We have data and reference as -
In [375]: data
Out[375]: array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])
In [376]: reference
Out[376]: array([20, 10, 30])
For a moment, let us consider a sorted version of reference -
In [373]: np.sort(reference)
Out[373]: array([10, 20, 30])
Now, we can use np.searchsorted to find out the position of each data element in this sorted version, like so -
In [378]: np.searchsorted(np.sort(reference), data, side='left')
Out[378]: array([2, 1, 2, 0, 1, 0, 1, 0, 2, 1, 1, 2, 2, 0, 2], dtype=int64)
If we run the original code, the expected output turns out to be -
In [379]: indexes
Out[379]: array([2, 0, 2, 1, 0, 1, 0, 1, 2, 0, 0, 2, 2, 1, 2])
As can be seen, the searchsorted output is fine except the 0's in it must be 1s and 1's must be changed to 0's. Now, we had taken into computation, the sorted version of reference. So, to do the 0's to 1's and vice versa changes, we need to bring in the indices used for sorting reference, i.e. np.argsort(reference). That's basically it for a vectorized no-loop or no-dict approach! So, the final implementation would look something like this -
# Get sorting indices for reference
sort_idx = np.argsort(reference)
# Sort reference and get searchsorted indices for data in reference
pos = np.searchsorted(reference[sort_idx], data, side='left')
# Change pos indices based on sorted indices for reference
out = np.argsort(reference)[pos]
Runtime tests -
In [396]: data = np.random.randint(0,30000,150000)
...: reference = np.unique(data)
...: reference = reference[np.random.permutation(reference.size)]
...:
...:
...: def org_approach(data,reference):
...: indexes = np.zeros_like(data, dtype=int)
...: for i in range(data.size):
...: indexes[i] = np.where(data[i] == reference)[0]
...: return indexes
...:
...: def vect_approach(data,reference):
...: sort_idx = np.argsort(reference)
...: pos = np.searchsorted(reference[sort_idx], data, side='left')
...: return sort_idx[pos]
...:
In [397]: %timeit org_approach(data,reference)
1 loops, best of 3: 9.86 s per loop
In [398]: %timeit vect_approach(data,reference)
10 loops, best of 3: 32.4 ms per loop
Verify results -
In [399]: np.array_equal(org_approach(data,reference),vect_approach(data,reference))
Out[399]: True
referencearray is expected to be way smaller than thedataso that the main thing I needed to optimize was the loop through all the values indata... Still, it's true I should think about the dictionaries more often that I do! :)