5

Let's say I have an array with a finite amount of unique values. Say

data = array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])

And I also have a reference array with all the unique values found in data, without repetitions and in a particular order. Say

reference = array([20, 10, 30])

And I want to create an array with the same shape than data containing as values the indices in the reference array where each element in the data array is found.

In other words, having data and reference, I want to create an array indexes such that the following holds.

data = reference[indexes]

A suboptimal approach to compute indexes would be using a for loop, like this

indexes = np.zeros_like(data, dtype=int)
for i in range(data.size):
    indexes[i] = np.where(data[i] == reference)[0]

but I'd be surprised there is not a numpythonic (and thus faster!) way to do this... Any ideas?

Thanks!

1
  • Uhm... I think I wasn't clear with that in the question, sorry... The reference array is expected to be way smaller than the data so that the main thing I needed to optimize was the loop through all the values in data... Still, it's true I should think about the dictionaries more often that I do! :) Commented Jun 29, 2015 at 12:54

3 Answers 3

5

We have data and reference as -

In [375]: data
Out[375]: array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])

In [376]: reference
Out[376]: array([20, 10, 30])

For a moment, let us consider a sorted version of reference -

In [373]: np.sort(reference)
Out[373]: array([10, 20, 30])

Now, we can use np.searchsorted to find out the position of each data element in this sorted version, like so -

In [378]: np.searchsorted(np.sort(reference), data, side='left')
Out[378]: array([2, 1, 2, 0, 1, 0, 1, 0, 2, 1, 1, 2, 2, 0, 2], dtype=int64)

If we run the original code, the expected output turns out to be -

In [379]: indexes
Out[379]: array([2, 0, 2, 1, 0, 1, 0, 1, 2, 0, 0, 2, 2, 1, 2])

As can be seen, the searchsorted output is fine except the 0's in it must be 1s and 1's must be changed to 0's. Now, we had taken into computation, the sorted version of reference. So, to do the 0's to 1's and vice versa changes, we need to bring in the indices used for sorting reference, i.e. np.argsort(reference). That's basically it for a vectorized no-loop or no-dict approach! So, the final implementation would look something like this -

# Get sorting indices for reference
sort_idx = np.argsort(reference)

# Sort reference and get searchsorted indices for data in reference
pos = np.searchsorted(reference[sort_idx], data, side='left')

# Change pos indices based on sorted indices for reference
out = np.argsort(reference)[pos]

Runtime tests -

In [396]: data = np.random.randint(0,30000,150000)
     ...: reference = np.unique(data)
     ...: reference = reference[np.random.permutation(reference.size)]
     ...: 
     ...: 
     ...: def org_approach(data,reference):
     ...:     indexes = np.zeros_like(data, dtype=int)
     ...:     for i in range(data.size):
     ...:         indexes[i] = np.where(data[i] == reference)[0]
     ...:     return indexes
     ...: 
     ...: def vect_approach(data,reference):
     ...:     sort_idx = np.argsort(reference)
     ...:     pos = np.searchsorted(reference[sort_idx], data, side='left')       
     ...:     return sort_idx[pos]
     ...: 

In [397]: %timeit org_approach(data,reference)
1 loops, best of 3: 9.86 s per loop

In [398]: %timeit vect_approach(data,reference)
10 loops, best of 3: 32.4 ms per loop

Verify results -

In [399]: np.array_equal(org_approach(data,reference),vect_approach(data,reference))
Out[399]: True
Sign up to request clarification or add additional context in comments.

1 Comment

Yeap! nice and fast method! Thanks!
1

You have to loop through the data once to map the data values onto indexes. The quickest way to do that is to look up the value indexes in a dictionary. So you need to create a dictionary from values to indexes first.

Here's a complete example:

import numpy

data = numpy.array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])
reference = numpy.array([20, 10, 30])
reference_index = dict((value, index) for index, value in enumerate(reference))
indexes = [reference_index[value] for value in data]
assert numpy.all(data == reference[indexes])

This will be faster than the numpy.where approach because numpy.where will do a linear, O(n), search while the dictionary approach uses a hashtable to find the index in O(1) time.

1 Comment

Nice! this applies makes sense when the reference array is not much smaller than the data array... In my case reference is way smaller, and data is large... It's true that my for loop example doesn't take this into account... something like for i in range(reference.size): indexes[data==reference[i]] = i would be faster if data.size >>> reference.size.
0
import numpy as np

data = np.array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])
reference = {20:0, 10:1, 30:2}
indexes = np.zeros_like(data, dtype=int)

for i in xrange(data.size):
    indexes[i] = reference[data[i]]

A dictionary lookup is significantly faster. The use of xrange also helped marginally.

Using timeit:

Original: 4.01297836938

This version: 1.30972428591

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.