Classify elements of a numpy array using a second array as reference

Question

Let's say I have an array with a finite amount of unique values. Say

data = array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])

And I also have a reference array with all the unique values found in data, without repetitions and in a particular order. Say

reference = array([20, 10, 30])

And I want to create an array with the same shape than data containing as values the indices in the reference array where each element in the data array is found.

In other words, having data and reference, I want to create an array indexes such that the following holds.

data = reference[indexes]

A suboptimal approach to compute indexes would be using a for loop, like this

indexes = np.zeros_like(data, dtype=int)
for i in range(data.size):
    indexes[i] = np.where(data[i] == reference)[0]

but I'd be surprised there is not a numpythonic (and thus faster!) way to do this... Any ideas?

Thanks!

Uhm... I think I wasn't clear with that in the question, sorry... The reference array is expected to be way smaller than the data so that the main thing I needed to optimize was the loop through all the values in data... Still, it's true I should think about the dictionaries more often that I do! :) — mgab
– mgab, Commented Jun 29, 2015 at 12:54

Divakar · Accepted Answer · 2015-06-27 10:38:48Z

We have data and reference as -

In [375]: data
Out[375]: array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])

In [376]: reference
Out[376]: array([20, 10, 30])

For a moment, let us consider a sorted version of reference -

In [373]: np.sort(reference)
Out[373]: array([10, 20, 30])

Now, we can use np.searchsorted to find out the position of each data element in this sorted version, like so -

In [378]: np.searchsorted(np.sort(reference), data, side='left')
Out[378]: array([2, 1, 2, 0, 1, 0, 1, 0, 2, 1, 1, 2, 2, 0, 2], dtype=int64)

If we run the original code, the expected output turns out to be -

In [379]: indexes
Out[379]: array([2, 0, 2, 1, 0, 1, 0, 1, 2, 0, 0, 2, 2, 1, 2])

As can be seen, the searchsorted output is fine except the 0's in it must be 1s and 1's must be changed to 0's. Now, we had taken into computation, the sorted version of reference. So, to do the 0's to 1's and vice versa changes, we need to bring in the indices used for sorting reference, i.e. np.argsort(reference). That's basically it for a vectorized no-loop or no-dict approach! So, the final implementation would look something like this -

# Get sorting indices for reference
sort_idx = np.argsort(reference)

# Sort reference and get searchsorted indices for data in reference
pos = np.searchsorted(reference[sort_idx], data, side='left')

# Change pos indices based on sorted indices for reference
out = np.argsort(reference)[pos]

Runtime tests -

In [396]: data = np.random.randint(0,30000,150000)
     ...: reference = np.unique(data)
     ...: reference = reference[np.random.permutation(reference.size)]
     ...: 
     ...: 
     ...: def org_approach(data,reference):
     ...:     indexes = np.zeros_like(data, dtype=int)
     ...:     for i in range(data.size):
     ...:         indexes[i] = np.where(data[i] == reference)[0]
     ...:     return indexes
     ...: 
     ...: def vect_approach(data,reference):
     ...:     sort_idx = np.argsort(reference)
     ...:     pos = np.searchsorted(reference[sort_idx], data, side='left')       
     ...:     return sort_idx[pos]
     ...: 

In [397]: %timeit org_approach(data,reference)
1 loops, best of 3: 9.86 s per loop

In [398]: %timeit vect_approach(data,reference)
10 loops, best of 3: 32.4 ms per loop

Verify results -

In [399]: np.array_equal(org_approach(data,reference),vect_approach(data,reference))
Out[399]: True

Daniel Renshaw · Accepted Answer · 2015-06-26 17:11:25Z

1

You have to loop through the data once to map the data values onto indexes. The quickest way to do that is to look up the value indexes in a dictionary. So you need to create a dictionary from values to indexes first.

Here's a complete example:

import numpy

data = numpy.array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])
reference = numpy.array([20, 10, 30])
reference_index = dict((value, index) for index, value in enumerate(reference))
indexes = [reference_index[value] for value in data]
assert numpy.all(data == reference[indexes])

This will be faster than the numpy.where approach because numpy.where will do a linear, O(n), search while the dictionary approach uses a hashtable to find the index in O(1) time.

answered Jun 26, 2015 at 17:11

Daniel Renshaw

34.3k8 gold badges78 silver badges95 bronze badges

1 Comment

mgab Over a year ago

Nice! this applies makes sense when the reference array is not much smaller than the data array... In my case reference is way smaller, and data is large... It's true that my for loop example doesn't take this into account... something like for i in range(reference.size): indexes[data==reference[i]] = i would be faster if data.size >>> reference.size.

Martin Evans · Accepted Answer · 2015-06-26 18:02:42Z

0

import numpy as np

data = np.array([30, 20, 30, 10, 20, 10, 20, 10, 30, 20, 20, 30, 30, 10, 30])
reference = {20:0, 10:1, 30:2}
indexes = np.zeros_like(data, dtype=int)

for i in xrange(data.size):
    indexes[i] = reference[data[i]]

A dictionary lookup is significantly faster. The use of xrange also helped marginally.

Using timeit:

Original: 4.01297836938

This version: 1.30972428591

answered Jun 26, 2015 at 18:02

Martin Evans

46.9k17 gold badges88 silver badges104 bronze badges

Collectives™ on Stack Overflow

Classify elements of a numpy array using a second array as reference

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

1 Comment

Comments

Your Answer

Sign up or log in

Post as a guest

Related