large array searching with numpy

Question

I have a two arrays of integers

a = numpy.array([1109830922873, 2838383, 839839393, ..., 29839933982])
b = numpy.array([2838383, 555555555, 2839474582, ..., 29839933982])

where len(a) ~ 15,000 and len(b) ~ 2 million.

What I want is to find the indices of array b elements which match those in array a. Now, I'm using list comprehension and numpy.argwhere() to achieve this:

bInds = [ numpy.argwhere(b == c)[0] for c in a ]

however, obviously, it is taking a long time to complete this. And array a will become larger too, so this is not a sensible route to take.

Is there a better way to achieve this result, considering the large arrays I'm dealing with here? It currently takes around ~5 minutes to do this. Any speed up is needed!

More info: I want the indices to match the order of array a too. (Thanks Charles)

Maybe you could create a hashmap mapping elements from a to their respective index. Then you just have to look them up in the map. — tobias_k
– tobias_k, Commented Jul 3, 2014 at 14:29

tobias_k · Accepted Answer · 2014-07-03 14:41:19Z

2

Unless I'm mistaken, your approach searches the entire array b for each element of a again and again.

Alternatively, you could create a dictionary mapping the individual elements from b to their indices.

indices = {}
for i, e in enumerate(b):
    indices[e] = i                      # if elements in b are unique
    indices.setdefault(e, []).append(i) # otherwise, use lists

Then you can use this mapping for quickly finding the indices where elements from a can be found in b.

bInds = [ indices[c] for c in a ]

answered Jul 3, 2014 at 14:41

tobias_k

83.1k12 gold badges130 silver badges186 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Carl M Over a year ago

I believe this is doing exactly what I need! I guess this is what you meant by a hashmap? Thank you for your time.

tobias_k Over a year ago

Yes, a hashmap is more or less another word for a dictionary. In Python, it's called dictionary, or dict, or just {}, in Java, it's a Map or HashMap. Sorry about the confusion.

Carl M Over a year ago

Any idea how you'd add the failsafe of an item from a not being found in b?

Carl M Over a year ago

I just created a separate set and used: [ indices[c] if c in b_set else -99 for c in a ]

tobias_k Over a year ago

You do not need a separate set. Lookup in dict is O(1) as well, so you can just do indices[c] if c in indices else -99, or use get with a default value, i.e. indices.get(e, -99).

Charles · Accepted Answer · 2014-07-03 14:36:21Z

0

This take about a second to run.

import numpy

#make some fake data...
a = (numpy.random.random(15000) * 2**16).astype(int)
b = (numpy.random.random(2000000) * 2**16).astype(int)

#find indcies of b that are contained in a.
set_a = set(a)
result = set()
for i,val in enumerate(b):
    if val in set_a:
        result.add(i)

result = numpy.array(list(result))
result.sort()

print result

answered Jul 3, 2014 at 14:36

Charles

1,84013 silver badges16 bronze badges

2 Comments

Carl M Over a year ago

Thank you! However, I want the indices in the same order as the elements in a. This puts them in the same order as they are in b. Does that make sense?

Charles Over a year ago

Yes, but you should clarify your questions, as that is not clear.

Collectives™ on Stack Overflow

large array searching with numpy

2 Answers 2

5 Comments

2 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

2 Comments

Your Answer

Sign up or log in

Post as a guest

Related