3

I have a two arrays of integers

a = numpy.array([1109830922873, 2838383, 839839393, ..., 29839933982])
b = numpy.array([2838383, 555555555, 2839474582, ..., 29839933982])

where len(a) ~ 15,000 and len(b) ~ 2 million.

What I want is to find the indices of array b elements which match those in array a. Now, I'm using list comprehension and numpy.argwhere() to achieve this:

bInds = [ numpy.argwhere(b == c)[0] for c in a ]

however, obviously, it is taking a long time to complete this. And array a will become larger too, so this is not a sensible route to take.

Is there a better way to achieve this result, considering the large arrays I'm dealing with here? It currently takes around ~5 minutes to do this. Any speed up is needed!

More info: I want the indices to match the order of array a too. (Thanks Charles)

1
  • 2
    Maybe you could create a hashmap mapping elements from a to their respective index. Then you just have to look them up in the map. Commented Jul 3, 2014 at 14:29

2 Answers 2

2

Unless I'm mistaken, your approach searches the entire array b for each element of a again and again.

Alternatively, you could create a dictionary mapping the individual elements from b to their indices.

indices = {}
for i, e in enumerate(b):
    indices[e] = i                      # if elements in b are unique
    indices.setdefault(e, []).append(i) # otherwise, use lists

Then you can use this mapping for quickly finding the indices where elements from a can be found in b.

bInds = [ indices[c] for c in a ]
Sign up to request clarification or add additional context in comments.

5 Comments

I believe this is doing exactly what I need! I guess this is what you meant by a hashmap? Thank you for your time.
Yes, a hashmap is more or less another word for a dictionary. In Python, it's called dictionary, or dict, or just {}, in Java, it's a Map or HashMap. Sorry about the confusion.
Any idea how you'd add the failsafe of an item from a not being found in b?
I just created a separate set and used: [ indices[c] if c in b_set else -99 for c in a ]
You do not need a separate set. Lookup in dict is O(1) as well, so you can just do indices[c] if c in indices else -99, or use get with a default value, i.e. indices.get(e, -99).
0

This take about a second to run.

import numpy

#make some fake data...
a = (numpy.random.random(15000) * 2**16).astype(int)
b = (numpy.random.random(2000000) * 2**16).astype(int)

#find indcies of b that are contained in a.
set_a = set(a)
result = set()
for i,val in enumerate(b):
    if val in set_a:
        result.add(i)

result = numpy.array(list(result))
result.sort()

print result

2 Comments

Thank you! However, I want the indices in the same order as the elements in a. This puts them in the same order as they are in b. Does that make sense?
Yes, but you should clarify your questions, as that is not clear.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.