how to quickly count equal element in a numpy.array?

Question

I have a python matrix

leafs = np.array([[1,2,3],[1,2,4],[2,3,4],[4,2,1]])

I would like to compute for each couple of rows the number of time they have the same element.

In this case I would get a 4x4 matrix proximity

proximity = array([[3, 2, 0, 1],
                   [2, 3, 1, 1],
                   [0, 1, 3, 0],
                   [1, 1, 0, 3]])

This is the code that I am currently using.

proximity = []

for i in range(n):
 print(i)
 proximity.append(np.apply_along_axis(lambda x: sum(x==leafs[i, :]), axis=1,
                                      arr=leafs))

I need a faster solution

EDIT: The accepted solution does not work in this example

    >>> type(f.leafs)
<class 'numpy.ndarray'>
>>> f.leafs.shape
(7210, 1000)
>>> f.leafs.dtype
dtype('int64')

>>> f.leafs.reshape(7210, 1, 1000) == f.leafs.reshape(1, 7210, 1000)
False
>>> f.leafs
array([[ 19,  32,  16, ..., 143, 194, 157],
       [ 19,  32,  16, ..., 143, 194, 157],
       [ 19,  32,  16, ..., 143, 194, 157],
       ..., 
       [139,  32,  16, ...,   5, 194, 157],
       [170,  32,  16, ...,   5, 194, 157],
       [170,  32,  16, ...,   5, 194, 157]])
>>>

Itertools (python2 | python3) should have something..? Try investigating this answer for Generate all unique permutations of 2d array — Mr. Polywhirl
– Mr. Polywhirl, Commented Aug 5, 2014 at 22:41
Can you be a bit more precise about what your output should be? Maybe I'm tired but for 'each couple of rows' I'm not seeing how a 4x3 maps to a 4x4 — sapi
– sapi, Commented Aug 5, 2014 at 22:42
The final matrix is squared and has the same number of rows of the initial one. Each row of the matrix is compared with all the others. `similarity[i,j] = sum(leafs[i,k] == leafs[j,k] for k in range(len(leafs[i,:]))' — Donbeo
– Donbeo, Commented Aug 5, 2014 at 22:47
What are the elements on the array? Integers, strings, floats? — ojy
– ojy, Commented Aug 5, 2014 at 22:50

Warren Weckesser · Accepted Answer · 2014-08-05 23:25:32Z

5

Here's one way, using broadcasting. Be warned: the temporary array eq has shape (nrows, nrows, ncols), so if nrows is 4000 and ncols is 1000, eq will require 16GB of memory.

In [38]: leafs
Out[38]: 
array([[1, 2, 3],
       [1, 2, 4],
       [2, 3, 4],
       [4, 2, 1]])

In [39]: nrows, ncols = leafs.shape

In [40]: eq = leafs.reshape(nrows,1,ncols) == leafs.reshape(1,nrows,ncols)

In [41]: proximity = eq.sum(axis=-1)

In [42]: proximity
Out[42]: 
array([[3, 2, 0, 1],
       [2, 3, 1, 1],
       [0, 1, 3, 0],
       [1, 1, 0, 3]])

Also note that this solution is inefficient: proximity is symmetric, and the diagonal is always equal to ncols, but this solution computes the full array, so it does more than twice as much work as necessary.

edited Aug 5, 2014 at 23:25

answered Aug 5, 2014 at 23:07

Warren Weckesser

116k20 gold badges207 silver badges224 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Donbeo Over a year ago

This is a fantastic solution and it is very very fast! Can you explain command 40 and 41 . Thanks!

Fabricator Over a year ago

@Donbeo, 4x1x3 against 1x4x3 => 4x4x3. doc

Donbeo Over a year ago

@Warren Weckesser There is a case in which your solution does not work. I have edited the question. This can probably be due to an excessive use of memory

Warren Weckesser Over a year ago

As I noted in my answer, I expected memory would be an issue if the the dimensions of leafs were large. Broadcasting+reduction is handy, but the the excessive memory use of intermediate results can be an issue. @ojy's answer is a reasonable way to break up the calculation into small steps.

Donbeo Over a year ago

I have edited the question. I am just curious to know why this is not working. The ojy answer is probably better in my situation

Mack · Accepted Answer · 2015-04-24 21:30:45Z

Warren Weckesser offered a very beautiful solution using broadcasting. However, even a straightforward approach using a loop can have comparable performance. np.apply_along_axis is slow in your initial solution because it does not take advantage of vectorization. However the following fixes it:

def proximity_1(leafs):
    n = len(leafs)
    proximity = np.zeros((n,n))
    for i in range(n):
        proximity[i] = (leafs == leafs[i]).sum(1)  
    return proximity

You could also use a list comprehension to make the above code more concise. The difference is that np.apply_along_axis would loop through all the rows in a non-optimized manner, while leafs == leafs[i] will take advantage of numpy speed.

The solution from Warren Weckesser truly shows numpy's beauty. However, it includes the overhead of creating an intermediate 3-d array of size nrows*nrows*ncols. So if you have large data, the simple loop might be more efficient.

Here's an example. Below is code offered by Warren Weckesser, wrapped in a function. (I don't know what are the code copyright rules here, so I assume this reference is enough :))

def proximity_2(leafs):
    nrows, ncols = leafs.shape    
    eq = leafs.reshape(nrows,1,ncols) == leafs.reshape(1,nrows,ncols)
    proximity = eq.sum(axis=-1)  
    return proximity

Now let's evaluate the performance on an array of random integers of size 10000 x 100.

leafs = np.random.randint(1,100,(10000,100))
time proximity_1(leafs)
>> 28.6 s
time proximity_2(leafs) 
>> 35.4 s

I ran both examples in an IPython environment on the same machine.

Collectives™ on Stack Overflow

how to quickly count equal element in a numpy.array?

2 Answers 2

5 Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

5 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related