2

I have a python matrix

leafs = np.array([[1,2,3],[1,2,4],[2,3,4],[4,2,1]])

I would like to compute for each couple of rows the number of time they have the same element.

In this case I would get a 4x4 matrix proximity

proximity = array([[3, 2, 0, 1],
                   [2, 3, 1, 1],
                   [0, 1, 3, 0],
                   [1, 1, 0, 3]])

This is the code that I am currently using.

proximity = []

for i in range(n):
 print(i)
 proximity.append(np.apply_along_axis(lambda x: sum(x==leafs[i, :]), axis=1,
                                      arr=leafs))

I need a faster solution

EDIT: The accepted solution does not work in this example

    >>> type(f.leafs)
<class 'numpy.ndarray'>
>>> f.leafs.shape
(7210, 1000)
>>> f.leafs.dtype
dtype('int64')

>>> f.leafs.reshape(7210, 1, 1000) == f.leafs.reshape(1, 7210, 1000)
False
>>> f.leafs
array([[ 19,  32,  16, ..., 143, 194, 157],
       [ 19,  32,  16, ..., 143, 194, 157],
       [ 19,  32,  16, ..., 143, 194, 157],
       ..., 
       [139,  32,  16, ...,   5, 194, 157],
       [170,  32,  16, ...,   5, 194, 157],
       [170,  32,  16, ...,   5, 194, 157]])
>>> 
10
  • Itertools (python2 | python3) should have something..? Try investigating this answer for Generate all unique permutations of 2d array Commented Aug 5, 2014 at 22:41
  • 1
    Can you be a bit more precise about what your output should be? Maybe I'm tired but for 'each couple of rows' I'm not seeing how a 4x3 maps to a 4x4 Commented Aug 5, 2014 at 22:42
  • The final matrix is squared and has the same number of rows of the initial one. Each row of the matrix is compared with all the others. `similarity[i,j] = sum(leafs[i,k] == leafs[j,k] for k in range(len(leafs[i,:]))' Commented Aug 5, 2014 at 22:47
  • What are the elements on the array? Integers, strings, floats? Commented Aug 5, 2014 at 22:50
  • The elements are integers. Commented Aug 5, 2014 at 22:58

2 Answers 2

5

Here's one way, using broadcasting. Be warned: the temporary array eq has shape (nrows, nrows, ncols), so if nrows is 4000 and ncols is 1000, eq will require 16GB of memory.

In [38]: leafs
Out[38]: 
array([[1, 2, 3],
       [1, 2, 4],
       [2, 3, 4],
       [4, 2, 1]])

In [39]: nrows, ncols = leafs.shape

In [40]: eq = leafs.reshape(nrows,1,ncols) == leafs.reshape(1,nrows,ncols)

In [41]: proximity = eq.sum(axis=-1)

In [42]: proximity
Out[42]: 
array([[3, 2, 0, 1],
       [2, 3, 1, 1],
       [0, 1, 3, 0],
       [1, 1, 0, 3]])

Also note that this solution is inefficient: proximity is symmetric, and the diagonal is always equal to ncols, but this solution computes the full array, so it does more than twice as much work as necessary.

Sign up to request clarification or add additional context in comments.

5 Comments

This is a fantastic solution and it is very very fast! Can you explain command 40 and 41 . Thanks!
@Donbeo, 4x1x3 against 1x4x3 => 4x4x3. doc
@Warren Weckesser There is a case in which your solution does not work. I have edited the question. This can probably be due to an excessive use of memory
As I noted in my answer, I expected memory would be an issue if the the dimensions of leafs were large. Broadcasting+reduction is handy, but the the excessive memory use of intermediate results can be an issue. @ojy's answer is a reasonable way to break up the calculation into small steps.
I have edited the question. I am just curious to know why this is not working. The ojy answer is probably better in my situation
1

Warren Weckesser offered a very beautiful solution using broadcasting. However, even a straightforward approach using a loop can have comparable performance. np.apply_along_axis is slow in your initial solution because it does not take advantage of vectorization. However the following fixes it:

def proximity_1(leafs):
    n = len(leafs)
    proximity = np.zeros((n,n))
    for i in range(n):
        proximity[i] = (leafs == leafs[i]).sum(1)  
    return proximity

You could also use a list comprehension to make the above code more concise. The difference is that np.apply_along_axis would loop through all the rows in a non-optimized manner, while leafs == leafs[i] will take advantage of numpy speed.

The solution from Warren Weckesser truly shows numpy's beauty. However, it includes the overhead of creating an intermediate 3-d array of size nrows*nrows*ncols. So if you have large data, the simple loop might be more efficient.

Here's an example. Below is code offered by Warren Weckesser, wrapped in a function. (I don't know what are the code copyright rules here, so I assume this reference is enough :))

def proximity_2(leafs):
    nrows, ncols = leafs.shape    
    eq = leafs.reshape(nrows,1,ncols) == leafs.reshape(1,nrows,ncols)
    proximity = eq.sum(axis=-1)  
    return proximity

Now let's evaluate the performance on an array of random integers of size 10000 x 100.

leafs = np.random.randint(1,100,(10000,100))
time proximity_1(leafs)
>> 28.6 s
time proximity_2(leafs) 
>> 35.4 s 

I ran both examples in an IPython environment on the same machine.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.