how to compare entries in numpy array with each other efficiently?

Question

I have a numpy array embed_vec of length tot_vec in which each entry is a 3d vector:

[[ 0.52483319  0.78015841  0.71117216]
 [ 0.53041481  0.79462171  0.67234534]
 [ 0.53645428  0.80896727  0.63119403]
 ..., 
 [ 0.72283509  0.40070804  0.15220522]
 [ 0.71277758  0.38498613  0.16141834]
 [ 0.70221445  0.36918032  0.17370776]]

For each of the elements in this array, I want to find out the number of other entries which are "close" to that entry. By close, I mean that the distance between two vectors is less than a specified value R. For this, I must compare all the possible pairs in this array with each other and then find out the number of close vectors for each of the vectors in the array. So I am doing this:

p = np.zeros(tot_vec) # This contains the number of close vectors
for i in range(tot_vec-1):
    for j in range(i+1, tot_vec):
        if np.linalg.norm(embed_vec[i]-embed_vec[j]) < R:
            p[i] += 1

However, this is extremely inefficient because I have two nested python loops and for larger array sizes, this takes forever. If this were in C++ or Fortran, it wouldn't have been a great issue. My question is, can one achieve the same thing using numpy efficiently using some vectorization method? As a side note, I don't mind a solution using Pandas also.

@Peaceful I deleted the comment because you are using a multi-dimensional distance. Even though I might try to use some of that logic, this is a distinctly different question — piRSquared
– piRSquared, Commented Jan 13, 2017 at 9:33
You can use scipy's pdist to get a distance matrix. May run into memory issues if tot_vec is large. — MB-F
– MB-F, Commented Jan 13, 2017 at 9:36
Someone should really implement something for this. It has to be about the most asked numpy question. — Daniel F
– Daniel F, Commented Jan 13, 2017 at 9:50

Community · Accepted Answer · 2020-06-20 09:12:55Z

6

Approach #1 : Vectorized approach -

def vectorized_app(embed_vec, R):  
    tot_vec = embed_vec.shape[0]          
    r,c = np.triu_indices(tot_vec,1)
    subs = embed_vec[r] - embed_vec[c]
    dists = np.einsum('ij,ij->i',subs,subs)
    return np.bincount(r,dists<R**2,minlength=tot_vec)

Approach #2 : With less loop complexity (for very large arrays) -

def loopy_less_app(embed_vec, R):  
    tot_vec = embed_vec.shape[0]
    Rsq = R**2
    out = np.zeros(tot_vec,dtype=int)
    for i in range(tot_vec):
        subs = embed_vec[i] - embed_vec[i+1:tot_vec]
        dists = np.einsum('ij,ij->i',subs,subs)
        out[i] = np.count_nonzero(dists < Rsq)
    return out

Benchmarking

Original approach -

def loopy_app(embed_vec, R):
    tot_vec = embed_vec.shape[0]
    p = np.zeros(tot_vec) # This contains the number of close vectors
    for i in range(tot_vec-1):
        for j in range(i+1, tot_vec):
            if np.linalg.norm(embed_vec[i]-embed_vec[j]) < R:
                p[i] += 1
    return p

Timings -

In [76]: # Sample random array
    ...: embed_vec = np.random.rand(3000,3)
    ...: R = 0.5
    ...: 

In [77]: %timeit loopy_app(embed_vec, R)
1 loops, best of 3: 50.5 s per loop

In [78]: %timeit loopy_less_app(embed_vec, R)
10 loops, best of 3: 143 ms per loop

350x+ speedup there!

Going with much bigger array with the proposed loopy_less_app -

In [81]: # Sample random array
    ...: embed_vec = np.random.rand(20000,3)
    ...: R = 0.5
    ...: 

In [82]: %timeit loopy_less_app(embed_vec, R)
1 loops, best of 3: 4.47 s per loop

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Jan 13, 2017 at 9:38

Divakar

222k19 gold badges273 silver badges374 bronze badges

Sign up to request clarification or add additional context in comments.

5 Comments

Peaceful Over a year ago

This gives me : ValueError: array is too big; arr.size * arr.dtype.itemsize` is larger than the maximum possible size. `

Divakar Over a year ago

@Peaceful Check out Approach #2 ? That should be good for your (60000,3) array, hopefully!

Peaceful Over a year ago

This is impressive indeed. Can you please explain me how can I modify this for 1d vectors? Because in that case it throws error: ValueError: einstein sum subscripts string contains too many subscripts for operand 0

Divakar Over a year ago

@Peaceful For 1D arrays, skip the einsum step and at the last step do : out[i] = np.count_nonzero(np.abs(subs) < R) for loopy_less_app approach.

BallpointBen Over a year ago

If the input array is both 1D and sorted, how could one speed this up further?

MB-F · Accepted Answer · 2017-01-13 11:08:04Z

2

I am intrigued by that question and attempted to solve it efficintly using scipy's cKDTree. However, this approach may run out of memory because internally a list of all pairs with distance <= R is maintained. If your R and tot_vec are small enough it will work:

import numpy as np
from scipy.spatial import cKDTree as KDTree

tot_vec = 60000
embed_vec = np.random.randn(tot_vec, 3)
R = 0.1

tree = KDTree(embed_vec, leafsize=100)
p = np.zeros(tot_vec)
for pair in tree.query_pairs(R):
    p[pair[0]] += 1
    p[pair[1]] += 1

In case memory is an issue, with some effort it is possible to rewrite query_pairs as a generator function in Python at the cost of C performance.

edited Jan 13, 2017 at 11:08

answered Jan 13, 2017 at 11:05

MB-F

23.8k5 gold badges71 silver badges127 bronze badges

3 Comments

Divakar Over a year ago

I was looking into kmeans, but no luck. Good to see it in use for this problem! What about count_neighbors?

MB-F Over a year ago

@Divakar I guess count_neighbors is less efficient because it is meant to operate on two trees and thus might traverse the tree twice. Did not try it, though.

Divakar Over a year ago

I see. Not really familiar with kmeans tools :) Would be interesting to see some timings I guess.

Daniel F · Accepted Answer · 2017-01-13 10:05:09Z

0

first broadcast the difference:

disp_vecs=tot_vec[:,None,:]-tot_vec[None,:,:]

Now, depending on how big your dataset is, you may want to do a fist pass without all the math. If the distance is less than r, all the components should be less than r

first_mask=np.max(disp_vec, axis=-1)<r

Then do the actual calculation

disps=np.linlg.norm(disp_vec[first_mask],axis=-1)
second_mask=disps<r

Now reassign

disps=disps[second_mask]
first_mask[first_mask]=second_mask

disps are now the good values, and first_mask is a boolean mask of where they go. You can process from there.

answered Jan 13, 2017 at 10:05

Daniel F

14.5k2 gold badges34 silver badges59 bronze badges

1 Comment

Peaceful Over a year ago

Did you mean to write embed_vec in place of tot_vec?

Collectives™ on Stack Overflow

how to compare entries in numpy array with each other efficiently?

3 Answers 3

Benchmarking

5 Comments

3 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

Benchmarking

5 Comments

3 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related