3

I have a numpy array embed_vec of length tot_vec in which each entry is a 3d vector:

[[ 0.52483319  0.78015841  0.71117216]
 [ 0.53041481  0.79462171  0.67234534]
 [ 0.53645428  0.80896727  0.63119403]
 ..., 
 [ 0.72283509  0.40070804  0.15220522]
 [ 0.71277758  0.38498613  0.16141834]
 [ 0.70221445  0.36918032  0.17370776]]

For each of the elements in this array, I want to find out the number of other entries which are "close" to that entry. By close, I mean that the distance between two vectors is less than a specified value R. For this, I must compare all the possible pairs in this array with each other and then find out the number of close vectors for each of the vectors in the array. So I am doing this:

p = np.zeros(tot_vec) # This contains the number of close vectors
for i in range(tot_vec-1):
    for j in range(i+1, tot_vec):
        if np.linalg.norm(embed_vec[i]-embed_vec[j]) < R:
            p[i] += 1

However, this is extremely inefficient because I have two nested python loops and for larger array sizes, this takes forever. If this were in C++ or Fortran, it wouldn't have been a great issue. My question is, can one achieve the same thing using numpy efficiently using some vectorization method? As a side note, I don't mind a solution using Pandas also.

8
  • What's the shape of embed_vec in your actual use-case? Commented Jan 13, 2017 at 9:28
  • @Divakar : It is (60000, 3) Commented Jan 13, 2017 at 9:29
  • @Peaceful I deleted the comment because you are using a multi-dimensional distance. Even though I might try to use some of that logic, this is a distinctly different question Commented Jan 13, 2017 at 9:33
  • 1
    You can use scipy's pdist to get a distance matrix. May run into memory issues if tot_vec is large. Commented Jan 13, 2017 at 9:36
  • 2
    Someone should really implement something for this. It has to be about the most asked numpy question. Commented Jan 13, 2017 at 9:50

3 Answers 3

6

Approach #1 : Vectorized approach -

def vectorized_app(embed_vec, R):  
    tot_vec = embed_vec.shape[0]          
    r,c = np.triu_indices(tot_vec,1)
    subs = embed_vec[r] - embed_vec[c]
    dists = np.einsum('ij,ij->i',subs,subs)
    return np.bincount(r,dists<R**2,minlength=tot_vec)

Approach #2 : With less loop complexity (for very large arrays) -

def loopy_less_app(embed_vec, R):  
    tot_vec = embed_vec.shape[0]
    Rsq = R**2
    out = np.zeros(tot_vec,dtype=int)
    for i in range(tot_vec):
        subs = embed_vec[i] - embed_vec[i+1:tot_vec]
        dists = np.einsum('ij,ij->i',subs,subs)
        out[i] = np.count_nonzero(dists < Rsq)
    return out

Benchmarking

Original approach -

def loopy_app(embed_vec, R):
    tot_vec = embed_vec.shape[0]
    p = np.zeros(tot_vec) # This contains the number of close vectors
    for i in range(tot_vec-1):
        for j in range(i+1, tot_vec):
            if np.linalg.norm(embed_vec[i]-embed_vec[j]) < R:
                p[i] += 1
    return p                

Timings -

In [76]: # Sample random array
    ...: embed_vec = np.random.rand(3000,3)
    ...: R = 0.5
    ...: 

In [77]: %timeit loopy_app(embed_vec, R)
1 loops, best of 3: 50.5 s per loop

In [78]: %timeit loopy_less_app(embed_vec, R)
10 loops, best of 3: 143 ms per loop

350x+ speedup there!

Going with much bigger array with the proposed loopy_less_app -

In [81]: # Sample random array
    ...: embed_vec = np.random.rand(20000,3)
    ...: R = 0.5
    ...: 

In [82]: %timeit loopy_less_app(embed_vec, R)
1 loops, best of 3: 4.47 s per loop
Sign up to request clarification or add additional context in comments.

5 Comments

This gives me : ValueError: array is too big; arr.size * arr.dtype.itemsize` is larger than the maximum possible size. `
@Peaceful Check out Approach #2 ? That should be good for your (60000,3) array, hopefully!
This is impressive indeed. Can you please explain me how can I modify this for 1d vectors? Because in that case it throws error: ValueError: einstein sum subscripts string contains too many subscripts for operand 0
@Peaceful For 1D arrays, skip the einsum step and at the last step do : out[i] = np.count_nonzero(np.abs(subs) < R) for loopy_less_app approach.
If the input array is both 1D and sorted, how could one speed this up further?
2

I am intrigued by that question and attempted to solve it efficintly using scipy's cKDTree. However, this approach may run out of memory because internally a list of all pairs with distance <= R is maintained. If your R and tot_vec are small enough it will work:

import numpy as np
from scipy.spatial import cKDTree as KDTree

tot_vec = 60000
embed_vec = np.random.randn(tot_vec, 3)
R = 0.1

tree = KDTree(embed_vec, leafsize=100)
p = np.zeros(tot_vec)
for pair in tree.query_pairs(R):
    p[pair[0]] += 1
    p[pair[1]] += 1

In case memory is an issue, with some effort it is possible to rewrite query_pairs as a generator function in Python at the cost of C performance.

3 Comments

I was looking into kmeans, but no luck. Good to see it in use for this problem! What about count_neighbors?
@Divakar I guess count_neighbors is less efficient because it is meant to operate on two trees and thus might traverse the tree twice. Did not try it, though.
I see. Not really familiar with kmeans tools :) Would be interesting to see some timings I guess.
0

first broadcast the difference:

disp_vecs=tot_vec[:,None,:]-tot_vec[None,:,:]

Now, depending on how big your dataset is, you may want to do a fist pass without all the math. If the distance is less than r, all the components should be less than r

first_mask=np.max(disp_vec, axis=-1)<r

Then do the actual calculation

disps=np.linlg.norm(disp_vec[first_mask],axis=-1)
second_mask=disps<r

Now reassign

disps=disps[second_mask]
first_mask[first_mask]=second_mask

disps are now the good values, and first_mask is a boolean mask of where they go. You can process from there.

1 Comment

Did you mean to write embed_vec in place of tot_vec?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.