1

I'm working with two different arrays (75x4), and I'm applying a shortest distance algorithm between the two arrays.

So I want to:

  • perform an operation with one row of the first array with every individual row of the second array, iterating to obtain 75 values
  • find the minimum value, and store that in a new array
  • repeat this with the second row of the first array, once again iterating the operation for all the rows of the second array, and again storing the minimum difference to the new array

How would I go about doing this with numpy?

Essentially I want to perform an operation between one row of array 1 on every row of array 2, find the minimum value, and store that in a new array. Then do that very same thing for the 2nd row of array 1, and so on for all 75 rows of array 1.

Here is the code for the formula I'm using. What I get here is just the distance between every row of array 1 (training data) and array 2 (testing data). But what I'm looking for is to do it for one row of array 1 iterating down for all rows of array 2, storing the minimum value in a new array, then doing the same for the next row of array 1, and so on.

arr_attributedifference = (arr_trainingdata - arr_testingdata)**2
arr_distance = np.sqrt(arr_attributedifference.sum(axis=1))
2
  • There are three basic alternatives: (1) use a nested loop, giving up all the speed benefits of numpy; (2) build a product array so you can do one giant elementwise operation; (3) leave the outer loop as a loop, but turn the inner loop into an elementwise operation. You usually want to do #3, because it has almost all of the speed benefits of #2 without the memory costs. So you just need to figure out how to do your operation as one row from A against the whole array B, and then loop over A doing that. Commented Jun 7, 2018 at 22:50
  • But meanwhile, two 75x4 arrays is pretty tiny, so you may be fine just writing whatever you feel most comfortable with writing, and then making sure it's not too slow. Commented Jun 7, 2018 at 22:51

1 Answer 1

3

Here are two methods one using einsum, the other KDTree:

einsum does essentially what we could also achieve via broadcasting, for example np.einsum('ik,jk', A, B) is roughly equivalent to (A[:, None, :] * B[None, :, :]).sum(axis=2). The advantage of einsum is that it does the summing straight away, so it avoids creating an mxmxn intermediate array.

KDTree is more sophisticated. We have to invest upfront into generating the tree but afterwards querying nearest neighbors is very efficient.

import numpy as np
from scipy.spatial import cKDTree as KDTree

def f_einsum(A, B):
    B2AB = np.einsum('ij,ij->i', B, B) / 2 - np.einsum('ik,jk', A, B)
    idx = B2AB.argmin(axis=1)
    D = A-B[idx]
    return np.sqrt(np.einsum('ij,ij->i', D, D)), idx

def f_KDTree(A, B):
    T = KDTree(B)
    return T.query(A, 1)

m, n = 75, 4
A, B = np.random.randn(2, m, n)

de, ie = f_einsum(A, B)
dt, it = f_KDTree(A, B)
assert np.all(ie == it) and np.allclose(de, dt)

from timeit import timeit

for m, n in [(75, 4), (500, 4)]:
    A, B = np.random.randn(2, m, n)
    print(m, n)
    print('einsum:', timeit("f_einsum(A, B)", globals=globals(), number=1000))
    print('KDTree:', timeit("f_KDTree(A, B)", globals=globals(), number=1000))

Sample run:

75 4
einsum: 0.067826496087946
KDTree: 0.12196151306852698
500 4
einsum: 3.1056990439537913
KDTree: 0.85108971898444

We can see that at small problem size the direct method (einsum) is faster while for larger problem size KDTree wins.

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.