Iterating operation with two arrays using numpy

Question

I'm working with two different arrays (75x4), and I'm applying a shortest distance algorithm between the two arrays.

So I want to:

perform an operation with one row of the first array with every individual row of the second array, iterating to obtain 75 values
find the minimum value, and store that in a new array
repeat this with the second row of the first array, once again iterating the operation for all the rows of the second array, and again storing the minimum difference to the new array

How would I go about doing this with numpy?

Essentially I want to perform an operation between one row of array 1 on every row of array 2, find the minimum value, and store that in a new array. Then do that very same thing for the 2nd row of array 1, and so on for all 75 rows of array 1.

Here is the code for the formula I'm using. What I get here is just the distance between every row of array 1 (training data) and array 2 (testing data). But what I'm looking for is to do it for one row of array 1 iterating down for all rows of array 2, storing the minimum value in a new array, then doing the same for the next row of array 1, and so on.

arr_attributedifference = (arr_trainingdata - arr_testingdata)**2
arr_distance = np.sqrt(arr_attributedifference.sum(axis=1))

There are three basic alternatives: (1) use a nested loop, giving up all the speed benefits of numpy; (2) build a product array so you can do one giant elementwise operation; (3) leave the outer loop as a loop, but turn the inner loop into an elementwise operation. You usually want to do #3, because it has almost all of the speed benefits of #2 without the memory costs. So you just need to figure out how to do your operation as one row from A against the whole array B, and then loop over A doing that. — abarnert
– abarnert, Commented Jun 7, 2018 at 22:50
But meanwhile, two 75x4 arrays is pretty tiny, so you may be fine just writing whatever you feel most comfortable with writing, and then making sure it's not too slow. — abarnert
– abarnert, Commented Jun 7, 2018 at 22:51

marc_s · Accepted Answer · 2018-06-09 21:19:25Z

Here are two methods one using einsum, the other KDTree:

einsum does essentially what we could also achieve via broadcasting, for example np.einsum('ik,jk', A, B) is roughly equivalent to (A[:, None, :] * B[None, :, :]).sum(axis=2). The advantage of einsum is that it does the summing straight away, so it avoids creating an mxmxn intermediate array.

KDTree is more sophisticated. We have to invest upfront into generating the tree but afterwards querying nearest neighbors is very efficient.

import numpy as np
from scipy.spatial import cKDTree as KDTree

def f_einsum(A, B):
    B2AB = np.einsum('ij,ij->i', B, B) / 2 - np.einsum('ik,jk', A, B)
    idx = B2AB.argmin(axis=1)
    D = A-B[idx]
    return np.sqrt(np.einsum('ij,ij->i', D, D)), idx

def f_KDTree(A, B):
    T = KDTree(B)
    return T.query(A, 1)

m, n = 75, 4
A, B = np.random.randn(2, m, n)

de, ie = f_einsum(A, B)
dt, it = f_KDTree(A, B)
assert np.all(ie == it) and np.allclose(de, dt)

from timeit import timeit

for m, n in [(75, 4), (500, 4)]:
    A, B = np.random.randn(2, m, n)
    print(m, n)
    print('einsum:', timeit("f_einsum(A, B)", globals=globals(), number=1000))
    print('KDTree:', timeit("f_KDTree(A, B)", globals=globals(), number=1000))

Sample run:

75 4
einsum: 0.067826496087946
KDTree: 0.12196151306852698
500 4
einsum: 3.1056990439537913
KDTree: 0.85108971898444

We can see that at small problem size the direct method (einsum) is faster while for larger problem size KDTree wins.

Collectives™ on Stack Overflow

Iterating operation with two arrays using numpy

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related