exceptions for numpy arrays

Question

I'm looking to remove certain values within a constant range around values held within a second array. i.e. I have one large np array and I want to remove values +-3 in that array using another array of specific values, say [20,50,90,210]. So if my large array was [14,21,48,54,92,215] I would want [14,54,215] returned. The values are double precision so I'm trying to avoid creating a large mask array to remove specific values and use a range instead.

Joe Kington · Accepted Answer · 2016-01-17 01:08:20Z

You mentioned that you wanted to avoid a large mask array. Unless both your "large array" and your "specific values" array are very large, I wouldn't try to avoid this. Often, with numpy it's best to allow relatively large temporary arrays to be created.

However, if you do need to control memory usage more tightly, you have several options. A typical trick is to only vectorize one part of the operation and iterate over the shorter input (this is shown in the second example below). It saves having nested loops in Python, and can significantly decrease the memory usage involved.

I'll show three different approaches. There are several others (including dropping down to C or Cython if you really need tight control and performance), but hopefully this gives you some ideas.

On a side note, for these small inputs, the overhead of array creation will overwhelm the differences. The speed and memory usage I'm referring to is only for large (>~1e6 elements) arrays.

Fully vectorized, but most memory usage

The easiest way is to calculate all distances at once and then reduce the mask back to the same shape as the initial array. For example:

import numpy as np

vals = np.array([14,21,48,54,92,215])
other = np.array([20,50,90,210])

dist = np.abs(vals[:,None] - other[None,:])
mask = np.all(dist > 3, axis=1)
result = vals[mask]

Partially vectorized, intermediate memory usage

Another option is to build up the mask iteratively for each element in the "specific values" array. This iterates over all elements of the shorter "specific values" array (a.k.a. other in this case):

import numpy as np

vals = np.array([14,21,48,54,92,215])
other = np.array([20,50,90,210])

mask = np.ones(len(vals), dtype=bool)
for num in other:
    dist = np.abs(vals - num)
    mask &= dist > 3
result = vals[mask]

Slowest, but lowest memory usage

Finally, if you really want to reduce memory usage, you could iterate over every item in your large array:

import numpy as np

vals = np.array([14,21,48,54,92,215])
other = np.array([20,50,90,210])

result = []
for num in vals:
    if np.all(np.abs(num - other) > 3):
        result.append(num)

The temporary list in that case is likely to take up more memory than the mask in the previous version. However, you could avoid the temporary list by using np.fromiter if you wanted. The timing comparison below shows an example of this.

Timing Comparisons

Let's compare the speed of these functions. We'll use 10,000,000 elements in the "large array" and 4 values in the "specific values" array. The relative speed and memory usage of these functions depend strongly on the sizes of the two arrays, so you should only consider this as a vague guideline.

import numpy as np

vals = np.random.random(1e7)
other = np.array([0.1, 0.5, 0.8, 0.95])
tolerance = 0.05

def basic(vals, other, tolerance):
    dist = np.abs(vals[:,None] - other[None,:])
    mask = np.all(dist > tolerance, axis=1)
    return vals[mask]

def intermediate(vals, other, tolerance):
    mask = np.ones(len(vals), dtype=bool)
    for num in other:
        dist = np.abs(vals - num)
        mask &= dist > tolerance
    return vals[mask]

def slow(vals, other, tolerance):
    def func(vals, other, tolerance):
        for num in vals:
            if np.all(np.abs(num - other) > tolerance):
                yield num
    return np.fromiter(func(vals, other, tolerance), dtype=vals.dtype)

And in this case, the partially vectorized version wins out. That's to be expected in most cases where vals is significantly longer than other. However, the first example (basic) is almost as fast, and is arguably simpler.

In [7]: %timeit basic(vals, other, tolerance)
1 loops, best of 3: 1.45 s per loop

In [8]: %timeit intermediate(vals, other, tolerance)
1 loops, best of 3: 917 ms per loop

In [9]: %timeit slow(vals, other, tolerance)
1 loops, best of 3: 2min 30s per loop

Either way you choose to implement things, these are common vectorization "tricks" that show up in many problems. In high-level languages like Python, Matlab, R, etc It's often useful to try fully vectorizing, then mix vectorization and explicit loops if memory usage is an issue. Which one is best usually depends on the relative sizes of the inputs, but this is a common pattern to try when optimizing speed vs memory usage in high-level scientific programming.

Thanks a lot. I was trying something similar to the bottom method for a huge spectral dataset and it was very slow. Since I'm using a constant set of exceptions for hundreds to thousands of 2150 length double-precision one d arrays, I suppose a constant mask actually will be the way to go.
Also keep in mind that the memory usage may not be as large an issue as you might think. A 1 million element, 64-bit float numpy array is almost exactly 8MB in ram (8MB + a few tens of bytes of overhead). A boolean mask of the same length is only 1MB. At any rate, you can afford to make a few temporary copies in memory, and it's almost always faster than iterating over an array in Python. Either way, though, numpy gives you fairly tight control over memory if you need it.

jeremycg · Accepted Answer · 2016-01-17 00:53:53Z

1

You can try:

def closestmatch(x, y):
   val = np.abs(x-y)
   return(val.min()>=3)

Then:

b[np.array([closestmatch(a, x) for x in b])]

answered Jan 17, 2016 at 0:53

jeremycg

25k6 gold badges67 silver badges78 bronze badges

1 Comment

Joe Kington Over a year ago

That makes sense if a is much larger than b. However, it will be horrendously slow once you get beyond a few hundred thousand elements in the b array.

Collectives™ on Stack Overflow

exceptions for numpy arrays

2 Answers 2

Fully vectorized, but most memory usage

Partially vectorized, intermediate memory usage

Slowest, but lowest memory usage

Timing Comparisons

2 Comments

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Fully vectorized, but most memory usage

Partially vectorized, intermediate memory usage

Slowest, but lowest memory usage

Timing Comparisons

2 Comments

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related