Python/Numpy: How to avoid storing Boolean-indexed sub-arrays?

Question

In using Numpy I often have the need to use Boolean indexing to access parts of an array. To makes things easier to read and type, I often store these sub-arrays into new variables, for example:

n = 10000
X = np.random.rand((n, n))
W = np.random.random_integers(0, 1, n)
X0 = X[W==0]
X1 = X[W==1]

However, as I deal with larger and larger data sets, this seems very wasteful. What is the recommended practice in this case? Should I just write (in the example above) X[W==0] and X[W==1] every time instead?

since W contains only 0 or 1, you can do: nulls = (W==0) and then X0 = X[nulls] and X1 = X[np.logical_not(nulls)]. The latter could be X1 = X[~nulls] — Saullo G. P. Castro
– Saullo G. P. Castro, Commented Aug 24, 2014 at 9:20

DrV · Accepted Answer · 2014-08-24 09:42:11Z

If you have only a limited number of elements to index, you might want to store the indices as a list of indices. For this purpose the numpy.nonzero is very useful. However, if your number of elements to index is large, you use less memory by storing the boolean array (1 byte per element).

So, there are four possibilities:

store a boolean array
store indices of non-zeros
always make the comparison separately
masked arrays

From the memory storage point of view alternative 1 takes 8 bytes per indexed element per dimension. (Of course, the "per dimension" can be avoided by using flat indices.) The boolean approach takes 1 byte per element, so if you have more than 1/8 of the elements with a True in the boolean table, it is the more space-saving solution. Solution #3 probably takes the same space as the boolean solution.

(I do not know enough of the internals of NumPy to say much about the masked arrays. I suspect they behave in a similar way to the boolean indexing.)

Performance-wise the situation is similar. The boolean solution is efficient, if you have a lot of elements to pick, but if you have only a few elements, then the index solution is better.

Just to give some benchmarking idea:

import numpy as np
import time

def create_indices(prob):
    data = np.random.random(100000000) < prob
    return data, np.nonzero(data)

def bool_index(data):
    return data[data]

def list_index(data, indices):
    return data[indices]

By using timeit with different probabilities, the results are:

    p    boolean   list
   0.01   0.206    0.012
   0.10   0.415    0.099
   0.20   0.405    0.146
   0.50   0.786    0.373
   0.75   0.539    0.555
   1.00   0.214    0.723

This is actually quite interesting: Using boolean indices is worst when half of the elements are True. The use of list indices behaves as expected.

This benchmark must not be taken as the whole truth. It may be that the type of the array to be indexed changes the situation (here bool which is the same as uint8), etc. However, it seems that performance-wise list indices are very good in most cases.

Collectives™ on Stack Overflow

Python/Numpy: How to avoid storing Boolean-indexed sub-arrays?

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related