Get only non-zero sub arrays from a N dimensional Numpy array

Question

I have a numpy array 'arr' that is of shape (1756020, 28, 28, 4). Basically 'arr' has 1756020 small arrays of shape (28,28,4). Out of the 1756020 arrays 967210 are 'all zero' and 788810 has all non-zero values. I want to remove all the 967210 'all zero' small arrays. I wrote a if else loop using the condition arr[i]==0.any() but it takes a lot of time. Is there a better way to do it?

arr.any(axis(1,2,3)) might be more effective, because the first nonzero value is enough to keep it, counting the total number is not needed — Antti A
– Antti A, Commented May 13, 2018 at 20:56
Did one of the below solutions help? Feel free to accept one (tick on left), or ask for clarification. — jpp
– jpp, Commented May 16, 2018 at 11:38

jpp · Accepted Answer · 2018-05-13 20:51:25Z

6

One way to vectorise your logic is to use numpy.any with a tuple argument for axis containing non-tested dimensions.

# set up 4d array of ones
A = np.ones((5, 3, 3, 4))

# make second of shape (3, 3, 4) = 0
A[1] = 0  # or A[1, ...] = 0; or A[1, :, :, :] = 0

# find out which are non-zero
res = np.any(A, axis=(1, 2, 3))

print(res)

[True False True True True]

This feature is available in numpy v0.17 upwards. As per the docs:

axis : None or int or tuple of ints, optional

If this is a tuple of ints, a reduction is performed on multiple axes, instead of a single axis or all the axes as before.

edited May 13, 2018 at 20:51

answered May 13, 2018 at 20:09

jpp

166k37 gold badges301 silver badges363 bronze badges

Sign up to request clarification or add additional context in comments.

7 Comments

jpp Over a year ago

@MateenUlhaq, This is pretty much exactly what I have. The difference being I'm identifying the zero arrays. There's no fundamental difference, though. Do you believe there is?

Antti A Over a year ago

I think the problem with arr != 0 is that it creates one new huge array. It needs about 5 gbyte (or whatever) memory. Maybe also A == 0 is unnecessary, why not just res = np.any(A, axis=(1, 2, 3))?

Alex Riley Over a year ago

Minor point: A[1, :, :, :] = 0 is just equivalent to A[1] = 0 - there's no need to specify the slices for the trailing axes (either explicitly or using ellipses).

jpp Over a year ago

@AnttiA, Timing the 3 options (A==0, A!=0, just A), there isn't much difference. I don't think you'll find any memory benefit either. Does your testing suggest something else?

jpp Over a year ago

@AlexRiley, Thanks - appreciated, I put all 3 up for good order.

|

Antti A · Accepted Answer · 2018-05-13 20:31:16Z

I made a small test script with the size you mentioned. With my computer, array creation (memory error if floats, thats why booleans) and selection are slow, but finding zeros seems to be rather fast:

if __name__ == '__main__':
    arr = np.ones((1756020, 28, 28, 4), dtype=bool)
    for i in range(0,1756020,2):
        arr[i] = 0
    print(arr[:5])
    s = arr.shape
    t0 = time.time()
    arr2 = arr.reshape((s[0], np.prod(s[1:])))
    ok = np.any(arr2, axis=1)
    print(time.time()-t0)
    arr_clean = arr2[ok]
    print(time.time()-t0)
    arr_clean = arr_clean.reshape((np.sum(ok), *s[1:]))
    print(time.time()-t0)
    print('end')

Output:

0.4846000671386719 # Find of zeros is fast

29.750200271606445 # Removing zeros is slow

29.797000408172607 # Reshape to original shapes [1:] is fast

Collectives™ on Stack Overflow

Get only non-zero sub arrays from a N dimensional Numpy array

2 Answers 2

7 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

7 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related