4

I have a numpy array 'arr' that is of shape (1756020, 28, 28, 4). Basically 'arr' has 1756020 small arrays of shape (28,28,4). Out of the 1756020 arrays 967210 are 'all zero' and 788810 has all non-zero values. I want to remove all the 967210 'all zero' small arrays. I wrote a if else loop using the condition arr[i]==0.any() but it takes a lot of time. Is there a better way to do it?

4
  • Try arr[(arr!=0).any(axis=(1,2,3))]. Commented May 13, 2018 at 20:09
  • numpy.count_nonzero ? Commented May 13, 2018 at 20:10
  • arr.any(axis(1,2,3)) might be more effective, because the first nonzero value is enough to keep it, counting the total number is not needed Commented May 13, 2018 at 20:56
  • Did one of the below solutions help? Feel free to accept one (tick on left), or ask for clarification. Commented May 16, 2018 at 11:38

2 Answers 2

6

One way to vectorise your logic is to use numpy.any with a tuple argument for axis containing non-tested dimensions.

# set up 4d array of ones
A = np.ones((5, 3, 3, 4))

# make second of shape (3, 3, 4) = 0
A[1] = 0  # or A[1, ...] = 0; or A[1, :, :, :] = 0

# find out which are non-zero
res = np.any(A, axis=(1, 2, 3))

print(res)

[True False True True True]

This feature is available in numpy v0.17 upwards. As per the docs:

axis : None or int or tuple of ints, optional

If this is a tuple of ints, a reduction is performed on multiple axes, instead of a single axis or all the axes as before.

Sign up to request clarification or add additional context in comments.

7 Comments

@MateenUlhaq, This is pretty much exactly what I have. The difference being I'm identifying the zero arrays. There's no fundamental difference, though. Do you believe there is?
I think the problem with arr != 0 is that it creates one new huge array. It needs about 5 gbyte (or whatever) memory. Maybe also A == 0 is unnecessary, why not just res = np.any(A, axis=(1, 2, 3))?
Minor point: A[1, :, :, :] = 0 is just equivalent to A[1] = 0 - there's no need to specify the slices for the trailing axes (either explicitly or using ellipses).
@AnttiA, Timing the 3 options (A==0, A!=0, just A), there isn't much difference. I don't think you'll find any memory benefit either. Does your testing suggest something else?
@AlexRiley, Thanks - appreciated, I put all 3 up for good order.
|
1

I made a small test script with the size you mentioned. With my computer, array creation (memory error if floats, thats why booleans) and selection are slow, but finding zeros seems to be rather fast:

if __name__ == '__main__':
    arr = np.ones((1756020, 28, 28, 4), dtype=bool)
    for i in range(0,1756020,2):
        arr[i] = 0
    print(arr[:5])
    s = arr.shape
    t0 = time.time()
    arr2 = arr.reshape((s[0], np.prod(s[1:])))
    ok = np.any(arr2, axis=1)
    print(time.time()-t0)
    arr_clean = arr2[ok]
    print(time.time()-t0)
    arr_clean = arr_clean.reshape((np.sum(ok), *s[1:]))
    print(time.time()-t0)
    print('end')

Output:

0.4846000671386719 # Find of zeros is fast

29.750200271606445 # Removing zeros is slow

29.797000408172607 # Reshape to original shapes [1:] is fast

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.