I have a numpy array 'arr' that is of shape (1756020, 28, 28, 4).
Basically 'arr' has 1756020 small arrays of shape (28,28,4). Out of the 1756020 arrays 967210 are 'all zero' and 788810 has all non-zero values. I want to remove all the 967210 'all zero' small arrays. I wrote a if else loop using the condition arr[i]==0.any() but it takes a lot of time. Is there a better way to do it?
2 Answers
One way to vectorise your logic is to use numpy.any with a tuple argument for axis containing non-tested dimensions.
# set up 4d array of ones
A = np.ones((5, 3, 3, 4))
# make second of shape (3, 3, 4) = 0
A[1] = 0 # or A[1, ...] = 0; or A[1, :, :, :] = 0
# find out which are non-zero
res = np.any(A, axis=(1, 2, 3))
print(res)
[True False True True True]
This feature is available in numpy v0.17 upwards. As per the docs:
axis : None or int or tuple of ints, optional
If this is a tuple of ints, a reduction is performed on multiple axes, instead of a single axis or all the axes as before.
7 Comments
A[1, :, :, :] = 0 is just equivalent to A[1] = 0 - there's no need to specify the slices for the trailing axes (either explicitly or using ellipses).I made a small test script with the size you mentioned. With my computer, array creation (memory error if floats, thats why booleans) and selection are slow, but finding zeros seems to be rather fast:
if __name__ == '__main__':
arr = np.ones((1756020, 28, 28, 4), dtype=bool)
for i in range(0,1756020,2):
arr[i] = 0
print(arr[:5])
s = arr.shape
t0 = time.time()
arr2 = arr.reshape((s[0], np.prod(s[1:])))
ok = np.any(arr2, axis=1)
print(time.time()-t0)
arr_clean = arr2[ok]
print(time.time()-t0)
arr_clean = arr_clean.reshape((np.sum(ok), *s[1:]))
print(time.time()-t0)
print('end')
Output:
0.4846000671386719 # Find of zeros is fast
29.750200271606445 # Removing zeros is slow
29.797000408172607 # Reshape to original shapes [1:] is fast
arr[(arr!=0).any(axis=(1,2,3))].