1

I have two 3 dimensional numpy arrays A, B (size ~ (1000, 1000, 3) -> image processing) and the element-wise functions on it.

The functions are sequentially:

import numpy as np
A  = A ** 3
A = np.maximum(A, 0.001)
C = np.divide(B, A)

Since the function operating these 3 commands is the bottleneck of a time-demanding process, I would like to ask whether there is a way to perform all of those with a single access to each element in memory, i.e. the fastest performance.

The only combinations I could find was the divide part, e.g. here or here, or this one which is a special case because of the einstein sum.

Is there any way to do that accessing each element in the memory one time (thus make it time-efficient) without the need to write a custom ufunc?

1
  • Datatype permitting, using the third link, do np.einsum('ijk,ijk,ijk->ijk',A,A,A) to simulate A***3? That should be really efficient. Commented Apr 30, 2016 at 18:54

2 Answers 2

3

Is there any way to do that accessing each element in the memory one time (thus make it time-efficient) without the need to write a custom ufunc?

Yes, this is exactly what numexpr was designed for.

import numpy as np
import numexpr as ne

def func1(A, B):
    A = A ** 3
    A = np.maximum(A, 0.001)
    return np.divide(B, A)

def func2(A, B):
    return ne.evaluate("B / where(A**3 > 0.001, A**3, 0.001)",
                       local_dict={'A':A,'B':B})

A, B = np.random.randn(2, 1000, 1000, 3)

print(np.allclose(func1(A, B), func2(A, B)))
# True

numexpr gives about a factor of 70 improvement over your original code:

In [1]: %%timeit A, B = np.random.randn(2, 1000, 1000, 3)
func1(A, B)
   ....: 
1 loop, best of 3: 837 ms per loop

In [2]: %%timeit A, B = np.random.randn(2, 1000, 1000, 3)
func2(A, B)
   ....: 
The slowest run took 8.87 times longer than the fastest. This could mean that an
intermediate result is being cached.
100 loops, best of 3: 11.5 ms per loop

In part this is because numexpr will use multiple threads for the computation by default, but even with a single thread it still crushes naive vectorization:

In [3]: ne.set_num_threads(1)
Out[3]: 8

In [4]: %%timeit A, B = np.random.randn(2, 1000, 1000, 3)
func2(A, B)
   ....: 
10 loops, best of 3: 47.3 ms per loop
Sign up to request clarification or add additional context in comments.

2 Comments

That speedup is just huge! Confirmed similar figures at my end.
Quite impressive indeed. Thanks!
2

Frankly, numexpr one-liner listed in @ali_m's solution looks like the way to go, given the associated speedup numbers. Keeping things in NumPy, listed in this post is an alternative suggestion.

Let's time those instructions one at a time and see if there's any bottleneck -

In [108]: # Random input arrays
     ...: A = np.random.rand(1000,1000,3)
     ...: B = np.random.rand(1000,1000,3)
     ...: 

In [109]: %timeit A**3
1 loops, best of 3: 442 ms per loop

In [110]: A  = A ** 3

In [111]: %timeit np.maximum(A, 0.001)
100 loops, best of 3: 16.4 ms per loop

In [112]: A = np.maximum(A, 0.001)

In [113]: %timeit np.divide(B, A)
10 loops, best of 3: 19.7 ms per loop

So, it seems the power calculation takes up a huge portion of the total runtime.

Let's introduce np.einsum there, but please be aware that of the datatypes involved.

In [114]: # Random input arrays
     ...: A = np.random.rand(1000,1000,3)
     ...: B = np.random.rand(1000,1000,3)
     ...: 

In [115]: %timeit A**3
1 loops, best of 3: 442 ms per loop

In [116]: %timeit np.einsum('ijk,ijk,ijk->ijk',A,A,A)
10 loops, best of 3: 28.3 ms per loop

In [117]: np.allclose(A**3,np.einsum('ijk,ijk,ijk->ijk',A,A,A))
Out[117]: True

That's a good speedup there.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.