3

Despite all the seemingly similar questions and answers, here goes:

I have a fairly large 2D numpy array and would like to process it row by row using multiprocessing. For each row I need to find specific (numeric) values and use them to set values in a second 2D numpy array. A small example (real use for array with appr. 10000x10000 cells):

import numpy as np
inarray = np.array([(1.5,2,3), (4,5.1,6), (2.7, 4.8, 4.3)])
outarray = np.array([(0.0,0.0,0.0), (0.0,0.0,0.0), (0.0,0.0,0.0)])

I would now like to process inarray row by row using multiprocessing, to find all the cells in inarray that are greater than 5 (e.g. inarray[1,1] and inarray[1,2], and set cells in outarray that have index locations one smaller in both dimensions (e.g. outarray[0,0] and outarray[0,1]) to 1.

After looking here and here and here I'm sad to say I still don't know how to do it. Help!

1
  • So if I find an index in helper = inarray[1:,1:], it will be the same index in outarray... right? Commented May 13, 2014 at 22:57

2 Answers 2

2

If you can use the latest numpy development version, then you can use multithreading instead of multiprocessing. Since this PR was merged a couple of months ago, numpy releases the GIL when indexing, so you can do something like:

import numpy as np
import threading

def target(in_, out):
    out[in_ > .5] = 1

def multi_threaded(a, thread_count=3):
    b = np.zeros_like(a)
    chunk = len(a) // thread_count
    threads = []
    for j in xrange(thread_count):
        sl_a = slice(1 + chunk*j,
                     a.shape[0] if j == thread_count-1 else 1 + chunk*(j+1),
                     None)
        sl_b = slice(sl_a.start-1, sl_a.stop-1, None)
        threads.append(threading.Thread(target=target, args=(a[sl_a, 1:],
                                                             b[sl_b, :-1])))
    for t in threads:
        t.start()
    for t in threads:
        t.join()
    return b

And now do things like:

In [32]: a = np.random.rand(100, 100000)

In [33]: %timeit multi_threaded(a, 1)
1 loops, best of 3: 121 ms per loop

In [34]: %timeit multi_threaded(a, 2)
10 loops, best of 3: 86.6 ms per loop

In [35]: %timeit multi_threaded(a, 3)
10 loops, best of 3: 79.4 ms per loop
Sign up to request clarification or add additional context in comments.

Comments

0

I don't think multiprocessing is the right call, because you want to change one object by multiple processes. I think this is not a good idea. I get that it would be nice finding the indexes via multiple processes, but in order to send the data to an other process, the object is internally pickled (again: as far as I know).

Please try this and tell us if it is very slow:

outarray[inarray[1:,1:] > 5] = 1
outarray

array([[ 1.,  1.,  0.],
       [ 0.,  0.,  0.],
       [ 0.,  0.,  0.]])

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.