python parallel processing

Question

I am new to Python. I have 2000 files each about 100 MB. I have to read each of them and merge them into a big matrix (or table). Can I use parallel processing for this so that I can save some time? If yes, how? I tried searching and things seem very complicated. Currently, it takes about 8 hours to get this done serially. We have a really big server with one Tera Byte RAM and few hundred processors. How can I efficiently make use of this?

Thank you for your help.

Whether you can parallelize this depends on whether the merging phase is CPU-bound or I/O-bound, in addition to your hardware. — Fred Foo
– Fred Foo, Commented Nov 16, 2011 at 18:58
Just saying, Python probably isn't the best tool for this. Maybe there's more information that makes it a better fit, but something like C or C++ will probably be easier and more efficient and processing large amounts of data. — Kris Harper
– Kris Harper, Commented Nov 16, 2011 at 18:59
Is the merge operation something trivial/segmented, like concatenation? Or something like a reads/op/write (sum, e.g.)? — Brian Cain
– Brian Cain, Commented Nov 16, 2011 at 19:00
@root45: It might not be as bad as you think. The asker talks about matrices, so it may be possible to use something like numpy to do the grunt work in compiled code, while controlling the process from Python. — Thomas K
– Thomas K, Commented Nov 16, 2011 at 19:01

Raymond Hettinger · Accepted Answer · 2011-11-16 19:10:24Z

1

You make be able to preprocess the files in separate processes using the subprocess module; however, if the final table is kept in memory, then that process will end up being you bottleneck.

There is another possible approach using shared memory with mmap objects. Each subprocess can be responsible for loading the files into a subsection of the mapped memory.

answered Nov 16, 2011 at 19:10

Raymond Hettinger

229k67 gold badges405 silver badges504 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

python parallel processing

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related