Large on disk array for numpy

Question

I have a sparse array that seems to be too large to handel effectively in memory (2000x2500000, float). I can form it into a sparse lil_array (scipy) but if I try output a column or row compressed sparse array (A.tocsc(), A.tocsr()) my machine runs out of memory (and there's also a serious mismatch between the data in a text file 4.4G and the pickeled lil array 12G - it would be nice to have a disk format that more closely approximates the raw data size).

I will probably be handeling even larger arrays in the future.

Question: What's the best way to handle large on disk arrays in such a way that I can use the regular numpy functions in a transparent way. For instance, sums along rows and columns, vector products, max, min, slicing etc?

Is pytables the way to go? is there a good (fast) sql-numpy middleware layer? a secret on disk array built into numpy?

In the past with (slightly smaller) arrays I've always just pickel-cached long calculated results to disk. This works when the arrays end up being < 4G or so but is not longer tenable.

When you pickled your array, did you make sure to use the binary protocol? If you are using the default text protocol, then this could be the cause of the huge file size. — DaveP
– DaveP, Commented Apr 26, 2012 at 5:57

NPE · Accepted Answer · 2012-04-25 16:08:49Z

2

I often use memory-mapped numpy arrays to process multi-gigabyte numerical matrices. I find them to work really well for my purposes. Obviously, if the size of the data exceeds the amount of RAM, one has to be careful about access patterns to avoid thrashing.

answered Apr 25, 2012 at 16:08

NPE

503k114 gold badges970 silver badges1k bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

Anthony Bak Over a year ago

This might be doable but seems pretty inefficient for sparse arrays. Is there a sparse version?

NPE Over a year ago

@AnthonyBak: Not that I know of. However, a 2000x100000 dense array of float32 is only 800MB in size (both on disk and in memory).

Anthony Bak Over a year ago

Yes, there was a typo in my original question. It should have said 2000x2500000.

David Parks Over a year ago

stackoverflow.com/questions/42727412/…

Collectives™ on Stack Overflow

Large on disk array for numpy

1 Answer 1

4 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

4 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related