I have a sparse array that seems to be too large to handel effectively in memory (2000x2500000, float). I can form it into a sparse lil_array (scipy) but if I try output a column or row compressed sparse array (A.tocsc(), A.tocsr()) my machine runs out of memory (and there's also a serious mismatch between the data in a text file 4.4G and the pickeled lil array 12G - it would be nice to have a disk format that more closely approximates the raw data size).
I will probably be handeling even larger arrays in the future.
Question: What's the best way to handle large on disk arrays in such a way that I can use the regular numpy functions in a transparent way. For instance, sums along rows and columns, vector products, max, min, slicing etc?
Is pytables the way to go? is there a good (fast) sql-numpy middleware layer? a secret on disk array built into numpy?
In the past with (slightly smaller) arrays I've always just pickel-cached long calculated results to disk. This works when the arrays end up being < 4G or so but is not longer tenable.