Storing large amount of boolean data in python

Question

I need to store sparse matrix data. Size of data is 10^6 10^4 columns. In each column I store a vector of 0's except of few values where it's true.

Then I need to sum over columns in each matrix, and multiply each row with a scalar. I tried dictionaries, but they fail when I need to sum, and multiply.

What would You use?

PS. numpy.zeros is too small

Have you taken a look at scipy.sparse? docs.scipy.org/doc/scipy/reference/sparse.html — Joe Kington
– Joe Kington, Commented Feb 11, 2012 at 19:10
Are you only storing this data or processing it as a matrix? Are you just looking for the smaller memory footprint or something else? Do you want it to look like a list of lists? — dawg
– dawg, Commented Feb 11, 2012 at 22:15

Tim Pietzcker · Accepted Answer · 2012-02-12 06:48:24Z

4

How about two dicts? Assuming this is the matrix (x for True):

   0  1  2  3  4  5  6  7
0  x     x        x 
1     x
2                       x
3              x
4
5
6        x        x
7

You'd only need to store

rows = {0: [0, 2, 5], 1: [1], 2: [7], 3: [4], 6: [2, 5]}

You could easily transform this into

columns = {0: [0], 1: [1], 2: [0, 6], 4: [3], 5: [0, 6], 7: [2]}

using something like

columns = {}
for row in rows:
    for column in rows[row]:
        columns.setdefault(column, []).append(row)

and then sum over columns (sum(1 for x in column[2])) or over rows and multiply the result with whatever you want.

edited Feb 12, 2012 at 6:48

answered Feb 11, 2012 at 18:58

Tim Pietzcker

337k59 gold badges520 silver badges572 bronze badges

Sign up to request clarification or add additional context in comments.

8 Comments

Gareth Latty Over a year ago

Why bother with dicts when you are using indices as keys? At that point, just have lists.

Tim Pietzcker Over a year ago

@Lattyware: Because it's a sparse matrix. Imagine a million rows and 10.000 columns as per the OP's specifications (and only a hundred points in the matrix being occupied).

jsbueno Over a year ago

I think it is ok to keep the data actually stored in two dicts, but I'd say the "way to do it" is to build a class around them so you can get nice methods to perform insertion, reseting and the needed operations.

user688635 Over a year ago

This is inside out. You should be storing a tuple of (row, col) in one dict instead...

Tim Pietzcker Over a year ago

@Colt45: That doesn't make much sense. I could imagine storing (row, col) tuples in a list, and depending on how often insertion, deletion, summing over columns etc. are done, it might outperform my solution, but why in a dict? What would be the keys, what would be the values? I wrote my solution in order to allow quick summing across rows and/or columns. I'd say it depends on the use case.

|

JoshAdel · Accepted Answer · 2012-02-12 00:15:26Z

2

As others have mentioned, you should look at scipy.sparse:

http://docs.scipy.org/doc/scipy/reference/sparse.html

There are a bunch of different formats that are optimized for various sparse operations including scalar multiplication and summation.

For example:

import scipy.sparse
import numpy as np

rows = np.array([1,100,1000])
cols = np.array([100,99,1474])
vals = np.ones_like(rows)

A = scipy.sparse.coo_matrix((vals,(rows,cols)),shape=(int(1E6),int(1E6)),dtype=np.bool)

Then to multiply by a scalar and take sum:

B = 3*A
B.sum() # 9

answered Feb 12, 2012 at 0:15

JoshAdel

69.1k27 gold badges146 silver badges146 bronze badges

1 Comment

Intelligent-Infrastructure Over a year ago

I didn't check it, however I used sparse matrices in Matlab once. I first checked dictionaries, and it worked fine.

dawg · Accepted Answer · 2012-02-13 15:21:12Z

1

Depending on your needs, there are literally hundreds of methods to do this. The Sparse Matrix entry on Wikipedia is a good start to figure a method that applies specifically to your needs.

As an extremely simple example, you could use a Dictionary of Keys class like this:

class SparseDOK(dict):

    def __init__(self):
        pass

    def __setitem__(self,key,value):
        if value in[0,0.0,False,None]:
            dict.__setitem__(self,key,False)
            dict.__delitem__(self,key)
        else:
            dict.__setitem__(self,key,True)

    def __getitem__(self, key):    
        try: 
            return dict.__getitem__(self, key)

        except KeyError: 
            return False


>>> dok=SparseDOK()
>>> dok[10,20]=55
>>> print dok
{(10, 20): True}
>>> print dok[10,20]
True
>>> print dok[55,300]      
False
>>> dok[10,20]=False
>>> print dok[10,20]
False

Every entry in an arbitrary 'matrix' is assumed to be False, unless specifically set to True. You would need to add error checking, but this will be very compact and fast.

The advantage of constructing a Dictionary of Key is very efficient construction of the data structure. You only need to go through the original data once and you can easily add or delete data. The disadvantage is less interactive processing of the matrix once you have constructed it.

Since the dictionary keys are tuples, it is trivial to add the indices by row or column. Since the entire matrix would need to be processed after construction to do this, we can just construct a dict with whatever sum or product is desired once and then refer to that dict of processed data.

>>> dok[10,20]=True
>>> dok[10,2000]=True
>>> dok[11,2000]=True
>>> dok[35000,2000]=True
>>> dok[10,35000]=True
>>> print dok
{(11, 2000): True, (10, 2000): True, (35000, 2000): True, (10, 20): True, (10, 35000): True}
cols={}
for tup in dok.keys():
    if tup[1] not in cols:
        cols[tup[1]]=1
    else:
        cols[tup[1]]+=1    

>>> print cols
{2000: 3, 35000: 1, 20: 1}

Now you can refer to the col key in cols for the sum of the rows by col. It is trivial to add product, etc. Just remember you need to recalculate the sums / products if the original DOK is edited or changed. You could keep a running total if you anticipate that the DOK would change often after it was first created.

If your needs are more complex, consider using SciPy or Pysparse. As you can see, there are 7 different sparse matrix format in SciPy. Don't reinvent something the others have already done better...

edited Feb 13, 2012 at 15:21

answered Feb 11, 2012 at 23:11

dawg

105k24 gold badges143 silver badges217 bronze badges

6 Comments

Tim Pietzcker Over a year ago

Wouldn't that require to iterate over the entire dict every time you want to sum a single column's or row's values?

dawg Over a year ago

@Tim Pietzcker: Yes, as currently written, you would need to iterate over the entire dict. It would be fairly trivial to add keeping a running total by column to the class which might slow down construction of the data structure. This method of a dictionary of tuples for a general DOK is, however, the same way SciPY does it with a DOK. The OP did not specify whether rapid construction or rapid processing is more important. A DOK has the advantage of rapid construction. Since the OP stated he had only a few trues in potentially millions of falses, this seemed like a fair trade-off.

the wolf Over a year ago

@Tim Pietzcker: Not EVERY time no. The cols dict is being calculated as a dict of sums by col, so you would only have to iterate over the sparse matrix of booleans ONCE after it was constructed. Then just refer to the cols dict for all the sums by col. You would need calc cols again if the original data changes I suppose, but that what not specified by the OP.

Intelligent-Infrastructure Over a year ago

Which way in your opinion would be faster: creating a dictionary, or scipy.sparse?

dawg Over a year ago

@Intelligent-Infrastructure: You need to define faster? Faster to write? Faster to construct the matrix once? Interactively? Faster to process the matrix once? Interactively? Faster to debug? Faster to install on other's machines? If you only construct the DOK once and then build the derived dict of sums and products once, the DOK is straightforward, easy to debug, and probably very fast. SciPy is extremely fast, very bug free and many parts are written in C. SciPy has a learning curve, it has to be on the target and it has some overhead. I would need more info to tell you for sure. Test it.

|

Collectives™ on Stack Overflow

Storing large amount of boolean data in python

3 Answers 3

8 Comments

1 Comment

6 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

8 Comments

1 Comment

6 Comments

Your Answer

Sign up or log in

Post as a guest

Related