4

I need to store sparse matrix data. Size of data is 10^6 10^4 columns. In each column I store a vector of 0's except of few values where it's true.

Then I need to sum over columns in each matrix, and multiply each row with a scalar. I tried dictionaries, but they fail when I need to sum, and multiply.

What would You use?

PS. numpy.zeros is too small

2
  • 4
    Have you taken a look at scipy.sparse? docs.scipy.org/doc/scipy/reference/sparse.html Commented Feb 11, 2012 at 19:10
  • 1
    Are you only storing this data or processing it as a matrix? Are you just looking for the smaller memory footprint or something else? Do you want it to look like a list of lists? Commented Feb 11, 2012 at 22:15

3 Answers 3

4

How about two dicts? Assuming this is the matrix (x for True):

   0  1  2  3  4  5  6  7
0  x     x        x 
1     x
2                       x
3              x
4
5
6        x        x
7

You'd only need to store

rows = {0: [0, 2, 5], 1: [1], 2: [7], 3: [4], 6: [2, 5]}

You could easily transform this into

columns = {0: [0], 1: [1], 2: [0, 6], 4: [3], 5: [0, 6], 7: [2]}

using something like

columns = {}
for row in rows:
    for column in rows[row]:
        columns.setdefault(column, []).append(row)

and then sum over columns (sum(1 for x in column[2])) or over rows and multiply the result with whatever you want.

Sign up to request clarification or add additional context in comments.

8 Comments

Why bother with dicts when you are using indices as keys? At that point, just have lists.
@Lattyware: Because it's a sparse matrix. Imagine a million rows and 10.000 columns as per the OP's specifications (and only a hundred points in the matrix being occupied).
I think it is ok to keep the data actually stored in two dicts, but I'd say the "way to do it" is to build a class around them so you can get nice methods to perform insertion, reseting and the needed operations.
This is inside out. You should be storing a tuple of (row, col) in one dict instead...
@Colt45: That doesn't make much sense. I could imagine storing (row, col) tuples in a list, and depending on how often insertion, deletion, summing over columns etc. are done, it might outperform my solution, but why in a dict? What would be the keys, what would be the values? I wrote my solution in order to allow quick summing across rows and/or columns. I'd say it depends on the use case.
|
2

As others have mentioned, you should look at scipy.sparse:

http://docs.scipy.org/doc/scipy/reference/sparse.html

There are a bunch of different formats that are optimized for various sparse operations including scalar multiplication and summation.

For example:

import scipy.sparse
import numpy as np

rows = np.array([1,100,1000])
cols = np.array([100,99,1474])
vals = np.ones_like(rows)

A = scipy.sparse.coo_matrix((vals,(rows,cols)),shape=(int(1E6),int(1E6)),dtype=np.bool)

Then to multiply by a scalar and take sum:

B = 3*A
B.sum() # 9

1 Comment

I didn't check it, however I used sparse matrices in Matlab once. I first checked dictionaries, and it worked fine.
1

Depending on your needs, there are literally hundreds of methods to do this. The Sparse Matrix entry on Wikipedia is a good start to figure a method that applies specifically to your needs.

As an extremely simple example, you could use a Dictionary of Keys class like this:

class SparseDOK(dict):

    def __init__(self):
        pass

    def __setitem__(self,key,value):
        if value in[0,0.0,False,None]:
            dict.__setitem__(self,key,False)
            dict.__delitem__(self,key)
        else:
            dict.__setitem__(self,key,True)

    def __getitem__(self, key):    
        try: 
            return dict.__getitem__(self, key)

        except KeyError: 
            return False


>>> dok=SparseDOK()
>>> dok[10,20]=55
>>> print dok
{(10, 20): True}
>>> print dok[10,20]
True
>>> print dok[55,300]      
False
>>> dok[10,20]=False
>>> print dok[10,20]
False

Every entry in an arbitrary 'matrix' is assumed to be False, unless specifically set to True. You would need to add error checking, but this will be very compact and fast.

The advantage of constructing a Dictionary of Key is very efficient construction of the data structure. You only need to go through the original data once and you can easily add or delete data. The disadvantage is less interactive processing of the matrix once you have constructed it.

Since the dictionary keys are tuples, it is trivial to add the indices by row or column. Since the entire matrix would need to be processed after construction to do this, we can just construct a dict with whatever sum or product is desired once and then refer to that dict of processed data.

>>> dok[10,20]=True
>>> dok[10,2000]=True
>>> dok[11,2000]=True
>>> dok[35000,2000]=True
>>> dok[10,35000]=True
>>> print dok
{(11, 2000): True, (10, 2000): True, (35000, 2000): True, (10, 20): True, (10, 35000): True}
cols={}
for tup in dok.keys():
    if tup[1] not in cols:
        cols[tup[1]]=1
    else:
        cols[tup[1]]+=1    

>>> print cols
{2000: 3, 35000: 1, 20: 1}

Now you can refer to the col key in cols for the sum of the rows by col. It is trivial to add product, etc. Just remember you need to recalculate the sums / products if the original DOK is edited or changed. You could keep a running total if you anticipate that the DOK would change often after it was first created.

If your needs are more complex, consider using SciPy or Pysparse. As you can see, there are 7 different sparse matrix format in SciPy. Don't reinvent something the others have already done better...

6 Comments

Wouldn't that require to iterate over the entire dict every time you want to sum a single column's or row's values?
@Tim Pietzcker: Yes, as currently written, you would need to iterate over the entire dict. It would be fairly trivial to add keeping a running total by column to the class which might slow down construction of the data structure. This method of a dictionary of tuples for a general DOK is, however, the same way SciPY does it with a DOK. The OP did not specify whether rapid construction or rapid processing is more important. A DOK has the advantage of rapid construction. Since the OP stated he had only a few trues in potentially millions of falses, this seemed like a fair trade-off.
@Tim Pietzcker: Not EVERY time no. The cols dict is being calculated as a dict of sums by col, so you would only have to iterate over the sparse matrix of booleans ONCE after it was constructed. Then just refer to the cols dict for all the sums by col. You would need calc cols again if the original data changes I suppose, but that what not specified by the OP.
Which way in your opinion would be faster: creating a dictionary, or scipy.sparse?
@Intelligent-Infrastructure: You need to define faster? Faster to write? Faster to construct the matrix once? Interactively? Faster to process the matrix once? Interactively? Faster to debug? Faster to install on other's machines? If you only construct the DOK once and then build the derived dict of sums and products once, the DOK is straightforward, easy to debug, and probably very fast. SciPy is extremely fast, very bug free and many parts are written in C. SciPy has a learning curve, it has to be on the target and it has some overhead. I would need more info to tell you for sure. Test it.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.