3

I have a sparse matrix and I'm trying to add a sparse vector to it. I've tried different sparse formats, including csr, csc, lil, coo, and different ways of adding the sparse vector to sparse matrix, including vstack and concatenate.

All ways and formats turned out to be very slow. But when I convert the vector to dense format (by todense() ) and append it to a dense matrix (numpy.ndarray specifically) it is done very quickly. Why is it? Is there a trick or a suitable format for this that I'm missing?

Here is my code for when I tried it with 'coo' format:

from scipy.sparse import coo_matrix, rand
from time import time as timer
from numpy import array, concatenate, empty

### sparse appending in coo way ####
def sparse_append(A):
    dim = A.shape[1]
    mat = coo_matrix((0, dim))

    sparse_addtime = 0

    for vector in A:
        st = timer() 

        row = coo_matrix(vector)
        newdata = concatenate((mat.data, row.data))
        newrows = concatenate((mat.row, row.row + mat.shape[0]))
        newcols = concatenate((mat.col, row.col))

        mat = coo_matrix((newdata, (newrows, newcols)), shape = ((mat.shape)[0]+1, (mat.shape)[1]))

        et = timer() 
        sparse_addtime += et-st

    return sparse_addtime

#### dense append ####
def dense_append(A):
    dim = A.shape[1]
    mat = empty([0,dim])

    dense_addtime = 0

    for vector in A:
        st = timer()
        mat = concatenate((mat,vector))
        et = timer()
        dense_addtime += et-st

    return dense_addtime



### main ####
if __name__ == '__main__':
    dim = 400
    n = 200

    A = rand(n, dim, density = 0.1, format='lil')
    B = A.todense() #numpy.ndarray

    t1 = sparse_append(A)
    t2 = dense_append(B)

    print t1, t2

Any help is appreciated.

2
  • 1
    Have you experimented with scipy.sparse.hstack and scipy.sparse.vstack ? Commented Aug 21, 2015 at 0:23
  • Yes, I've tried them too, they are slow too compared to adding dense vector to numpy.ndarray Commented Aug 21, 2015 at 14:37

1 Answer 1

1

The slowest part in your sparse addition code is the row conversion.

row = coo_matrix(vector)

This takes roughly 65% of the time when I run it. This is because it needs to change the storage format it is storing the data in. The other slow part is creating a matrix.

mat = coo_matrix((newdata, (newrows, newcols)), shape = ((mat.shape)[0]+1, (mat.shape)[1]))

This takes a further 30% of the time. Every time you do this, you are copying all the data and allocating a bunch of memory. The most efficient way of adding your rows, especially if they are already in lil format, is to modify the matrix. If you know the matrix's dimensions at the start, you can just create the matrix with the right shape from the start. The sparse format is memory efficient, and empty rows are no issue. Otherwise, you can use set_shape to increase the dimensions every time.

from scipy.sparse import lil_matrix, rand
from time import time as timer
from numpy import array, concatenate, empty

### sparse appending ####
def sparse_append(A):
    dim = A.shape[1]
    mat = lil_matrix(A.shape, dtype = A.dtype)

    sparse_addtime = 0
    i = 0
    for vector in A:
        st = timer()

        mat[i] = vector
        i += 1
        et = timer() 
        sparse_addtime += et-st

    return sparse_addtime



#### dense append ####
def dense_append(A):
    dim = A.shape[1]
    mat = empty([0,dim])

    dense_addtime = 0

    for vector in A:
        st = timer()
        mat = concatenate((mat,vector))
        et = timer()
        dense_addtime += et-st

    return dense_addtime



### main ####
if __name__ == '__main__':
    dim = 400
    n = 200

    A = rand(n, dim, density = 0.1, format='lil')
    B = A.todense() #numpy.ndarray

    t1 = sparse_append(A)
    t2 = dense_append(B)

    print t1, t2

Running the code like this, I get slitghly better time from the sparse addition.

Sign up to request clarification or add additional context in comments.

6 Comments

That's right, using lil_matrix and allocating the memory beforehand helps to get a better runtime but still it is much slower than adding a dense vector to numpy.ndarray. what I need is to beat the dense appending runtime, I'm working with very sparse matrices, so I'm guessing this should be doable unless scipy implementation for sparse arrays is not efficient?!
Running the code above, I get 0.029s for the sparse, and 0.032s for the dense. However, the dense code can be optimized by preallocating the memory once again, and updating rows. In that case, it cuts down to 0.0008s. With lil_matrix, you are appending 200 rows to a linked list. This is much slower than copying 200 times a 400*64 bit array into a preallocated block of memory. The sparse matrix is useful because it has a smaller footprint memory wise, and because certain matrix operations can be made faster using sparse matrix implementations.
I ran the code you have written above, and for sparse it gave me 0.32 and for dense it gave 0.022 ! Are you sure for sparse you are getting 0.029 and not 0.29??
Interestingly, I get very different results with different versions of scipy/numpy. Numpy 1.8.1/Scipy 0.13.3: 0.46 0.031, Numpy 1.9.0/scipy0.14.0: 0.043 0.028, Numpy 1.9.2/Scipy 0.14.1 0.029, 0.031. All of these are with python 2.7.10
with Numpy 1.9.2/Scipy 0.16.0: 0.031, 0.022. So it seems that somewhere along the upgrade, the sparse array implementation was improved by a factor of 10, and the dense by a third.
|

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.