Creating ngrams from scikit learn and count vectorizer throws Memory Error

Question

I am building ngrams from multiple text documents using scikit-learn. I need to build document-frequency using countVectorizer.

Example :

document1 = "john is a nice guy"

document2 = "person can be a guy"

So, document-frequency will be

{'be': 1,
 'can': 1,
 'guy': 2,
 'is': 1,
 'john': 1,
 'nice': 1,
 'person': 1}

Here documents are just strings but when I tried with huge amount of data. It throws MEMORY ERROR.

Code :

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
document = [Huge amount of data around 7MB] # ['john is a guy', 'person guy']
vectorizer = CountVectorizer(ngram_range=(1, 5))
X = vectorizer.fit_transform(document).todense()
tranformer = vectorizer.transform(document).todense()
matrix_terms = np.array(vectorizer.get_feature_names())
lst_freq =  map(sum,zip(*tranformer.A))          
matrix_freq = np.array(lst_freq)
final_matrix = np.array([matrix_terms,matrix_freq])

ERROR :

Traceback (most recent call last):
  File "demo1.py", line 13, in build_ngrams_matrix
    X = vectorizer.fit_transform(document).todense()
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/base.py", line 605, in todense
    return np.asmatrix(self.toarray(order=order, out=out))
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 901, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 269, in toarray
    B = self._process_toarray_args(order, out)
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/base.py", line 789, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError

Have you check stackoverflow.com/questions/16332083/… or stackoverflow.com/questions/23879139/…? — fredtantini
– fredtantini, Commented Nov 12, 2014 at 13:24
I think using todense() while generating MEMORY ERROR. But when it doesn't use todense(), its gives the output in sparse matrix. I don't know to read that sparse matrix. Any help plz? — iNikkz
– iNikkz, Commented Nov 12, 2014 at 14:11
If you really want to look at the sparse matrix, you can look at a small chunk of it (e.g. the first 10 lines) like so X[:10,:].todense(). Most other operations, such as summation, work the same way for sparse and dense matrices, so you don't really need to call todense/A/toarray — mbatchkarov
– mbatchkarov, Commented Nov 12, 2014 at 15:51

perimosocordiae · Accepted Answer · 2014-11-13 16:14:50Z

11

As the comments have mentioned, you're running into memory issues when you convert the large sparse matrices to dense format. Try something like this:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
document = [Huge amount of data around 7MB] # ['john is a guy', 'person guy']
vectorizer = CountVectorizer(ngram_range=(1, 5))

# Don't need both X and transformer; they should be identical
X = vectorizer.fit_transform(document)
matrix_terms = np.array(vectorizer.get_feature_names())

# Use the axis keyword to sum over rows
matrix_freq = np.asarray(X.sum(axis=0)).ravel()
final_matrix = np.array([matrix_terms,matrix_freq])

EDIT: If you want a dictionary from term to frequency, try this after calling fit_transform:

terms = vectorizer.get_feature_names()
freqs = X.sum(axis=0).A1
result = dict(zip(terms, freqs))

edited Nov 13, 2014 at 16:14

answered Nov 12, 2014 at 16:23

perimosocordiae

17.9k14 gold badges64 silver badges76 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

iNikkz Over a year ago

thanks @perimosocordiae. Suggest me to use fit_transform() as both were identical.

iNikkz Over a year ago

@perimosocordiae : Need more help. You final_matrix is a matrix. Now I want to convert it into dictionary in rapid way. I used dict(zip(final_matrix[0],final_matrix[1])) this. But it takes time in seconds. Any other way to convert matrix into dictionary?

perimosocordiae Over a year ago

I'm not sure if it will be much faster, but I've updated my answer to show how to make the dictionary you want.

Collectives™ on Stack Overflow

Creating ngrams from scikit learn and count vectorizer throws Memory Error

1 Answer 1

3 Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

3 Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related