4

I am building ngrams from multiple text documents using scikit-learn. I need to build document-frequency using countVectorizer.

Example :

document1 = "john is a nice guy"

document2 = "person can be a guy"

So, document-frequency will be

{'be': 1,
 'can': 1,
 'guy': 2,
 'is': 1,
 'john': 1,
 'nice': 1,
 'person': 1}

Here documents are just strings but when I tried with huge amount of data. It throws MEMORY ERROR.

Code :

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
document = [Huge amount of data around 7MB] # ['john is a guy', 'person guy']
vectorizer = CountVectorizer(ngram_range=(1, 5))
X = vectorizer.fit_transform(document).todense()
tranformer = vectorizer.transform(document).todense()
matrix_terms = np.array(vectorizer.get_feature_names())
lst_freq =  map(sum,zip(*tranformer.A))          
matrix_freq = np.array(lst_freq)
final_matrix = np.array([matrix_terms,matrix_freq])

ERROR :

Traceback (most recent call last):
  File "demo1.py", line 13, in build_ngrams_matrix
    X = vectorizer.fit_transform(document).todense()
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/base.py", line 605, in todense
    return np.asmatrix(self.toarray(order=order, out=out))
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/compressed.py", line 901, in toarray
    return self.tocoo(copy=False).toarray(order=order, out=out)
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/coo.py", line 269, in toarray
    B = self._process_toarray_args(order, out)
  File "/usr/local/lib/python2.7/dist-packages/scipy/sparse/base.py", line 789, in _process_toarray_args
    return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
3
  • 2
    Have you check stackoverflow.com/questions/16332083/… or stackoverflow.com/questions/23879139/…? Commented Nov 12, 2014 at 13:24
  • I think using todense() while generating MEMORY ERROR. But when it doesn't use todense(), its gives the output in sparse matrix. I don't know to read that sparse matrix. Any help plz? Commented Nov 12, 2014 at 14:11
  • If you really want to look at the sparse matrix, you can look at a small chunk of it (e.g. the first 10 lines) like so X[:10,:].todense(). Most other operations, such as summation, work the same way for sparse and dense matrices, so you don't really need to call todense/A/toarray Commented Nov 12, 2014 at 15:51

1 Answer 1

11

As the comments have mentioned, you're running into memory issues when you convert the large sparse matrices to dense format. Try something like this:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
document = [Huge amount of data around 7MB] # ['john is a guy', 'person guy']
vectorizer = CountVectorizer(ngram_range=(1, 5))

# Don't need both X and transformer; they should be identical
X = vectorizer.fit_transform(document)
matrix_terms = np.array(vectorizer.get_feature_names())

# Use the axis keyword to sum over rows
matrix_freq = np.asarray(X.sum(axis=0)).ravel()
final_matrix = np.array([matrix_terms,matrix_freq])

EDIT: If you want a dictionary from term to frequency, try this after calling fit_transform:

terms = vectorizer.get_feature_names()
freqs = X.sum(axis=0).A1
result = dict(zip(terms, freqs))
Sign up to request clarification or add additional context in comments.

3 Comments

thanks @perimosocordiae. Suggest me to use fit_transform() as both were identical.
@perimosocordiae : Need more help. You final_matrix is a matrix. Now I want to convert it into dictionary in rapid way. I used dict(zip(final_matrix[0],final_matrix[1])) this. But it takes time in seconds. Any other way to convert matrix into dictionary?
I'm not sure if it will be much faster, but I've updated my answer to show how to make the dictionary you want.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.