1

I'm trying to create a function called "words_in_texts" to get the result like this

words_in_texts(['hello', 'bye', 'world'], 
               pd.Series(['hello', 'hello world hello'])

array([[1, 0, 0],
   [1, 0, 1]])   

I believe that the argument for this function should be a list with all the words and a series.

def words_in_texts(words, texts):
'''
Args:
    words (list-like): words to find
    texts (Series): strings to search in

Returns:
    NumPy array of 0s and 1s with shape (n, p) where n is the
    number of texts and p is the number of words.
'''
indicator_array = texts.str.contains(words)

return indicator_array

I'm confused on how to create the 2d array result, can anyone please help me with this? Thank you in advance!

1 Answer 1

2

Use sklearn.feature_extraction.text.CountVectorizer:

In [52]: from sklearn.feature_extraction.text import CountVectorizer

In [53]: vect = CountVectorizer(vocabulary=['hello', 'bye', 'world'], binary=True)

In [54]: X = vect.fit_transform(pd.Series(['hello', 'hello world hello']))

result as a sparse matrix:

In [55]: X
Out[55]:
<2x3 sparse matrix of type '<class 'numpy.int64'>'
        with 3 stored elements in Compressed Sparse Row format>

you can convert it to dense matrix:

In [56]: X.A
Out[56]:
array([[1, 0, 0],
       [1, 0, 1]], dtype=int64)

features (column names):

In [57]: vect.get_feature_names()
Out[57]: ['hello', 'bye', 'world']
Sign up to request clarification or add additional context in comments.

1 Comment

Ha ha, my link edit clashed with yours there. Good answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.