2

Converting pandas data frame with mixed column types -- numerical, ordinal as well as categorical -- to Scipy sparse arrays is a central problem in machine learning.

Now, if my pandas' data frame consists of only numerical data, then I can simply do the following to convert the data frame to sparse csr matrix:

scipy.sparse.csr_matrix(df.values)

and if my data frame consists of ordinal data types, I can handle them using LabelEncoder

from collections import defaultdict
d = defaultdict(LabelEncoder)     
fit = df.apply(lambda x: d[x.name].fit_transform(x))

Then, I can again use the following and the problem is solved:

scipy.sparse.csr_matrix(df.values)

Categorical variables with a low number of values is also not a concern. They can easily be handled using pd.get_dummies (Pandas or Scikit-Learn versions).

My main concern is for categorical variables with a large number of values.

The main problem: How to handle categorical variables with a large number of values?

pd.get_dummies(train_set, columns=[categorical_columns_with_large_number_of_values], sparse=True)

takes a lot of time.

This question seems to be giving interesting directions, but, it is not clear whether it handles all the data types efficiently.

Let me know if you know the efficient way. Thanks.

1 Answer 1

1

You can convert any single column to a sparse COO array very easily with factorize. This will be MUCH faster than building a giant dense dataframe.

import pandas as pd
import scipy.sparse

data = pd.DataFrame({"A": ["1", "2", "A", "C", "A"]})

c, u = pd.factorize(data['A'])
n, m = data.shape[0], u.shape[0]

one_hot = scipy.sparse.coo_matrix((np.ones(n, dtype=np.int16), (np.arange(n), c)), shape=(n,m))

You'll get something that looks like this:

>>> one_hot.A
array([[1, 0, 0, 0],
       [0, 1, 0, 0],
       [0, 0, 1, 0],
       [0, 0, 0, 1],
       [0, 0, 1, 0]], dtype=int16)

>>> u
Index(['1', '2', 'A', 'C'], dtype='object')

Where rows are your dataframe rows and columns are the factors of your column (u will have labels for those columns in order)

Sign up to request clarification or add additional context in comments.

4 Comments

How to combine multiple columns in one sparse matrix, then? ML models take in one sparse matrix.
This solution does not work. For my single column with 14350959 unique values and 133267714 rows, it says: MemoryError: Unable to allocate 3.40 PiB for an array with shape (133267714, 14350959) and data type int16. Handling multiple columns is another problem.
You could vstack your encoded arrays but you should very carefully consider what the encoding means and how it works with your ML method. You do not have the memory to make a dense representation of this array and it will fail if you call array.A.
yes, I see your point. But, I think you meant hstack (scipy.sparse.hstack), no?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.