1

I have a data set with two columns of categorical label data (NBA Team names). What I want to do is use one hot encoding to generate a binary, 1D vector as an array representing each team. Here is my code:

from sklearn.preprocessing import MultiLabelBinarizer
one_hot_encoder = MultiLabelBinarizer()
table["Teams"] = one_hot_encoder.fit_transform(table["Teams"])

The encoder works appropriately, and it generates the arrays accordingly. In other words,

one_hot_encoder.fit_transform(table["Teams"])

generates the following properly:

Link to encoder result screenshot

However, when I try to store the array into the column, as follows:

table["Teams"] = one_hot_encoder.fit_transform(table["Teams"])

It seems like it's not being saved properly.

Link to data frame result screenshot

Instead, it looks as if the column is just taking the first value of each array, and not storing the entire array. How should I go about resolving this?

1
  • Could your paste your sample data instead of image? Commented Jul 13, 2018 at 7:21

2 Answers 2

1

I think need convert 2d array to lists:

table = pd.DataFrame({"Teams":list('aaasdffds')})

from sklearn.preprocessing import MultiLabelBinarizer
one_hot_encoder = MultiLabelBinarizer()

table["Teams"] = one_hot_encoder.fit_transform(table["Teams"]).tolist()
print (table)
          Teams
0  [1, 0, 0, 0]
1  [1, 0, 0, 0]
2  [1, 0, 0, 0]
3  [0, 0, 0, 1]
4  [0, 1, 0, 0]
5  [0, 0, 1, 0]
6  [0, 0, 1, 0]
7  [0, 1, 0, 0]
8  [0, 0, 0, 1]

But store arrays or lists to one column is not recommended because not possible use vectorized methods/functions, better is create DataFrame:

table = pd.DataFrame(one_hot_encoder.fit_transform(table["Teams"]), 
                     columns=one_hot_encoder.classes_)
print (table)

   a  d  f  s
0  1  0  0  0
1  1  0  0  0
2  1  0  0  0
3  0  0  0  1
4  0  1  0  0
5  0  0  1  0
6  0  0  1  0
7  0  1  0  0
8  0  0  0  1
Sign up to request clarification or add additional context in comments.

Comments

0

Realizing you need a list within your DataFrame. You can store the arrays as a list, pandas wont modify it.

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
encoded_array = mlb.fit_transform(table['Teams'])
table['Teams'] = [ [encoded_array [i,:]] for i in range(table.shape[0]) ]

1 Comment

OP need new column filled with array, so your question dont answer it. It is recommedation only, same principe in my answer.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.