One hot encoding error python machine learning

Question

I am working with categorical variables in Machine Learning.Here is sample of my data:

age,gender,height,class,label
25,m,43,A,0
35,f,45,B,1
12,m,36,C,0
14,f,42,A,0

There are two categorical variables gender and height.I have used LabelEncoding technique.

My code:

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder,OneHotEncoder

df=pd.read_csv('test.csv')

X=df.drop(['label'],1)
y=np.array(df['label'])

data=X.iloc[:,:].values

lben = LabelEncoder()
data[:,1] = lben.fit_transform(data[:,1])
data[:,3] = lben.fit_transform(data[:,3])

onehotencoder = OneHotEncoder(categorical_features=[1])
data = onehotencoder.fit_transform(data).toarray()

onehotencoder = OneHotEncoder(categorical_features=[3])
data = onehotencoder.fit_transform(data).toarray()

print(data.shape)

np.savetxt('data.csv',data,fmt='%s')

The data.csv looks like this:

0.0 0.0 1.0 0.0 0.0 1.0 25.0 0.0
0.0 0.0 0.0 1.0 1.0 0.0 35.0 1.0
1.0 0.0 0.0 0.0 0.0 1.0 12.0 2.0
0.0 1.0 0.0 0.0 1.0 0.0 14.0 0.0

I am unable to understand why the column is like this i.e where is the value of the 'height' column.Also the data.shape is (4,8) instead of (4,7) i.e(gender represented by 2 columns and class by 3 and 'age' and 'height' features.

Mischa Lisovyi · Accepted Answer · 2018-06-17 11:06:29Z

1

Are you sure that you need to use LabelEncoder+OneHotEncoder? There is a much simpler method (which does not allow to do advanced procedures, but so far you seem to work on basics):

import pandas as pd
import numpy as np

df=pd.read_csv('test.csv')

X=df.drop(['label'],1)
y=np.array(df['label'])

data = pd.get_dummies(X)

The problem with the current code is that after you have done the first OHE:

onehotencoder = OneHotEncoder(categorical_features=[1])
data = onehotencoder.fit_transform(data).toarray()

the columns get shifted and column 3 is in fact the original height column instead of the label-encoded class column. So change the second one to use column 4 and you will get what you want.

answered Jun 17, 2018 at 11:06

Mischa Lisovyi

3,2631 gold badge22 silver badges38 bronze badges

Sign up to request clarification or add additional context in comments.

Collectives™ on Stack Overflow

One hot encoding error python machine learning

1 Answer 1

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Your Answer

Sign up or log in

Post as a guest

Related