9

I have a dataframe X with integer, float and string columns. I'd like to one-hot encode every column that is of "Object" type, so I'm trying to do this:

encoding_needed = X.select_dtypes(include='object').columns
ohe = preprocessing.OneHotEncoder()
X[encoding_needed] = ohe.fit_transform(X[encoding_needed].astype(str)) #need astype bc I imputed with 0, so some rows have a mix of zeroes and strings.

However, I end up with IndexError: tuple index out of range. I don't quite understand this as per the documentation the encoder expects X: array-like, shape [n_samples, n_features], so I should be OK passing a dataframe. How can I one-hot encode the list of columns specifically marked in encoding_needed?

EDIT:

The data is confidential so I cannot share it and I cannot create a dummy as it has 123 columns as is.

I can provide the following:

X.shape: (40755, 123)
encoding_needed.shape: (81,) and is a subset of columns.

Full stack:

---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-90-6b3e9fdb6f91> in <module>()
      1 encoding_needed = X.select_dtypes(include='object').columns
      2 ohe = preprocessing.OneHotEncoder()
----> 3 X[encoding_needed] = ohe.fit_transform(X[encoding_needed].astype(str))

~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/frame.py in __setitem__(self, key, value)
   3365             self._setitem_frame(key, value)
   3366         elif isinstance(key, (Series, np.ndarray, list, Index)):
-> 3367             self._setitem_array(key, value)
   3368         else:
   3369             # set column

~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/frame.py in _setitem_array(self, key, value)
   3393                 indexer = self.loc._convert_to_indexer(key, axis=1)
   3394                 self._check_setitem_copy()
-> 3395                 self.loc._setitem_with_indexer((slice(None), indexer), value)
   3396 
   3397     def _setitem_frame(self, key, value):

~/anaconda3/envs/python3/lib/python3.6/site-packages/pandas/core/indexing.py in _setitem_with_indexer(self, indexer, value)
    592                     # GH 7551
    593                     value = np.array(value, dtype=object)
--> 594                     if len(labels) != value.shape[1]:
    595                         raise ValueError('Must have equal len keys and value '
    596                                          'when setting with an ndarray')

IndexError: tuple index out of range
2
  • 1
    Please provide a sample of your data and the full error traceback, not just the last line Commented Feb 10, 2020 at 16:15
  • 1
    @G.Anderson I updated the question. Commented Feb 10, 2020 at 16:18

3 Answers 3

21
# example data
X = pd.DataFrame({'int':[0,1,2,3],
                   'float':[4.0, 5.0, 6.0, 7.0],
                   'string1':list('abcd'),
                   'string2':list('efgh')})

   int  float string1 string2
0    0    4.0       a       e
1    1    5.0       b       f
2    2    6.0       c       g
3    3    7.0       d       h

Using pandas

With pandas.get_dummies, it will automatically select your object columns and drop these columns while appenind the one-hot-encoded columns:

pd.get_dummies(X)

   int  float  string1_a  string1_b  string1_c  string1_d  string2_e  \
0    0    4.0          1          0          0          0          1   
1    1    5.0          0          1          0          0          0   
2    2    6.0          0          0          1          0          0   
3    3    7.0          0          0          0          1          0   

   string2_f  string2_g  string2_h  
0          0          0          0  
1          1          0          0  
2          0          1          0  
3          0          0          1  

Using sklearn

Here we have to specify that we only need the object columns:

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder()

X_object = X.select_dtypes('object')
ohe.fit(X_object)

codes = ohe.transform(X_object).toarray()
feature_names = ohe.get_feature_names(['string1', 'string2'])

X = pd.concat([df.select_dtypes(exclude='object'), 
               pd.DataFrame(codes,columns=feature_names).astype(int)], axis=1)

   int  float  string1_a  string1_b  string1_c  string1_d  string2_e  \
0    0    4.0          1          0          0          0          1   
1    1    5.0          0          1          0          0          0   
2    2    6.0          0          0          1          0          0   
3    3    7.0          0          0          0          1          0   

   string2_f  string2_g  string2_h  
0          0          0          0  
1          1          0          0  
2          0          1          0  
3          0          0          1  
Sign up to request clarification or add additional context in comments.

1 Comment

While doing ohe.fit(df["sales"]) I am getting error as ValueError: Expected 2D array, got 1D array instead: array=['ab' 'vg' 'ab' 'iu' 'ab' 'vg' 'iu']. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
1

Without seeing your data, I'm having a hard time finding your error. You could try the get_dummies method from pandas?

pd.get_dummies(X[encoding_needed])

1 Comment

X[encoding_needed] = pd.get_dummies(X[encoding_needed]) results in a ValueError: Columns must be same length as key.
1

In case of 'get_feature_names not found' for OneHotEncoder, the following might be more feasible:

import pandas as pd
columns_encode=['string1','string2']
encoder = OneHotEncoder()
df_X_enumeric=X.copy()

for col in columns_encode:
  onehot = encoder.fit_transform(df_X_enumeric[[col]])
  feature_names = encoder.categories_[0]
  onehot_df = pd.DataFrame(onehot.toarray(), columns=feature_names)
  df_X_enumeric= pd.concat([df_X_enumeric, onehot_df], axis=1)


df_X_enumeric.drop(columns_encode, axis=1, inplace=True)

,oneHot vs dummies is also helpful.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.