18

I am writing a very basic program to predict missing values in a dataset using scikit-learn's Imputer class.

I have made a NumPy array, created an Imputer object with strategy='mean' and performed fit_transform() on the NumPy array.

When I print the array after performing fit_transform(), the 'Nan's remain, and I dont get any prediction.

What am I doing wrong here? How do I go about predicting the missing values?

import numpy as np
from sklearn.preprocessing import Imputer

X = np.array([[23.56],[53.45],['NaN'],[44.44],[77.78],['NaN'],[234.44],[11.33],[79.87]])

print X

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit_transform(X)

print X
1
  • That's not generally called prediction, it's called imputation. Unless the missing values are all in the future. Commented Apr 12, 2018 at 22:50

3 Answers 3

27

Per the documentation, sklearn.preprocessing.Imputer.fit_transform returns a new array, it doesn't alter the argument array. The minimal fix is therefore:

X = imp.fit_transform(X)
Sign up to request clarification or add additional context in comments.

6 Comments

That is working fine, thanks. However, the predicted values for all missing values are coming out to be the same. I took much larger datasets too and still all 'NaN's were being replaced by the same value. What do I need to change in my program?
These aren't "predicted" values, they're just replacements for missing data. Your strategy is 'mean', so it will "replace missing values using the mean along the axis".
Okay. Which algorithm should I use for predicting the missing values then?
Additionally, you can set copy=False in the constructor to do imputation in-place and avoid creating a copy whenever possible.
@Rayu You may want to use multiple imputation to do this correctly. See here for more information about doing so using pandas and the very nice port of MICE by Frank Cheng: gsocfrankcheng.blogspot.ca
|
8

After scikit-learn version 0.20 the usage of impute module was changed. Now, we can use imputer like;

from sklearn.impute import SimpleImputer
impute = SimpleImputer(missing_values=np.nan, strategy='mean')
impute.fit(X)
X=impute.transform(X)

Pay attention:

Instead of 'NaN', np.nan is used

Don't need to use axis parameter

We can use imp or imputer instead of my impute variable

Comments

2

Note: Due to the change in the sklearn library 'NaN' has to be replaced with np.nan as shown below.

 from sklearn.preprocessing import Imputer
 imputer = Imputer(missing_values= np.nan,strategy='mean',axis=0)  
 imputer = imputer.fit(X[:,1:3])
 X[:,1:3]= imputer.transform(X[:,1:3])

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.