Predicting missing values with scikit-learn's Imputer module

Question

I am writing a very basic program to predict missing values in a dataset using scikit-learn's Imputer class.

I have made a NumPy array, created an Imputer object with strategy='mean' and performed fit_transform() on the NumPy array.

When I print the array after performing fit_transform(), the 'Nan's remain, and I dont get any prediction.

What am I doing wrong here? How do I go about predicting the missing values?

import numpy as np
from sklearn.preprocessing import Imputer

X = np.array([[23.56],[53.45],['NaN'],[44.44],[77.78],['NaN'],[234.44],[11.33],[79.87]])

print X

imp = Imputer(missing_values='NaN', strategy='mean', axis=0)
imp.fit_transform(X)

print X

That's not generally called prediction, it's called imputation. Unless the missing values are all in the future. — smci
– smci, Commented Apr 12, 2018 at 22:50

jonrsharpe · Accepted Answer · 2014-07-29 14:20:30Z

27

Per the documentation, sklearn.preprocessing.Imputer.fit_transform returns a new array, it doesn't alter the argument array. The minimal fix is therefore:

X = imp.fit_transform(X)

answered Jul 29, 2014 at 14:20

jonrsharpe

123k31 gold badges278 silver badges488 bronze badges

Sign up to request clarification or add additional context in comments.

6 Comments

xennygrimmato Over a year ago

That is working fine, thanks. However, the predicted values for all missing values are coming out to be the same. I took much larger datasets too and still all 'NaN's were being replaced by the same value. What do I need to change in my program?

jonrsharpe Over a year ago

These aren't "predicted" values, they're just replacements for missing data. Your strategy is 'mean', so it will "replace missing values using the mean along the axis".

xennygrimmato Over a year ago

Okay. Which algorithm should I use for predicting the missing values then?

Gilles Louppe Over a year ago

Additionally, you can set copy=False in the constructor to do imputation in-place and avoid creating a copy whenever possible.

Don Over a year ago

@Rayu You may want to use multiple imputation to do this correctly. See here for more information about doing so using pandas and the very nice port of MICE by Frank Cheng: gsocfrankcheng.blogspot.ca

|

msklc · Accepted Answer · 2020-06-08 20:37:07Z

8

After scikit-learn version 0.20 the usage of impute module was changed. Now, we can use imputer like;

from sklearn.impute import SimpleImputer
impute = SimpleImputer(missing_values=np.nan, strategy='mean')
impute.fit(X)
X=impute.transform(X)

Pay attention:

Instead of 'NaN', np.nan is used

Don't need to use axis parameter

We can use imp or imputer instead of my impute variable

edited Jun 8, 2020 at 20:37

answered Dec 21, 2019 at 12:58

msklc

6241 gold badge9 silver badges10 bronze badges

Comments

Shrikant Chaudhari · Accepted Answer · 2020-03-12 06:56:13Z

2

Note: Due to the change in the sklearn library 'NaN' has to be replaced with np.nan as shown below.

 from sklearn.preprocessing import Imputer
 imputer = Imputer(missing_values= np.nan,strategy='mean',axis=0)  
 imputer = imputer.fit(X[:,1:3])
 X[:,1:3]= imputer.transform(X[:,1:3])

edited Mar 12, 2020 at 6:56

Shrikant Chaudhari

435 bronze badges

answered Aug 17, 2018 at 18:09

MD SAZID KHAN

212 bronze badges

Collectives™ on Stack Overflow

Predicting missing values with scikit-learn's Imputer module

3 Answers 3

6 Comments

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

6 Comments

Comments

Comments

Linked

Related