Creating new DataFrame from existing dataframe without missing values

Question

I stuck with quite obvious task.

I have a df with missing data. For processing such kind of data I want to test two dataFrames.

For the first one X_real_zeros - I replace missing with 0. And for the second one X_real_means - with column's mean.

I have collected all numeric columns name in one array

numeric_cols = ['RFCD.Percentage.1', 'RFCD.Percentage.2', 'RFCD.Percentage.3', 
                'RFCD.Percentage.4', 'RFCD.Percentage.5',
                'SEO.Percentage.1', 'SEO.Percentage.2', 'SEO.Percentage.3',
                'SEO.Percentage.4', 'SEO.Percentage.5',
                'Year.of.Birth.1', 'Number.of.Successful.Grant.1', 'Number.of.Unsuccessful.Grant.1']

Then I'm trying to create two dataFrames.

data = pd.read_csv('data.csv')
X_real_zeros = data
for col in numeric_cols:
    X_real_zeros[col] = data[col].fillna(0)

X_real_means = data
a = calculate_means(data[numeric_cols])
for col in numeric_cols:
    print(a[col], col)
    X_real_means[col] = data[col].fillna(a[col])

But, when I want to create the second one, it turns out, that my data data frame has been modified. Anyway I think my approach is not accurate, what is the proper way of solving such tasks?

Community · Accepted Answer · 2020-06-20 09:12:55Z

6

use

X_real_means = data.copy()

Otherwise, the variable X_real_means will reference exactly the same object as data.

Wes Mickenny answered a similar question in here: pandas dataframe, copy by value

The overall code after the change will look like this:

data = pd.read_csv('data.csv')
X_real_zeros = data.copy()
for col in numeric_cols:
    X_real_zeros[col] = data[col].fillna(0)

X_real_means = data.copy()
a = calculate_means(data[numeric_cols])
for col in numeric_cols:
    print(a[col], col)
    X_real_means[col] = data[col].fillna(a[col])

edited Jun 20, 2020 at 9:12

CommunityBot

11 silver badge

answered Sep 29, 2017 at 12:05

Mohamed Ali JAMAOUI

14.8k14 gold badges79 silver badges124 bronze badges

Sign up to request clarification or add additional context in comments.

Comments

Kilzrus · Accepted Answer · 2017-09-29 12:10:15Z

1

I think thats all you needed to do:

data = pd.read_csv('data.csv')
X_real_zeros = data.copy()
for col in numeric_cols:
    X_real_zeros[col] = data[col].fillna(0)

X_real_means = data.copy()
a = calculate_means(data[numeric_cols])
for col in numeric_cols:
    print(a[col], col)
    X_real_means[col] = data[col].fillna(a[col])

answered Sep 29, 2017 at 12:10

Kilzrus

1432 silver badges11 bronze badges

Collectives™ on Stack Overflow

Creating new DataFrame from existing dataframe without missing values

2 Answers 2

Comments

Comments

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Linked

Related