Python Pandas - Drop row based on value

Question

I have a Pandas dataframe with columns A and B

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,100,size=(10, 2)), columns=list('AB'))

I create column C, which is NULL if A > B

df['C'] = np.select([ df.A > df.B ], [df.A], default=np.NaN)

That gives:

    A   B     C
0  95  19  95.0
1  46  11  46.0
2  96  86  96.0
3  22  61   NaN
4  69   1  69.0
5  78  91   NaN
6  42   7  42.0
7  24  28   NaN
8  55  92   NaN
9  92  16  92.0

I then drop rows that have df.C as NaN with one of several methods:

df = df.dropna(subset=['C'], how='any')

or

df = df.drop(df[pd.isnull(df.C)].index)

or

df = df.drop(df[(pd.isnull(df.C))].index)

and all 3 methods give me roughly have the rows. In this case:

    A   B     C
0  95  19  95.0
1  46  11  46.0
2  96  86  96.0
4  69   1  69.0
6  42   7  42.0
9  92  16  92.0

But when I don't use a number, for example a string:

df['C'] = np.select([ df.A > df.B ], ['yes'], default=np.NaN)

Then those same 3 methods to drop rows with df.C being NaN are not filtered. For example, when df.A > df.B sets column C to yes, I get something like this:

    A   B    C
0   6  70  nan
1  85  46  yes
2  76  87  nan
3  77  36  yes
4  73  18  yes
5   1  41  nan
6  19  69  nan
7  62  89  nan
8   6   7  nan
9  35  75  nan

I can fix this, by replacing pd.NaN with a string like 'IGNORE', and then filtering 'IGNORE', but I find this result otherwise unexpected.

df['C'] = np.select([ df.A > df.B ], ['yes'], default='IGNORE')
df = df.drop(df[(df.C == 'IGNORE')].index)

What's going on here? (When df.C is a string, are my pd.NaN's being converted to strings?)

I'm using 64 bit Python 2.7.13, Pandas 0.19.2, and Numpy 1.11.3 on Windows 10.

@Psidom Yes, true. It seems NaN is literally "not a number" and is being converted to a string "nan". — philshem
– philshem, Commented Feb 17, 2017 at 20:11
@Psidom if you write your comment as an answer I'd be happy to accept it. It doesn't really explain why, but it definitely solves the problem — philshem
– philshem, Commented Feb 17, 2017 at 20:54

Pedro Lobito · Accepted Answer · 2017-09-14 17:54:20Z

3

Instead of dropping, take only finite values.

df = df[np.isfinite(df['C'])]

Edit:

As per you comment nan is of the type string, so, remove rows based on values:

df = df[df.C != "nan"] will work

df[df.C.notnull()]
    A   B    C
0  67  23  yes
1  91  61  yes
2  30  92  nan
3  53  97  nan
4  81  11  yes
5  23   7  yes
6  47  39  yes
7  11  27  nan
8  46  55  nan
9  31  82  nan
df = df[df.C != "nan"]


    A   B    C
0  67  23  yes
1  91  61  yes
4  81  11  yes
5  23   7  yes
6  47  39  yes

edited Sep 14, 2017 at 17:54

Pedro Lobito

99.8k36 gold badges274 silver badges278 bronze badges

answered Feb 17, 2017 at 20:12

MANOJ REDDY

1431 silver badge10 bronze badges

Sign up to request clarification or add additional context in comments.

3 Comments

philshem Over a year ago

I get a TypeError

ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''

MANOJ REDDY Over a year ago

I have tried to just simulate your problemand came up with this solution. import numpy as np import pandas as pd df = pd.DataFrame(np.random.randint(0,100,size=(10, 2)), columns=list('AB')) df['C'] = np.select([ df.A > df.B ], [df.A], default=np.NaN) print df A B C 0 81 17 81.0 1 14 67 NaN 2 16 9 16.0 3 25 31 NaN 4 35 36 NaN 5 56 5 56.0 6 18 20 NaN 7 32 4 32.0 8 46 51 NaN 9 53 34 53.0 df = df[np.isfinite(df['C'])] print df A B C 0 81 17 81.0 2 16 9 16.0 5 56 5 56.0 7 32 4 32.0 9 53 34 53.0

philshem Over a year ago

Ok the difference is that in my actual code (and not the sample code I posted here), please try this: df['C'] = np.select([ df.A > df.B ], [u'yes'], default=np.NaN)

akuiper · Accepted Answer · 2017-02-17 21:02:54Z

Your case is similar to this one:

np.array([1,2,'3',np.nan])
# array(['1', '2', '3', 'nan'], 
#       dtype='<U21')

since np.select also returns an array, if you further check

type(np.nan)
# float

str(np.nan)
# 'nan'

so np.nan is a float, but numpy array prefers single data type except for structured array, so when there is a string element in the array, all elements are converted to string.

For your case, if you have string column, you can use None in place of np.nan as default, this will create a missing value which can pass isnull() check and works with dropna() :

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,100,size=(10, 2)), columns=list('AB'))
df['C'] = np.select([ df.A > df.B ], ['yes'], default=None)

df.dropna()

#    A  B     C
#0  82  1   yes
#3  84  8   yes
#6  52  30  yes
#7  68  61  yes
#9  91  87  yes

Collectives™ on Stack Overflow

Python Pandas - Drop row based on value

2 Answers 2

3 Comments

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

3 Comments

Comments

Your Answer

Sign up or log in

Post as a guest

Related