4

I have a Pandas dataframe with columns A and B

import pandas as pd
import numpy as np

df = pd.DataFrame(np.random.randint(0,100,size=(10, 2)), columns=list('AB'))

I create column C, which is NULL if A > B

df['C'] = np.select([ df.A > df.B ], [df.A], default=np.NaN)

That gives:

    A   B     C
0  95  19  95.0
1  46  11  46.0
2  96  86  96.0
3  22  61   NaN
4  69   1  69.0
5  78  91   NaN
6  42   7  42.0
7  24  28   NaN
8  55  92   NaN
9  92  16  92.0

I then drop rows that have df.C as NaN with one of several methods:

df = df.dropna(subset=['C'], how='any')

or

df = df.drop(df[pd.isnull(df.C)].index)

or

df = df.drop(df[(pd.isnull(df.C))].index)

and all 3 methods give me roughly have the rows. In this case:

    A   B     C
0  95  19  95.0
1  46  11  46.0
2  96  86  96.0
4  69   1  69.0
6  42   7  42.0
9  92  16  92.0

But when I don't use a number, for example a string:

df['C'] = np.select([ df.A > df.B ], ['yes'], default=np.NaN)

Then those same 3 methods to drop rows with df.C being NaN are not filtered. For example, when df.A > df.B sets column C to yes, I get something like this:

    A   B    C
0   6  70  nan
1  85  46  yes
2  76  87  nan
3  77  36  yes
4  73  18  yes
5   1  41  nan
6  19  69  nan
7  62  89  nan
8   6   7  nan
9  35  75  nan

I can fix this, by replacing pd.NaN with a string like 'IGNORE', and then filtering 'IGNORE', but I find this result otherwise unexpected.

df['C'] = np.select([ df.A > df.B ], ['yes'], default='IGNORE')
df = df.drop(df[(df.C == 'IGNORE')].index)

What's going on here? (When df.C is a string, are my pd.NaN's being converted to strings?)


I'm using 64 bit Python 2.7.13, Pandas 0.19.2, and Numpy 1.11.3 on Windows 10.

2
  • @Psidom Yes, true. It seems NaN is literally "not a number" and is being converted to a string "nan". Commented Feb 17, 2017 at 20:11
  • @Psidom if you write your comment as an answer I'd be happy to accept it. It doesn't really explain why, but it definitely solves the problem Commented Feb 17, 2017 at 20:54

2 Answers 2

3

Instead of dropping, take only finite values.

df = df[np.isfinite(df['C'])]

Edit:

As per you comment nan is of the type string, so, remove rows based on values:

df = df[df.C != "nan"] will work

df[df.C.notnull()]
    A   B    C
0  67  23  yes
1  91  61  yes
2  30  92  nan
3  53  97  nan
4  81  11  yes
5  23   7  yes
6  47  39  yes
7  11  27  nan
8  46  55  nan
9  31  82  nan
df = df[df.C != "nan"]


    A   B    C
0  67  23  yes
1  91  61  yes
4  81  11  yes
5  23   7  yes
6  47  39  yes 
Sign up to request clarification or add additional context in comments.

3 Comments

I get a TypeError ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
I have tried to just simulate your problemand came up with this solution. import numpy as np import pandas as pd df = pd.DataFrame(np.random.randint(0,100,size=(10, 2)), columns=list('AB')) df['C'] = np.select([ df.A > df.B ], [df.A], default=np.NaN) print df A B C 0 81 17 81.0 1 14 67 NaN 2 16 9 16.0 3 25 31 NaN 4 35 36 NaN 5 56 5 56.0 6 18 20 NaN 7 32 4 32.0 8 46 51 NaN 9 53 34 53.0 df = df[np.isfinite(df['C'])] print df A B C 0 81 17 81.0 2 16 9 16.0 5 56 5 56.0 7 32 4 32.0 9 53 34 53.0
Ok the difference is that in my actual code (and not the sample code I posted here), please try this: df['C'] = np.select([ df.A > df.B ], [u'yes'], default=np.NaN)
1

Your case is similar to this one:

np.array([1,2,'3',np.nan])
# array(['1', '2', '3', 'nan'], 
#       dtype='<U21')

since np.select also returns an array, if you further check

type(np.nan)
# float

str(np.nan)
# 'nan'

so np.nan is a float, but numpy array prefers single data type except for structured array, so when there is a string element in the array, all elements are converted to string.


For your case, if you have string column, you can use None in place of np.nan as default, this will create a missing value which can pass isnull() check and works with dropna() :

import pandas as pd
import numpy as np
​
df = pd.DataFrame(np.random.randint(0,100,size=(10, 2)), columns=list('AB'))
df['C'] = np.select([ df.A > df.B ], ['yes'], default=None)

df.dropna()

#    A  B     C
#0  82  1   yes
#3  84  8   yes
#6  52  30  yes
#7  68  61  yes
#9  91  87  yes

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.