I have a Pandas dataframe with columns A and B
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randint(0,100,size=(10, 2)), columns=list('AB'))
I create column C, which is NULL if A > B
df['C'] = np.select([ df.A > df.B ], [df.A], default=np.NaN)
That gives:
A B C
0 95 19 95.0
1 46 11 46.0
2 96 86 96.0
3 22 61 NaN
4 69 1 69.0
5 78 91 NaN
6 42 7 42.0
7 24 28 NaN
8 55 92 NaN
9 92 16 92.0
I then drop rows that have df.C as NaN with one of several methods:
df = df.dropna(subset=['C'], how='any')
or
df = df.drop(df[pd.isnull(df.C)].index)
or
df = df.drop(df[(pd.isnull(df.C))].index)
and all 3 methods give me roughly have the rows. In this case:
A B C
0 95 19 95.0
1 46 11 46.0
2 96 86 96.0
4 69 1 69.0
6 42 7 42.0
9 92 16 92.0
But when I don't use a number, for example a string:
df['C'] = np.select([ df.A > df.B ], ['yes'], default=np.NaN)
Then those same 3 methods to drop rows with df.C being NaN are not filtered. For example, when df.A > df.B sets column C to yes, I get something like this:
A B C
0 6 70 nan
1 85 46 yes
2 76 87 nan
3 77 36 yes
4 73 18 yes
5 1 41 nan
6 19 69 nan
7 62 89 nan
8 6 7 nan
9 35 75 nan
I can fix this, by replacing pd.NaN with a string like 'IGNORE', and then filtering 'IGNORE', but I find this result otherwise unexpected.
df['C'] = np.select([ df.A > df.B ], ['yes'], default='IGNORE')
df = df.drop(df[(df.C == 'IGNORE')].index)
What's going on here? (When df.C is a string, are my pd.NaN's being converted to strings?)
I'm using 64 bit Python 2.7.13, Pandas 0.19.2, and Numpy 1.11.3 on Windows 10.