2

I have data frame of many columns consisting float values. I want to delete a row if any of the columns have value below 20.

code:

xdf = pd.DataFrame({'A':np.random.uniform(low=-50, high=53.3, size=(5)),'B':np.random.uniform(low=10, high=130, size=(5)),'C':np.random.uniform(low=-50, high=130, size=(5)),'D':np.random.uniform(low=-100, high=200, size=(5))})

xdf =  
           A          B           C           D
0  -9.270533  42.098425   91.125009  148.350655
1  17.771411  55.564825  106.396381  -89.082831
2 -22.602563  99.330643   17.590466   73.985202
3  15.890920  76.011631   52.366311  194.023063
4  35.202379  41.973846   32.576890  100.523902

# my code
xdf[xdf[cols].ge(20).all(axis=1)]

Out[17]: 
           A          B         C           D
4  35.202379  41.973846  32.57689  100.523902

Expected output: drop a row if any column has below 20 value

xdf =  
           A          B           C           D
4  35.202379  41.973846   32.576890  100.523902 

Is this the best way of doing it?

3
  • probably, could be faster in numpy Commented Jan 26, 2022 at 22:53
  • @ZLi how do we do it in numpy? Commented Jan 26, 2022 at 23:03
  • added an answer below Commented Jan 26, 2022 at 23:16

3 Answers 3

1

To do it in numpy:

xdf = pd.DataFrame({'A':np.random.uniform(low=-50, high=53.3, size=(5)),'B':np.random.uniform(low=10, high=130, size=(5)),'C':np.random.uniform(low=-50, high=130, size=(5)),'D':np.random.uniform(low=-100, high=200, size=(5))})

%timeit xdf[xdf[['A','B','C','D']].ge(20).all(axis=1)]
%timeit xdf[(xdf[['A','B','C','D']].values >= 20).all(axis=1)]
705 µs ± 277 ns per loop (mean ± std. dev. of 7 runs, 1000 loops each)
460 µs ± 1.13 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

If you do not want to keep result in DataFrame this can even be faster:

xdf.values[(xdf[['A','B','C','D']].values >= 20).all(axis=1)]
Sign up to request clarification or add additional context in comments.

Comments

1

As numpy is lighter and therefore faster in terms of calculations with numbers, try this:

a = np.array([np.random.uniform(low=-50, high=53.3, size=(5)),
    np.random.uniform(low=10, high=130, size=(5)),
    np.random.uniform(low=-50, high=130, size=(5)),
    np.random.uniform(low=-100, high=200, size=(5))])

print(a[np.all(a > 20, axis=1)])

If you want to stick with pandas, another idea would be:

xdfFiltered = xdf.loc[(xdf["A"] > 20) & (xdf["B"] > 20) & (xdf["C"] > 20) & (xdf["D"] > 20)]

Comments

1

You can use the numpy equivalent of .ge instead:

xdf.loc[np.greater(xdf,20).all(axis=1)]

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.