Filtering a DataFrame based on two logical conditions, first one numpy array values, second one current day based

Question

The following is just a working example.

I have a DataFrame containings a monotonous growing function. Some of the values are actuals, some are forecasted.

I need to filter specific actuals values based on a set of milestones and I have to avoid to take forecasted values from the dataframe

I created this following script. It works, but I think is not so much pythoninc.

I am a self taught and my working eviroment is Google Colab

Expected Output

I would avoid the for loop and the if condition

Understand if there is room of improvement in the code quality

#importing libraries 
import pandas as pd 
import numpy as np
import datetime

#working code mock-up
th_array = np.arange(0, 11000, 1000)
cumulated_array = np.arange(0, 5000, 185)

df_index = pd.date_range(end = "20/04/2021", 
                         periods = len(cumulated_array))
df = pd.DataFrame(data = cumulated_array,index = df_index, 
                columns = ["cumulated"])



df_filtered = pd.DataFrame()
current_day =  pd.to_datetime(datetime.date.today())

#filtering loop
for y in th_array:
  x = df[(df['cumulated'] > y) & (df.index < current_day)]
  if x.empty is False:
      df_filtered = df_filtered.append(x.iloc[0])

tdy · Accepted Answer · 2021-04-13 21:48:28Z

One way to refactor the loop is to locate the desired rows with idxmax() and then index them in one shot:

df = df[df.index < current_day]
th_array = th_array[th_array < df.cumulated.max()]

indexes = pd.DataFrame(df.cumulated.values[:, None] > th_array).idxmax()
df.iloc[indexes]

#             cumulated
# 2021-03-25        185
# 2021-03-30       1110
# 2021-04-04       2035
# 2021-04-10       3145

Explanation

First, keep only the rows before current_day:

df = df[df.index < current_day]

#             cumulated
# 2021-03-24          0
# 2021-03-25        185
# ...
# 2021-04-10       3145
# 2021-04-11       3330

And keep only the th_array values less than cumulated.max():

th_array = th_array[th_array < df.cumulated.max()]

# array([   0, 1000, 2000, 3000])

Then use array broadcasting to build a boolean matrix of cumulated > th_array where rows correspond to cumulated and columns to th_array:

valid = pd.DataFrame(df.cumulated.values[:, None] > th_array)

#         0      1      2      3
# 0   False  False  False  False
# 1    True  False  False  False
# 2    True  False  False  False
# 3    True  False  False  False
# 4    True  False  False  False
# 5    True  False  False  False
# 6    True   True  False  False
# 7    True   True  False  False
# 8    True   True  False  False
# 9    True   True  False  False
# 10   True   True  False  False
# 11   True   True   True  False
# 12   True   True   True  False
# 13   True   True   True  False
# 14   True   True   True  False
# 15   True   True   True  False
# 16   True   True   True  False
# 17   True   True   True   True
# 18   True   True   True   True

So for each column (th_array), we want the first True row (cumulated). These can be found with idxmax(). Since False is 0 and True is 1, all the True indexes are tied for the max, and the first one wins the tiebreaker:

indexes = valid.idxmax()

# 0     1
# 1     6
# 2    11
# 3    17
# dtype: int64

Then just iloc these indexes for the final filtered df:

df.iloc[indexes]

#             cumulated
# 2021-03-25        185
# 2021-03-30       1110
# 2021-04-04       2035
# 2021-04-10       3145

Timing

For the sample data, the indexing method runs ~11 times faster than looping+appending:

>>> %timeit iloc(df, th_array)
989 µs ± 15.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

>>> %timeit loop(df, th_array)
10.9 ms ± 202 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Testing functions for reference:

def iloc(df, th_array):
    df = df[df.index < current_day]
    th_array = th_array[th_array < df.cumulated.max()]
    indexes = pd.DataFrame(df.cumulated.values[:, None] > th_array).idxmax()
    return df.iloc[indexes]

def loop(df, th_array):
    df_filtered = pd.DataFrame()
    for y in th_array:
        x = df[(df['cumulated'] > y) & (df.index < current_day)]
        if x.empty is False:
            df_filtered = df_filtered.append(x.iloc[0])
    return df_filtered

maybe is a silly question, but why you also specify None inside df.cumulated.values[:, None] ? — Andrea Ciufo
– Andrea Ciufo, Commented May 18, 2021 at 7:00

Stack Exchange Network

Filtering a DataFrame based on two logical conditions, first one numpy array values, second one current day based

1 Answer 1

Explanation

Timing

You must log in to answer this question.

Hot Network Questions

Filtering a DataFrame based on two logical conditions, first one numpy array values, second one current day based

1 Answer 1

Explanation

Timing

You must log in to answer this question.

Related

Hot Network Questions