More efficient alternative to nested For loop

Question

I have two dataframes which contain data collected at two different frequencies. I want to update the label of df2, to that of df1 if it falls into the duration of an event.

I created a nested for-loop to do it, but it takes a rather long time. Here is the code I used:

for i in np.arange(len(df1)-1):
    for j in np.arange(len(df2)):
        if (df2.timestamp[j] > df1.timestamp[i]) & (df2.timestamp[j] < (df1.timestamp[i] + df1.duration[i])):
            df2.loc[j,"label"] = df1.loc[i,"label"]

Is there a more efficient way of doing this? df1 size (367, 4) df2 size (342423, 9)

short example data:

import numpy as np
import pandas as pd

data1 = {'timestamp':  [1,2,3,4,5,6,7,8,9],
    'duration': [0.5,0.3,0.8,0.2,0.4,0.5,0.3,0.7,0.5],
     'label': ['inh','exh','inh','exh','inh','exh','inh','exh','inh']
    }
df1 = pd.DataFrame (data1, columns = ['timestamp','duration','label'])

data2 = {'timestamp':  [1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6,6.5,7,7.5,8,8.5,9,9.5],
         'label': ['plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc']
        }
    df2 = pd.DataFrame (data2, columns = ['timestamp','label'])

it's always nicer if you could provide a minimal reproducible example. — Pierre D
– Pierre D, Commented Dec 10, 2020 at 14:12
cool, and from your sample code, I assume the logic to use for multiple matches , i.e, if a row in df2 matches more than one in df1, is to then take the last (by index order)? — Pierre D
– Pierre D, Commented Dec 10, 2020 at 14:36
A row in df2 should not be able to match multiple rows in df1, since events in df2 have no duration/duration=0. However, multiple rows in df2 can match a row in df1, in which case they get the same label. — martsc1
– martsc1, Commented Dec 10, 2020 at 14:41

Serge Ballesta · Accepted Answer · 2020-12-11 08:26:12Z

1

I would first use a merge_asof to select the highest timestamp from df1 below the timestamp from df2. Next a simple (vectorized) comparison of df2.timestamp and df1.timestamp + df1.duration is enough to select matching lines.

Code could be:

df1['t2'] = df1['timestamp'].astype('float64') # types of join columns must be the same
temp = pd.merge_asof(df2, df1, left_on='timestamp', right_on='t2')
df2.loc[temp.timestamp_x <= temp.t2 + temp.duration, 'label'] = temp.label_y

It gives for df2:

    timestamp label
0         1.0   inh
1         1.5   inh
2         2.0   exh
3         2.5   plc
4         3.0   inh
5         3.5   inh
6         4.0   exh
7         4.5   plc
8         5.0   inh
9         5.5   plc
10        6.0   exh
11        6.5   exh
12        7.0   inh
13        7.5   plc
14        8.0   exh
15        8.5   exh
16        9.0   inh
17        9.5   inh

edited Dec 11, 2020 at 8:26

answered Dec 10, 2020 at 14:57

Serge Ballesta

150k13 gold badges137 silver badges267 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

martsc1 Over a year ago

What does 't2' represent?

Collectives™ on Stack Overflow

More efficient alternative to nested For loop

1 Answer 1

1 Comment

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

1 Comment

Your Answer

Sign up or log in

Post as a guest

Related