0

I have two dataframes which contain data collected at two different frequencies. I want to update the label of df2, to that of df1 if it falls into the duration of an event.

I created a nested for-loop to do it, but it takes a rather long time. Here is the code I used:

for i in np.arange(len(df1)-1):
    for j in np.arange(len(df2)):
        if (df2.timestamp[j] > df1.timestamp[i]) & (df2.timestamp[j] < (df1.timestamp[i] + df1.duration[i])):
            df2.loc[j,"label"] = df1.loc[i,"label"]

Is there a more efficient way of doing this? df1 size (367, 4) df2 size (342423, 9)

short example data:

import numpy as np
import pandas as pd

data1 = {'timestamp':  [1,2,3,4,5,6,7,8,9],
    'duration': [0.5,0.3,0.8,0.2,0.4,0.5,0.3,0.7,0.5],
     'label': ['inh','exh','inh','exh','inh','exh','inh','exh','inh']
    }
df1 = pd.DataFrame (data1, columns = ['timestamp','duration','label'])

data2 = {'timestamp':  [1,1.5,2,2.5,3,3.5,4,4.5,5,5.5,6,6.5,7,7.5,8,8.5,9,9.5],
         'label': ['plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc','plc']
        }
    df2 = pd.DataFrame (data2, columns = ['timestamp','label'])
3
  • 1
    it's always nicer if you could provide a minimal reproducible example. Commented Dec 10, 2020 at 14:12
  • cool, and from your sample code, I assume the logic to use for multiple matches , i.e, if a row in df2 matches more than one in df1, is to then take the last (by index order)? Commented Dec 10, 2020 at 14:36
  • A row in df2 should not be able to match multiple rows in df1, since events in df2 have no duration/duration=0. However, multiple rows in df2 can match a row in df1, in which case they get the same label. Commented Dec 10, 2020 at 14:41

1 Answer 1

1

I would first use a merge_asof to select the highest timestamp from df1 below the timestamp from df2. Next a simple (vectorized) comparison of df2.timestamp and df1.timestamp + df1.duration is enough to select matching lines.

Code could be:

df1['t2'] = df1['timestamp'].astype('float64') # types of join columns must be the same
temp = pd.merge_asof(df2, df1, left_on='timestamp', right_on='t2')
df2.loc[temp.timestamp_x <= temp.t2 + temp.duration, 'label'] = temp.label_y

It gives for df2:

    timestamp label
0         1.0   inh
1         1.5   inh
2         2.0   exh
3         2.5   plc
4         3.0   inh
5         3.5   inh
6         4.0   exh
7         4.5   plc
8         5.0   inh
9         5.5   plc
10        6.0   exh
11        6.5   exh
12        7.0   inh
13        7.5   plc
14        8.0   exh
15        8.5   exh
16        9.0   inh
17        9.5   inh
Sign up to request clarification or add additional context in comments.

1 Comment

What does 't2' represent?

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.