0

I have a dataframe df1 that summarizes different observations of an individual ID overtime but rounded at fixed 45-minutes intervals starting at 00:00:00 (00:00:00, 00:45:00, etc.). As an example:

df1<- data.frame(DateTime45=c("2017-07-09 00:00:00","2017-07-09 00:45:00","2017-07-09 02:15:00","2017-07-09 03:45:00"),
                 ID=c("A","A","A","A"),
                 VariableX=c(0,2,0,4))

df1
           DateTime45 ID VariableX
1 2017-07-09 00:00:00  A         0
2 2017-07-09 00:45:00  A         2
3 2017-07-09 02:15:00  A         0
4 2017-07-09 03:45:00  A         4

I have another dataframe df2 in which I have other info (vedba) about this individual also overtime, but in this case without 45-minutes time intervals. As an example:

df2<- data.frame(DateTime= c("2017-07-08 23:40:57.245","2017-07-08 23:58:12.945","2017-07-09 00:01:00.345","2017-07-09 00:07:12.845","2017-07-09 00:28:34.845","2017-07-09 00:31:46.567","2017-07-09 00:53:21.345","2017-07-09 01:01:34.545","2017-07-09 01:09:12.246","2017-07-09 01:23:12.321","2017-07-09 01:34:26.687","2017-07-09 01:57:08.687","2017-07-09 02:05:23.789","2017-07-09 02:32:24.789","2017-07-09 02:42:34.536","2017-07-09 02:59:00.098","2017-07-09 03:03:01.434","2017-07-09 03:11:38.987","2017-07-09 03:23:31.345","2017-07-09 03:28:21.345","2017-07-09 03:42:53.345"),
                 ID=c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),
                 vedba=c(1.87,2.3,0.3,0.67,1.3,2.1,3.6,0.1,0.8,1.3,2.4,1.5,1.23,2.02,1.89,0.78,1.11,2.13,1.20,0.34,0.94))
df2$DateTime<- as.POSIXct(df2$DateTime, format="%Y-%m-%d %H:%M:%OS",tz="UTC")

df2

                  DateTime ID vedba
1  2017-07-08 23:40:57.244  A  1.87
2  2017-07-08 23:58:12.944  A  2.30
3  2017-07-09 00:01:00.345  A  0.30
4  2017-07-09 00:07:12.845  A  0.67
.        .          .       .    .
.        .          .       .    .

I want to calculate for each row in df1, the mean vedba using values from df2. The key is that I want to consider that for each time in df1, the window encompasses between 22 minutes and 30 seconds before and after (that is, df1$DateTime45 is the central value of the range). For instance, the time-range for df1[1,1] (2017-07-09 00:00:00) is between 2017-07-08 23:37:30 and 2017-07-09 00:22:30.

In this example, I would expect to get this:

df3
           DateTime45 ID VariableX meanVedba n_vedba
1 2017-07-09 00:00:00  A         0 1.2850000       4
2 2017-07-09 00:45:00  A         2 1.7750000       4
3 2017-07-09 02:15:00  A         0 1.5833333       3
4 2017-07-09 03:45:00  A         4 0.8266667       3

*Note: I include an n_vedba variable to check if the code is taking the right number of rows from df2.

My try was this code:

setDT(df1)[, DateTime45 := ymd_hms(DateTime45)]
setDT(df2)[, dt_floor := round_date(ymd_hms(DateTime), unit = "45 mins")]
df3<- df2[df1, .(meanVedba = mean(vedba), 
                                                  n_vedba=.N),
                        on = .(ID, dt_floor = DateTime45), by = .EACHI]

df3

   ID            dt_floor meanVedba n_vedba
1:  A 2017-07-09 00:00:00 0.4850000       2
2:  A 2017-07-09 00:45:00 2.3333333       3
3:  A 2017-07-09 02:15:00        NA       0
4:  A 2017-07-09 03:45:00 0.8266667       3

However, as you can see, I don't get what I would expect.

Does anyone know why and how to change the code in order to accomplish what I want?


Extra comment

When I have hour-intervals instead of 45-minutes intervals the code I showed works.

  1. I create the dataframes
df1<- data.frame(DateTime=c("2017-07-09 00:00:00","2017-07-09 01:00:00","2017-07-09 02:00:00","2017-07-09 03:00:00","2017-07-09 04:00:00"),
                 ID=c("A","A","A","A","A"),
                 VariableX=c(0,2,0,4,7))
df1$DateTime<- as.POSIXct(df1$DateTime45, format="%Y-%m-%d %H:%M:%S",tz="UTC")
df1

             DateTime ID VariableX
1 2017-07-09 00:00:00  A         0
2 2017-07-09 01:00:00  A         2
3 2017-07-09 02:00:00  A         0
4 2017-07-09 03:00:00  A         4
5 2017-07-09 04:00:00  A         7

df2<- data.frame(DateTime= c("2017-07-08 23:40:57.245","2017-07-08 23:58:12.945","2017-07-09 00:01:00.345","2017-07-09 00:07:12.845","2017-07-09 00:28:34.845","2017-07-09 00:31:46.567","2017-07-09 00:53:21.345","2017-07-09 01:01:34.545","2017-07-09 01:09:12.246","2017-07-09 01:23:12.321","2017-07-09 01:34:26.687","2017-07-09 01:57:08.687","2017-07-09 02:05:23.789","2017-07-09 02:32:24.789","2017-07-09 02:42:34.536","2017-07-09 02:59:00.098","2017-07-09 03:03:01.434","2017-07-09 03:11:38.987","2017-07-09 03:23:31.345","2017-07-09 03:28:21.345","2017-07-09 03:42:53.345"),
                 ID=c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),
                 vedba=c(1.87,2.3,0.3,0.67,1.3,2.1,3.6,0.1,0.8,1.3,2.4,1.5,1.23,2.02,1.89,0.78,1.11,2.13,1.20,0.34,0.94))
df2$DateTime<- as.POSIXct(df2$DateTime, format="%Y-%m-%d %H:%M:%OS",tz="UTC")

df2

               DateTime ID vedba            dt_floor
 1: 2017-07-08 23:40:57  A  1.87 2017-07-09 00:00:00
 2: 2017-07-08 23:58:12  A  2.30 2017-07-09 00:00:00
 3: 2017-07-09 00:01:00  A  0.30 2017-07-09 00:00:00
 4: 2017-07-09 00:07:12  A  0.67 2017-07-09 00:00:00
 .            .          .    .            .
 .            .          .    .            .

  1. I calculate vedba for hourly-bin intervals
setDT(df1)[, DateTime45 := ymd_hms(DateTime)]
setDT(df2)[, dt_floor := round_date(ymd_hms(DateTime), unit = "hour")]
df3<- df2[df1, .(meanVedba = mean(vedba), 
                                                  n_vedba=.N),
                        on = .(ID, dt_floor = DateTime), by = .EACHI]

df3

   ID            dt_floor meanVedba n_vedba
1:  A 2017-07-09 00:00:00  1.288000       5
2:  A 2017-07-09 01:00:00  1.580000       5
3:  A 2017-07-09 02:00:00  1.710000       3
4:  A 2017-07-09 03:00:00  1.352857       7
5:  A 2017-07-09 04:00:00  0.940000       1

2 Answers 2

3

You need an non-equi join

library(data.table)
library(lubridate)
df1<- data.frame(DateTime=c("2017-07-09 00:00:00","2017-07-09 00:45:00","2017-07-09 02:15:00","2017-07-09 03:45:00"),
                 ID=c("A","A","A","A"),
                 VariableX=c(0,2,0,4))
df1$DateTime<- as.POSIXct(df1$DateTime, format="%Y-%m-%d %H:%M:%S",tz="UTC")

df2<- data.frame(DateTime= c("2017-07-08 23:40:57.245","2017-07-08 23:58:12.945","2017-07-09 00:01:00.345","2017-07-09 00:07:12.845","2017-07-09 00:28:34.845","2017-07-09 00:31:46.567","2017-07-09 00:53:21.345","2017-07-09 01:01:34.545","2017-07-09 01:09:12.246","2017-07-09 01:23:12.321","2017-07-09 01:34:26.687","2017-07-09 01:57:08.687","2017-07-09 02:05:23.789","2017-07-09 02:32:24.789","2017-07-09 02:42:34.536","2017-07-09 02:59:00.098","2017-07-09 03:03:01.434","2017-07-09 03:11:38.987","2017-07-09 03:23:31.345","2017-07-09 03:28:21.345","2017-07-09 03:42:53.345"),
                 ID=c("A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A","A"),
                 vedba=c(1.87,2.3,0.3,0.67,1.3,2.1,3.6,0.1,0.8,1.3,2.4,1.5,1.23,2.02,1.89,0.78,1.11,2.13,1.20,0.34,0.94))
df2$DateTime<- as.POSIXct(df2$DateTime, format="%Y-%m-%d %H:%M:%OS",tz="UTC")
setDT(df1)
setDT(df2)
df1[, date_lo := DateTime - minutes(22) - seconds(30)]
df1[, date_hi := DateTime + minutes(22) + seconds(30)]
df2[df1, .(mean = mean(vedba),
           N = .N), on = .(ID, DateTime <= date_hi, DateTime >= date_lo), .EACHI]

   ID            DateTime            DateTime      mean N
1:  A 2017-07-09 00:22:30 2017-07-08 23:37:30 1.2850000 4
2:  A 2017-07-09 01:07:30 2017-07-09 00:22:30 1.7750000 4
3:  A 2017-07-09 02:37:30 2017-07-09 01:52:30 1.5833333 3
4:  A 2017-07-09 04:07:30 2017-07-09 03:22:30 0.8266667 3
Sign up to request clarification or add additional context in comments.

Comments

0

Well, I thought of working around it differently, first I switched your POSIXct forPOSIXlt and I applied it to both df1 and df2 ( instead of just df1)

So I ran this:

df1$DateTime45<- as.POSIXlt(df1$DateTime45, format="%Y-%m-%d %H:%M:%OS",tz="UTC")
df2$DateTime<- as.POSIXlt(df2$DateTime, format="%Y-%m-%d %H:%M:%OS",tz="UTC")

Then I decided to go for conditions, since you have times, you can check if the difference between each df2 and your df1 is greater ( by absolute value) than 22.5 minutes. I did it with 2 nested for loops:

for (i in 1:length(df1$DateTime45)){
  for (n in 1:length(df2$DateTime)){
  df2$DateTime[abs((df1$DateTime45[i] - df2$DateTime[n])) < seconds_to_period(seconds(22.5*60))][n] <- df1$DateTime45[i]
  }
}

Basically so far I overwrote ( converted) all df2 dates into the relevant df1's.So be mindful that if you want to keep your original df2 dates and times you should initially run this on a duplicate of df2.

Now finally we can calculate the mean vedba and join it to df1, again using a simple for loop:

means <- list()
for (i in 1:length(df1$DateTime45)){
means[[i]]  <- mean(df2[df1$DateTime45[i]==df2$DateTime,]$vedba)
}
df1<- cbind(df1,means = unlist(means))
rm(means)

now running df1 gives us:

           DateTime45 ID VariableX     means
1 2017-07-09 00:00:00  A         0 1.2850000
2 2017-07-09 00:45:00  A         2 1.7750000
3 2017-07-09 02:15:00  A         0 1.5833333
4 2017-07-09 03:45:00  A         4 0.8266667

2 Comments

Thanks for your reply!! I am wondering if a for loop will be appropiate in my case since my real dataframe 'df2' has millions of rows and 'df1' several tens of thousands. I will take a look although I will wait in case someone else can let me know how to change the code I tried to solve my goal. I think it should be something silly.
Also, I forgot to you would need to call the "lubridate" package for this to work.

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.