0

I’m working with machine events data in R, where each event has a start and end time, a unique event code, and a precedence level (ordered factor). Events may overlap, and I need to transform this into non-overlapping time intervals. Each resulting interval should inherit the highest precedence event active during that time.

Additionally:

  • If there are gaps (i.e., no events) between intervals, those gaps should be filled with a dummy event (event_code = 0, precedence = Level 0).
  • Contiguous intervals with the same event code and precedence should be merged to avoid fragmentation.

Here’s a reproducible example:

Input data (startstate)

startstate <- structure(list(
  event_code = c(101, 202, 303, 202, 404),
  start_time = structure(c(1735689600, 1735707600, 1735725600, 1735743600, 1735756200), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
  end_time   = structure(c(1735718400, 1735722000, 1735740000, 1735758000, 1735776000), class = c("POSIXct", "POSIXt"), tzone = "UTC"),
  precedence = structure(c(1L, 2L, 1L, 2L, 3L), levels = c("Level 1", "Level 2", "Level 3"), class = c("ordered", "factor"))
), class = "data.frame", row.names = c(NA, -5L))

> startstate
  event_code          start_time            end_time precedence
1        101 2025-01-01 00:00:00 2025-01-01 08:00:00    Level 1
2        202 2025-01-01 05:00:00 2025-01-01 09:00:00    Level 2
3        303 2025-01-01 10:00:00 2025-01-01 14:00:00    Level 1
4        202 2025-01-01 15:00:00 2025-01-01 19:00:00    Level 2
5        404 2025-01-01 18:30:00 2025-01-02 00:00:00    Level 3

Desired output (endstate)

endstate <- structure(list(
  event_code = c(101, 202, 0, 303, 0, 202, 404),
  start_time = structure(c(1735689600, 1735707600, 1735722000, 1735725600, 1735740000, 1735743600, 1735756200), tzone = "UTC", class = c("POSIXct", "POSIXt")),
  end_time   = structure(c(1735707600, 1735722000, 1735725600, 1735740000, 1735743600, 1735756200, 1735776000), tzone = "UTC", class = c("POSIXct", "POSIXt")),
  precedence = c("Level 1", "Level 2", "Level 0", "Level 1", "Level 0", "Level 2", "Level 3")
), row.names = c(NA, -7L), class = c("tbl_df", "tbl", "data.frame"))

> endstate
  event_code start_time          end_time            precedence
1        101 2025-01-01 00:00:00 2025-01-01 05:00:00 Level 1   
2        202 2025-01-01 05:00:00 2025-01-01 09:00:00 Level 2   
3          0 2025-01-01 09:00:00 2025-01-01 10:00:00 Level 0   
4        303 2025-01-01 10:00:00 2025-01-01 14:00:00 Level 1   
5          0 2025-01-01 14:00:00 2025-01-01 15:00:00 Level 0   
6        202 2025-01-01 15:00:00 2025-01-01 18:30:00 Level 2   
7        404 2025-01-01 18:30:00 2025-01-02 00:00:00 Level 3   

Edited: Fixed event_code typo in endstate. Thanks @jon-spring for spotting it.

I’ve had some success using the excellent ivs package (because of Right-open intervals) to untangle overlaps, but I run into trouble when:

  • The result contains adjacent rows with the same event code and precedence, which should ideally be merged.
  • I’m not sure of the cleanest way to insert filler rows for gaps.

Here’s an example of the undesired intermediate output:

event_code start_time          end_time            precedence
       101 2025-01-01 00:00:00 2025-01-01 05:00:00 Level 1          
       202 2025-01-01 05:00:00 2025-01-01 08:00:00 Level 2          
       202 2025-01-01 08:00:00 2025-01-01 09:00:00 Level 2  <- should merge with row above

Any ideas on how to:

  1. Break overlapping intervals into disjoint ones with correct precedence,
  2. Insert zero-precedence filler intervals for gaps, and
  3. Collapse adjacent intervals with identical attributes?
0

2 Answers 2

1

This is kludgey but it works for the example data. I'd like to see more elegant approaches.

The approach here is to reshape the data long so that each interval is a "start" observation and and "end" observation, and then we track how many active observations there are per precedence level, put those in separate columns, add a dummy "Level 0" level, and fill down. This gives a timeline of which Levels currently have at least one active interval.

Then we can pivot longer again to find the highest observed precedence at each time (i.e. where the count is > 0). We keep the rows where this changes, and add back to the original data, setting end_time to be the next start_time.

startstate |>
  pivot_longer(start_time:end_time, values_to = "start_time") |>
  arrange(start_time) |>
  mutate(active_by_level = cumsum(name == "start_time") - 
           cumsum(name == "end_time"), .by = precedence) |>
  pivot_wider(names_from = precedence, values_from = active_by_level) |>
  mutate(`Level 0` = 1) |>
  fill(everything()) |>

  pivot_longer(-c(event_code:start_time), names_to = "precedence", values_drop_na = TRUE) |>
  summarize(precedence = max(precedence[value > 0]), .by = start_time) |>
  filter(precedence != lag(precedence, 1, "")) |>
  left_join(startstate) |>
  transmute(event_code = coalesce(as.character(event_code), "0"),
            start_time, end_time = coalesce(lead(start_time), end_time),
            precedence) |>
  filter(!is.na(end_time))

Result

  event_code start_time          end_time            precedence
  <chr>      <dttm>              <dttm>              <chr>     
1 101        2025-01-01 00:00:00 2025-01-01 05:00:00 Level 1   
2 202        2025-01-01 05:00:00 2025-01-01 09:00:00 Level 2   
3 0          2025-01-01 09:00:00 2025-01-01 10:00:00 Level 0   
4 303        2025-01-01 10:00:00 2025-01-01 14:00:00 Level 1   
5 0          2025-01-01 14:00:00 2025-01-01 15:00:00 Level 0   
6 202        2025-01-01 15:00:00 2025-01-01 18:30:00 Level 2   
7 404        2025-01-01 18:30:00 2025-01-02 00:00:00 Level 3

(This output varies from your desired output in "event_code" in row 4 and 7, but it seems possible you have a typo in either the source data or the desired output.)

Sign up to request clarification or add additional context in comments.

3 Comments

Thanks for having a crack at it, Jon. No, rows 4 & 7 in my desired output have the correct end times. Intervals may not overlap across rows.
Please see my edited code which fixes the end_times. It was the "event_code" column that didn't make sense to me in your output.
You're right; there was indeed a typo in the event_code, which I have now fixed in text. And the solution works too, although it is difficult to wrap your head around it. Thanks, much appreciated! I will test it on some more complex cases. :-)
0

Suppose you have the following intermediate data

library(tidyverse)

intermediate <- tibble(event_code = c(101, 202, 202, 202),
                       start_time = c(make_datetime(2025,1,1), make_datetime(2025,1,1,5), make_datetime(2025,1,1,8), make_datetime(2025,1,1,12)),
                       end_time = c(make_datetime(2025,1,1,5), make_datetime(2025,1,1,8), make_datetime(2025,1,1,9), make_datetime(2025,1,1,15)),
                       precendence = c(1,2,2, 2))

which looks like

> intermediate
# A tibble: 4 × 4
  event_code start_time          end_time            precendence
       <dbl> <dttm>              <dttm>                    <dbl>
1        101 2025-01-01 00:00:00 2025-01-01 05:00:00           1
2        202 2025-01-01 05:00:00 2025-01-01 08:00:00           2
3        202 2025-01-01 08:00:00 2025-01-01 09:00:00           2
4        202 2025-01-01 12:00:00 2025-01-01 15:00:00           2

To merge the adjacent intervals, use lead() to peek at the line below

need_to_merge <- intermediate |> 
    arrange(event_code, precendence, start_time) |> 
    mutate(.by = c(event_code, precendence),
           next_end_time = lead(end_time)) |> 
    filter(.by = c(event_code, precendence),
           end_time == lead(start_time)) |> 
    mutate(end_time = next_end_time, .keep = "unused")
    
merged <- intermediate |> 
    anti_join(need_to_merge, by = join_by(event_code, precendence, start_time)) |> 
    anti_join(need_to_merge, by = join_by(event_code, precendence, x$end_time == y$end_time)) |> 
    bind_rows(need_to_merge)

Now to add in filler rows, we basically use the same method:

need_to_add <- merged |> 
    arrange(start_time) |> 
    mutate(next_start_time = lead(start_time)) |> 
    filter(next_start_time > end_time) |> 
    mutate(start_time = end_time,
           end_time = next_start_time,
           precendence = 0,
           event_code = 0,
           .keep = "unused")

merged |> 
    bind_rows(need_to_add) |> 
    arrange(start_time)

which gives

# A tibble: 4 × 4
  event_code start_time          end_time            precendence
       <dbl> <dttm>              <dttm>                    <dbl>
1        101 2025-01-01 00:00:00 2025-01-01 05:00:00           1
2        202 2025-01-01 05:00:00 2025-01-01 09:00:00           2
3          0 2025-01-01 09:00:00 2025-01-01 12:00:00           0
4        202 2025-01-01 12:00:00 2025-01-01 15:00:00           2

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.