7

I'm trying to create many different possible weighting schemes based on temperature.

I created a data frame with all possible combinations of 8 vectors (each vector represents a temperature range). So the columns of the data frame are a specific temperature range and the rows are weights.

I would like to pass the temperature ranges as arguments to case_when, and loop through each row of the weights data frame, creating a new variable for each row based on the actual temperature and the associated weight for that temperature based on the information in the weights data frame.

Using the following post, I was able to create a function to produce the weights data frame:

Use dplyr::case_when with arguments programmatically

But I don't know how to construct the case_when arguments using the weights data frame.

Function to create data frame of all possible weights

library(rlang)
library(tidyverse)

create_temp_weights <- function(
  from = 31,
  to = 100,
  by = 10,
  weights = exprs(between(., 31, 40) ~ c(0, 0.2),
                  between(., 41, 50) ~ c(0.5, 0.8),
                  between(., 51, 90) ~ c(0.8, 1),
                  between(., 91, 100) ~ c(0.2, 0.8),
                  TRUE ~ c(-0.1, 0))
) {

  # use 999 to map other temperatures to last case
  map(c(seq(from, to, by), 999), ~ case_when(!!!weights)) %>%
    set_names(c(map_chr(seq(from, to, by),
                      ~ str_c("temp_", ., "_", . + by - 1)), "temp_other")) %>%
  cross_df(.)

}

temp_weights <- create_temp_weights()

Create tibble with vector of temperatures used to construct the weights

test_tibble <- tibble(temp = seq_len(100))

head(test_tibble)

The following case_when is the thing I'm trying to produce programmatically using the weights data frame.

# Now I want to create a function that will produce the following
# case_when from the temp_weight data frame so I don't have to
# manually edit the following each time I create a new weights data frame

test_tibble2 <- map_dfc(.x = seq_len(nrow(temp_weights)),
    ~ transmute(
      test_tibble,
      temp =
        case_when(
          temp >= 31   & temp  <= 40   ~  temp_weights$temp_31_40[.x],
          temp >= 41   & temp  <= 50   ~  temp_weights$temp_41_50[.x],
          temp >= 51   & temp  <= 60   ~  temp_weights$temp_51_60[.x],
          temp >= 61   & temp  <= 70   ~  temp_weights$temp_61_70[.x],
          temp >= 71   & temp  <= 80   ~  temp_weights$temp_71_80[.x],
          temp >= 81   & temp  <= 90   ~  temp_weights$temp_81_90[.x],
          temp >= 91   & temp  <= 100  ~  temp_weights$temp_91_100[.x],
          TRUE & !is.na(temp)          ~  temp_weights$temp_other[.x]
        )
      ) %>% set_names(paste0("temp_wt_", .x))
    ) 

head(test_tibble2)

So what I'm looking for is a function that constructs the case_when arguments from a weights data frame.

5
  • Not sure but maybe cut might be a more efficient function for the job? Commented Jul 31, 2019 at 16:24
  • 1
    Yeah, I should have mentioned that efficiency is important because the actual temperature variable from which the weights are constructed is 40 million observations long. So if cut can be used here more efficiently I'm happy to switch. But then the question becomes: How can I generate the arguments to cut from the weights data frame? Commented Jul 31, 2019 at 16:33
  • 1
    It's a very good question! If you run debugonce(case_when) and then your map_dfc call you can observer that case_when arguments are parsed via fs <- compact_null(list2(...)). One potential solution would be to use trace to replace that fs object with the values you would programmatically generate using the data frame you have. Commented Jul 31, 2019 at 16:34
  • So basically, my initial thinking would be to approach this like that this is a very dirty solution that attempts to inject the externally constructed case_when object. This shows my thinking but returns an error as you would have to understand that str behind the fs object and then replace it properly; not like in my lame attempt. Commented Jul 31, 2019 at 16:45
  • @Konrad Thank you for your comments. I'm not sure how the arguments to case_when are being generated programmatically from the weights data frame. If I change the number of temperature intervals using the function in the beginning of my post, the number of arguments to case_when will need to change as well. I just don't want to have to do this manually. Commented Jul 31, 2019 at 16:55

2 Answers 2

1

Closely mimicking OP:

windows <- 
  str_extract_all(names(temp_weights), "\\d+") %>% 
  modify(as.integer) %>% 
  modify_if(negate(length), ~ c(-Inf, Inf)) %>% 
  set_names(names(temp_weights))

temp <- test_tibble$temp

res <-
  map_dfc(
    seq_len(nrow(temp_weights)), 
    ~ {
      row <- .
      rlang::eval_tidy(expr(case_when(
        !!! imap(
          windows, 
          ~ expr(
            between(temp, !! .x[1], !! .x[2]) ~ !! temp_weights[[.y]][row]
          )
        )
      )))
    }
  ) %>% 
  set_names(paste0("temp_wt_", seq_along(.)))

all.equal(res, test_tibble2)
#> [1] TRUE 

Slightly more efficient (not repeating case_when for each weight combination):

res2 <- 
  rlang::eval_tidy(expr(case_when(
    !!! imap(
      windows, 
      ~ expr(
        between(temp, !! .x[1], !! .x[2]) ~ list(!! temp_weights[[.y]])
      )
    )
  ))) %>% 
  do.call(what = rbind) %>% 
  as_tibble() %>% 
  set_names(paste0("temp_wt_", seq_along(.)))

all.equal(res2, test_tibble2)
#> [1] TRUE   
Sign up to request clarification or add additional context in comments.

Comments

0

This is meant to supplement the accepted answer by Aurèle.

Here, I compare efficiency between Aurèle's two proposed solutions and a final solution using data.table, which also provides the option to preserve NAs.

suppressPackageStartupMessages(library(rlang))
suppressPackageStartupMessages(library(tidyverse))
suppressPackageStartupMessages(library(tictoc))

create_temp_weights <- function(
  from = 31,
  to = 100,
  by = 10,
  weights = exprs(between(., 31, 40) ~ c(0, 0.2),
                  between(., 41, 50) ~ c(0.5, 0.8),
                  between(., 51, 90) ~ c(0.8, 1),
                  between(., 91, 100) ~ c(0.2, 0.8),
                  TRUE ~ c(-0.1, 0))
) {

  # use 999 to map other temperatures to last case
  map(c(seq(from, to, by), 999), ~ case_when(!!!weights)) %>%
    set_names(c(map_chr(seq(from, to, by),
                        ~ str_c("temp_", ., "_", . + by - 1)), "temp_other")) %>%
    cross_df(.)

}

temp_weights <- create_temp_weights()

test_tibble <- tibble(temp = rnorm(1000000, 50, 15))

test_tibble2 <- map_dfc(.x = seq_len(nrow(temp_weights)),
                        ~ transmute(
                          test_tibble,
                          temp =
                            case_when(
                              temp >= 31   & temp  <= 40   ~  temp_weights$temp_31_40[.x],
                              temp >= 41   & temp  <= 50   ~  temp_weights$temp_41_50[.x],
                              temp >= 51   & temp  <= 60   ~  temp_weights$temp_51_60[.x],
                              temp >= 61   & temp  <= 70   ~  temp_weights$temp_61_70[.x],
                              temp >= 71   & temp  <= 80   ~  temp_weights$temp_71_80[.x],
                              temp >= 81   & temp  <= 90   ~  temp_weights$temp_81_90[.x],
                              temp >= 91   & temp  <= 100  ~  temp_weights$temp_91_100[.x],
                              TRUE & !is.na(temp)          ~  temp_weights$temp_other[.x]
                            )
                        ) %>% set_names(paste0("temp_wt_", .x))
) 

windows <- 
  str_extract_all(names(temp_weights), "\\d+") %>% 
  modify(as.integer) %>% 
  modify_if(negate(length), ~ c(-Inf, Inf)) %>% 
  set_names(names(temp_weights))

Solution #1

temp <- test_tibble$temp

tic()
res <-
  map_dfc(
    seq_len(nrow(temp_weights)), 
    ~ {
      row <- .
      rlang::eval_tidy(expr(case_when(
        !!! imap(
          windows, 
          ~ expr(
            between(temp, !! .x[1], !! .x[2]) ~ !! temp_weights[[.y]][row]
          )
        )
      )))
    }
  ) %>% 
  set_names(paste0("temp_wt_", seq_along(.)))
toc()
#> 65.18 sec elapsed

all.equal(res, test_tibble2)
#> [1] TRUE

Solution #2

tic()
res2 <- 
  rlang::eval_tidy(expr(case_when(
    !!! imap(
      windows, 
      ~ expr(
        between(temp, !! .x[1], !! .x[2]) ~ list(!! temp_weights[[.y]])
      )
    )
  ))) %>% 
  do.call(what = rbind) %>% 
  as_tibble() %>% 
  set_names(paste0("temp_wt_", seq_along(.)))
#> Warning: `as_tibble.matrix()` requires a matrix with column names or a `.name_repair` argument. Using compatibility `.name_repair`.
#> This warning is displayed once per session.
toc()
#> 2.76 sec elapsed

all.equal(res2, test_tibble2)
#> [1] TRUE

Solution #3 Using data.table

tic()
res3 <-
  rlang::eval_tidy(expr(case_when(
    !!! imap(
      windows,
      ~ expr(
        between(temp, !! .x[1], !! .x[2]) ~ list(!! temp_weights[[.y]])
      )
    )
  ))) %>%
  data.table::transpose(., fill = NA) %>%
  set_names(paste0("temp_wt_", seq_along(.))) %>%
  as_tibble()
toc()
#> 4.69 sec elapsed

all.equal(res3, test_tibble2)
#> [1] TRUE

In summary, solution #2 seems to be the fastest (2.76 sec) followed by the data.table solution (4.69 sec). However, I appreciate that the data.table solution has the fill option to preserve NAs.

Created on 2019-08-02 by the reprex package (v0.3.0)

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.