1

I have a for loop I would like to run by group. I would like it to run through a set of data, creates a time series for most rows, and then output a forecast for that row of data (based on that time point and the ones preceding it) in the group The issue I am having is running that loop for every 'group' within my data. I want to avoid doing so manually as that would take hours and surely there is a better way.

Allow to me explain in more detail.

I have a large dataset (1.6M rows), each row has a year, country A, country B, and a number of measures which concern the relationship between the two.

So far, I have been successful in extracting a single (country A, country B) relationship into a new table and using a for loop to output the necessary forecast data to a new variable in the dataset. I'd like to create to have that for loop run over every (country A, country B) grouping with more than 3 entries.

The data:

Here I will replicate a small slice of the data, and will include a missing value for realism.

set.seed(2000)  
df <- data.frame(year = rep(c(1946:1970),length.out=50),
                     ccode1 = rep(c("2"), length.out = 50),
                     ccode2 = rep(c("20","31"), each=25),
                     kappavv = rnorm(50,mean = 0, sd=0.25),
                     output = NA)
    df$kappavv[12] <- NA

What I've done:

NOTE: I start forecasting from the third data point of each group but based on all time points preceding the forecast.

for(i in 3:nrow(df)){
    
    dat_ts <- ts(df[, 4], start = c(min(df$year), 1), end = c(df$year[i], 1), frequency = 1)
    dat_ts_corr <- na_interpolation(dat_ts)
    trialseries <- holt(dat_ts_corr, h=1)
    df$output[i] <- trialseries$mean
  }

This part works and outputs what I want when I apply it to a single pairing of ccode1 and ccode2 when arranged correctly in ascending order of years.

What isn't working:

I am having some serious problems getting my head around applying this for loop by grouping of ccode2. Some of my data is uneven: sometimes groups are different sizes, having different start/end points, and there are missing data.

I have tried expressing the loop as a function, using group_by() and piping, using various types of apply() functions.

Your help is appreciated. Thanks in advance. I am glad to answer any clarifying questions you have.

1 Answer 1

0

You can put the for loop code in a function.

library(dplyr)
library(purrr)

apply_func <- function(df) {
  for(i in 3:nrow(df)){
    
    dat_ts <- ts(df[, 4], start = c(min(df$year), 1), 
                 end = c(df$year[i], 1), frequency = 1)
    dat_ts_corr <- imputeTS::na_interpolation(dat_ts)
    trialseries <- forecast::holt(dat_ts_corr, h=1)
    df$output[i] <- trialseries$mean
  }
  return(df)
}

Split the data by ccode2 and apply apply_func.

df %>%group_split(ccode2) %>% map_df(apply_func)

#    year ccode1 ccode2 kappavv  output
#   <int> <chr>  <chr>    <dbl>   <dbl>
# 1  1946 2      20     -0.213  NA     
# 2  1947 2      20     -0.0882 NA     
# 3  1948 2      20      0.223   0.286 
# 4  1949 2      20      0.435   0.413 
# 5  1950 2      20      0.229   0.538 
# 6  1951 2      20     -0.294   0.477 
# 7  1952 2      20     -0.485  -0.675 
# 8  1953 2      20      0.524   0.405 
# 9  1954 2      20      0.0564  0.0418
#10  1955 2      20      0.294   0.161 
# … with 40 more rows
Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.