3

I am working with an array with dimensions

[1] 290 259  55   4

For each repetition of the last three dimensions, I want to perform a rolling mean on the 290 elements of the first dimension, reducing the number of elements to 289. Finally, I need to create a data frame with the updated values.

The following code achieves what I need, but it takes a LONG time to run (actually, I have to interrupt it before the end).

library(zoo)

# Generate random data with same dimensions as mine
my.array <- array(1:16524200, dim=c(290,259,55,4))

# Get dimension sizes
dim2 <- dim(my.array)[2]
dim3 <- dim(my.array)[3]
dim4 <- dim(my.array)[4]

# Pre-allocate data frame to be used within the loop
df2 <- data.frame()

# Loop over dimensions
for (i in 1:dim4) {
  for (j in 1:dim3) {
    for (k in 1:dim2) {

      # Take rolling average
      u <- rollapply(my.array[,k,j,i], 2, mean)

      # Assemble data frame
      df1 <- data.frame(time=i, level=j, lat=k, wind=u)
      df2 <- rbind(df2, df1)

    }
  }
}
# Very slow, and uses only one machine core

I feel like it is possible to improve the processing time of this code by using vectorization or even some kind of parallelism, but I can't figure out how.

Any suggestions to make this code more efficient?

3
  • 3
    Never build data.frames iteratively. Each time you call rbind, it is copying the entire frame into a new object and over-writing df2. It might work for a few dozen, but (as you can see) this scales horribly. Commented Dec 20, 2019 at 2:12
  • @r2evans that makes sense, but... What would be an alternative? Commented Dec 20, 2019 at 2:19
  • 1
    In general, something <- lapply(list_of_stuff, somefunc) and then do.call(rbind, something) (though this question requires a little more). Commented Dec 20, 2019 at 2:21

3 Answers 3

6

apply() works on any number of dimensions so you can achieve the same result much more quickly using the following wrapped in as.data.frame.table() to efficiently convert the output from an array to a data frame:

library(zoo)
df <- as.data.frame.table(apply(my.array, c(2,3,4), rollmean, 2))

Not strictly necessary but this can be tidied up to match your original output:

idx <- sapply(df, is.factor)
df[idx] <- sapply(df[idx], as.integer)

df <- setNames(df[c(4,3,2,5)], c("time", "level", "lat", "wind"))

Check if the result is the same:

identical(df2, df)
[1] TRUE
Sign up to request clarification or add additional context in comments.

1 Comment

I had forgotten about as.data.frame.table, that does an incredible job of speeding things up (over 2x on initial testing).
5

Up front, you are suffering from the 2nd circle of R's Inferno (https://www.burns-stat.com/pages/Tutor/R_inferno.pdf): growing objects. Each time you call rbind, it makes a complete copy of the frame, does the r-binding, then overwrites that complete copy over the original variable name. So while it might work without noticeable slow-down for the first few dozen, it will slow down a bit over 100 or so ... and you're doing it 56,980 times.

It is generally much better to process things into a list and then do the rbind once at the end on the entire list, as in do.call(rbind, list_of_frames). Granted, you still may have the computational challenge of doing something potentially hard ... luckily zoo is about as efficient as you can get for window operations, and this one is not impossibly hard.

I'll demonstrate on a significantly-reduced problem set (since I don't think it matters if we're looking at 16M or 1.5M iterations.

my.array <- array(1:1502200, dim=c(290,259,5,4))
eg <- do.call(expand.grid, lapply(dim(my.array)[-1], seq_len))
dim(eg)
# [1] 5180    3
head(eg)
#   Var1 Var2 Var3
# 1    1    1    1
# 2    2    1    1
# 3    3    1    1
# 4    4    1    1
# 5    5    1    1
# 6    6    1    1

system.time({
  list_of_frames <- Map(function(i,j,k) {
    u <- zoo::rollapply(my.array[,i,j,k], 2, mean)
    data.frame(i, j, k, wind = u)
  }, eg[[1]], eg[[2]], eg[[3]])
})
#    user  system elapsed 
#    5.79    0.00    5.80 
head(list_of_frames[[5]])
#   i j k   wind
# 1 5 1 1 1161.5
# 2 5 1 1 1162.5
# 3 5 1 1 1163.5
# 4 5 1 1 1164.5
# 5 5 1 1 1165.5
# 6 5 1 1 1166.5

system.time({
  out <- do.call(rbind, list_of_frames)
})
#    user  system elapsed 
#    0.50    0.03    0.53 
nrow(out)
# [1] 1497020
rbind(head(out), tail(out))
#           i j k      wind
# 1         1 1 1       1.5
# 2         1 1 1       2.5
# 3         1 1 1       3.5
# 4         1 1 1       4.5
# 5         1 1 1       5.5
# 6         1 1 1       6.5
# 1497015 259 5 4 1502194.5
# 1497016 259 5 4 1502195.5
# 1497017 259 5 4 1502196.5
# 1497018 259 5 4 1502197.5
# 1497019 259 5 4 1502198.5
# 1497020 259 5 4 1502199.5

Explanation:

  • do.call(expand.grid, ...) is creating a frame of all the i,j,k combinations you need, dynamically on the dimensions of your array.
  • Map(f, is, js, ks) runs the function f with the 1st argument of each of is, js, and ks (notional for this bullet), so Map looks something like:

    f(is[1], js[1], ks[1])
    f(is[2], js[2], ks[2])
    f(is[3], js[3], ks[3])
    # ...
    
  • then we combine them in one call using do.call(rbind, ...). We really have to use do.call here because this call is analogous to

    rbind(list_of_frames[[1]], list_of_frames[[2]], ..., list_of_frames[[5180]])
    

    (over to you if you'd prefer to write out this version).

Comments

4

Another option to flatten the multidimensional array first before using data.table to calculate the rolling mean

library(data.table)
system.time({
    ans <- setDT(as.data.frame.table(my.array))[
        , .(wind=((Freq + shift(Freq)) / 2)[-1L]), 
        .(time=Var4, level=Var3, lat=Var2)]
    cols <- c("time", "level", "lat")
    ans[, (cols) := lapply(.SD, function(x) match(x, unique(x))), .SDcols=cols]
})
ans

output:

          time level lat       wind
       1:    1     1   1        1.5
       2:    1     1   1        2.5
       3:    1     1   1        3.5
       4:    1     1   1        4.5
       5:    1     1   1        5.5
      ---                          
16467216:    4    55 259 16524195.5
16467217:    4    55 259 16524196.5
16467218:    4    55 259 16524197.5
16467219:    4    55 259 16524198.5
16467220:    4    55 259 16524199.5

timings:

   user  system elapsed 
   4.90    1.16    5.66 

and for comparison:

library(zoo)
system.time({
    as.data.frame.table(apply(my.array, c(2,3,4), rollmean, 2))  
})
#   user  system elapsed 
#  21.89    0.63   22.51 

3 Comments

Thanks, I had forgotten how fast data.table can be! I am marking this as the official answer, since my data can be up to 10x bigger than the example I gave, and using this answer would save me a lot of time.
I am struggling with converting the resulting data table to an array with the dimensions 289 259 55 4 (similar to the original one). Do you happen to have any hints on this?
I am guessing it’s using array? Maybe post another question? I am not with a PC right now

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.