R - Speeding up loop over array dimensions

Question

I am working with an array with dimensions

[1] 290 259  55   4

For each repetition of the last three dimensions, I want to perform a rolling mean on the 290 elements of the first dimension, reducing the number of elements to 289. Finally, I need to create a data frame with the updated values.

The following code achieves what I need, but it takes a LONG time to run (actually, I have to interrupt it before the end).

library(zoo)

# Generate random data with same dimensions as mine
my.array <- array(1:16524200, dim=c(290,259,55,4))

# Get dimension sizes
dim2 <- dim(my.array)[2]
dim3 <- dim(my.array)[3]
dim4 <- dim(my.array)[4]

# Pre-allocate data frame to be used within the loop
df2 <- data.frame()

# Loop over dimensions
for (i in 1:dim4) {
  for (j in 1:dim3) {
    for (k in 1:dim2) {

      # Take rolling average
      u <- rollapply(my.array[,k,j,i], 2, mean)

      # Assemble data frame
      df1 <- data.frame(time=i, level=j, lat=k, wind=u)
      df2 <- rbind(df2, df1)

    }
  }
}
# Very slow, and uses only one machine core

I feel like it is possible to improve the processing time of this code by using vectorization or even some kind of parallelism, but I can't figure out how.

Any suggestions to make this code more efficient?

Never build data.frames iteratively. Each time you call rbind, it is copying the entire frame into a new object and over-writing df2. It might work for a few dozen, but (as you can see) this scales horribly. — r2evans
– r2evans, Commented Dec 20, 2019 at 2:12
@r2evans that makes sense, but... What would be an alternative? — thiagoveloso
– thiagoveloso, Commented Dec 20, 2019 at 2:19
In general, something <- lapply(list_of_stuff, somefunc) and then do.call(rbind, something) (though this question requires a little more). — r2evans
– r2evans, Commented Dec 20, 2019 at 2:21

iroha · Accepted Answer · 2019-12-20 03:22:46Z

6

apply() works on any number of dimensions so you can achieve the same result much more quickly using the following wrapped in as.data.frame.table() to efficiently convert the output from an array to a data frame:

library(zoo)
df <- as.data.frame.table(apply(my.array, c(2,3,4), rollmean, 2))

Not strictly necessary but this can be tidied up to match your original output:

idx <- sapply(df, is.factor)
df[idx] <- sapply(df[idx], as.integer)

df <- setNames(df[c(4,3,2,5)], c("time", "level", "lat", "wind"))

Check if the result is the same:

identical(df2, df)
[1] TRUE

edited Dec 20, 2019 at 3:22

answered Dec 20, 2019 at 2:56

iroha

35.3k4 gold badges55 silver badges66 bronze badges

Sign up to request clarification or add additional context in comments.

1 Comment

r2evans Over a year ago

I had forgotten about as.data.frame.table, that does an incredible job of speeding things up (over 2x on initial testing).

r2evans · Accepted Answer · 2019-12-20 02:42:39Z

Up front, you are suffering from the 2nd circle of R's Inferno (https://www.burns-stat.com/pages/Tutor/R_inferno.pdf): growing objects. Each time you call rbind, it makes a complete copy of the frame, does the r-binding, then overwrites that complete copy over the original variable name. So while it might work without noticeable slow-down for the first few dozen, it will slow down a bit over 100 or so ... and you're doing it 56,980 times.

It is generally much better to process things into a list and then do the rbind once at the end on the entire list, as in do.call(rbind, list_of_frames). Granted, you still may have the computational challenge of doing something potentially hard ... luckily zoo is about as efficient as you can get for window operations, and this one is not impossibly hard.

I'll demonstrate on a significantly-reduced problem set (since I don't think it matters if we're looking at 16M or 1.5M iterations.

my.array <- array(1:1502200, dim=c(290,259,5,4))
eg <- do.call(expand.grid, lapply(dim(my.array)[-1], seq_len))
dim(eg)
# [1] 5180    3
head(eg)
#   Var1 Var2 Var3
# 1    1    1    1
# 2    2    1    1
# 3    3    1    1
# 4    4    1    1
# 5    5    1    1
# 6    6    1    1

system.time({
  list_of_frames <- Map(function(i,j,k) {
    u <- zoo::rollapply(my.array[,i,j,k], 2, mean)
    data.frame(i, j, k, wind = u)
  }, eg[[1]], eg[[2]], eg[[3]])
})
#    user  system elapsed 
#    5.79    0.00    5.80 
head(list_of_frames[[5]])
#   i j k   wind
# 1 5 1 1 1161.5
# 2 5 1 1 1162.5
# 3 5 1 1 1163.5
# 4 5 1 1 1164.5
# 5 5 1 1 1165.5
# 6 5 1 1 1166.5

system.time({
  out <- do.call(rbind, list_of_frames)
})
#    user  system elapsed 
#    0.50    0.03    0.53 
nrow(out)
# [1] 1497020
rbind(head(out), tail(out))
#           i j k      wind
# 1         1 1 1       1.5
# 2         1 1 1       2.5
# 3         1 1 1       3.5
# 4         1 1 1       4.5
# 5         1 1 1       5.5
# 6         1 1 1       6.5
# 1497015 259 5 4 1502194.5
# 1497016 259 5 4 1502195.5
# 1497017 259 5 4 1502196.5
# 1497018 259 5 4 1502197.5
# 1497019 259 5 4 1502198.5
# 1497020 259 5 4 1502199.5

Explanation:

do.call(expand.grid, ...) is creating a frame of all the i,j,k combinations you need, dynamically on the dimensions of your array.
Map(f, is, js, ks) runs the function f with the 1st argument of each of is, js, and ks (notional for this bullet), so Map looks something like:
```
f(is[1], js[1], ks[1])
f(is[2], js[2], ks[2])
f(is[3], js[3], ks[3])
# ...
```
then we combine them in one call using do.call(rbind, ...). We really have to use do.call here because this call is analogous to
```
rbind(list_of_frames[[1]], list_of_frames[[2]], ..., list_of_frames[[5180]])
```
(over to you if you'd prefer to write out this version).

chinsoon12 · Accepted Answer · 2019-12-20 05:51:47Z

4

Another option to flatten the multidimensional array first before using data.table to calculate the rolling mean

library(data.table)
system.time({
    ans <- setDT(as.data.frame.table(my.array))[
        , .(wind=((Freq + shift(Freq)) / 2)[-1L]), 
        .(time=Var4, level=Var3, lat=Var2)]
    cols <- c("time", "level", "lat")
    ans[, (cols) := lapply(.SD, function(x) match(x, unique(x))), .SDcols=cols]
})
ans

output:

          time level lat       wind
       1:    1     1   1        1.5
       2:    1     1   1        2.5
       3:    1     1   1        3.5
       4:    1     1   1        4.5
       5:    1     1   1        5.5
      ---                          
16467216:    4    55 259 16524195.5
16467217:    4    55 259 16524196.5
16467218:    4    55 259 16524197.5
16467219:    4    55 259 16524198.5
16467220:    4    55 259 16524199.5

timings:

   user  system elapsed 
   4.90    1.16    5.66

and for comparison:

library(zoo)
system.time({
    as.data.frame.table(apply(my.array, c(2,3,4), rollmean, 2))  
})
#   user  system elapsed 
#  21.89    0.63   22.51

answered Dec 20, 2019 at 5:51

chinsoon12

25.2k4 gold badges27 silver badges35 bronze badges

3 Comments

thiagoveloso Over a year ago

Thanks, I had forgotten how fast data.table can be! I am marking this as the official answer, since my data can be up to 10x bigger than the example I gave, and using this answer would save me a lot of time.

thiagoveloso Over a year ago

I am struggling with converting the resulting data table to an array with the dimensions 289 259 55 4 (similar to the original one). Do you happen to have any hints on this?

chinsoon12 Over a year ago

I am guessing it’s using array? Maybe post another question? I am not with a PC right now

Collectives™ on Stack Overflow

R - Speeding up loop over array dimensions

3 Answers 3

1 Comment

Comments

3 Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

3 Answers 3

1 Comment

Comments

3 Comments

Your Answer

Sign up or log in

Post as a guest

Related