How to compute new variables out of items using rowMeans function in a loop function (big data set)?

Question

I need your help because I have a big data set about illnesses (wide format). So I have 54 different illnesses, each having a block of 18 questions (data is nested in illnesses and participants).

As I have the same variables/questions for each illness, I am trying to find a fast way to calculate rowMeans for the scales (maybe using a loop function).

So basically I have the variables epi_scm.1 - epi_scm.18, ms_scm.1-ms_scm.18, autism_scm.1-autism_scm.18 and so on (beginning of column name indicating the illness and end indicating multi items) and I need to calculate the rowMeans out of the multi items for each illness (e.g., Morality = rowMeans([, c("epi_scm.1","epi_scm.2", etc.)] but I do not wanna do that manually for every illness (as there are many).

Do you know how to do this more efficiently ? (I hope you understood what I mean)

Thanks and best regards!

L

I tried to subset each illness but that takes too much time and isnt really suitable for my main analyses:

#Subset data to include only 42 columns

subset_epi <- df4[, 1:42]  # Replace 1:42 with the indices or column names of the columns I want to keep
subset_epi <- subset_epi[complete.cases(subset_epi), ]

#Organize index numbers subset

rownames(subset_epi) <- 1:nrow(subset_epi)
dim(subset_epi)

#New variable Morality = Mean Score for Morality

subset_epi$Morality <- rowMeans(na.omit(subset_epi[, c("epi_scm_1", "epi_scm_2", "epi_scm_3", "epi_scm_4", "epi_scm_5")]))

Welcome to stack overflow. I recommend you pivot your data to long format to start with, so instead of having so many columns you just have "participant", "illness" ,"question_no" and "response" for example. This allows you to do grouping operations by illness and simplifies everything downstream. — PGSA
– PGSA, Commented Apr 26, 2024 at 9:07
For usecases like this, I suggest reshaping your data to melt/pivot your wide data to a long format, and perform operations on the long dataset. — Wimpel
– Wimpel, Commented Apr 26, 2024 at 9:08
Do the names always follow the pattern in the question, characters then an underscore then more characters then a dot and a number? — Rui Barradas
– Rui Barradas, Commented Apr 26, 2024 at 9:14

PGSA · Accepted Answer · 2024-04-26 09:23:12Z

Here's an example approach you could try.

library(tidyverse)

# Mocking up some data:
df <- data.frame(participant = 1:4,
                 epi_scm_1 = sample(1:5, 4, TRUE),
                 epi_scm_2 = sample(1:5, 4, TRUE),
                 epi_scm_3 = sample(1:5, 4, TRUE),
                 autism_scm_1 = sample(1:5, 4, TRUE),
                 autism_scm_2 = sample(1:5, 4, TRUE),
                 autism_scm_3 = sample(1:5, 4, TRUE)
                 )

# pivoting longer:
df2 <- df |> pivot_longer(-participant,
                   names_to = c("illness", NA, "question_no"),
                   names_sep = "_",
                   values_to = "score")

df2 looks like:

# A tibble: 24 × 4
   participant illness question_no score
         <int> <chr>   <chr>       <int>
 1           1 epi     1               5
 2           1 epi     2               1
 3           1 epi     3               5
 4           1 autism  1               2
 5           1 autism  2               1
 6           1 autism  3               3
 7           2 epi     1               1
 8           2 epi     2               1
 9           2 epi     3               3
10           2 autism  1               4
etc

we then want to do the summarising by taking the mean of the score per-participant and per-illness:

> df2 |> summarise(mean = mean(score), .by = c(participant, illness))
# A tibble: 8 × 3
  participant illness  mean
        <int> <chr>   <dbl>
1           1 epi      3.67
2           1 autism   2   
3           2 epi      1.67
4           2 autism   2.33
5           3 epi      4   
6           3 autism   1   
7           4 epi      1.67
8           4 autism   2.33

Rui Barradas · Accepted Answer · 2024-04-26 09:29:13Z

Get the names' prefixes with sub, discarding all from the dot onward, then split the names vector by this prefixes vector. Finally, sapply rowMeans to the columns given by the split vector.

nms_subset_epi <- names(subset_epi)
f <- sub("\\..*$", "", nms_subset_epi)
nms_split <- split(x, f)

sapply(nms_split, \(nms) {
  subset_epi[nms] |> rowMeans(na.rm = TRUE)
})
#>        autism_scm    epi_scm      ms_scm
#>  [1,] -0.71350045 -0.3449778  1.13997663
#>  [2,] -0.38520104  0.4688165  0.31570295
#>  [3,] -0.07565961 -0.1079713  0.26870261
#>  [4,]  0.47264391  0.2930216  0.49563327
#>  [5,] -0.93888010  1.1580985 -0.05093926
#>  [6,] -0.65158609 -0.1481108 -0.57702851
#>  [7,]  0.61227658 -0.1976458  0.33733187
#>  [8,] -0.23906127  0.1635898  0.09771817
#>  [9,]  1.00040381 -1.4154913 -2.05572232
#> [10,]  0.95393066 -1.5664113  0.15947003

^{Created on 2024-04-26 with reprex v2.1.0}

You can then give new names to this matrix, for instance,

colnames(result) <- paste0(colnames(result), "_Means")

Test data

x <- "epi_scm.1, epi_scm.18, ms_scm.1, ms_scm.18, autism_scm.1, autism_scm.18"
x <- scan(text = x, what = character(), sep = ",") |> trimws()
nms_subset_epi <- x

set.seed(2024)
subset_epi <- replicate(6, rnorm(10))
subset_epi[sample(60, 10)] <- NA
subset_epi <- subset_epi |>
  as.data.frame() |>
  setNames(x)

^{Created on 2024-04-26 with reprex v2.1.0}

Collectives™ on Stack Overflow

How to compute new variables out of items using rowMeans function in a loop function (big data set)?

2 Answers 2

Comments

Test data

Comments

Your Answer

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Test data

Comments

Your Answer

Sign up or log in

Post as a guest

Related