I have the following tibble:
eu_df <- structure( list( nuts_code = c( "PT17", "PT17", "PT17", "PT17", "PT17", "PT17", "PT17", "PT17", "PT17", "PT17", "PT17", "PT1A", "PT1A", "PT1A", "PT1A", "PT1A", "PT1A", "PT1A", "PT1A", "PT1A", "PT1A", "PT1A", "PT1B", "PT1B", "PT1B", "PT1B", "PT1B", "PT1B", "PT1B", "PT1B", "PT1B", "PT1B", "PT1B" ), year = c( 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024, 2014, 2015, 2016, 2017, 2018, 2019, 2020, 2021, 2022, 2023, 2024 ), pop = c( 2815667, 2820766, 2829408, 2836906, 2849085, 2859422, 2884800, 2884695, 2883645, 2921564, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 2097742, 2126578, NA, NA, NA, NA, NA, NA, NA, NA, NA, 823822, 834599 ), medage = c( 41.9, 42.1, 42.4, 42.7, 42.9, 43.2, 43.5, 44, 44.5, 44.8, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 44.5, 44.7, NA, NA, NA, NA, NA, NA, NA, NA, NA, 45.5, 45.6 ), gdp = c( NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 77253.6, 90226.08, 102494.09, NA, NA, NA, NA, NA, NA, NA, NA, 14134.32, 15740.99, 17196, NA ), selfemp = c( NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 132.2, 143.35, NA, NA, NA, NA, NA, NA, NA, NA, NA, 43.5, 45.96, NA, NA ), entnum = c( NA, NA, NA, NA, NA, NA, NA, 387952, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 305387, 333104, NA, NA, NA, NA, NA, NA, NA, NA, NA, 82565, 90899, NA, NA ), area = c( 3015, 3015, 3015, 3015, 3015, 3015, 3015, 3015, 3015, 3015, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1390, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 1625 ) ), row.names = c(NA, -33L), class = "data.frame" )
> eu_df
nuts_code year pop medage gdp selfemp entnum area
1 PT17 2014 2815667 41.9 NA NA NA 3015
2 PT17 2015 2820766 42.1 NA NA NA 3015
3 PT17 2016 2829408 42.4 NA NA NA 3015
4 PT17 2017 2836906 42.7 NA NA NA 3015
5 PT17 2018 2849085 42.9 NA NA NA 3015
6 PT17 2019 2859422 43.2 NA NA NA 3015
7 PT17 2020 2884800 43.5 NA NA NA 3015
8 PT17 2021 2884695 44.0 NA NA 387952 3015
9 PT17 2022 2883645 44.5 NA NA NA 3015
10 PT17 2023 2921564 44.8 NA NA NA 3015
11 PT17 2024 NA NA NA NA NA NA
12 PT1A 2014 NA NA NA NA NA NA
13 PT1A 2015 NA NA NA NA NA NA
14 PT1A 2016 NA NA NA NA NA NA
15 PT1A 2017 NA NA NA NA NA NA
16 PT1A 2018 NA NA NA NA NA NA
17 PT1A 2019 NA NA NA NA NA NA
18 PT1A 2020 NA NA NA NA NA NA
19 PT1A 2021 NA NA 77253.60 132.20 305387 NA
20 PT1A 2022 NA NA 90226.08 143.35 333104 NA
21 PT1A 2023 2097742 44.5 102494.09 NA NA NA
22 PT1A 2024 2126578 44.7 NA NA NA 1390
23 PT1B 2014 NA NA NA NA NA NA
24 PT1B 2015 NA NA NA NA NA NA
25 PT1B 2016 NA NA NA NA NA NA
26 PT1B 2017 NA NA NA NA NA NA
27 PT1B 2018 NA NA NA NA NA NA
28 PT1B 2019 NA NA NA NA NA NA
29 PT1B 2020 NA NA NA NA NA NA
30 PT1B 2021 NA NA 14134.32 43.50 82565 NA
31 PT1B 2022 NA NA 15740.99 45.96 90899 NA
32 PT1B 2023 823822 45.5 17196.00 NA NA NA
33 PT1B 2024 834599 45.6 NA NA NA 1625
nuts_code is the code of an European region, year is the year in which the variable was observed, pop is the population, medage is the median age of the population, gdp is the GDP, selfemp is the number of persons self-employed, entnum is the number of enterprises founded in that year, area is the area of the region.
Basically, under the new NUTS classification (the classification for European regions), in recent years the region PT17 was split in two regions: PT1A and PT1B.
This means that, for example, while the population (pop) for PT17 is missing in 2024, it can be recovered as the sum of the populations for PT1A and PT1B in the same year.
Same thing for the other variables and the other years.
So the exercise is, when a variable for PT17 is missing in a given year, compute it as the sum of the same variable for PT1A and PT1B in that same year.
As an additional complexity, for medage we need to compute the median, not the sum.
I need to build a general function because I have plenty of these cases.
I came up with the following:
# bring new nuts codes into the old one (splitted region that goes back to the old one)
change_nuts <- function(df, old_nuts_code, new_nuts_codes, fns) {
old_df <- df %>% filter(nuts_code == old_nuts_code)
stopifnot(nrow(old_df) > 0)
vars <- colnames(df) %>% setdiff(c("nuts_code", "year"))
for (var in vars) {
for (year in sort(unique(old_df$year))) {
if (!is.na(old_df[old_df$year == year, var])) {
next
}
old_df[old_df$year == year, var] <- df[
df$nuts_code == new_nuts_codes[1] & df$year == year,
var
]
for (new_nuts_code in new_nuts_codes[-1]) {
if (var %in% names(fns)) {
old_df[old_df$year == year, var] <- do.call(
fns[[var]],
list(
old_df[old_df$year == year, var],
df[df$nuts_code == new_nuts_code & df$year == year, var]
)
)
} else {
old_df[old_df$year == year, var] <- sum(
old_df[old_df$year == year, var],
df[df$nuts_code == new_nuts_code & df$year == year, var]
)
}
}
}
}
# Remove old NUTS code
df %<>% filter(nuts_code != old_nuts_code)
# Remove new NUTS codes
df %<>% filter(!(nuts_code %in% new_nuts_codes))
# Bind new dataframe with old nuts code
df %<>% bind_rows(old_df)
# Sort by NUTS code and then by year
df %<>% arrange(nuts_code, year)
return(df)
}
which seems to work:
fns <- list("medage" = median)
eu_df %<>% change_nuts("PT17", c("PT1A", "PT1B"), fns = fns)
and then, after
eu_df %>% filter(nuts_code %in% c("PT17", "PT1A", "PT1B"))
I get the expected result:
nuts_code year pop medage gdp selfemp entnum area
1 PT17 2014 2815667 41.9 NA NA NA 3015
2 PT17 2015 2820766 42.1 NA NA NA 3015
3 PT17 2016 2829408 42.4 NA NA NA 3015
4 PT17 2017 2836906 42.7 NA NA NA 3015
5 PT17 2018 2849085 42.9 NA NA NA 3015
6 PT17 2019 2859422 43.2 NA NA NA 3015
7 PT17 2020 2884800 43.5 NA NA NA 3015
8 PT17 2021 2884695 44.0 91387.92 175.70 387952 3015
9 PT17 2022 2883645 44.5 105967.07 189.31 424003 3015
10 PT17 2023 2921564 44.8 119690.09 NA NA 3015
11 PT17 2024 2961177 44.7 NA NA NA 3015
With the missing variables for PT17 computed using the new regions.
The problem is that my function seems, well, ugly.
Is there a way to do this with tidyverse?