1

I have a brm model that I'd like to generate predicted values for a fixed value of one parameter (length). I think I need to do this with the posterior_predict function (or possibly the posterior_epred function), but these functions both generate large matrices with no labels so it's impossible to know what is actually being generated.

library(tidyverse)
library(brms)
set.seed(42)
df <- data.frame(mass   = rnorm(100, mean = 50, sd = 5),
                 length = rnorm(100, mean = 20, sd = 2),
                 sex    = sample(c("f", "m"), 100, replace = TRUE),
                 site   = sample(LETTERS[1:3], 100, replace = TRUE))

m <- brm(mass ~ length * sex + site, data = df, prior = prior(normal(0, 10)))

newdata <- data.frame(site   = rep(LETTERS[1:3], each = 30),
                      sex    = rep(c("f", "m"), 45),
                      length = 20)
pred1 <- bind_cols(newdata, predict(m, newdata)) # results in a usable dataframe
pred2 <- posterior_predict(m, newdata)           # results in a large matrix with no labels
pred3 <- posterior_epred(m, newdata)             # results in a large matrix with no labels

ggplot(data = pred1, aes(x = Estimate, y = site, col = sex)) + geom_violin()

enter image description here

6
  • 2
    Both pred1 and pred2 are 4000 rows x 90 columns. By default brms gives you 1000 posterior samples each from 4 Markov chains, for 4000 total. Hence the number of rows. The 90 columns correspond to the 90 rows in newdata. The difference between posterior_predict and posterior_epred is explained in the documentation (sort of like the difference between a prediction interval and confidence interval of the mean, in a frequentist analysis). Commented Mar 20, 2024 at 22:30
  • 2
    If you want output that is easier to work with in tidyverse, look into the tidybayes package, in particular the functions add_predicted_draws and add_epred_draws which give "tidy" versions of the two different types of posterior predictions. Commented Mar 20, 2024 at 22:31
  • @qdread, I think my frustration with posterior_predict is that I have no idea what value corresponds to the data in newdata. Commented Mar 20, 2024 at 23:25
  • @qdred, add_predicted_draws seems like it will be helpful for this particular question, but it doesn't provide the CIs that I'd get with the predict function. Commented Mar 20, 2024 at 23:29
  • 1
    Try newdata |> add_predicted_draws(m) |> median_qi() . This will give 95% equal tailed credible interval by default but other options are possible Commented Mar 21, 2024 at 0:06

1 Answer 1

2

Explanation of posterior_predict output

Both posterior_predict() and posterior_epred() produce a 4000x90 matrix. Why 4000 rows? It has 4000 rows because the model fit by brm() uses an MCMC algorithm to sample the joint posterior distribution of the parameters. By default it does this with four Markov chains that return 1000 samples each, for a total of 4000. Why 90 columns? Because newdata in this case has 90 rows. The columns of pred1 and pred2 correspond to the rows of newdata, in the same order.

Incidentally the difference between posterior_predict() and posterior_epred() is similar to the difference between a prediction interval and a confidence interval around the mean in a frequentist analysis. posterior_predict() gives you draws from the posterior predictive distribution, which incorporates the residual error and thus has higher variance. In contrast posterior_epred() gives you the distribution of the expected value of the posterior.

Solution without any additional packages

To get a median and a 95% equal-tailed credible interval of the posterior predictive distribution for each row in newdata you could do this:

posteriorpredictive_CI <- bind_cols(
  newdata,
  t(apply(pred2, 2L, quantile, probs = c(.5, .025, .975)))
)

Additional solution using tidybayes package

You may find the tidybayes package makes this easier. The functions add_predicted_draws() (for the posterior predictive distribution) and add_epred_draws() (for the expected value of the posterior) output long-form "tidy" data frames. These have 360,000 rows because each row of newdata is repeated 4000 times, one for each of the posterior draws. Then the function median_qi(), imported from ggdist which is a dependency of tidybayes, can be used to summarize with equal-tailed credible intervals. This will give you similar results to the previous code, with some extra ID columns identifying the width of the interval and which row of newdata each row of the summary comes from.

library(tidybayes)

posteriorpredictive_CI <- newdata |>
  add_predicted_draws(m) |>
  median_qi()

The width of the interval is 0.95 by default but can be changed with the .width argument of median_qi().

Sign up to request clarification or add additional context in comments.

Comments

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.