Meta-analysis can be used not just to guess about effects out-of-sample but also to re-evaluate effects in sample

Imagine you are in the fortunate position of planning a collection of studies which you will later get to analyze together (looking at you metaketas). Each study estimates a site specific effect. You want to learn something about general effects. We work through design issues using a multi-study design with J studies that employs both frequentist and Bayesian approaches to meta-analysis. In the designs that we diagnose these perform very similarly in terms of estimating sample and population average effects. But there are tradeoffs. The Bayesian model does better at estimating individual effects by separating out true heterogeneity from sampling error but can sometimes fare poorly at estimating prediction intervals.

What questions might a meta-analysis try to answer?

We are interested in effects within individual studies, in some “general” effect, and in effect variability. We imagine a setting in which effects from \(J\) studies are drawn from a population of possible cases with effects: \[\theta_i \sim f(\mu, \tau) \text{ for } i \in \{1,2,\dots, J\}\] We will assume that \(f\) is Normal but we will be able to see what happens if we are wrong about that. The most important assumption we make here is that the studies are random draws from a population. In practice that is almost never the case, though that may change as researchers coordinate more on multi-site projects.

Actual estimates are given by: \[\hat{\theta}_i \sim N(\theta_i, \sigma_i) \text{ for } i \in \{1,2,\dots, J\}\] The idea here is that estimates are a draw from a Normal distribution centered on the truth, but with some sampling error. Thus we are assuming that whatever estimation strategy is used is unbiased. The size of the error will depend on things like the sample size and estimation strategy. We will assume it is the same for all studies (though that’s largely a design issue that you could experiment with as part of the declaration).

In summary, we are assuming a world where knowing the estimates from a case tell us something about the actual effects in that case and knowing something about the actual effects in a case tells us something about effects in a population of cases.

Our challenge is: given a set of estimates, \(\hat{\theta}\), and a set of estimates of sampling error, \(\hat{\sigma}\), can we figure out:

  • \(\mu\) the population average effect
  • \(\tau\) the fundamental heterogeneity of effects
  • \(\theta\) the country level effects and \(\overline{\theta}\) the average effect in our sample

In addition, partially heeding IntHout et al. (2016)’s plea for presenting “prediction intervals,” we will add an estimate for the probability that a new study will have a positive effect:

  • \(\int_{0}^{\infty}f(x|\mu, \sigma)dx\)

Though not worked in here, you might be interested in still deeper questions such as trying to explain variation in effects across studies and using this variation to make predictions to new cases given information about those cases (this is the stuff of “meta-regression”) and could be done with a relatively modest modification of the design we declare below.

A meta-design

We imagine a design in which we generate a set of \(J\) studies, each producing unbiased estimates of a study specific treatment effect as well as estimates of uncertainty.

We consider three approaches to estimating quantities of interest:

  1. An agnostic frequentist approach that takes the study level effects at face value. We will also include a fairly naive approach to quantifying effect heterogeneity using the standard deviation of study effects (spoiler: this won’t do very well but we are including it in case you’d be tempted to reach for this).
f_agnostic = function(data) with(data,  {
  data.frame(estimate = c(muhat    = mean(est_effects),
                         tauhat    = sd(est_effects),
                         study_est = est_effects[1]))})
  1. A model-based frequentist approach in which the estimates for \(\mu\) and \(\tau\) are generated using a random effects model via maximum likelihood procedures using the metaplus package (Beath and Bolker 2015).
f_metaplus = function(data) with(data,  {
  metaplus_result <- metaplus(yi = est_effects, sei = est_sds)
  metaplus_ests   <- summary(metaplus_result)[[1]][1:2,1]
  data.frame(estimate = c(muhat    = metaplus_ests[1],
                          tauhat   = metaplus_ests[2]^.5,
                          prob_pos = 1-pnorm(0, metaplus_ests[1], metaplus_ests[2])))
  1. For a Bayesian approach we will use the “8 schools model” (Gelman et al. 2013) implemented via stan. See here for more on stan and this model in particular. To feed the stan model into declaredesign we use a handler to generate an analysis step that uses stan. The handler generates data in the form stan wants it in, runs the model, and extracts the quantities we care about.
f_bayesian = function(data) {
  J      <- nrow(data)
  df     <- list(J = J, y = data$est_effects, sigma = data$est_sds)
  fit    <- stan(model_code = stan_model, data = df)
  fit_sm <- summary(fit)$summary
  data.frame(estimate = fit_sm[,1][c("mu", "tau", "theta[1]", "prob_pos")])}

Here’s the full stan model that the handler calls on (based on stan team’s code, with an additional “generated quantities” block making predictions). The model details the type of data to be expected, the parameters,1 and the assumed data generation process (model block). Priors are not specified here so they are taken to be flat over their ranges. stan then seeks to calculate a posterior distribution on the priors given the data.

stan_model <- " 
  data {
    int<lower=0> J;         // number of schools 
    real y[J];              // estimated treatment effects
    real<lower=0> sigma[J]; // standard error of effect estimates 
  parameters {
    real mu;                // population treatment effect
    real<lower=0> tau;      // standard deviation in treatment effects
    vector[J] eta;          // unscaled deviation from mu by school
  transformed parameters {
    vector[J] theta = mu + tau * eta;        // school treatment effects
  model {
    target += normal_lpdf(eta | 0, 1);       // prior log-density
    target += normal_lpdf(y | theta, sigma); // log-likelihood
 generated quantities {
    real<lower=0> prob_pos;                  // Probability an effect is >0
    prob_pos =   1 - 1/(1+exp(-(0.07056 * (-mu/tau)^3 + 1.5976 * (-mu/tau))));   

With these estimators in hand, the design declaration is quite straightforward.

# Parameters
J     = 8   # Number of studies 
mu    = 1   # Assumed average effect
tau   = .5  # Heterogeneity of effects
sigma = .5  # Error in studies (assumed the same across studies), 
            # but could be a vector to reflect variation in precision across studies

# Design
metadesign <- 
  declare_population(N = J, 
                     true_effects = rnorm(N, mu, tau),
                     est_effects = rnorm(N, true_effects, sigma),
                     est_sds = sigma) +

  declare_estimand(mu = mu, tau = tau, study_effect = true_effects[1], 
                   prob_pos  = 1 - pnorm(0, mu, tau))  +
  declare_estimator(handler = tidy_estimator(f_agnostic), label = "Agnostic",
                   estimand = c("mu", "tau", "study_effect")) + 
  declare_estimator(handler = tidy_estimator(f_metaplus), label = "Metaplus", 
                   estimand = c("mu", "tau", "prob_pos")) +
  declare_estimator(handler = tidy_estimator(f_bayesian), label = "Bayesian",
                   estimand = c("mu", "tau", "study_effect", "prob_pos"))

Let’s diagnose a set of designs of different sizes and plot some results.

metadesigns <- redesign(metadesign, J = c(4, 8, 16, 32))
diagnosis   <- diagnose_design(metadesigns)

Bias looks like this:

And the RMSE (expected error) looks like this:

We see that all approaches estimate the population average effect quite well. No bias and a RMSE that starts to vanish as J grows. They differ though in their estimates of variability and, because of this, in their predictions about the probability that the effect in a next study will be positive (prob_pos). They also differ in their final inferences about effects in the already completed studies (study_effect). In particular:

  • The Bayesian model is biased here for \(\tau\)—it ends up with too high an estimate of heterogeneity in expectation, since it starts with a very diffuse prior that it doesn’t let go of fast enough.2 In turn this means that the Bayesian model holds open the possibility of more negative effects arising from the population than we in fact would see given the model assumptions. The naive frequentist approach is also worse for \(\tau\) and it doesn’t get much better with more studies.

  • The Bayesian model does better for \(\theta\) than the other approaches. The reason is that the Bayesian model adjusts the individual study estimates based on inferences about the data generation.

Having more studies wipes out the effects of priors, resulting in the Bayes model accurately estimating heterogeneity but still doing (even) better on \(\theta\).

A model can be useful even when it’s wrong

Both the Bayesian and metaplus approaches examined here to estimate \(\mu\) and \(\tau\) make use of an assumption about the distribution of effects in a population. How robust are results to having the wrong model?

We explore a little by declaring a design in which effects are in fact distributed uniform over [0,2], rather than normally (so the mean is 1 and sd is (1/3)^.5) but the estimation erroneously supposes that effects are distributed normal. One big difference here is that although the variance in effects is a little larger there is now zero probability of a non positive effect. This new design requires changing both of the first two steps, though we will recycle steps 3 - 5:

metadesign_2    <- 
  declare_population(N = J, true_effects = runif(N, 0, 2),
                     est_effects = rnorm(N, true_effects, sigma), est_sds = sigma) +

  declare_estimand(mu = mu, tau = .577, study_effect = true_effects[1], prob_pos  = 0)  +

  metadesign[[3]] + metadesign[[4]] + metadesign[[5]] 
Estimand Label Estimator Label Bias RMSE Mean Estimand Mean Estimate
mu Agnostic -0.01 0.27 1.00 0.99
mu Bayesian -0.01 0.28 1.00 0.99
prob_pos Bayesian 0.89 0.90 0.00 0.89
study_effect Agnostic 0.05 0.51 0.98 1.03
study_effect Bayesian 0.04 0.42 0.98 1.01

We see the Bayesian approach continues to estimate study level effects better than the agnostic approach, despite having the wrong model. Though it does quite poorly on the out of sample predictions since it is working wiht the wrong distribution (and in particular one for which negative effects are always possible).

There are a few ways to improve upon this. Performance of the model can to some extent be assessed within the sample, e.g. by assessing how well a model developed on a subset of studies fares in predicting outcomes in another set of studies. Or more flexible approaches for modelling the underlying distribution of effects could be used (see e.g. Beath (2014)).

So models can help, even when wrong. But they can also lead you astray if too wrong.3 Hmm. If this all makes you nervous one option is to keep the focus on \(\mu\) for which which is unbiased (though perhaps not very useful for out-of-sample predictions).

If we don’t know how a study got selected into a sample we don’t have strong grounds to use it to make inferences out of sample

In practice in many (maybe all?) meta-analyses there is not a very good understanding of how individuals studies were selected from the population of studies of interest. This can make it hard to justify any kind of meta-analytic approach like these that strive to do more than estimate sample average effects. On the brighter side, if you have some information on how studies are selected then you can build that into the estimation and the design declaration and assess how sampling matters for inference.


Beath, K, and B Bolker. 2015. “Metaplus: Robust Meta-Analysis and Meta-Regression.” R Package Version 0.7-4.

Beath, Ken J. 2014. “A Finite Mixture Method for Outlier Detection and Robustness in Meta-Analysis.” Research Synthesis Methods 5 (4): 285–93.

Gelman, Andrew, Hal S Stern, John B Carlin, David B Dunson, Aki Vehtari, and Donald B Rubin. 2013. Bayesian Data Analysis. Chapman; Hall/CRC.

IntHout, Joanna, John PA Ioannidis, Maroeska M Rovers, and Jelle J Goeman. 2016. “Plea for Routinely Presenting Prediction Intervals in Meta-Analysis.” BMJ Open 6 (7): e010247.

Kontopantelis, Evangelos, and David Reeves. 2012. “Performance of Statistical Methods for Meta-Analysis When True Study Effects Are Non-Normally Distributed: A Simulation Study.” Statistical Methods in Medical Research 21 (4): 409–26.

  1. The parameters are (i) a real number, \(\mu\), (ii) a nonnegative real number \(\tau\) and (iii) a set of \(J\) unit level effects \(\theta_i\) (to simplify computation the model defines deviations from the average effect as \(\eta_i\) and then defines \(\theta_i\) as \(\theta_i = \mu + \eta_i\tau\)).

  2. You might find it discomforting to be applying such a frequentist ideas as bias to a Bayesian model; Bayesian models that get the prior right shouldn’t have any bias. The idea here though is that we assess ex ante what is the expected inference that will be made from a Bayesian approach given some data generating process that is not available to a researcher.

  3. See Kontopantelis and Reeves (2012) for explorations.