Mediation Analysis Design

By randomly assigning units to treatment we can determine whether a treatment affects an outcome but not why or how it might affect it. Identifying causal mechanisms, which is arguably one of the most useful findings in science, requires a treatment with nonzero effects and the specification of one mediator.

For this analysis we assume that the average treatment effect (ATE) of \(Z\) on \(Y\) is nonzero. Our main interest lies in decomposing the ATE into direct and indirect effects. The indirect effect is channeled from the treatment \(Z\) to the outcome \(Y\) through a mediator \(M\) and the direct effect runs directly from \(Z\) to \(Y\).

Identifying causal mechanisms is not a simple task. It involves complex potential outcomes (more on this below) and a mediating variable that is generally not assigned at random and is not a pre-treatment covariate (since it’s affected by the treatment). Researchers often use regression based approaches to identify causal mechanisms but these rely on assumptions that sometimes can’t be met.

We define below a mediation analysis design.

Design Declaration

  • Model:
    We specify a population of size \(N\). Individuals from this population have two potential outcomes related to the mediator.

  • \(M_i(Z_i=0):\) The value for the mediator \(M\) when the unit \(i\) is in the control group, and
  • \(M_i(Z_i=1):\) the value for the mediator \(M\) when the unit \(i\) is treated.

Additionally, individuals have four potential outcomes related to \(Y\).

Two potential outcomes that can be observed under treatment or control conditions.

  • \(Y_i(Z_i=0 , M_i(Z_i=0)):\) the outcome of interest given that the unit \(i\) is in the control group and the mediator that would be observed if the unit \(i\) is in the control group, and

  • \(Y_i(Z_i=1 , M_i(Z_i=1)):\) the outcome of interest given that the unit \(i\) is in the treatment group and the mediator that would be observed when the unit \(i\) is in the control group.

And two complex potential outcomes.

  • \(Y_i(Z_i=1 , M_i(Z_i=0)):\) the outcome of interest given that unit \(i\) is treated and the value of the mediator that would be observed in the control condition.

  • \(Y_i(Z_i=0 , M_i(Z_i=1)):\) the outcome of interest given that unit \(i\) remains untreated and the value of the mediator that would be observed in the treatment condition

Thus the data generating process we specify defines \(Y\) as a function of \(M\) and \(Z\) and \(M\) as a function of \(Z\).

  • Inquiry: We are interested in the average effects of the treatment on the mediator, \(M\), the direct average effects of the treatment on \(Y\) and the effects \(Z\) that run through \(M\).

  • Data strategy: We use assign units to treatment using complete random assignment.

  • Answer strategy: First, we regress \(M\) on \(Z\). Then we regress \(Y\) on \(M\) and \(Z\). Our estimators are the coefficient of regressors.

In code:

N <- 200
a <- 1
b <- 0.4
c <- 0
d <- 0.5
rho <- 0

population <- declare_population(N = N, e1 = rnorm(N), e2 = rnorm(n = N, 
    mean = rho * e1, sd = 1 - rho^2))
POs_M <- declare_potential_outcomes(M ~ 1 * (a * Z + e1 > 
    0))
POs_Y <- declare_potential_outcomes(Y ~ d * Z + b * M + c * 
    M * Z + e2, conditions = list(M = 0:1, Z = 0:1))
POs_Y_nat_0 <- declare_potential_outcomes(Y_nat0_Z_0 = b * 
    M_Z_0 + e2, Y_nat0_Z_1 = d + b * M_Z_0 + c * M_Z_0 + 
    e2)
POs_Y_nat_1 <- declare_potential_outcomes(Y_nat1_Z_0 = b * 
    M_Z_1 + e2, Y_nat1_Z_1 = d + b * M_Z_1 + c * M_Z_1 + 
    e2)
estimands <- declare_estimands(FirstStage = mean(M_Z_1 - 
    M_Z_0), Indirect_0 = mean(Y_M_1_Z_0 - Y_M_0_Z_0), Indirect_1 = mean(Y_M_1_Z_1 - 
    Y_M_0_Z_1), Controlled_Direct_0 = mean(Y_M_0_Z_1 - Y_M_0_Z_0), 
    Controlled_Direct_1 = mean(Y_M_1_Z_1 - Y_M_1_Z_0), Natural_Direct_0 = mean(Y_nat0_Z_1 - 
        Y_nat0_Z_0), Natural_Direct_1 = mean(Y_nat1_Z_1 - 
        Y_nat1_Z_0))
assignment <- declare_assignment()
reveal_M <- declare_reveal(M, Z)
reveal_Y <- declare_reveal(Y, assignment_variable = c("M", 
    "Z"))
reveal_nat0 <- declare_reveal(Y_nat0)
reveal_nat1 <- declare_reveal(Y_nat1)
manipulation <- declare_step(Not_M = 1 - M, handler = fabricate)
mediator_regression <- declare_estimator(M ~ Z, model = lm_robust, 
    estimand = "FirstStage", label = "Stage 1")
stage2_1 <- declare_estimator(Y ~ Z * M, model = lm_robust, 
    term = c("M"), estimand = c("Indirect_0"), label = "Stage 2")
stage2_2 <- declare_estimator(Y ~ Z * M, model = lm_robust, 
    term = c("Z"), estimand = c("Controlled_Direct_0", "Natural_Direct_0"), 
    label = "Direct_0")
stage2_3 <- declare_estimator(Y ~ Z * Not_M, model = lm_robust, 
    term = c("Z"), estimand = c("Controlled_Direct_1", "Natural_Direct_1"), 
    label = "Direct_1")
mediation_analysis_design <- population + POs_M + POs_Y + 
    POs_Y_nat_0 + POs_Y_nat_1 + estimands + assignment + 
    reveal_M + reveal_Y + reveal_nat0 + reveal_nat1 + manipulation + 
    mediator_regression + stage2_1 + stage2_2 + stage2_3

Diagnosis

Let us diagnose two versions of this design: one in which the correlation between the error term of the mediator regression and one of the outcome regression (\(\rho\)) is greater than zero, and another in which \(\rho\) equals zero.

designs <- expand_design(mediation_analysis_designer, rho = c(0,.5))
diagnosis <- diagnose_design(designs)
rho Estimand Label Estimator Label Term N Sims Bias RMSE Power Coverage Mean Estimate SD Estimate Mean Se Type S Rate Mean Estimand
0 Controlled_Direct_0 Direct_0 Z 500 -0.01 0.30 0.43 0.94 0.49 0.30 0.29 0.00 0.50
(0.01) (0.01) (0.02) (0.01) (0.01) (0.01) (0.00) (0.00) (0.00)
0 Controlled_Direct_1 Direct_1 Z 500 -0.01 0.18 0.77 0.95 0.49 0.18 0.18 0.00 0.50
(0.01) (0.01) (0.02) (0.01) (0.01) (0.01) (0.00) (0.00) (0.00)
0 FirstStage Stage 1 Z 500 0.00 0.05 1.00 0.99 0.34 0.06 0.06 0.00 0.34
(0.00) (0.00) (0.00) (0.00) (0.00) (0.00) (0.00) (0.00) (0.00)
0 Indirect_0 Stage 2 M 500 0.01 0.20 0.51 0.95 0.41 0.20 0.20 0.00 0.40
(0.01) (0.01) (0.02) (0.01) (0.01) (0.01) (0.00) (0.00) (0.00)
0 Indirect_1 NA NA 500 NA NA NA NA NA NA NA NA 0.40
NA NA NA NA NA NA NA NA (0.00)
0 Natural_Direct_0 Direct_0 Z 500 -0.01 0.30 0.43 0.94 0.49 0.30 0.29 0.00 0.50
(0.01) (0.01) (0.02) (0.01) (0.01) (0.01) (0.00) (0.00) (0.00)
0 Natural_Direct_1 Direct_1 Z 500 -0.01 0.18 0.77 0.95 0.49 0.18 0.18 0.00 0.50
(0.01) (0.01) (0.02) (0.01) (0.01) (0.01) (0.00) (0.00) (0.00)
0.5 Controlled_Direct_0 Direct_0 Z 500 -0.36 0.43 0.08 0.65 0.14 0.23 0.23 0.11 0.50
(0.01) (0.01) (0.01) (0.02) (0.01) (0.01) (0.00) (0.05) (0.00)
0.5 Controlled_Direct_1 Direct_1 Z 500 -0.25 0.29 0.39 0.60 0.25 0.16 0.15 0.00 0.50
(0.01) (0.01) (0.02) (0.02) (0.01) (0.01) (0.00) (0.00) (0.00)
0.5 FirstStage Stage 1 Z 500 0.00 0.05 1.00 0.98 0.34 0.06 0.06 0.00 0.34
(0.00) (0.00) (0.00) (0.01) (0.00) (0.00) (0.00) (0.00) (0.00)
0.5 Indirect_0 Stage 2 M 500 0.79 0.81 1.00 0.00 1.19 0.17 0.16 0.00 0.40
(0.01) (0.01) (0.00) (0.00) (0.01) (0.01) (0.00) (0.00) (0.00)
0.5 Indirect_1 NA NA 500 NA NA NA NA NA NA NA NA 0.40
NA NA NA NA NA NA NA NA (0.00)
0.5 Natural_Direct_0 Direct_0 Z 500 -0.36 0.43 0.08 0.65 0.14 0.23 0.23 0.11 0.50
(0.01) (0.01) (0.01) (0.02) (0.01) (0.01) (0.00) (0.05) (0.00)
0.5 Natural_Direct_1 Direct_1 Z 500 -0.25 0.29 0.39 0.60 0.25 0.16 0.15 0.00 0.50
(0.01) (0.01) (0.02) (0.02) (0.01) (0.01) (0.00) (0.00) (0.00)

Our diagnosis indicates that when the error terms are not correlated, the direct and indirect effects can be estimated without bias.In contrast, when \(\rho\) does not equal zero, the regression underestimates the effect of the mediator on \(Y\) and overstates the direct effects of \(Z\) on \(Y\).

Unfortunately, the assumption of no correlation is not always guaranteed since \(M\) is not assigned at random and might be correlated with \(Y\).

Using the Mediation Analysis Designer

In R, you can generate a mediation_analysis design using the template function mediation_analysis_designer() in the DesignLibrary package by running the following lines, which load the package:

library(DesignLibrary)

We can then create specific designs by defining values for each argument. For example, we create a design called my_mediation_analysis_design with N, a, b, d and rho set to 500, .2, .4, .2, and .15, respectively, by running the lines below.

mediation_analysis_design <- mediation_analysis_designer(
  N = 500, a = .2, b = .4, d = .2, rho = .15)

You can see more details on the mediation_analysis_designer() function and its arguments by running the following line of code:

??mediation_analysis_designer

Further Reading

Gerber, Alan S., and Donald P. Green. 2012. Field Experiments: Design, Analysis, and Interpretation. New York: W.W. Norton.

Imai, Kosuke, Luke Keele, Dustin Tingley, and Teppei Yamamoto. 2011. “Unpacking the Black Box of Causality: Learning About Causal Mechanisms from Experimental and Observational Studies.” American Political Science Review 105 (4): 765–89.