Often we want to learn about the causal effect of one thing on something else – say, the effect of taking a pill on a person becoming healthy. We can define the effect of a cause on an outcome as the difference of the how the outcomes potentially would have looked under two states of the world: the potential outcome in a world where the cause is present and the potential outcome in a world where it is absent. Were we able to observe that a person was sick in a world where they took the pill and sick in a world where they didn’t, for example, we would infer that the pill did not have any causal effect on their health.

The problem is we can never observe our world in two counterfactual states at once, and so we can never estimate the effect of a cause on any particular individual person or thing. We can, however, estimate an average of individual causal effects, which we refer to as the average treatment effect (ATE). The simplest design for inferring an ATE is the two-arm experiment: some people are assigned at random to a treatment (the first arm), and the rest are assigned to the control (the second arm). The difference in the averages of the two groups gives us an unbiased estimate of the true ATE.

How does this design work? One way to think about it is to compare the random assignment of the treatment to a random sampling procedure: in this design it is as though we take a representative sample from the two counterfactual states of the world. Seen in this way, the treatment and control groups are ‘random samples’ from the potential outcomes in a world where the cause is present and one in which it is absent. By taking the mean of the group that represents the treated potential outcomes and comparing it to the mean of the group that represents the untreated potential outcomes, we can construct a representative guess of the true average difference in the two states of the world. Just like in random sampling, as the size of our experiment grows our guess of the ATE will converge to the true ATE, implying that our design is unbiased. As we shall see, however, characterizing our uncertainty about our guesses can be complicated, even in such a simple design.

Design Declaration

  • Model:

    Our model of the world specifies a population of \(N\) units that have a control potential outcome, \(Y_i(Z = 0)\), that is distributed standard normally. A unit’s individual treatment effect is a random draw from a distribution with mean \(\tau\) and standard deviation \(\sigma\), that is added to its control potential outcome: \(Y_i(Z = 1) = Y_i(Z = 0) + t_i\). This implies that the variance of the sample’s treated potential outcomes is higher than the variance of their control potential outcomes, although they are correlated because the treated potential outcome is created by simply adding the treatment effect to the control potential outcome.

  • Inquiry:

    We want to know the average of all units’ differences in treated and untreated potential outcomes – the average treatment effect: \(E[Y_i(Z = 1) - Y_i(Z = 0)] = E[t_i] = \tau\).

  • Data strategy:

    We randomly sample \(n\) units from the population of \(N\). We randomly assign a fixed number, \(m\), to treatment, and the rest of the \(n-m\) units to control.

  • Answer strategy:

    We subtract the mean of the control group from the mean of the treatment group in order to estimate the average treatment effect.

# Model ------------------------------------------------------------------------
N <- 500
n <- 250
m <- 100
tau <- 1
sigma <- 3
population <- declare_population( 
 N = N, noise = rnorm(N), treatment_effect = rnorm(N, mean = tau, sd = sigma))
potential_outcomes <- declare_potential_outcomes(
    Y_Z_0 = noise, 
    Y_Z_1 = noise + treatment_effect)

# Inquiry ----------------------------------------------------------------------
estimand <- declare_estimand(ATE = mean(Y_Z_1 - Y_Z_0))

# Data Strategy ----------------------------------------------------------------
sampling <- declare_sampling(n = n)
assignment <- declare_assignment(m = m)

# Answer Strategy --------------------------------------------------------------
estimator <- declare_estimator(Y ~ Z, estimand = estimand)

# Design -----------------------------------------------------------------------
two_arm <- declare_design(population, potential_outcomes, sampling, 
                                                    estimand, assignment, reveal_outcomes, estimator)


With the design declared we can run a diagnosis from Monte Carlo simulations of the design:

diagnosis <- diagnose_design(two_arm, sims = 10000, bootstrap_sims = 1000)
Mean Estimate Mean Estimand Bias SE(Bias) Power SE(Power) Coverage SE(Coverage)
0.998 0.999 -0.002 0.003 0.858 0.003 0.984 0.001

The diagnosis indicates that our two-arm design recovers an unbiased estimate of the average treatment effect in our sample. The average estimate is 0.998 and the mean estimand is 0.999. Our estimate of the bias is thus very close to 0, at -0.002, and the standard errors from the bootstrap of our simulations suggest that the fact that the bias is not exactly 0 is due to simulation error.

The power of our design is good: if this simulation were to faithfully represent how the study will work in practice, we would have a roughly 86% chance of correctly rejecting the null hypothesis of an average treatment effect of 0.

Note, however, that the coverage of the estimator is too high: our 95% confidence intervals include the true effect 98% of the time. This implies that our intervals are too wide, which in turn implies that our estimates of the variance of the difference in means are higher than they should be, leading us to fail to reject the null more often than we should. Why might this be the case? Recall for starters that if two variables are correlated, the variance of the difference of their means is equal to the sum of their respective variances and their covariance. While it is easy enough to estimate the population variance of the treatment group or the control group using the sample variance estimators, estimating the population covariance is very difficult because we never observe the treated and control potential outcomes simultaneously. It turns out that common estimators of the variance of a difference in means rely on a conservative rule of thumb to estimate the covariance of the potential outcomes: since the covariance cannot be any larger than the sum of the variances, conventional estimators often use the sum the two sample variances as an upper bound for the covariance.

Further Reading

To learn more about two-arm designs and the tradeoffs involved in designing them, see Gerber and Green (2012). For a discussion of why conventional estimators of the variance of the difference in means are conservatively biased, and what to do about it, see Aronow et al. (2014).


  • Modify the estimator in the design to use the lm_robust function from the estimatr package. What happens to the coverage and power, and why?

  • Modify the design so that the variance of the treatment effect is 0. What happens to the coverage of the estimator? Why?

  • Returning to the original design, try increasing or decreasing the relative proportion of units treated, \(m/n\). How does this affect power and coverage? Explain.


Aronow, Peter M, Donald P Green, Donald KK Lee, and others. 2014. “Sharp Bounds on the Variance in Randomized Experiments.” The Annals of Statistics 42 (3). Institute of Mathematical Statistics: 850–71.

Gerber, Alan S., and Donald P. Green. 2012. Field Experiments: Design, Analysis, and Interpretation. New York: W.W. Norton.