You don’t always get the data you want. Very often, individuals who would be relevant to your inquiry don’t show up in the data: people refuse to answer surveys, data gets lost, collection activities are interrupted.
When data goes missing, two things happen. First, your power goes down because you have less data to work with relative to a study with complete data. Second, you have to worry about whether there exists any systematic relationship between missingness and the outcomes you are studying. If such a relationship exists, it can introduce bias.
These features of attrition are fairly well-known. But how much attrition is too much attrition? How high does the correlation between the propensity to go missing and the outcome you care about have to be, in order to seriously jeapordize a study? Here, we declare a design that allows us to study such questions.
Model:
Our model of the world specifies a population of \(N\) units that have three variables affected by the treatment: a response variable, \(R_i\); our outcome of interest, \(Y_i\), which is correlated with the response variable through \(\rho\); and \(Y^{obs}_i\), the measured version of the true \(Y_i\), which is only observed when \(R_i = 1\).
Inquiry:
We want to know the average of all units’ differences in treated and untreated potential outcomes – the average treatment effect on the outcome of interest: \(E[Y_i(Z = 1) - Y_i(Z = 0)]\). But we also want to know the average treatment effect on reporting, \(E[R_i(Z = 1) - R_i(Z = 0)]\), as well as the effect of the treatment among those who report, \(E[Y_i(Z = 1) - Y_i(Z = 0) \mid R_i = 1]\).
Data strategy:
We randomly assign half of the units to treatment.
Answer strategy:
For \(R_i\) and \(Y^{obs}_i\), we subtract the mean of the control group’s values from the mean of the treatment group in order to estimate the average treatment effect.
N <- 100
a_R <- 0
b_R <- 1
a_Y <- 0
b_Y <- 1
rho <- 0
population <- declare_population(N = N, u_R = rnorm(N), u_Y = rnorm(N,
mean = rho * u_R, sd = sqrt(1 - rho^2)))
potential_outcomes_R <- declare_potential_outcomes(R ~ (a_R +
b_R * Z > u_R))
potential_outcomes_Y <- declare_potential_outcomes(Y ~ (a_Y +
b_Y * Z > u_Y))
estimand_1 <- declare_estimand(mean(R_Z_1 - R_Z_0), label = "ATE on R")
estimand_2 <- declare_estimand(mean(Y_Z_1 - Y_Z_0), label = "ATE on Y")
estimand_3 <- declare_estimand(mean((Y_Z_1 - Y_Z_0)[R ==
1]), label = "ATE on Y (Given R)")
assignment <- declare_assignment(prob = 0.5)
reveal <- declare_reveal(outcome_variables = c("R", "Y"))
observed <- declare_step(Y_obs = ifelse(R, Y, NA), handler = fabricate)
estimator_1 <- declare_estimator(R ~ Z, term = "Z", estimand = estimand_1,
label = "DIM on R")
estimator_2 <- declare_estimator(Y_obs ~ Z, term = "Z", estimand = c(estimand_2,
estimand_3), label = "DIM on Y_obs")
estimator_3 <- declare_estimator(Y ~ Z, term = "Z", estimand = c(estimand_2,
estimand_3), label = "DIM on Y")
two_arm_attrition_design <- population + potential_outcomes_R +
potential_outcomes_Y + assignment + reveal + observed +
estimand_1 + estimand_2 + estimand_3 + estimator_1 +
estimator_2 + estimator_3
designs <- expand_design(designer = two_arm_attrition_designer,
rho = c(0,.2,.8))
diagnoses <- diagnose_designs(designs, sims = 25)
## Warning: We recommend you choose a higher number of simulations than 25 for the
## top level of simulation.
## Warning: We recommend you choose a higher number of simulations than 25 for the
## top level of simulation.
## Warning: We recommend you choose a higher number of simulations than 25 for the
## top level of simulation.
kable(reshape_diagnosis(diagnoses,select = c("Bias","Power")), digits = 2)
Design Label | Estimand Label | Estimator Label | Term | N Sims | Bias | Power |
---|---|---|---|---|---|---|
design_1 | ATE on R | DIM on R | Z | 25 | 0.02 | 1.00 |
(0.01) | (0.00) | |||||
design_1 | ATE on Y | DIM on Y | Z | 25 | 0.03 | 0.96 |
(0.01) | (0.04) | |||||
design_1 | ATE on Y | DIM on Y_obs | Z | 25 | 0.05 | 0.92 |
(0.02) | (0.06) | |||||
design_1 | ATE on Y (Given R) | DIM on Y | Z | 25 | 0.01 | 0.96 |
(0.02) | (0.04) | |||||
design_1 | ATE on Y (Given R) | DIM on Y_obs | Z | 25 | 0.04 | 0.92 |
(0.02) | (0.06) | |||||
design_2 | ATE on R | DIM on R | Z | 25 | 0.02 | 1.00 |
(0.01) | (0.00) | |||||
design_2 | ATE on Y | DIM on Y | Z | 25 | 0.03 | 1.00 |
(0.01) | (0.00) | |||||
design_2 | ATE on Y | DIM on Y_obs | Z | 25 | 0.01 | 0.88 |
(0.02) | (0.06) | |||||
design_2 | ATE on Y (Given R) | DIM on Y | Z | 25 | 0.03 | 1.00 |
(0.02) | (0.00) | |||||
design_2 | ATE on Y (Given R) | DIM on Y_obs | Z | 25 | 0.01 | 0.88 |
(0.02) | (0.06) | |||||
design_3 | ATE on R | DIM on R | Z | 25 | 0.02 | 1.00 |
(0.01) | (0.00) | |||||
design_3 | ATE on Y | DIM on Y | Z | 25 | 0.03 | 0.96 |
(0.01) | (0.04) | |||||
design_3 | ATE on Y | DIM on Y_obs | Z | 25 | -0.18 | 0.20 |
(0.02) | (0.09) | |||||
design_3 | ATE on Y (Given R) | DIM on Y | Z | 25 | 0.08 | 0.96 |
(0.01) | (0.04) | |||||
design_3 | ATE on Y (Given R) | DIM on Y_obs | Z | 25 | -0.13 | 0.20 |
(0.01) | (0.09) |
The diagnosis illustrates that the effect on reporting can always be estimated with high power and no bias
However, any strategy that conditions on \(Y_i^{obs}\) is very biased, even for an estimand that is conditional on reporting. Even a small amount of correlation between missingness and outcomes can severely jeapordize inferences.