DeclareDesign is a system for describing research designs in code and simulating them in order to understand their properties. Because DeclareDesign employs a consistent grammar of designs, you can focus on the intellectually challenging part – designing good research studies – without having to code up simulations from scratch. DeclareDesign is based on the ModelInquiryData StrategyAnswer Strategy (MIDA) framework for describing designs and a declarediagnoseredesign workflow for improving research designs before implementing them.
You can find a few different entry points into these ideas and the software tools:
 Short introduction to the MIDA framework
 Short introduction to DeclareDesign on the World Bank Development Impact blog
 Our paper in the American Political Science Review (appendices)
 Library of research designs, which includes a range of canonical designs that can be declared with a single line of code and then tailored for your own uses
 Blog series that demonstrates how DeclareDesign can be used to shed light on a range of tricky design choices, which includes declarations of many designs
In this guide, we introduce you to DeclareDesign
for R and how each step of the designdiagnoseredesign process can be implemented in it.
Installing R
To get started, install the statistical computing environment R, which you can download for free from CRAN. We also recommend the free program RStudio, which provides a friendly interface to R.^{1}
Once you’ve got RStudio installed, open it up and install DeclareDesign
and its related packages. These include three packages that enable specific steps in the research process (fabricatr
for simulating social science data; randomizr
, for random sampling and random assignment; and estimatr
for designbased estimators). You can also install DesignLibrary
, which gets standard designs upandrunning in one line. To install them, you can type:
install.packages(c("DeclareDesign", "fabricatr", "randomizr", "estimatr", "DesignLibrary"))
We also recommend you install and get to know the tidyverse
suite of packages for data analysis, which we will use in this guide:
install.packages("tidyverse")
In this guide, we will introduce the DeclareDesign
software and how to implement the MIDA framework within it. We will not provide a general introduction to R or to the tidyverse
, because there are already many terrific introductions. We especially recommend R for Data Science
, available for free on the Web.
Where we are going
We will build up to declaring and diagnosing a design in this section. But to get a sense of the goal, below is a simple 100unit randomized experiment design declared, diagnosed, and redesigned.
Declaring a design
# we should turn this into a picture labeling MIDA
simple_design <
# M: model
# a 100unit population with an unobserved shock 'e'
declare_population(N = 100, u = rnorm(N)) +
# two potential outcomes, Y_Z_0 and Y_Z_1
# Y_Z_0 is the control potential outcome (what would happen if the unit is untreated)
# it is equal to the unobserved shock 'u'
# Y_Z_1 is the treated potential outcome
# it is equal to the control potential outcome plus a treatment effect of 0.25
declare_potential_outcomes(Y_Z_0 = u, Y_Z_1 = Y_Z_0 + 0.25) +
# I: inquiry
# we are interested in the average treatment effect in the population (PATE)
declare_estimand(PATE = mean(Y_Z_1  Y_Z_0)) +
# D: data strategy
# sampling: we randomly sample 50 of the 100 units in the population
declare_sampling(n = 50) +
# assignment: we randomly assign half of the 50 sampled units to treatment (half to control)
declare_assignment(prob = 0.5) +
# reveal outcomes: construct outcomes from the potential outcomes named Y depending on
# the realized value of their assignment variable named Z
declare_reveal(outcome_variables = Y, assignment_variables = Z) +
# A: answer strategy
# calculate the differenceinmeans of Y depending on Z
# we link this estimator to PATE because this is our estimate of our inquiry
declare_estimator(Y ~ Z, model = difference_in_means, estimand = "PATE")
Diagnosis
To diagnose the design, we first define a set of diagnosands, which are statistical properties of the design. In this case, we select the bias (difference between the estimate and the estimand, which is the PATE); the root meansquared error; and the statistical power of the design.
# Select diagnosands
simple_design_diagnosands <
declare_diagnosands(select = c(bias, rmse, power))
We then diagnose the design, which involves simulating the design and again and again, and then calculate the diagnosands based on the simulations data.
# Diagnose the design
simple_design_diagnosis <
diagnose_design(simple_design, diagnosands = simple_design_diagnosands, sims = 500)
estimand_label  estimator_label  bias  rmse  power 

PATE  estimator  0  0.28  0.13 
Redesign
We see that the power of the design is small, so we increase the number of sampled units from 50 to 100. replace_step
creates a new design, swapping out the fourth step (sampling) for a modified sampling step.
redesigned_simple_design <
replace_step(simple_design,
step = 4,
new_step = declare_sampling(n = 100))
With the big picture of the declaration, diagnosis, and redesign of a simple design in mind, we now turn to building up from a single step to a full declared design.
Building a step of a research design
We begin learning about how to build a research design in DeclareDesign
by declaring a single step: random assignment. We take as a starting point a fixed set of data, describing a set of voters in Los Angeles. The research project we are planning involves randomly assigning voters to receive a knock on their door from a canvasser (or not to receive a door knock). Our data look like this:
ID  age  sex  party  precinct 

001  27  M  DEM  5210 
002  53  M  REP  2155 
003  35  F  REP  4321 
004  75  M  REP  3590 
005  66  F  REP  5297 
006  64  M  GRN  8905 
There are 100 voters in the dataset.
Using dplyr
We plan to randomly assign 50 of the voters to treatment (door knock) and 50 to control (no door knock). We want to create an indicator variable Z
, where 1
represents treatment and 0
control. In order to do this, we use R’s sample
function:
voter_file < voter_file %>%
mutate(Z = sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.5, 0.5)))
This says: draw a random sample with replacement 100 times (the number of voters) of 0
’s and 1
’s with probability 0.5
each. Recall that the %>%
operator sends a data frame to the dplyr
verb mutate
, which can add new columns to a data frame. This is a short dplyr
“pipeline”.^{2}
Now our data frame voter_file
includes the Z
indicator:
ID  age  sex  party  precinct  Z 

001  27  M  DEM  5210  1 
002  53  M  REP  2155  0 
003  35  F  REP  4321  1 
004  75  M  REP  3590  1 
005  66  F  REP  5297  1 
006  64  M  GRN  8905  1 
We can make things a bit easier with the randomizr
package, which includes common random assignment functions including simple random assignment used here. You can instead write:
voter_file < voter_file %>%
mutate(Z = simple_ra(N = 100, prob = 0.5))
We might use this dplyr
pipeline to actually implement the random assignment for a study. But to diagnose the properties of a research design, we want to know what would happen under any possible random assignment. To do this, we will need to run the assignment step over and over again and save the results.
As a function
To simulate the design in order to diagnose it, we need to turn the assignment step into a function. The function can then be run again and again, each time resulting in a different random assignment.
In DeclareDesign
, we are going to use a special kind of function: a tidy function, which takes in a data frame and returns back out a data frame. The new data frame may have an additional variable (such as a random assignment) or it may have fewer rows (due to sampling, for example).
For our random assignment step, we want to create a tidy function that adds our assignment indicator Z
to the data, but leaves it otherwise unchanged. We write:
simple_random_assignment_function < function(data) {
data %>% mutate(Z = simple_ra(N = 100, prob = 0.5))
}
We took the dplyr
pipeline we built above, and put it on the inside of a tidy function. Now, when we run our random assignment function on the voter file, it adds in Z
:
simple_random_assignment_function(voter_file)
ID  age  sex  party  precinct  Z 

001  27  M  DEM  5210  0 
002  53  M  REP  2155  1 
003  35  F  REP  4321  1 
004  75  M  REP  3590  0 
005  66  F  REP  5297  1 
006  64  M  GRN  8905  1 
In DeclareDesign
DeclareDesign
makes writing each design step just a bit easier. Instead of writing a function each time, it writes a function for us. The core of DeclareDesign
is a set of declare_*
functions, including declare_assignment
. Each one is a function factory, meaning it takes a set of parameters about your research design like the number of units and the random assignment probability as inputs, and returns a function as an output.
Instead of writing the function simple_random_assignment_function
as we did above, in DeclareDesign
we declare it:
simple_random_assignment_step < declare_assignment(prob = 0.5)
simple_random_assignment_step
is a tidy function. You can run the function on data:
simple_random_assignment_step(voter_file)
ID  age  sex  party  precinct  Z  Z_cond_prob 

001  27  M  DEM  5210  1  0.5 
002  53  M  REP  2155  1  0.5 
003  35  F  REP  4321  0  0.5 
004  75  M  REP  3590  0  0.5 
005  66  F  REP  5297  0  0.5 
006  64  M  GRN  8905  1  0.5 
A few parts of the declaration may seem a little bit odd. First, we did not tell R anything about the number of units in our dataset, as we did in the function and in the dplyr
pipeline we wrote earlier. Second, we didn’t give it the data! This is because a step declaration creates a function that will work on any size dataset. We told declare_assignment
that we want to assign treatment with probability 0.5
(and implicitly control with probability 10.5 = 0.5
), regardless of how large the dataset is. We did not send the declaration the data, because declare_assignment
automatically creates a tidy function for us, one that takes data and returns data with an assignment step. We will see in a moment how DeclareDesign
uses these functions to simulate data from a research design. But you can always use the function yourself with your own data.
Every step of a research design in MIDA can be written using one of the declare_*
functions. In the next section, we walk through each step and how to declare it using DeclareDesign
.
Declaring each step of a research design
In this section, we walk through how to declare each step of a research design using DeclareDesign
. In the next section, we build those steps into a research design, and then describe how to interrogate the design.
Model
The model defines the structure of the world, both its size and background characteristics as well as how interventions in the world determine outcomes. In DeclareDesign
, we split the model into two main design steps: the population and potential outcomes. There is always one population in a design, but there can be multiple sets of potential outcomes.
Population
The population defines the number of units in the population, any multilevel structure to the data, and its background characteristics. We can define the population in several ways.
In some cases, you may start a design with data on the population. When that happens, we do not to simulate it. We can simply declare the data as our population:
declare_population(data = voter_file)
ID  age  sex  party  precinct  Z 

001  27  M  DEM  5210  1 
002  53  M  REP  2155  0 
003  35  F  REP  4321  0 
004  75  M  REP  3590  0 
005  66  F  REP  5297  1 
006  64  M  GRN  8905  1 
When we do not have complete data on the population, we simulate it. Relying on the data simulation functions from our fabricatr
package, declare_population
asks about the size and variables of the population:
declare_population(N = 100, u = rnorm(N))
When we run the declared population function, we will get a different 100unit dataset each time:



The fabricatr
package can simulate data for social science research including multilevel data structures like students in classrooms in schools. You can read the fabricatr
Web site to get started simulating your data structure. A simple twolevel data structure of individuals within households could be declared as:
declare_population(
households = add_level(N = 100, individuals_per_hh = sample(1:10, N, replace = TRUE)),
individuals = add_level(N = individuals_per_hh, age = sample(1:100, N, replace = TRUE))
)
In every step of the research design process, you can shortcircuit our default way of doing things and bring in your own code. This is useful when you have a complex design, or when you’ve already written code for your design and you want to use it directly. It works by setting the handler:
complex_population_function < function(data, N_units) {
data.frame(u = rnorm(N_units))
}
declare_population(handler = complex_population_function, N_units = 100)
Potential outcomes
Defining potential outcomes is as easy as a single expression per potential outcome. These may be a function of background characteristics, other potential outcomes, or other R functions.^{3}
declare_potential_outcomes(
Y_Z_0 = u,
Y_Z_1 = Y_Z_0 + 0.25)
des < declare_population(N = 100, u = rnorm(N)) +
declare_potential_outcomes(Y_Z_0 = u, Y_Z_1 = Y_Z_0 + 0.25)
draw_data(des)
ID  u  Y_Z_0  Y_Z_1 

001  0.01  0.01  0.26 
002  0.92  0.92  0.67 
003  1.71  1.71  1.96 
004  1.17  1.17  0.92 
005  1.78  1.78  1.53 
006  2.25  2.25  2.00 
We also have a simpler interface to define all the potential outcomes at once as a function of a treatment assignment variable. The names of the potential outcomes are constructed from the outcome name (here Y
on the lefthand side of the formula) and from the assignment_variables
argument (here Z
).
declare_potential_outcomes(Y ~ u + 0.25 * Z, assignment_variables = Z)
Either way of creating potential outcomes works; one may be easier or harder to code up in a given research design setting.
Inquiry
To define your inquiry, declare your estimand, which is a function of background characteristics from your population, potential outcomes, or both. We define the average treatment effect for the experiment in our simple design as follows:
declare_estimand(PATE = mean(Y_Z_1  Y_Z_0))
Notice that we defined the PATE (the population average treatment effect), but said nothing special related to the population. In fact, it looks like we just defined the average treatment effect. This is because where you define the estimand in your design is going to determine whether it refers to the population, sample, or other form of estimand. We will see how to do this in a moment.
Data strategy
The data strategy constitutes one or more steps representing interventions the researcher makes in the world from sampling to assignment to measurement. Typically, this may include sampling and assignment.
Sampling
The sampling step relies on the randomizr
package to conduct random sampling. We define a simple 50unit sample from the population as follows:
declare_sampling(n = 50)
When we draw data from our simple design at this point, it will be smaller: from 100 units in the population to a data frame of 50 units representing the sample. In the data frame, we have an inclusion probability, the probability of being included in the sample. randomizr
includes this by default. In this case, every unit in the population had an equal 0.5 probability of inclusion.
ID  u  Y_Z_0  Y_Z_1  S_inclusion_prob  

2  002  0.67  0.67  0.42  0.5 
3  003  1.11  1.11  0.86  0.5 
4  004  0.34  0.34  0.09  0.5 
6  006  1.23  1.23  0.98  0.5 
8  008  1.25  1.25  1.00  0.5 
10  010  0.72  0.72  0.97  0.5 
Sampling could also be nonrandom, which could be accomplished by using a handler.
Assignment
Assignment also relies, by default, on the randomizr
package for random assignment. Here, we define assignment as a 50% probability of assignment to treatment and 50% to control.
declare_assignment(prob = 0.5)
Assignment results in a data frame with an additional indicator Z
of the assignment as well as the probability of assignment. Again, here the assignment probabilities are constant, but in other designs they are not and this is crucial information for the analysis stage.
ID  u  Y_Z_0  Y_Z_1  S_inclusion_prob  Z  Z_cond_prob 

001  0.77  0.77  0.52  0.5  0  0.5 
002  1.55  1.55  1.30  0.5  0  0.5 
003  1.29  1.29  1.04  0.5  1  0.5 
004  1.17  1.17  0.92  0.5  0  0.5 
006  0.70  0.70  0.95  0.5  0  0.5 
009  0.15  0.15  0.10  0.5  0  0.5 
Other data strategies
Random sampling and random assignment are not the only kinds of data strategies. Others may include merging in fixed administrative data from other sources, collapsing data across months or days, and other operations. You can include these as steps in your design too, using declare_step
. Here, you must define a handler, as we did for using a custom function in declare_population
. Some handlers that may prove useful are the dplyr
verbs such as mutate
and summarize
, and the fabricate
function from our fabricatr
package.
To add a variable using fabricate:
declare_step(handler = fabricate, add_variable = rnorm(N))
If you have districtmonth data you may want to analyze at the district level, collapsing across months:^{4}
collapse_data < function(data, collapse_by) {
data %>% group_by({{ collapse_by }}) %>% summarize_all(mean, na.rm = TRUE)
}
declare_step(handler = collapse_data, collapse_by = district)
Answer strategy
Through our model and data strategy steps, we have simulated a dataset with two key inputs to the answer strategy: an assignment variable and an outcome. In other answer strategies, pretreatment characteristics from the model might also be relevant. The data look like this:
ID  u  Y_Z_0  Y_Z_1  S_inclusion_prob  Z  Z_cond_prob  Y 

001  0.12  0.12  0.13  0.5  0  0.5  0.12 
005  1.51  1.51  1.26  0.5  1  0.5  1.26 
007  0.52  0.52  0.77  0.5  0  0.5  0.52 
008  1.33  1.33  1.58  0.5  1  0.5  1.58 
010  1.60  1.60  1.85  0.5  1  0.5  1.85 
012  0.05  0.05  0.20  0.5  1  0.5  0.20 
Our estimator is the differenceinmeans estimator, which compares outcomes between the group that was assigned to treatment and that assigned to control. We can calculate the differenceinmeans estimate with a call to summarize
from dplyr
:
simple_design_data %>% summarize(DiM = mean(Y[Z == 1])  mean(Y[Z == 0]))
DiM 

0.51 
The estimatr
package makes this easy and calculates the designbased standard error and a pvalue and confidence interval for you:
difference_in_means(Y ~ Z, data = simple_design_data)
term  estimate  std.error  statistic  p.value  conf.low  conf.high  df  outcome 

Z  0.51  0.27  1.9  0.06  0.03  1.1  41  Y 
Now, in order to declare our estimator, we can send the name of a model to declare_estimator
. R has many models that work with declare_estimator
, including lm
, glm
, the ictreg
package from the list
package, etc. The designbased estimators from estimatr
can all be used.
declare_estimator(Y ~ Z, model = difference_in_means, estimand = "PATE")
In this declaration, we also define the estimand we are targeting with the differenceinmeans estimator.^{5} Typically, you will have an estimand that you are targeting, and sometimes you may consider targeting more than one and assessing how good your estimator estimates them. For example, you may want to know how good a job your instrumental variables job is at targeting the complier average causal effect, but also how close it gets on average to the average treatment effect.
Combining steps to form a design
In the last section, we defined a set of individual research steps. We draw one version of them together here:
population < declare_population(N = 100, u = rnorm(N))
potential_outcomes < declare_potential_outcomes(Y_Z_0 = u, Y_Z_1 = Y_Z_0 + 0.25)
estimand < declare_estimand(PATE = mean(Y_Z_1  Y_Z_0))
sampling < declare_sampling(n = 50)
assignment < declare_assignment(prob = 0.5)
reveal < declare_reveal(outcome_variables = Y, assignment_variables = Z)
estimator < declare_estimator(Y ~ Z, model = difference_in_means, estimand = "PATE")
To construct a research design object that we can operate on — diagnose it, redesign it, draw data from it, etc. — we add them together with the +
operator. The +
creates a design object.
simple_design <
population + potential_outcomes + estimand + sampling + assignment + reveal + estimator
Often we’ll use a more compact way of writing a design, where we define it all at once with the +
:
simple_design <
declare_population(N = 100, u = rnorm(N)) +
declare_potential_outcomes(Y_Z_0 = u, Y_Z_1 = Y_Z_0 + 0.25) +
declare_estimand(PATE = mean(Y_Z_1  Y_Z_0)) +
declare_sampling(n = 50) +
declare_assignment(prob = 0.5) +
declare_reveal(outcome_variables = Y, assignment_variables = Z) +
declare_estimator(Y ~ Z, model = difference_in_means, estimand = "PATE")
Order matters
When defining a design, the order steps are included in the design via the +
operator matters. Think of the order of your design as the causal order in which steps take place.
population + potential_outcomes + estimand + sampling + assignment + reveal + estimator
The order encodes several important aspects of the design:
 First, the fact that the estimand follows the potential outcomes and comes before sampling and assignment means it is a population estimand, the population average treatment effect. This is because it is calculated on the data created so far.
 The estimator comes after the assignment and reveal outcomes steps. If it didn’t, our differenceinmeans would not work, because it wouldn’t have access to the treatment variable and the realized outcomes.
Simulating a research design
Diagnosing a research design — learning about its properties — requires first simulating running the design over and over. We need to simulate the data generating process, then calculate the estimands, then calculate the estimates that will result.
In dplyr
We first demonstrate how to use the tidy functions created by the declare_*
functions in a dplyr
pipeline to simulate a design once.
We can run the population function, which generates the data structure, and then add the potential outcomes, and calculate the estimand as follows:
population() %>% potential_outcomes %>% estimand
estimand_label  estimand 

PATE  0.25 
This is the same thing as running the functions one at a time on each other: estimand(potential_outcomes(population()))
.
Similarly, if we want to draw simulated estimates from the design, we again simulate a population, add potential outcomes, but now sample units, assign treatments to sampled units, reveal the outcomes, and calculate estimates:
population() %>% potential_outcomes %>% sampling %>% assignment %>% reveal %>% estimator
estimator_label  term  estimate  std.error  statistic  p.value  conf.low  conf.high  df  outcome  estimand_label 

estimator  Z  0.05  0.29  0.18  0.86  0.63  0.53  47  Y  PATE 
In DeclareDesign
With simple design defined as an object, we can easily learn about what kind of data it generates, the values of its estimand and estimates, and other features with simple functions in DeclareDesign
. They chain together functions in a similar way to the dplyr
pipelines above.
To draw simulated data based on the design, we use draw_data
:
draw_data(simple_design)
ID  u  Y_Z_0  Y_Z_1  S_inclusion_prob  Z  Z_cond_prob  Y 

001  0.01  0.01  0.26  0.5  0  0.5  0.01 
003  1.45  1.45  1.20  0.5  0  0.5  1.45 
004  1.68  1.68  1.93  0.5  1  0.5  1.93 
007  0.32  0.32  0.07  0.5  1  0.5  0.07 
009  0.04  0.04  0.21  0.5  1  0.5  0.21 
012  0.57  0.57  0.32  0.5  0  0.5  0.57 
draw_data
runs all of the “data steps” in a design, which are both from the model (population and potential outcomes) and from the data strategy (typically sampling and assignment).
To simulate the estimands from a single run of the design, we use draw_estimands
. This runs two operations at once: it draws the data, and calculates the estimands at the point defined by the design. For example, in our design the estimand comes just after the potential outcomes. In this design, draw_estimands
will run the first two steps and then calculate the estimands from the estimand
function we declared:
draw_estimands(simple_design)
estimand_label  estimand 

PATE  0.25 
Similarly, we can simulate the estimates from a single run with draw_estimates
which draws data and at the appropriate moment calculates estimates.
draw_estimates(simple_design)
estimator_label  term  estimate  std.error  statistic  p.value  conf.low  conf.high  df  outcome  estimand_label 

estimator  Z  0.31  0.32  0.97  0.33  0.33  0.96  47  Y  PATE 
To diagnose a design, we want a data frame that includes the estimates and estimands from many runs of a design. That is, we want to run the design, draw estimates and estimands, and then do that over and over and stack the results. This is exactly what simulate_design
does:
simulate_design(simple_design, sims = 500)
design_label  sim_ID  estimand_label  estimand  estimator_label  term  estimate  std.error  statistic  p.value  conf.low  conf.high  df  outcome 

simple_design  1  PATE  0.25  estimator  Z  0.15  0.25  0.62  0.54  0.34  0.65  48  Y 
simple_design  2  PATE  0.25  estimator  Z  0.64  0.25  2.53  0.02  0.13  1.14  40  Y 
simple_design  3  PATE  0.25  estimator  Z  0.34  0.27  1.24  0.22  0.21  0.88  42  Y 
simple_design  4  PATE  0.25  estimator  Z  0.58  0.27  2.14  0.04  0.03  1.13  48  Y 
simple_design  5  PATE  0.25  estimator  Z  0.11  0.30  0.38  0.71  0.49  0.72  46  Y 
Diagnosing a research design
The simulations data frame we created allows us to diagnose the design (calculate summary statistics from the simulations) directly. We can, for example, use the following dplyr
pipeline to calculate the bias, root meansquared error, and power for each estimatorestimand pair.
simulations_df %>%
group_by(estimand_label, estimator_label) %>%
summarize(bias = mean(estimate  estimand),
rmse = sqrt(mean((estimate  estimand)^2)),
power = mean(p.value < .05))
estimand_label  estimator_label  bias  rmse  power 

PATE  estimator  0.11  0.24  0.4 
In DeclareDesign
, we do this in two steps. First, declare your diagnosands. These are functions of the simulations data. We have precoded several standard diagnosands.
study_diagnosands < declare_diagnosands(
select = c(bias, rmse, power),
mse = mean((estimate  estimand)^2))
Next, take your simulations data and the diagnosands, and diagnose. This runs a single operation, which is to calculate the diagnosands on your simulations data, just like in the dplyr
version above.
diagnose_design(simulations_df, diagnosands = study_diagnosands)
design_label  estimand_label  estimator_label  term  mse  se(mse)  bias  se(bias)  rmse  se(rmse)  power  se(power)  n_sims 

simple_design  PATE  estimator  Z  0.06  0.03  0.11  0.1  0.24  0.06  0.4  0.22  5 
We can also do this in a single step. When you send diagnose_design
a design object, it will first run the simulations for you, then calculate the diagnosands from the simulations data frame that results.
diagnose_design(simple_design, diagnosands = study_diagnosands)
Comparing designs
In the diagnosis phase, you will often want to compare the properties of two designs to see which you prefer on the basis of the diagnosand values. We have two ways to compare. First, we can compare the designs themselves — what kinds of estimates and estimands do they produce, what steps are in the design. And we can compare the diagnoses.
compare_designs(simple_design, redesigned_simple_design)
To compare the diagnoses, we run a diagnosis for each one and then calculate the difference between each diagnosand for the two designs and conduct a statistical test of the null effect of no difference.
compare_diagnoses(simple_design, redesigned_simple_design)
Comparing many variants of a design
Often, we want to compare a large set of similar designs, varying key design parameters such as sample size, effect size, or the probability of treatment assignment. The easiest way to do this is to write a function that makes designs based on a set of these design inputs. We call these designers. Here’s a simple designer based on our running example:
simple_designer < function(sample_size, effect_size) {
declare_population(N = sample_size, u = rnorm(N)) +
declare_potential_outcomes(Y_Z_0 = u, Y_Z_1 = Y_Z_0 + effect_size) +
declare_estimand(PATE = mean(Y_Z_1  Y_Z_0)) +
declare_sampling(n = 50) +
declare_assignment(prob = 0.5) +
declare_reveal(outcome_variables = Y, assignment_variables = Z) +
declare_estimator(Y ~ Z, model = difference_in_means, estimand = "PATE")
}
To create a single design, based on our original parameters of a 100unit sample size and a treatment effect of 0.25
, we can run:
simple_design < simple_designer(sample_size = 100, effect_size = 0.25)
Now to simulate multiple designs, we can use the DeclareDesign
function expand_design
. Here we examine our simple design under several possible sample sizes, which we might want to do to conduct a minimum power analysis. We hold the effect size constant.
simple_designs < expand_design(simple_designer, sample_size = c(100, 500, 1000), effect_size = 0.25)
Our simulation and diagnosis tools can take a set of expanded designs (an R list) and will simulate all of them at once, creating a column called design_label
to keep them apart. For example:
diagnose_design(simple_designs)
Library of designs
In our DesignLibrary
package, we have created a set of common designs as designers, so you can get started quickly and also easily set up a range of design variants for comparison.
library(DesignLibrary)
b_c_design < block_cluster_two_arm_designer(N = 1000, N_blocks = 10)
diagnose_design(b_c_design)
Both R and RStudio are available on Windows, Mac, and Linux.↩︎
This pipeline could be expressed in base R as
voter_file$Z < sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.5, 0.5))
↩︎Typically, we think of potential outcomes as fixed and not random, and move random variables to the population.↩︎
The
{{ }}
syntax is handy for writing functions indplyr
where you want to be able reuse the function with different variable names. Here, thecollapse_data
function willgroup_by
the variable you send to the argumentcollapse_by
, which in our declaration we set todistrict
. The pipeline within the function then calculates the mean in each district.↩︎Sometimes, you may be interested just in the properties of an estimator, such as calculating its power. In this case, you need not define an estimand.↩︎