Getting Started with DeclareDesign

DeclareDesign is a system for describing research designs in code and simulating them in order to understand their properties. Because DeclareDesign employs a consistent grammar of designs, you can focus on the intellectually challenging part – designing good research studies – without having to code up simulations from scratch. DeclareDesign is based on the Model-Inquiry-Data Strategy-Answer Strategy (MIDA) framework for describing designs and a declare-diagnose-redesign workflow for improving research designs before implementing them.

You can find a few different entry points into these ideas and the software tools:

In this guide, we introduce you to DeclareDesign for R and how each step of the design-diagnose-redesign process can be implemented in it.

Installing R

To get started, install the statistical computing environment R, which you can download for free from CRAN. We also recommend the free program RStudio, which provides a friendly interface to R.1

Once you’ve got RStudio installed, open it up and install DeclareDesign and its related packages. These include three packages that enable specific steps in the research process (fabricatr for simulating social science data; randomizr, for random sampling and random assignment; and estimatr for design-based estimators). You can also install DesignLibrary, which gets standard designs up-and-running in one line. To install them, you can type:

install.packages(c("DeclareDesign", "fabricatr", "randomizr", "estimatr", "DesignLibrary"))

We also recommend you install and get to know the tidyverse suite of packages for data analysis, which we will use in this guide:

install.packages("tidyverse")

In this guide, we will introduce the DeclareDesign software and how to implement the MIDA framework within it. We will not provide a general introduction to R or to the tidyverse, because there are already many terrific introductions. We especially recommend R for Data Science, available for free on the Web.

Where we are going

We will build up to declaring and diagnosing a design in this section. But to get a sense of the goal, below is a simple 100-unit randomized experiment design declared, diagnosed, and redesigned.

Declaring a design

# we should turn this into a picture labeling MIDA
simple_design <- 
  
  # M: model
  
  # a 100-unit population with an unobserved shock 'e'
  declare_population(N = 100, u = rnorm(N)) +
  
  # two potential outcomes, Y_Z_0 and Y_Z_1
  # Y_Z_0 is the control potential outcome (what would happen if the unit is untreated)
  #   it is equal to the unobserved shock 'u'
  # Y_Z_1 is the treated potential outcome 
  #   it is equal to the control potential outcome plus a treatment effect of 0.25
  declare_potential_outcomes(Y_Z_0 = u, Y_Z_1 = Y_Z_0 + 0.25) +
  
  # I: inquiry
  
  # we are interested in the average treatment effect in the population (PATE)
  declare_estimand(PATE = mean(Y_Z_1 - Y_Z_0)) +
  
  # D: data strategy
  
  # sampling: we randomly sample 50 of the 100 units in the population
  declare_sampling(n = 50) +
  
  # assignment: we randomly assign half of the 50 sampled units to treatment (half to control)
  declare_assignment(prob = 0.5) +
  
  # reveal outcomes: construct outcomes from the potential outcomes named Y depending on 
  #   the realized value of their assignment variable named Z
  declare_reveal(outcome_variables = Y, assignment_variables = Z) +
  
  # A: answer strategy
  
  # calculate the difference-in-means of Y depending on Z 
  # we link this estimator to PATE because this is our estimate of our inquiry
  declare_estimator(Y ~ Z, model = difference_in_means, estimand = "PATE")

Diagnosis

To diagnose the design, we first define a set of diagnosands, which are statistical properties of the design. In this case, we select the bias (difference between the estimate and the estimand, which is the PATE); the root mean-squared error; and the statistical power of the design.

# Select diagnosands
simple_design_diagnosands <- 
  declare_diagnosands(select = c(bias, rmse, power))

We then diagnose the design, which involves simulating the design and again and again, and then calculate the diagnosands based on the simulations data.

# Diagnose the design
simple_design_diagnosis <- 
  diagnose_design(simple_design, diagnosands = simple_design_diagnosands, sims = 500)
estimand_label estimator_label bias rmse power
PATE estimator 0 0.28 0.13

Redesign

We see that the power of the design is small, so we increase the number of sampled units from 50 to 100. replace_step creates a new design, swapping out the fourth step (sampling) for a modified sampling step.

redesigned_simple_design <-
  replace_step(simple_design, 
               step = 4, 
               new_step = declare_sampling(n = 100))

With the big picture of the declaration, diagnosis, and redesign of a simple design in mind, we now turn to building up from a single step to a full declared design.

Building a step of a research design

We begin learning about how to build a research design in DeclareDesign by declaring a single step: random assignment. We take as a starting point a fixed set of data, describing a set of voters in Los Angeles. The research project we are planning involves randomly assigning voters to receive a knock on their door from a canvasser (or not to receive a door knock). Our data look like this:

ID age sex party precinct
001 27 M DEM 5210
002 53 M REP 2155
003 35 F REP 4321
004 75 M REP 3590
005 66 F REP 5297
006 64 M GRN 8905

There are 100 voters in the dataset.

Using dplyr

We plan to randomly assign 50 of the voters to treatment (door knock) and 50 to control (no door knock). We want to create an indicator variable Z, where 1 represents treatment and 0 control. In order to do this, we use R’s sample function:

voter_file <- voter_file %>% 
  mutate(Z = sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.5, 0.5)))

This says: draw a random sample with replacement 100 times (the number of voters) of 0’s and 1’s with probability 0.5 each. Recall that the %>% operator sends a data frame to the dplyr verb mutate, which can add new columns to a data frame. This is a short dplyr “pipeline”.2

Now our data frame voter_file includes the Z indicator:

ID age sex party precinct Z
001 27 M DEM 5210 1
002 53 M REP 2155 0
003 35 F REP 4321 1
004 75 M REP 3590 1
005 66 F REP 5297 1
006 64 M GRN 8905 1

We can make things a bit easier with the randomizr package, which includes common random assignment functions including simple random assignment used here. You can instead write:

voter_file <- voter_file %>% 
  mutate(Z = simple_ra(N = 100, prob = 0.5))

We might use this dplyr pipeline to actually implement the random assignment for a study. But to diagnose the properties of a research design, we want to know what would happen under any possible random assignment. To do this, we will need to run the assignment step over and over again and save the results.

As a function

To simulate the design in order to diagnose it, we need to turn the assignment step into a function. The function can then be run again and again, each time resulting in a different random assignment.

In DeclareDesign, we are going to use a special kind of function: a tidy function, which takes in a data frame and returns back out a data frame. The new data frame may have an additional variable (such as a random assignment) or it may have fewer rows (due to sampling, for example).

For our random assignment step, we want to create a tidy function that adds our assignment indicator Z to the data, but leaves it otherwise unchanged. We write:

simple_random_assignment_function <- function(data) {
  data %>% mutate(Z = simple_ra(N = 100, prob = 0.5))
}

We took the dplyr pipeline we built above, and put it on the inside of a tidy function. Now, when we run our random assignment function on the voter file, it adds in Z:

simple_random_assignment_function(voter_file) 
ID age sex party precinct Z
001 27 M DEM 5210 0
002 53 M REP 2155 1
003 35 F REP 4321 1
004 75 M REP 3590 0
005 66 F REP 5297 1
006 64 M GRN 8905 1

In DeclareDesign

DeclareDesign makes writing each design step just a bit easier. Instead of writing a function each time, it writes a function for us. The core of DeclareDesign is a set of declare_* functions, including declare_assignment. Each one is a function factory, meaning it takes a set of parameters about your research design like the number of units and the random assignment probability as inputs, and returns a function as an output. Instead of writing the function simple_random_assignment_function as we did above, in DeclareDesign we declare it:

simple_random_assignment_step <- declare_assignment(prob = 0.5)

simple_random_assignment_step is a tidy function. You can run the function on data:

simple_random_assignment_step(voter_file) 
ID age sex party precinct Z Z_cond_prob
001 27 M DEM 5210 1 0.5
002 53 M REP 2155 1 0.5
003 35 F REP 4321 0 0.5
004 75 M REP 3590 0 0.5
005 66 F REP 5297 0 0.5
006 64 M GRN 8905 1 0.5

A few parts of the declaration may seem a little bit odd. First, we did not tell R anything about the number of units in our dataset, as we did in the function and in the dplyr pipeline we wrote earlier. Second, we didn’t give it the data! This is because a step declaration creates a function that will work on any size dataset. We told declare_assignment that we want to assign treatment with probability 0.5 (and implicitly control with probability 1-0.5 = 0.5), regardless of how large the dataset is. We did not send the declaration the data, because declare_assignment automatically creates a tidy function for us, one that takes data and returns data with an assignment step. We will see in a moment how DeclareDesign uses these functions to simulate data from a research design. But you can always use the function yourself with your own data.

Every step of a research design in MIDA can be written using one of the declare_* functions. In the next section, we walk through each step and how to declare it using DeclareDesign.

Declaring each step of a research design

In this section, we walk through how to declare each step of a research design using DeclareDesign. In the next section, we build those steps into a research design, and then describe how to interrogate the design.

Model

The model defines the structure of the world, both its size and background characteristics as well as how interventions in the world determine outcomes. In DeclareDesign, we split the model into two main design steps: the population and potential outcomes. There is always one population in a design, but there can be multiple sets of potential outcomes.

Population

The population defines the number of units in the population, any multilevel structure to the data, and its background characteristics. We can define the population in several ways.

In some cases, you may start a design with data on the population. When that happens, we do not to simulate it. We can simply declare the data as our population:

declare_population(data = voter_file)
ID age sex party precinct Z
001 27 M DEM 5210 1
002 53 M REP 2155 0
003 35 F REP 4321 0
004 75 M REP 3590 0
005 66 F REP 5297 1
006 64 M GRN 8905 1

When we do not have complete data on the population, we simulate it. Relying on the data simulation functions from our fabricatr package, declare_population asks about the size and variables of the population:

declare_population(N = 100, u = rnorm(N))

When we run the declared population function, we will get a different 100-unit dataset each time:

ID u
001 -1.10
002 -0.23
003 -0.35
004 0.53
005 1.61
006 0.51
ID u
001 1.44
002 1.36
003 0.33
004 1.43
005 -0.87
006 0.95
ID u
001 -0.45
002 0.80
003 -0.57
004 -1.93
005 0.66
006 -1.60

The fabricatr package can simulate data for social science research including multilevel data structures like students in classrooms in schools. You can read the fabricatr Web site to get started simulating your data structure. A simple two-level data structure of individuals within households could be declared as:

declare_population(
  households = add_level(N = 100, individuals_per_hh = sample(1:10, N, replace = TRUE)),
  individuals = add_level(N = individuals_per_hh, age = sample(1:100, N, replace = TRUE))
)

In every step of the research design process, you can short-circuit our default way of doing things and bring in your own code. This is useful when you have a complex design, or when you’ve already written code for your design and you want to use it directly. It works by setting the handler:

complex_population_function <- function(data, N_units) {
  data.frame(u = rnorm(N_units))
}

declare_population(handler = complex_population_function, N_units = 100)

Potential outcomes

Defining potential outcomes is as easy as a single expression per potential outcome. These may be a function of background characteristics, other potential outcomes, or other R functions.3

declare_potential_outcomes(
  Y_Z_0 = u, 
  Y_Z_1 = Y_Z_0 + 0.25)
des <- declare_population(N = 100, u = rnorm(N)) +
  declare_potential_outcomes(Y_Z_0 = u, Y_Z_1 = Y_Z_0 + 0.25)

draw_data(des)
ID u Y_Z_0 Y_Z_1
001 0.01 0.01 0.26
002 -0.92 -0.92 -0.67
003 1.71 1.71 1.96
004 -1.17 -1.17 -0.92
005 -1.78 -1.78 -1.53
006 -2.25 -2.25 -2.00

We also have a simpler interface to define all the potential outcomes at once as a function of a treatment assignment variable. The names of the potential outcomes are constructed from the outcome name (here Y on the lefthand side of the formula) and from the assignment_variables argument (here Z).

declare_potential_outcomes(Y ~ u + 0.25 * Z, assignment_variables = Z)

Either way of creating potential outcomes works; one may be easier or harder to code up in a given research design setting.

Inquiry

To define your inquiry, declare your estimand, which is a function of background characteristics from your population, potential outcomes, or both. We define the average treatment effect for the experiment in our simple design as follows:

declare_estimand(PATE = mean(Y_Z_1 - Y_Z_0))

Notice that we defined the PATE (the population average treatment effect), but said nothing special related to the population. In fact, it looks like we just defined the average treatment effect. This is because where you define the estimand in your design is going to determine whether it refers to the population, sample, or other form of estimand. We will see how to do this in a moment.

Data strategy

The data strategy constitutes one or more steps representing interventions the researcher makes in the world from sampling to assignment to measurement. Typically, this may include sampling and assignment.

Sampling

The sampling step relies on the randomizr package to conduct random sampling. We define a simple 50-unit sample from the population as follows:

declare_sampling(n = 50)

When we draw data from our simple design at this point, it will be smaller: from 100 units in the population to a data frame of 50 units representing the sample. In the data frame, we have an inclusion probability, the probability of being included in the sample. randomizr includes this by default. In this case, every unit in the population had an equal 0.5 probability of inclusion.

ID u Y_Z_0 Y_Z_1 S_inclusion_prob
2 002 -0.67 -0.67 -0.42 0.5
3 003 -1.11 -1.11 -0.86 0.5
4 004 -0.34 -0.34 -0.09 0.5
6 006 -1.23 -1.23 -0.98 0.5
8 008 -1.25 -1.25 -1.00 0.5
10 010 0.72 0.72 0.97 0.5

Sampling could also be non-random, which could be accomplished by using a handler.

Assignment

Assignment also relies, by default, on the randomizr package for random assignment. Here, we define assignment as a 50% probability of assignment to treatment and 50% to control.

declare_assignment(prob = 0.5)

Assignment results in a data frame with an additional indicator Z of the assignment as well as the probability of assignment. Again, here the assignment probabilities are constant, but in other designs they are not and this is crucial information for the analysis stage.

ID u Y_Z_0 Y_Z_1 S_inclusion_prob Z Z_cond_prob
001 -0.77 -0.77 -0.52 0.5 0 0.5
002 -1.55 -1.55 -1.30 0.5 0 0.5
003 -1.29 -1.29 -1.04 0.5 1 0.5
004 -1.17 -1.17 -0.92 0.5 0 0.5
006 0.70 0.70 0.95 0.5 0 0.5
009 -0.15 -0.15 0.10 0.5 0 0.5

Other data strategies

Random sampling and random assignment are not the only kinds of data strategies. Others may include merging in fixed administrative data from other sources, collapsing data across months or days, and other operations. You can include these as steps in your design too, using declare_step. Here, you must define a handler, as we did for using a custom function in declare_population. Some handlers that may prove useful are the dplyr verbs such as mutate and summarize, and the fabricate function from our fabricatr package.

To add a variable using fabricate:

declare_step(handler = fabricate, add_variable = rnorm(N))

If you have district-month data you may want to analyze at the district level, collapsing across months:4

collapse_data <- function(data, collapse_by) {
  data %>% group_by({{ collapse_by }}) %>% summarize_all(mean, na.rm = TRUE)
}

declare_step(handler = collapse_data, collapse_by = district)

Answer strategy

Through our model and data strategy steps, we have simulated a dataset with two key inputs to the answer strategy: an assignment variable and an outcome. In other answer strategies, pretreatment characteristics from the model might also be relevant. The data look like this:

ID u Y_Z_0 Y_Z_1 S_inclusion_prob Z Z_cond_prob Y
001 -0.12 -0.12 0.13 0.5 0 0.5 -0.12
005 -1.51 -1.51 -1.26 0.5 1 0.5 -1.26
007 0.52 0.52 0.77 0.5 0 0.5 0.52
008 1.33 1.33 1.58 0.5 1 0.5 1.58
010 1.60 1.60 1.85 0.5 1 0.5 1.85
012 -0.05 -0.05 0.20 0.5 1 0.5 0.20

Our estimator is the difference-in-means estimator, which compares outcomes between the group that was assigned to treatment and that assigned to control. We can calculate the difference-in-means estimate with a call to summarize from dplyr:

simple_design_data %>% summarize(DiM = mean(Y[Z == 1]) - mean(Y[Z == 0]))
DiM
0.51

The estimatr package makes this easy and calculates the design-based standard error and a p-value and confidence interval for you:

difference_in_means(Y ~ Z, data = simple_design_data)
term estimate std.error statistic p.value conf.low conf.high df outcome
Z 0.51 0.27 1.9 0.06 -0.03 1.1 41 Y

Now, in order to declare our estimator, we can send the name of a model to declare_estimator. R has many models that work with declare_estimator, including lm, glm, the ictreg package from the list package, etc. The design-based estimators from estimatr can all be used.

declare_estimator(Y ~ Z, model = difference_in_means, estimand = "PATE")

In this declaration, we also define the estimand we are targeting with the difference-in-means estimator.5 Typically, you will have an estimand that you are targeting, and sometimes you may consider targeting more than one and assessing how good your estimator estimates them. For example, you may want to know how good a job your instrumental variables job is at targeting the complier average causal effect, but also how close it gets on average to the average treatment effect.

Combining steps to form a design

In the last section, we defined a set of individual research steps. We draw one version of them together here:

population <- declare_population(N = 100, u = rnorm(N)) 
potential_outcomes <- declare_potential_outcomes(Y_Z_0 = u, Y_Z_1 = Y_Z_0 + 0.25) 
estimand <- declare_estimand(PATE = mean(Y_Z_1 - Y_Z_0)) 
sampling <- declare_sampling(n = 50) 
assignment <- declare_assignment(prob = 0.5) 
reveal <- declare_reveal(outcome_variables = Y, assignment_variables = Z) 
estimator <- declare_estimator(Y ~ Z, model = difference_in_means, estimand = "PATE")

To construct a research design object that we can operate on — diagnose it, redesign it, draw data from it, etc. — we add them together with the + operator. The + creates a design object.

simple_design <- 
  population + potential_outcomes + estimand + sampling + assignment + reveal + estimator

Often we’ll use a more compact way of writing a design, where we define it all at once with the +:

simple_design <- 
  declare_population(N = 100, u = rnorm(N)) +
  declare_potential_outcomes(Y_Z_0 = u, Y_Z_1 = Y_Z_0 + 0.25) +
  declare_estimand(PATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_sampling(n = 50) +
  declare_assignment(prob = 0.5) +
  declare_reveal(outcome_variables = Y, assignment_variables = Z) +
  declare_estimator(Y ~ Z, model = difference_in_means, estimand = "PATE")

Order matters

When defining a design, the order steps are included in the design via the + operator matters. Think of the order of your design as the causal order in which steps take place.

population + potential_outcomes + estimand + sampling + assignment + reveal + estimator

The order encodes several important aspects of the design:

  • First, the fact that the estimand follows the potential outcomes and comes before sampling and assignment means it is a population estimand, the population average treatment effect. This is because it is calculated on the data created so far.
  • The estimator comes after the assignment and reveal outcomes steps. If it didn’t, our difference-in-means would not work, because it wouldn’t have access to the treatment variable and the realized outcomes.

Simulating a research design

Diagnosing a research design — learning about its properties — requires first simulating running the design over and over. We need to simulate the data generating process, then calculate the estimands, then calculate the estimates that will result.

In dplyr

We first demonstrate how to use the tidy functions created by the declare_* functions in a dplyr pipeline to simulate a design once.

We can run the population function, which generates the data structure, and then add the potential outcomes, and calculate the estimand as follows:

population() %>% potential_outcomes %>% estimand
estimand_label estimand
PATE 0.25

This is the same thing as running the functions one at a time on each other: estimand(potential_outcomes(population())).

Similarly, if we want to draw simulated estimates from the design, we again simulate a population, add potential outcomes, but now sample units, assign treatments to sampled units, reveal the outcomes, and calculate estimates:

population() %>% potential_outcomes %>% sampling %>%  assignment %>% reveal %>% estimator
estimator_label term estimate std.error statistic p.value conf.low conf.high df outcome estimand_label
estimator Z -0.05 0.29 -0.18 0.86 -0.63 0.53 47 Y PATE

In DeclareDesign

With simple design defined as an object, we can easily learn about what kind of data it generates, the values of its estimand and estimates, and other features with simple functions in DeclareDesign. They chain together functions in a similar way to the dplyr pipelines above.

To draw simulated data based on the design, we use draw_data:

draw_data(simple_design)
ID u Y_Z_0 Y_Z_1 S_inclusion_prob Z Z_cond_prob Y
001 0.01 0.01 0.26 0.5 0 0.5 0.01
003 -1.45 -1.45 -1.20 0.5 0 0.5 -1.45
004 1.68 1.68 1.93 0.5 1 0.5 1.93
007 -0.32 -0.32 -0.07 0.5 1 0.5 -0.07
009 -0.04 -0.04 0.21 0.5 1 0.5 0.21
012 -0.57 -0.57 -0.32 0.5 0 0.5 -0.57

draw_data runs all of the “data steps” in a design, which are both from the model (population and potential outcomes) and from the data strategy (typically sampling and assignment).

To simulate the estimands from a single run of the design, we use draw_estimands. This runs two operations at once: it draws the data, and calculates the estimands at the point defined by the design. For example, in our design the estimand comes just after the potential outcomes. In this design, draw_estimands will run the first two steps and then calculate the estimands from the estimand function we declared:

draw_estimands(simple_design)
estimand_label estimand
PATE 0.25

Similarly, we can simulate the estimates from a single run with draw_estimates which draws data and at the appropriate moment calculates estimates.

draw_estimates(simple_design)
estimator_label term estimate std.error statistic p.value conf.low conf.high df outcome estimand_label
estimator Z 0.31 0.32 0.97 0.33 -0.33 0.96 47 Y PATE

To diagnose a design, we want a data frame that includes the estimates and estimands from many runs of a design. That is, we want to run the design, draw estimates and estimands, and then do that over and over and stack the results. This is exactly what simulate_design does:

simulate_design(simple_design, sims = 500)
design_label sim_ID estimand_label estimand estimator_label term estimate std.error statistic p.value conf.low conf.high df outcome
simple_design 1 PATE 0.25 estimator Z 0.15 0.25 0.62 0.54 -0.34 0.65 48 Y
simple_design 2 PATE 0.25 estimator Z 0.64 0.25 2.53 0.02 0.13 1.14 40 Y
simple_design 3 PATE 0.25 estimator Z 0.34 0.27 1.24 0.22 -0.21 0.88 42 Y
simple_design 4 PATE 0.25 estimator Z 0.58 0.27 2.14 0.04 0.03 1.13 48 Y
simple_design 5 PATE 0.25 estimator Z 0.11 0.30 0.38 0.71 -0.49 0.72 46 Y

Diagnosing a research design

The simulations data frame we created allows us to diagnose the design (calculate summary statistics from the simulations) directly. We can, for example, use the following dplyr pipeline to calculate the bias, root mean-squared error, and power for each estimator-estimand pair.

simulations_df %>% 
  group_by(estimand_label, estimator_label) %>% 
  summarize(bias = mean(estimate - estimand),
            rmse = sqrt(mean((estimate - estimand)^2)),
            power = mean(p.value < .05))
estimand_label estimator_label bias rmse power
PATE estimator 0.11 0.24 0.4

In DeclareDesign, we do this in two steps. First, declare your diagnosands. These are functions of the simulations data. We have precoded several standard diagnosands.

study_diagnosands <- declare_diagnosands(
  select = c(bias, rmse, power), 
  mse = mean((estimate - estimand)^2))

Next, take your simulations data and the diagnosands, and diagnose. This runs a single operation, which is to calculate the diagnosands on your simulations data, just like in the dplyr version above.

diagnose_design(simulations_df, diagnosands = study_diagnosands)
design_label estimand_label estimator_label term mse se(mse) bias se(bias) rmse se(rmse) power se(power) n_sims
simple_design PATE estimator Z 0.06 0.03 0.11 0.1 0.24 0.06 0.4 0.22 5

We can also do this in a single step. When you send diagnose_design a design object, it will first run the simulations for you, then calculate the diagnosands from the simulations data frame that results.

diagnose_design(simple_design, diagnosands = study_diagnosands)

Comparing designs

In the diagnosis phase, you will often want to compare the properties of two designs to see which you prefer on the basis of the diagnosand values. We have two ways to compare. First, we can compare the designs themselves — what kinds of estimates and estimands do they produce, what steps are in the design. And we can compare the diagnoses.

compare_designs(simple_design, redesigned_simple_design)

To compare the diagnoses, we run a diagnosis for each one and then calculate the difference between each diagnosand for the two designs and conduct a statistical test of the null effect of no difference.

compare_diagnoses(simple_design, redesigned_simple_design)

Comparing many variants of a design

Often, we want to compare a large set of similar designs, varying key design parameters such as sample size, effect size, or the probability of treatment assignment. The easiest way to do this is to write a function that makes designs based on a set of these design inputs. We call these designers. Here’s a simple designer based on our running example:

simple_designer <- function(sample_size, effect_size) {
  declare_population(N = sample_size, u = rnorm(N)) +
    declare_potential_outcomes(Y_Z_0 = u, Y_Z_1 = Y_Z_0 + effect_size) +
    declare_estimand(PATE = mean(Y_Z_1 - Y_Z_0)) +
    declare_sampling(n = 50) +
    declare_assignment(prob = 0.5) +
    declare_reveal(outcome_variables = Y, assignment_variables = Z) +
    declare_estimator(Y ~ Z, model = difference_in_means, estimand = "PATE")
}

To create a single design, based on our original parameters of a 100-unit sample size and a treatment effect of 0.25, we can run:

simple_design <- simple_designer(sample_size = 100, effect_size = 0.25)

Now to simulate multiple designs, we can use the DeclareDesign function expand_design. Here we examine our simple design under several possible sample sizes, which we might want to do to conduct a minimum power analysis. We hold the effect size constant.

simple_designs <- expand_design(simple_designer, sample_size = c(100, 500, 1000), effect_size = 0.25)

Our simulation and diagnosis tools can take a set of expanded designs (an R list) and will simulate all of them at once, creating a column called design_label to keep them apart. For example:

diagnose_design(simple_designs)

Library of designs

In our DesignLibrary package, we have created a set of common designs as designers, so you can get started quickly and also easily set up a range of design variants for comparison.

library(DesignLibrary)

b_c_design <- block_cluster_two_arm_designer(N = 1000, N_blocks = 10)

diagnose_design(b_c_design)

  1. Both R and RStudio are available on Windows, Mac, and Linux.↩︎

  2. This pipeline could be expressed in base R as voter_file$Z <- sample(c(0, 1), size = 100, replace = TRUE, prob = c(0.5, 0.5))↩︎

  3. Typically, we think of potential outcomes as fixed and not random, and move random variables to the population.↩︎

  4. The {{ }} syntax is handy for writing functions in dplyr where you want to be able reuse the function with different variable names. Here, the collapse_data function will group_by the variable you send to the argument collapse_by, which in our declaration we set to district. The pipeline within the function then calculates the mean in each district.↩︎

  5. Sometimes, you may be interested just in the properties of an estimator, such as calculating its power. In this case, you need not define an estimand.↩︎