This vignette serves as a brief introduction to the DeclareDesign package for R. DeclareDesign is a software implementation of every step of the design-diagnose-redesign process. While you can of course declare, diagnose, and redesign your design using nearly any programming language, DeclareDesign is structured to make it easy to mix-and-match design elements while handling the tedious simulation bookkeeping behind the scenes.

Installing R

You can download the statistical computing environment R for free from CRAN. We also recommend the free program RStudio, which provides a friendly interface to R. Both R and RStudio are available on Windows, Mac, and Linux.

Once you have R and RStudio installed, open it up and install DeclareDesign and its related packages. These include three packages that enable specific steps in the research process (fabricatr for simulating social science data; randomizr for random sampling and random assignment; and estimatr for design-based estimators). You can also install DesignLibrary, which gets standard designs up-and-running in one line. To install them, copy the following code into your R console:

install.packages(c(
  "DeclareDesign",
  "fabricatr",
  "randomizr",
  "estimatr",
  "DesignLibrary"
))

We also recommend that you install and get to know the tidyverse suite of packages for data analysis, which we will use throughout the book:

install.packages("tidyverse")

For introductions to R and the tidyverse we especially recommend the free resource R for Data Science.

Building a step of a research design

A research design is a concatenation of design steps. The best way to learn how to build a design is to learn how to make a step. We will start out by making—or declaring—a step that implements random assignment.

Almost all steps take a dataset as input and return a dataset as output. We will imagine input data that describes a set of voters in Los Angeles. The research project we are planning involves randomly assigning voters to receive (or not receive) a knock on their door from a canvasser. Our data look like this:

Example data
ID age sex party precinct
001 66 M REP 9104
002 54 F DEM 8029
003 18 M GRN 8383
004 42 F DEM 2048
005 27 M REP 5210

There are 100 voters in the dataset.

We want a function that takes this dataset, implements a random assignment, adds it to the dataset, and then returns the new dataset containing the random assignment.

You could write your own function to do that but you can also use one of the declare_* functions in DeclareDesign that are designed to write functions. Each one of these functions is a kind of function factory: it takes a set of parameters about your research design like the number of units and the random assignment probability as inputs, and returns a function as an output. Here is an example of a declare_assignment step.

simple_random_assignment_step <- declare_assignment(prob = 0.6)

The big idea here is that the object we created, simple_random_assignment_step, is not a particular assignment, it is a function that conducts assignment when called. You can run the function on data:

simple_random_assignment_step(voter_file) 
Data output following implementation of an assignment step.
ID age sex party precinct Z Z_cond_prob
001 66 M REP 9104 0 0.4
002 54 F DEM 8029 0 0.4
003 18 M GRN 8383 1 0.6
004 42 F DEM 2048 0 0.4
005 27 M REP 5210 0 0.4

The output of the simple_random_assignment_step(voter_file) call is the original dataset with a new column indicating treatment assignment (Z) appended. As a bonus, the data also includes the probability that each unit is assigned to the condition in which it is in (Z_cond), which is an extremely useful number to know in many analysis settings. The most important thing to understand here is that steps are “dataset-in, dataset-out” functions. The simple_random_assignment_step took the voter_file dataset and returned a dataset with assignment information appended.

Every step of a research design declaration can be written using one of the declare_* functions. This table collects these according to the four elements of a research design. Below, we walk through the common uses of each of these declaration functions.

Design component Function Description
Model declare_population() define background variables
declare_potential_outcomes() define functional relationships between treatments and outcomes
Inquiry declare_estimand() define research question
Data strategy declare_sampling() specify sampling procedures
declare_assignment() specify assignment procedures
reveal_outcomes() link potential outcomes to revealed outcomes via assignment
declare_measurement() specify measurement procedures
Answer strategy declare_estimator() specify data summary procedures

Options and defaults

Each of the declare_* functions has many options. In general, you do not have to specify these as default values are usually provided. For instance, you might have noticed above that when you ran the assignment step above, the new variable that was created was called Z. This is because declare_assignment has an argument assignment_variable that defaults to Z. You can change that of course to whatever you want.

More subtly, the declare_* functions also default to “handlers” which have their own default arguments. These handlers are generally well-developed sets of functions that implement the tasks needed by the declare_ function. For instance, assignment_handler defaults to the conduct_ra function from the randomizr package. The declaration passes any additional arguments that you give it on to conduct_ra, and, by the same token, assumes the default values of the handler. In the example above, we had prob = 0.6 as an argument. If you look at the documentation, prob is not an argument of declare_assignment but it is an argument of conduct_ra, with a default value of 0.5. If we had left this bit out we would have gotten a function that assigned treatment with probability 0.5. As with any software, learning these defaults will take some time and can be looked up in the help files, e.g. ?declare_assignment.

Your own handlers

The built-in functions we provide in the DeclareDesign package are quite flexible and handle many major designs, but not all. The framework is built so that you are never constrained by what we provide. At any point, rather than using the default handlers (such as conduct_ra), you can write a function that implements your own procedures. The only discipline that the framework imposes is that you write your procedure as a function that takes data in and sends data back.

Here is an example of how you turn your own functions into design steps.

custom_assignment <- function(data) {
  mutate(data, Z = rbinom(n = nrow(data), 1, prob = 0.5))
}

my_assignment_step <- declare_assignment(handler = custom_assignment)

my_assignment_step(voter_file)  
Data generated using a custom function
ID age sex party precinct Z
001 66 M REP 9104 0
002 54 F DEM 8029 1
003 18 M GRN 8383 1
004 42 F DEM 2048 0
005 27 M REP 5210 1

Research design steps

In this section, we walk through how to declare each step of a research design using DeclareDesign. In the next section, we build those steps into a research design, and then describe how to interrogate the design.

Model

The model defines the structure of the world, both its size and background characteristics as well as how interventions in the world determine outcomes. In DeclareDesign, we split the model into two functions: declare_population and declare_potential_outcomes.

Population

The population defines the number of units in the population, any multilevel structure to the data, and its background characteristics. We can define the population in several ways. In some cases, you may start a design with data on the population. When that happens, we do not need to simulate it. We can simply declare the data as our population:

declare_population(data = voter_file)
Draw from a fixed population
ID age sex party precinct
001 66 M REP 9104
002 54 F DEM 8029
003 18 M GRN 8383
004 42 F DEM 2048
005 27 M REP 5210

When we do not have complete data on the population, we simulate it. Relying on the data simulation functions from our fabricatr package, declare_population asks about the size and variables of the population. For instance, if we want a function that generates a dataset with 100 units and a random variable U we write:

declare_population(N = 100, U = rnorm(N))

When we run this population function, we will get a different 100-unit dataset each time, as shown here.

Five draws from the population.
Draw 1
Draw 2
Draw 3
Draw 4
Draw 5
ID U ID U ID U ID U ID U
001 -0.32 001 0.19 001 -1.280 001 -0.38 001 0.744
002 1.17 002 0.69 002 1.880 002 -0.35 002 2.445
003 1.70 003 0.82 003 0.597 003 -0.64 003 0.043
004 0.93 004 -0.98 004 -1.963 004 0.40 004 0.159
005 -1.15 005 -1.29 005 0.084 005 0.34 005 1.686

The fabricatr package can simulate many different types of data, including various types of categorical variables or different types of data structures, such as panel or multilevel structures. You can read the fabricatr website vignette to get started simulating data.

As an example of a two-level hierarchical data structure, here is a declaration for 100 households with a random number of individuals within each household. This two-level structure could be declared as:

declare_population(
  households = add_level(
    N = 100,
    individuals_per_hh = sample(1:6, N, replace = TRUE)
  ),
  individuals = add_level(
    N = individuals_per_hh, 
    age = sample(1:100, N, replace = TRUE)
  )
)

As always, you can exit our built-in way of doing things and bring in your own code. This is useful for complex designs, or when you have already written code for your design and you want to use it directly. Here is an example of a custom population declaration:

complex_population_function <- function(data, N_units) {
  data.frame(U = rnorm(N_units))
}

declare_population(
  handler = complex_population_function, N_units = 100
)

Potential outcomes

Defining potential outcomes is as easy as a single expression per potential outcome. Potential outcomes may depend on background characteristics, other potential outcomes, or other R functions.

declare_potential_outcomes(
  Y_Z_0 = U, 
  Y_Z_1 = Y_Z_0 + 0.25)
design <- 
  declare_population(N = 100, U = rnorm(N)) +
  declare_potential_outcomes(Y ~ 0.25 * Z + U)

draw_data(design)
Adding potential outcomes to the population.
ID U Y_Z_0 Y_Z_1
001 1.53 1.53 1.782
002 0.92 0.92 1.172
003 -1.19 -1.19 -0.937
004 -0.20 -0.20 0.046
005 -1.06 -1.06 -0.809

The declare_potential_outcomes function also includes an alternative interface for defining potential outcomes that uses R’s formula syntax. The formula syntax lets you specify “regression-like” outcome equations. One downside is that it mildly obscures how the names of the eventual potential outcomes columns are named. We build the names of the potential outcomes columns the outcome name (here Y on the left-hand side of the formula) and from the assignment_variables argument (here Z).

declare_potential_outcomes(Y ~ 0.25 * Z + U, assignment_variables = Z)

Either way of creating potential outcomes works; one may be easier or harder to code up in a given research design setting.

Inquiry

To define your inquiry, declare your estimand. Estimands are typically summaries of the data produced in declare_population and declare_potential_outcomes. Here we define the average treatment effect as follows:

declare_estimand(PATE = mean(Y_Z_1 - Y_Z_0))

Notice that we defined the PATE (the population average treatment effect), but said nothing special related to the population – it looks like we just defined the average treatment effect. This is because order matters. If we want to define a SATE (the sample average treatment effect), we would have to do so after sampling has occurred. We will see how to do this in a moment.

Data strategy

The data strategy constitutes one or more steps representing interventions the researcher makes in the world from sampling to assignment to measurement.

Sampling

The sampling step relies on the randomizr package to conduct random sampling. Here we define a procedure for drawing a 50-unit sample from the population:

When we draw data from our simple design at this point, it will have fewer rows: it will have shrunk from 100 units in the population to a data frame of 50 units representing the sample. The new data frame also includes a variable indicating the probability of being included in the sample. In this case, every unit in the population had an equal inclusion probability of 0.5.

Sampled data.
ID U Y_Z_0 Y_Z_1 S_inclusion_prob
1 001 0.86 0.86 1.11 0.5
3 003 0.62 0.62 0.87 0.5
5 005 1.02 1.02 1.27 0.5
6 006 0.86 0.86 1.11 0.5
8 008 -0.43 -0.43 -0.18 0.5

Sampling could also be non-random, which could be accomplished by using a custom handler.

Assignment

The default handler for declare_assignment also relies on the randomizr package for random assignment. Here, we define an assignment procedure that allocates subjects to treatment with probability 0.5. One subtlety is that by default, declare_assignment conducts complete random assignment (exactly \(m\) of \(N\) units assigned to treatment, where \(m\) = prob * \(N\)).

declare_assignment(prob = 0.5)

After treatments are assigned, some potential outcomes are revealed. Treated units reveal their treated potential outcomes and untreated units reveal their untreated potential outcomes. The reveal_outcomes function performs this switching operation.

Adding these two declarations to the design results in a data frame with an additional indicator Z for the assignment as well as its corresponding probability of assignment. Again, here the assignment probabilities are constant, but in other designs they are not and this is crucial information for the analysis stage. The outcome variable Y is composed of each unit’s potential outcomes depending on its treatment status.

Sampled data with assignment indicator.
ID U Y_Z_0 Y_Z_1 S_inclusion_prob Z Z_cond_prob Y
001 -0.86 -0.86 -0.61 0.5 1 0.5 -0.61
003 0.92 0.92 1.17 0.5 1 0.5 1.17
004 -0.72 -0.72 -0.47 0.5 0 0.5 -0.72
006 0.45 0.45 0.70 0.5 0 0.5 0.45
008 0.15 0.15 0.40 0.5 0 0.5 0.15

Measurement

Measurement is a critical part of every research design; sometimes it is beneficial to explicitly declare the measurement procedures of the design, rather than allowing them to be implicit in the ways variables are created in declare_population and declare_potential_outcomes. For example, we might imagine that the normally distributed outcome variable Y is a latent outcome that will be translated into a binary outcome when measured by the researcher:

declare_measurement(Y_binary = rbinom(N, 1, prob = pnorm(Y)))
Sampled data with an explicitly measured outcome.
ID U Y_Z_0 Y_Z_1 S_inclusion_prob Z Z_cond_prob Y Y_binary
001 -1.1 -1.1 -0.86 0.5 1 0.5 -0.86 0
006 3.1 3.1 3.38 0.5 1 0.5 3.38 1
008 1.9 1.9 2.11 0.5 0 0.5 1.86 1
012 -1.6 -1.6 -1.32 0.5 1 0.5 -1.32 0
013 -0.3 -0.3 -0.05 0.5 0 0.5 -0.30 0

Answer strategy

Through our model and data strategy steps, we have simulated a dataset with two key inputs to the answer strategy: an assignment variable and an outcome. In other answer strategies, pretreatment characteristics from the model might also be relevant. The data look like this:

Data with revealed outcomes.
ID U Y_Z_0 Y_Z_1 S_inclusion_prob Z Z_cond_prob Y
001 1.680 1.680 1.93 0.5 1 0.5 1.930
002 0.057 0.057 0.31 0.5 0 0.5 0.057
003 -1.280 -1.280 -1.03 0.5 1 0.5 -1.030
005 0.613 0.613 0.86 0.5 1 0.5 0.863
006 -1.241 -1.241 -0.99 0.5 1 0.5 -0.991

Our estimator is the difference-in-means estimator, which compares outcomes between the group that was assigned to treatment and that assigned to control. The difference_in_means() function in the estimatr package calculates the estimate, the standard error, \(p\)-value and confidence interval for you:

difference_in_means(Y ~ Z, data = simple_design_data)
Difference-in-means estimate from simulated data.
term estimate std.error statistic p.value conf.low conf.high df outcome
Z 0.13 0.27 0.5 0.62 -0.4 0.67 48 Y

Now, in order to declare our estimator, we can send the name of a modeling function to declare_estimator. R has many modeling functions that work with declare_estimator, including lm, glm, or the ictreg function from the list package, among hundreds of others. Throughout the book, we will be using many estimators from estimatr because they are fast and calculate robust standard errors easily. Estimators are (almost always) associated with estimands.1 Here, we are targeting the population average treatment effect with the difference-in-means estimator.

declare_estimator(
  Y ~ Z, model = difference_in_means, estimand = "PATE"
)

Two finer points: model_summary and label_estimator

Many answer strategies use modeling functions like lm, lm_robust, or glm. The output from these modeling functions are typically very complicated list objects that contain large amounts of information about the modeling process. We typically only want a few summary pieces of information out of these model objects, like the coefficient estimates, standard errors, and confidence intervals. We use model summary functions passed to the model_summary argument of declare_estimator to do so. Model summary functions take models as inputs and return data frames as outputs.

The default model summary function is tidy:

declare_estimator(
  Y ~ Z, model = lm_robust, model_summary = tidy
)

You could also use glance to get model fit statistics like \(R^2\).

declare_estimator(
  Y ~ Z, model = lm_robust, model_summary = glance
)

Occasionally, you’ll need to write your own model summary function that takes a model fit object and returns a data.frame with the information you need. For example, in order to calculate average marginal effects estimates from a logistic regression, we run a glm model through the margins function from the margins package; we then need to “tidy” the output from margins using the tidy function. Here we’re also asking for a 95% confidence interval.

tidy_margins <- function(x) {
  tidy(margins(x, data = x$data), conf.int = TRUE)
}

declare_estimator(
  Y ~ Z + X,
  model = glm,
  family = binomial("logit"),
  model_summary = tidy_margins,
  term = "Z"
) 

If your answer strategy does not use a model function, you’ll need to provide a function that takes data as an input and returns a data.frame with the estimate. Set the handler to be label_estimator(your_function_name) to take advantage of DeclareDesign’s mechanism for matching estimands to estimators. When you use label_estimator, you can provide an estimand, and DeclareDesign will keep track of which estimates match each estimand. For example, to calculate the mean of an outcome, you could write your own estimator in this way:

my_estimator <- function(data){
  data.frame(estimate = mean(data$Y))
}
declare_estimator(handler = label_estimator(my_estimator), label = "mean", estimand = "Y_bar")
## declare_estimator(estimand = "Y_bar", handler = label_estimator(my_estimator), 
##     label = "mean")

Other design steps

The main declare_* functions cover many elements of research designs, but not all. You can include any operations we haven’t explicitly included as steps in your design too, using declare_step. Here, you must define a specific handler. Some handlers that may be useful are the dplyr verbs such as mutate and summarize, and the fabricate function from our fabricatr package.

To add a variable using fabricate:

declare_step(handler = fabricate, added_variable = rnorm(N))

If you have district-month data you may want to analyze at the district level, collapsing across months:

collapse_data <- function(data, collapse_by) {
  data %>% 
    group_by({{ collapse_by }}) %>% 
    summarize_all(mean, na.rm = TRUE)
}

declare_step(handler = collapse_data, collapse_by = district)

# Note: The `{{ }}` syntax is handy for writing functions in `dplyr` 
# where you want to be able to reuse the function with different variable 
# names. Here, the `collapse_data` function will `group_by` the 
# variable you send to the argument `collapse_by`, which in our 
# declaration we set to `district`. The pipeline within the function 
# then calculates the mean in each district.

Building a design from design steps

In the last section, we defined a set of individual research steps. We draw one version of them together here:

population <- 
  declare_population(N = 100, U = rnorm(N)) 

potential_outcomes <- 
  declare_potential_outcomes(Y ~ 0.25 * Z + U) 

estimand <- 
  declare_estimand(PATE = mean(Y_Z_1 - Y_Z_0)) 

sampling <- 
  declare_sampling(n = 50) 

assignment <- 
  declare_assignment(prob = 0.5) 

reveal <- 
  reveal_outcomes(outcome_variables = Y, assignment_variables = Z) 

estimator <- 
  declare_estimator(
    Y ~ Z, model = difference_in_means, estimand = "PATE"
  )

To construct a research design object that we can operate on — diagnose it, redesign it, draw data from it, etc. — we add them together with the + operator, just as %>% makes dplyr pipelines or + creates ggplot objects.

design <- 
  population + potential_outcomes + estimand + 
  sampling + assignment + reveal + estimator

We will usually declare designs more compactly, concatenating steps directly with +:

design <- 
  declare_population(N = 100, U = rnorm(N)) +
  declare_potential_outcomes(Y ~ 0.25 * Z + U) +
  declare_estimand(PATE = mean(Y_Z_1 - Y_Z_0)) +
  declare_sampling(n = 50) +
  declare_assignment(prob = 0.5) +
  reveal_outcomes(outcome_variables = Y, assignment_variables = Z) +
  declare_estimator(
    Y ~ Z, model = difference_in_means, estimand = "PATE"
  )

Order matters

When defining a design, the order in which steps are included in the design via the + operator matters. Think of the order of your design as the temporal order in which steps take place. Here, since the estimand comes before sampling and assignment, it is a population estimand, the population average treatment effect.

population + potential_outcomes + estimand + 
  sampling + assignment + reveal + estimator

We could define our estimand as a sample average treatment effect by putting estimand after sampling:

population + potential_outcomes + sampling +
  estimand + assignment + reveal + estimator

Simulating a research design

Diagnosing a research design — learning about its properties — requires first simulating running the design over and over. We need to simulate the data generating process, then calculate the estimands, then calculate the resulting estimates.

With the design defined as an object, we can learn about what kind of data it generates, the values of its estimand and estimates, and other features. For example, to draw simulated data based on the design, we use draw_data:

draw_data(design)
Simulated data draw.
ID U Y_Z_0 Y_Z_1 S_inclusion_prob Z Z_cond_prob Y
002 0.076 0.076 0.33 0.5 0 0.5 0.076
005 0.595 0.595 0.84 0.5 0 0.5 0.595
006 -0.431 -0.431 -0.18 0.5 0 0.5 -0.431
007 -1.133 -1.133 -0.88 0.5 0 0.5 -1.133
008 -2.237 -2.237 -1.99 0.5 0 0.5 -2.237

draw_data runs all of the “data steps” in a design, which are both from the model (population and potential outcomes) and from the data strategy (sampling, assignment, and measurement).

To simulate the estimands from a single run of the design, we use draw_estimands. This runs two operations at once: it draws the data, and calculates the estimands at the point defined by the design. For example, in our design, the estimand comes just after the potential outcomes. In this design, draw_estimands will run the first two steps and then calculate the estimands from the estimand function we declared:

Estimands calculated from simulated data.
estimand_label estimand
PATE 0.25

Similarly, we can draw the estimates from a single run with draw_estimates which simulates data and, at the appropriate moment, calculates estimates.

Estimates calculated from simulated data.
term estimate std.error statistic p.value conf.low conf.high df outcome estimand_label
Z 0.63 0.32 2 0.054 -0.011 1.3 44 Y PATE

To simulate designs, we use the simulate_design function to draw data, calculate estimands and estimates, and then repeat the process over and over.

simulation_df <- simulate_design(design)
Simulations data frame.
sim_ID estimand estimate std.error statistic p.value conf.low conf.high df
1 0.25 0.10 0.31 0.33 0.74 -0.52 0.73 43
2 0.25 0.30 0.30 0.99 0.33 -0.31 0.91 47
3 0.25 0.24 0.32 0.73 0.47 -0.42 0.89 40
4 0.25 0.44 0.27 1.61 0.12 -0.11 0.98 41
5 0.25 -0.13 0.29 -0.45 0.66 -0.71 0.45 45

Diagnosing a research design

Using the simulations data frame, we can calculate diagnosands like bias, root mean-squared-error, and power for each estimator-estimand pair. In DeclareDesign, we do this in two steps. First, declare your diagnosands, which are functions that summarize simulations data. The software includes many pre-coded diagnosands, though you can write your own like this:

study_diagnosands <- declare_diagnosands(
  bias = mean(estimate - estimand),
  rmse = sqrt(mean((estimate - estimand)^2)),
  power = mean(p.value <= 0.05)
)

Second, apply your diagnosand declaration to the simulations data frame with the diagnose_design function:

diagnose_design(simulation_df, diagnosands = study_diagnosands)
Design diagnosis.
Bias RMSE Power
0.00 0.28 0.14
(0.01) (0.01) (0.01)

We can also do this in a single step by sending diagnose_design a design object. The function will first run the simulations for you, then calculate the diagnosands from the simulation data frame that results.

diagnose_design(design, diagnosands = study_diagnosands)

Redesign

After the declaration phase, you will often want to learn how the diagnosands change as design features change. We can do this using redesign:

redesign(design, N = c(100, 200, 300, 400, 500))

An alternative way to do this is to write a “designer.” A designer is a function that makes designs based on a few design parameters. Designer help researchers flexibly explore design variations. Here’s a simple designer based on our running example:

simple_designer <- function(sample_size, effect_size) {
  declare_population(N = sample_size, U = rnorm(N)) +
    declare_potential_outcomes(Y ~ effect_size * Z + U) +
    declare_estimand(PATE = mean(Y_Z_1 - Y_Z_0)) +
    declare_sampling(n = 50) +
    declare_assignment(prob = 0.5) +
    reveal_outcomes(outcome_variables = Y, assignment_variables = Z) +
    declare_estimator(
      Y ~ Z, model = difference_in_means, estimand = "PATE"
    )
}

To create a single design, based on our original parameters of a 100-unit sample size and a treatment effect of 0.25, we can run:

design <- simple_designer(sample_size = 100, effect_size = 0.25)

Now to simulate multiple designs, we can use the DeclareDesign function expand_design. Here we examine our simple design under several possible sample sizes, which we might want to do to conduct a minimum power analysis. We hold the effect size constant.

designs <- expand_design(
  simple_designer, 
  sample_size = c(100, 500, 1000), 
  effect_size = 0.25
)

Our simulation and diagnosis tools can take a list of designs and simulate all of them at once, creating a column called design_label to keep track. For example:

Comparing designs

Alternatively, we can compare a pair of designs directly with the compare_designs function. This function is most useful for comparing the differences between a planned design and an implemented design.

compare_designs(planned_design, implemented_design)

Similarly, we can compare two designs on the basis of their diagnoses:

compare_diagnoses(planned_design, implemented_design)

Library of designs

In our DesignLibrary package, we have created a set of common designs as designers (functions that create designs from just a few parameters), so you can get started quickly.

library(DesignLibrary)

b_c_design <- block_cluster_two_arm_designer(N = 1000, N_blocks = 10)

Further Reading

For much more detail and help, we recommend the following resources.


  1. Sometimes, you may be interested in properties of an estimator that do not depend on an estimand, such as calculating its power↩︎