# Getting Started with DeclareDesign

DeclareDesign is a system for describing research designs in code and simulating them in order to understand their properties. Because DeclareDesign employs a consistent grammar of designs, you can focus on the intellectually challenging part – designing good research studies – without having to code up simulations from scratch. DeclareDesign is based on the Model-Inquiry-Data Strategy-Answer Strategy (MIDA) framework for describing designs and a declare-diagnose-redesign workflow for improving research designs before implementing them.

## Installing R

You can download the statistical computing environment R for free from CRAN. We also recommend the free program RStudio, which provides a friendly interface to R. Both R and RStudio are available on Windows, Mac, and Linux.

Once you have R and RStudio installed, open it up and install DeclareDesign and its related packages. These include three packages that enable specific steps in the research process (fabricatr for simulating social science data; randomizr for random sampling and random assignment; and estimatr for design-based estimators). You can also install DesignLibrary, which gets standard designs up-and-running in one line. To install them, copy the following code into your R console:

install.packages(c(
"DeclareDesign",
"fabricatr",
"randomizr",
"estimatr",
"DesignLibrary"
))

We also recommend that you install and get to know the tidyverse suite of packages for data analysis:

install.packages("tidyverse")

For introductions to R and the tidyverse we especially recommend the free resource R for Data Science.

## Building a step of a research design

A research design is a concatenation of design steps. The best way to learn how to build a design is to learn how to make a step. We will start out by making—or declaring—a step that implements random assignment.

Almost all steps take a dataset as input and return a dataset as output. We will imagine input data that describes a set of voters in Los Angeles. The research project we are planning involves randomly assigning voters to receive (or not receive) a knock on their door from a canvasser. Our data look like this:

Table 1: Example data
ID age sex party precinct
001 66 M REP 9104
002 54 F DEM 8029
003 18 M GRN 8383
004 42 F DEM 2048
005 27 M REP 5210

There are 100 voters in the dataset.

We want a function that takes this dataset, implements a random assignment, adds it to the dataset, and then returns the new dataset containing the random assignment.

You could write your own function to do that but you can also use one of the declare_* functions in DeclareDesign that are designed to write functions. Each one of these functions is a kind of function factory: it takes a set of parameters about your research design like the number of units and the random assignment probability as inputs, and returns a function as an output. Here is an example of a declare_assignment step.

simple_random_assignment_step <-
declare_assignment(Z = simple_ra(N = N, prob = 0.6))

The big idea here is that the object we created, simple_random_assignment_step, is not a particular assignment, it is a function that conducts assignment when called. You can run the function on data:

simple_random_assignment_step(voter_file) 
Table 2: Data output following implementation of an assignment step.
ID age sex party precinct Z
001 66 M REP 9104 1
002 54 F DEM 8029 1
003 18 M GRN 8383 0
004 42 F DEM 2048 1
005 27 M REP 5210 0

The output of the simple_random_assignment_step(voter_file) call is the original dataset with a new column indicating treatment assignment (Z) appended. As a bonus, the data also includes the probability that each unit is assigned to the condition in which it is in (Z_cond), which is an extremely useful number to know in many analysis settings. The most important thing to understand here is that steps are “dataset-in, dataset-out” functions. The simple_random_assignment_step took the voter_file dataset and returned a dataset with assignment information appended.

Every step of a research design declaration can be written using one of the declare_* functions. Table 3 collects these according to the four elements of a research design. Below, we walk through the common uses of each of these declaration functions.

Table 3: Declaration functions in DeclareDesign
Design component Function Description
Model declare_model() define background variables and potential outcomes
Inquiry declare_inquiry() define research question
Data strategy declare_sampling() specify sampling procedures
declare_assignment() specify assignment procedures
declare_measurement() specify measurement procedures
Answer strategy declare_estimator() specify estimation procedures
declare_test() specify testing procedures

The built-in functions we provide in the DeclareDesign package are quite flexible and handle many major designs, but not all. The framework is built so that you are never constrained by what we provide. At any point, you can write a function that implements your own procedures. The only discipline that the framework imposes is that you write your procedure as a function that takes data in and sends data back.

Here is an example of how you turn your own functions into design steps.

custom_assignment <- function(data) {
mutate(data, Z = rbinom(n = nrow(data), 1, prob = 0.5))
}

my_assignment_step <- declare_assignment(handler = custom_assignment)

my_assignment_step(voter_file)  
Table 4: Data generated using a custom function
ID age sex party precinct Z
001 66 M REP 9104 0
002 54 F DEM 8029 1
003 18 M GRN 8383 1
004 42 F DEM 2048 0
005 27 M REP 5210 0

## Research design steps

In this section, we walk through how to declare each step of a research design using DeclareDesign. In the next section, we build those steps into a research design, and then describe how to interrogate the design.

### Model

The model defines the structure of the world, both its size and background characteristics as well as how interventions in the world determine outcomes.

#### Population

The population defines the number of units in the population, any multilevel structure to the data, and its background characteristics. We can define the population in several ways. In some cases, you may start a design with data on the population. When that happens, we do not need to simulate it. We can simply declare the data as our population:

declare_model(data = voter_file)
Table 5: Draw from a fixed population
ID age sex party precinct
001 66 M REP 9104
002 54 F DEM 8029
003 18 M GRN 8383
004 42 F DEM 2048
005 27 M REP 5210

When we do not have complete data on the population, we simulate it. Relying on the data simulation functions from our fabricatr package, declare_model asks about the size and variables of the population. For instance, if we want a function that generates a dataset with 100 units and a random variable U we write:

declare_model(N = 100, U = rnorm(N))

When we run this population function, we will get a different 100-unit dataset each time, as shown in Table 6.

Table 6: Five draws from the population.
Draw 1
Draw 2
Draw 3
Draw 4
Draw 5
ID U ID U ID U ID U ID U
001 -0.700 001 -0.35 001 -1.07 001 0.696 001 0.87
002 0.561 002 0.27 002 0.11 002 0.059 002 0.46
003 0.048 003 0.45 003 0.35 003 -0.030 003 1.73
004 0.567 004 -2.69 004 -1.37 004 -1.279 004 -0.46
005 -1.196 005 0.87 005 -0.39 005 -0.247 005 -0.70

The fabricatr package can simulate many different types of data, including various types of categorical variables or different types of data structures, such as panel or multilevel structures. You can read the fabricatr website vignette to get started simulating data.

As an example of a two-level hierarchical data structure, here is a declaration for 100 households with a random number of individuals within each household. This two-level structure could be declared as:

declare_model(
N = 100,
individuals_per_hh = sample(1:6, N, replace = TRUE)
),
N = individuals_per_hh,
age = sample(1:100, N, replace = TRUE)
)
)

As always, you can exit our built-in way of doing things and bring in your own code. This is useful for complex designs, or when you have already written code for your design and you want to use it directly. Here is an example of a custom population declaration:

complex_population_function <- function(data, N_units) {
data.frame(U = rnorm(N_units))
}

declare_model(
handler = complex_population_function, N_units = 100
)

#### Potential outcomes

Defining potential outcomes is as easy as a single expression per potential outcome. Potential outcomes may depend on background characteristics, other potential outcomes, or other R functions.

declare_model(
Y_Z_0 = U,
Y_Z_1 = Y_Z_0 + 0.25)
design <-
declare_model(
N = 100, U = rnorm(N),
potential_outcomes(Y ~ 0.25 * Z + U)
)
Table 7: Adding potential outcomes to the population.
ID U Y_Z_0 Y_Z_1
001 -2.41 -2.41 -2.16
002 0.81 0.81 1.06
003 -0.86 -0.86 -0.61
004 1.30 1.30 1.55
005 0.59 0.59 0.84

The declare_model function also includes an alternative interface for defining potential outcomes that uses R’s formula syntax with the potential_outcomes function. The formula syntax lets you specify “regression-like” outcome equations. One downside is that it mildly obscures how the names of the eventual potential outcomes columns are named. We build the names of the potential outcomes columns the outcome name (here Y on the left-hand side of the formula) and the name of the assignment variable from the variable name in the conditions argument (here Z).

declare_model(potential_outcomes(Y ~ 0.25 * Z + U, conditions = list(Z = c(0, 1))))

Either way of creating potential outcomes works; one may be easier or harder to code up in a given research design setting.

### Inquiry

To define your inquiry, declare your estimand. Estimands are typically summaries of the data produced in declare_model. Here we define the average treatment effect as follows:

declare_inquiry(PATE = mean(Y_Z_1 - Y_Z_0))

Notice that we defined the PATE (the population average treatment effect), but said nothing special related to the population – it looks like we just defined the average treatment effect. This is because order matters. If we want to define a SATE (the sample average treatment effect), we would have to do so after sampling has occurred. We will see how to do this in a moment.

### Data strategy

The data strategy constitutes one or more steps representing interventions the researcher makes in the world from sampling to assignment to measurement.

#### Sampling

The sampling step relies on the randomizr package to conduct random sampling. See Section ?? for an overview of the many kinds of sampling that are possible. Here we define a procedure for drawing a 50-unit sample from the population:

declare_sampling(S = complete_rs(N, n = 50), legacy = FALSE)

When we draw data from our simple design at this point, it will have fewer rows: it will have shrunk from 100 units in the population to a data frame of 50 units representing the sample. The new data frame also includes a variable indicating the probability of being included in the sample. In this case, every unit in the population had an equal inclusion probability of 0.5.

Table 8: Sampled data.
ID U Y_Z_0 Y_Z_1 S
5 005 0.523 0.523 0.77 1
6 006 1.933 1.933 2.18 1
8 008 0.748 0.748 1.00 1
10 010 -0.078 -0.078 0.17 1
11 011 2.119 2.119 2.37 1

Sampling could also be non-random, which could be accomplished by using a custom handler.

#### Assignment

Here, we define an assignment procedure that allocates subjects to treatment with probability 0.5.

declare_assignment(Z = complete_ra(N, prob = 0.5), legacy = FALSE)

After treatments are assigned, some potential outcomes are revealed. Treated units reveal their treated potential outcomes and untreated units reveal their untreated potential outcomes. The reveal_outcomes function performs this switching operation.

declare_measurement(Y = reveal_outcomes(Y ~ Z))

Adding these two declarations to the design results in a data frame with an additional indicator Z for the assignment as well as its corresponding probability of assignment. Again, here the assignment probabilities are constant, but in other designs described in Section ?? they are not and this is crucial information for the analysis stage. The outcome variable Y is composed of each unit’s potential outcomes depending on its treatment status.

Table 9: Sampled data with assignment indicator.
ID U Y_Z_0 Y_Z_1 S Z Y
004 0.94 0.94 1.19 1 1 1.19
005 0.77 0.77 1.02 1 0 0.77
008 -0.17 -0.17 0.08 1 1 0.08
009 1.60 1.60 1.85 1 1 1.85
010 -1.39 -1.39 -1.14 1 0 -1.39

#### Measurement

Measurement is a critical part of every research design; sometimes it is beneficial to explicitly declare the measurement procedures of the design, rather than allowing them to be implicit in the ways variables are created in declare_model. For example, we might imagine that the normally distributed outcome variable Y is a latent outcome that will be translated into a binary outcome when measured by the researcher:

declare_measurement(Y_binary = rbinom(N, 1, prob = pnorm(Y)))
Table 10: Sampled data with an explicitly measured outcome.
ID U Y_Z_0 Y_Z_1 S Z Y Y_binary
001 -1.19 -1.19 -0.94 1 0 -1.19 0
002 -0.66 -0.66 -0.41 1 0 -0.66 0
003 1.09 1.09 1.34 1 0 1.09 1
004 0.51 0.51 0.76 1 1 0.76 1
007 0.76 0.76 1.00 1 0 0.76 0

Through our model and data strategy steps, we have simulated a dataset with two key inputs to the answer strategy: an assignment variable and an outcome. In other answer strategies, pretreatment characteristics from the model might also be relevant. The data look like this:

Table 11: Data with revealed outcomes.
ID U Y_Z_0 Y_Z_1 S Z Y
001 -0.06 -0.06 0.190 1 0 -0.06
002 -0.16 -0.16 0.089 1 0 -0.16
004 0.99 0.99 1.239 1 1 1.24
006 -2.61 -2.61 -2.364 1 0 -2.61
007 0.78 0.78 1.030 1 1 1.03

Our estimator is the difference-in-means estimator, which compares outcomes between the group that was assigned to treatment and that assigned to control. The difference_in_means() function in the estimatr package calculates the estimate, the standard error, $$p$$-value and confidence interval for you:

difference_in_means(Y ~ Z, data = simple_design_data)
Table 12: Difference-in-means estimate from simulated data.
term estimate std.error statistic p.value conf.low conf.high df outcome
Z 0.39 0.28 1.4 0.17 -0.17 0.95 46 Y

Now, in order to declare our estimator, we can send the name of a modeling function to declare_estimator. R has many modeling functions that work with declare_estimator, including lm, glm, or the ictreg function from the list package, among hundreds of others. We use many estimators from estimatr because they are fast and calculate robust standard errors easily. Estimators are (almost always) associated with estimands.1 Here, we are targeting the population average treatment effect with the difference-in-means estimator.

declare_estimator(
Y ~ Z, model = difference_in_means, inquiry = "PATE"
)

#### Two finer points: model_summary and label_estimator

Many answer strategies use modeling functions like lm, lm_robust, or glm. The output from these modeling functions are typically very complicated list objects that contain large amounts of information about the modeling process. We typically only want a few summary pieces of information out of these model objects, like the coefficient estimates, standard errors, and confidence intervals. We use model summary functions passed to the model_summary argument of declare_estimator to do so. Model summary functions take models as inputs and return data frames as outputs.

The default model summary function is tidy:

declare_estimator(
Y ~ Z, model = lm_robust, model_summary = tidy
)

You could also use glance to get model fit statistics like $$R^2$$.

declare_estimator(
Y ~ Z, model = lm_robust, model_summary = glance
)

Occasionally, you’ll need to write your own model summary function that takes a model fit object and returns a data.frame with the information you need. For example, in order to calculate average marginal effects estimates from a logistic regression, we run a glm model through the margins function from the margins package; we then need to “tidy” the output from margins using the tidy function. Here we’re also asking for a 95% confidence interval.

tidy_margins <- function(x) {
tidy(margins(x, data = x$data), conf.int = TRUE) } declare_estimator( Y ~ Z + X, model = glm, family = binomial("logit"), model_summary = tidy_margins, term = "Z" )  If your answer strategy does not use a model function, you’ll need to provide a function that takes data as an input and returns a data.frame with the estimate. Set the handler to be label_estimator(your_function_name) to take advantage of DeclareDesign’s mechanism for matching estimands to estimators. When you use label_estimator, you can provide an estimand, and DeclareDesign will keep track of which estimates match each estimand. For example, to calculate the mean of an outcome, you could write your own estimator in this way: my_estimator <- function(data){ data.frame(estimate = mean(data$Y))
}
declare_estimator(handler = label_estimator(my_estimator),
label = "mean", inquiry = "Y_bar")
## declare_estimator(inquiry = "Y_bar", handler = label_estimator(my_estimator),
##     label = "mean")

### Other design steps

The main declare_* functions cover many elements of research designs, but not all. You can include any operations we haven’t explicitly included as steps in your design too, using declare_step. Here, you must define a specific handler. Some handlers that may be useful are the dplyr verbs such as mutate and summarize, and the fabricate function from our fabricatr package.

To add a variable using fabricate:

declare_step(handler = fabricate, added_variable = rnorm(N))

If you have district-month data you may want to analyze at the district level, collapsing across months:

collapse_data <- function(data, collapse_by) {
data %>%
group_by({{ collapse_by }}) %>%
summarize_all(mean, na.rm = TRUE)
}

declare_step(handler = collapse_data, collapse_by = district)

# Note: The {{ }} syntax is handy for writing functions in dplyr
# where you want to be able to reuse the function with different variable
# names. Here, the collapse_data function will group_by the
# variable you send to the argument collapse_by, which in our
# declaration we set to district. The pipeline within the function
# then calculates the mean in each district.

## Building a design from design steps

In the last section, we defined a set of individual research steps. We draw one version of them together here:

model <-
declare_model(N = 100, U = rnorm(N),
potential_outcomes(Y ~ 0.25 * Z + U))

inquiry <-
declare_inquiry(PATE = mean(Y_Z_1 - Y_Z_0))

sampling <- declare_sampling(
S = complete_rs(N, n = 50))

assignment <- declare_assignment(
Z = complete_ra(N, prob = 0.5))

measurement <- declare_measurement(Y = reveal_outcomes(Y ~ Z))

declare_estimator(
Y ~ Z, model = difference_in_means, inquiry = "PATE"
)

To construct a research design object that we can operate on — diagnose it, redesign it, draw data from it, etc. — we add them together with the + operator, just as %>% makes dplyr pipelines or + creates ggplot objects.

design <-
model + inquiry +
sampling + assignment + measurement + answer_strategy

We will usually declare designs more compactly, concatenating steps directly with +:

design <-
declare_model(N = 100, U = rnorm(N),
potential_outcomes(Y ~ 0.25 * Z + U)) +
declare_inquiry(PATE = mean(Y_Z_1 - Y_Z_0)) +
declare_sampling(S = complete_rs(N, n = 50)) +
declare_assignment(Z = complete_ra(N, prob = 0.5)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_estimator(
Y ~ Z, model = difference_in_means, inquiry = "PATE"
)

### Order matters

When defining a design, the order in which steps are included in the design via the + operator matters. Think of the order of your design as the temporal order in which steps take place. Here, since the inquiry comes before sampling and assignment, it is a population inquiry, the population average treatment effect.

model +
declare_inquiry(PATE = mean(Y_Z_1 - Y_Z_0)) +
sampling +
assignment +
measurement +
answer_strategy

We could define our inquiry as a sample average treatment effect by putting inquiry after sampling:

model +
sampling +
declare_inquiry(SATE = mean(Y_Z_1 - Y_Z_0)) +
assignment +
measurement +
answer_strategy

## Simulating a research design

Diagnosing a research design — learning about its properties — requires first simulating running the design over and over. We need to simulate the data generating process, then calculate the values of its inquiries, then calculate the resulting estimates. For example, to draw simulated data based on the design, we use draw_data:

draw_data(design)
Table 13: Simulated data draw.
ID U Y_Z_0 Y_Z_1 S Z Y
003 0.080 0.080 0.33 1 1 0.33
008 -1.194 -1.194 -0.94 1 1 -0.94
010 1.300 1.300 1.55 1 1 1.55
013 -1.914 -1.914 -1.66 1 0 -1.91
016 0.034 0.034 0.28 1 1 0.28

draw_data runs all of the “data steps” in a design, which are both from the model and from the data strategy (sampling, assignment, and measurement).

To simulate the estimands from a single run of the design, we use draw_inquiries. This runs two operations at once: it draws the data, and calculates the estimands at the point defined by the design. For example, in our design, the estimand comes just after the potential outcomes. In this design, draw_inquiries will run the first two steps and then calculate the value of inquiry from the inquiry function we declared:

draw_estimands(design)
Table 14: Estimands calculated from simulated data.
inquiry estimand
PATE 0.25

Similarly, we can draw the estimates from a single run with draw_estimates which simulates data and, at the appropriate moment, calculates estimates.

draw_estimates(design)
Table 15: Estimates calculated from simulated data.
term estimate std.error statistic p.value conf.low conf.high df outcome inquiry
Z 0.24 0.29 0.84 0.41 -0.34 0.83 48 Y PATE

To simulate designs, we use the simulate_design function to draw data, calculate estimands and estimates, and then repeat the process over and over.

simulation_df <- simulate_design(design)
simulation_df
Table 16: Simulations data frame.
sim_ID estimand estimate std.error statistic p.value conf.low conf.high df
1 0.25 0.64 0.25 2.55 0.014 0.136 1.15 47
2 0.25 -0.18 0.31 -0.59 0.557 -0.804 0.44 43
3 0.25 0.41 0.29 1.40 0.170 -0.181 1.00 46
4 0.25 0.17 0.23 0.73 0.468 -0.300 0.64 48
5 0.25 0.71 0.32 2.21 0.032 0.063 1.36 45

## Diagnosing a research design

Using the simulations data frame, we can calculate diagnosands like bias, root mean-squared-error, and power for each estimator-estimand pair. In DeclareDesign, we do this in two steps. First, declare your diagnosands, which are functions that summarize simulations data. The software includes many pre-coded diagnosands (see Section ??), though you can write your own like this:

study_diagnosands <- declare_diagnosands(
bias = mean(estimate - estimand),
rmse = sqrt(mean((estimate - estimand)^2)),
power = mean(p.value <= 0.05)
)

Second, apply your diagnosand declaration to the simulations data frame with the diagnose_design function:

diagnose_design(simulation_df, diagnosands = study_diagnosands)
Table 17: Design diagnosis.
Bias RMSE Power
-0.00 0.28 0.14
(0.01) (0.01) (0.01)

We can also do this in a single step by sending diagnose_design a design object. The function will first run the simulations for you, then calculate the diagnosands from the simulation data frame that results.

diagnose_design(design, diagnosands = study_diagnosands)

### Redesign

After the declaration phase, you will often want to learn how the diagnosands change as design features change. We can do this using redesign:

redesign(design, N = c(100, 200, 300, 400, 500))

An alternative way to do this is to write a “designer.” A designer is a function that makes designs based on a few design parameters. Designer help researchers flexibly explore design variations. Here’s a simple designer based on our running example:

simple_designer <- function(sample_size, effect_size) {
declare_model(
N = sample_size,
U = rnorm(N),
potential_outcomes(Y ~ effect_size * Z + U)
) +
declare_inquiry(PATE = mean(Y_Z_1 - Y_Z_0)) +
declare_sampling(S = complete_rs(N, n = 50)) +
declare_assignment(Z = complete_ra(N, prob = 0.5)) +
declare_measurement(Y = reveal_outcomes(Y ~ Z)) +
declare_estimator(
Y ~ Z, model = difference_in_means, inquiry = "PATE"
)
}

To create a single design, based on our original parameters of a 100-unit sample size and a treatment effect of 0.25, we can run:

design <- simple_designer(sample_size = 100, effect_size = 0.25)

Now to simulate multiple designs, we can use the DeclareDesign function expand_design. Here we examine our simple design under several possible sample sizes, which we might want to do to conduct a minimum power analysis. We hold the effect size constant.

designs <- expand_design(
simple_designer,
sample_size = c(100, 500, 1000),
effect_size = 0.25
)

Our simulation and diagnosis tools can take a list of designs and simulate all of them at once, creating a column called design to keep track. For example:

diagnose_design(designs)

### Comparing designs

Alternatively, we can compare a pair of designs directly with the compare_designs function. This function is most useful for comparing the differences between a planned design and an implemented design.

compare_designs(planned_design, implemented_design)

Similarly, we can compare two designs on the basis of their diagnoses:

compare_diagnoses(planned_design, implemented_design)

### Library of designs

In our DesignLibrary package, we have created a set of common designs as designers (functions that create designs from just a few parameters), so you can get started quickly.

library(DesignLibrary)

block_cluster_design <- block_cluster_two_arm_designer(N = 1000, N_blocks = 10)