This vignette serves as a brief introduction to the DeclareDesign
package for R. DeclareDesign
is a software implementation of every step of the design-diagnose-redesign process. While you can of course declare, diagnose, and redesign your design using nearly any programming language, DeclareDesign
is structured to make it easy to mix-and-match design elements while handling the tedious simulation bookkeeping behind the scenes.
You can download the statistical computing environment R for free from CRAN. We also recommend the free program RStudio, which provides a friendly interface to R. Both R and RStudio are available on Windows, Mac, and Linux.
Once you have R and RStudio installed, open it up and install DeclareDesign
and its related packages. These include three packages that enable specific steps in the research process (fabricatr
for simulating social science data; randomizr
for random sampling and random assignment; and estimatr
for design-based estimators). You can also install DesignLibrary
, which gets standard designs up-and-running in one line. To install them, copy the following code into your R console:
install.packages(c(
"DeclareDesign",
"fabricatr",
"randomizr",
"estimatr",
"DesignLibrary"
))
We also recommend that you install and get to know the tidyverse
suite of packages for data analysis, which we will use throughout the book:
install.packages("tidyverse")
For introductions to R and the tidyverse
we especially recommend the free resource R for Data Science
.
A research design is a concatenation of design steps. The best way to learn how to build a design is to learn how to make a step. We will start out by making—or declaring—a step that implements random assignment.
Almost all steps take a dataset as input and return a dataset as output. We will imagine input data that describes a set of voters in Los Angeles. The research project we are planning involves randomly assigning voters to receive (or not receive) a knock on their door from a canvasser. Our data look like this:
ID | age | sex | party | precinct |
---|---|---|---|---|
001 | 66 | M | REP | 9104 |
002 | 54 | F | DEM | 8029 |
003 | 18 | M | GRN | 8383 |
004 | 42 | F | DEM | 2048 |
005 | 27 | M | REP | 5210 |
There are 100 voters in the dataset.
We want a function that takes this dataset, implements a random assignment, adds it to the dataset, and then returns the new dataset containing the random assignment.
You could write your own function to do that but you can also use one of the declare_*
functions in DeclareDesign
that are designed to write functions. Each one of these functions is a kind of function factory: it takes a set of parameters about your research design like the number of units and the random assignment probability as inputs, and returns a function as an output. Here is an example of a declare_assignment
step.
simple_random_assignment_step <- declare_assignment(prob = 0.6)
The big idea here is that the object we created, simple_random_assignment_step
, is not a particular assignment, it is a function that conducts assignment when called. You can run the function on data:
simple_random_assignment_step(voter_file)
ID | age | sex | party | precinct | Z | Z_cond_prob |
---|---|---|---|---|---|---|
001 | 66 | M | REP | 9104 | 0 | 0.4 |
002 | 54 | F | DEM | 8029 | 0 | 0.4 |
003 | 18 | M | GRN | 8383 | 1 | 0.6 |
004 | 42 | F | DEM | 2048 | 0 | 0.4 |
005 | 27 | M | REP | 5210 | 0 | 0.4 |
The output of the simple_random_assignment_step(voter_file)
call is the original dataset with a new column indicating treatment assignment (Z
) appended. As a bonus, the data also includes the probability that each unit is assigned to the condition in which it is in (Z_cond
), which is an extremely useful number to know in many analysis settings. The most important thing to understand here is that steps are “dataset-in, dataset-out” functions. The simple_random_assignment_step
took the voter_file
dataset and returned a dataset with assignment information appended.
Every step of a research design declaration can be written using one of the declare_*
functions. This table collects these according to the four elements of a research design. Below, we walk through the common uses of each of these declaration functions.
Design component | Function | Description |
---|---|---|
Model | declare_population() |
define background variables |
declare_potential_outcomes() |
define functional relationships between treatments and outcomes | |
Inquiry | declare_estimand() |
define research question |
Data strategy | declare_sampling() |
specify sampling procedures |
declare_assignment() |
specify assignment procedures | |
reveal_outcomes() |
link potential outcomes to revealed outcomes via assignment | |
declare_measurement() |
specify measurement procedures | |
Answer strategy | declare_estimator() |
specify data summary procedures |
Each of the declare_*
functions has many options. In general, you do not have to specify these as default values are usually provided. For instance, you might have noticed above that when you ran the assignment step above, the new variable that was created was called Z
. This is because declare_assignment
has an argument assignment_variable
that defaults to Z
. You can change that of course to whatever you want.
More subtly, the declare_*
functions also default to “handlers” which have their own default arguments. These handlers are generally well-developed sets of functions that implement the tasks needed by the declare_
function. For instance, assignment_handler
defaults to the conduct_ra
function from the randomizr
package. The declaration passes any additional arguments that you give it on to conduct_ra
, and, by the same token, assumes the default values of the handler. In the example above, we had prob = 0.6
as an argument. If you look at the documentation, prob
is not an argument of declare_assignment
but it is an argument of conduct_ra
, with a default value of 0.5. If we had left this bit out we would have gotten a function that assigned treatment with probability 0.5. As with any software, learning these defaults will take some time and can be looked up in the help files, e.g. ?declare_assignment
.
The built-in functions we provide in the DeclareDesign
package are quite flexible and handle many major designs, but not all. The framework is built so that you are never constrained by what we provide. At any point, rather than using the default handlers (such as conduct_ra
), you can write a function that implements your own procedures. The only discipline that the framework imposes is that you write your procedure as a function that takes data in and sends data back.
Here is an example of how you turn your own functions into design steps.
custom_assignment <- function(data) {
mutate(data, Z = rbinom(n = nrow(data), 1, prob = 0.5))
}
my_assignment_step <- declare_assignment(handler = custom_assignment)
my_assignment_step(voter_file)
ID | age | sex | party | precinct | Z |
---|---|---|---|---|---|
001 | 66 | M | REP | 9104 | 0 |
002 | 54 | F | DEM | 8029 | 1 |
003 | 18 | M | GRN | 8383 | 1 |
004 | 42 | F | DEM | 2048 | 0 |
005 | 27 | M | REP | 5210 | 1 |
In this section, we walk through how to declare each step of a research design using DeclareDesign
. In the next section, we build those steps into a research design, and then describe how to interrogate the design.
The model defines the structure of the world, both its size and background characteristics as well as how interventions in the world determine outcomes. In DeclareDesign
, we split the model into two functions: declare_population
and declare_potential_outcomes
.
The population defines the number of units in the population, any multilevel structure to the data, and its background characteristics. We can define the population in several ways. In some cases, you may start a design with data on the population. When that happens, we do not need to simulate it. We can simply declare the data as our population:
declare_population(data = voter_file)
ID | age | sex | party | precinct |
---|---|---|---|---|
001 | 66 | M | REP | 9104 |
002 | 54 | F | DEM | 8029 |
003 | 18 | M | GRN | 8383 |
004 | 42 | F | DEM | 2048 |
005 | 27 | M | REP | 5210 |
When we do not have complete data on the population, we simulate it. Relying on the data simulation functions from our fabricatr
package, declare_population
asks about the size and variables of the population. For instance, if we want a function that generates a dataset with 100 units and a random variable U
we write:
declare_population(N = 100, U = rnorm(N))
When we run this population function, we will get a different 100-unit dataset each time, as shown here.
ID | U | ID | U | ID | U | ID | U | ID | U |
---|---|---|---|---|---|---|---|---|---|
001 | -0.32 | 001 | 0.19 | 001 | -1.280 | 001 | -0.38 | 001 | 0.744 |
002 | 1.17 | 002 | 0.69 | 002 | 1.880 | 002 | -0.35 | 002 | 2.445 |
003 | 1.70 | 003 | 0.82 | 003 | 0.597 | 003 | -0.64 | 003 | 0.043 |
004 | 0.93 | 004 | -0.98 | 004 | -1.963 | 004 | 0.40 | 004 | 0.159 |
005 | -1.15 | 005 | -1.29 | 005 | 0.084 | 005 | 0.34 | 005 | 1.686 |
The fabricatr
package can simulate many different types of data, including various types of categorical variables or different types of data structures, such as panel or multilevel structures. You can read the fabricatr
website vignette to get started simulating data.
As an example of a two-level hierarchical data structure, here is a declaration for 100 households with a random number of individuals within each household. This two-level structure could be declared as:
declare_population(
households = add_level(
N = 100,
individuals_per_hh = sample(1:6, N, replace = TRUE)
),
individuals = add_level(
N = individuals_per_hh,
age = sample(1:100, N, replace = TRUE)
)
)
As always, you can exit our built-in way of doing things and bring in your own code. This is useful for complex designs, or when you have already written code for your design and you want to use it directly. Here is an example of a custom population declaration:
complex_population_function <- function(data, N_units) {
data.frame(U = rnorm(N_units))
}
declare_population(
handler = complex_population_function, N_units = 100
)
Defining potential outcomes is as easy as a single expression per potential outcome. Potential outcomes may depend on background characteristics, other potential outcomes, or other R functions.
declare_potential_outcomes(
Y_Z_0 = U,
Y_Z_1 = Y_Z_0 + 0.25)
design <-
declare_population(N = 100, U = rnorm(N)) +
declare_potential_outcomes(Y ~ 0.25 * Z + U)
draw_data(design)
ID | U | Y_Z_0 | Y_Z_1 |
---|---|---|---|
001 | 1.53 | 1.53 | 1.782 |
002 | 0.92 | 0.92 | 1.172 |
003 | -1.19 | -1.19 | -0.937 |
004 | -0.20 | -0.20 | 0.046 |
005 | -1.06 | -1.06 | -0.809 |
The declare_potential_outcomes
function also includes an alternative interface for defining potential outcomes that uses R’s formula syntax. The formula syntax lets you specify “regression-like” outcome equations. One downside is that it mildly obscures how the names of the eventual potential outcomes columns are named. We build the names of the potential outcomes columns the outcome name (here Y
on the left-hand side of the formula) and from the assignment_variables
argument (here Z
).
declare_potential_outcomes(Y ~ 0.25 * Z + U, assignment_variables = Z)
Either way of creating potential outcomes works; one may be easier or harder to code up in a given research design setting.
To define your inquiry, declare your estimand. Estimands are typically summaries of the data produced in declare_population
and declare_potential_outcomes
. Here we define the average treatment effect as follows:
declare_estimand(PATE = mean(Y_Z_1 - Y_Z_0))
Notice that we defined the PATE (the population average treatment effect), but said nothing special related to the population – it looks like we just defined the average treatment effect. This is because order matters. If we want to define a SATE (the sample average treatment effect), we would have to do so after sampling has occurred. We will see how to do this in a moment.
The data strategy constitutes one or more steps representing interventions the researcher makes in the world from sampling to assignment to measurement.
The sampling step relies on the randomizr
package to conduct random sampling. Here we define a procedure for drawing a 50-unit sample from the population:
declare_sampling(n = 50)
When we draw data from our simple design at this point, it will have fewer rows: it will have shrunk from 100 units in the population to a data frame of 50 units representing the sample. The new data frame also includes a variable indicating the probability of being included in the sample. In this case, every unit in the population had an equal inclusion probability of 0.5.
ID | U | Y_Z_0 | Y_Z_1 | S_inclusion_prob | |
---|---|---|---|---|---|
1 | 001 | 0.86 | 0.86 | 1.11 | 0.5 |
3 | 003 | 0.62 | 0.62 | 0.87 | 0.5 |
5 | 005 | 1.02 | 1.02 | 1.27 | 0.5 |
6 | 006 | 0.86 | 0.86 | 1.11 | 0.5 |
8 | 008 | -0.43 | -0.43 | -0.18 | 0.5 |
Sampling could also be non-random, which could be accomplished by using a custom handler.
The default handler for declare_assignment
also relies on the randomizr
package for random assignment. Here, we define an assignment procedure that allocates subjects to treatment with probability 0.5. One subtlety is that by default, declare_assignment
conducts complete random assignment (exactly \(m\) of \(N\) units assigned to treatment, where \(m\) = prob
* \(N\)).
declare_assignment(prob = 0.5)
After treatments are assigned, some potential outcomes are revealed. Treated units reveal their treated potential outcomes and untreated units reveal their untreated potential outcomes. The reveal_outcomes
function performs this switching operation.
reveal_outcomes(Y, Z)
Adding these two declarations to the design results in a data frame with an additional indicator Z
for the assignment as well as its corresponding probability of assignment. Again, here the assignment probabilities are constant, but in other designs they are not and this is crucial information for the analysis stage. The outcome variable Y
is composed of each unit’s potential outcomes depending on its treatment status.
ID | U | Y_Z_0 | Y_Z_1 | S_inclusion_prob | Z | Z_cond_prob | Y |
---|---|---|---|---|---|---|---|
001 | -0.86 | -0.86 | -0.61 | 0.5 | 1 | 0.5 | -0.61 |
003 | 0.92 | 0.92 | 1.17 | 0.5 | 1 | 0.5 | 1.17 |
004 | -0.72 | -0.72 | -0.47 | 0.5 | 0 | 0.5 | -0.72 |
006 | 0.45 | 0.45 | 0.70 | 0.5 | 0 | 0.5 | 0.45 |
008 | 0.15 | 0.15 | 0.40 | 0.5 | 0 | 0.5 | 0.15 |
Measurement is a critical part of every research design; sometimes it is beneficial to explicitly declare the measurement procedures of the design, rather than allowing them to be implicit in the ways variables are created in declare_population
and declare_potential_outcomes
. For example, we might imagine that the normally distributed outcome variable Y
is a latent outcome that will be translated into a binary outcome when measured by the researcher:
declare_measurement(Y_binary = rbinom(N, 1, prob = pnorm(Y)))
ID | U | Y_Z_0 | Y_Z_1 | S_inclusion_prob | Z | Z_cond_prob | Y | Y_binary |
---|---|---|---|---|---|---|---|---|
001 | -1.1 | -1.1 | -0.86 | 0.5 | 1 | 0.5 | -0.86 | 0 |
006 | 3.1 | 3.1 | 3.38 | 0.5 | 1 | 0.5 | 3.38 | 1 |
008 | 1.9 | 1.9 | 2.11 | 0.5 | 0 | 0.5 | 1.86 | 1 |
012 | -1.6 | -1.6 | -1.32 | 0.5 | 1 | 0.5 | -1.32 | 0 |
013 | -0.3 | -0.3 | -0.05 | 0.5 | 0 | 0.5 | -0.30 | 0 |
Through our model and data strategy steps, we have simulated a dataset with two key inputs to the answer strategy: an assignment variable and an outcome. In other answer strategies, pretreatment characteristics from the model might also be relevant. The data look like this:
ID | U | Y_Z_0 | Y_Z_1 | S_inclusion_prob | Z | Z_cond_prob | Y |
---|---|---|---|---|---|---|---|
001 | 1.680 | 1.680 | 1.93 | 0.5 | 1 | 0.5 | 1.930 |
002 | 0.057 | 0.057 | 0.31 | 0.5 | 0 | 0.5 | 0.057 |
003 | -1.280 | -1.280 | -1.03 | 0.5 | 1 | 0.5 | -1.030 |
005 | 0.613 | 0.613 | 0.86 | 0.5 | 1 | 0.5 | 0.863 |
006 | -1.241 | -1.241 | -0.99 | 0.5 | 1 | 0.5 | -0.991 |
Our estimator is the difference-in-means estimator, which compares outcomes between the group that was assigned to treatment and that assigned to control. The difference_in_means()
function in the estimatr
package calculates the estimate, the standard error, \(p\)-value and confidence interval for you:
difference_in_means(Y ~ Z, data = simple_design_data)
term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome |
---|---|---|---|---|---|---|---|---|
Z | 0.13 | 0.27 | 0.5 | 0.62 | -0.4 | 0.67 | 48 | Y |
Now, in order to declare our estimator, we can send the name of a modeling function to declare_estimator
. R has many modeling functions that work with declare_estimator
, including lm
, glm
, or the ictreg
function from the list
package, among hundreds of others. Throughout the book, we will be using many estimators from estimatr
because they are fast and calculate robust standard errors easily. Estimators are (almost always) associated with estimands.1 Here, we are targeting the population average treatment effect with the difference-in-means estimator.
declare_estimator(
Y ~ Z, model = difference_in_means, estimand = "PATE"
)
model_summary
and label_estimator
Many answer strategies use modeling functions like lm
, lm_robust
, or glm
. The output from these modeling functions are typically very complicated list
objects that contain large amounts of information about the modeling process. We typically only want a few summary pieces of information out of these model objects, like the coefficient estimates, standard errors, and confidence intervals. We use model summary functions passed to the model_summary
argument of declare_estimator
to do so. Model summary functions take models as inputs and return data frames as outputs.
The default model summary function is tidy
:
declare_estimator(
Y ~ Z, model = lm_robust, model_summary = tidy
)
You could also use glance
to get model fit statistics like \(R^2\).
declare_estimator(
Y ~ Z, model = lm_robust, model_summary = glance
)
Occasionally, you’ll need to write your own model summary function that takes a model fit object and returns a data.frame with the information you need. For example, in order to calculate average marginal effects estimates from a logistic regression, we run a glm
model through the margins
function from the margins
package; we then need to “tidy” the output from margins
using the tidy
function. Here we’re also asking for a 95% confidence interval.
tidy_margins <- function(x) {
tidy(margins(x, data = x$data), conf.int = TRUE)
}
declare_estimator(
Y ~ Z + X,
model = glm,
family = binomial("logit"),
model_summary = tidy_margins,
term = "Z"
)
If your answer strategy does not use a model
function, you’ll need to provide a function that takes data
as an input and returns a data.frame
with the estimate. Set the handler to be label_estimator(your_function_name)
to take advantage of DeclareDesign’s mechanism for matching estimands to estimators. When you use label_estimator
, you can provide an estimand, and DeclareDesign will keep track of which estimates match each estimand. For example, to calculate the mean of an outcome, you could write your own estimator in this way:
my_estimator <- function(data){
data.frame(estimate = mean(data$Y))
}
declare_estimator(handler = label_estimator(my_estimator), label = "mean", estimand = "Y_bar")
## declare_estimator(estimand = "Y_bar", handler = label_estimator(my_estimator),
## label = "mean")
The main declare_*
functions cover many elements of research designs, but not all. You can include any operations we haven’t explicitly included as steps in your design too, using declare_step
. Here, you must define a specific handler. Some handlers that may be useful are the dplyr
verbs such as mutate
and summarize
, and the fabricate
function from our fabricatr
package.
To add a variable using fabricate:
declare_step(handler = fabricate, added_variable = rnorm(N))
If you have district-month data you may want to analyze at the district level, collapsing across months:
collapse_data <- function(data, collapse_by) {
data %>%
group_by({{ collapse_by }}) %>%
summarize_all(mean, na.rm = TRUE)
}
declare_step(handler = collapse_data, collapse_by = district)
# Note: The `{{ }}` syntax is handy for writing functions in `dplyr`
# where you want to be able to reuse the function with different variable
# names. Here, the `collapse_data` function will `group_by` the
# variable you send to the argument `collapse_by`, which in our
# declaration we set to `district`. The pipeline within the function
# then calculates the mean in each district.
In the last section, we defined a set of individual research steps. We draw one version of them together here:
population <-
declare_population(N = 100, U = rnorm(N))
potential_outcomes <-
declare_potential_outcomes(Y ~ 0.25 * Z + U)
estimand <-
declare_estimand(PATE = mean(Y_Z_1 - Y_Z_0))
sampling <-
declare_sampling(n = 50)
assignment <-
declare_assignment(prob = 0.5)
reveal <-
reveal_outcomes(outcome_variables = Y, assignment_variables = Z)
estimator <-
declare_estimator(
Y ~ Z, model = difference_in_means, estimand = "PATE"
)
To construct a research design object that we can operate on — diagnose it, redesign it, draw data from it, etc. — we add them together with the +
operator, just as %>%
makes dplyr
pipelines or +
creates ggplot
objects.
design <-
population + potential_outcomes + estimand +
sampling + assignment + reveal + estimator
We will usually declare designs more compactly, concatenating steps directly with +
:
design <-
declare_population(N = 100, U = rnorm(N)) +
declare_potential_outcomes(Y ~ 0.25 * Z + U) +
declare_estimand(PATE = mean(Y_Z_1 - Y_Z_0)) +
declare_sampling(n = 50) +
declare_assignment(prob = 0.5) +
reveal_outcomes(outcome_variables = Y, assignment_variables = Z) +
declare_estimator(
Y ~ Z, model = difference_in_means, estimand = "PATE"
)
When defining a design, the order in which steps are included in the design via the +
operator matters. Think of the order of your design as the temporal order in which steps take place. Here, since the estimand comes before sampling and assignment, it is a population estimand, the population average treatment effect.
population + potential_outcomes + estimand +
sampling + assignment + reveal + estimator
We could define our estimand as a sample average treatment effect by putting estimand
after sampling
:
population + potential_outcomes + sampling +
estimand + assignment + reveal + estimator
Diagnosing a research design — learning about its properties — requires first simulating running the design over and over. We need to simulate the data generating process, then calculate the estimands, then calculate the resulting estimates.
With the design defined as an object, we can learn about what kind of data it generates, the values of its estimand and estimates, and other features. For example, to draw simulated data based on the design, we use draw_data
:
draw_data(design)
ID | U | Y_Z_0 | Y_Z_1 | S_inclusion_prob | Z | Z_cond_prob | Y |
---|---|---|---|---|---|---|---|
002 | 0.076 | 0.076 | 0.33 | 0.5 | 0 | 0.5 | 0.076 |
005 | 0.595 | 0.595 | 0.84 | 0.5 | 0 | 0.5 | 0.595 |
006 | -0.431 | -0.431 | -0.18 | 0.5 | 0 | 0.5 | -0.431 |
007 | -1.133 | -1.133 | -0.88 | 0.5 | 0 | 0.5 | -1.133 |
008 | -2.237 | -2.237 | -1.99 | 0.5 | 0 | 0.5 | -2.237 |
draw_data
runs all of the “data steps” in a design, which are both from the model (population and potential outcomes) and from the data strategy (sampling, assignment, and measurement).
To simulate the estimands from a single run of the design, we use draw_estimands
. This runs two operations at once: it draws the data, and calculates the estimands at the point defined by the design. For example, in our design, the estimand comes just after the potential outcomes. In this design, draw_estimands
will run the first two steps and then calculate the estimands from the estimand
function we declared:
draw_estimands(design)
estimand_label | estimand |
---|---|
PATE | 0.25 |
Similarly, we can draw the estimates from a single run with draw_estimates
which simulates data and, at the appropriate moment, calculates estimates.
draw_estimates(design)
term | estimate | std.error | statistic | p.value | conf.low | conf.high | df | outcome | estimand_label |
---|---|---|---|---|---|---|---|---|---|
Z | 0.63 | 0.32 | 2 | 0.054 | -0.011 | 1.3 | 44 | Y | PATE |
To simulate designs, we use the simulate_design
function to draw data, calculate estimands and estimates, and then repeat the process over and over.
simulation_df <- simulate_design(design)
sim_ID | estimand | estimate | std.error | statistic | p.value | conf.low | conf.high | df |
---|---|---|---|---|---|---|---|---|
1 | 0.25 | 0.10 | 0.31 | 0.33 | 0.74 | -0.52 | 0.73 | 43 |
2 | 0.25 | 0.30 | 0.30 | 0.99 | 0.33 | -0.31 | 0.91 | 47 |
3 | 0.25 | 0.24 | 0.32 | 0.73 | 0.47 | -0.42 | 0.89 | 40 |
4 | 0.25 | 0.44 | 0.27 | 1.61 | 0.12 | -0.11 | 0.98 | 41 |
5 | 0.25 | -0.13 | 0.29 | -0.45 | 0.66 | -0.71 | 0.45 | 45 |
Using the simulations data frame, we can calculate diagnosands like bias, root mean-squared-error, and power for each estimator-estimand pair. In DeclareDesign
, we do this in two steps. First, declare your diagnosands, which are functions that summarize simulations data. The software includes many pre-coded diagnosands, though you can write your own like this:
study_diagnosands <- declare_diagnosands(
bias = mean(estimate - estimand),
rmse = sqrt(mean((estimate - estimand)^2)),
power = mean(p.value <= 0.05)
)
Second, apply your diagnosand declaration to the simulations data frame with the diagnose_design
function:
diagnose_design(simulation_df, diagnosands = study_diagnosands)
Bias | RMSE | Power |
---|---|---|
0.00 | 0.28 | 0.14 |
(0.01) | (0.01) | (0.01) |
We can also do this in a single step by sending diagnose_design
a design object. The function will first run the simulations for you, then calculate the diagnosands from the simulation data frame that results.
diagnose_design(design, diagnosands = study_diagnosands)
After the declaration phase, you will often want to learn how the diagnosands change as design features change. We can do this using redesign
:
An alternative way to do this is to write a “designer.” A designer is a function that makes designs based on a few design parameters. Designer help researchers flexibly explore design variations. Here’s a simple designer based on our running example:
simple_designer <- function(sample_size, effect_size) {
declare_population(N = sample_size, U = rnorm(N)) +
declare_potential_outcomes(Y ~ effect_size * Z + U) +
declare_estimand(PATE = mean(Y_Z_1 - Y_Z_0)) +
declare_sampling(n = 50) +
declare_assignment(prob = 0.5) +
reveal_outcomes(outcome_variables = Y, assignment_variables = Z) +
declare_estimator(
Y ~ Z, model = difference_in_means, estimand = "PATE"
)
}
To create a single design, based on our original parameters of a 100-unit sample size and a treatment effect of 0.25
, we can run:
design <- simple_designer(sample_size = 100, effect_size = 0.25)
Now to simulate multiple designs, we can use the DeclareDesign
function expand_design
. Here we examine our simple design under several possible sample sizes, which we might want to do to conduct a minimum power analysis. We hold the effect size constant.
designs <- expand_design(
simple_designer,
sample_size = c(100, 500, 1000),
effect_size = 0.25
)
Our simulation and diagnosis tools can take a list of designs and simulate all of them at once, creating a column called design_label
to keep track. For example:
diagnose_design(designs)
Alternatively, we can compare a pair of designs directly with the compare_designs
function. This function is most useful for comparing the differences between a planned design and an implemented design.
compare_designs(planned_design, implemented_design)
Similarly, we can compare two designs on the basis of their diagnoses:
compare_diagnoses(planned_design, implemented_design)
In our DesignLibrary
package, we have created a set of common designs as designers (functions that create designs from just a few parameters), so you can get started quickly.
library(DesignLibrary)
b_c_design <- block_cluster_two_arm_designer(N = 1000, N_blocks = 10)
Sometimes, you may be interested in properties of an estimator that do not depend on an estimand, such as calculating its power↩︎