Overview

The DeclareDesign package has only about 10 functions.

The core functions represent key steps in a research design, some of which may be used more than once, such as if you have two random assignment steps or multiple estimators.

1. declare_population
2. declare_potential_outcomes
3. declare_sampling
4. declare_assignment
5. declare_estimand
6. declare_estimator

The function declare_design can take any of these six functions, plus any R function that takes data and returns data.

The post-design-declaration commands are:

1. modify_design (takes a design and a set of modifications, returns a design)
2. diagnose_design (takes a design, returns simulations and diagnosis)
3. compare_designs (takes a list of designs and diagnoses them all)
4. draw_data (takes a design and returns a single draw of the data)
5. get_estimates(takes a design a returns a single simulation of estimates)
6. get_estimands(takes a design a returns a single simulation of estimands)

There are a few other features:

1. A template is a function that takes parameters (e.g., N) and returns a design. quick_design is a function of a template and parameters that returns a design.
2. We can easily declare_diagnosands, which are things like power and bias, but the package defaults to the usual suspects.
3. reveal_outcomes implements a general switching equation, which allows you to reveal outcomes from potential outcomes and a treatment assignment.

The Six Steps

Each of the declare_* functions returns a function.

In addition to using them to declare the design, you can just run them yourself, i.e. your_potential_outcomes(your_population()).

Population

You can easily define a single-level population (i.e. just individuals). You can use any R function to create a variable, with N being the only privileged name. The output is a function, so calling my_population() makes a data.frame.

my_population <-
declare_population(N = 1000,
income = rnorm(N),
age = sample(18:95, N, replace = T))

pop <- my_population()
head(pop)
##     ID     income age
## 1 0001  1.3709584  95
## 2 0002 -0.5646982  52
## 3 0003  0.3631284  72
## 4 0004  0.6328626  87
## 5 0005  0.4042683  83
## 6 0006 -0.1061245  75

Multi-level datasets are also easy to use. You set the N of each level in a call to level(). The level function is neat – if the previous level has data, it merges so that there are N entries for each of the units at the higher level. We can handle non-fixed number of units at each level too. In the individuals line, we’ve drawn a random number of individuals that are in each village.

my_population_nested <- declare_population(
districts = level(N = 25, urban = sample(0:1, N, replace = TRUE)),
villages = level(N = 10, altitude = rnorm(N)),
individuals = level(N = sample(100:200, size = 250, replace = TRUE),
income = rnorm(N),
age = sample(18:95, N, replace = TRUE)))

This says that there are 25 districts, 10 villages per districts, and then between 100 and 200 individuals per village. It creates districts first, then merges in villages, then merges in individuals using ID variables created at the level above it.

Within those levels, you can easily include existing data (and also add variables to them if you wish):

region_data <- data.frame(capital = c(1, 0, 0, 0, 0))
pop_level_data <- declare_population(
regions = level(N = 2, gdp = runif(N)),
cities = level(N = 2, subways = rnorm(N, mean = 5)))

head(pop_level_data())
##   regions       gdp cities  subways
## 1       1 0.2737925      1 4.864184
## 2       1 0.2737925      2 4.012727
## 3       2 0.9441967      3 5.831925
## 4       2 0.9441967      4 4.204940

Similarly, you can easily declare your existing data as the population:

country_data <- data.frame(
cow_code = c(504, 15, 100, 90),
polity_iv = c(-9, 7, -1, 3))
pop_data <- declare_population(data = country_data)

head(pop_data())
##   cow_code polity_iv
## 1      504        -9
## 2       15         7
## 3      100        -1
## 4       90         3

If you don’t want your data to be fixed, you can resample from it, i.e.

pop_data_bootstrap <- declare_population(
data = country_data, population_function = fabricatr::resample_data)

head(pop_data_bootstrap())
##   cow_code polity_iv
## 1      100        -1
## 2       90         3
## 3       90         3
## 4       90         3

Note that fabricatr is one of the helper packages that come along with DeclareDesign. fabricatr helps you simulate population data or resample from existing data.

Potential outcomes

A declare_potential_outcomes declaration returns a function. That function takes data and returns data with potential outcomes columns appended. There are two ways of declaring potential outcomes: a formula or as separate variables (as in declare_population).

In a formula

my_potential_outcomes <- declare_potential_outcomes(
formula = Y ~ .25 * Z + .01 * age * Z)
pop_pos <- my_potential_outcomes(pop)
head(pop_pos)
##     ID     income age Y_Z_0 Y_Z_1
## 1 0001  1.3709584  95     0  1.20
## 2 0002 -0.5646982  52     0  0.77
## 3 0003  0.3631284  72     0  0.97
## 4 0004  0.6328626  87     0  1.12
## 5 0005  0.4042683  83     0  1.08
## 6 0006 -0.1061245  75     0  1.00

This has defaults set for condition_names (0, 1) and the assignment variable name (Z). You can set the “domain” of the potential outcomes function with condition_names.

my_potential_outcomes <- declare_potential_outcomes(
formula = Y ~ .25 * Z + .01 * age * Z,
condition_names = 1:4)
head(my_potential_outcomes(pop))
##     ID     income age Y_Z_1 Y_Z_2 Y_Z_3 Y_Z_4
## 1 0001  1.3709584  95  1.20  2.40  3.60  4.80
## 2 0002 -0.5646982  52  0.77  1.54  2.31  3.08
## 3 0003  0.3631284  72  0.97  1.94  2.91  3.88
## 4 0004  0.6328626  87  1.12  2.24  3.36  4.48
## 5 0005  0.4042683  83  1.08  2.16  3.24  4.32
## 6 0006 -0.1061245  75  1.00  2.00  3.00  4.00

As separate variables

The second way, which some may prefer, is to define each potential outcome yourself. This bakes in the condition names and assignment variable.

my_potential_outcomes <-
declare_potential_outcomes(
Y_Z_0 = .05,
Y_Z_1 = .30 + .01 * age)

head(my_potential_outcomes(pop))
##     ID     income age Y_Z_0 Y_Z_1
## 1 0001  1.3709584  95  0.05  1.25
## 2 0002 -0.5646982  52  0.05  0.82
## 3 0003  0.3631284  72  0.05  1.02
## 4 0004  0.6328626  87  0.05  1.17
## 5 0005  0.4042683  83  0.05  1.13
## 6 0006 -0.1061245  75  0.05  1.05

Sampling

A sampling function takes data and returns a sampled subset of the data. By default, declare_sampling understands arguments passed to ... as randomizr arguments, but it’s easy to supply your own function instead.

my_sampling <- declare_sampling(n = 250)
smp <- my_sampling(pop_pos)
nrow(smp)
## [1] 250

Assignment

Assignment declarations return functions of data that return data. If you use the randomizr defaults, then it appends to the dataset an assignment draw and a vector of observed probability weights.

my_assignment <- declare_assignment(m = 25)
smp <- my_assignment(smp)
data$Y_Z_1 <- with(data, 0.25 + u) data } my_potential_outcomes_custom <- declare_potential_outcomes( potential_outcomes_function = my_potential_outcomes_function ) pop_pos_custom <- my_potential_outcomes_custom(pop_custom) head(pop_pos_custom[, c("u", "Y_Z_0", "Y_Z_1")]) ## u Y_Z_0 Y_Z_1 ## 1 1.0359587 1.0359587 1.2859587 ## 2 0.2363770 0.2363770 0.4863770 ## 3 1.7074779 1.7074779 1.9574779 ## 4 0.4080789 0.4080789 0.6580789 ## 5 1.2078574 1.2078574 1.4578574 ## 6 -0.6103413 -0.6103413 -0.3603413 Custom Sampling Again, you can still use a totally separate custom sampling function easily. In this case, you are asked to provide a function that takes population data and returns sampled data, i.e. you draw the sample and then subset. You can also include inclusion weights if you wish in the function (as the default function does). my_sampling_function <- function(data) { data$S <- rbinom(n = nrow(data),
size = 1,
prob = 0.1)
data[data$S == 1, ] } my_sampling_custom <- declare_sampling( sampling_function = my_sampling_function) smp_custom <- my_sampling_custom(pop_pos) nrow(smp_custom) ## [1] 102 Custom Assignment my_assignment_function <- function(data) { data$Z <- rbinom(n = nrow(data),
size = 1,
prob = 0.5)
data
}
my_assignment_custom <- declare_assignment(
assignment_function = my_assignment_function)

table(my_assignment_custom(pop_pos)$Z) ## ## 0 1 ## 501 499 Custom Estimand my_estimand_function <- function(data) { with(data, median(Y_Z_1 - Y_Z_0)) } my_estimand_custom <- declare_estimand( estimand_function = my_estimand_function, label = medianTE) my_estimand_custom(pop_pos) ## estimand_label estimand ## 1 medianTE 0.82 Custom Estimator my_estimator_function <- function(formula, data){ data.frame(est = with(data, mean(Y))) } my_estimator_custom <- declare_estimator(Y ~ Z, estimator_function = my_estimator_function, estimand = my_estimand) my_estimator_custom(smp) ## estimator_label est estimand_label ## 1 my_estimator 0.08868 ATE Features Quick design You can also write a design maker function that declares a design based on a set of parameters like N, the number of clusters, etc. and use the function quick_design to make designs using just those parameters. m_arm_trial <- function(numb){ my_population <- declare_population( N = numb, income = rnorm(N), age = sample(18:95, N, replace = T)) my_potential_outcomes <- declare_potential_outcomes( formula = Y ~ .25 * Z + .01 * age * Z) my_sampling <- declare_sampling(n = 250) my_assignment <- declare_assignment(m = 25) my_estimand <- declare_estimand(ATE = mean(Y_Z_1 - Y_Z_0)) my_estimator_dim <- declare_estimator(Y ~ Z, estimand = my_estimand) my_design <- declare_design(my_population, my_potential_outcomes, my_estimand, my_sampling, my_assignment, reveal_outcomes, my_estimator_dim) return(my_design) } my_1000_design <- quick_design(template = m_arm_trial, numb = 1000) head(draw_data(my_1000_design)) ## ID income age Y_Z_0 Y_Z_1 S_inclusion_prob Z Z_cond_prob Y ## 6 0006 2.4696180 59 0 0.84 0.25 0 0.9 0 ## 9 0009 0.1271827 54 0 0.79 0.25 0 0.9 0 ## 13 0013 0.5023062 20 0 0.45 0.25 0 0.9 0 ## 16 0016 -0.2910818 63 0 0.88 0.25 0 0.9 0 ## 22 0022 0.5427438 49 0 0.74 0.25 0 0.9 0 ## 24 0024 -0.5541436 19 0 0.44 0.25 0 0.9 0 Continuous potential outcomes my_potential_outcomes_continuous <- declare_potential_outcomes( formula = Y ~ .25 * Z + .01 * age * Z, condition_names = seq(0, 1, by = .1)) continuous_treatment_function <- function(data){ data$Z <- sample(seq(0, 1, by = .1), size = nrow(data), replace = TRUE)
data
}

my_assignment_continuous <- declare_assignment(assignment_function = continuous_treatment_function)

my_design <- declare_design(my_population(),
my_potential_outcomes_continuous,
my_assignment_continuous,
reveal_outcomes)

head(draw_data(my_design))
##     ID      income age Y_Z_0 Y_Z_0.1 Y_Z_0.2 Y_Z_0.3 Y_Z_0.4 Y_Z_0.5
## 1 0001 -0.04428595  38     0   0.063   0.126   0.189   0.252   0.315
## 2 0002  0.97052667  18     0   0.043   0.086   0.129   0.172   0.215
## 3 0003 -0.13801701  76     0   0.101   0.202   0.303   0.404   0.505
## 4 0004 -0.35100480  34     0   0.059   0.118   0.177   0.236   0.295
## 5 0005 -1.73853446  36     0   0.061   0.122   0.183   0.244   0.305
## 6 0006 -0.59459264  69     0   0.094   0.188   0.282   0.376   0.470
##   Y_Z_0.6 Y_Z_0.7 Y_Z_0.8 Y_Z_0.9 Y_Z_1   Z     Y
## 1   0.378   0.441   0.504   0.567  0.63 0.8 0.504
## 2   0.258   0.301   0.344   0.387  0.43 0.9 0.387
## 3   0.606   0.707   0.808   0.909  1.01 0.7 0.707
## 4   0.354   0.413   0.472   0.531  0.59 0.3 0.177
## 5   0.366   0.427   0.488   0.549  0.61 0.5 0.305
## 6   0.564   0.658   0.752   0.846  0.94 1.0 0.940

Attrition

Just another potential outcome!

my_potential_outcomes_attrition <- declare_potential_outcomes(
formula = R ~ rbinom(n = N, size = 1, prob = pnorm(Y_Z_0)))

my_design <- declare_design(my_population(),
my_potential_outcomes,
my_potential_outcomes_attrition,
my_assignment,
reveal_outcomes(outcome_variable_name = "R"),
reveal_outcomes(attrition_variable_name = "R"))

head(draw_data(my_design)[, c("ID", "Y_Z_0", "Y_Z_1", "R_Z_0", "R_Z_1", "Z", "R", "Y")])
##     ID Y_Z_0 Y_Z_1 R_Z_0 R_Z_1 Z R    Y
## 1 0001  0.05  1.04     1     0 0 1 0.05
## 2 0002  0.05  0.87     1     0 0 1 0.05
## 3 0003  0.05  1.14     0     1 0 0   NA
## 4 0004  0.05  1.22     1     0 1 0   NA
## 5 0005  0.05  0.94     1     1 0 1 0.05
## 6 0006  0.05  1.23     1     0 0 1 0.05

Stochastic population sizes

The population (or any level of the population) can have stochastic population sizes. (In fact, N can be a number, a fixed vector of numbers, or an expression that returns a stochastic number or vector of numbers.)

stochastic_population <- declare_population(
N = sample(500:1000, 1), income = rnorm(N), age = sample(18:95, N, replace = TRUE))

c(nrow(stochastic_population()),
nrow(stochastic_population()),
nrow(stochastic_population()))
## [1] 622 832 912

Three companion packages

All built-in default functions are in three standalone packages that can be used to support DeclareDesign or on their own. This enables the simplicity of the six key functions in declare_design and a minimal number of dependencies.

For now, make sure you have the development versions of each package from Github, not the version of randomizr on CRAN.

The three packages are:

fabricatr

fabricatr, includes data creation features described above. It’s main function, fabricate, creates simulated data as in the above examples in declare_population. It can do single level data creation or multilevel creation using level() within fabricate. As it turns out, it is also the backbone of declare_potential_outcomes default function that creates each PO as an expression described above (i.e. where you define Y_Z_1 = 0.5 + Z). The second main function is bootstrap_data which resamples your data in a way that respects hierarchy. We expect this package will be built up further, including with proportional outcomes functions.

randomizr

randomizr works by default as described above in declare_assignment() – in fact appears to the user similarly to the old version – but is entirely separate now. The set of sampling functions that mimic the assignment functions are now also built in to randomizr, including simple, clustered, stratified, and clustered-and-stratified sampling. The next step is to rewrite some of these into C++ to see if we can speed a few of them up. Based on our understanding of Rcpp, the speedup gains here may be minor.

estimatr

The C++ version of lm with robust standard errors described above as well as the difference-in-means and blocked difference-in-means functions are the core of estimatr now, but it can be built up with further fast estimators.