Overview

The DeclareDesign package has only about 10 functions.

The core functions represent key steps in a research design, some of which may be used more than once, such as if you have two random assignment steps or multiple estimators.

  1. declare_population
  2. declare_potential_outcomes
  3. declare_sampling
  4. declare_assignment
  5. declare_estimand
  6. declare_estimator

The function declare_design can take any of these six functions, plus any R function that takes data and returns data.

The post-design-declaration commands are:

  1. modify_design (takes a design and a set of modifications, returns a design)
  2. diagnose_design (takes a design, returns simulations and diagnosis)
  3. compare_designs (takes a list of designs and diagnoses them all)
  4. draw_data (takes a design and returns a single draw of the data)
  5. get_estimates(takes a design a returns a single simulation of estimates)
  6. get_estimands(takes a design a returns a single simulation of estimands)

There are a few other features:

  1. A template is a function that takes parameters (e.g., N) and returns a design. quick_design is a function of a template and parameters that returns a design.
  2. We can easily declare_diagnosands, which are things like power and bias, but the package defaults to the usual suspects.
  3. reveal_outcomes implements a general switching equation, which allows you to reveal outcomes from potential outcomes and a treatment assignment.

The Six Steps

Each of the declare_* functions returns a function.

In addition to using them to declare the design, you can just run them yourself, i.e. your_potential_outcomes(your_population()).

Population

You can easily define a single-level population (i.e. just individuals). You can use any R function to create a variable, with N being the only privileged name. The output is a function, so calling my_population() makes a data.frame.

my_population <-
  declare_population(N = 1000,
  income = rnorm(N),
  age = sample(18:95, N, replace = T))

pop <- my_population()
head(pop)
##     ID     income age
## 1 0001  1.3709584  95
## 2 0002 -0.5646982  52
## 3 0003  0.3631284  72
## 4 0004  0.6328626  87
## 5 0005  0.4042683  83
## 6 0006 -0.1061245  75

Multi-level datasets are also easy to use. You set the N of each level in a call to level(). The level function is neat – if the previous level has data, it merges so that there are N entries for each of the units at the higher level. We can handle non-fixed number of units at each level too. In the individuals line, we’ve drawn a random number of individuals that are in each village.

my_population_nested <- declare_population(
  districts = level(N = 25, urban = sample(0:1, N, replace = TRUE)),
  villages = level(N = 10, altitude = rnorm(N)),
  individuals = level(N = sample(100:200, size = 250, replace = TRUE), 
                      income = rnorm(N),
                      age = sample(18:95, N, replace = TRUE)))

This says that there are 25 districts, 10 villages per districts, and then between 100 and 200 individuals per village. It creates districts first, then merges in villages, then merges in individuals using ID variables created at the level above it.

Within those levels, you can easily include existing data (and also add variables to them if you wish):

region_data <- data.frame(capital = c(1, 0, 0, 0, 0))
pop_level_data <- declare_population(
  regions = level(N = 2, gdp = runif(N)),
  cities = level(N = 2, subways = rnorm(N, mean = 5)))

head(pop_level_data())
##   regions       gdp cities  subways
## 1       1 0.2737925      1 4.864184
## 2       1 0.2737925      2 4.012727
## 3       2 0.9441967      3 5.831925
## 4       2 0.9441967      4 4.204940

Similarly, you can easily declare your existing data as the population:

country_data <- data.frame(
  cow_code = c(504, 15, 100, 90),
  polity_iv = c(-9, 7, -1, 3))
pop_data <- declare_population(data = country_data)

head(pop_data())
##   cow_code polity_iv
## 1      504        -9
## 2       15         7
## 3      100        -1
## 4       90         3

If you don’t want your data to be fixed, you can resample from it, i.e.

pop_data_bootstrap <- declare_population(
  data = country_data, population_function = fabricatr::resample_data)

head(pop_data_bootstrap())
##     cow_code polity_iv
## 3        100        -1
## 4         90         3
## 4.1       90         3
## 4.2       90         3

Note that fabricatr is one of the helper packages that come along with DeclareDesign. fabricatr helps you simulate population data or resample from existing data.

Potential outcomes

A declare_potential_outcomes declaration returns a function. That function takes data and returns data with potential outcomes columns appended. There are two ways of declaring potential outcomes: a formula or as separate variables (as in declare_population).

In a formula

my_potential_outcomes <- declare_potential_outcomes(
  formula = Y ~ .25 * Z + .01 * age * Z)
pop_pos <- my_potential_outcomes(pop)
head(pop_pos)
##     ID     income age Y_Z_0 Y_Z_1
## 1 0001  1.3709584  95     0  1.20
## 2 0002 -0.5646982  52     0  0.77
## 3 0003  0.3631284  72     0  0.97
## 4 0004  0.6328626  87     0  1.12
## 5 0005  0.4042683  83     0  1.08
## 6 0006 -0.1061245  75     0  1.00

This has defaults set for condition_names (0, 1) and the assignment variable name (Z). You can set the “domain” of the potential outcomes function with condition_names.

my_potential_outcomes <- declare_potential_outcomes(
  formula = Y ~ .25 * Z + .01 * age * Z,
  condition_names = 1:4)
head(my_potential_outcomes(pop))
##     ID     income age Y_Z_1 Y_Z_2 Y_Z_3 Y_Z_4
## 1 0001  1.3709584  95  1.20  2.40  3.60  4.80
## 2 0002 -0.5646982  52  0.77  1.54  2.31  3.08
## 3 0003  0.3631284  72  0.97  1.94  2.91  3.88
## 4 0004  0.6328626  87  1.12  2.24  3.36  4.48
## 5 0005  0.4042683  83  1.08  2.16  3.24  4.32
## 6 0006 -0.1061245  75  1.00  2.00  3.00  4.00

As separate variables

The second way, which some may prefer, is to define each potential outcome yourself. This bakes in the condition names and assignment variable.

my_potential_outcomes <-
  declare_potential_outcomes(
    Y_Z_0 = .05,
    Y_Z_1 = .30 + .01 * age)

head(my_potential_outcomes(pop))
##     ID     income age Y_Z_0 Y_Z_1
## 1 0001  1.3709584  95  0.05  1.25
## 2 0002 -0.5646982  52  0.05  0.82
## 3 0003  0.3631284  72  0.05  1.02
## 4 0004  0.6328626  87  0.05  1.17
## 5 0005  0.4042683  83  0.05  1.13
## 6 0006 -0.1061245  75  0.05  1.05

Sampling

A sampling function takes data and returns a sampled subset of the data. By default, declare_sampling understands arguments passed to ... as randomizr arguments, but it’s easy to supply your own function instead.

my_sampling <- declare_sampling(n = 250)
smp <- my_sampling(pop_pos)
nrow(smp)
## [1] 250

Assignment

Assignment declarations return functions of data that return data. If you use the randomizr defaults, then it appends to the dataset an assignment draw and a vector of observed probability weights.

my_assignment <- declare_assignment(m = 25)
smp <- my_assignment(smp)
table(smp$Z)
## 
##   0   1 
## 225  25
head(smp)
##      ID     income age Y_Z_0 Y_Z_1 S_inclusion_prob Z Z_cond_prob
## 3  0003  0.3631284  72     0  0.97             0.25 0         0.9
## 9  0009  2.0184237  30     0  0.55             0.25 0         0.9
## 13 0013 -1.3888607  62     0  0.87             0.25 0         0.9
## 19 0019 -2.4404669  79     0  1.04             0.25 0         0.9
## 22 0022 -1.7813084  59     0  0.84             0.25 0         0.9
## 23 0023 -0.1719174  48     0  0.73             0.25 0         0.9

Estimands

Estimands run on data that includes potential outcomes, drawn earlier using a declare_potential_outcomes call.

my_estimand <- declare_estimand(ATE = mean(Y_Z_1 - Y_Z_0))
my_estimand(pop_pos)
##   estimand_label estimand
## 1            ATE  0.81926

The only part baked in to DeclareDesign is the naming structure, outcome_assignment_condition. You could write your own potential outcomes function to avoid this (and in most cases this would also require writing your own reveal_outcomes function).

Estimators

To declare an estimator, you declare an estimator function, a difference_in_means by default. Optionally you also declare an estimand that is attached to the estimator.

smp <- reveal_outcomes(smp)
my_estimator_dim <- declare_estimator(Y ~ Z, estimand = my_estimand)
my_estimator_dim(smp)
##   estimator_label coefficient_name    est         se           p  ci_lower
## 1    my_estimator                Z 0.8868 0.04431223 1.76578e-16 0.7953441
##    ci_upper estimand_label
## 1 0.9782559            ATE

Using our simple lm function with built-in robust standard errors (HC2! or your choice of other HC’s) is similarly straightforward (it’s also fast! in RCPP!).

my_estimator_lm <- 
  declare_estimator(Y ~ Z, 
                    model = lm_robust, 
                    coefficient_name = "Z", 
                    estimand = my_estimand)

my_estimator_lm(smp)
##   estimator_label coefficient_name    est         se            p
## 1    my_estimator                Z 0.8868 0.04431223 1.103823e-53
##    ci_lower  ci_upper estimand_label
## 1 0.7995237 0.9740763            ATE

Declaring Designs

Instead of defining your population, potential outcomes, and so on, you simply give us a set of functions in the order of your DAG, i.e. beginning with a population, then potential outcomes, sampling, and so on. You can also put any R function in causal order that takes data and returns data – including all the nice functions in dplyr like mutate, to allow you to create new variables and do things like collapse clusters

design <- declare_design(my_population,
                         my_potential_outcomes,
                         my_estimand,
                         dplyr::mutate(big_income = 5*income), 
                         my_sampling,
                         my_assignment,
                         reveal_outcomes,
                         my_estimator_dim)

Remarks re: declare_design:

  1. The first argument must always be a dataset or create one.
  2. Your estimand is placed where you want to define it, i.e. here we are defining a PATE by placing the estimand just after population and before sampling or assignment.
  3. declare_design produces two things: a “dgp function” and a “design function.” The dgp function draws a dataset and the design function returns an estimands dataframe and an estimates data frame. It simulates the design from population through estimates, in whatever order you tell it – meaning it carefully separates the data generating parts of the design and the calculation of estimates and estimands.

You can run them directly via:

dat <- draw_data(design)
head(dat)
##      ID     income age Y_Z_0 Y_Z_1 big_income S_inclusion_prob Z
## 7  0007  1.5194408  40  0.05  0.70   7.597204             0.25 0
## 16 0016  0.9815189  57  0.05  0.87   4.907594             0.25 0
## 18 0018  0.6370899  59  0.05  0.89   3.185450             0.25 0
## 24 0024  0.8700965  26  0.05  0.56   4.350483             0.25 1
## 25 0025  0.4857625  61  0.05  0.91   2.428812             0.25 0
## 34 0034 -0.1823682  46  0.05  0.76  -0.911841             0.25 0
##    Z_cond_prob    Y
## 7          0.9 0.05
## 16         0.9 0.05
## 18         0.9 0.05
## 24         0.1 0.56
## 25         0.9 0.05
## 34         0.9 0.05

and

##   estimator_label coefficient_name    est         se            p ci_lower
## 1    my_estimator                Z 0.8072 0.04009788 1.544802e-16 0.724442
##   ci_upper estimand_label
## 1 0.889958            ATE
##   estimand_label estimand
## 1            ATE  0.78939

Custom functions

Because all steps are functions, it’s easy for you to provide custom functions instead of using our defaults.

Custom Population

You can use a custom function to generate your population entirely on your own, too:

my_population_function <- function(N) {
  data.frame(u = rnorm(N))
}

my_population_custom <- declare_population(
  population_function = my_population_function, N = 100)

pop_custom <- my_population_custom()

head(pop_custom)
##            u
## 1  1.0359587
## 2  0.2363770
## 3  1.7074779
## 4  0.4080789
## 5  1.2078574
## 6 -0.6103413

Custom Potential Outcomes

my_potential_outcomes_function <-
  function(data) {
    data$Y_Z_0 <- with(data, u)
    data$Y_Z_1 <- with(data, 0.25 + u)
    data
  }
my_potential_outcomes_custom <- declare_potential_outcomes(
  potential_outcomes_function = my_potential_outcomes_function
)

pop_pos_custom <- my_potential_outcomes_custom(pop_custom)

head(pop_pos_custom[, c("u", "Y_Z_0", "Y_Z_1")])
##            u      Y_Z_0      Y_Z_1
## 1  1.0359587  1.0359587  1.2859587
## 2  0.2363770  0.2363770  0.4863770
## 3  1.7074779  1.7074779  1.9574779
## 4  0.4080789  0.4080789  0.6580789
## 5  1.2078574  1.2078574  1.4578574
## 6 -0.6103413 -0.6103413 -0.3603413

Custom Sampling

Again, you can still use a totally separate custom sampling function easily. In this case, you are asked to provide a function that takes population data and returns sampled data, i.e. you draw the sample and then subset. You can also include inclusion weights if you wish in the function (as the default function does).

my_sampling_function <- function(data) {
     data$S <- rbinom(n = nrow(data),
            size = 1,
            prob = 0.1)
     data[data$S == 1, ]
}

my_sampling_custom <- declare_sampling(
  sampling_function = my_sampling_function)

smp_custom <- my_sampling_custom(pop_pos)

nrow(smp_custom)
## [1] 102

Custom Assignment

my_assignment_function <- function(data) {
  data$Z <- rbinom(n = nrow(data),
         size = 1,
         prob = 0.5)
  data
}
my_assignment_custom <- declare_assignment(
  assignment_function = my_assignment_function)

table(my_assignment_custom(pop_pos)$Z)
## 
##   0   1 
## 501 499

Custom Estimand

my_estimand_function <- function(data) {
  with(data, median(Y_Z_1 - Y_Z_0))
}
my_estimand_custom <- declare_estimand(
  estimand_function = my_estimand_function, label = medianTE)

my_estimand_custom(pop_pos)
##   estimand_label estimand
## 1       medianTE     0.82

Custom Estimator

my_estimator_function <- function(formula, data){
  data.frame(est = with(data, mean(Y)))
}

my_estimator_custom <- 
  declare_estimator(Y ~ Z, 
                    estimator_function = my_estimator_function, 
                    estimand = my_estimand)

my_estimator_custom(smp)
##   estimator_label     est estimand_label
## 1    my_estimator 0.08868            ATE

Features

Quick design

You can also write a design maker function that declares a design based on a set of parameters like N, the number of clusters, etc. and use the function quick_design to make designs using just those parameters.

m_arm_trial <- function(numb){
  my_population <- declare_population(
    N = numb, income = rnorm(N), age = sample(18:95, N, replace = T))

  my_potential_outcomes <- declare_potential_outcomes(
    formula = Y ~ .25 * Z + .01 * age * Z)
  my_sampling <- declare_sampling(n = 250)
  my_assignment <- declare_assignment(m = 25)
  my_estimand <- declare_estimand(ATE = mean(Y_Z_1 - Y_Z_0))
  my_estimator_dim <- declare_estimator(Y ~ Z, estimand = my_estimand)
  my_design <- declare_design(my_population,
                              my_potential_outcomes,
                              my_estimand,
                              my_sampling,
                              my_assignment,
                              reveal_outcomes,
                              my_estimator_dim)
  return(my_design)
}

my_1000_design <- quick_design(template = m_arm_trial, numb = 1000)
head(draw_data(my_1000_design))
##      ID     income age Y_Z_0 Y_Z_1 S_inclusion_prob Z Z_cond_prob Y
## 6  0006  2.4696180  59     0  0.84             0.25 0         0.9 0
## 9  0009  0.1271827  54     0  0.79             0.25 0         0.9 0
## 13 0013  0.5023062  20     0  0.45             0.25 0         0.9 0
## 16 0016 -0.2910818  63     0  0.88             0.25 0         0.9 0
## 22 0022  0.5427438  49     0  0.74             0.25 0         0.9 0
## 24 0024 -0.5541436  19     0  0.44             0.25 0         0.9 0

Continuous potential outcomes

my_potential_outcomes_continuous <- declare_potential_outcomes(
  formula = Y ~ .25 * Z + .01 * age * Z, condition_names = seq(0, 1, by = .1))

continuous_treatment_function <- function(data){
 data$Z <- sample(seq(0, 1, by = .1), size = nrow(data), replace = TRUE)
 data
}

my_assignment_continuous <- declare_assignment(assignment_function = continuous_treatment_function)

my_design <- declare_design(my_population(),
                            my_potential_outcomes_continuous,
                            my_assignment_continuous,
                            reveal_outcomes)

head(draw_data(my_design))
##     ID      income age Y_Z_0 Y_Z_0.1 Y_Z_0.2 Y_Z_0.3 Y_Z_0.4 Y_Z_0.5
## 1 0001 -0.04428595  38     0   0.063   0.126   0.189   0.252   0.315
## 2 0002  0.97052667  18     0   0.043   0.086   0.129   0.172   0.215
## 3 0003 -0.13801701  76     0   0.101   0.202   0.303   0.404   0.505
## 4 0004 -0.35100480  34     0   0.059   0.118   0.177   0.236   0.295
## 5 0005 -1.73853446  36     0   0.061   0.122   0.183   0.244   0.305
## 6 0006 -0.59459264  69     0   0.094   0.188   0.282   0.376   0.470
##   Y_Z_0.6 Y_Z_0.7 Y_Z_0.8 Y_Z_0.9 Y_Z_1   Z     Y
## 1   0.378   0.441   0.504   0.567  0.63 0.8 0.504
## 2   0.258   0.301   0.344   0.387  0.43 0.9 0.387
## 3   0.606   0.707   0.808   0.909  1.01 0.7 0.707
## 4   0.354   0.413   0.472   0.531  0.59 0.3 0.177
## 5   0.366   0.427   0.488   0.549  0.61 0.5 0.305
## 6   0.564   0.658   0.752   0.846  0.94 1.0 0.940

Attrition

Just another potential outcome!

my_potential_outcomes_attrition <- declare_potential_outcomes(
  formula = R ~ rbinom(n = N, size = 1, prob = pnorm(Y_Z_0)))

my_design <- declare_design(my_population(),
                            my_potential_outcomes,
                            my_potential_outcomes_attrition,
                            my_assignment,
                            reveal_outcomes(outcome_variable_name = "R"),
                            reveal_outcomes(attrition_variable_name = "R"))

head(draw_data(my_design)[, c("ID", "Y_Z_0", "Y_Z_1", "R_Z_0", "R_Z_1", "Z", "R", "Y")])
##     ID Y_Z_0 Y_Z_1 R_Z_0 R_Z_1 Z R    Y
## 1 0001  0.05  1.04     1     0 0 1 0.05
## 2 0002  0.05  0.87     1     0 0 1 0.05
## 3 0003  0.05  1.14     0     1 0 0   NA
## 4 0004  0.05  1.22     1     0 1 0   NA
## 5 0005  0.05  0.94     1     1 0 1 0.05
## 6 0006  0.05  1.23     1     0 0 1 0.05

Stochastic population sizes

The population (or any level of the population) can have stochastic population sizes. (In fact, N can be a number, a fixed vector of numbers, or an expression that returns a stochastic number or vector of numbers.)

stochastic_population <- declare_population(
  N = sample(500:1000, 1), income = rnorm(N), age = sample(18:95, N, replace = TRUE))

c(nrow(stochastic_population()), 
  nrow(stochastic_population()), 
  nrow(stochastic_population()))
## [1] 622 832 912

Three companion packages

All built-in default functions are in three standalone packages that can be used to support DeclareDesign or on their own. This enables the simplicity of the six key functions in declare_design and a minimal number of dependencies.

For now, make sure you have the development versions of each package from Github, not the version of randomizr on CRAN.

The three packages are:

fabricatr

fabricatr, includes data creation features described above. It’s main function, fabricate, creates simulated data as in the above examples in declare_population. It can do single level data creation or multilevel creation using level() within fabricate. As it turns out, it is also the backbone of declare_potential_outcomes default function that creates each PO as an expression described above (i.e. where you define Y_Z_1 = 0.5 + Z). The second main function is bootstrap_data which resamples your data in a way that respects hierarchy. We expect this package will be built up further, including with proportional outcomes functions.

randomizr

randomizr works by default as described above in declare_assignment() – in fact appears to the user similarly to the old version – but is entirely separate now. The set of sampling functions that mimic the assignment functions are now also built in to randomizr, including simple, clustered, stratified, and clustered-and-stratified sampling. The next step is to rewrite some of these into C++ to see if we can speed a few of them up. Based on our understanding of Rcpp, the speedup gains here may be minor.

estimatr

The C++ version of lm with robust standard errors described above as well as the difference-in-means and blocked difference-in-means functions are the core of estimatr now, but it can be built up with further fast estimators.