Declare sampling procedure
declare_sampling(..., handler = sampling_handler, label = NULL) sampling_handler(data, ..., sampling_variable = "S", drop_nonsampled = TRUE)
... | arguments to be captured, and later passed to the handler |
---|---|
handler | a tidy-in, tidy-out function |
label | a string describing the step |
data | A data.frame. |
sampling_variable | The prefix for the sampling inclusion probability variable. |
drop_nonsampled | Logical indicating whether to drop units that are not sampled. Default is |
A sampling declaration, which is a function that takes a data.frame as an argument and returns a data.frame subsetted to sampled observations and (optionally) augmented with inclusion probabilities and other quantities.
declare_sampling
can work with any sampling_function that takes data and returns data. The default handler is draw_rs
from the randomizr
package. This allows quick declaration of many sampling schemes that involve strata and clusters.
The arguments to draw_rs
can include N, strata, clusters, n, prob, strata_n, and strata_prob.
The arguments you need to specify are different for different designs.
Note that declare_sampling
works similarly to declare_assignment
a key difference being that declare_sampling
functions subset data to sampled units rather than simply appending an indicator for membership of a sample (assignment). If you need to sample but keep the dataset use declare_assignment
and define further steps (such as estimation) with respect to subsets defined by the assignment.
For details see the help files for complete_rs
, strata_rs
, cluster_rs
, or strata_and_cluster_rs
# Simple random sampling design <- declare_population(N = 100, female = rbinom(N, 1, 0.5), U = rnorm(N)) + declare_potential_outcomes(Y ~ 0.5 * Z + 0.2 * female + 0.1 * Z * female + U) design_with_sampling <- design + declare_sampling(n = 50) nrow(draw_data(design))#> [1] 100#> [1] 50# Stratified random sampling design + declare_sampling(strata = female)#> #> Design Summary #> #> Step 1 (population): declare_population(N = 100, female = rbinom(N, 1, 0.5), U = rnorm(N)) #> #> N = 100 #> #> Added variable: ID #> N_missing N_unique class #> 0 100 character #> #> Added variable: female #> 0 1 #> 55 45 #> 0.55 0.45 #> #> Added variable: U #> min median mean max sd N_missing N_unique #> -3.03 -0.06 -0.08 2.62 1 0 100 #> #> Step 2 (potential outcomes): declare_potential_outcomes(Y ~ 0.5 * Z + 0.2 * female + 0.1 * Z * female + U) #> #> Formula: Y ~ 0.5 * Z + 0.2 * female + 0.1 * Z * female + U #> #> Added variable: Y_Z_0 #> min median mean max sd N_missing N_unique #> -3.03 0.04 0.01 2.62 1.01 0 100 #> #> Added variable: Y_Z_1 #> min median mean max sd N_missing N_unique #> -2.53 0.6 0.55 3.12 1.01 0 100 #> #> Step 3 (sampling): declare_sampling(strata = female) --------------------------- #> #> N = 50 (50 subtracted) #> #> Added variable: S_inclusion_prob #> 0.5 #> 50 #> 1.00 #> #> Altered variable: ID #> Before: #> N_missing N_unique class #> 0 100 character #> #> After: #> N_missing N_unique class #> 0 50 character #> #> Altered variable: female #> Before: #> 0 1 #> 55 45 #> 0.55 0.45 #> #> After: #> 0 1 #> 27 23 #> 0.54 0.46 #> #> Altered variable: U #> Before: #> min median mean max sd N_missing N_unique #> -3.03 -0.06 -0.08 2.62 1 0 100 #> #> After: #> min median mean max sd N_missing N_unique #> -2.32 -0.03 0.01 2.62 1.01 0 50 #> #> Altered variable: Y_Z_0 #> Before: #> min median mean max sd N_missing N_unique #> -3.03 0.04 0.01 2.62 1.01 0 100 #> #> After: #> min median mean max sd N_missing N_unique #> -2.12 0.08 0.11 2.62 1 0 50 #> #> Altered variable: Y_Z_1 #> Before: #> min median mean max sd N_missing N_unique #> -2.53 0.6 0.55 3.12 1.01 0 100 #> #> After: #> min median mean max sd N_missing N_unique #> -1.52 0.63 0.65 3.12 1 0 50 #># Custom random sampling functions my_sampling_function <- function(data, n = 20) { data[sample(n, n, replace = TRUE), , drop = FALSE] } design + declare_sampling(handler = my_sampling_function)#> #> Design Summary #> #> Step 1 (population): declare_population(N = 100, female = rbinom(N, 1, 0.5), U = rnorm(N)) #> #> N = 100 #> #> Added variable: ID #> N_missing N_unique class #> 0 100 character #> #> Added variable: female #> 0 1 #> 56 44 #> 0.56 0.44 #> #> Added variable: U #> min median mean max sd N_missing N_unique #> -2.42 0.11 0.16 2.09 0.86 0 100 #> #> Step 2 (potential outcomes): declare_potential_outcomes(Y ~ 0.5 * Z + 0.2 * female + 0.1 * Z * female + U) #> #> Formula: Y ~ 0.5 * Z + 0.2 * female + 0.1 * Z * female + U #> #> Added variable: Y_Z_0 #> min median mean max sd N_missing N_unique #> -2.22 0.23 0.25 2.13 0.89 0 100 #> #> Added variable: Y_Z_1 #> min median mean max sd N_missing N_unique #> -1.62 0.75 0.79 2.73 0.91 0 100 #> #> Step 3 (sampling): declare_sampling(handler = my_sampling_function) ------------ #> #> N = 20 (80 subtracted) #> #> Altered variable: ID #> Before: #> N_missing N_unique class #> 0 100 character #> #> After: #> N_missing N_unique class #> 0 13 character #> #> Altered variable: female #> Before: #> 0 1 #> 56 44 #> 0.56 0.44 #> #> After: #> 0 1 #> 11 9 #> 0.55 0.45 #> #> Altered variable: U #> Before: #> min median mean max sd N_missing N_unique #> -2.42 0.11 0.16 2.09 0.86 0 100 #> #> After: #> min median mean max sd N_missing N_unique #> -1.19 0.62 0.62 1.84 0.99 0 13 #> #> Altered variable: Y_Z_0 #> Before: #> min median mean max sd N_missing N_unique #> -2.22 0.23 0.25 2.13 0.89 0 100 #> #> After: #> min median mean max sd N_missing N_unique #> -1.19 0.72 0.71 2.04 1.06 0 13 #> #> Altered variable: Y_Z_1 #> Before: #> min median mean max sd N_missing N_unique #> -1.62 0.75 0.79 2.73 0.91 0 100 #> #> After: #> min median mean max sd N_missing N_unique #> -0.69 1.27 1.26 2.64 1.1 0 13 #>