Resampling data with fabricatr

One way to imagine new data is to take data you already have and resample it, ensuring that existing inter-correlations between variables are preserved, while generating new data or expanding the size of the dataset. fabricatr offers several options to simulate resampling.

Bootstrapping

The simplest option in fabricatr is to “bootstrap” data. Taking data with N observations, the “bootstrap” resamples these observations with replacement and generates N new observations. Existing observations may be used zero times, once, or more than once. Bootstrapping is very simple with the resample_data() function:

survey_data <- fabricate(N = 10, voted_republican = draw_binary(prob = 0.5, N = N))

survey_data_new <- resample_data(survey_data)
head(survey_data_new)

ID	voted_republican
02	0
07	0
01	1
03	1
10	0
02	0

It is also possible to resample fewer or greater number of observations from your existing data. We can do this by specifying the argument N to resample_data(). Consider expanding a small dataset to allow for better imagination of larger data to be collected later.

large_survey_data <- resample_data(survey_data, N = 100)
nrow(large_survey_data)

100

Resampling hierarchical data

One of the most powerful features of all of fabricatr is the ability to resample from hierarchical data at any or all levels. Doing so requires specifying which levels you will want to resample with the ID_labels argument. Unless otherwise specified, all units from levels below the resampled level will be kept. In our earlier country-province-citizen dataset, resampling countries will lead to all provinces and citizens of the selected country being carried forward. You can resample at multiple levels simultaneously.

Consider this example, which takes a dataset containing 2 cities of 3 citizens, and resamples it into a dataset of 3 cities, each containing 5 citizens.

my_data <-
  fabricate(
    cities = add_level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
    citizens = add_level(N = 3, age = runif(N, 18, 70))
  )

my_data_2 <- resample_data(my_data, N = c(3, 5), ID_labels = c("cities", "citizens"))
head(my_data_2)

cities	elevation	citizens	age
1	1769	3	23
1	1769	2	68
1	1769	1	51
1	1769	1	51
1	1769	1	51
2	1205	5	64

resample_data() will first select the cities to be resampled. Then, for each city, it will continue by selecting the citizens to be resampled. If a higher level unit is used more than once (for example, the same city being chosen twice), and a lower level is subsequently resampled, the choices of which units to keep for the lower level will differ for each copy of the higher level. In this example, if city 1 is chosen twice, then the sets of five citizens chosen for each copy of the city 1 will differ.

You can also specify the levels you wish to resample from using the name arguents to the N parameter, like in this example which does exactly the same thing as the previous example, but specifies the level names in a different way:

my_data <-
  fabricate(
    cities = add_level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
    citizens = add_level(N = 3, age = runif(N, 18, 70))
  )

my_data_2 <- resample_data(my_data, N = c(cities = 3, citizens = 5))
head(my_data_2)

cities	elevation	citizens	age
1	1125	3	69
1	1125	1	49
1	1125	3	69
1	1125	2	35
1	1125	2	35
2	1075	5	62

Unique per-sample labels

Some researchers may be interested in preserving unique labels for each sample draw at a given level. An example of this may be to sample cities, as above, but then want to run city-level statistics; if the same city is sampled twice, then the city-level statistic will incorrectly combine both samples. This can be solved with unique_labels = TRUE, which will create a new column for each sampled level, called <level name>_unique, which will be unique for each sample. Consider the following code:

my_data_unique <- resample_data(my_data, N = c(cities = 3), unique_labels = TRUE)

cities	elevation	citizens	age	cities_unique
2	1075	4	60	2_1
2	1075	5	62	2_1
2	1075	6	26	2_1
2	1075	4	60	2_2
2	1075	5	62	2_2
2	1075	6	26	2_2
2	1075	4	60	2_3
2	1075	5	62	2_3
2	1075	6	26	2_3

“Passthrough” Resampling

In some cases it may make sense to resample each unit at a given level. For example, there may be value in resampling 1 citizen in each and every city represented in the data set. fabricatr allows the user to specify ALL for the N argument to a given level to accomplish this:

my_data <-
  fabricate(
    cities = add_level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
    citizens = add_level(N = 3, age = runif(N, 18, 70))
  )

my_data_3 <- resample_data(my_data, N = c(ALL, 1), ID_labels = c("cities", "citizens"))
head(my_data_3)

cities	elevation	citizens	age
1	1980	2	25
2	1072	5	46

Aaron Rudkin

Bootstrapping

Resampling hierarchical data

Unique per-sample labels

“Passthrough” Resampling