One way to imagine new data is to take data you already have and resample it, ensuring that existing inter-correlations between variables are preserved, while generating new data or expanding the size of the dataset. fabricatr offers several options to simulate resampling.
Bootstrapping
The simplest option in fabricatr is to “bootstrap”
data. Taking data with N observations, the “bootstrap” resamples these
observations with replacement and generates N new observations. Existing
observations may be used zero times, once, or more than once.
Bootstrapping is very simple with the resample_data()
function:
survey_data <- fabricate(N = 10, voted_republican = draw_binary(prob = 0.5, N = N))
survey_data_new <- resample_data(survey_data)
head(survey_data_new)
ID | voted_republican |
---|---|
02 | 0 |
07 | 0 |
01 | 1 |
03 | 1 |
10 | 0 |
02 | 0 |
It is also possible to resample fewer or greater number of
observations from your existing data. We can do this by specifying the
argument N
to resample_data()
. Consider
expanding a small dataset to allow for better imagination of larger data
to be collected later.
large_survey_data <- resample_data(survey_data, N = 100)
nrow(large_survey_data)
100
Resampling hierarchical data
One of the most powerful features of all of
fabricatr is the ability to resample from hierarchical
data at any or all levels. Doing so requires specifying which levels you
will want to resample with the ID_labels
argument. Unless
otherwise specified, all units from levels below the resampled level
will be kept. In our earlier country-province-citizen dataset,
resampling countries will lead to all provinces and citizens of the
selected country being carried forward. You can resample at multiple
levels simultaneously.
Consider this example, which takes a dataset containing 2 cities of 3 citizens, and resamples it into a dataset of 3 cities, each containing 5 citizens.
my_data <-
fabricate(
cities = add_level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
citizens = add_level(N = 3, age = runif(N, 18, 70))
)
my_data_2 <- resample_data(my_data, N = c(3, 5), ID_labels = c("cities", "citizens"))
head(my_data_2)
cities | elevation | citizens | age |
---|---|---|---|
1 | 1769 | 3 | 23 |
1 | 1769 | 2 | 68 |
1 | 1769 | 1 | 51 |
1 | 1769 | 1 | 51 |
1 | 1769 | 1 | 51 |
2 | 1205 | 5 | 64 |
resample_data()
will first select the cities to be
resampled. Then, for each city, it will continue by selecting the
citizens to be resampled. If a higher level unit is used more than once
(for example, the same city being chosen twice), and a lower level is
subsequently resampled, the choices of which units to keep for the lower
level will differ for each copy of the higher level. In this example, if
city 1 is chosen twice, then the sets of five citizens chosen for each
copy of the city 1 will differ.
You can also specify the levels you wish to resample from using the
name arguents to the N
parameter, like in this example
which does exactly the same thing as the previous example, but specifies
the level names in a different way:
my_data <-
fabricate(
cities = add_level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
citizens = add_level(N = 3, age = runif(N, 18, 70))
)
my_data_2 <- resample_data(my_data, N = c(cities = 3, citizens = 5))
head(my_data_2)
cities | elevation | citizens | age |
---|---|---|---|
1 | 1125 | 3 | 69 |
1 | 1125 | 1 | 49 |
1 | 1125 | 3 | 69 |
1 | 1125 | 2 | 35 |
1 | 1125 | 2 | 35 |
2 | 1075 | 5 | 62 |
Unique per-sample labels
Some researchers may be interested in preserving unique labels for
each sample draw at a given level. An example of this may be to sample
cities, as above, but then want to run city-level statistics; if the
same city is sampled twice, then the city-level statistic will
incorrectly combine both samples. This can be solved with
unique_labels = TRUE
, which will create a new column for
each sampled level, called <level name>_unique
, which
will be unique for each sample. Consider the following code:
my_data_unique <- resample_data(my_data, N = c(cities = 3), unique_labels = TRUE)
cities | elevation | citizens | age | cities_unique |
---|---|---|---|---|
2 | 1075 | 4 | 60 | 2_1 |
2 | 1075 | 5 | 62 | 2_1 |
2 | 1075 | 6 | 26 | 2_1 |
2 | 1075 | 4 | 60 | 2_2 |
2 | 1075 | 5 | 62 | 2_2 |
2 | 1075 | 6 | 26 | 2_2 |
2 | 1075 | 4 | 60 | 2_3 |
2 | 1075 | 5 | 62 | 2_3 |
2 | 1075 | 6 | 26 | 2_3 |
“Passthrough” Resampling
In some cases it may make sense to resample each unit at a given
level. For example, there may be value in resampling 1 citizen in each
and every city represented in the data set. fabricatr
allows the user to specify ALL
for the N
argument to a given level to accomplish this:
my_data <-
fabricate(
cities = add_level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
citizens = add_level(N = 3, age = runif(N, 18, 70))
)
my_data_3 <- resample_data(my_data, N = c(ALL, 1), ID_labels = c("cities", "citizens"))
head(my_data_3)
cities | elevation | citizens | age |
---|---|---|---|
1 | 1980 | 2 | 25 |
2 | 1072 | 5 | 46 |