Building and Importing Data
Aaron Rudkin
Source:vignettes/building_importing.Rmd
building_importing.Rmd
fabricatr is a package designed to help you imagine your data before you collect it. While many solutions exist for creating simulated datasets, fabricatr is specifically designed to make the creation of realistic social science datasets easy. In particular, we need to be able to imagine correlated data and hierarchical data.
Basics
Using fabricatr begins by calling the function
fabricate()
. fabricate()
can be used to create
single-level of hierarchical data. There are three main ways to call
fabricate()
:
- Making a single-level dataset by specifying how many observations you would like
- Making a single-level dataset by importing data and optionally modifying it by creating new variables
- Making a hierarchical dataset.
Single-level datasets from scratch
Making a single-level dataset begins with providing the argument
N
, a number representing the number of observations you
wish to create, followed by a series of variable definitions. Variables
can be defined using any function you have access to in R.
fabricatr provides several simple functions for
generating common types of data. These are covered below. Functions that
create subsequent variables can rely on previously created variables,
which ensures that variables can be related to one another:
ID | Y | Y2 |
---|---|---|
1 | 0.78 | 3.9 |
2 | 0.50 | 2.5 |
3 | 0.54 | 2.7 |
4 | 0.61 | 3.0 |
5 | 0.32 | 1.6 |
This simple example makes use of R
’s built-in
runif
command. The rest of the tutorial assumes a
familiarity with R
and its basic data generating
processes.
Filling out observations.
fabricate
is intended to make rectangular data frames:
this means that each variable added at a level needs to be the same
length. Failure to provide equal-length variables will result in an
error. We provide a convenient helper function, recycle
, to
help expand existing data to fit the length of your level. Here, let’s
use the existing month
variable from R
to
generate data using a month:
month.abb
contains the months of the year: [“Jan”,
“Feb”, “Mar”, …, “Dec”]. It is obvious that although we are asking for
20 observations, there are only twelve months in the year.
recycle
will automatically wrap the month text resulting in
a data frame with the 12 months “Jan” through “Dec”, followed by 8
months “Jan” through “Aug”.
Single-level datasets using existing data
Instead of specifying the argument N
, users can specify
the argument data
to import existing datasets. Once a
dataset is imported, subsequent variables have access to N
,
representing the number of observations in the imported data. This makes
it easy to augment existing data with simulations based on that
data.
In this example, we make use of the quakes
dataset,
built into R
, which describes characteristics of
earthquakes off the coast of Fiji. The mag
variable in this
dataset contains the richter magnitude of the earthquakes. We will
expand this data to add variables modelling hypothetical fatalities and
insurance costs:
simulated_quake_data <- fabricate(
data = quakes,
fatalities = round(pmax(0, rnorm(N, mean = mag)) * 100),
insurance_cost = fatalities * runif(N, 1000000, 2000000)
)
head(simulated_quake_data)
lat | long | depth | mag | stations | fatalities | insurance_cost |
---|---|---|---|---|---|---|
-20 | 182 | 562 | 4.8 | 41 | 494 | 564,736,181 |
-21 | 181 | 650 | 4.2 | 15 | 390 | 414,044,159 |
-26 | 184 | 42 | 5.4 | 43 | 596 | 708,701,956 |
-18 | 182 | 626 | 4.1 | 19 | 293 | 567,537,664 |
-20 | 182 | 649 | 4.0 | 11 | 487 | 920,442,976 |
-20 | 184 | 195 | 4.0 | 12 | 319 | 436,673,133 |
Notice that variable creation calls are able to make reference to both the variables in the imported data set, and newly created variables. Also, function calls can be arbitrarily nested – the variable fatalities uses several nested function calls.
Hierarchical data
The most powerful use of fabricatr is to create hierarchical (“nested”) data. In the example below, we create 5 countries, each of which has 10 provinces. We also have covariates at the country level (GDP per capita and life expectancy) and at the provincial level (presence of natural resources, and presence of manufacturing industry):
country_data <-
fabricate(
countries = add_level(
N = 5,
gdp_per_capita = runif(N, min = 10000, max = 50000),
life_expectancy = 50 + runif(N, 10, 20) + ((gdp_per_capita > 30000) * 10)
),
provinces = add_level(
N = 10,
natural_resources = draw_binary(prob = 0.3, N = N),
manufacturing = draw_binary(prob = 0.7, N = N)
)
)
head(country_data)
countries | gdp_per_capita | life_expectancy | provinces | natural_resources | manufacturing |
---|---|---|---|---|---|
1 | 40,451 | 73 | 01 | 1 | 1 |
1 | 40,451 | 73 | 02 | 0 | 1 |
1 | 40,451 | 73 | 03 | 0 | 0 |
1 | 40,451 | 73 | 04 | 1 | 1 |
1 | 40,451 | 73 | 05 | 0 | 0 |
1 | 40,451 | 73 | 06 | 0 | 1 |
Several things can be observed in this example. First, fabricate
knows that your second add_level()
command will be nested
under the first level of data. Each level gets its own ID variable, in
addition to the variables you create. Second, the meaning of the
variable “N” changes. During the add_level()
call for
countries, N is 5. During the add_level()
call for
provinces, N is 10. And the resulting data, of course, has 50
observations.
Finally, the province-level variables are created using the
draw_binary()
function. This is a function provided by
fabricatr to make simulating discrete random variables
simple. When you simulate your own data, you can use
fabricatr’s functions, R’s built-ins, or any custom
functions you wish. draw_binary()
is explained in our tutorial on
variable generation using fabricatr
Adding hierarchy to existing data
fabricatr is also able to import existing data and nest hierarchical data under it. This maybe be useful if, for example, you have existing country-level data but wish to simulate data at lower geographical levels for the purposes of an experiment you plan to conduct.
Imagine importing the country-province data simulated in the previous
example. Because fabricate()
returns a data frame, this
simulated data can be re-imported into a subsequent fabricate call, just
like external data can be.
citizen_data <-
fabricate(
data = country_data,
citizens = add_level(
N = 10,
salary = rnorm(
N,
mean = gdp_per_capita + natural_resources * 5000 + manufacturing * 5000,
sd = 10000
)
)
)
head(citizen_data)
countries | gdp_per_capita | life_expectancy | provinces | natural_resources | manufacturing | citizens | salary |
---|---|---|---|---|---|---|---|
1 | 40,451 | 73 | 01 | 1 | 1 | 001 | 45,852 |
1 | 40,451 | 73 | 01 | 1 | 1 | 002 | 56,768 |
1 | 40,451 | 73 | 01 | 1 | 1 | 003 | 52,549 |
1 | 40,451 | 73 | 01 | 1 | 1 | 004 | 46,428 |
1 | 40,451 | 73 | 01 | 1 | 1 | 005 | 70,148 |
1 | 40,451 | 73 | 01 | 1 | 1 | 006 | 64,771 |
In this example, we add a third level of data; for each of our 50 country-province observations, we now have 10 citizen-level observations. Citizen-level covariates like salary can draw from both the country-level covariate and the province-level covariate.
Notice that the syntax for adding a new nested level to existing data is different than the syntax for adding new variables to the original dataset.
Modifying existing levels
Suppose you have hierarchical data, and wish to simulate variables at
a higher level of aggregation. For example, imagine you import a dataset
containing citizens within countries, but you wish to simulate
additional country-level variables. In fabricatr, you
can do this using the modify_level()
command.
Let’s use our country-province data from earlier:
new_country_data <-
fabricate(
data = country_data,
countries = modify_level(average_temperature = runif(N, 30, 80))
)
head(new_country_data)
countries | gdp_per_capita | life_expectancy | provinces | natural_resources | manufacturing | average_temperature |
---|---|---|---|---|---|---|
1 | 40,451 | 73 | 01 | 1 | 1 | 38 |
1 | 40,451 | 73 | 02 | 0 | 1 | 52 |
1 | 40,451 | 73 | 03 | 0 | 0 | 69 |
1 | 40,451 | 73 | 04 | 1 | 1 | 64 |
1 | 40,451 | 73 | 05 | 0 | 0 | 33 |
1 | 40,451 | 73 | 06 | 0 | 1 | 65 |
We can observe that the new variable is created at the level of
aggregation you chose – countries. Also, although N is not specified
anywhere, modify_level()
knows how large N should be based
on the number of countries it finds in the dataset. It is important,
then, to ensure that the modify_level()
command is
correctly assigned to the level of interest. We can also modify more
than one level.
Here, we modify our country-province-citizen data from above:
new_citizen_data <-
fabricate(
data = citizen_data,
countries = modify_level(average_temperature = runif(N, 30, 80)),
provinces = modify_level(
conflict_zone = draw_binary(N, prob = 0.2 + natural_resources * 0.3),
infant_mortality = runif(N, 0, 10) + conflict_zone * 10 +
(average_temperature > 70) * 10
),
citizens = modify_level(
college_degree = draw_binary(N, prob = 0.4 - (0.3 * conflict_zone))
)
)
Before assessing what this tells us about
modify_level()
, let’s consider what the data simulated
does. It creates a new variable at the country level, for a country
level average temperature. Subsequently, it creates a province level
binary indicator for whether the province is an active conflict site.
Provinces that have natural resources are more likely to be in conflict
in this simulation, drawing on conclusions from literature on “resource
curses”. The infant mortality rate for the province is able to depend
both on province level data we have just generated, and country-level
data: it is higher in high-temperature areas (reflecting literature on
increased disease burden near the equator) and also higher in conflict
zones. Citizen access to education is also random, but depends on
whether they live in a conflict area.
There are a lot of things to learn from this example. First, it’s
possible to modify multiple levels. Any new variable created will
automatically propagate to the lower level data according – by setting
an average temperature for a country, all provinces, and all citizens of
those provinces, have the value for the country. Values created from one
modify_level()
call can be used in subsequent variables of
the same call, or subsequent calls.
Again, we see the use of draw_binary()
. Using this
function is covered in our tutorial on generating
discrete random variables, linked below.
Averages within higher levels of hierarchy
A powerful feature of nested data and fabricatr’s setup is that variable creating can access variables from higher in
You may want to include the mean value of a variable within a group
defined by a higher level of the hierarchy, for example the average
income of citizens within city. You can do this with ave()
,
a built-in R
command:
ave_example <- fabricate(
cities = add_level(N = 2),
citizens = add_level(
N = 1:2, income = rnorm(N),
income_mean_city = ave(income, cities)
)
)
ave_example
cities | citizens | income | income_mean_city |
---|---|---|---|
1 | 1 | -1.39 | -1.39 |
2 | 2 | -0.16 | 0.73 |
2 | 3 | 1.62 | 0.73 |
Here, we can create citizen-level data which relies on the data of
other citizens within the same city. ave()
takes two
arguments: first, the name of the variable we are averaging on (in this
case, income
), and second, the name of the level we are
grouping by (in this case cities
). Other R
functions which are able to group by variables to compute statistics of
interest are also compatible with fabricatr.
Next Steps
You’ve seen fabricatr’s ability to generate single-level and hierarchical data, which is enough to get you started on using the package. From here, you can explore more about modeling the structure of data by reading our tutorial on panel and cross-classified data or using fabricatr to bootstrap or resample hierarchical data. Or, if you would like to learn about modeling specific variables using fabricatr, you can read our tutorial on common social science variables; our technical manual on generating discrete random variables; or our guide on using other data generation packages with fabricatr.