More complicated level creation with variable numbers of observations
add_level()
can be used to create more complicated
patterns of nesting. For example, when creating lower level data, it is
possible to use a different N
for each of the values of the
higher level data:
variable_data <-
fabricate(
cities = add_level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
citizens = add_level(N = c(2, 4), age = runif(N, 18, 70))
)
variable_data
cities | elevation | citizens | age |
---|---|---|---|
1 | 1778 | 1 | 46 |
1 | 1778 | 2 | 50 |
2 | 1499 | 3 | 35 |
2 | 1499 | 4 | 65 |
2 | 1499 | 5 | 34 |
2 | 1499 | 6 | 23 |
Here, each city has a different number of citizens. And the value of
N
used to create the age variable automatically updates as
needed. The result is a dataset with 6 citizens, 2 in the first city and
4 in the second. As long as N is either a number, or a vector of the
same length of the current lowest level of the data,
add_level()
will know what to do.
It is also possible to provide a function to N, enabling a random number of citizens per city:
my_data <-
fabricate(
cities = add_level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
citizens = add_level(N = sample(1:6, size = 2, replace = TRUE), age = runif(N, 18, 70))
)
my_data
cities | elevation | citizens | age |
---|---|---|---|
1 | 1850 | 1 | 53 |
1 | 1850 | 2 | 55 |
2 | 1128 | 3 | 45 |
2 | 1128 | 4 | 42 |
2 | 1128 | 5 | 47 |
2 | 1128 | 6 | 69 |
2 | 1128 | 7 | 40 |
2 | 1128 | 8 | 18 |
Here, each city is given a random number of citizens between 1 and 6.
Since the sample()
function returns a vector of length 2,
this is like specifying 2 separate N
s as in the example
above.
It is also possible to define N
on the basis of higher
level variables themselves. Consider the following example:
variable_n <- fabricate(
cities = add_level(N = 5, population = runif(N, 10, 200)),
citizens = add_level(N = round(population * 0.3))
)
cities | population | citizens |
---|---|---|
1 | 133 | 001 |
1 | 133 | 002 |
1 | 133 | 003 |
1 | 133 | 004 |
1 | 133 | 005 |
1 | 133 | 006 |
Here, the city has a defined population, and the number of citizens in our simulated data reflects a sample of 30% of that population. Although we only display the first 6 rows for brevity’s sake, the first city would have 27 rows in total.
Finally, relying on the ID label from the higher level, it is also
possible to define N
on the basis of the higher level’s
length:
n_inherit <- fabricate(
cities = add_level(N = 5, population = runif(N, 10, 200)),
citizens = add_level(N = sample(1:10, length(cities), replace=TRUE))
)
Here, each city has a random number of citizens from 1 to 10, but we
need to supply the length of the higher level’s variable (in this case,
the ID label cities
) to the sample function to ensure that
one draw is made per city.
Correlated variables with custom functions
Some users might be implemented in drawing correlated variables
generated by functions not amongst the default R
statistical distributions or those functions supplied by
fabricatr. Any function can be made to work with
correlate()
provided it accepts an argument called
quantile_y
which will pass a series of quantiles to draw
from the distribution of interest.
As an example, you might have some external data that represents an empirical distribution you wish to draw from. Here we use actual county level vote data from the 2016 US presidential election to generate a vote share correlated with
# Load external data: Thanks to Tony McGovern, https://github.com/tonmcg
county_level_2016_results <- read.csv(url("https://raw.githubusercontent.com/tonmcg/County_Level_Election_Results_12-16/master/2016_US_County_Level_Presidential_Results.csv"))
# Function that takes quantile_y and maps to the empirical quantiles of dataset
custom_quantile <- function(data, quantile_y) {
round(ecdf(data)(quantile_y), 2)
}
# Traditional fabricate() call:
county_vote_data <- fabricate(
N = 500,
poverty_rate = runif(N, min = 0.01, max = 0.40),
dem_vote = correlate(custom_quantile,
data = county_level_2016_results$per_dem,
given = poverty_rate,
rho = 0.3)
)
cor(county_vote_data$dem_vote, county_vote_data$poverty_rate, method="spearman")
0.34
Tidyverse integration
Because the functions in fabricatr take data and
return data, they are cross-compatible with a tidyverse
workflow. Here is an example of using magrittr’s pipe
operator (%>%
) and dplyr’s
group_by
and mutate
verbs to add new data.
library(dplyr)
my_data <-
fabricate(
cities = add_level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
citizens = add_level(N = c(2, 3), age = runif(N, 18, 70))
) %>%
group_by(cities) %>%
mutate(pop = n())
my_data
cities | elevation | citizens | age | pop |
---|---|---|---|---|
1 | 1988 | 1 | 47 | 2 |
1 | 1988 | 2 | 32 | 2 |
2 | 1899 | 3 | 60 | 3 |
2 | 1899 | 4 | 66 | 3 |
2 | 1899 | 5 | 52 | 3 |
It is also possible to use the pipe operator (%>%
) to
direct the flow of data between fabricate()
calls. Remember
that every fabricate()
call can import existing data
frames, and every call returns a single data frame.
my_data <-
data_frame(Y = sample(1:10, 2)) %>%
fabricate(lower_level = add_level(N = 3, Y2 = Y + rnorm(N)))
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
my_data
Y | lower_level | Y2 |
---|---|---|
10 | 1 | 9.2 |
10 | 2 | 9.8 |
10 | 3 | 9.6 |
4 | 4 | 4.8 |
4 | 5 | 5.2 |
4 | 6 | 4.3 |