More complicated level creation with variable numbers of observations
add_level()
can be used to create more complicated patterns of nesting. For example, when creating lower level data, it is possible to use a different N
for each of the values of the higher level data:
variable_data <-
fabricate(
cities = add_level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
citizens = add_level(N = c(2, 4), age = runif(N, 18, 70))
)
variable_data
cities | elevation | citizens | age |
---|---|---|---|
1 | 1778 | 1 | 46 |
1 | 1778 | 2 | 50 |
2 | 1499 | 3 | 35 |
2 | 1499 | 4 | 65 |
2 | 1499 | 5 | 34 |
2 | 1499 | 6 | 23 |
Here, each city has a different number of citizens. And the value of N
used to create the age variable automatically updates as needed. The result is a dataset with 6 citizens, 2 in the first city and 4 in the second. As long as N is either a number, or a vector of the same length of the current lowest level of the data, add_level()
will know what to do.
It is also possible to provide a function to N, enabling a random number of citizens per city:
my_data <-
fabricate(
cities = add_level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
citizens = add_level(N = sample(1:6, size = 2, replace = TRUE), age = runif(N, 18, 70))
)
my_data
cities | elevation | citizens | age |
---|---|---|---|
1 | 1850 | 1 | 53 |
1 | 1850 | 2 | 55 |
2 | 1128 | 3 | 45 |
2 | 1128 | 4 | 42 |
2 | 1128 | 5 | 47 |
2 | 1128 | 6 | 69 |
2 | 1128 | 7 | 40 |
2 | 1128 | 8 | 18 |
Here, each city is given a random number of citizens between 1 and 6. Since the sample()
function returns a vector of length 2, this is like specifying 2 separate N
s as in the example above.
It is also possible to define N
on the basis of higher level variables themselves. Consider the following example:
variable_n <- fabricate(
cities = add_level(N = 5, population = runif(N, 10, 200)),
citizens = add_level(N = round(population * 0.3))
)
cities | population | citizens |
---|---|---|
1 | 133 | 001 |
1 | 133 | 002 |
1 | 133 | 003 |
1 | 133 | 004 |
1 | 133 | 005 |
1 | 133 | 006 |
Here, the city has a defined population, and the number of citizens in our simulated data reflects a sample of 30% of that population. Although we only display the first 6 rows for brevity’s sake, the first city would have 27 rows in total.
Finally, relying on the ID label from the higher level, it is also possible to define N
on the basis of the higher level’s length:
n_inherit <- fabricate(
cities = add_level(N = 5, population = runif(N, 10, 200)),
citizens = add_level(N = sample(1:10, length(cities), replace=TRUE))
)
Here, each city has a random number of citizens from 1 to 10, but we need to supply the length of the higher level’s variable (in this case, the ID label cities
) to the sample function to ensure that one draw is made per city.
Correlated variables with custom functions
Some users might be implemented in drawing correlated variables generated by functions not amongst the default R
statistical distributions or those functions supplied by fabricatr. Any function can be made to work with correlate()
provided it accepts an argument called quantile_y
which will pass a series of quantiles to draw from the distribution of interest.
As an example, you might have some external data that represents an empirical distribution you wish to draw from. Here we use actual county level vote data from the 2016 US presidential election to generate a vote share correlated with
# Load external data: Thanks to Tony McGovern, https://github.com/tonmcg
county_level_2016_results <- read.csv(url("https://raw.githubusercontent.com/tonmcg/County_Level_Election_Results_12-16/master/2016_US_County_Level_Presidential_Results.csv"))
# Function that takes quantile_y and maps to the empirical quantiles of dataset
custom_quantile <- function(data, quantile_y) {
round(ecdf(data)(quantile_y), 2)
}
# Traditional fabricate() call:
county_vote_data <- fabricate(
N = 500,
poverty_rate = runif(N, min = 0.01, max = 0.40),
dem_vote = correlate(custom_quantile,
data = county_level_2016_results$per_dem,
given = poverty_rate,
rho = 0.3)
)
cor(county_vote_data$dem_vote, county_vote_data$poverty_rate, method="spearman")
0.34
Tidyverse integration
Because the functions in fabricatr take data and return data, they are cross-compatible with a tidyverse
workflow. Here is an example of using magrittr’s pipe operator (%>%
) and dplyr’s group_by
and mutate
verbs to add new data.
library(dplyr)
my_data <-
fabricate(
cities = add_level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
citizens = add_level(N = c(2, 3), age = runif(N, 18, 70))
) %>%
group_by(cities) %>%
mutate(pop = n())
my_data
cities | elevation | citizens | age | pop |
---|---|---|---|---|
1 | 1988 | 1 | 47 | 2 |
1 | 1988 | 2 | 32 | 2 |
2 | 1899 | 3 | 60 | 3 |
2 | 1899 | 4 | 66 | 3 |
2 | 1899 | 5 | 52 | 3 |
It is also possible to use the pipe operator (%>%
) to direct the flow of data between fabricate()
calls. Remember that every fabricate()
call can import existing data frames, and every call returns a single data frame.
my_data <-
data_frame(Y = sample(1:10, 2)) %>%
fabricate(lower_level = add_level(N = 3, Y2 = Y + rnorm(N)))
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
## ℹ Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
my_data
Y | lower_level | Y2 |
---|---|---|
10 | 1 | 9.2 |
10 | 2 | 9.8 |
10 | 3 | 9.6 |
4 | 4 | 4.8 |
4 | 5 | 5.2 |
4 | 6 | 4.3 |