Skip to contents

More complicated level creation with variable numbers of observations

add_level() can be used to create more complicated patterns of nesting. For example, when creating lower level data, it is possible to use a different N for each of the values of the higher level data:

variable_data <-
  fabricate(
    cities = add_level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
    citizens = add_level(N = c(2, 4), age = runif(N, 18, 70))
  )
variable_data
cities elevation citizens age
1 1778 1 46
1 1778 2 50
2 1499 3 35
2 1499 4 65
2 1499 5 34
2 1499 6 23

Here, each city has a different number of citizens. And the value of N used to create the age variable automatically updates as needed. The result is a dataset with 6 citizens, 2 in the first city and 4 in the second. As long as N is either a number, or a vector of the same length of the current lowest level of the data, add_level() will know what to do.

It is also possible to provide a function to N, enabling a random number of citizens per city:

my_data <-
  fabricate(
    cities = add_level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
    citizens = add_level(N = sample(1:6, size = 2, replace = TRUE), age = runif(N, 18, 70))
  )
my_data
cities elevation citizens age
1 1850 1 53
1 1850 2 55
2 1128 3 45
2 1128 4 42
2 1128 5 47
2 1128 6 69
2 1128 7 40
2 1128 8 18

Here, each city is given a random number of citizens between 1 and 6. Since the sample() function returns a vector of length 2, this is like specifying 2 separate Ns as in the example above.

It is also possible to define N on the basis of higher level variables themselves. Consider the following example:

variable_n <- fabricate(
  cities = add_level(N = 5, population = runif(N, 10, 200)),
  citizens = add_level(N = round(population * 0.3))
)
cities population citizens
1 133 001
1 133 002
1 133 003
1 133 004
1 133 005
1 133 006

Here, the city has a defined population, and the number of citizens in our simulated data reflects a sample of 30% of that population. Although we only display the first 6 rows for brevity’s sake, the first city would have 27 rows in total.

Finally, relying on the ID label from the higher level, it is also possible to define N on the basis of the higher level’s length:

n_inherit <- fabricate(
  cities = add_level(N = 5, population = runif(N, 10, 200)),
  citizens = add_level(N = sample(1:10, length(cities), replace=TRUE))
)

Here, each city has a random number of citizens from 1 to 10, but we need to supply the length of the higher level’s variable (in this case, the ID label cities) to the sample function to ensure that one draw is made per city.

Correlated variables with custom functions

Some users might be implemented in drawing correlated variables generated by functions not amongst the default R statistical distributions or those functions supplied by fabricatr. Any function can be made to work with correlate() provided it accepts an argument called quantile_y which will pass a series of quantiles to draw from the distribution of interest.

As an example, you might have some external data that represents an empirical distribution you wish to draw from. Here we use actual county level vote data from the 2016 US presidential election to generate a vote share correlated with

# Load external data: Thanks to Tony McGovern, https://github.com/tonmcg
county_level_2016_results <- read.csv(url("https://raw.githubusercontent.com/tonmcg/County_Level_Election_Results_12-16/master/2016_US_County_Level_Presidential_Results.csv"))

# Function that takes quantile_y and maps to the empirical quantiles of dataset
custom_quantile <- function(data, quantile_y) {
  round(ecdf(data)(quantile_y), 2)
}

# Traditional fabricate() call:
county_vote_data <- fabricate(
  N = 500,
  poverty_rate = runif(N, min = 0.01, max = 0.40),
  dem_vote = correlate(custom_quantile, 
                       data = county_level_2016_results$per_dem,
                       given = poverty_rate, 
                       rho = 0.3)
)

cor(county_vote_data$dem_vote, county_vote_data$poverty_rate, method="spearman")

0.34

Tidyverse integration

Because the functions in fabricatr take data and return data, they are cross-compatible with a tidyverse workflow. Here is an example of using magrittr’s pipe operator (%>%) and dplyr’s group_by and mutate verbs to add new data.

library(dplyr)

my_data <-
  fabricate(
    cities = add_level(N = 2, elevation = runif(n = N, min = 1000, max = 2000)),
    citizens = add_level(N = c(2, 3), age = runif(N, 18, 70))
  ) %>%
  group_by(cities) %>%
  mutate(pop = n())

my_data
cities elevation citizens age pop
1 1988 1 47 2
1 1988 2 32 2
2 1899 3 60 3
2 1899 4 66 3
2 1899 5 52 3

It is also possible to use the pipe operator (%>%) to direct the flow of data between fabricate() calls. Remember that every fabricate() call can import existing data frames, and every call returns a single data frame.

my_data <-
  data_frame(Y = sample(1:10, 2)) %>%
  fabricate(lower_level = add_level(N = 3, Y2 = Y + rnorm(N)))
## Warning: `data_frame()` was deprecated in tibble 1.1.0.
##  Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
my_data
Y lower_level Y2
10 1 9.2
10 2 9.8
10 3 9.6
4 4 4.8
4 5 5.2
4 6 4.3