Although much of fabricatr’s is designed to generate hierarchical, nested data, we also provide functions to generate panel and cross-classified (partially nested) data. In this vignette, we offer instructions to do both.
Let’s begin by visualizing an example of panel data, where we have data for each of several countries in each of several years.
This data arrangement is somewhat more complex than traditional nested, hierarchical data. Each observation draws from a common pool of countries and years. In other words, the year-specific variables attached to Country A’s 1997 observation are also attached to Country B’s 1997 observation.
The steps for generating a panel in fabricatr are as follows:
cross_levels()function to join the non-nested data frames to make a panel.
First, we need to generate country and year data. By default, fabricatr fully nests subsequent levels under the first level call. Here, we must explicitly not do this.
Note that this function call will not evaluate because we have specified the two non-nested data frames, but not yet told fabricatr what to do with them. The second (and any subsequent) non-nested levels should contain the
nest = FALSE argument – otherwise, years would be interpreted as to be a level nested within countries. Each level will track all of its own variables, so it is possible to add as many features as you would like to the levels.
It is also possible to import multiple non-nested data frames; this will allow you to assemble pre-existing data sources however you would like. Recall that the first argument to a
fabricate() call is the data you wish to import. We have previously seen that it is possible to import a single data frame this way, but it is also possible to import a list of data frames, staging them all for use for cross-classifying data. Data imported in this manner looks like this:
fabricate() call is incomplete – we have imported the data we wish to cross-classify on, but not yet learned how to merge the data. If you do not specify how to merge the data,
fabricate() will simply return the most recent data frame imported or generated, unmodified.
Specifying a merge function to create a panel is simple. You need only to tell fabricatr which levels you wish to merge, and then you will have an assembled panel and can generate new variables at the observation-level. We do this using a call to
cross_levels() takes a single required argument, which is of the form
by = join(...). This join command tells fabricatr how to assemble your data. In this case, we are telling it to join the countries data frame to the years data frame, resulting in country-year observations.
Just like with regular
add_level() commands, you can add new variables which have full access to the existing columns.
All of the functions specified above work when joining more than two levels. We could extend our student example to include, for example, college quality. Nothing changes about the join syntax beyond the addition of the third or subsequent variable names:
three_data <- fabricate( primary_schools = add_level(N = 20, ps_quality = runif(N, 1, 10)), secondary_schools = add_level(N = 15, ss_quality = runif(N, 1, 10), nest = FALSE), colleges = add_level(N = 50, c_quality = runif(N, 1, 10), nest = FALSE), students = link_levels( N = 1500, by = join( ps_quality, ss_quality, c_quality, rho = 0.2 ), earning_potential = 20000 + (2000 * ps_quality) + (6000 * ss_quality) + (10000 * c_quality) + rnorm(N, 0, 5000) ) )
One potential source for failure is specifying an invalid
rho. If you specify a
rho that makes the correlation between the three variables impossible to obtain, the
fabricate() call will fail. A common case of this occurring is specifying a negative
rho with three or more levels – in general, if A is negatively correlated with B, and B is negatively correlated with C, then A and C cannot be negatively correlated.
Instead of specifying a
rho correlation coefficient, users can specify a
sigma correlation matrix to make the resulting correlations more sophisticated. Consider the following setup:
sigma <- matrix( c( 1, 0.4, 0.2, 0.4, 1, 0.8, 0.2, 0.8, 1 ), ncol = 3, nrow = 3 ) adv_data <- fabricate( primary_schools = add_level(N = 20, ps_quality = runif(N, 1, 10)), secondary_schools = add_level(N = 15, ss_quality = runif(N, 1, 10), nest = FALSE), colleges = add_level(N = 50, c_quality = runif(N, 1, 10), nest = FALSE), students = link_levels( N = 1500, by = join( ps_quality, ss_quality, c_quality, sigma = sigma ), earning_potential = 20000 + (2000 * ps_quality) + (6000 * ss_quality) + (10000 * c_quality) + rnorm(N, 0, 5000) ) )
sigma must be specified as a symmetric square matrix with a diagonal of all 1s and a feasible correlation structure.