`../vignettes/variable_generation.Rmd`

`variable_generation.Rmd`

**fabricatr** provides convenient helper functions to generate discrete random variables far more easily than using R’s built-in data generation mechanisms. Below we introduce you to the types of data you can generate using **fabricatr**.

The simplest possible type of data is a binary random variable (also called a bernoulli random variable). Generating a binary random variable requires only one parameter `prob`

which specifies the probability that outcomes drawn from this variable are equal to 1. By default, `draw_binary()`

will generate `N = length(prob)`

draws. `N`

can also be specified explicitly. Consider these examples:

```
draw_binary_ex <- fabricate(
N = 3, p = c(0, .5, 1),
binary_1 = draw_binary(prob = p),
binary_2 = draw_binary(N = 3, prob = 0.5)
)
```

In addition to binary variables, you can make data from repeated Bernoulli trials (“binomial” data). This requires using the `draw_binomial()`

function and specifying an argument `trials`

, equal to the number of trials.

```
binomial_ex <- fabricate(
N = 3,
freethrows = draw_binomial(N = N, prob = 0.5, trials = 10)
)
```

Some researchers may be interested in specifying probabilities through a “link function”. This can be done in any of your data generating functions through the `link`

argument. The default link function is “identity”, but we also support “logit”, and “probit”. These link functions transform continuous and unbounded latent data into probabilities of a positive outcome. If you are specifying a link function, you should also specify your latent variable as the `latent`

argument.

```
bernoulli_probit <- fabricate(
N = 3, x = 10 * rnorm(N),
binary = draw_binary(latent = x, link = "probit")
)
```

Some researchers may be interested in generating ordered outcomes – for example, Likert scale outcomes. You can do this with the `draw_ordered()`

function. Ordered variables require a vector of breakpoints, supplied as the argument `breaks`

– points at which the underlying latent variable switches from category to category. The first break should always be below the lower bound of the data, while the final break should always be above the upper bound of the data – if breaks do not cover the data, `draw_ordered()`

will attempt to correct this by adding breaks where appropriate.

In the following example, each of three observations has a latent variable `x`

which is continuous and unbounded. The variable `ordered`

transforms `x`

into three numeric categories: 1, 2, and 3. All values of `x`

below -1 result in `ordered`

1; all values of `x`

between -1 and 1 result in `ordered`

2; all values of `x`

above 1 result in `ordered`

3:

```
ordered_example <- fabricate(
N = 3,
x = 5 * rnorm(N),
ordered = draw_ordered(x, breaks = c(-Inf, -1, 1, Inf))
)
```

Likert variables are a special case of ordered variables. Users can use `draw_ordered()`

with properly specified breaks and break labels to generate Likert data, or use the `draw_likert()`

function as a convenient alias:

```
survey_data <- fabricate(
N = 100,
Q1 = draw_likert(x = rnorm(N)),
Q2 = draw_likert(x = rnorm(N)),
Q3 = draw_likert(x = rnorm(N))
)
```

`draw_likert()`

takes one compulsory argument (`x`

, which represents the latent variable being transformed into ordered data). By default, `draw_likert()`

provides a 7-item Likert scale with breaks at [-\(\infty\), -2.5, -1.5, -0.5, 0.5, 1.5, 2.5, \(\infty\)]. Users can explicitly specify the `type`

argument to use other types of Likert data. Supported types are 4, 5, and 7. Default breaks for 5-item Likert scales are [-\(\infty\), -1.5, -0.5, 0.5, 1.5, \(\infty\)]. Default breaks for 4-item Likert scales are [-\(\infty\), -1, 0, 1, \(\infty\)].

Optionally, users can specify their own breaks. These will override the `type`

command and scale type will be detected based on the length of the `break`

argument. As above, a `break`

argument with 8 values will produce a 7-item Likert scale, one with 6 values will produce a 5-item Likert scale, and one with 5 values will produce a 4-item Likert scale.

Labels are automatically provided by `draw_likert()`

. The default 7-item Likert scale uses the labels [“Strongly Disagree”, “Disagree”, “Lean Disagree”, “Don’t Know / Neutral”, “Lean Agree”, “Agree”, “Strongly Agree”].

Examples of how users might use the function are available below:

```
survey_data <- fabricate(
N = 100,
Q1 = draw_likert(x = rnorm(N), type = 7),
Q2 = draw_likert(x = rnorm(N), type = 5),
Q3 = draw_likert(x = rnorm(N), type = 4),
Q4 = draw_likert(x = rnorm(N), breaks = c(-Inf, -0.8, 0, 1, 2, Inf))
)
table(survey_data$Q2)
```

Strongly Disagree | Disagree | Don’t Know / Neutral | Agree | Strongly Agree |
---|---|---|---|---|

8 | 18 | 45 | 21 | 8 |

This function is a convenient, quick alias for creating likert variables with these labels. Users who want more flexibility with respect to break labels or number of breaks should use `draw_ordered()`

and specify breaks and break labels explicitly.

Some variables of interest are modeled as quantile outcomes; for example, income data in a survey might be reported or analyzed by income deciles. We provide two simple helper functions for this purpose: `draw_quantile()`

which allows you to generate quantile outcomes without underlying data, and `split_quantile()`

which allows you to transform underlying data into quantile outcomes. This is particularly relevant for survey data that has been made granular to protect respondent privacy.

In this example, we will generate income data for a population and then transform it into deciles:

```
population <- fabricate(
N = 1000,
income = 20000 * rgamma(N, 1.4, 0.65),
income_decile = split_quantile(income, type = 10)
)
```

The `type`

paramater of the `split_quantile()`

function specifies how many bins to split the data into. In the decile example, 10. Note that the resulting quantiles will be based on the sample data, not the parameterization of the population. It would also be possible to skip the step of characterizing the income data by directly modeling income deciles:

This parameterization of the income variable was derived by researching U.S. individual income data and using numerical optimization to derive the parameterization that best approximated a series of quantiles from that data. Many other data generation packages, including **wakefield**, implement automatic generation of income data. For more information about using **fabricatr** with other data generating packages, see our online tutorial

`draw_count()`

allows you to create Poisson-distributed count outcomes. These require that the user specify the parameter `mean`

, equal to the Poisson distribution mean (often referred to as `lambda`

in statistical formulations of count data).

```
count_outcome_example = fabricate(N = 3,
x = c(0, 5, 100),
count = draw_count(mean = x))
```

`draw_categorical()`

can generate non-ordered, categorical data. Users must provide a vector of probabilities for each category (or a matrix, if each observation should have separate probabilities).

If probabilities do not sum to exactly one, they will be normalized, but negative probabilities will cause an error.

In the first example, each unit has a different set of probabilities and the probabilities are provided as a matrix:

```
categorical_example <- fabricate(
N = 6,
p1 = runif(N, 0, 1),
p2 = runif(N, 0, 1),
p3 = runif(N, 0, 1),
cat = draw_categorical(N = N, prob = cbind(p1, p2, p3))
)
```

In the second example, each unit has the same probability of getting a given category. `draw_categorical()`

will issue a warning to remind you that it is interpreting the vector in this way.

```
warn_draw_cat_example <- fabricate(
N = 6,
cat = draw_categorical(N = N, prob = c(0.2, 0.4, 0.4))
)
```

“categorical” variables can also use link functions, for example to generate multinomial probit data.

*This functionality is EXPERIMENTAL, and we cannot guarantee its properties for all data structures. Be sure to diagnose your design and assess the distributions of your variables.*

**fabricatr** makes it easy to generate correlated random variables one at a time. All you need is a reference variable, a desired rank correlation coefficient (from -1 to 1), and a `draw_`

function to specify the target distribution of the correlated variable.

This is done using a function called `correlate()`

, which takes four distinct arguments: `draw_handler`

, which is the un-quoted name of your desired outcome distribution (e.g. `draw_binary`

, `draw_binomial`

, `draw_count`

); any parameters needed for that distribution (e.g. `mean`

for `draw_count`

); `given`

, which is the reference variable your new variable will be correlated with, and `rho`

, which is the rank correlation coefficient you are targeting.

In this example, we will generate individuals, who are given a battery of public opinion questions. Their responses are aggregated into index scores in each of several categories: conservative values, economic conservatism, and foreign policy interventionism.

```
respondents <- fabricate(
N = 100,
conservative_values = draw_binomial(prob = 0.4, trials = 20, N = N),
economic_conservative = correlate(given = conservative_values,
rho = 0.5,
draw_binomial,
prob = 0.6,
trials = 20),
foreign_policy = correlate(given = conservative_values,
rho = 0.3,
draw_binomial,
prob = 0.4,
trials = 20)
)
```

Unpacking this example; first, the respondents are asked 20 questions and have a 40% chance of answering yes to each question. The result is summed into an index. Next, the respondents are asked 20 questions and have a 60% chance of answering yes to each question, but result is also constrained to be rank-correlated at `rho = 0.5`

with their score on the `conservative_values`

index. Finally, 20 more questions are asked of the respondents about foreign policy, where they have a 40% chance of answering yes, and the index score is somewhat more weakly correlated with `conservative_values`

.

The argument `N`

is not necessary in the resulting distribution because it is automatically detected from the variable specified in `given`

.

We can analyze the data to verify that, subject to the limitations of random sampling, the results reflect the specification. First, let’s verify that our means are approximately 0.4, 0.6, and 0.4:

`apply(respondents[2:4], 2, mean) / 20`

x | |
---|---|

conservative_values | 0.42 |

economic_conservative | 0.60 |

foreign_policy | 0.40 |

Next, let’s check the correlation between our variables of interest:

`cor(respondents[, 2:4], method="spearman")[1, ]`

x | |
---|---|

conservative_values | 1.00 |

economic_conservative | 0.44 |

foreign_policy | 0.24 |

It is also possible to use `correlate()`

with other variable generation functions, such as those from base `R`

or even custom functions. Custom functions must take an argument `quantile_y`

, which is a vector that supplies quantiles of the distribution to draw from. Base `R`

functions are supported as described here:

```
student_report_card <- fabricate(
N = 100,
math_score = rnorm(N, mean = 80, sd = 5),
chemistry_score = correlate(given = math_score, rho = 0.8,
rnorm, mean = 75, sd = 5),
civics_score = correlate(given = math_score, rho = 0.6,
rnorm, mean = 85, sd = 5)
)
```

Please observe that the first argument to correlate, `draw_handler`

, must be supplied as an unquoted name of a variable generating function. By default, this function works with any of the base `R`

distributions; for information about using a third party or custom data generating function, see our advanced features tutorial

A note about correlated data: Although the method used will generate data that is correlated on average with rank correlation `rho`

, some variance is normal, and distributions which have less flexibility will be less likely to achieve the target correlation. For example, if the first variable is normally distributed with a wide range of values, and the target variable is a binary 0/1 variable, the correlation will be imprecise by the lack of flexibility in the target variable.

We also provide helper functions to generate cluster-correlated random variables with fixed intra-cluster correlation (ICC) values. Our two functions `draw_binary_icc()`

and `draw_normal_icc()`

allow you to generate both discrete binary data with fixed ICCs and normal data with fixed ICCs.

`draw_binary_icc()`

takes three required arguments: `prob`

, a probability or vector of probabilities which determine the chance a given observation will be a 1; `clusters`

, a map of units to clusters (required to generate the correlation structure); and `ICC`

, the fixed intra-cluster correlation (from 0 to 1). Users may optionally specify `N`

; if it is not specified, `draw_binary_icc()`

will determine it based on the length of the `clusters`

vector.

Consider the following example, which models whether individuals smoke:

```
# 100 individual population, 20 each in each of 5 clusters
clusters = rep(1:5, 20)
# Individuals have a 20% chance of smoking, but clusters are highly correlated
# in their tendency to smoke
smoker = draw_binary_icc(prob = 0.2, clusters = clusters, ICC = 0.5)
# Observe distribution of smokers and non-smokers
table(smoker)
```

0 | 1 |
---|---|

76 | 24 |

We see that approximately 20% of the population smokes, in line with our specification, but what patterns of heterogeneity do we see by cluster?

`table(clusters, smoker)`

0 | 1 |
---|---|

16 | 4 |

19 | 1 |

18 | 2 |

4 | 16 |

19 | 1 |

Here we learn that of our 5 clusters, 4 are overwhelmingly non-smokers, while a fifth is composed of 80% smokers.

We can also specify separate mean for each cluster; but it is worth noting that the higher the ICC, the more the cluster mean will depart from the nominal cluster mean.

If you do not specify a vector of probabilities or a correlation coefficient, the default values are probability 0.5 for each cluster and ICC of 0.5. If you do not specify cluster IDs, the function will return an error.

`draw_normal_icc()`

takes four required arguments: `mean`

, a mean or vector of means, one for each cluster; `clusters`

, a map of units to clusters (required to generate the correlation structure); `ICC`

, the fixed intra-cluster correlation coefficient; and `sd`

, a standard deviation or vector of standard deviations, one for each cluster. Users can optionally specify `N`

, a number of units, but if it is not supplied `draw_normal_icc()`

will determine it based on the length of the `clusters`

vector.

If `sd`

is not supplied, each cluster will be assumed to have a within-cluster standard deviation of 1 and the `sd_between`

will be implied by this `sd`

and the `ICC`

parameter. If `mean`

is not supplied, each cluster will be assumed to be mean zero.

Here, we model student academic performance by cluster:

```
# 100 students, 10 each in 10 clusters
clusters <- rep(1:5, 20)
numeric_grade <- draw_normal_icc(mean = 80, clusters = clusters, ICC = 0.5, sd = 15)
letter_grade <- draw_ordered(
x = numeric_grade,
breaks = c(-Inf, 60, 70, 80, 90, Inf),
break_labels = c("F", "D", "C", "B", "A")
)
mean(numeric_grade)
```

84.12

The mean grade matches the population mean. Now let’s look at the relationship between cluster and letter grade to observe the cluster pattern:

`table(letter_grade, clusters)`

F | D | C | B | A |
---|---|---|---|---|

0 | 0 | 3 | 2 | 15 |

0 | 3 | 5 | 9 | 3 |

5 | 3 | 8 | 2 | 2 |

4 | 7 | 2 | 5 | 2 |

0 | 0 | 0 | 2 | 18 |

It is obvious upon inspection that two of the clusters contain academic high-performers, while two of the clusters have a substantial failure rate. Although each cluster has the same mean in expectation, the induced intra-cluster correlation forces some clusters higher and others lower.

Alternatively, users can specify `ICC`

and `total_sd`

. The resulting variable will be rescaled to have a standard deviation exactly equal to `total_sd`

.

If you are interested in reading more about how to generate specific variables with **fabricatr**, you can read our tutorial on common social science variables, or learn how to use other data-generating packages with **fabricatr**.

If you are interested in learning how to import or build data, you can read our introduction to building and importing data. More advanced users can read our tutorial on generating panel or cross-classified data. You can also learn about bootstrapping and resampling hierarchical data.

When generating correlated data using `correlate()`

, the following approach is used, where X is the source variable, F is the empirical distribution of X, G is the target distribution, and Y is the realized variable drawn from that distribution:

- Calculate the quantiles of the observed X in the empirical distribution X
- Draw these quantiles from a standard normal distribution
- Generate a standard normal version of Y based on the specified relationship.
- Map the standard normal Y into the quantiles of the standard normal distribution
- Draw from the target distribution at those quantiles.

Because all transformations are affine, the rank order correlation induced in the standard normal stage is preserved in the target distribution. Mathematically:

\[ \begin{aligned} X_{quan} &= F^{-1}(X) \\ X_{std} &= \Phi(X_{quan}) \\ Y_{std} &\sim N(\rho \times X_{std}, (1 - \rho^2)) \\ Y_{quan} &= \Phi^{-1}(Y_{std}) \\ Y &= G(Y_{quan}) \end{aligned} \]

One thing that may be counterintuitive about this approach is that the correlation is based on the sample characteristics of X – the empirical data X observed without reference to its distribution of origin – but the population characteristics of Y. This choice was chosen because it minimizes the amount of information required of users and maximizes the flexibility of source variables, although it may result in some imprecision on a given draw owing to the finite sample size of X (or chance deviation of X from the desired distribution during sampling).

When generating binary data with a fixed ICC, we use this formula, where \(i\) is a cluster and \(j\) is a unit in a cluster:

\[ \begin{aligned} z_i &\sim \text{Bern}(p_i) \\ u_{ij} &\sim \text{Bern}(\sqrt{\rho}) \\ x_{ij} &= \begin{cases} x_{ij} \sim \text{Bern}(p_i) & \quad \text{if } u_{ij} = 1 \\ z_i & \quad \text{if } u_{ij} = 0 \end{cases} \end{aligned} \]

In expectation, this guarantees an intra-cluster correlation of \(\rho\) and a cluster proportion of \(p_i\). This approach is derived from Hossain, Akhtar and Chakraborti, Hrishikesh. “ICCBin: Facilitates Clustered Binary Data Generation, and Estimation of Intracluster Correlation Coefficient (ICC) for Binary Data”, available on CRAN or GitHub

When generating normal data with a fixed ICC, we follow this formula, again with \(i\) as a cluster and \(j\) as a unit in the cluster:

\[ \begin{aligned} \sigma^2_{\alpha i} &= \frac{(\rho * \sigma^2_{\epsilon i})}{(1 - \rho)} \\ \alpha_i &\sim \mathcal{N}(0, \sigma^2_{\alpha i}) \\ \mu_{ij} &\sim \mathcal{N}(\mu_i, \sigma^2_{\epsilon i}) \\ x_{ij} &= \mu_{ij} + \alpha_i \end{aligned} \]

In expectation, this approach guarantees an intra-cluster correlation of \(\rho\), a cluster mean of \(\mu_{i}\), and a cluster-level variance in error terms of \(\sigma^2_{\epsilon i}\). This approach is specified on StatsExchange.