Using other data generating packages with fabricatr
Aaron Rudkin
Source:../vignettes/other_packages.Rmd
other_packages.Rmd
In general, fabricatr is going to be compatible with any existing packages you use to generate synthetic data in one of two ways: either using those packages to create variables within a fabricate
call, or using those packages to make complete data frames which are then imported into a fabricate
. Below we provide examples for some of the most popular packages that serve this purpose.
wakefield: simulating common demographic features
wakefield (by Tyler Rinker) is a popular R
package for creating synthetic data. wakefield’s strength is that it can quickly generate common variables, especially for human demographic features. wakefield can easily be integrated into a fabricatr workflow in one of two ways: using wakefield to create individual variables within a fabricate
call, or using wakefield to make a data frame and importing that data frame into a fabricate
call.
In this example, we create a data-set of participants in a survey experiment, using wakefield to generate the demographic variables
library(wakefield)
survey_experiment_df <- fabricate(
N = 50,
treatment = draw_binary(prob = 0.5, N = N),
age = age(n = N),
race = race(n = N),
sex = sex(n = N),
)
Researchers interested in learning about wakefield’s available functionality, parameterizations, and default probability can read wakefield’s user guide on GitHub.
In addition to creating variables within a fabricate
call, users can import completed wakefield data frames into a fabricate
call:
survey_experiment_df <- r_data_frame(
n = 50,
age,
race,
sex)
fabricatr_df <- fabricate(
data = survey_experiment_df,
treatment = draw_binary(prob = 0.5, N = N)
)
randomNames: Plausible names for human subjects
randomNames (by Damian Betebenner) is a package that does one thing well: generate random names for human subjects (including with specified genders and ethnicities). The primary use case would be to use this as part of generating a variable within a fabricate
call. In the below example, we use fabricate
to generate some other demographic data, and then randomNames
to generate matching names.
library(randomNames)
experiment_data <- fabricate(
N = 50,
treatment = draw_binary(prob = 0.5, N = N),
is_female = draw_binary(prob = 0.5, N = N),
patient_name = randomNames(N, gender=is_female)
)
Note that we make use of the existing is_female
variable from the fabricate
call to ensure randomNames
generates gender-typical names.
Modeling causality with DAGs and simcausal
Users who are familiar with the DAGS (directed acyclic graphs) model of causal inference may have interest in using the simcausal package, which allows users to specify a DAGS model and then sample from it. Integrating this package with fabricatr is likely to involve using simcausal first to specify a model, simulating data from the model, and then importing data into a fabricate
call for further user with fabricatr.
Consider this example, common in the literature on educational attainment and school outcomes, where students come from families that have a wealth
parameter, assignment to schools is based partially on wealth
, and test outcomes (testoutcome
) is based on both school quality and wealth.
library("simcausal")
# Define DAG
D <- DAG.empty() +
node("wealth", distr = "rnorm",
mean = 30000,
sd = 10000) +
node("schoolquality", distr = "runif",
min = 0 + (5 * (wealth > 50000)),
max = 10) +
node("testoutcome", distr = "runif",
min = 0 + 0.0001 * wealth + 0.25 * schoolquality,
max = 10)
# Freeze DAG object
set_dag <- set.DAG(D)
# Draw data from DAG
df <- sim(set_dag, n = 100)
# Pass into fabricate call and make new variables as necessary
fabricate(df,
passed_test = testoutcome > 6,
eligible_for_snap = wealth < 25000)
Survival and duration models with simsurv
simsurv is a package dedicated to generating panel survival data. The most likely way you might integrate simsurv with fabricatr would be to use fabricatr to generating covariates which can then be imported into simsurv to model in a hazard or duration context.
Here, our example will be a clinical trial of a cancer drug. Participants have the expected biographical data: age, gender, whether the patient smokes, the disease stage, assignment to treatment, and a KPS score (commonly used to evaluate overall patient health).
Survival data creates a ragged longitudinal survey; some patients will die during the course of the trial, removing them as observations. Others will continue alive until the end of the trial. We specify a “hazard function”, which tells simsurv how the course of patient survival will change over time. Covariates with positive betas
increase risk of death, while covariates with negative betas
decrease risk of death. We will track patients for 5 years after treatment.
library(simsurv)
# Simulate patient data in a clinical trial
participant_data <- fabricate(
N = 100,
age = runif(N, min = 18, max = 85),
is_female = draw_binary(prob = 0.5, N = N),
is_smoker = draw_binary(prob = 0.2 + 0.2 * (age > 50), N = N),
disease_stage = round(runif(N, min = 1 + 0.5 * (age > 65), max = 4)),
treatment = draw_binary(prob = 0.5, N = N),
kps = runif(N, min = 40, max = 100)
)
# Simulate data in the survival context
survival_data <- simsurv(
lambdas = 0.1, gammas = 0.5,
x = participant_data,
betas = c(is_female = -0.2, is_smoker = 1.2,
treatment = -0.4, kps = -0.005,
disease_stage = 0.2),
maxt = 5)
The generated data from the survival_data
object can then be re-imported into the participant_data
using any data merging tools, including through a fabricate
call, and then used for subsequent analyses (e.g. using the survival package).
Time series using forecast
forecast, by Rob Hyndman, is a package commonly used to analyze time series data which also has functionality capable of generating simulated time series data. forecast can use the Arima
and simulate
functions to create pre-specified ARIMA models, including seasonal time trends.
Below, we provide an example of using forecast to generate an ARIMA time series, reshape the data, and import it into fabricatr to create new variables of interest.
library(forecast)
arima_model <- simulate(
Arima(ts(rnorm(100), frequency = 4),
order = c(1, 0, 1))
fabricate(data.frame(arima_model),
year = rep(1:25, each=4),
quarter = rep(1:4, 25))
Here, ts
converts a series of data into a time series, with frequency specifying the number of observations per unit of time (in this case, for example, quarters in a year). Arima
ingests this data and fits an ARIMA model with the specified parameters. simulate
draws new data from the fit time series, producing a vector of interest. We then import the data into a fabricate
call (converting it to a data frame) and add new columns of interest.
Other data simulation tools
The R
ecosystem has many other data simulation tools, and all can be used to complement or supplement fabricatr in your workflow. Some of the packages that we have noticed but not covered here include:
- gems by Luisa Salazar Vizcaya
- simFrame by Andreas Alfons
- simPop by Matthias Templ
- simstudy by Keith Goldfeld
- synthPop by Beata Nowok
- SimCorrMix by Allison Cynthia Fialkowski
If you’d like to see a tutorial on using these packages or any others with fabricatr, please Contact Us so we can help you