Using other data generating packages with fabricatr
Aaron Rudkin
Source:vignettes/other_packages.Rmd
other_packages.Rmd
In general, fabricatr is going to be compatible with
any existing packages you use to generate synthetic data in one of two
ways: either using those packages to create variables within a
fabricate
call, or using those packages to make complete
data frames which are then imported into a fabricate
. Below
we provide examples for some of the most popular packages that serve
this purpose.
wakefield: simulating common demographic features
wakefield (by Tyler Rinker) is a popular
R
package for creating synthetic data.
wakefield’s strength is that it can quickly generate
common variables, especially for human demographic features.
wakefield can easily be integrated into a
fabricatr workflow in one of two ways: using
wakefield to create individual variables within a
fabricate
call, or using wakefield to make
a data frame and importing that data frame into a fabricate
call.
In this example, we create a data-set of participants in a survey experiment, using wakefield to generate the demographic variables
library(wakefield)
survey_experiment_df <- fabricate(
N = 50,
treatment = draw_binary(prob = 0.5, N = N),
age = age(n = N),
race = race(n = N),
sex = sex(n = N),
)
Researchers interested in learning about wakefield’s available functionality, parameterizations, and default probability can read wakefield’s user guide on GitHub.
In addition to creating variables within a fabricate
call, users can import completed wakefield data frames
into a fabricate
call:
survey_experiment_df <- r_data_frame(
n = 50,
age,
race,
sex)
fabricatr_df <- fabricate(
data = survey_experiment_df,
treatment = draw_binary(prob = 0.5, N = N)
)
randomNames: Plausible names for human subjects
randomNames (by Damian Betebenner) is a
package that does one thing well: generate random names for human
subjects (including with specified genders and ethnicities). The primary
use case would be to use this as part of generating a variable within a
fabricate
call. In the below example, we use
fabricate
to generate some other demographic data, and then
randomNames
to generate matching names.
library(randomNames)
experiment_data <- fabricate(
N = 50,
treatment = draw_binary(prob = 0.5, N = N),
is_female = draw_binary(prob = 0.5, N = N),
patient_name = randomNames(N, gender=is_female)
)
Note that we make use of the existing is_female
variable
from the fabricate
call to ensure randomNames
generates gender-typical names.
Modeling causality with DAGs and simcausal
Users who are familiar with the DAGS (directed acyclic graphs) model
of causal inference may have interest in using the
simcausal package, which allows users to specify a DAGS
model and then sample from it. Integrating this package with
fabricatr is likely to involve using
simcausal first to specify a model, simulating data
from the model, and then importing data into a fabricate
call for further user with fabricatr.
Consider this example, common in the literature on educational
attainment and school outcomes, where students come from families that
have a wealth
parameter, assignment to schools is based
partially on wealth
, and test outcomes
(testoutcome
) is based on both school quality and
wealth.
library("simcausal")
# Define DAG
D <- DAG.empty() +
node("wealth", distr = "rnorm",
mean = 30000,
sd = 10000) +
node("schoolquality", distr = "runif",
min = 0 + (5 * (wealth > 50000)),
max = 10) +
node("testoutcome", distr = "runif",
min = 0 + 0.0001 * wealth + 0.25 * schoolquality,
max = 10)
# Freeze DAG object
set_dag <- set.DAG(D)
# Draw data from DAG
df <- sim(set_dag, n = 100)
# Pass into fabricate call and make new variables as necessary
fabricate(df,
passed_test = testoutcome > 6,
eligible_for_snap = wealth < 25000)
Survival and duration models with simsurv
simsurv is a package dedicated to generating panel survival data. The most likely way you might integrate simsurv with fabricatr would be to use fabricatr to generating covariates which can then be imported into simsurv to model in a hazard or duration context.
Here, our example will be a clinical trial of a cancer drug. Participants have the expected biographical data: age, gender, whether the patient smokes, the disease stage, assignment to treatment, and a KPS score (commonly used to evaluate overall patient health).
Survival data creates a ragged longitudinal survey; some patients
will die during the course of the trial, removing them as observations.
Others will continue alive until the end of the trial. We specify a
“hazard function”, which tells simsurv how the course
of patient survival will change over time. Covariates with positive
betas
increase risk of death, while covariates with
negative betas
decrease risk of death. We will track
patients for 5 years after treatment.
library(simsurv)
# Simulate patient data in a clinical trial
participant_data <- fabricate(
N = 100,
age = runif(N, min = 18, max = 85),
is_female = draw_binary(prob = 0.5, N = N),
is_smoker = draw_binary(prob = 0.2 + 0.2 * (age > 50), N = N),
disease_stage = round(runif(N, min = 1 + 0.5 * (age > 65), max = 4)),
treatment = draw_binary(prob = 0.5, N = N),
kps = runif(N, min = 40, max = 100)
)
# Simulate data in the survival context
survival_data <- simsurv(
lambdas = 0.1, gammas = 0.5,
x = participant_data,
betas = c(is_female = -0.2, is_smoker = 1.2,
treatment = -0.4, kps = -0.005,
disease_stage = 0.2),
maxt = 5)
The generated data from the survival_data
object can
then be re-imported into the participant_data
using any
data merging tools, including through a fabricate
call, and
then used for subsequent analyses (e.g. using the
survival package).
Time series using forecast
forecast, by Rob Hyndman, is a
package commonly used to analyze time series data which also has
functionality capable of generating simulated time series data.
forecast can use the Arima
and
simulate
functions to create pre-specified ARIMA models,
including seasonal time trends.
Below, we provide an example of using forecast to generate an ARIMA time series, reshape the data, and import it into fabricatr to create new variables of interest.
library(forecast)
arima_model <- simulate(
Arima(ts(rnorm(100), frequency = 4),
order = c(1, 0, 1))
fabricate(data.frame(arima_model),
year = rep(1:25, each=4),
quarter = rep(1:4, 25))
Here, ts
converts a series of data into a time series,
with frequency specifying the number of observations per unit of time
(in this case, for example, quarters in a year). Arima
ingests this data and fits an ARIMA model with the specified parameters.
simulate
draws new data from the fit time series, producing
a vector of interest. We then import the data into a
fabricate
call (converting it to a data frame) and add new
columns of interest.
Other data simulation tools
The R
ecosystem has many other data simulation tools,
and all can be used to complement or supplement
fabricatr in your workflow. Some of the packages that
we have noticed but not covered here include:
- gems by Luisa Salazar Vizcaya
- simFrame by Andreas Alfons
- simPop by Matthias Templ
- simstudy by Keith Goldfeld
- synthPop by Beata Nowok
- SimCorrMix by Allison Cynthia Fialkowski
If you’d like to see a tutorial on using these packages or any others with fabricatr, please Contact Us so we can help you