R/draw_normal_icc.R
draw_normal_icc.Rd
Data is generated to ensure inter-cluster correlation 0, intra-cluster correlation in expectation ICC. The data generating process used in this function is specified at the following URL: https://stats.stackexchange.com/questions/263451/create-synthetic-data-with-a-given-intraclass-correlation-coefficient-icc
draw_normal_icc(
mean = 0,
N = NULL,
clusters,
sd = NULL,
sd_between = NULL,
total_sd = NULL,
ICC = NULL
)
A number or vector of numbers, one mean per cluster. If none is provided, will default to 0.
(Optional) A number indicating the number of observations to be generated. Must be equal to length(clusters) if provided.
A vector of factors or items that can be coerced to clusters; the length will determine the length of the generated data.
A number or vector of numbers, indicating the standard deviation of each cluster's error terms -- standard deviation within a cluster (default 1)
A number or vector of numbers, indicating the standard deviation between clusters.
A number indicating the total sd of the resulting variable.
May only be specified if ICC is specified and sd
and sd_between
are not.
A number indicating the desired ICC.
A vector of numbers corresponding to the observations from the supplied cluster IDs.
The typical use for this function is for a user to provide an ICC
and,
optionally, a set of within-cluster standard deviations, sd
. If the
user does not provide sd
, the default value is 1. These arguments
imply a fixed between-cluster standard deviation.
An alternate mode for the function is to provide between-cluster standard
deviations, sd_between
, and an ICC
. These arguments imply
a fixed within-cluster standard deviation.
If users provide all three of ICC
, sd_between
, and
sd
, the function will warn the user and use the provided standard
deviations for generating the data.
# Divide observations into clusters
clusters = rep(1:5, 10)
# Default: unit variance within each cluster
draw_normal_icc(clusters = clusters, ICC = 0.5)
#> [1] 0.34066735 -1.09542354 0.49465106 0.03068483 -0.38183951 -1.57453527
#> [7] -1.04707701 -0.14266562 -0.34736660 -0.46482503 -0.06810272 -2.46486124
#> [13] 0.93923623 0.57155456 0.19195177 -0.63926347 -0.52185873 -0.60495800
#> [19] -0.21088358 -1.15578495 -0.96644288 -1.45195861 1.16092529 -0.27719126
#> [25] -1.61910999 0.18104605 -0.90927880 -0.03387395 -0.77261877 -1.38040296
#> [31] -0.14362702 -2.55397984 -1.07871006 -1.13080947 -2.10452400 -0.25894432
#> [37] -0.98785028 -0.25898333 -1.03700062 0.58671995 -0.21903997 -0.43154027
#> [43] 0.65288855 -0.33999751 0.20586299 -1.23377258 -0.75212116 0.84016629
#> [49] 0.03522322 0.11707450
# Alternatively, you can specify characteristics:
draw_normal_icc(mean = 10, clusters = clusters, sd = 3, ICC = 0.3)
#> [1] 7.683778 11.595033 11.208928 8.130258 10.672332 7.478325 6.748232
#> [8] 11.881818 16.267152 4.374796 12.874916 5.270785 10.934186 7.167838
#> [15] 6.263748 8.642012 12.295274 8.562739 8.747630 6.265516 2.455183
#> [22] 15.205841 11.110051 6.028755 11.461177 2.730217 10.152848 13.849622
#> [29] 6.019095 13.100577 5.285693 7.570052 12.674419 11.908607 12.861187
#> [36] 7.627498 13.037720 10.945860 8.900722 10.254335 5.248685 7.019324
#> [43] 12.743514 6.086238 4.823689 9.277063 10.018417 8.688299 4.979698
#> [50] 10.979758
# Can specify between-cluster standard deviation instead:
draw_normal_icc(clusters = clusters, sd_between = 4, ICC = 0.2)
#> [1] -7.8861422 17.4973224 -0.9747662 10.9927142 0.3669069 1.2359383
#> [7] -0.5330519 11.7032931 -0.6102078 2.3591629 -9.1082153 0.2796305
#> [13] 3.7425805 -14.2025004 -11.7893536 1.3154810 25.4646336 10.5318516
#> [19] 11.6572157 -13.7360200 -1.5324509 -1.3092685 2.4174306 14.1789690
#> [25] 4.9655754 0.8698504 9.5552141 8.6847665 20.7272530 -11.8277109
#> [31] 4.0713956 -2.2831786 1.5226788 -13.0287353 -2.4106954 6.6123508
#> [37] 4.0928958 -6.8973644 1.7846996 -7.6459503 2.0411449 12.1529656
#> [43] 7.8893099 2.3123863 -14.3483976 -1.4316713 -0.8622560 -7.7531333
#> [49] -4.3572488 -24.2363288
# Can specify total SD instead:
total_sd_draw = draw_normal_icc(clusters = clusters, ICC = 0.5, total_sd = 3)
sd(total_sd_draw)
#> [1] 3
# Verify that ICC generated is accurate
corr_draw = draw_normal_icc(clusters = clusters, ICC = 0.4)
summary(lm(corr_draw ~ as.factor(clusters)))$r.squared
#> [1] 0.2751446