Data is generated to ensure inter-cluster correlation 0, intra-cluster correlation in expectation ICC. The data generating process used in this function is specified at the following URL: https://stats.stackexchange.com/questions/263451/create-synthetic-data-with-a-given-intraclass-correlation-coefficient-icc

draw_normal_icc(mean = 0, N = NULL, clusters, sd = NULL,
sd_between = NULL, total_sd = NULL, ICC = NULL)

Arguments

mean A number or vector of numbers, one mean per cluster. If none is provided, will default to 0. (Optional) A number indicating the number of observations to be generated. Must be equal to length(clusters) if provided. A vector of factors or items that can be coerced to clusters; the length will determine the length of the generated data. A number or vector of numbers, indicating the standard deviation of each cluster's error terms -- standard deviation within a cluster (default 1) A number or vector of numbers, indicating the standard deviation between clusters. A number indicating the total sd of the resulting variable. May only be specified if ICC is specified and sd and sd_between are not. A number indicating the desired ICC.

Value

A vector of numbers corresponding to the observations from the supplied cluster IDs.

Details

The typical use for this function is for a user to provide an ICC and, optionally, a set of within-cluster standard deviations, sd. If the user does not provide sd, the default value is 1. These arguments imply a fixed between-cluster standard deviation.

An alternate mode for the function is to provide between-cluster standard deviations, sd_between, and an ICC. These arguments imply a fixed within-cluster standard deviation.

If users provide all three of ICC, sd_between, and sd, the function will warn the user and use the provided standard deviations for generating the data.

Examples


# Divide observations into clusters
clusters = rep(1:5, 10)

# Default: unit variance within each cluster
draw_normal_icc(clusters = clusters, ICC = 0.5)#>  [1] -1.91895050 -2.32972918 -1.04108334 -0.05564077  0.87693614  1.11799259
#>  [7] -0.99141140 -0.88805475 -0.71326911 -0.21932045 -1.22261697 -3.05425246
#> [13] -1.94498592  0.32406580 -0.38223353  1.02693684 -1.52146891 -2.12281385
#> [19] -0.78642284  0.45302176 -0.73581061 -0.91270615 -1.00380565 -1.39700264
#> [25]  3.22000964  0.65148723 -0.03478417 -3.02644937 -0.79626285  0.52635874
#> [31]  0.84369638 -0.74923767 -2.21426708 -1.81894923 -0.43155686  1.00400689
#> [37] -1.12518640 -1.85884918  0.12255121 -0.28577768 -1.01415162 -1.48412826
#> [43] -3.67599455 -1.37105138  0.72496291 -0.68962721 -0.31864645 -2.27866549
#> [49] -1.06514811  0.31793276
# Alternatively, you can specify characteristics:
draw_normal_icc(mean = 10, clusters = clusters, sd = 3, ICC = 0.3)#>  [1]  2.980361  5.059323 11.220492 12.834811 12.036792  2.311390  6.813053
#>  [8] 21.045831  4.879437  6.892657  5.155492 11.773461 11.567804 12.027526
#> [15]  9.251026  3.861271 11.507811 11.776464 10.176906 15.019595  5.579890
#> [22] 10.263847 11.652612  5.595671 13.200408  3.579218 13.534341  9.284284
#> [29] 13.473204 11.052864  6.018959 10.516244 11.932237  8.787241 14.952115
#> [36]  2.895813  7.866673 12.485727  7.493867 10.279353  5.963135  4.932639
#> [43] 13.080236 10.711291  5.099772  2.513929  6.138051 11.786126  8.023712
#> [50]  8.156668
# Can specify between-cluster standard deviation instead:
draw_normal_icc(clusters = clusters, sd_between = 4, ICC = 0.2)#>  [1]  -7.9882450  19.1681798 -17.8117587  -5.1334636 -21.9023901  -6.4990373
#>  [7] -11.6548116  -9.0053030  -2.5826652 -12.2276641   5.4512710   2.8243694
#> [13] -17.9118919  -8.0512295 -20.5450364  -4.1655727 -10.5949451  -2.6768807
#> [19]   1.1514062 -11.1928974   1.1884742  -1.9987958  -5.6795444   7.1289345
#> [25]   1.9346844  -9.1734515  10.6353696   6.7102672  -3.3426178 -15.0474155
#> [31]   2.8221672  -4.7640555   1.4392404   5.6333692  -5.4478836  -2.1267284
#> [37]  -4.5118486   0.1530110  10.2640661  -1.2393195  -3.9640117  -8.9105903
#> [43]   1.9340071  11.9791366  -5.0312457  11.1539108   6.4744885  -0.3893474
#> [49]  -9.5299046 -12.6852342
# Can specify total SD instead:
total_sd_draw = draw_normal_icc(clusters = clusters, ICC = 0.5, total_sd = 3)
sd(total_sd_draw)#> [1] 3
# Verify that ICC generated is accurate
corr_draw = draw_normal_icc(clusters = clusters, ICC = 0.4)
summary(lm(corr_draw ~ as.factor(clusters)))\$r.squared#> [1] 0.2983103