Data is generated to ensure inter-cluster correlation 0, intra-cluster correlation in expectation ICC. The data generating process used in this function is specified at the following URL: https://stats.stackexchange.com/questions/263451/create-synthetic-data-with-a-given-intraclass-correlation-coefficient-icc

draw_normal_icc(mean = 0, N = NULL, clusters, sd = NULL,
  sd_between = NULL, total_sd = NULL, ICC = NULL)

Arguments

mean

A number or vector of numbers, one mean per cluster. If none is provided, will default to 0.

N

(Optional) A number indicating the number of observations to be generated. Must be equal to length(clusters) if provided.

clusters

A vector of factors or items that can be coerced to clusters; the length will determine the length of the generated data.

sd

A number or vector of numbers, indicating the standard deviation of each cluster's error terms -- standard deviation within a cluster (default 1)

sd_between

A number or vector of numbers, indicating the standard deviation between clusters.

total_sd

A number indicating the total sd of the resulting variable. May only be specified if ICC is specified and sd and sd_between are not.

ICC

A number indicating the desired ICC.

Value

A vector of numbers corresponding to the observations from the supplied cluster IDs.

Details

The typical use for this function is for a user to provide an ICC and, optionally, a set of within-cluster standard deviations, sd. If the user does not provide sd, the default value is 1. These arguments imply a fixed between-cluster standard deviation.

An alternate mode for the function is to provide between-cluster standard deviations, sd_between, and an ICC. These arguments imply a fixed within-cluster standard deviation.

If users provide all three of ICC, sd_between, and sd, the function will warn the user and use the provided standard deviations for generating the data.

Examples

# Divide observations into clusters clusters = rep(1:5, 10) # Default: unit variance within each cluster draw_normal_icc(clusters = clusters, ICC = 0.5)
#> [1] -1.91895050 -2.32972918 -1.04108334 -0.05564077 0.87693614 1.11799259 #> [7] -0.99141140 -0.88805475 -0.71326911 -0.21932045 -1.22261697 -3.05425246 #> [13] -1.94498592 0.32406580 -0.38223353 1.02693684 -1.52146891 -2.12281385 #> [19] -0.78642284 0.45302176 -0.73581061 -0.91270615 -1.00380565 -1.39700264 #> [25] 3.22000964 0.65148723 -0.03478417 -3.02644937 -0.79626285 0.52635874 #> [31] 0.84369638 -0.74923767 -2.21426708 -1.81894923 -0.43155686 1.00400689 #> [37] -1.12518640 -1.85884918 0.12255121 -0.28577768 -1.01415162 -1.48412826 #> [43] -3.67599455 -1.37105138 0.72496291 -0.68962721 -0.31864645 -2.27866549 #> [49] -1.06514811 0.31793276
# Alternatively, you can specify characteristics: draw_normal_icc(mean = 10, clusters = clusters, sd = 3, ICC = 0.3)
#> [1] 2.980361 5.059323 11.220492 12.834811 12.036792 2.311390 6.813053 #> [8] 21.045831 4.879437 6.892657 5.155492 11.773461 11.567804 12.027526 #> [15] 9.251026 3.861271 11.507811 11.776464 10.176906 15.019595 5.579890 #> [22] 10.263847 11.652612 5.595671 13.200408 3.579218 13.534341 9.284284 #> [29] 13.473204 11.052864 6.018959 10.516244 11.932237 8.787241 14.952115 #> [36] 2.895813 7.866673 12.485727 7.493867 10.279353 5.963135 4.932639 #> [43] 13.080236 10.711291 5.099772 2.513929 6.138051 11.786126 8.023712 #> [50] 8.156668
# Can specify between-cluster standard deviation instead: draw_normal_icc(clusters = clusters, sd_between = 4, ICC = 0.2)
#> [1] -7.9882450 19.1681798 -17.8117587 -5.1334636 -21.9023901 -6.4990373 #> [7] -11.6548116 -9.0053030 -2.5826652 -12.2276641 5.4512710 2.8243694 #> [13] -17.9118919 -8.0512295 -20.5450364 -4.1655727 -10.5949451 -2.6768807 #> [19] 1.1514062 -11.1928974 1.1884742 -1.9987958 -5.6795444 7.1289345 #> [25] 1.9346844 -9.1734515 10.6353696 6.7102672 -3.3426178 -15.0474155 #> [31] 2.8221672 -4.7640555 1.4392404 5.6333692 -5.4478836 -2.1267284 #> [37] -4.5118486 0.1530110 10.2640661 -1.2393195 -3.9640117 -8.9105903 #> [43] 1.9340071 11.9791366 -5.0312457 11.1539108 6.4744885 -0.3893474 #> [49] -9.5299046 -12.6852342
# Can specify total SD instead: total_sd_draw = draw_normal_icc(clusters = clusters, ICC = 0.5, total_sd = 3) sd(total_sd_draw)
#> [1] 3
# Verify that ICC generated is accurate corr_draw = draw_normal_icc(clusters = clusters, ICC = 0.4) summary(lm(corr_draw ~ as.factor(clusters)))$r.squared
#> [1] 0.2983103