fabricatr

Data is generated to ensure inter-cluster correlation 0, intra-cluster correlation in expectation ICC. The data generating process used in this function is specified at the following URL: https://stats.stackexchange.com/questions/263451/create-synthetic-data-with-a-given-intraclass-correlation-coefficient-icc

draw_normal_icc(mean = 0, N = NULL, clusters, sd = NULL,
  sd_between = NULL, ICC = NULL)

Arguments

mean

A number or vector of numbers, one mean per cluster. If none is provided, will default to 0.

N

(Optional) A number indicating the number of observations to be generated. Must be equal to length(clusters) if provided.

clusters

A vector of factors or items that can be coerced to clusters; the length will determine the length of the generated data.

sd

A number or vector of numbers, indicating the standard deviation of each cluster's error terms -- standard deviation within a cluster (default 1)

sd_between

A number or vector of numbers, indicating the standard deviation between clusters.

ICC

A number indicating the desired ICC.

Value

A vector of numbers corresponding to the observations from the supplied cluster IDs.

Details

The typical use for this function is for a user to provide an ICC and, optionally, a set of within-cluster standard deviations, sd. If the user does not provide sd, the default value is 1. These arguments imply a fixed between-cluster standard deviation.

An alternate mode for the function is to provide between-cluster standard deviations, sd_between, and an ICC. These arguments imply a fixed within-cluster standard deviation.

If users provide all three of ICC, sd_between, and sd, the function will warn the user and use the provided standard deviations for generating the data.

Examples

# Divide observations into clusters clusters = rep(1:5, 10) # Default: unit variance within each cluster draw_normal_icc(clusters = clusters, ICC = 0.5)
#> [1] -2.774221580 0.744750945 -1.256170958 -0.464284195 0.005476574 #> [6] -1.369422702 -0.836103254 -0.123449524 -2.102324520 0.621329562 #> [11] -0.898702207 -0.317246896 -1.218305552 -0.871250814 1.557310484 #> [16] -0.696322293 -0.604463143 -1.979579020 -1.810849294 0.296515115 #> [21] -1.004455099 0.625096428 -3.047062288 -1.052697373 0.788810839 #> [26] -1.368391559 -1.480018037 -1.367819304 -1.488752796 0.350384221 #> [31] -0.398461203 -0.150745376 0.684173868 -0.621120925 0.985746945 #> [36] -0.855393624 -0.627176508 -1.390609732 -1.669810464 0.650505185 #> [41] -0.875855986 -1.114138467 -0.352778182 -0.649242153 -1.654094115 #> [46] -0.627612972 -1.378201814 -1.439114985 -0.931369804 2.069935332
# Alternatively, you can specify characteristics: draw_normal_icc(mean = 10, clusters = clusters, sd = 3, ICC = 0.3)
#> [1] 6.5076580 4.9897328 10.1449755 7.8583626 14.4104899 10.9502646 #> [7] 3.9472303 10.3020590 9.3788620 11.7987364 7.4582735 0.8442592 #> [13] 14.1324137 7.8228458 11.7481211 10.9525766 5.1749352 15.7697438 #> [19] 12.9372572 12.0742556 15.0646442 10.2735558 12.9598582 10.8189339 #> [25] 15.8945017 8.2490312 16.2499167 17.0550873 2.6046481 12.4859585 #> [31] 7.7983020 11.4686566 17.4882516 8.8115134 11.2750603 12.4293539 #> [37] 7.9646933 11.3487835 5.5431375 9.7058788 4.7493910 7.9675376 #> [43] 15.3378425 10.0832600 12.3363107 7.4247893 12.1190561 9.9523218 #> [49] 10.5290856 8.6540212
# Can specify between-cluster standard deviation instead: draw_normal_icc(clusters = clusters, sd_between = 4, ICC = 0.2)
#> [1] -17.7058653 0.5882161 5.6486827 7.9324515 -8.3428828 2.5919477 #> [7] -4.1270824 9.5123579 2.6435674 -2.6263506 -10.2739136 11.8095493 #> [13] 4.2633041 11.1662272 11.4890005 -7.0800055 1.7052403 -2.5329016 #> [19] 12.1615392 -2.4679993 -9.7255167 6.1329169 19.8382806 8.8485680 #> [25] -5.5736394 -0.6081300 -10.1431205 17.4245857 7.4653253 1.2508721 #> [31] -13.9938790 1.7155350 11.1095098 7.7341433 0.5826849 -2.9654141 #> [37] 9.5051761 5.7550496 -4.3387526 6.1589327 -7.8782514 11.0888177 #> [43] 14.9726313 15.7865843 5.3907583 -11.7361528 11.0765795 0.5280482 #> [49] 1.5394966 -4.0823270
# Verify that ICC generated is accurate corr_draw = draw_normal_icc(clusters = clusters, ICC = 0.4) summary(lm(corr_draw ~ as.factor(clusters)))$r.squared
#> [1] 0.2247414