Data is generated to ensure inter-cluster correlation 0, intra-cluster correlation in expectation ICC. The data generating process used in this function is specified at the following URL: https://stats.stackexchange.com/questions/263451/create-synthetic-data-with-a-given-intraclass-correlation-coefficient-icc

draw_normal_icc(
  mean = 0,
  N = NULL,
  clusters,
  sd = NULL,
  sd_between = NULL,
  total_sd = NULL,
  ICC = NULL
)

Arguments

mean

A number or vector of numbers, one mean per cluster. If none is provided, will default to 0.

N

(Optional) A number indicating the number of observations to be generated. Must be equal to length(clusters) if provided.

clusters

A vector of factors or items that can be coerced to clusters; the length will determine the length of the generated data.

sd

A number or vector of numbers, indicating the standard deviation of each cluster's error terms -- standard deviation within a cluster (default 1)

sd_between

A number or vector of numbers, indicating the standard deviation between clusters.

total_sd

A number indicating the total sd of the resulting variable. May only be specified if ICC is specified and sd and sd_between are not.

ICC

A number indicating the desired ICC.

Value

A vector of numbers corresponding to the observations from the supplied cluster IDs.

Details

The typical use for this function is for a user to provide an ICC and, optionally, a set of within-cluster standard deviations, sd. If the user does not provide sd, the default value is 1. These arguments imply a fixed between-cluster standard deviation.

An alternate mode for the function is to provide between-cluster standard deviations, sd_between, and an ICC. These arguments imply a fixed within-cluster standard deviation.

If users provide all three of ICC, sd_between, and sd, the function will warn the user and use the provided standard deviations for generating the data.

Examples


# Divide observations into clusters
clusters = rep(1:5, 10)

# Default: unit variance within each cluster
draw_normal_icc(clusters = clusters, ICC = 0.5)
#>  [1]  3.193677002 -0.694161660 -1.163596044 -0.030426854 -0.449739775
#>  [6]  1.448604466 -0.198496197 -0.724556142 -0.494297442 -1.286804300
#> [11]  0.922793984  1.627148097 -2.821770515 -0.318945137 -2.081215241
#> [16]  2.267226514  1.392289228 -3.514052755 -0.546281233 -0.686807563
#> [21]  0.285575296 -0.212877609 -1.197733200 -2.104578483 -2.182892144
#> [26]  1.860005221  2.514494388 -3.378069872 -0.313458110 -2.060269606
#> [31]  3.358570696  0.512100103 -1.865812327 -2.180867268  0.406161666
#> [36]  1.192231036  0.168507509 -1.654285359  0.093821567 -0.205126568
#> [41]  0.349162641 -0.004802547 -1.976556277 -0.943767093  2.213761821
#> [46]  2.640729395 -0.695767120 -1.205844672 -0.627108986 -0.680444448

# Alternatively, you can specify characteristics:
draw_normal_icc(mean = 10, clusters = clusters, sd = 3, ICC = 0.3)
#>  [1] 10.108105 14.233949  1.027469 13.984978  9.547545 10.792231 13.460526
#>  [8] 16.285299 10.511811  9.704436  6.746031 15.437494 14.509163 12.166728
#> [15]  9.019455  4.838860  9.223192 12.136143 13.697882 10.254911  5.922168
#> [22] 10.017498 10.129288 15.085041 10.232368  3.110054 16.489392 10.382424
#> [29] 13.479198 14.029841 11.148306 10.900062  8.829749  7.735726 12.391908
#> [36]  9.684622 14.281370 10.131279 10.875278 10.428040  7.893524 11.386201
#> [43] 10.978700  6.415810 15.469439  2.582721  5.679822  8.662545 15.754203
#> [50] 15.136754

# Can specify between-cluster standard deviation instead:
draw_normal_icc(clusters = clusters, sd_between = 4, ICC = 0.2)
#>  [1]   6.71422798 -24.91961235 -10.57839551   5.51809067   5.44231853
#>  [6]  -5.67860083  -1.78680524   8.87558136   9.04368438   8.80904303
#> [11]   0.56161823  -8.78196665   4.03479889 -10.91835818  -1.85522981
#> [16]  -6.75884223 -11.27108686   5.54061668   8.21677577 -13.18258017
#> [21]  -4.67046765  -8.77616241  12.62723504   3.45399655   0.24836931
#> [26] -12.58538931   2.93049782   0.55791874  14.40594751   5.38960110
#> [31]   0.09590407  -7.66550647   7.26525772  -0.14612536  -8.04789035
#> [36]   7.07680595 -22.19037622  -3.24839479  22.72559225  -0.16300181
#> [41]  -2.53603415 -25.97474318  -1.21757308  12.83747507   5.83212545
#> [46]   6.21730898   0.37607464  -5.64978033   9.49129869   1.23139262

# Can specify total SD instead:
total_sd_draw = draw_normal_icc(clusters = clusters, ICC = 0.5, total_sd = 3)
sd(total_sd_draw)
#> [1] 3

# Verify that ICC generated is accurate
corr_draw = draw_normal_icc(clusters = clusters, ICC = 0.4)
summary(lm(corr_draw ~ as.factor(clusters)))$r.squared
#> [1] 0.3670907