Data is generated to ensure inter-cluster correlation 0, intra-cluster correlation in expectation ICC. The data generating process used in this function is specified at the following URL: https://stats.stackexchange.com/questions/263451/create-synthetic-data-with-a-given-intraclass-correlation-coefficient-icc

draw_normal_icc(
  mean = 0,
  N = NULL,
  clusters,
  sd = NULL,
  sd_between = NULL,
  total_sd = NULL,
  ICC = NULL
)

Arguments

mean

A number or vector of numbers, one mean per cluster. If none is provided, will default to 0.

N

(Optional) A number indicating the number of observations to be generated. Must be equal to length(clusters) if provided.

clusters

A vector of factors or items that can be coerced to clusters; the length will determine the length of the generated data.

sd

A number or vector of numbers, indicating the standard deviation of each cluster's error terms -- standard deviation within a cluster (default 1)

sd_between

A number or vector of numbers, indicating the standard deviation between clusters.

total_sd

A number indicating the total sd of the resulting variable. May only be specified if ICC is specified and sd and sd_between are not.

ICC

A number indicating the desired ICC.

Value

A vector of numbers corresponding to the observations from the supplied cluster IDs.

Details

The typical use for this function is for a user to provide an ICC and, optionally, a set of within-cluster standard deviations, sd. If the user does not provide sd, the default value is 1. These arguments imply a fixed between-cluster standard deviation.

An alternate mode for the function is to provide between-cluster standard deviations, sd_between, and an ICC. These arguments imply a fixed within-cluster standard deviation.

If users provide all three of ICC, sd_between, and sd, the function will warn the user and use the provided standard deviations for generating the data.

Examples


# Divide observations into clusters
clusters = rep(1:5, 10)

# Default: unit variance within each cluster
draw_normal_icc(clusters = clusters, ICC = 0.5)
#>  [1]  0.34066735 -1.09542354  0.49465106  0.03068483 -0.38183951 -1.57453527
#>  [7] -1.04707701 -0.14266562 -0.34736660 -0.46482503 -0.06810272 -2.46486124
#> [13]  0.93923623  0.57155456  0.19195177 -0.63926347 -0.52185873 -0.60495800
#> [19] -0.21088358 -1.15578495 -0.96644288 -1.45195861  1.16092529 -0.27719126
#> [25] -1.61910999  0.18104605 -0.90927880 -0.03387395 -0.77261877 -1.38040296
#> [31] -0.14362702 -2.55397984 -1.07871006 -1.13080947 -2.10452400 -0.25894432
#> [37] -0.98785028 -0.25898333 -1.03700062  0.58671995 -0.21903997 -0.43154027
#> [43]  0.65288855 -0.33999751  0.20586299 -1.23377258 -0.75212116  0.84016629
#> [49]  0.03522322  0.11707450

# Alternatively, you can specify characteristics:
draw_normal_icc(mean = 10, clusters = clusters, sd = 3, ICC = 0.3)
#>  [1]  7.683778 11.595033 11.208928  8.130258 10.672332  7.478325  6.748232
#>  [8] 11.881818 16.267152  4.374796 12.874916  5.270785 10.934186  7.167838
#> [15]  6.263748  8.642012 12.295274  8.562739  8.747630  6.265516  2.455183
#> [22] 15.205841 11.110051  6.028755 11.461177  2.730217 10.152848 13.849622
#> [29]  6.019095 13.100577  5.285693  7.570052 12.674419 11.908607 12.861187
#> [36]  7.627498 13.037720 10.945860  8.900722 10.254335  5.248685  7.019324
#> [43] 12.743514  6.086238  4.823689  9.277063 10.018417  8.688299  4.979698
#> [50] 10.979758

# Can specify between-cluster standard deviation instead:
draw_normal_icc(clusters = clusters, sd_between = 4, ICC = 0.2)
#>  [1]  -7.8861422  17.4973224  -0.9747662  10.9927142   0.3669069   1.2359383
#>  [7]  -0.5330519  11.7032931  -0.6102078   2.3591629  -9.1082153   0.2796305
#> [13]   3.7425805 -14.2025004 -11.7893536   1.3154810  25.4646336  10.5318516
#> [19]  11.6572157 -13.7360200  -1.5324509  -1.3092685   2.4174306  14.1789690
#> [25]   4.9655754   0.8698504   9.5552141   8.6847665  20.7272530 -11.8277109
#> [31]   4.0713956  -2.2831786   1.5226788 -13.0287353  -2.4106954   6.6123508
#> [37]   4.0928958  -6.8973644   1.7846996  -7.6459503   2.0411449  12.1529656
#> [43]   7.8893099   2.3123863 -14.3483976  -1.4316713  -0.8622560  -7.7531333
#> [49]  -4.3572488 -24.2363288

# Can specify total SD instead:
total_sd_draw = draw_normal_icc(clusters = clusters, ICC = 0.5, total_sd = 3)
sd(total_sd_draw)
#> [1] 3

# Verify that ICC generated is accurate
corr_draw = draw_normal_icc(clusters = clusters, ICC = 0.4)
summary(lm(corr_draw ~ as.factor(clusters)))$r.squared
#> [1] 0.2751446