Simulations - OLS and Variance

This document exposes the properties of different variance estimators using DeclareDesign and estimatr. More details about the variance estimators with references can be found in the mathematical notes.

library(DeclareDesign)
## Loading required package: randomizr
## Loading required package: fabricatr
## Loading required package: estimatr
library(ggplot2)

Homoskedastic errors

Under simple conditions with homoskedasticity (i.e., all errors are drawn from a distribution with the same variance), the classical estimator of the variance of OLS should be unbiased. In this section I demonstrate this to be true using DeclareDesign and estimatr.

First, let’s take a simple set up:

\[ \begin{aligned} \mathbf{y} &= \mathbf{X}\beta + \epsilon, \\ \epsilon_i &\overset{i.i.d.}{\sim} N(0, \sigma^2). \end{aligned} \]

For our simulation, let’s have a constant and one covariate, so that \(\mathbf{X} = [\mathbf{1}, \mathbf{x_1}]\), where \(\mathbf{x_1}\) is a column vector of a covariate drawn from a standard normal distribution. Let’s also assume that are covariates are fixed, rather than stochastic. Let’s draw the data we will use.

set.seed(41)
dat <- data.frame(x = rnorm(50))

The function

\[ \epsilon_i \overset{i.i.d.}{\sim} N(0, \sigma^2), \] encodes the assumption of homoskedasticity. Because of these homoskedastic errors, we know that the true variance of the coefficients with fixed covariates is

\[ \mathbb{V}[\widehat{\beta}] = \sigma^2 (\mathbf{X}^\top \mathbf{X})^{-1}, \]

where I hide conditioning on \(\mathbf{X}\) for simplicity.

Let’s compute the true variance for our dataset.

sigmasq <- 4
# Build the X matrix with intercept
Xmat <- cbind(1, dat$x)
# Invert XtX
XtX_inv <- solve(crossprod(Xmat))
# Get full variance covariance matrix
true_var_cov_mat <- sigmasq * XtX_inv

But for this example, we are only going to focus on the variance for the covariate, not the intercept, so let’s store that variance.

true_varb <- true_var_cov_mat[2, 2]
true_varb
## [1] 0.07831866

Now, using DeclareDesign, let’s specify the rest of the data generating process (DGP). Let’s set \(\beta = [0, 1]^\top\), so that the true DGP is \(\mathbf{y} = \mathbf{x_1} + \epsilon\).

simp_pop <- declare_population(
  epsilon = rnorm(N, sd = 2),
  y = x + epsilon
)

Now let’s tell DeclareDesign that our target, our estimand, is the true variance.

varb_estimand <- declare_estimand(true_varb = true_varb)

Our estimator for this estimand will be the classical OLS variance estimator, which we know should be unbiased:

\[ \widehat{\mathbb{V}[\widehat{\beta}]} = \frac{\mathbf{e}^\top\mathbf{e}}{N - K} (\mathbf{X}^\top \mathbf{X})^{-1}, \]

where the residuals \(\mathbf{e} = \mathbf{y} - \mathbf{X}\widehat{\beta}\), \(N\) is the number of observations, and \(K\) is the number of regressors—two in our case. We can easily get this estimate of the variance by squaring the standard error we get out from lm_robust in estimatr. Let’s tell DeclareDesign to use that estimator and get the coefficient on the \(\mathbf{x}_1\) variable.

lmc <- declare_estimator(
  y ~ x,
  model = estimatr::lm_robust,
  se_type = "classical",
  estimand = varb_estimand,
  coefficient_name = "x"
)

Now, we want to test for a few results using Monte Carlo simulation. Our main goal is to show that our estimated variance is unbiased for the true variance (our estimand). We can do this by comparing the mean of our estimated variances across our Monto Carlo simulations to the true variance. We can also show that the standard error of our coefficient estimate is the same as the standard deviation of the sampling distribution of our coefficient. Lastly, we also measure the coverage of our 95 percent confidence intervals across simulations to test whether the they cover the true coefficient 95 percent of the time.

Let’s first set up the design and our diagnosands.

# First declare all the steps of our design, starting with our fixed data
classical_design <- declare_design(
  dat,
  simp_pop,
  varb_estimand,
  lmc
)

# Declare a set of diagnosands that help us check if
# we have unbiasedness
my_diagnosands <- declare_diagnosands(
  `Bias of Estimated Variance` = mean(se^2 - estimand),
  `Bias of Standard Error` = mean(se - sd(est)),
  `Coverage Rate` = mean(1 <= ci_upper & 1 >= ci_lower),
  `Mean of Estimated Variance` = mean(se^2),
  `True Variance` = estimand[1]
)

Now let’s run the simulations!

# Run 25000 simulations
set.seed(42)
dig <- diagnose_design(
  classical_design,
  sims = 20000,
  diagnosands = my_diagnosands,
  parallel = TRUE
)

Our diagnosands can help us see if there is any bias. Note that the standard error of the diagnosands—denoted as se(Diagnosand)—is a useful tool that let’s us know whether our estimated bias, or any other diagnosand, is simply an artifact of simulation noise or if it appears to be precisely estimated via simulation.

Diagnosand
True Variance 0.07832
Mean of Estimated Variance 0.07831
Bias of Estimated Variance -0.00001
se(Bias of Estimated Variance) 0.00011

As we can see the bias is very close to zero. Because the standard error of the bias is much larger than the estimated bias, we can be reasonably certain that the only reason the bias is not exactly zero is due to simulation error. We can also see the unbiasedness visually, using a density plot of estimated variances with a line for the true variance.

We can also show that the standard error is unbiased for the standard deviation of the sampling distribution of \(\beta\) and that the coverage is appropriate.

Diagnosand
Bias of Standard Error -0.00161
se(Bias of Standard Error) 0.00140
Coverage Rate 0.94815
se(Coverage Rate) 0.00164

Heteroskedastic errors

Let’s use the same fixed data set-up, but introduce heteroskedasticity. In this case, the variance of the errors is different across units:

\[ \epsilon_i \sim N(0, \sigma_i^2), \]

where \(\sigma^2_i \neq \sigma^2_j\) for some units \(i\) and \(j\). If the variance of the errors is not independent of the regressors, the “classical” variance will be biased and inconsistent. Meanwhile, heteroskedastic-consistent variance estimators, such as the HC2 estimator, are consistent and normally less biased than the “classical” estimator. Let’s demonstrate this using DeclareDesign. First, let’s specify the variance of the errors to be strongly correlated with \(x\).

dat <- fabricate(
  dat,
  noise_var = 1 + (x - min(x))^2
)
# Plot shows variance of errors increasing with x
plot(dat$x, dat$noise_var, xlab = "x", ylab = "sigmasq_i")

Note that the general form of the true variance with fixed covariates is:

\[ \mathbb{V}[\widehat{\beta}] = (\mathbf{X}^\top \mathbf{X})^{-1} \mathbf{X}^\top \mathbf{\Phi} \mathbf{X} (\mathbf{X}^\top \mathbf{X})^{-1}, \]

where \(\mathbf{\Phi}\) is the variance covariance matrix of the errors, or \(\mathbf{\Phi} = \mathbb{E}[\epsilon\epsilon^\top]\). In the above case with homoskedasticity, we assumed \(\mathbf{\Phi} = \sigma^2 \mathbf{I}\) and were able to simplify. Now, as in the standard set up for heteroskedasticity, we set \(\mathbf{\Phi}\) to be a diagonal matrix where noise_var, the variance for each unit’s error, is on the diagonal, like so:

\[ \mathbf{\Phi} = \begin{bmatrix} \sigma_1^2 & 0 & \cdots & 0 \\ 0 & \sigma_2^2 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & \sigma_n^2 \end{bmatrix} \]

Using that error structure and the error for each unit, we can estimate the true variance.

Xmat <- with(dat, cbind(1, x))
XtX_inv <- solve(crossprod(Xmat))
varb <- tcrossprod(XtX_inv, Xmat) %*% diag(with(dat, noise_var)) %*% Xmat %*% XtX_inv
true_varb_het <- varb[2, 2]
true_varb_het
## [1] 0.1473923

Now let’s use DeclareDesign to test whether HC2 is less biased in this example than classical standard errors. However, I will use another feature of DeclareDesign where I can create a design template. This allows me to easily pass different datasets (i.e., of different sizes) and variance estimators so that I can test the properties different estimators across different datasets.

# This creates a template that takes some fixed data
# that has x and noise_var. Then for that data, it creates
# a single design that you can then simulate many times to get
# the properties you are interested in
# It also takes the two se_types you'd like to compare
fixed_dat_het_design_temp <- function(dat, se_types) {

  # Get true variance for this data
  Xmat <- with(dat, cbind(1, x))
  XtX_inv <- solve(crossprod(Xmat))
  varb <- tcrossprod(XtX_inv, Xmat) %*% diag(with(dat, noise_var)) %*% Xmat %*% XtX_inv
  true_varb_het <- varb[2, 2]

  # Population function now has heteroskedastic noise
  simp_pop <- declare_population(
    epsilon = rnorm(N, sd = sqrt(noise_var)),
    y = x + epsilon
  )

  varb_het_estimand <- declare_estimand(true_varb_het = true_varb_het)

  # Now we declare the two estimators
  lm1 <- declare_estimator(
    y ~ x,
    model = estimatr::lm_robust,
    se_type = !!se_types[1], # !! operater will reveal the value and pass it to estimatr
    estimand = varb_het_estimand,
    coefficient_name = "x",
    label = se_types[1]
  )

  lm2 <- declare_estimator(
    y ~ x,
    model = estimatr::lm_robust,
    se_type = !!se_types[2],
    estimand = varb_het_estimand,
    coefficient_name = "x",
    label = se_types[2]
  )

  # We return the design so we can diagnose it
  return(
    declare_design(
      dat,
      simp_pop,
      varb_het_estimand,
      lm1,
      lm2
    )
  )
}

So let’s use the same diagnosands as above to test the properties of our estimators with heteroskedasticity.

# Create a design using our template and the data we have been using
het_design <- fixed_dat_het_design_temp(dat, se_types = c("classical", "HC2"))
Classical Est. HC2
True Variance 0.14739 0.14739
Mean of Estimated Variance 0.11244 0.14489
Bias of Estimated Variance -0.03495 -0.00251
se(Bias of Estimated Variance) 0.00025 0.00058
Bias of Standard Error -0.05280 -0.01535
se(Bias of Standard Error) 0.00282 0.00286
Coverage Rate 0.91820 0.93350
se(Coverage Rate) 0.00257 0.00223

As you can see, the bias for the HC2 errors is much closer to zero, whereas the bias for the classical error is much larger, especially when compared to the standard error of the bias diagnosand. How does this bias change as the sample size changes? As the HC2 variance estimate is consistent under heteroskedasticity, it should converge to zero.

Ns <- c(100, 500, 1000, 2500, 5000)
diags <- vector("list", length(Ns))

set.seed(42)
for (i in seq_along(Ns)) {
  # Generate ONE fixed dataset per N
  dat <- fabricate(
    N = Ns[i],
    x = rnorm(N),
    noise_var = 1 + (x - min(x))^2
  )
  des <- fixed_dat_het_design_temp(dat, se_types = c("classical", "HC2"))
  diags[[i]] <- diagnose_design(
    des,
    sims = 10000,
    diagnosands = my_diagnosands,
    bootstrap = FALSE
  )$diagnosands
  diags[[i]]$N <- Ns[i]
}

As you can see, the HC2 variance tends to be conservative—as the bias is positive—and is consistent, converging to the true value as the sample size increases. The classical error, while also converging, will never have zero bias and the bias is negative, underestimating the variance of the coefficients.

HC1 and HC2 in small samples

The default variance estimator in estimatr is the HC2, rather than HC1 estimator (Stata’s default). We do this for several reasons, one of which is HC2 outperforming HC1 in small sample sizes. Let’s test this by comparing HC2 and HC1 over some small sample sizes, and plot the bias of each estimator across 40 datasets per sample size.

datasets <- 40
# Generate several fixed datasets per N
Ns <- rep(c(15, 25, 50), each = datasets)
diags_hc <- vector("list", length(Ns))

set.seed(42)
for (i in seq_along(Ns)) {
  dat <- fabricate(
    N = Ns[i],
    x = rnorm(N),
    noise_var = 1 + (x - min(x))^2
  )
  des <- fixed_dat_het_design_temp(dat, se_types = c("HC1", "HC2"))
  diags_hc[[i]] <- diagnose_design(
    des,
    sims = 200,
    diagnosands = my_diagnosands,
    bootstrap = FALSE
  )$diagnosands
  diags_hc[[i]]$N <- Ns[i]
}

In small datasets, HC1 variance estimates suffers from greater downward bias than HC2 variance estimates.