# Two-Stage Least Squares Instrumental Variables Regression

This formula estimates an instrumental variables regression using two-stage least squares with a variety of options for robust standard errors

iv_robust(formula, data, weights, subset, clusters, fixed_effects, se_type = NULL, ci = TRUE, alpha = 0.05, return_vcov = TRUE, try_cholesky = FALSE)

## Arguments

formula | an object of class formula of the regression and the instruments.
For example, the formula |
---|---|

data | A |

weights | the bare (unquoted) names of the weights variable in the supplied data. |

subset | An optional bare (unquoted) expression specifying a subset of observations to be used. |

clusters | An optional bare (unquoted) name of the variable that corresponds to the clusters in the data. |

fixed_effects | An optional right-sided formula containing the fixed
effects that will be projected out of the data, such as |

se_type | The sort of standard error sought. If `clusters` is not specified the options are "HC0", "HC1" (or "stata", the equivalent), "HC2" (default), "HC3", or "classical". If `clusters` is specified the options are "CR0", "CR2" (default), or "stata". Can also specify "none", which may speed up estimation of the coefficients. |

ci | logical. Whether to compute and return p-values and confidence intervals, TRUE by default. |

alpha | The significance level, 0.05 by default. |

return_vcov | logical. Whether to return the variance-covariance matrix for later usage, TRUE by default. |

try_cholesky | logical. Whether to try using a Cholesky decomposition to solve least squares instead of a QR decomposition, FALSE by default. Using a Cholesky decomposition may result in speed gains, but should only be used if users are sure their model is full-rank (i.e., there is no perfect multi-collinearity) |

## Value

An object of class `"iv_robust"`

.

The post-estimation commands functions `summary`

and `tidy`

return results in a `data.frame`

. To get useful data out of the return,
you can use these data frames, you can use the resulting list directly, or
you can use the generic accessor functions `coef`

, `vcov`

,
`confint`

, and `predict`

.

An object of class `"iv_robust"`

is a list containing at least the
following components:

the estimated coefficients

the estimated standard errors

the t-statistic

the estimated degrees of freedom

the p-values from a two-sided t-test using `coefficients`

, `std.error`

, and `df`

the lower bound of the `1 - alpha`

percent confidence interval

the upper bound of the `1 - alpha`

percent confidence interval

a character vector of coefficient names

the significance level specified by the user

the standard error type specified by the user

the residual variance

the number of observations used

the number of columns in the design matrix (includes linearly dependent columns!)

the rank of the fitted model

the fitted variance covariance matrix

the \(R^2\) of the second stage regression

the \(R^2\) of the second stage regression, but penalized for having more parameters, `rank`

a vector with the value of the second stage F-statistic with the numerator and denominator degrees of freedom

whether or not weights were applied

the original function call

the matrix of predicted means

## Details

This function performs two-stage least squares estimation to fit
instrumental variables regression. The syntax is similar to that in
`ivreg`

from the `AER`

package. Regressors and instruments
should be specified in a two-part formula, such as
`y ~ x1 + x2 | z1 + z2 + z3`

, where `x1`

and `x2`

are
regressors and `z1`

, `z2`

, and `z3`

are instruments. Unlike
`ivreg`

, you must explicitly specify all exogenous regressors on
both sides of the bar.

The default variance estimators are the same as in `lm_robust`

.
Without clusters, we default to `HC2`

standard errors, and with clusters
we default to `CR2`

standard errors. 2SLS variance estimates are
computed using the same estimators as in `lm_robust`

, however the
design matrix used are the second-stage regressors, which includes the estimated
endogenous regressors, and the residuals used are the difference
between the outcome and a fit produced by the second-stage coefficients and the
first-stage (endogenous) regressors. More notes on this can be found at
the mathematical appendix.

If `fixed_effects`

are specified, both the outcome, regressors, and instruments
are centered using the method of alternating projections (Halperin 1962; Gaure 2013). Specifying
fixed effects in this way will result in large speed gains with standard error
estimators that do not need to invert the matrix of fixed effects. This means using
"classical", "HC0", "HC1", "CR0", or "stata" standard errors will be faster than other
standard error estimators. Be wary when specifying fixed effects that may result
in perfect fits for some observations or if there are intersecting groups across
multiple fixed effect variables (e.g. if you specify both "year" and "country" fixed effects
with an unbalanced panel where one year you only have data for one country).

## References

Gaure, Simon. 2013. "OLS with multiple high dimensional category variables." Computational Statistics & Data Analysis 66: 8-1. http://dx.doi.org/10.1016/j.csda.2013.03.024

Halperin, I. 1962. "The product of projection operators." Acta Scientiarum Mathematicarum (Szeged) 23(1-2): 96-99.

## Examples

library(fabricatr) dat <- fabricate( N = 40, Y = rpois(N, lambda = 4), Z = rbinom(N, 1, prob = 0.4), D = Z * rbinom(N, 1, prob = 0.8), X = rnorm(N), G = sample(letters[1:4], N, replace = TRUE) ) # Instrument for treatment `D` with encouragement `Z` tidy(iv_robust(Y ~ D + X | Z + X, data = dat))#> term estimate std.error statistic p.value conf.low conf.high #> 1 (Intercept) 3.7273864 0.4776022 7.8043751 2.471825e-09 2.759673 4.6951003 #> 2 D -0.7100564 0.7858244 -0.9035815 3.720629e-01 -2.302288 0.8821750 #> 3 X 0.1560677 0.3621421 0.4309570 6.690000e-01 -0.577702 0.8898373 #> df outcome #> 1 37 Y #> 2 37 Y #> 3 37 Y# Instrument with Stata's `ivregress 2sls , small rob` HC1 variance tidy(iv_robust(Y ~ D | Z, data = dat, se_type = "stata"))#> term estimate std.error statistic p.value conf.low conf.high #> 1 (Intercept) 3.6666667 0.4700241 7.8010180 2.083704e-09 2.715153 4.6181808 #> 2 D -0.6140351 0.7436379 -0.8257179 4.141183e-01 -2.119451 0.8913811 #> df outcome #> 1 38 Y #> 2 38 Y# With clusters, we use CR2 errors by default dat$cl <- rep(letters[1:5], length.out = nrow(dat)) tidy(iv_robust(Y ~ D | Z, data = dat, clusters = cl))#> term estimate std.error statistic p.value conf.low conf.high #> 1 (Intercept) 3.6666667 0.2317241 15.823413 0.0001712102 2.997917 4.3354161 #> 2 D -0.6140351 0.4874346 -1.259728 0.2764953738 -1.969373 0.7413026 #> df outcome #> 1 3.646251 Y #> 2 3.985068 Y# Again, easy to replicate Stata (again with `small` correction in Stata) tidy(iv_robust(Y ~ D | Z, data = dat, clusters = cl, se_type = "stata"))#> term estimate std.error statistic p.value conf.low conf.high #> 1 (Intercept) 3.6666667 0.2391517 15.331971 0.0001055703 3.002675 4.3306582 #> 2 D -0.6140351 0.5047569 -1.216497 0.2906673250 -2.015465 0.7873946 #> df outcome #> 1 4 Y #> 2 4 Y# We can also specify fixed effects, that will be taken as exogenous regressors # Speed gains with fixed effects are greatests with "stata" or "HC1" std.errors tidy(iv_robust(Y ~ D | Z, data = dat, fixed_effects = ~ G, se_type = "HC1"))#> term estimate std.error statistic p.value conf.low conf.high df outcome #> 1 D -0.618551 0.8095842 -0.7640355 0.4499664 -2.262094 1.024992 35 Y