estimatr

This formula fits a linear model, provides a variety of options for robust standard errors, and conducts coefficient tests

lm_robust(formula, data, weights, subset, clusters, fixed_effects,
  se_type = NULL, ci = TRUE, alpha = 0.05, return_vcov = TRUE,
  try_cholesky = FALSE)

Arguments

formula

an object of class formula, as in lm

data

A data.frame

weights

the bare (unquoted) names of the weights variable in the supplied data.

subset

An optional bare (unquoted) expression specifying a subset of observations to be used.

clusters

An optional bare (unquoted) name of the variable that corresponds to the clusters in the data.

fixed_effects

An optional right-sided formula containing the fixed effects that will be projected out of the data, such as ~ blockID. Do not pass multiple-fixed effects with intersecting groups. Speed gains are greatest for variables with large numbers of groups and when using "HC1" or "stata" standard errors. See 'Details'.

se_type

The sort of standard error sought. If `clusters` is not specified the options are "HC0", "HC1" (or "stata", the equivalent), "HC2" (default), "HC3", or "classical". If `clusters` is specified the options are "CR0", "CR2" (default), or "stata". Can also specify "none", which may speed up estimation of the coefficients.

ci

logical. Whether to compute and return p-values and confidence intervals, TRUE by default.

alpha

The significance level, 0.05 by default.

return_vcov

logical. Whether to return the variance-covariance matrix for later usage, TRUE by default.

try_cholesky

logical. Whether to try using a Cholesky decomposition to solve least squares instead of a QR decomposition, FALSE by default. Using a Cholesky decomposition may result in speed gains, but should only be used if users are sure their model is full-rank (i.e., there is no perfect multi-collinearity)

Value

An object of class "lm_robust".

The post-estimation commands functions summary and tidy return results in a data.frame. To get useful data out of the return, you can use these data frames, you can use the resulting list directly, or you can use the generic accessor functions coef, vcov, confint, and predict. Marginal effects and uncertainty about them can be gotten by passing this object to margins from the margins.

Users who want to print the results in TeX of HTML can use the extract function and the texreg package.

If users specify a multivariate linear regression model (multiple outcomes), then some of the below components will be of higher dimension to accommodate the additional models.

An object of class "lm_robust" is a list containing at least the following components:

coefficients

the estimated coefficients

std.error

the estimated standard errors

statistic

the t-statistic

df

the estimated degrees of freedom

p.value

the p-values from a two-sided t-test using coefficients, std.error, and df

conf.low

the lower bound of the 1 - alpha percent confidence interval

conf.high

the upper bound of the 1 - alpha percent confidence interval

term

a character vector of coefficient names

alpha

the significance level specified by the user

se_type

the standard error type specified by the user

res_var

the residual variance

N

the number of observations used

k

the number of columns in the design matrix (includes linearly dependent columns!)

rank

the rank of the fitted model

vcov

the fitted variance covariance matrix

r.squared

The \(R^2\), $$R^2 = 1 - Sum(e[i]^2) / Sum((y[i] - y^*)^2),$$ where \(y^*\) is the mean of \(y[i]\) if there is an intercept and zero otherwise, and \(e[i]\) is the ith residual.

adj.r.squared

The \(R^2\) but penalized for having more parameters, rank

fstatistic

a vector with the value of the F-statistic with the numerator and denominator degrees of freedom

weighted

whether or not weights were applied

call

the original function call

fitted.values

the matrix of predicted means

We also return terms and contrasts, used by predict. If fixed_effects are specified, then we return proj_fstatistic, proj_r.squared, and proj_adj.r.squared, which are model fit statistics that are computed on the projected model (after demeaning the fixed effects).

Details

This function performs linear regression and provides a variety of standard errors. It takes a formula and data much in the same was as lm does, and all auxiliary variables, such as clusters and weights, can be passed either as quoted names of columns, as bare column names, or as a self-contained vector. Examples of usage can be seen below and in the Getting Started vignette.

The mathematical notes in this vignette specify the exact estimators used by this function. The default variance estimators have been chosen largely in accordance with the procedures in this manual. The default for the case without clusters is the HC2 estimator and the default with clusters is the analogous CR2 estimator. Users can easily replicate Stata standard errors in the clustered or non-clustered case by setting `se_type` = "stata".

The function estimates the coefficients and standard errors in C++, using the RcppEigen package. By default, we estimate the coefficients using Column-Pivoting QR decomposition from the Eigen C++ library, although users could get faster solutions by setting `try_cholesky` = TRUE to use a Cholesky decomposition instead. This will likely result in quicker solutions, but the algorithm does not reliably detect when there are linear dependencies in the model and may fail silently if they exist.

If `fixed_effects` are specified, both the outcome and design matrix are centered using the method of alternating projections (Halperin 1962; Gaure 2013). Specifying fixed effects in this way will result in large speed gains with standard error estimators that do not need to invert the matrix of fixed effects. This means using "classical", "HC0", "HC1", "CR0", or "stata" standard errors will be faster than other standard error estimators. Be wary when specifying fixed effects that may result in perfect fits for some observations or if there are intersecting groups across multiple fixed effect variables (e.g. if you specify both "year" and "country" fixed effects with an unbalanced panel where one year you only have data for one country).

References

Abadie, Alberto, Susan Athey, Guido W Imbens, and Jeffrey Wooldridge. 2017. "A Class of Unbiased Estimators of the Average Treatment Effect in Randomized Experiments." arXiv Pre-Print. https://arxiv.org/abs/1710.02926v2.

Bell, Robert M, and Daniel F McCaffrey. 2002. "Bias Reduction in Standard Errors for Linear Regression with Multi-Stage Samples." Survey Methodology 28 (2): 169-82.

Gaure, Simon. 2013. "OLS with multiple high dimensional category variables." Computational Statistics \& Data Analysis 66: 8-1. http://dx.doi.org/10.1016/j.csda.2013.03.024

Halperin, I. 1962. "The product of projection operators." Acta Scientiarum Mathematicarum (Szeged) 23(1-2): 96-99.

MacKinnon, James, and Halbert White. 1985. "Some Heteroskedasticity-Consistent Covariance Matrix Estimators with Improved Finite Sample Properties." Journal of Econometrics 29 (3): 305-25. https://doi.org/10.1016/0304-4076(85)90158-7.

Pustejovsky, James E, and Elizabeth Tipton. 2016. "Small Sample Methods for Cluster-Robust Variance Estimation and Hypothesis Testing in Fixed Effects Models." Journal of Business & Economic Statistics. Taylor & Francis. https://doi.org/10.1080/07350015.2016.1247004.

Samii, Cyrus, and Peter M Aronow. 2012. "On Equivalencies Between Design-Based and Regression-Based Variance Estimators for Randomized Experiments." Statistics and Probability Letters 82 (2). https://doi.org/10.1016/j.spl.2011.10.024.

Examples

library(fabricatr) dat <- fabricate( N = 40, y = rpois(N, lambda = 4), x = rnorm(N), z = rbinom(N, 1, prob = 0.4) ) # Default variance estimator is HC2 robust standard errors lmro <- lm_robust(y ~ x + z, data = dat) # Can tidy() the data in to a data.frame tidy(lmro)
#> term estimate std.error statistic p.value conf.low conf.high #> 1 (Intercept) 4.0720980 0.4444853 9.1613776 4.723071e-11 3.1714852 4.972711 #> 2 x 0.5004417 0.4023025 1.2439439 2.213436e-01 -0.3147005 1.315584 #> 3 z 0.1008780 0.7319296 0.1378248 8.911262e-01 -1.3821521 1.583908 #> df outcome #> 1 37 y #> 2 37 y #> 3 37 y
# Can use summary() to get more statistics summary(lmro)
#> #> Call: #> lm_robust(formula = y ~ x + z, data = dat) #> #> Standard error type: HC2 #> #> Coefficients: #> Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF #> (Intercept) 4.0721 0.4445 9.1614 4.723e-11 3.1715 4.973 37 #> x 0.5004 0.4023 1.2439 2.213e-01 -0.3147 1.316 37 #> z 0.1009 0.7319 0.1378 8.911e-01 -1.3822 1.584 37 #> #> Multiple R-squared: 0.04638 , Adjusted R-squared: -0.00517 #> F-statistic: 0.7856 on 2 and 37 DF, p-value: 0.4633
# Can also get coefficients three ways lmro$coefficients
#> (Intercept) x z #> 4.0720980 0.5004417 0.1008780
coef(lmro)
#> (Intercept) x z #> 4.0720980 0.5004417 0.1008780
tidy(lmro)$estimate
#> [1] 4.0720980 0.5004417 0.1008780
# Can also get confidence intervals from object or with new 1 - `alpha` lmro$conf.low
#> (Intercept) x z #> 3.1714852 -0.3147005 -1.3821521
confint(lmro, level = 0.8)
#> 10 % 90 % #> (Intercept) 3.49210937 4.652087 #> x -0.02450443 1.025388 #> z -0.85418344 1.055940
# Can recover classical standard errors lmclassic <- lm_robust(y ~ x + z, data = dat, se_type = "classical") tidy(lmclassic)
#> term estimate std.error statistic p.value conf.low conf.high #> 1 (Intercept) 4.0720980 0.4916637 8.2822839 5.984029e-10 3.0758928 5.068303 #> 2 x 0.5004417 0.3750552 1.3343148 1.902551e-01 -0.2594923 1.260376 #> 3 z 0.1008780 0.7477913 0.1349013 8.934211e-01 -1.4142911 1.616047 #> df outcome #> 1 37 y #> 2 37 y #> 3 37 y
# Can easily match Stata's robust standard errors lmstata <- lm_robust(y ~ x + z, data = dat, se_type = "stata") tidy(lmstata)
#> term estimate std.error statistic p.value conf.low conf.high #> 1 (Intercept) 4.0720980 0.4415569 9.2221360 3.976890e-11 3.1774187 4.966777 #> 2 x 0.5004417 0.3920827 1.2763677 2.097792e-01 -0.2939933 1.294877 #> 3 z 0.1008780 0.7320639 0.1377995 8.911461e-01 -1.3824243 1.584180 #> df outcome #> 1 37 y #> 2 37 y #> 3 37 y
# Easy to specify clusters for cluster-robust inference dat$clusterID <- sample(1:10, size = 40, replace = TRUE) lmclust <- lm_robust(y ~ x + z, data = dat, clusters = clusterID) tidy(lmclust)
#> term estimate std.error statistic p.value conf.low conf.high #> 1 (Intercept) 4.0720980 0.3742515 10.8806460 0.0000278716 3.1643042 4.979892 #> 2 x 0.5004417 0.4249444 1.1776640 0.2830728405 -0.5366996 1.537583 #> 3 z 0.1008780 0.6001063 0.1681003 0.8711681416 -1.3125892 1.514345 #> df outcome #> 1 6.225079 y #> 2 6.064140 y #> 3 7.138410 y
# Can also match Stata's clustered standard errors lm_robust( y ~ x + z, data = dat, clusters = clusterID, se_type = "stata" )
#> Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF #> (Intercept) 4.0720980 0.3676274 11.0766974 1.518271e-06 3.2404669 4.903729 9 #> x 0.5004417 0.4095156 1.2220334 2.527346e-01 -0.4259469 1.426830 9 #> z 0.1008780 0.5910571 0.1706739 8.682555e-01 -1.2361861 1.437942 9
# Works just as LM does with functions in the formula dat$blockID <- rep(c("A", "B", "C", "D"), each = 10) lm_robust(y ~ x + z + factor(blockID), data = dat)
#> Estimate Std. Error t value Pr(>|t|) CI Lower #> (Intercept) 5.2051154 0.9397759 5.5386774 3.426568e-06 3.2952610 #> x 0.3043018 0.4435678 0.6860322 4.973446e-01 -0.5971365 #> z -0.1127011 0.6859705 -0.1642943 8.704723e-01 -1.5067609 #> factor(blockID)B -1.2726531 1.3086524 -0.9724913 3.376744e-01 -3.9321548 #> factor(blockID)C -1.1121690 1.1621142 -0.9570222 3.453120e-01 -3.4738691 #> factor(blockID)D -1.9470029 1.1977093 -1.6256056 1.132686e-01 -4.3810411 #> CI Upper DF #> (Intercept) 7.1149697 34 #> x 1.2057402 34 #> z 1.2813587 34 #> factor(blockID)B 1.3868486 34 #> factor(blockID)C 1.2495311 34 #> factor(blockID)D 0.4870352 34
# Weights are also easily specified dat$w <- runif(40) lm_robust( y ~ x + z, data = dat, weights = w, clusters = clusterID )
#> Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper #> (Intercept) 4.6227047 0.3018606 15.314036 0.00001978 3.84930971 5.396100 #> x 1.0925247 0.4371912 2.498963 0.06087351 -0.07647668 2.261526 #> z -0.9788636 0.8409243 -1.164033 0.28692510 -3.01687836 1.059151 #> DF #> (Intercept) 5.055661 #> x 4.427303 #> z 6.248211
# Subsetting works just as in `lm()` lm_robust(y ~ x, data = dat, subset = z == 1)
#> Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF #> (Intercept) 4.3533662 0.6377937 6.825665 4.075491e-06 3.00130394 5.705429 16 #> x 0.8954626 0.4586327 1.952461 6.860710e-02 -0.07679526 1.867721 16
# One can also choose to set the significance level for different CIs lm_robust(y ~ x + z, data = dat, alpha = 0.1)
#> Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF #> (Intercept) 4.0720980 0.4444853 9.1613776 4.723071e-11 3.3222096 4.821986 37 #> x 0.5004417 0.4023025 1.2439439 2.213436e-01 -0.1782802 1.179164 37 #> z 0.1008780 0.7319296 0.1378248 8.911262e-01 -1.1339556 1.335712 37
# We can also specify fixed effects # Speed gains with fixed effects are greatests with "stata" or "HC1" std.errors tidy(lm_robust(y ~ x + z, data = dat, fixed_effects = ~ blockID, se_type = "HC1"))
#> term estimate std.error statistic p.value conf.low conf.high df #> 1 x 0.3043018 0.4312598 0.7056114 0.4852366 -0.5721235 1.180727 34 #> 2 z -0.1127011 0.6827196 -0.1650766 0.8698613 -1.5001543 1.274752 34 #> outcome #> 1 y #> 2 y
# NOT RUN { # Can also use 'margins' package if you have it installed to get # marginal effects library(margins) lmrout <- lm_robust(y ~ x + z, data = dat) summary(margins(lmrout)) # Can output results using 'texreg' library(texreg) texreg(lmrout) # }