Mathematical notes for estimatr
Luke Sonnet
Source:vignettes/mathematical-notes.Rmd
mathematical-notes.Rmd
This document provides the mathematical notes for each of the
estimators in estimatr
. The most up-to-date version of this
can be found on the DeclareDesign
website here.
Estimators
The current estimators we provide are:
-
lm_robust
- for fitting linear models with heteroskedasticity/cluster-robust standard errors -
lm_lin
- a wrapper forlm_robust()
to simplify interacting centered pre-treatment covariates with a treatment variable -
iv_robust
- two stage least squares estimation of instrumental variables regression -
difference_in_means
- for estimating differences in means with appropriate standard errors for unit-randomized, cluster-randomized, block-randomized, matched-pair randomized, and matched-pair clustered designs -
horvitz_thompson
- for estimating average treatment effects taking into consideration treatment probabilities or sampling probabilities for simple and cluster randomized designs
lm_robust
notes
The lm_robust
method uses the
C++ library Eigen, via the RcppEigen
package, to estimate the coefficients, variance-covariance matrix, and,
in some cases, the degrees of freedom of linear models.
The default estimators have been selected for efficiency in large
samples and low bias in small samples as well as for their similarities
to design-based randomization estimators (Samii and Aronow 2012). This
section outlines the various kinds of variance estimators one can employ
within lm_robust
.
Coefficient estimates
Our algorithm solves the least squares problem using a rank-revealing
column-pivoting QR factorization that eliminates the need to invert
explicitly and behaves much like the default lm
function in
R. However, when
is rank deficient, there are certain conditions under which the QR
factorization algorithm we use, from the Eigen C++ library, drops
different coefficients from the output than the default lm
function. In general, users should avoid specifying models with
rank-deficiencies. In fact, if users are certain their data are not rank
deficient, they can improve the speed of lm_robust
by
setting try_cholesky = TRUE
. This replaces the QR
factorization with a Cholesky factorization that is only guaranteed to
work
is of full rank.
Weights
If weights are included, we transform the data as below and then proceed as normal, following advice from Romano and Wolf (2017) that this weighted estimator has attractive properties. We do so by first scaling all of the weights so that they sum to one. Then we multiply each row of the design matrix by the square root each unitβs weight, , and then do the same to the outcome, . This results in our coefficients being estimated as follows, where is a diagonal matrix with the scaled weights on the diagonal.
Weighted:
The transformed data are then used in the analysis below, where is now and is now , where is an operator that applies a square root to the coefficients of some matrix.
We should note that this transformation yields the same standard
errors as specifying weights using aweight
in Stata for the
βclassicalβ, βHC0β, and βHC1β (βstataβ) variance estimators.
Furthermore, in the clustered case, our weighted estimator for the
βstataβ cluster-robust variance also matches Stata. Thus Stataβs main
robust standard error estimators, βHC1β and their clustered estimator,
match our package when weights are applied. Nonetheless, Stata uses a
slightly different Hat matrix and thus βHC2β and βHC3β estimates in
Stata when weights are specified may differ from our estimatesβmore on that
here.
Variance
In addition to solving for OLS coefficients faster than
lm
, we provide a variety of robust variance estimators.
Below we outline them for the non-clustered and clustered cases. You can
see some simulations about the unbiasedness of the classical variance
estimators with homoskedasticity and the consistency of the HC2
estimators with heteroskedasticity in these
simulations.
Heteroskedasticity-Robust Variance and Degrees of Freedom
The default variance estimator without clusters is the HC2 variance, first proposed by MacKinnon and White (1985). This estimator has the advantage of being equivalent to a conservative randomization-based βNeymanβ estimator of the variance (Samii and Aronow 2012). Furthermore, while it is somewhat less efficient than the HC1 variance estimator, the default in Stata, it tends to perform better in small samples (evidence for that can be found in our simulations here).
se_type = |
Variance Estimator () | Degrees of Freedom | Notes |
---|---|---|---|
"classical" |
N - K | ||
"HC0" |
N - K | ||
"HC1" , "stata"
|
N - K | Often called the Eicker-Huber-White variance (or similar) | |
"HC2" (default) |
N - K | ||
"HC3" |
N - K |
- is the th row of .
- is an operator that creates a diagonal matrix from a vector
- is the number of observations
- is the number of elements in .
Cluster-Robust Variance and Degrees of Freedom
For cluster-robust inference, we provide several estimators that are essentially analogs of the heteroskedastic-consistent variance estimators for the clustered case. Our default is the CR2 variance estimator, analogous to HC2 standard errors, and perform quite well in small samples without sacrificing much in the way of efficiency in larger samples. This estimator was originally proposed in Bell and McCaffrey (2002), although we implement a generalized version of the algorithm outlined in Pustejovsky and Tipton (2018); these authors provide an R package for CR2 variance estimation, clubSandwich, that applies these standard errors to a wide variety of models. For a good overview of the different cluster-robust variance estimators and simulations of their accuracy in small samples, again users can see Imbens and Kolesar (2016). For an overview of when to use cluster-robust estimators, especially in an experimental setting, see Abadie et al. (2017).
se_type = |
Variance Estimator () | Degrees of Freedom | Notes |
---|---|---|---|
"CR0" |
|||
"stata" |
The Stata variance estimator is the same as the CR0 estimate but with a special finite-sample correction. | ||
"CR2" (default) |
These estimates of the variance and degrees of freedom come from Pustejovsky and Tipton (2018), which is an extension to certain models with a particular set of dummy variables of the method proposed by Bell and McCaffrey (2002). Note that the degrees of freedom can vary for each coefficient. See below for more complete notation. |
- is the number of clusters
- is the rows of that belong to cluster
- is an identity matrix of size
- is the elements of the residual matrix in cluster , or
- and are defined in the notes below
Further notes on CR2: The variance estimator we implement is shown in equations (4) and (5) in Pustejovsky and Tipton (2018) and equation (11), where we set to be , following Bell and McCaffrey (2002). Further note that the Pustejovsky and Tipton (2018) CR2 estimator and the Bell and McCaffrey (2002) estimator are identical when is full rank. It could be rank-deficient if there were dummy variables, or fixed effects, that were also your clusters. In this case, the original Bell and McCaffrey (2002) estimator could not be computed. You can see the simpler Bell and McCaffrey (2002) estimator written up plainly on page 709 of Imbens and Kolesar (2016) along with the degrees of freedom denoted as .
In the CR2 variance calculation, we get as follows:
where is the symmetric square root of the MooreβPenrose inverse of and are the columns that correspond to cluster . To get the corresponding degrees of freedom, note that
where is a vector of length , the number of coefficients, where the th element is 1 and all other elements are 0. The signifies the coefficient for which we are computing the degrees of freedom.
lm_lin
notes
The lm_lin
estimator is a data
pre-processor for lm_robust
that implements the regression
method for covariate adjustment suggested by Lin
(2013).
This estimator works by taking the outcome and treatment variable as
the main formula (formula
) and takes a right-sided formula
of all pre-treatment covariates (covariates
). These
pre-treatment covariates are then centered to be mean zero and
interacted with the treatment variable before being added to the formula
and passed to lm_robust
. In other words, instead of fitting
a simple model adjusting for pre-treatment covariates such as
with the following model
where is a vector of pre-treatment covariates for unit that have been centered to have mean zero and is an indicator for the treatment group. Lin (2013) proposed this estimator in response to the critique by Freedman (2008) that using regression to adjust for pre-treatment covariates could bias estimates of treatment effects.
The estimator lm_lin
also works for multi-valued
treatments by creating a full set of dummies for each treatment level
and interacting each with the centered pre-treatment covariates. The
rest of the options for the function and corresponding estimation is
identical to lm_robust
.
iv_robust
notes
Our iv_robust
estimator uses a
two-stage least squares estimation.
Coefficient estimates
where are the endogenous regressors, , and are the instruments. This is equivalent to estimating the first stage regression,
and using the first stage predicted values in the second stage regression,
Variance
The variances estimates for iv_robust
are the same as
the estimates for lm_robust
although two changes are made.
First, we replace
with the second stage regressors,
,
and we replace the residuals,
,
with
.
That is, we use the residuals from the final coefficients and the
endogenous, uninstrumented regressors
.
Because Stata does not default to using finite sample corrections and
tests with its ivregress 2sls
estimator, the correspondence
between our instrumental variables estimator and theirs can be a bit
unclear. The following table shows the options in Stata that correspond
to our estimators.
estimatr |
Stata | Notes |
---|---|---|
N/A | ivregress 2sls y (x = z) |
Stataβs default has no finite sample correction (i.e., ). Stata here also uses z-tests. |
iv_robust(y ~ x | z, se_type = "classical") |
ivregress 2sls y (x = z), small |
. |
iv_robust(y ~ x | z, se_type = "HC0") |
ivregress 2sls y (x = z), rob |
Stata uses z-tests here. |
iv_robust(y ~ x | z, se_type = "HC1") |
ivregress 2sls y (x = z), rob small |
|
iv_robust(y ~ x | z, se_type = "HC2") (default) |
N/A | |
iv_robust(y ~ x | z, se_type = "HC3") |
N/A | |
iv_robust(y ~ x | z, clusters = clust,
se_type = "CR0")
|
ivregress 2sls y (x = z), vce(cl clust) |
Stata uses z-tests here. |
iv_robust(y ~ x | z, clusters = clust,
se_type = "stata")
|
ivregress 2sls y (x = z), vce(cl clust) small |
|
iv_robust(y ~ x | z, clusters = clust,
se_type = "CR2") (default) |
N/A |
Confidence intervals and hypothesis testing
If is the th diagonal element of , we build confidence intervals using the user specified as:
We also provide two-sided p-values using a t-distribution with the
aforementioned significance level
and degrees of freedom
,
which come from the second-stage regression. As mentioned in the table
above, these results will be different from Stata in certain cases as
Stata uses z-tests when small
is not specified.
difference_in_means
notes
There are six kinds of experimental designs for which our difference_in_means
estimator can estimate treatment effects, standard errors, confidence
intervals, and provide p-values. We list the different designs here
along with how the software learns the design:
- Simple (both
clusters
andblocks
are unused) - Clustered (
clusters
is specified whileblocks
is not) - Blocked (
blocks
is specified whileclusters
is not) - Blocked and clustered (both are specified)
There are two subsets of blocked designs that we also consider:
- Matched-pairs (only
blocks
is specified and all blocks are size two) - Matched-pair clustered design (both names are specified and each block only has two clusters)
Note: if there are blocks of size two and blocks greater than size two, we default to the matched-pairs estimators described below.
For each design, our estimator is informed by the recent statistical literature on the analysis of experimental data.
Estimates
Any unblocked design where is the treatment variable, is the outcome, and is the total number of units.
Blocked design (including matched-pairs designs) where is the number of blocks, is the size of those blocks, and is the estimated difference-in-means in block .
Weighting
If the user specifies weights, treatment effects (or block-level
treatment effects) and their standard errors are estimated by
lm_robust
. There are three exceptions. First, we still
compute the degrees of freedom as in the below table. Second, if the
design is blocked, a weighted treatment effect and variance estimate are
computed within each block using lm_robust
and then
combined as below. Third, specifying weights with a matched-pairs
estimator in difference_in_means
is not supported at the
moment.
Variance and Degrees of Freedom
Design type | Variance | Degrees of Freedom | Notes |
---|---|---|---|
No blocks or clusters (standard) | Where
is the Bessel-corrected variance of all units where
and
is the number of units in condition
.
This is equivalent to the variance and WelchβSatterthwaite approximation
of the degrees of freedom used by Rβs t.test . |
||
Blocked | Where is the variance of the estimated difference-in-means in block . See footnote 17 on page 74 of (Gerber and Green 2012) for a reference. The degrees of freedom are equivalent to a regression with a full set of block specific treatment effects. | ||
Clusters | Same as lm_robust CR2 estimator |
Same as lm_robust CR2 estimator |
This variance is the same as that recommended by Gerber and Green (2012) in equation 3.23 on page 83 when the clusters are even sizes. |
Blocked and clustered | Where is the variance of the estimated difference-in-means in block and S is the number of clusters. See footnote 17 on page 74 of Gerber and Green (2012) for a reference. The degrees of freedom are equivalent to a regression on data collapsed by cluster with a full set of block specific treatment effects. | ||
Matched pairs | See equation 3.16 on page 77 of Gerber and Green (2012) for a reference. | ||
Matched pair cluster randomized | See the variance for the SATE defined in equation 6 on page 36 of (Imai, King, and Nall 2009) and the suggested degrees of freedom on page 37. |
horvitz_thompson
notes
We provide Horvitz-Thompson estimators for two-armed trials and can be used to estimate unbiased treatment effects when the randomization is known. Horvitz-Thompson estimators require information about the probability each unit is in treatment and control, as well as the joint probability each unit is in the treatment, in the control, and in opposite treatment conditions.
The estimator we implement here, horvitz_thompson()
, can
be told the design of an experiment in several ways, and the reference
page is a good place to see some of those examples. Users can see a
description of the estimator and its properties in Aronow and Middleton (2013), Middleton and Aronow (2015), and Aronow and Samii (2017).
Some definitions used below:
- is the marginal probability of being in condition for unit i
- is the joint probability of unit being in condition and unit being in condition
- is the indicator function
Estimates
Simple, complete, clustered
Blocked
where is the number of blocks, is the size of those blocks, and is the Horvitz-Thompson estimate in block .
Variance
Currently we provide variance estimates that rely on two separate assumptions:
-
"youngs"
which implements a conservative variance estimate using Youngβs inequality, described in equation 35 on page 147 of Aronow and Middleton (2013) and in Aronow and Samii (2017) on pages 11-15. -
"constant"
which assumes constant treatment effects across all units but is less conservative. We only provide this estimator for simple randomized experiments.
Youngβs inequality
For designs that that are not clustered we use the following variance:
There are some simplifications of the above for simpler designs that follow algebraically from the above. For example, if there are no two units for which the joint probability of being in either condition is 0, which is the case for most experiments that are not matched-pair experiments, then we get:
If we further simplify to the case where there is simple random assignment, and there is absolutely no dependence among units (i.e., ), we get:
Clustered designs
For clustered designs, we use the following collapsed estimator by
setting collapsed = TRUE
. Here,
is the total number of clusters,
is the total of the outcomes
for all
units in cluster
,
is the marginal probability of cluster
being in condition
,
and
and
are defined analogously. Warning! If one passes
condition_pr_mat
to horvitz_thompson
for a
clustered design, but not clusters
, the function will not
use the collapsed estimator the the variance estimate will be
inaccurate.
Constant effects
Alternatively, one can assume constant treatment effects and, under that assumption, estimate the variance that is consistent under that assumption but less conservative. Again, this estimator is only implemented for the simple randomized case.
- is the potential outcome for condition for unit . This is either observed if or estimated using the constant effects assumption if , where is the condition for unit . To be precise and , where is the estimated treatment effect.
Confidence intervals and hypothesis testing
Theory on hypothesis testing with the Horvitz-Thompson estimator is yet to be developed. We rely on a normal approximation and construct confidence intervals in the following way:
The associated p-values for a two-sided null hypothesis test are computed using a normal distribution and the aforementioned significance level .