Mathematical notes for estimatr
Luke Sonnet
Source:../vignettes/mathematical-notes.Rmd
mathematical-notes.Rmd
This document provides the mathematical notes for each of the estimators in estimatr
. The most up-to-date version of this can be found on the DeclareDesign website here.
Estimators
The current estimators we provide are:
-
lm_robust
- for fitting linear models with heteroskedasticity/cluster-robust standard errors -
lm_lin
- a wrapper forlm_robust()
to simplify interacting centered pre-treatment covariates with a treatment variable -
iv_robust
- two stage least squares estimation of instrumental variables regression -
difference_in_means
- for estimating differences in means with appropriate standard errors for unit-randomized, cluster-randomized, block-randomized, matched-pair randomized, and matched-pair clustered designs -
horvitz_thompson
- for estimating average treatment effects taking into consideration treatment probabilities or sampling probabilities for simple and cluster randomized designs
lm_robust
notes
The lm_robust
method uses the C++ library Eigen, via the RcppEigen
package, to estimate the coefficients, variance-covariance matrix, and, in some cases, the degrees of freedom of linear models.
The default estimators have been selected for efficiency in large samples and low bias in small samples as well as for their similarities to design-based randomization estimators (Samii and Aronow 2012). This section outlines the various kinds of variance estimators one can employ within lm_robust
.
Coefficient estimates
\[ \widehat{\beta} =(\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathbf{y} \]
Our algorithm solves the least squares problem using a rank-revealing column-pivoting QR factorization that eliminates the need to invert \((\mathbf{X}^{\top}\mathbf{X})^{-1}\) explicitly and behaves much like the default lm
function in R. However, when \(\mathbf{X}\) is rank deficient, there are certain conditions under which the QR factorization algorithm we use, from the Eigen C++ library, drops different coefficients from the output than the default lm
function. In general, users should avoid specifying models with rank-deficiencies. In fact, if users are certain their data are not rank deficient, they can improve the speed of lm_robust
by setting try_cholesky = TRUE
. This replaces the QR factorization with a Cholesky factorization that is only guaranteed to work \(\mathbf{X}\) is of full rank.
Weights
If weights are included, we transform the data as below and then proceed as normal, following advice from Romano and Wolf (2017) that this weighted estimator has attractive properties. We do so by first scaling all of the weights so that they sum to one. Then we multiply each row of the design matrix \(\mathbf{X}\) by the square root each unit’s weight, \(\mathbf{x}_i \sqrt{w_i}\), and then do the same to the outcome, \(\mathbf{y}_i \sqrt{w_i}\). This results in our coefficients being estimated as follows, where \(\mathbf{W}\) is a diagonal matrix with the scaled weights on the diagonal.
Weighted: \[ \widehat{\beta} =(\mathbf{X}^{\top}\mathbf{W}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathbf{W}\mathbf{y} \]
The transformed data are then used in the analysis below, where \((\mathbf{X}^{\top}\mathbf{X})^{-1}\) is now \((\mathbf{X}^{\top}\mathbf{W}\mathbf{X})^{-1}\) and \(\mathbf{X}\) is now \(\mathbf{X} \mathrm{sqrt}[W]\), where \(\mathrm{sqrt}[.]\) is an operator that applies a square root to the coefficients of some matrix.
We should note that this transformation yields the same standard errors as specifying weights using aweight
in Stata for the “classical”, “HC0”, and “HC1” (“stata”) variance estimators. Furthermore, in the clustered case, our weighted estimator for the “stata” cluster-robust variance also matches Stata. Thus Stata’s main robust standard error estimators, “HC1” and their clustered estimator, match our package when weights are applied. Nonetheless, Stata uses a slightly different Hat matrix and thus “HC2” and “HC3” estimates in Stata when weights are specified may differ from our estimates—more on that here.
Variance
In addition to solving for OLS coefficients faster than lm
, we provide a variety of robust variance estimators. Below we outline them for the non-clustered and clustered cases. You can see some simulations about the unbiasedness of the classical variance estimators with homoskedasticity and the consistency of the HC2 estimators with heteroskedasticity in these simulations.
Heteroskedasticity-Robust Variance and Degrees of Freedom
The default variance estimator without clusters is the HC2 variance, first proposed by MacKinnon and White (1985). This estimator has the advantage of being equivalent to a conservative randomization-based “Neyman” estimator of the variance (Samii and Aronow 2012). Furthermore, while it is somewhat less efficient than the HC1 variance estimator, the default in Stata, it tends to perform better in small samples (evidence for that can be found in our simulations here).
se_type = |
Variance Estimator (\(\widehat{\mathbb{V}}[\widehat{\beta}]\)) | Degrees of Freedom | Notes |
---|---|---|---|
"classical" |
\(\frac{\mathbf{e}^\top\mathbf{e}}{N-K} (\mathbf{X}^{\top}\mathbf{X})^{-1}\) | N - K | |
"HC0" |
\((\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathrm{diag}\left[e_i^2\right]\mathbf{X}(\mathbf{X}^{\top}\mathbf{X})^{-1}\) | N - K | |
"HC1" , "stata"
|
\(\frac{N}{N-K}(\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathrm{diag}\left[e_i^2\right]\mathbf{X}(\mathbf{X}^{\top}\mathbf{X})^{-1}\) | N - K | Often called the Eicker-Huber-White variance (or similar) |
"HC2" (default) |
\((\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathrm{diag}\left[\frac{e_i^2}{1-h_{ii}}\right]\mathbf{X}(\mathbf{X}^{\top}\mathbf{X})^{-1}\) | N - K | |
"HC3" |
\((\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}{\top}\mathrm{diag}\left[\frac{e_i^2}{(1-h_{ii})^2}\right]\mathbf{X}(\mathbf{X}^{\top}\mathbf{X})^{-1}\) | N - K |
- \(\mathbf{x}_i\) is the \(i\)th row of \(\mathbf{X}\).
- \(h_{ii} = \mathbf{x}_i(\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{x}^{\top}_i\)
- \(e_i = y_i - \mathbf{x}_i\widehat{\beta}\)
- \(\mathrm{diag}[.]\) is an operator that creates a diagonal matrix from a vector
- \(N\) is the number of observations
- \(K\) is the number of elements in \(\beta\).
Cluster-Robust Variance and Degrees of Freedom
For cluster-robust inference, we provide several estimators that are essentially analogs of the heteroskedastic-consistent variance estimators for the clustered case. Our default is the CR2 variance estimator, analogous to HC2 standard errors, and perform quite well in small samples without sacrificing much in the way of efficiency in larger samples. This estimator was originally proposed in Bell and McCaffrey (2002), although we implement a generalized version of the algorithm outlined in Pustejovsky and Tipton (2018); these authors provide an R package for CR2 variance estimation, clubSandwich, that applies these standard errors to a wide variety of models. For a good overview of the different cluster-robust variance estimators and simulations of their accuracy in small samples, again users can see Imbens and Kolesar (2016). For an overview of when to use cluster-robust estimators, especially in an experimental setting, see Abadie et al. (2017).
se_type = |
Variance Estimator (\(\widehat{\mathbb{V}}[\widehat{\beta}]\)) | Degrees of Freedom | Notes |
---|---|---|---|
"CR0" |
\((\mathbf{X}^{\top}\mathbf{X})^{-1} \sum^S_{s=1} \left[\mathbf{X}^\top_s \mathbf{e}_s\mathbf{e}^\top_s \mathbf{X}_s \right] (\mathbf{X}^{\top}\mathbf{X})^{-1}\) | \(S - 1\) | |
"stata" |
\(\frac{N-1}{N-K}\frac{S}{S-1} (\mathbf{X}^{\top}\mathbf{X})^{-1} \sum^S_{s=1} \left[\mathbf{X}^\top_s \mathbf{e}_s\mathbf{e}^\top_s \mathbf{X}_s \right] (\mathbf{X}^{\top}\mathbf{X})^{-1}\) | \(S - 1\) | The Stata variance estimator is the same as the CR0 estimate but with a special finite-sample correction. |
"CR2" (default) |
\((\mathbf{X}^{\top}\mathbf{X})^{-1} \sum^S_{s=1} \left[\mathbf{X}^\top_s \mathbf{A}_s \mathbf{e}_s\mathbf{e}^\top_s \mathbf{A}^\top_s \mathbf{X}_s \right] (\mathbf{X}^{\top}\mathbf{X})^{-1}\) | \(\frac{\left(\sum^S_{i = 1} \mathbf{p}^\top_i \mathbf{p}_i \right)^2}{\sum^S_{i=1}\sum^S_{j=1} \left(\mathbf{p}^\top_i \mathbf{p}_j \right)^2}\) | These estimates of the variance and degrees of freedom come from Pustejovsky and Tipton (2018), which is an extension to certain models with a particular set of dummy variables of the method proposed by Bell and McCaffrey (2002). Note that the degrees of freedom can vary for each coefficient. See below for more complete notation. |
- \(S\) is the number of clusters
- \(\mathbf{X}_s\) is the rows of \(\mathbf{X}\) that belong to cluster \(s\)
- \(I_n\) is an identity matrix of size \(n\times n\)
- \(\mathbf{e}_s\) is the elements of the residual matrix \(\mathbf{e}\) in cluster \(s\), or \(\mathbf{e}_s = \mathbf{y}_s - \mathbf{X}_s \widehat{\beta}\)
- \(\mathbf{A}_s\) and \(\mathbf{p}\) are defined in the notes below
Further notes on CR2: The variance estimator we implement is shown in equations (4) and (5) in Pustejovsky and Tipton (2018) and equation (11), where we set \(\mathbf{\Phi}\) to be \(I\), following Bell and McCaffrey (2002). Further note that the Pustejovsky and Tipton (2018) CR2 estimator and the Bell and McCaffrey (2002) estimator are identical when \(\mathbf{B_s}\) is full rank. It could be rank-deficient if there were dummy variables, or fixed effects, that were also your clusters. In this case, the original Bell and McCaffrey (2002) estimator could not be computed. You can see the simpler Bell and McCaffrey (2002) estimator written up plainly on page 709 of Imbens and Kolesar (2016) along with the degrees of freedom denoted as \(K_{BM}\).
In the CR2 variance calculation, we get \(\mathbf{A}_s\) as follows:
\[ \begin{aligned} \mathbf{H} &= \mathbf{X}(\mathbf{X}^{\top}\mathbf{X})^{-1}\mathbf{X}^\top \\\\\\ \mathbf{B}_s &= (I_{N} - \mathbf{H})_s (I_{N} - \mathbf{H})^\top_s \\\\\\ \mathbf{A}_s &= \mathbf{B}^{+1/2}_s \end{aligned} \]
where \(\mathbf{B}^{+1/2}_s\) is the symmetric square root of the Moore–Penrose inverse of \(\mathbf{B}_s\) and \((I - \mathbf{H})_s\) are the \(N_s\) columns that correspond to cluster \(s\). To get the corresponding degrees of freedom, note that
\[ \mathbf{p}_s = (I_N - \mathbf{H})^\top_s \mathbf{A}_s \mathbf{X}_s (\mathbf{X}^{\top}\mathbf{X})^{-1} \mathbf{z}_{k} \] where \(\mathbf{z}_{k}\) is a vector of length \(K\), the number of coefficients, where the \(k\)th element is 1 and all other elements are 0. The \(k\) signifies the coefficient for which we are computing the degrees of freedom.
Confidence intervals and hypothesis testing
If \(\widehat{\mathbb{V}}_k\) is the \(k\)th diagonal element of \(\widehat{\mathbb{V}}\), we build confidence intervals using the user specified \(\alpha\) as:
\[ \mathrm{CI}^{1-\alpha} = \left(\widehat{\beta_k} + t^{df}_{\alpha/2} \sqrt{\widehat{\mathbb{V}}_k}, \widehat{\beta_k} + t^{df}_{1 - \alpha/2} \sqrt{\widehat{\mathbb{V}}_k}\right) \]
We also provide two-sided p-values using a t-distribution with the aforementioned significance level \(\alpha\) and degrees of freedom \(df\).
lm_lin
notes
The lm_lin
estimator is a data pre-processor for lm_robust
that implements the regression method for covariate adjustment suggested by Lin (2013).
This estimator works by taking the outcome and treatment variable as the main formula (formula
) and takes a right-sided formula of all pre-treatment covariates (covariates
). These pre-treatment covariates are then centered to be mean zero and interacted with the treatment variable before being added to the formula and passed to lm_robust
. In other words, instead of fitting a simple model adjusting for pre-treatment covariates such as
\[ y_i = \tau z_i + \mathbf{\beta}^\top \mathbf{x}_i + \epsilon_i \]
with the following model
\[ y_i = \tau z_i + \mathbf{\beta}^\top \mathbf{x}^c_i + \mathbf{\gamma}^\top \mathbf{x}^c_i z_i + \epsilon_i \]
where \(\mathbf{x}^c_i\) is a vector of pre-treatment covariates for unit \(i\) that have been centered to have mean zero and \(z_i\) is an indicator for the treatment group. Lin (2013) proposed this estimator in response to the critique by Freedman (2008) that using regression to adjust for pre-treatment covariates could bias estimates of treatment effects.
The estimator lm_lin
also works for multi-valued treatments by creating a full set of dummies for each treatment level and interacting each with the centered pre-treatment covariates. The rest of the options for the function and corresponding estimation is identical to lm_robust
.
iv_robust
notes
Our iv_robust
estimator uses a two-stage least squares estimation.
Coefficient estimates
\[ \widehat{\beta}_{2SLS} =(\mathbf{X}^{\top}\mathbf{P_z}\mathbf{X})^{-1}\mathbf{X}^{\top}\mathbf{P_z}\mathbf{y}, \]
where \(\mathbf{X}\) are the endogenous regressors, \(\mathbf{P_Z} = \mathbf{Z}(\mathbf{Z}^{\top}\mathbf{Z})^{-1}\mathbf{Z}^\top\), and \(\mathbf{Z}\) are the instruments. This is equivalent to estimating the first stage regression,
\[ \mathbf{X} = \mathbf{Z}\beta_{FS} + \mathbf{\zeta}, \]
and using the first stage predicted values in the second stage regression,
\[ \begin{aligned} \widehat{\mathbf{X}} &= \mathbf{Z}\widehat{\beta}_{FS} \\ \mathbf{y} &= \widehat{\mathbf{X}}\beta_{2SLS} + \mathbf{\epsilon}. \end{aligned} \]
Variance
The variances estimates for iv_robust
are the same as the estimates for lm_robust
although two changes are made. First, we replace \(\mathbf{X}\) with the second stage regressors, \(\widehat{\mathbf{X}}\), and we replace the residuals, \(e_i\), with \(\mathbf{y} - \mathbf{X} \beta_{2SLS}\). That is, we use the residuals from the final coefficients and the endogenous, uninstrumented regressors \(\mathbf{X}\).
Because Stata does not default to using finite sample corrections and tests with its ivregress 2sls
estimator, the correspondence between our instrumental variables estimator and theirs can be a bit unclear. The following table shows the options in Stata that correspond to our estimators.
estimatr |
Stata | Notes |
---|---|---|
N/A | ivregress 2sls y (x = z) |
Stata’s default has no finite sample correction (i.e., \(\widehat{\sigma}^2 = \mathbf{e}^\top \mathbf{e}\)). Stata here also uses z-tests. |
iv_robust(y ~ x | z, se_type = "classical") |
ivregress 2sls y (x = z), small |
\(\widehat{\sigma}^2 = \frac{\mathbf{e}^\top \mathbf{e}}{N - k}\). |
iv_robust(y ~ x | z, se_type = "HC0") |
ivregress 2sls y (x = z), rob |
Stata uses z-tests here. |
iv_robust(y ~ x | z, se_type = "HC1") |
ivregress 2sls y (x = z), rob small |
|
iv_robust(y ~ x | z, se_type = "HC2") (default) |
N/A | |
iv_robust(y ~ x | z, se_type = "HC3") |
N/A | |
iv_robust(y ~ x | z, clusters = clust, se_type = "CR0")
|
ivregress 2sls y (x = z), vce(cl clust) |
Stata uses z-tests here. |
iv_robust(y ~ x | z, clusters = clust, se_type = "stata")
|
ivregress 2sls y (x = z), vce(cl clust) small |
|
iv_robust(y ~ x | z, clusters = clust, se_type = "CR2") (default) |
N/A |
Confidence intervals and hypothesis testing
If \(\widehat{\mathbb{V}}_k\) is the \(k\)th diagonal element of \(\widehat{\mathbb{V}}\), we build confidence intervals using the user specified \(\alpha\) as:
\[ \mathrm{CI}^{1-\alpha} = \left(\widehat{\beta_{2SLS, k}} + t^{df}_{\alpha/2} \sqrt{\widehat{\mathbb{V}}_k}, \widehat{\beta_{2SLS, k}} + t^{df}_{1 - \alpha/2} \sqrt{\widehat{\mathbb{V}}_k}\right) \]
We also provide two-sided p-values using a t-distribution with the aforementioned significance level \(\alpha\) and degrees of freedom \(df\), which come from the second-stage regression. As mentioned in the table above, these results will be different from Stata in certain cases as Stata uses z-tests when small
is not specified.
difference_in_means
notes
There are six kinds of experimental designs for which our difference_in_means
estimator can estimate treatment effects, standard errors, confidence intervals, and provide p-values. We list the different designs here along with how the software learns the design:
- Simple (both
clusters
andblocks
are unused) - Clustered (
clusters
is specified whileblocks
is not) - Blocked (
blocks
is specified whileclusters
is not) - Blocked and clustered (both are specified)
There are two subsets of blocked designs that we also consider:
- Matched-pairs (only
blocks
is specified and all blocks are size two) - Matched-pair clustered design (both names are specified and each block only has two clusters)
Note: if there are blocks of size two and blocks greater than size two, we default to the matched-pairs estimators described below.
For each design, our estimator is informed by the recent statistical literature on the analysis of experimental data.
Estimates
Any unblocked design \[ \widehat{\tau} = \frac{1}{N} \sum^N_{i=1} z_i y_i - (1 - z_i) y_i \] where \(z_i\) is the treatment variable, \(y_i\) is the outcome, and \(N\) is the total number of units.
Blocked design (including matched-pairs designs) \[ \widehat{\tau} = \sum^J_{j=1} \frac{N_j}{N} \widehat{\tau_j} \] where \(J\) is the number of blocks, \(N_j\) is the size of those blocks, and \(\widehat{\tau_j}\) is the estimated difference-in-means in block \(j\).
Weighting
If the user specifies weights, treatment effects (or block-level treatment effects) and their standard errors are estimated by lm_robust
. There are three exceptions. First, we still compute the degrees of freedom as in the below table. Second, if the design is blocked, a weighted treatment effect and variance estimate are computed within each block using lm_robust
and then combined as below. Third, specifying weights with a matched-pairs estimator in difference_in_means
is not supported at the moment.
Variance and Degrees of Freedom
Design type | Variance \(\widehat{\mathbb{V}}[\widehat{\tau}]\) | Degrees of Freedom | Notes |
---|---|---|---|
No blocks or clusters (standard) | \(\frac{\widehat{\mathbb{V}}[y_{i,0}]}{N_0} + \frac{\widehat{\mathbb{V}}[y_{i,1}]}{N_1}\) | \(\widehat{\mathbb{V}}[\widehat{\tau}]^2 \left(\frac{(\widehat{\mathbb{V}}[y_{i,1}]/ N_1)^2}{N_1 - 1} + \frac{(\widehat{\mathbb{V}}[y_{i,0}]/ N_0)^2}{N_0 - 1}\right)\) | Where \(\widehat{\mathbb{V}}[y_{i,k}]\) is the Bessel-corrected variance of all units where \(z_i = k\) and \(N_k\) is the number of units in condition \(k\). This is equivalent to the variance and Welch–Satterthwaite approximation of the degrees of freedom used by R’s t.test . |
Blocked | \(\sum^J_{j=1} \left(\frac{N_j}{N}\right)^2 \widehat{\mathbb{V}}[\widehat{\tau_j}]\) | \(N - 2 * J\) | Where \(\widehat{\mathbb{V}}[\widehat{\tau_j}]\) is the variance of the estimated difference-in-means in block \(j\). See footnote 17 on page 74 of (Gerber and Green 2012) for a reference. The degrees of freedom are equivalent to a regression with a full set of block specific treatment effects. |
Clusters | Same as lm_robust CR2 estimator |
Same as lm_robust CR2 estimator |
This variance is the same as that recommended by Gerber and Green (2012) in equation 3.23 on page 83 when the clusters are even sizes. |
Blocked and clustered | \(\sum^J_{j=1} \left(\frac{N_j}{N}\right)^2 \widehat{\mathbb{V}}[\widehat{\tau_j}]\) | \(S - 2 * J\) | Where \(\widehat{\mathbb{V}}[\widehat{\tau_j}]\) is the variance of the estimated difference-in-means in block \(j\) and S is the number of clusters. See footnote 17 on page 74 of Gerber and Green (2012) for a reference. The degrees of freedom are equivalent to a regression on data collapsed by cluster with a full set of block specific treatment effects. |
Matched pairs | \(\frac{1}{J(J-1)} \sum^J_{j=1} \left(\widehat{\tau_j} - \widehat{\tau}\right)^2\) | \(J - 1\) | See equation 3.16 on page 77 of Gerber and Green (2012) for a reference. |
Matched pair cluster randomized | \(\frac{J}{(J-1)N^2} \sum^J_{j=1} \left(N_j \widehat{\tau_j} - \frac{N \widehat{\tau}}{J}\right)^2\) | \(J - 1\) | See the variance for the SATE defined in equation 6 on page 36 of (Imai, King, and Nall 2009) and the suggested degrees of freedom on page 37. |
Confidence intervals and hypothesis testing
We build confidence intervals using the user specified \(\alpha\) as:
\[ \mathrm{CI}^{1-\alpha} = \left(\widehat{\tau} + t^{df}_{\alpha/2} \sqrt{\widehat{\mathbb{V}}[\widehat{\tau}]},\widehat{\tau}] + t^{df}_{1 - \alpha/2} \sqrt{\widehat{\mathbb{V}}[\widehat{\tau}]}\right) \]
We also provide two-sided p-values using a t-distribution with the aforementioned significance level \(\alpha\) and degrees of freedom \(df\).
horvitz_thompson
notes
We provide Horvitz-Thompson estimators for two-armed trials and can be used to estimate unbiased treatment effects when the randomization is known. Horvitz-Thompson estimators require information about the probability each unit is in treatment and control, as well as the joint probability each unit is in the treatment, in the control, and in opposite treatment conditions.
The estimator we implement here, horvitz_thompson()
, can be told the design of an experiment in several ways, and the reference page is a good place to see some of those examples. Users can see a description of the estimator and its properties in Aronow and Middleton (2013), Middleton and Aronow (2015), and Aronow and Samii (2017).
Some definitions used below:
- \(\pi_{zi}\) is the marginal probability of being in condition \(z \in \{0, 1\}\) for unit i
- \(\pi_{ziwj}\) is the joint probability of unit \(i\) being in condition \(z\) and unit \(j\) being in condition \(w \in \{0, 1\}\)
- \(\epsilon_{ziwj}\) is the indicator function \(\mathbb{1}\left(\pi_{ziwj} = 0\right)\)
Estimates
Simple, complete, clustered
\[ \widehat{\tau} = \frac{1}{N} \sum^N_{i=1} z_i \frac{y_i}{\pi_{1i}} - (1 - z_i) \frac{y_i}{\pi_{0i}} \]
Blocked
\[ \widehat{\tau} = \sum^J_{j=1} \frac{N_j}{N} \widehat{\tau_j} \] where \(J\) is the number of blocks, \(N_j\) is the size of those blocks, and \(\widehat{\tau_j}\) is the Horvitz-Thompson estimate in block \(j\).
Variance
Currently we provide variance estimates that rely on two separate assumptions:
-
"youngs"
which implements a conservative variance estimate using Young’s inequality, described in equation 35 on page 147 of Aronow and Middleton (2013) and in Aronow and Samii (2017) on pages 11-15. -
"constant"
which assumes constant treatment effects across all units but is less conservative. We only provide this estimator for simple randomized experiments.
Young’s inequality
For designs that that are not clustered we use the following variance:
\[ \begin{aligned} \widehat{\mathbb{V}}_{Y}[\widehat{\tau}] = \frac{1}{N^2} \sum^N_{i=1} \Bigg[& z_i \left(\frac{y_i}{\pi_{1i}}\right)^2 + (1 - z_i) \left(\frac{y_i}{\pi_{0i}}\right)^2 + \sum_{j \neq i} \bigg(\frac{z_i z_j}{\pi_{1i1j} + \epsilon_{1i1j}}(\pi_{1i1j} - \pi_{1i}\pi_{1j})\frac{y_i}{\pi_{1i}}\frac{y_j}{\pi_{1j}} \\\\\\ & + \frac{(1-z_i) (1-z_j)}{\pi_{0i0j} + \epsilon_{0i0j}}(\pi_{0i0j} - \pi_{0i}\pi_{0j})\frac{y_i}{\pi_{0i}}\frac{y_j}{\pi_{0j}} - 2 \frac{z_i (1-z_j)}{\pi_{1i0j} + \epsilon_{1i0j}}(\pi_{1i0j} - \pi_{1i}\pi_{0j})\frac{y_i}{\pi_{1i}}\frac{y_j}{\pi_{0j}} \\\\\\ & + \sum_{\forall j \colon \pi_{1i1j} = 0} \left( z_i \frac{y^2_i}{2\pi_{1i}} + z_j \frac{y^2_j}{\pi_{1j}}\right) + \sum_{\forall j \colon \pi_{0i0j} = 0} \left( (1-z_i) \frac{y^2_i}{2\pi_{0i}} + (1-z_j) \frac{y^2_j}{\pi_{0j}}\right) \Bigg] \end{aligned} \]
There are some simplifications of the above for simpler designs that follow algebraically from the above. For example, if there are no two units for which the joint probability of being in either condition is 0, which is the case for most experiments that are not matched-pair experiments, then we get:
\[ \begin{aligned} \widehat{\mathbb{V}}_{Y}[\widehat{\tau}] = \frac{1}{N^2} \sum^N_{i=1} \Bigg[& z_i \left(\frac{y_i}{\pi_{1i}}\right)^2 + (1 - z_i) \left(\frac{y_i}{\pi_{0i}}\right)^2 + \sum_{j \neq i} \bigg(\frac{z_i z_j}{\pi_{1i1j}}(\pi_{1i1j} - \pi_{1i}\pi_{1j})\frac{y_i}{\pi_{1i}}\frac{y_j}{\pi_{1j}} \\\\\\ & + \frac{(1-z_i) (1-z_j)}{\pi_{0i0j}}(\pi_{0i0j} - \pi_{0i}\pi_{0j})\frac{y_i}{\pi_{0i}}\frac{y_j}{\pi_{0j}} - 2 \frac{z_i (1-z_j)}{\pi_{1i0j}}(\pi_{1i0j} - \pi_{1i}\pi_{0j})\frac{y_i}{\pi_{1i}}\frac{y_j}{\pi_{0j}} \Bigg] \end{aligned} \]
If we further simplify to the case where there is simple random assignment, and there is absolutely no dependence among units (i.e., \(\pi_{ziwj} = \pi_{zi}\pi_{wj} \;\;\forall\;\;z,w,i,j\)), we get:
\[ \begin{aligned} \widehat{\mathbb{V}}_{Y}[\widehat{\tau}] = \frac{1}{N^2} \sum^N_{i=1} \Bigg[& z_i \left(\frac{y_i}{\pi_{1i}}\right)^2 + (1 - z_i) \left(\frac{y_i}{\pi_{0i}}\right)^2\Bigg] \end{aligned} \]
Clustered designs
For clustered designs, we use the following collapsed estimator by setting collapsed = TRUE
. Here, \(M\) is the total number of clusters, \(y_k\) is the total of the outcomes \(y_i\) for all \(i\) units in cluster \(k\), \(\pi_zk\) is the marginal probability of cluster \(k\) being in condition \(z \in \{0, 1\}\), and \(z_k\) and \(\pi_{zkwl}\) are defined analogously. Warning! If one passes condition_pr_mat
to horvitz_thompson
for a clustered design, but not clusters
, the function will not use the collapsed estimator the the variance estimate will be inaccurate.
\[ \begin{aligned} \widehat{\mathbb{V}}_{Y}[\widehat{\tau}] = \frac{1}{N^2} \sum^M_{k=1} \Bigg[& z_k \left(\frac{y_k}{\pi_{1k}}\right)^2 + (1 - z_k) \left(\frac{y_k}{\pi_{0k}}\right)^2 + \sum_{l \neq k} \bigg(\frac{z_k z_l}{\pi_{1k1l} + \epsilon_{1k1l}}(\pi_{1k1l} - \pi_{1k}\pi_{1l})\frac{y_k}{\pi_{1k}}\frac{y_l}{\pi_{1l}} \\\\\\ & + \frac{(1-z_k) (1-z_l)}{\pi_{0k0l} + \epsilon_{0k0l}}(\pi_{0k0l} - \pi_{0k}\pi_{0l})\frac{y_k}{\pi_{0k}}\frac{y_l}{\pi_{0l}} - 2 \frac{z_k (1-z_l)}{\pi_{1k0l} + \epsilon_{1k0l}}(\pi_{1k0l} - \pi_{1k}\pi_{0l})\frac{y_k}{\pi_{1k}}\frac{y_l}{\pi_{0l}} \\\\\\ & + \sum_{\forall l \colon \pi_{1k1l} = 0} \left( z_k \frac{y^2_k}{2\pi_{1k}} + z_l \frac{y^2_l}{\pi_{1l}}\right) + \sum_{\forall l \colon \pi_{0k0l} = 0} \left( (1-z_k) \frac{y^2_k}{2\pi_{0k}} + (1-z_l) \frac{y^2_l}{\pi_{0l}}\right) \Bigg] \end{aligned} \]
Constant effects
Alternatively, one can assume constant treatment effects and, under that assumption, estimate the variance that is consistent under that assumption but less conservative. Again, this estimator is only implemented for the simple randomized case.
- \(y_{zi}\) is the potential outcome for condition \(z\) for unit \(i\). This is either observed if \(z_i = z\) or estimated using the constant effects assumption if \(z_i \neq z\), where \(z_i\) is the condition for unit \(i\). To be precise \(y_{1i} = z_i y_{i} + (1 - z_i) (y_{i} + \widehat{\tau})\) and \(y_{0i} = z_i (y_{i} - \widehat{\tau}) + (1 - z_i) y_{i}\), where \(\widehat{\tau}\) is the estimated treatment effect.
\[ \begin{aligned} \widehat{\mathbb{V}}_{C}[\widehat{\tau}] = \frac{1}{N^2} \sum^N_{i=1} \Bigg[& (1 - \pi_{0i}) \pi_{0i} \left(\frac{y_{0i}}{\pi_{0i}}\right)^2 + (1 - \pi_{1i}) \pi_{1i} \left(\frac{y_{1i}}{\pi_{1i}}\right)^2 - 2 y_{1i} y_{0i} \\\\\\ & + \sum_{j \neq i} \Big( (\pi_{0i0j} - \pi_{0i} \pi_{0j}) \frac{y_{0i}}{\pi_{0i}} \frac{y_{0j}}{\pi_{0j}} + (\pi_{1i1j} - \pi_{1i} \pi_{1j}) \frac{y_{1i}}{\pi_{1i}} \frac{y_{1j}}{\pi_{1j}} \\\\\\ &- 2 (\pi_{1i0j} - \pi_{1i} \pi_{0j}) \frac{y_{1i}}{\pi_{1i}} \frac{y_{0j}}{\pi_{0j}} \Big)\Bigg] \end{aligned} \]
Confidence intervals and hypothesis testing
Theory on hypothesis testing with the Horvitz-Thompson estimator is yet to be developed. We rely on a normal approximation and construct confidence intervals in the following way: \[ \mathrm{CI}^{1-\alpha} = \left(\widehat{\tau} + z_{\alpha/2} \sqrt{\widehat{\mathbb{V}}[\widehat{\tau}]}, \widehat{\tau} + z_{1 - \alpha/2} \sqrt{\widehat{\mathbb{V}}[\widehat{\tau}]}\right) \]
The associated p-values for a two-sided null hypothesis test are computed using a normal distribution and the aforementioned significance level \(\alpha\).
References
Abadie, Alberto, Susan Athey, Guido W Imbens, and Jeffrey Wooldridge. 2017. “A Class of Unbiased Estimators of the Average Treatment Effect in Randomized Experiments.” arXiv Pre-Print. https://arxiv.org/abs/1710.02926v2.
Aronow, Peter M, and Joel A Middleton. 2013. “A Class of Unbiased Estimators of the Average Treatment Effect in Randomized Experiments.” Journal of Causal Inference 1 (1): 135–54. https://doi.org/10.1515/jci-2012-0009.
Aronow, Peter M, and Cyrus Samii. 2017. “Estimating Average Causal Effects Under Interference Between Units.” Annals of Applied Statistics, forthcoming. https://arxiv.org/abs/1305.6156v3.
Bell, Robert M, and Daniel F McCaffrey. 2002. “Bias Reduction in Standard Errors for Linear Regression with Multi-Stage Samples.” Survey Methodology 28 (2): 169–82.
Freedman, David A. 2008. “On Regression Adjustments in Experiments with Several Treatments.” The Annals of Applied Statistics, 176–96. https://doi.org/10.1214/07-AOAS143.
Gerber, Alan S, and Donald P Green. 2012. Field Experiments: Design, Analysis, and Interpretation. New York: W.W. Norton.
Imai, Kosuke, Gary King, and Clayton Nall. 2009. “The Essential Role of Pair Matching in Cluster-Randomized Experiments, with Application to the Mexican Universal Health Insurance Evaluation.” Statistical Science 24 (1): 29–53. https://doi.org/10.1214/08-STS274.
Imbens, Guido W, and Michal Kolesar. 2016. “Robust Standard Errors in Small Samples: Some Practical Advice.” Review of Economics and Statistics 98 (4): 701–12. https://doi.org/10.1162/REST_a_00552.
Lin, Winston. 2013. “Agnostic Notes on Regression Adjustments to Experimental Data: Reexamining Freedman’s Critique.” The Annals of Applied Statistics 7 (1): 295–318. https://doi.org/10.1214/12-AOAS583.
MacKinnon, James, and Halbert White. 1985. “Some Heteroskedasticity-Consistent Covariance Matrix Estimators with Improved Finite Sample Properties.” Journal of Econometrics 29 (3): 305–25. https://doi.org/10.1016/0304-4076(85)90158-7.
Middleton, Joel A, and Peter M Aronow. 2015. “Unbiased Estimation of the Average Treatment Effect in Cluster-Randomized Experiments.” Statistics, Politics and Policy 6 (1-2): 39–75. https://doi.org/10.1515/spp-2013-0002.
Pustejovsky, James E, and Elizabeth Tipton. 2018. “Small-Sample Methods for Cluster-Robust Variance Estimation and Hypothesis Testing in Fixed Effects Models.” Journal of Business & Economic Statistics 36 (4). https://doi.org/10.1080/07350015.2016.1247004.
Romano, Joseph P, and Michael Wolf. 2017. “Resurrecting Weighted Least Squares.” Journal of Econometrics 197 (1): 1–19. https://doi.org/10.1016/j.jeconom.2016.10.003.
Samii, Cyrus, and Peter M Aronow. 2012. “On Equivalencies Between Design-Based and Regression-Based Variance Estimators for Randomized Experiments.” Statistics and Probability Letters 82 (2). https://doi.org/10.1016/j.spl.2011.10.024.