estimatr in the Tidyverse

The estimatr package estimates linear and instrumental variable regressions quickly and with robust standard errors. We designed the package to play well with the “tidyverse”, which is a suite of packages developed by RStudio. The main way we accomplish this is making it easy to turn model fits into data.frames, the data format expected by many tidyverse packages. This vignette will show you how to use estimatr functions with four tidyverse packages: broom, dplyr, purrr, and ggplot2.

Turning model objects into data.frames

library(estimatr)
fit <- lm_robust(mpg ~ disp, data = mtcars)
tidy(fit)
##          term    estimate   std.error statistic      p.value    conf.low
## 1 (Intercept) 29.59985476 1.490268903 19.862090 8.189608e-19 26.55631962
## 2        disp -0.04121512 0.005170197 -7.971672 6.743097e-09 -0.05177407
##     conf.high df outcome
## 1 32.64338989 30     mpg
## 2 -0.03065617 30     mpg

The fit object is an lm_robust object. The tidy() function takes the fit and returns a data.frame. This function and behavior are explictly modeled on the broom package, a tidyverse package for making model objects into data.frames.

Once the model fits are data.frames, it’s easy to pipe the output into dplyr data manipulation functions:

library(dplyr)
fit %>%
  tidy %>%
  mutate(t.stat = estimate / std.error,
         significant = p.value <= 0.05)
##          term    estimate   std.error statistic      p.value    conf.low
## 1 (Intercept) 29.59985476 1.490268903 19.862090 8.189608e-19 26.55631962
## 2        disp -0.04121512 0.005170197 -7.971672 6.743097e-09 -0.05177407
##     conf.high df outcome    t.stat significant
## 1 32.64338989 30     mpg 19.862090        TRUE
## 2 -0.03065617 30     mpg -7.971672        TRUE

Putting robust standard errors in a ggplot

If you want to overlay a regression model on a scatterplot, the stat_smooth function in ggplot2 is great. But if you run stat_smooth(method = "lm"), ggplot will add a confidence interval based on classical variance calculations. If you want to use robust standard errors (and you almost certainly do, since homoskedasticity is an unnecessary and often implausible assumption), then use stat_smooth(method = "lm_robust").

library(ggplot2)
ggplot(mtcars, aes(mpg, disp)) +
  geom_point() +
  stat_smooth(method = "lm_robust") +
  theme_bw()

Running multiple versions of a model with purrr

Sometimes, we want to run multiple versions of a regression mode, perhaps model on subsets of the data. This code runs three regressions on three subsets of the data and stacks the output into a single data.frame. We use the map function from the purrr package twice.

library(purrr)
gg_df <-
  mtcars %>%
  split(.$cyl) %>%
  map(~ lm_robust(mpg ~ disp, data = .)) %>% 
  map(tidy) %>%
  bind_rows(.id = "cyl")
gg_df
##   cyl        term     estimate   std.error  statistic      p.value
## 1   4 (Intercept) 40.871955322 3.507137416 11.6539361 9.875713e-07
## 2   4        disp -0.135141815 0.033649211 -4.0161957 3.035404e-03
## 3   6 (Intercept) 19.081987419 3.495964315  5.4582901 2.807248e-03
## 4   6        disp  0.003605119 0.020548982  0.1754402 8.676173e-01
## 5   8 (Intercept) 22.032798914 2.991733959  7.3645582 8.680881e-06
## 6   8        disp -0.019634095 0.009510531 -2.0644583 6.128588e-02
##      conf.low    conf.high df outcome
## 1 32.93825930 48.805651347  9     mpg
## 2 -0.21126162 -0.059022011  9     mpg
## 3 10.09532505 28.068649785  5     mpg
## 4 -0.04921772  0.056427958  5     mpg
## 5 15.51437058 28.551227247 12     mpg
## 6 -0.04035576  0.001087572 12     mpg

Coefficient plots with ggplot2

Having multiple models in a data.frame makes it convenient to make coefficient plots. Here we plot the regression estimate of the slope of mpg with respect to disp for each level of cyl:

gg_df %>%
  filter(term == "disp") %>%
  ggplot(aes(x = estimate, y = cyl)) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  theme_bw()

Multiple outcome variables

We could do a similar exercise with three different dependent variables. lm_robust helpfully returns the name of the outcome variable, which makes plotting these models easy:

gg_df <-
  c("mpg", "hp", "wt") %>%
  map( ~ lm_robust(formula(paste0(., " ~ disp")), data = mtcars)) %>%
  map_df(tidy)
gg_df
##          term     estimate    std.error statistic      p.value
## 1 (Intercept) 29.599854756 1.490269e+00 19.862090 8.189608e-19
## 2        disp -0.041215120 5.170197e-03 -7.971672 6.743097e-09
## 3 (Intercept) 45.734532208 1.025433e+01  4.460021 1.064717e-04
## 4        disp  0.437552650 5.450129e-02  8.028299 5.823354e-09
## 5 (Intercept)  1.599814597 1.675599e-01  9.547720 1.329415e-10
## 6        disp  0.007010325 7.385652e-04  9.491816 1.519591e-10
##       conf.low    conf.high df outcome
## 1 26.556319624 32.643389889 30     mpg
## 2 -0.051774072 -0.030656168 30     mpg
## 3 24.792392081 66.676672334 30      hp
## 4  0.326246164  0.548859136 30      hp
## 5  1.257611729  1.942017464 30      wt
## 6  0.005501974  0.008518677 30      wt
gg_df %>%
  filter(term == "disp") %>%
  ggplot(aes(estimate, outcome)) +
  geom_point() +
  geom_errorbarh(aes(xmin = conf.low, xmax = conf.high), height = 0) +
  geom_vline(xintercept = 0, linetype = "dashed") +
  theme_bw()