Intrumental variable regression and machine learning
Intro
Just like the question “what’s the difference between machine learning and statistics” has shed a lot of ink (since at least Breiman (2001)), the same question but where statistics is replaced by econometrics has led to a lot of discussion, as well. I like this presentation by Hal Varian from almost 6 years ago. There’s a slide called “What econometrics can learn from machine learning”, which summarises in a few bullet points Varian (2014) and the rest of the presentation discusses what machine learning can learn from econometrics. Varian argues that the difference between machine learning and econometrics is that machine learning focuses on prediction, while econometrics on inference and causality (and to a lesser extent prediction as well). Varian cites some methods that have been in the econometricians’ toolbox for decades (at least for some of them), such as regression discontinuity, difference in differences and instrumental variables regression. Another interesting paper is Mullainathan and Spiess, especially the section called What Do We (Not) Learn from Machine Learning Output?. The authors discuss the tempting idea of using LASSO to perform variable (feature) selection. Econometricians might be tempted to use LASSO to perform variable selection, and draw conclusions such as The variable (feature) “Number of rooms” has not been selected by LASSO, thus it plays no role in the prediction of house prices. However, when variables (features) are highly correlated, LASSO selects variables essentially randomly, without any meaningful impact on model performance (for prediction). I found this paragraph quite interesting (emphasis mine):
This problem is ubiquitous in machine learning. The very appeal of these algorithms is that they can fit many different functions. But this creates an Achilles’ heel: more functions mean a greater chance that two functions with very different coefficients can produce similar prediction quality. As a result, how an algorithm chooses between two very different functions can effectively come down to the flip of a coin. In econometric terms, while the lack of standard errors illustrates the limitations to making inference after model selection, the challenge here is (uniform) model selection consistency itself.
Assuming that we successfully dealt with model selection, we still have to content with significance of coefficients. There is recent research into this topic, such as Horel and Giesecke, but I wonder to what extent explainability could help with this. I have been looking around for papers that discuss explainability in the context of the social sciences but have not found any. If any of the readers of this blog are aware of such papers, please let me know.
Just to wrap up Mullainathan and Spiess; the authors then suggest to use machine learning mainly for prediction tasks, such as using images taken using satellites to predict future harvest size (the authors cite Lobell (2013)), or for tasks that have an implicit prediction component. For instance in the case of instrumental variables regression, two stages least squares is often used, and the first stage is a prediction task. Propensity score matching is another prediction task, where machine learning could be used. Other examples are presented as well. In this blog post, I’ll explore two stages least squares and see what happens when a random forest is used for the first step.
Instrumental variables regression using two-stage least squares
Let’s work out a textbook (literally) example of instrumental variable regression. The below example is taken from Wooldrige’s Econometric analysis of cross section and panel data, and is an exercise made using data from Mroz (1987) The sensitivity of an empirical model of married women’s hours of work to economic and statistical assumptions.
Let’s first load the needed packages and the data "mroz"
included in the {wooldridge}
package:
library(tidyverse)
library(randomForest)
library(wooldridge)
library(AER)
library(Metrics)
data("mroz")
Let’s only select the women that are in the labour force (inlf == 1
), and let’s run a simple linear regression. The dependent variable, or target, is lwage
, the logarithm of the wage, and the explanatory variables, or features are exper
, expersq
and educ
. For a full description of the data, click below:
Description of data
mroz {wooldridge} R Documentation
mroz
Description
Wooldridge Source: T.A. Mroz (1987), “The Sensitivity of an Empirical Model of Married Women’s Hours of Work to Economic and Statistical Assumptions,” Econometrica 55, 765-799. Professor Ernst R. Berndt, of MIT, kindly provided the data, which he obtained from Professor Mroz. Data loads lazily.
Usage
data('mroz')
Format
A data.frame with 753 observations on 22 variables:
inlf: =1 if in lab frce, 1975
hours: hours worked, 1975
kidslt6: # kids < 6 years
kidsge6: # kids 6-18
age: woman's age in yrs
educ: years of schooling
wage: est. wage from earn, hrs
repwage: rep. wage at interview in 1976
hushrs: hours worked by husband, 1975
husage: husband's age
huseduc: husband's years of schooling
huswage: husband's hourly wage, 1975
faminc: family income, 1975
mtr: fed. marg. tax rte facing woman
motheduc: mother's years of schooling
fatheduc: father's years of schooling
unem: unem. rate in county of resid.
city: =1 if live in SMSA
exper: actual labor mkt exper
nwifeinc: (faminc - wage*hours)/1000
lwage: log(wage)
expersq: exper^2
Used in Text
pages 249-251, 260, 294, 519-520, 530, 535, 535-536, 565-566, 578-579, 593- 595, 601-603, 619-620, 625
Source
https://www.cengage.com/cgi-wadsworth/course_products_wp.pl?fid=M20b&product_isbn_issn=9781111531041
working_w <- mroz %>%
filter(inlf == 1)
wage_lm <- lm(lwage ~ exper + expersq + educ,
data = working_w)
summary(wage_lm)
##
## Call:
## lm(formula = lwage ~ exper + expersq + educ, data = working_w)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.08404 -0.30627 0.04952 0.37498 2.37115
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.5220406 0.1986321 -2.628 0.00890 **
## exper 0.0415665 0.0131752 3.155 0.00172 **
## expersq -0.0008112 0.0003932 -2.063 0.03974 *
## educ 0.1074896 0.0141465 7.598 1.94e-13 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6664 on 424 degrees of freedom
## Multiple R-squared: 0.1568, Adjusted R-squared: 0.1509
## F-statistic: 26.29 on 3 and 424 DF, p-value: 1.302e-15
Now, we see that education is statistically significant and the effect is quite high. The return to education is about 11%. Now, let’s add some more explanatory variables:
wage_lm2 <- lm(lwage ~ exper + expersq + kidslt6 + kidsge6 + husage + huswage + city + educ,
data = working_w)
summary(wage_lm2)
##
## Call:
## lm(formula = lwage ~ exper + expersq + kidslt6 + kidsge6 + husage +
## huswage + city + educ, data = working_w)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.07431 -0.30500 0.05477 0.37871 2.31157
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.3853695 0.3163043 -1.218 0.22378
## exper 0.0398817 0.0133651 2.984 0.00301 **
## expersq -0.0007400 0.0003985 -1.857 0.06402 .
## kidslt6 -0.0564071 0.0890759 -0.633 0.52692
## kidsge6 -0.0143165 0.0276579 -0.518 0.60499
## husage -0.0028828 0.0049338 -0.584 0.55934
## huswage 0.0177470 0.0102733 1.727 0.08482 .
## city 0.0119960 0.0725595 0.165 0.86877
## educ 0.0986810 0.0151589 6.510 2.16e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6669 on 419 degrees of freedom
## Multiple R-squared: 0.1654, Adjusted R-squared: 0.1495
## F-statistic: 10.38 on 8 and 419 DF, p-value: 2.691e-13
The return to education lowers a bit, but is still significant. Now, the issue is that education is not exogenous (randomly assigned), and is thus correlated with the error term of the regression, due to an omitted variable for instance contained in the error term, that is correlated with education (for example work ethic).
To deal with this, econometricians use instrumental variables (IV) regression. I won’t go into detail here; just know that this method can deal with these types of issues. The Wikipedia page gives a good intro on what this is all about. This short paper is also quite interesting in introducing instrumental variables.
In practice, IV is done in two steps. First, regress the endogenous variable, in our case education, on all the explanatory variables from before, plus so called instruments. Instruments are variables that are correlated with the endogenous variable, here education, but uncorrelated to the error term. They only affect the target variable through their correlation with the endogenous variable. We will be using the education level of the parents of the women, as well as the education levels of their husbands as intruments. The assumption is that the parents’, as well as the husband’s education are exogenous in the log wage of the woman. This assumption can of course be challenged, but let’s say that it holds.
To conclude stage 1, we obtain the predictions of education:
first_stage <- lm(educ ~ exper + expersq + kidslt6 + kidsge6 + husage + huswage
+ city + motheduc + fatheduc + huseduc, data = working_w)
working_w$predictions_first_stage <- predict(first_stage)
We are now ready for the second stage. In the regression from before:
wage_lm2 <- lm(lwage ~ exper + expersq + kidslt6 + kidsge6 + husage + huswage + city + educ,
data = working_w)
we now replace educ
with the predictions of stage 1:
second_stage <- lm(lwage ~ exper + expersq + kidslt6 + kidsge6 + husage + huswage
+ city + predictions_first_stage,
data = working_w)
summary(second_stage)
##
## Call:
## lm(formula = lwage ~ exper + expersq + kidslt6 + kidsge6 + husage +
## huswage + city + predictions_first_stage, data = working_w)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.13493 -0.30004 0.03046 0.37142 2.27199
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1763588 0.4206911 0.419 0.6753
## exper 0.0419047 0.0139885 2.996 0.0029 **
## expersq -0.0007881 0.0004167 -1.891 0.0593 .
## kidslt6 -0.0255934 0.0941128 -0.272 0.7858
## kidsge6 -0.0234422 0.0291914 -0.803 0.4224
## husage -0.0042628 0.0051919 -0.821 0.4121
## huswage 0.0263802 0.0114511 2.304 0.0217 *
## city 0.0215685 0.0759034 0.284 0.7764
## predictions_first_stage 0.0531993 0.0263735 2.017 0.0443 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6965 on 419 degrees of freedom
## Multiple R-squared: 0.08988, Adjusted R-squared: 0.0725
## F-statistic: 5.172 on 8 and 419 DF, p-value: 3.581e-06
We see that education, now instrumented by the parents’ and the husband’s education is still significant, but the effect is much lower. The return to education is now about 5%. However, should our assumption hold, this effect is now causal. However there are some caveats. The IV estimate is a local average treatment effect, meaning that we only get the effect on those individuals that were affected by the treatment. In this case, it would mean that the effect we recovered is only for women who were not planning on, say, studying, but only did so under the influence of their parents (or vice-versa).
IV regression can also be achieved using the ivreg()
function from the {AER}
package:
inst_reg <- ivreg(lwage ~ exper + expersq + kidslt6 + kidsge6 + husage + huswage + city + educ
| .-educ + motheduc + fatheduc + huseduc,
data = working_w)
summary(inst_reg)
##
## Call:
## ivreg(formula = lwage ~ exper + expersq + kidslt6 + kidsge6 +
## husage + huswage + city + educ | . - educ + motheduc + fatheduc +
## huseduc, data = working_w)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.10175 -0.30407 0.03379 0.35255 2.25107
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1763588 0.4071522 0.433 0.6651
## exper 0.0419047 0.0135384 3.095 0.0021 **
## expersq -0.0007881 0.0004033 -1.954 0.0514 .
## kidslt6 -0.0255934 0.0910840 -0.281 0.7789
## kidsge6 -0.0234422 0.0282519 -0.830 0.4071
## husage -0.0042628 0.0050249 -0.848 0.3967
## huswage 0.0263802 0.0110826 2.380 0.0177 *
## city 0.0215685 0.0734606 0.294 0.7692
## educ 0.0531993 0.0255247 2.084 0.0377 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6741 on 419 degrees of freedom
## Multiple R-Squared: 0.1475, Adjusted R-squared: 0.1312
## Wald test: 5.522 on 8 and 419 DF, p-value: 1.191e-06
Ok, great, now let’s see how a machine learning practitioner who took an econometrics MOOC might tackle the issue. The first step will be to split the data into training and testing sets:
set.seed(42)
sample <- sample.int(n = nrow(working_w), size = floor(.90*nrow(working_w)), replace = F)
train <- working_w[sample, ]
test <- working_w[-sample, ]
Let’s now run the same analysis as above, but let’s compute the RMSE of the first stage regression on the testing data as well:
first_stage <- lm(educ ~ exper + expersq + kidslt6 + kidsge6 + husage + huswage
+ city + motheduc + fatheduc + huseduc, data = train)
test$predictions_first_stage <- predict(first_stage, newdata = test)
lm_rmse <- rmse(predicted = test$predictions_first_stage, actual = test$educ)
train$predictions_first_stage <- predict(first_stage)
The first stage is done, let’s go with the second stage:
second_stage <- lm(lwage ~ exper + expersq + kidslt6 + kidsge6 +
husage + huswage + city + predictions_first_stage,
data = train)
summary(second_stage)
##
## Call:
## lm(formula = lwage ~ exper + expersq + kidslt6 + kidsge6 + husage +
## huswage + city + predictions_first_stage, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.09828 -0.28606 0.05248 0.37258 2.29947
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0037711 0.4489252 -0.008 0.99330
## exper 0.0449370 0.0145632 3.086 0.00218 **
## expersq -0.0008394 0.0004344 -1.933 0.05404 .
## kidslt6 -0.0630522 0.0963953 -0.654 0.51345
## kidsge6 -0.0197164 0.0306834 -0.643 0.52089
## husage -0.0034744 0.0054358 -0.639 0.52310
## huswage 0.0219622 0.0118602 1.852 0.06484 .
## city 0.0679668 0.0804317 0.845 0.39863
## predictions_first_stage 0.0618777 0.0283253 2.185 0.02954 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6952 on 376 degrees of freedom
## Multiple R-squared: 0.1035, Adjusted R-squared: 0.08438
## F-statistic: 5.424 on 8 and 376 DF, p-value: 1.764e-06
The coefficients here are a bit different due to the splitting, but that’s not an issue. Ok, great, but our machine learning engineer is in love with random forests, so he wants to use a random forest for the prediction task of the first stage:
library(randomForest)
first_stage_rf <- randomForest(educ ~ exper + expersq + kidslt6 + kidsge6 + husage + huswage
+ city + motheduc + fatheduc + huseduc,
data = train)
test$predictions_first_stage_rf <- predict(first_stage_rf, newdata = test)
rf_rmse <- rmse(predicted = test$predictions_first_stage_rf, actual = test$educ)
train$predictions_first_stage_rf <- predict(first_stage_rf)
Let’s compare the RMSE’s of the two first stages. The RMSE of the first stage using linear regression was 2.0558723 and for the random forest 2.0000417. Our machine learning engineer is happy, because the random forest has better performance. Let’s now use the predictions for the second stage:
second_stage_rf_lm <- lm(lwage ~ exper + expersq + kidslt6 + kidsge6 +
husage + huswage + city + predictions_first_stage_rf,
data = train)
summary(second_stage_rf_lm)
##
## Call:
## lm(formula = lwage ~ exper + expersq + kidslt6 + kidsge6 + husage +
## huswage + city + predictions_first_stage_rf, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0655 -0.3198 0.0376 0.3710 2.3277
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.0416945 0.4824998 -0.086 0.93118
## exper 0.0460311 0.0145543 3.163 0.00169 **
## expersq -0.0008594 0.0004344 -1.978 0.04863 *
## kidslt6 -0.0420827 0.0952030 -0.442 0.65872
## kidsge6 -0.0211208 0.0306490 -0.689 0.49117
## husage -0.0033102 0.0054660 -0.606 0.54514
## huswage 0.0229111 0.0118142 1.939 0.05322 .
## city 0.0688384 0.0805209 0.855 0.39314
## predictions_first_stage_rf 0.0629275 0.0306877 2.051 0.04100 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.6957 on 376 degrees of freedom
## Multiple R-squared: 0.1021, Adjusted R-squared: 0.08302
## F-statistic: 5.346 on 8 and 376 DF, p-value: 2.251e-06
The results are pretty similar. Now, why not go a bit further and use a random forest for the second stage as well?