Missing data imputation and instrumental variables regression: the tidy approach
RIn this blog post I will discuss missing data imputation and instrumental variables regression. This is based on a short presentation I will give at my job. You can find the data used here on this website: http://eclr.humanities.manchester.ac.uk/index.php/IV_in_R
The data is used is from Wooldridge’s book, Econometrics: A modern Approach. You can download the data by clicking here.
This is the variable description:
1. inlf =1 if in labor force, 1975
2. hours hours worked, 1975
3. kidslt6 # kids < 6 years
4. kidsge6 # kids 6-18
5. age woman's age in yrs
6. educ years of schooling
7. wage estimated wage from earns., hours
8. repwage reported wage at interview in 1976
9. hushrs hours worked by husband, 1975
10. husage husband's age
11. huseduc husband's years of schooling
12. huswage husband's hourly wage, 1975
13. faminc family income, 1975
14. mtr fed. marginal tax rate facing woman
15. motheduc mother's years of schooling
16. fatheduc father's years of schooling
17. unem unem. rate in county of resid.
18. city =1 if live in SMSA
19. exper actual labor mkt exper
20. nwifeinc (faminc - wage*hours)/1000
21. lwage log(wage)
22. expersq exper^2
The goal is to first impute missing data in the data set, and then determine the impact of one added
year of education on wages. If one simply ignores missing values, bias can be introduced depending on
the missingness mechanism. The second problem here is that education is likely to be endogeneous
(and thus be correlated to the error term), as it is not randomly assigned. This causes biased estimates
and may lead to seriously wrong conclusions. So missingness and endogeneity should be dealt with, but
dealing with both issues is more of a programming challenge than an econometrics challenge.
Thankfully, the packages contained in the {tidyverse}
as well as {mice}
will save the day!
If you inspect the data, you will see that there are no missing values. So I will use the {mice}
package to first ampute the data (which means adding missing values). This, of course, is done
for education purposes. If you’re lucky enough to not have missing values in your data, you shouldn’t
add them!
Let’s load all the packages needed:
library(tidyverse)
library(AER)
library(naniar)
library(mice)
So first, let’s read in the data, and ampute it:
wages_data <- read_csv("http://eclr.humanities.manchester.ac.uk/images/5/5f/Mroz.csv")
## Parsed with column specification:
## cols(
## .default = col_integer(),
## wage = col_character(),
## repwage = col_double(),
## huswage = col_double(),
## mtr = col_double(),
## unem = col_double(),
## nwifeinc = col_double(),
## lwage = col_character()
## )
## See spec(...) for full column specifications.
First, I only select the variables I want to use and convert them to the correct class:
wages_data <- wages_data %>%
select(wage, educ, fatheduc, motheduc, inlf, hours,
kidslt6, kidsge6, age, huswage,
mtr, unem, city, exper) %>%
mutate_at(vars(kidslt6, kidsge6, hours, educ, age, wage, huswage, mtr,
motheduc, fatheduc, unem, exper), as.numeric) %>%
mutate_at(vars(city, inlf), as.character)
## Warning in evalq(as.numeric(wage), <environment>): NAs introduced by
## coercion
In the data, some women are not in the labour force, and thus do not have any wages; meaning they should have a 0 there. Instead, this is represented with the following symbol: “.”. So I convert these dots to 0. One could argue that the wages should not be 0, but that they’re truly missing. This is true, and there are ways to deal with such questions (Heckman’s selection model for instance), but this is not the point here.
wages_data <- wages_data %>%
mutate(wage = ifelse(is.na(wage), 0, wage))
Let’s double check if there are any missing values in the data, using naniar::vis_miss()
:
vis_miss(wages_data)
Nope! Let’s ampute it:
wages_mis <- ampute(wages_data)$amp
## Warning: Data is made numeric because the calculation of weights requires
## numeric data
ampute()
returns an object where the amp
element is the amputed data. This is what I save into
the new variable wages_mis
.
Let’s take a look:
vis_miss(wages_mis)
Ok, so now we have missing values. Let’s use the recently added mice::parlmice()
function to
impute the dataset, in parallel:
imp_wages <- parlmice(data = wages_mis, m = 10, maxit = 20, cl.type = "FORK")
For reproducibility, I save these objects to disk:
write_csv(wages_mis, "wages_mis.csv")
saveRDS(imp_wages, "imp_wages.rds")
As a sanity check, let’s look at the missingness pattern for the first completed dataset:
vis_miss(complete(imp_wages))
mice::parlmice()
was able to impute the dataset. I imputed it 10 times, so now I have 10 imputed
datasets. If I want to estimate a model using this data, I will need to do so 10 times.
This is where the tidyverse comes into play. First, let’s combine all the 10 imputed datasets into
one long dataset, with an index to differentiate them. This is done easily with mice::complete()
:
imp_wages_df <- mice::complete(imp_wages, "long")
Let’s take a look at the data:
head(imp_wages_df)
## .imp .id wage educ fatheduc motheduc inlf hours kidslt6 kidsge6 age
## 1 1 1 3.3540 12 7 12 1 1610 1 0 32
## 2 1 2 1.3889 12 7 7 1 1656 0 2 30
## 3 1 3 4.5455 12 7 12 1 1980 0 3 35
## 4 1 4 1.0965 12 7 7 1 456 0 3 34
## 5 1 5 4.5918 14 14 12 1 1568 1 2 31
## 6 1 6 4.7421 12 7 14 1 2032 0 0 54
## huswage mtr unem city exper
## 1 4.0288 0.7215 5.0 0 14
## 2 8.4416 0.6615 11.0 1 5
## 3 3.5807 0.6915 5.0 0 15
## 4 3.5417 0.7815 5.0 0 6
## 5 10.0000 0.6215 9.5 1 14
## 6 4.7364 0.6915 7.5 1 33
As you can see, there are two new columns, .id
and .imp
. .imp
equals i
means that it is the
i
th imputed dataset.
Because I have 0’s in my dependent variable, I will not log the wages but instead use the Inverse Hyperbolic Sine transformation. Marc F. Bellemare wrote a nice post about it here.
ihs <- function(x){
log(x + sqrt(x**2 + 1))
}
I can now apply this function, but first I have to group by .imp
. Remember, these are 10 separated
datasets. I also create the experience squared:
imp_wages_df <- imp_wages_df %>%
group_by(.imp) %>%
mutate(ihs_wage = ihs(wage),
exper2 = exper**2)
Now comes some tidyverse magic. I will create a new dataset by using the nest()
function from tidyr
.
(imp_wages <- imp_wages_df %>%
group_by(.imp) %>%
nest())
## # A tibble: 10 x 2
## .imp data
## <int> <list>
## 1 1 <tibble [753 × 17]>
## 2 2 <tibble [753 × 17]>
## 3 3 <tibble [753 × 17]>
## 4 4 <tibble [753 × 17]>
## 5 5 <tibble [753 × 17]>
## 6 6 <tibble [753 × 17]>
## 7 7 <tibble [753 × 17]>
## 8 8 <tibble [753 × 17]>
## 9 9 <tibble [753 × 17]>
## 10 10 <tibble [753 × 17]>
As you can see, imp_wages
is now a dataset with two columns: .imp
, indexing the imputed datasets,
and a column called data
, where each element is itself a tibble! data
is a so-called list-column.
You can read more about it on the
purrr
tutorial written by
Jenny Bryan.
Estimating a model now is easy, if you’re familiar with purrr
. This is how you do it:
imp_wages_reg = imp_wages %>%
mutate(lin_reg = map(data,
~lm(ihs_wage ~ educ + inlf + hours +
kidslt6 + kidsge6 + age + huswage +
mtr + unem + city + exper + exper2,
data = .)))
Ok, so what happened here? imp_wages
is a data frame, so it’s possible to add a column to it
with mutate()
. I call that column lin_reg
and use map()
on the column called data
(remember,
this column is actually a list of data frame objects, and map()
takes a list as an argument, and then a
function or formula) with the following formula:
~lm(ihs_wage ~ educ + inlf + hours +
kidslt6 + kidsge6 + age + huswage +
mtr + unem + city + exper + exper2,
data = .)
This formula is nothing more that a good old linear regression. The last line data = .
means that
the data to be used inside lm()
should be coming from the list called data
, which is the second
column of imp_wages
. As I’m writing these lines, I realize it is confusing as hell. But I promise
you that learning to use purrr
is a bit like learning how to use a bicycle. Very difficult to explain,
but once you know how to do it, it feels super natural. Take some time to play with the lines above
to really understand what happened.
Now, let’s take a look at the result:
imp_wages_reg
## # A tibble: 10 x 3
## .imp data lin_reg
## <int> <list> <list>
## 1 1 <tibble [753 × 17]> <lm>
## 2 2 <tibble [753 × 17]> <lm>
## 3 3 <tibble [753 × 17]> <lm>
## 4 4 <tibble [753 × 17]> <lm>
## 5 5 <tibble [753 × 17]> <lm>
## 6 6 <tibble [753 × 17]> <lm>
## 7 7 <tibble [753 × 17]> <lm>
## 8 8 <tibble [753 × 17]> <lm>
## 9 9 <tibble [753 × 17]> <lm>
## 10 10 <tibble [753 × 17]> <lm>
imp_wages_reg
now has a third column called lin_reg
where each element is a linear model, estimated
on the data from the data
column! We can now pool the results of these 10 regressions using
mice::pool()
:
pool_lin_reg <- pool(imp_wages_reg$lin_reg)
summary(pool_lin_reg)
## estimate std.error statistic df p.value
## (Intercept) 1.2868701172 3.214473e-01 4.0033628 737.9337 6.876133e-05
## educ 0.0385310276 8.231906e-03 4.6806931 737.9337 3.401935e-06
## inlf 1.8845418354 5.078235e-02 37.1101707 737.9337 0.000000e+00
## hours -0.0001164143 3.011378e-05 -3.8658143 737.9337 1.204773e-04
## kidslt6 -0.0438925013 3.793152e-02 -1.1571510 737.9337 2.475851e-01
## kidsge6 -0.0117978229 1.405226e-02 -0.8395678 737.9337 4.014227e-01
## age -0.0030084595 2.666614e-03 -1.1281946 737.9337 2.596044e-01
## huswage -0.0231736955 5.607364e-03 -4.1327255 737.9337 3.995866e-05
## mtr -2.2109176781 3.188827e-01 -6.9333267 737.9337 8.982592e-12
## unem 0.0028775444 5.462973e-03 0.5267360 737.9337 5.985352e-01
## city 0.0157414671 3.633755e-02 0.4332011 737.9337 6.649953e-01
## exper 0.0164364027 6.118875e-03 2.6861806 737.9337 7.389936e-03
## exper2 -0.0002022602 1.916146e-04 -1.0555575 737.9337 2.915159e-01
This function averages the results from the 10 regressions and computes correct standard errors. This is based on Rubin’s rules (Rubin, 1987, p. 76). As you can see, the linear regression indicates that one year of added education has a positive, significant effect of log wages (they’re not log wages, I used the IHS transformation, but log wages just sounds better than inverted hyperbolic sined wages). This effect is almost 4%.
But education is not randomly assigned, and as such might be endogenous. This is where instrumental variables come into play. An instrument is a variables that impacts the dependent variable only through the endogenous variable (here, education). For example, the education of the parents do not have a direct impact over one’s wage, but having college-educated parents means that you are likely college-educated yourself, and thus have a higher wage that if you only have a high school diploma.
I am thus going to instrument education with both parents’ education:
imp_wages_reg = imp_wages_reg %>%
mutate(iv_reg = map(data,
~ivreg(ihs_wage ~ educ + inlf + hours +
kidslt6 + kidsge6 + age + huswage +
mtr + unem + city + exper + exper2 |.-educ + fatheduc + motheduc,
data = .)))
The only difference from before is the formula:
~ivreg(ihs_wage ~ educ + inlf + hours +
kidslt6 + kidsge6 + age + huswage +
mtr + unem + city + exper + exper2 |.-educ + fatheduc + motheduc,
data = .)
## ~ivreg(ihs_wage ~ educ + inlf + hours + kidslt6 + kidsge6 + age +
## huswage + mtr + unem + city + exper + exper2 | . - educ +
## fatheduc + motheduc, data = .)
Instead of lm()
I use AER::ivreg()
and the formula has a second part, after the |
symbol. This
is where I specify that I instrument education with the parents’ education.
imp_wages_reg
now looks like this:
imp_wages_reg
## # A tibble: 10 x 4
## .imp data lin_reg iv_reg
## <int> <list> <list> <list>
## 1 1 <tibble [753 × 17]> <lm> <ivreg>
## 2 2 <tibble [753 × 17]> <lm> <ivreg>
## 3 3 <tibble [753 × 17]> <lm> <ivreg>
## 4 4 <tibble [753 × 17]> <lm> <ivreg>
## 5 5 <tibble [753 × 17]> <lm> <ivreg>
## 6 6 <tibble [753 × 17]> <lm> <ivreg>
## 7 7 <tibble [753 × 17]> <lm> <ivreg>
## 8 8 <tibble [753 × 17]> <lm> <ivreg>
## 9 9 <tibble [753 × 17]> <lm> <ivreg>
## 10 10 <tibble [753 × 17]> <lm> <ivreg>
Let’s take a look at the results:
pool_iv_reg <- pool(imp_wages_reg$iv_reg)
summary(pool_iv_reg)
## estimate std.error statistic df p.value
## (Intercept) 2.0091904157 5.146812e-01 3.9037568 737.9337 1.033832e-04
## educ 0.0038859137 2.086592e-02 0.1862326 737.9337 8.523136e-01
## inlf 1.9200207113 5.499457e-02 34.9129122 737.9337 0.000000e+00
## hours -0.0001313866 3.157375e-05 -4.1612608 737.9337 3.537881e-05
## kidslt6 -0.0234593391 4.000689e-02 -0.5863824 737.9337 5.577979e-01
## kidsge6 -0.0123239220 1.422241e-02 -0.8665145 737.9337 3.864897e-01
## age -0.0040874625 2.763340e-03 -1.4791748 737.9337 1.395203e-01
## huswage -0.0242737100 5.706497e-03 -4.2536970 737.9337 2.373189e-05
## mtr -2.6385172445 3.998419e-01 -6.5989008 737.9337 7.907430e-11
## unem 0.0047331976 5.622137e-03 0.8418859 737.9337 4.001246e-01
## city 0.0255647706 3.716783e-02 0.6878197 737.9337 4.917824e-01
## exper 0.0180917073 6.258779e-03 2.8906127 737.9337 3.957817e-03
## exper2 -0.0002291007 1.944599e-04 -1.1781381 737.9337 2.391213e-01
As you can see, education is not statistically significant anymore! This is why it is quite important to think about endogeneity issues. However, it is not always very easy to find suitable instruments. A series of tests exist to determine if you have relevant and strong instruments, but this blog post is already long enough. I will leave this for a future blog post.
If you found this blog post useful, you might want to follow me on twitter for blog post updates.