Imputing missing values in parallel using {furrr}
RToday I saw this tweet on my timeline:
For those of us that just can't wait until RStudio officially supports parallel purrr in #rstats, boy have I got something for you.
— Davis Vaughan (@dvaughan32) April 13, 2018
Introducing `furrr`, parallel purrr through the use of futures. Go ahead, break things, you know you want to:https://t.co/l9z1UC2Tew
and as a heavy {purrr}
user, as well as the happy owner of a 6-core AMD Ryzen 5 1600X cpu,
I was very excited to try out {furrr}
. For those unfamiliar with {purrr}
, you can
read some of my previous blog posts on it here,
here or
here.
To summarize very quickly: {purrr}
contains so-called higher order functions, which are functions
that take other functions as argument. One such function is map()
. Consider the following simple example:
numbers <- seq(1, 10)
If you want the square root of this numbers, you can of course simply use the sqrt()
function,
because it is vectorized:
sqrt(numbers)
## [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
## [8] 2.828427 3.000000 3.162278
But in a lot of situations, the solution is not so simple. Sometimes you have to loop over the
values. This is what we would need to do if sqrt()
was not vectorized:
sqrt_numbers <- rep(0, 10)
for(i in length(numbers)){
sqrt_numbers[i] <- sqrt(numbers[i])
}
First, you need to initialize a container, and then you have to populate the sqrt_numbers
list with the results.
Using, {purrr}
is way easier:
library(tidyverse)
map(numbers, sqrt)
## [[1]]
## [1] 1
##
## [[2]]
## [1] 1.414214
##
## [[3]]
## [1] 1.732051
##
## [[4]]
## [1] 2
##
## [[5]]
## [1] 2.236068
##
## [[6]]
## [1] 2.44949
##
## [[7]]
## [1] 2.645751
##
## [[8]]
## [1] 2.828427
##
## [[9]]
## [1] 3
##
## [[10]]
## [1] 3.162278
map()
is only one of the nice functions that are bundled inside {purrr}
. Mastering {purrr}
can really make you a much
more efficient R programmer. Anyways, recently, I have been playing around with imputation and the {mice}
package.
{mice}
comes with an example dataset called boys
, let’s take a look at it:
library(mice)
data(boys)
brotools::describe(boys) %>%
select(variable, type, n_missing, everything())
## # A tibble: 9 x 13
## variable type n_missing nobs mean sd mode min max q25
## <chr> <chr> <int> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 age Numeric 0 748 9.16 6.89 0.035 0.035 21.2 1.58
## 2 bmi Numeric 21 748 18.1 3.05 14.54 11.8 31.7 15.9
## 3 hc Numeric 46 748 51.5 5.91 33.7 33.7 65 48.1
## 4 hgt Numeric 20 748 132. 46.5 50.1 50 198 84.9
## 5 tv Numeric 522 748 11.9 7.99 <NA> 1 25 4
## 6 wgt Numeric 4 748 37.2 26.0 3.65 3.14 117. 11.7
## 7 gen Factor 503 748 NA NA <NA> NA NA NA
## 8 phb Factor 503 748 NA NA <NA> NA NA NA
## 9 reg Factor 3 748 NA NA south NA NA NA
## # ... with 3 more variables: median <dbl>, q75 <dbl>, n_unique <int>
In the code above I use the describe()
function from my personal package to get some summary
statistics of the boys
dataset (you can read more about this function
here). I am especially interested in the number of
missing values, which is why I re-order the columns. If I did not re-order the columns, it would not appear in
the output on my blog.
We see that some columns have a lot of missing values. Using the mice
function, it is very
easy to impute them:
start <- Sys.time()
imp_boys <- mice(boys, m = 10, maxit = 100, printFlag = FALSE)
end <- Sys.time() - start
print(end)
## Time difference of 3.290611 mins
Imputation on a single core took around 3 minutes on my computer. This might seem ok, but if you
have a larger data set with more variables, 3 minutes can become 3 hours. And if you increase maxit
,
which helps convergence, or the number of imputations, 3 hours can become 30 hours. With a 6-core CPU
this could potentially be brought down to 5 hours (in theory). Let’s see if we can go faster,
but first let’s take a look at the imputed data.
The mice()
function returns a mids
object. If you want to look at the data, you have to use
the complete()
function (careful, there is also a complete()
function in the {tidyr}
package,
so to avoid problems, I suggest you explicitely call mice::complete()
):
imp_boys <- mice::complete(imp_boys, "long")
brotools::describe(imp_boys) %>%
select(variable, type, n_missing, everything())
## # A tibble: 11 x 13
## variable type n_missing nobs mean sd mode min max q25
## <chr> <chr> <int> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 .id Numer… 0 7480 374. 216. 1 1 748 188.
## 2 .imp Numer… 0 7480 5.5 2.87 1 1 10 3
## 3 age Numer… 0 7480 9.16 6.89 0.035 0.035 21.2 1.58
## 4 bmi Numer… 0 7480 18.0 3.03 14.54 11.8 31.7 15.9
## 5 hc Numer… 0 7480 51.6 5.89 33.7 33.7 65 48.3
## 6 hgt Numer… 0 7480 131. 46.5 50.1 50 198 83
## 7 tv Numer… 0 7480 8.39 8.09 2 1 25 2
## 8 wgt Numer… 0 7480 37.1 26.0 3.65 3.14 117. 11.7
## 9 gen Factor 0 7480 NA NA G1 NA NA NA
## 10 phb Factor 0 7480 NA NA P1 NA NA NA
## 11 reg Factor 0 7480 NA NA south NA NA NA
## # ... with 3 more variables: median <dbl>, q75 <dbl>, n_unique <int>
As expected, no more missing values. The “long” argument inside mice::complete()
is needed if you want the complete()
function to return a long dataset. Doing the above “manually” using {purrr}
is possible with the following
code:
start <- Sys.time()
imp_boys_purrr <- map(rep(1, 10), ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE))
end <- Sys.time() - start
print(end)
## Time difference of 3.393966 mins
What this does is map the function ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE)
to a list of 1
s, and creates 10 imputed data sets. m = .
means that m
will be equal to whatever is inside
the list we are mapping our function over, so 1
, then 1
then another 1
etc….
It took around the same amount of time as using mice()
directly.
imp_boys_purrr
is now a list of 10 mids
objects. We thus need to map mice::complete()
to imp_boys_purrr
to get the data:
imp_boys_purrr_complete <- map(imp_boys_purrr, mice::complete)
Now, imp_boys_purrr_complete
is a list of 10 datasets. Let’s map brotools::describe()
to it:
map(imp_boys_purrr_complete, brotools::describe)
## [[1]]
## # A tibble: 9 x 13
## variable type nobs mean sd mode min max q25 median q75
## <chr> <chr> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age Numer… 748 9.16 6.89 0.035 0.035 21.2 1.58 10.5 15.3
## 2 bmi Numer… 748 18.0 3.03 14.54 11.8 31.7 15.9 17.4 19.5
## 3 hc Numer… 748 51.7 5.90 33.7 33.7 65 48.3 53.1 56
## 4 hgt Numer… 748 131. 46.5 50.1 50 198 84 146. 175.
## 5 tv Numer… 748 8.35 8.00 3 1 25 2 3 15
## 6 wgt Numer… 748 37.2 26.0 3.65 3.14 117. 11.7 34.7 59.6
## 7 gen Factor 748 NA NA G1 NA NA NA NA NA
## 8 phb Factor 748 NA NA P1 NA NA NA NA NA
## 9 reg Factor 748 NA NA south NA NA NA NA NA
## # ... with 2 more variables: n_missing <int>, n_unique <int>
##
## [[2]]
## # A tibble: 9 x 13
## variable type nobs mean sd mode min max q25 median q75
## <chr> <chr> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age Numer… 748 9.16 6.89 0.035 0.035 21.2 1.58 10.5 15.3
## 2 bmi Numer… 748 18.0 3.03 14.54 11.8 31.7 15.9 17.5 19.5
## 3 hc Numer… 748 51.6 5.88 33.7 33.7 65 48.3 53.2 56
## 4 hgt Numer… 748 131. 46.6 50.1 50 198 83.5 145. 175
## 5 tv Numer… 748 8.37 8.02 1 1 25 2 3 15
## 6 wgt Numer… 748 37.1 26.0 3.65 3.14 117. 11.9 34.6 59.6
## 7 gen Factor 748 NA NA G1 NA NA NA NA NA
## 8 phb Factor 748 NA NA P2 NA NA NA NA NA
## 9 reg Factor 748 NA NA south NA NA NA NA NA
## # ... with 2 more variables: n_missing <int>, n_unique <int>
##
## [[3]]
## # A tibble: 9 x 13
## variable type nobs mean sd mode min max q25 median q75
## <chr> <chr> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age Numer… 748 9.16 6.89 0.035 0.035 21.2 1.58 10.5 15.3
## 2 bmi Numer… 748 18.1 3.04 14.54 11.8 31.7 15.9 17.5 19.5
## 3 hc Numer… 748 51.6 5.87 33.7 33.7 65 48.5 53.3 56
## 4 hgt Numer… 748 131. 46.6 50.1 50 198 83.0 145. 175
## 5 tv Numer… 748 8.46 8.14 2 1 25 2 3 15
## 6 wgt Numer… 748 37.2 26.1 3.65 3.14 117. 11.7 34.6 59.6
## 7 gen Factor 748 NA NA G1 NA NA NA NA NA
## 8 phb Factor 748 NA NA P1 NA NA NA NA NA
## 9 reg Factor 748 NA NA south NA NA NA NA NA
## # ... with 2 more variables: n_missing <int>, n_unique <int>
##
## [[4]]
## # A tibble: 9 x 13
## variable type nobs mean sd mode min max q25 median q75
## <chr> <chr> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age Numer… 748 9.16 6.89 0.035 0.035 21.2 1.58 10.5 15.3
## 2 bmi Numer… 748 18.1 3.02 14.54 11.8 31.7 15.9 17.5 19.4
## 3 hc Numer… 748 51.7 5.93 33.7 33.7 65 48.5 53.4 56
## 4 hgt Numer… 748 131. 46.5 50.1 50 198 82.9 145. 175
## 5 tv Numer… 748 8.45 8.11 2 1 25 2 3 15
## 6 wgt Numer… 748 37.2 26.0 3.65 3.14 117. 11.7 34.7 59.6
## 7 gen Factor 748 NA NA G1 NA NA NA NA NA
## 8 phb Factor 748 NA NA P1 NA NA NA NA NA
## 9 reg Factor 748 NA NA south NA NA NA NA NA
## # ... with 2 more variables: n_missing <int>, n_unique <int>
##
## [[5]]
## # A tibble: 9 x 13
## variable type nobs mean sd mode min max q25 median q75
## <chr> <chr> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age Numer… 748 9.16 6.89 0.035 0.035 21.2 1.58 10.5 15.3
## 2 bmi Numer… 748 18.0 3.03 14.54 11.8 31.7 15.9 17.5 19.5
## 3 hc Numer… 748 51.6 5.91 33.7 33.7 65 48.3 53.2 56
## 4 hgt Numer… 748 131. 46.6 50.1 50 198 83.0 146. 175.
## 5 tv Numer… 748 8.21 8.02 3 1 25 2 3 15
## 6 wgt Numer… 748 37.1 26.0 3.65 3.14 117. 11.7 34.6 59.6
## 7 gen Factor 748 NA NA G1 NA NA NA NA NA
## 8 phb Factor 748 NA NA P1 NA NA NA NA NA
## 9 reg Factor 748 NA NA south NA NA NA NA NA
## # ... with 2 more variables: n_missing <int>, n_unique <int>
##
## [[6]]
## # A tibble: 9 x 13
## variable type nobs mean sd mode min max q25 median q75
## <chr> <chr> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age Numer… 748 9.16 6.89 0.035 0.035 21.2 1.58 10.5 15.3
## 2 bmi Numer… 748 18.0 3.05 14.54 11.8 31.7 15.9 17.4 19.5
## 3 hc Numer… 748 51.7 5.89 33.7 33.7 65 48.3 53.2 56
## 4 hgt Numer… 748 131. 46.5 50.1 50 198 83.0 146. 175
## 5 tv Numer… 748 8.44 8.24 3 1 25 2 3 15
## 6 wgt Numer… 748 37.1 26.0 3.65 3.14 117. 11.7 34.6 59.6
## 7 gen Factor 748 NA NA G1 NA NA NA NA NA
## 8 phb Factor 748 NA NA P1 NA NA NA NA NA
## 9 reg Factor 748 NA NA south NA NA NA NA NA
## # ... with 2 more variables: n_missing <int>, n_unique <int>
##
## [[7]]
## # A tibble: 9 x 13
## variable type nobs mean sd mode min max q25 median q75
## <chr> <chr> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age Numer… 748 9.16 6.89 0.035 0.035 21.2 1.58 10.5 15.3
## 2 bmi Numer… 748 18.1 3.04 14.54 11.8 31.7 15.9 17.4 19.5
## 3 hc Numer… 748 51.6 5.88 33.7 33.7 65 48.2 53.2 56
## 4 hgt Numer… 748 131. 46.6 50.1 50 198 83.5 146. 175
## 5 tv Numer… 748 8.47 8.15 2 1 25 2 3 15
## 6 wgt Numer… 748 37.2 26.1 3.65 3.14 117. 11.7 34.6 59.6
## 7 gen Factor 748 NA NA G1 NA NA NA NA NA
## 8 phb Factor 748 NA NA P1 NA NA NA NA NA
## 9 reg Factor 748 NA NA south NA NA NA NA NA
## # ... with 2 more variables: n_missing <int>, n_unique <int>
##
## [[8]]
## # A tibble: 9 x 13
## variable type nobs mean sd mode min max q25 median q75
## <chr> <chr> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age Numer… 748 9.16 6.89 0.035 0.035 21.2 1.58 10.5 15.3
## 2 bmi Numer… 748 18.0 3.04 14.54 11.8 31.7 15.9 17.4 19.4
## 3 hc Numer… 748 51.6 5.85 33.7 33.7 65 48.2 53.3 56
## 4 hgt Numer… 748 131. 46.5 50.1 50 198 83.0 146. 175
## 5 tv Numer… 748 8.36 8.06 2 1 25 2 3 15
## 6 wgt Numer… 748 37.2 26.1 3.65 3.14 117. 11.7 34.6 59.6
## 7 gen Factor 748 NA NA G1 NA NA NA NA NA
## 8 phb Factor 748 NA NA P1 NA NA NA NA NA
## 9 reg Factor 748 NA NA south NA NA NA NA NA
## # ... with 2 more variables: n_missing <int>, n_unique <int>
##
## [[9]]
## # A tibble: 9 x 13
## variable type nobs mean sd mode min max q25 median q75
## <chr> <chr> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age Numer… 748 9.16 6.89 0.035 0.035 21.2 1.58 10.5 15.3
## 2 bmi Numer… 748 18.0 3.05 14.54 11.8 31.7 15.9 17.4 19.5
## 3 hc Numer… 748 51.6 5.90 33.7 33.7 65 48.3 53.2 56
## 4 hgt Numer… 748 131. 46.6 50.1 50 198 83.9 146. 175
## 5 tv Numer… 748 8.57 8.25 1 1 25 2 3 15
## 6 wgt Numer… 748 37.1 26.1 3.65 3.14 117. 11.7 34.6 59.6
## 7 gen Factor 748 NA NA G1 NA NA NA NA NA
## 8 phb Factor 748 NA NA P1 NA NA NA NA NA
## 9 reg Factor 748 NA NA south NA NA NA NA NA
## # ... with 2 more variables: n_missing <int>, n_unique <int>
##
## [[10]]
## # A tibble: 9 x 13
## variable type nobs mean sd mode min max q25 median q75
## <chr> <chr> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 age Numer… 748 9.16 6.89 0.035 0.035 21.2 1.58 10.5 15.3
## 2 bmi Numer… 748 18.0 3.04 14.54 11.8 31.7 15.9 17.4 19.5
## 3 hc Numer… 748 51.6 5.89 33.7 33.7 65 48.3 53.1 56
## 4 hgt Numer… 748 131. 46.6 50.1 50 198 83.0 146. 175
## 5 tv Numer… 748 8.49 8.18 2 1 25 2 3 15
## 6 wgt Numer… 748 37.1 26.1 3.65 3.14 117. 11.7 34.6 59.6
## 7 gen Factor 748 NA NA G1 NA NA NA NA NA
## 8 phb Factor 748 NA NA P1 NA NA NA NA NA
## 9 reg Factor 748 NA NA south NA NA NA NA NA
## # ... with 2 more variables: n_missing <int>, n_unique <int>
Before merging this 10 datasets together into one, it would be nice to have a column with the id of the datasets.
This can easily be done with a variant of purrr::map()
, called map2()
:
imp_boys_purrr <- map2(.x = seq(1,10), .y = imp_boys_purrr_complete, ~mutate(.y, imp_id = as.character(.x)))
map2()
applies a function, say f()
, to 2 lists sequentially: f(x_1, y_1)
, then f(x_2, y_2)
, etc…
So here I map mutate()
to create a new column, imp_id
in each dataset. Now let’s bind the rows and
take a look at the data:
imp_boys_purrr <- bind_rows(imp_boys_purrr)
imp_boys_purrr %>%
brotools::describe() %>%
select(variable, type, n_missing, everything())
## # A tibble: 10 x 13
## variable type n_missing nobs mean sd mode min max q25
## <chr> <chr> <int> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 age Numeric 0 7480 9.16 6.89 0.035 0.035 21.2 1.58
## 2 bmi Numeric 0 7480 18.0 3.04 14.54 11.8 31.7 15.9
## 3 hc Numeric 0 7480 51.6 5.89 33.7 33.7 65 48.3
## 4 hgt Numeric 0 7480 131. 46.5 50.1 50 198 83
## 5 tv Numeric 0 7480 8.42 8.11 3 1 25 2
## 6 wgt Numeric 0 7480 37.1 26.0 3.65 3.14 117. 11.7
## 7 imp_id Charact… 0 7480 NA NA 1 NA NA NA
## 8 gen Factor 0 7480 NA NA G1 NA NA NA
## 9 phb Factor 0 7480 NA NA P1 NA NA NA
## 10 reg Factor 0 7480 NA NA south NA NA NA
## # ... with 3 more variables: median <dbl>, q75 <dbl>, n_unique <int>
You may ask yourself why I am bothering with all this. This will become apparent now. We can now use
the code we wrote to get our 10 imputed datasets using purrr::map()
and simply use furrr::future_map()
to parallelize the imputation process:
library(furrr)
## Loading required package: future
plan(multiprocess)
start <- Sys.time()
imp_boys_future <- future_map(rep(1, 10), ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE))
end <- Sys.time() - start
print(end)
## Time difference of 33.73772 secs
Boooom! Much faster! And simply by loading {furrr}
, then using plan(multiprocess)
to run the code in
parallel (if you forget that, the code will run on a single core) and using future_map()
instead of map()
.
Let’s take a look at the data:
imp_boys_future_complete <- map(imp_boys_future, mice::complete)
imp_boys_future <- map2(.x = seq(1,10), .y = imp_boys_future_complete, ~mutate(.y, imp_id = as.character(.x)))
imp_boys_future <- bind_rows(imp_boys_future)
imp_boys_future %>%
brotools::describe() %>%
select(variable, type, n_missing, everything())
## # A tibble: 10 x 13
## variable type n_missing nobs mean sd mode min max q25
## <chr> <chr> <int> <int> <dbl> <dbl> <chr> <dbl> <dbl> <dbl>
## 1 age Numeric 0 7480 9.16 6.89 0.035 0.035 21.2 1.58
## 2 bmi Numeric 0 7480 18.0 3.04 14.54 11.8 31.7 15.9
## 3 hc Numeric 0 7480 51.6 5.89 33.7 33.7 65 48.4
## 4 hgt Numeric 0 7480 131. 46.5 50.1 50 198 83
## 5 tv Numeric 0 7480 8.35 8.09 3 1 25 2
## 6 wgt Numeric 0 7480 37.1 26.0 3.65 3.14 117. 11.7
## 7 imp_id Charact… 0 7480 NA NA 1 NA NA NA
## 8 gen Factor 0 7480 NA NA G1 NA NA NA
## 9 phb Factor 0 7480 NA NA P1 NA NA NA
## 10 reg Factor 0 7480 NA NA south NA NA NA
## # ... with 3 more variables: median <dbl>, q75 <dbl>, n_unique <int>
So imputation went from 3.4 minutes (around 200 seconds) to 30 seconds. How cool is that? If you want to play around
with {furrr}
you must install it from Github, as it is not yet available on CRAN:
devtools::install_github("DavisVaughan/furrr")
If you are not comfortable with map()
(and thus future_map()
) but still want to impute in parallel, there is this
very nice script here to do just that. I created a package around this script,
called parlMICE (the same name as the script), to make installation and
usage easier. You can install it like so:
devtools::install_github("b-rodrigues/parlMICE")
If you found this blog post useful, you might want to follow me on twitter for blog post updates.