Imputing missing values in parallel using {furrr}

programming

Published

April 14, 2018

Today I saw this tweet on my timeline:

For those of us that just can't wait until RStudio officially supports parallel purrr in #rstats, boy have I got something for you.

Introducing furrr, parallel purrr through the use of futures. Go ahead, break things, you know you want to:https://t.co/l9z1UC2Tew
— Davis Vaughan (@dvaughan32) April 13, 2018

and as a heavy {purrr} user, as well as the happy owner of a 6-core AMD Ryzen 5 1600X cpu, I was very excited to try out {furrr}. For those unfamiliar with {purrr}, you can read some of my previous blog posts on it here, here or here.

To summarize very quickly: {purrr} contains so-called higher order functions, which are functions that take other functions as argument. One such function is map(). Consider the following simple example:

numbers <- seq(1, 10)

If you want the square root of this numbers, you can of course simply use the sqrt() function, because it is vectorized:

sqrt(numbers)

##  [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751
##  [8] 2.828427 3.000000 3.162278

But in a lot of situations, the solution is not so simple. Sometimes you have to loop over the values. This is what we would need to do if sqrt() was not vectorized:

sqrt_numbers <- rep(0, 10)

for(i in length(numbers)){
  sqrt_numbers[i] <- sqrt(numbers[i])
}

First, you need to initialize a container, and then you have to populate the sqrt_numbers list with the results. Using, {purrr} is way easier:

library(tidyverse)
map(numbers, sqrt)

## [[1]]
## [1] 1
## 
## [[2]]
## [1] 1.414214
## 
## [[3]]
## [1] 1.732051
## 
## [[4]]
## [1] 2
## 
## [[5]]
## [1] 2.236068
## 
## [[6]]
## [1] 2.44949
## 
## [[7]]
## [1] 2.645751
## 
## [[8]]
## [1] 2.828427
## 
## [[9]]
## [1] 3
## 
## [[10]]
## [1] 3.162278

map() is only one of the nice functions that are bundled inside {purrr}. Mastering {purrr} can really make you a much more efficient R programmer. Anyways, recently, I have been playing around with imputation and the {mice} package. {mice} comes with an example dataset called boys, let’s take a look at it:

library(mice)

data(boys)

brotools::describe(boys) %>%
  select(variable, type, n_missing, everything())

## # A tibble: 9 x 13
##   variable type    n_missing  nobs   mean    sd mode     min   max   q25
##   <chr>    <chr>       <int> <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>
## 1 age      Numeric         0   748   9.16  6.89 0.035  0.035  21.2  1.58
## 2 bmi      Numeric        21   748  18.1   3.05 14.54 11.8    31.7 15.9 
## 3 hc       Numeric        46   748  51.5   5.91 33.7  33.7    65   48.1 
## 4 hgt      Numeric        20   748 132.   46.5  50.1  50     198   84.9 
## 5 tv       Numeric       522   748  11.9   7.99 <NA>   1      25    4   
## 6 wgt      Numeric         4   748  37.2  26.0  3.65   3.14  117.  11.7 
## 7 gen      Factor        503   748  NA    NA    <NA>  NA      NA   NA   
## 8 phb      Factor        503   748  NA    NA    <NA>  NA      NA   NA   
## 9 reg      Factor          3   748  NA    NA    south NA      NA   NA   
## # ... with 3 more variables: median <dbl>, q75 <dbl>, n_unique <int>

In the code above I use the describe() function from my personal package to get some summary statistics of the boys dataset (you can read more about this function here). I am especially interested in the number of missing values, which is why I re-order the columns. If I did not re-order the columns, it would not appear in the output on my blog.

We see that some columns have a lot of missing values. Using the mice function, it is very easy to impute them:

start <- Sys.time()
imp_boys <- mice(boys, m = 10, maxit = 100, printFlag = FALSE)
end <- Sys.time() - start

print(end)

## Time difference of 3.290611 mins

Imputation on a single core took around 3 minutes on my computer. This might seem ok, but if you have a larger data set with more variables, 3 minutes can become 3 hours. And if you increase maxit, which helps convergence, or the number of imputations, 3 hours can become 30 hours. With a 6-core CPU this could potentially be brought down to 5 hours (in theory). Let’s see if we can go faster, but first let’s take a look at the imputed data.

The mice() function returns a mids object. If you want to look at the data, you have to use the complete() function (careful, there is also a complete() function in the {tidyr} package, so to avoid problems, I suggest you explicitely call mice::complete()):

imp_boys <- mice::complete(imp_boys, "long")

brotools::describe(imp_boys) %>%
  select(variable, type, n_missing, everything())

## # A tibble: 11 x 13
##    variable type   n_missing  nobs   mean     sd mode     min   max    q25
##    <chr>    <chr>      <int> <int>  <dbl>  <dbl> <chr>  <dbl> <dbl>  <dbl>
##  1 .id      Numer…         0  7480 374.   216.   1      1     748   188.  
##  2 .imp     Numer…         0  7480   5.5    2.87 1      1      10     3   
##  3 age      Numer…         0  7480   9.16   6.89 0.035  0.035  21.2   1.58
##  4 bmi      Numer…         0  7480  18.0    3.03 14.54 11.8    31.7  15.9 
##  5 hc       Numer…         0  7480  51.6    5.89 33.7  33.7    65    48.3 
##  6 hgt      Numer…         0  7480 131.    46.5  50.1  50     198    83   
##  7 tv       Numer…         0  7480   8.39   8.09 2      1      25     2   
##  8 wgt      Numer…         0  7480  37.1   26.0  3.65   3.14  117.   11.7 
##  9 gen      Factor         0  7480  NA     NA    G1    NA      NA    NA   
## 10 phb      Factor         0  7480  NA     NA    P1    NA      NA    NA   
## 11 reg      Factor         0  7480  NA     NA    south NA      NA    NA   
## # ... with 3 more variables: median <dbl>, q75 <dbl>, n_unique <int>

As expected, no more missing values. The “long” argument inside mice::complete() is needed if you want the complete() function to return a long dataset. Doing the above “manually” using {purrr} is possible with the following code:

start <- Sys.time()
imp_boys_purrr <- map(rep(1, 10), ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE))
end <- Sys.time() - start

print(end)

## Time difference of 3.393966 mins

What this does is map the function ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE) to a list of 1s, and creates 10 imputed data sets. m = . means that m will be equal to whatever is inside the list we are mapping our function over, so 1, then 1 then another 1 etc…. It took around the same amount of time as using mice() directly.

imp_boys_purrr is now a list of 10 mids objects. We thus need to map mice::complete() to imp_boys_purrr to get the data:

imp_boys_purrr_complete <- map(imp_boys_purrr, mice::complete)

Now, imp_boys_purrr_complete is a list of 10 datasets. Let’s map brotools::describe() to it:

map(imp_boys_purrr_complete, brotools::describe)

## [[1]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.0   3.03 14.54 11.8    31.7 15.9    17.4  19.5
## 3 hc       Numer…   748  51.7   5.90 33.7  33.7    65   48.3    53.1  56  
## 4 hgt      Numer…   748 131.   46.5  50.1  50     198   84     146.  175. 
## 5 tv       Numer…   748   8.35  8.00 3      1      25    2       3    15  
## 6 wgt      Numer…   748  37.2  26.0  3.65   3.14  117.  11.7    34.7  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA  
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA  
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[2]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.0   3.03 14.54 11.8    31.7 15.9    17.5  19.5
## 3 hc       Numer…   748  51.6   5.88 33.7  33.7    65   48.3    53.2  56  
## 4 hgt      Numer…   748 131.   46.6  50.1  50     198   83.5   145.  175  
## 5 tv       Numer…   748   8.37  8.02 1      1      25    2       3    15  
## 6 wgt      Numer…   748  37.1  26.0  3.65   3.14  117.  11.9    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA  
## 8 phb      Factor   748  NA    NA    P2    NA      NA   NA      NA    NA  
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[3]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.1   3.04 14.54 11.8    31.7 15.9    17.5  19.5
## 3 hc       Numer…   748  51.6   5.87 33.7  33.7    65   48.5    53.3  56  
## 4 hgt      Numer…   748 131.   46.6  50.1  50     198   83.0   145.  175  
## 5 tv       Numer…   748   8.46  8.14 2      1      25    2       3    15  
## 6 wgt      Numer…   748  37.2  26.1  3.65   3.14  117.  11.7    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA  
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA  
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[4]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.1   3.02 14.54 11.8    31.7 15.9    17.5  19.4
## 3 hc       Numer…   748  51.7   5.93 33.7  33.7    65   48.5    53.4  56  
## 4 hgt      Numer…   748 131.   46.5  50.1  50     198   82.9   145.  175  
## 5 tv       Numer…   748   8.45  8.11 2      1      25    2       3    15  
## 6 wgt      Numer…   748  37.2  26.0  3.65   3.14  117.  11.7    34.7  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA  
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA  
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[5]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.0   3.03 14.54 11.8    31.7 15.9    17.5  19.5
## 3 hc       Numer…   748  51.6   5.91 33.7  33.7    65   48.3    53.2  56  
## 4 hgt      Numer…   748 131.   46.6  50.1  50     198   83.0   146.  175. 
## 5 tv       Numer…   748   8.21  8.02 3      1      25    2       3    15  
## 6 wgt      Numer…   748  37.1  26.0  3.65   3.14  117.  11.7    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA  
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA  
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[6]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.0   3.05 14.54 11.8    31.7 15.9    17.4  19.5
## 3 hc       Numer…   748  51.7   5.89 33.7  33.7    65   48.3    53.2  56  
## 4 hgt      Numer…   748 131.   46.5  50.1  50     198   83.0   146.  175  
## 5 tv       Numer…   748   8.44  8.24 3      1      25    2       3    15  
## 6 wgt      Numer…   748  37.1  26.0  3.65   3.14  117.  11.7    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA  
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA  
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[7]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.1   3.04 14.54 11.8    31.7 15.9    17.4  19.5
## 3 hc       Numer…   748  51.6   5.88 33.7  33.7    65   48.2    53.2  56  
## 4 hgt      Numer…   748 131.   46.6  50.1  50     198   83.5   146.  175  
## 5 tv       Numer…   748   8.47  8.15 2      1      25    2       3    15  
## 6 wgt      Numer…   748  37.2  26.1  3.65   3.14  117.  11.7    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA  
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA  
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[8]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.0   3.04 14.54 11.8    31.7 15.9    17.4  19.4
## 3 hc       Numer…   748  51.6   5.85 33.7  33.7    65   48.2    53.3  56  
## 4 hgt      Numer…   748 131.   46.5  50.1  50     198   83.0   146.  175  
## 5 tv       Numer…   748   8.36  8.06 2      1      25    2       3    15  
## 6 wgt      Numer…   748  37.2  26.1  3.65   3.14  117.  11.7    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA  
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA  
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[9]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.0   3.05 14.54 11.8    31.7 15.9    17.4  19.5
## 3 hc       Numer…   748  51.6   5.90 33.7  33.7    65   48.3    53.2  56  
## 4 hgt      Numer…   748 131.   46.6  50.1  50     198   83.9   146.  175  
## 5 tv       Numer…   748   8.57  8.25 1      1      25    2       3    15  
## 6 wgt      Numer…   748  37.1  26.1  3.65   3.14  117.  11.7    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA  
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA  
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>
## 
## [[10]]
## # A tibble: 9 x 13
##   variable type    nobs   mean    sd mode     min   max   q25 median   q75
##   <chr>    <chr>  <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>  <dbl> <dbl>
## 1 age      Numer…   748   9.16  6.89 0.035  0.035  21.2  1.58   10.5  15.3
## 2 bmi      Numer…   748  18.0   3.04 14.54 11.8    31.7 15.9    17.4  19.5
## 3 hc       Numer…   748  51.6   5.89 33.7  33.7    65   48.3    53.1  56  
## 4 hgt      Numer…   748 131.   46.6  50.1  50     198   83.0   146.  175  
## 5 tv       Numer…   748   8.49  8.18 2      1      25    2       3    15  
## 6 wgt      Numer…   748  37.1  26.1  3.65   3.14  117.  11.7    34.6  59.6
## 7 gen      Factor   748  NA    NA    G1    NA      NA   NA      NA    NA  
## 8 phb      Factor   748  NA    NA    P1    NA      NA   NA      NA    NA  
## 9 reg      Factor   748  NA    NA    south NA      NA   NA      NA    NA  
## # ... with 2 more variables: n_missing <int>, n_unique <int>

Before merging this 10 datasets together into one, it would be nice to have a column with the id of the datasets. This can easily be done with a variant of purrr::map(), called map2():

imp_boys_purrr <- map2(.x = seq(1,10), .y = imp_boys_purrr_complete, ~mutate(.y, imp_id = as.character(.x)))

map2() applies a function, say f(), to 2 lists sequentially: f(x_1, y_1), then f(x_2, y_2), etc… So here I map mutate() to create a new column, imp_id in each dataset. Now let’s bind the rows and take a look at the data:

imp_boys_purrr <- bind_rows(imp_boys_purrr)

imp_boys_purrr %>%
  brotools::describe() %>%
  select(variable, type, n_missing, everything())

## # A tibble: 10 x 13
##    variable type     n_missing  nobs   mean    sd mode     min   max   q25
##    <chr>    <chr>        <int> <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>
##  1 age      Numeric          0  7480   9.16  6.89 0.035  0.035  21.2  1.58
##  2 bmi      Numeric          0  7480  18.0   3.04 14.54 11.8    31.7 15.9 
##  3 hc       Numeric          0  7480  51.6   5.89 33.7  33.7    65   48.3 
##  4 hgt      Numeric          0  7480 131.   46.5  50.1  50     198   83   
##  5 tv       Numeric          0  7480   8.42  8.11 3      1      25    2   
##  6 wgt      Numeric          0  7480  37.1  26.0  3.65   3.14  117.  11.7 
##  7 imp_id   Charact…         0  7480  NA    NA    1     NA      NA   NA   
##  8 gen      Factor           0  7480  NA    NA    G1    NA      NA   NA   
##  9 phb      Factor           0  7480  NA    NA    P1    NA      NA   NA   
## 10 reg      Factor           0  7480  NA    NA    south NA      NA   NA   
## # ... with 3 more variables: median <dbl>, q75 <dbl>, n_unique <int>

You may ask yourself why I am bothering with all this. This will become apparent now. We can now use the code we wrote to get our 10 imputed datasets using purrr::map() and simply use furrr::future_map() to parallelize the imputation process:

library(furrr)

## Loading required package: future

plan(multiprocess)

start <- Sys.time()
imp_boys_future <- future_map(rep(1, 10), ~mice(data = boys, m = ., maxit = 100, printFlag = FALSE))
end <- Sys.time() - start

print(end)

## Time difference of 33.73772 secs

Boooom! Much faster! And simply by loading {furrr}, then using plan(multiprocess) to run the code in parallel (if you forget that, the code will run on a single core) and using future_map() instead of map().

Let’s take a look at the data:

imp_boys_future_complete <- map(imp_boys_future, mice::complete)

imp_boys_future <- map2(.x = seq(1,10), .y = imp_boys_future_complete, ~mutate(.y, imp_id = as.character(.x)))

imp_boys_future <- bind_rows(imp_boys_future)

imp_boys_future %>%
  brotools::describe() %>%
  select(variable, type, n_missing, everything())

## # A tibble: 10 x 13
##    variable type     n_missing  nobs   mean    sd mode     min   max   q25
##    <chr>    <chr>        <int> <int>  <dbl> <dbl> <chr>  <dbl> <dbl> <dbl>
##  1 age      Numeric          0  7480   9.16  6.89 0.035  0.035  21.2  1.58
##  2 bmi      Numeric          0  7480  18.0   3.04 14.54 11.8    31.7 15.9 
##  3 hc       Numeric          0  7480  51.6   5.89 33.7  33.7    65   48.4 
##  4 hgt      Numeric          0  7480 131.   46.5  50.1  50     198   83   
##  5 tv       Numeric          0  7480   8.35  8.09 3      1      25    2   
##  6 wgt      Numeric          0  7480  37.1  26.0  3.65   3.14  117.  11.7 
##  7 imp_id   Charact…         0  7480  NA    NA    1     NA      NA   NA   
##  8 gen      Factor           0  7480  NA    NA    G1    NA      NA   NA   
##  9 phb      Factor           0  7480  NA    NA    P1    NA      NA   NA   
## 10 reg      Factor           0  7480  NA    NA    south NA      NA   NA   
## # ... with 3 more variables: median <dbl>, q75 <dbl>, n_unique <int>

So imputation went from 3.4 minutes (around 200 seconds) to 30 seconds. How cool is that? If you want to play around with {furrr} you must install it from Github, as it is not yet available on CRAN:

devtools::install_github("DavisVaughan/furrr")

If you are not comfortable with map() (and thus future_map()) but still want to impute in parallel, there is this very nice script here to do just that. I created a package around this script, called parlMICE (the same name as the script), to make installation and usage easier. You can install it like so:

devtools::install_github("b-rodrigues/parlMICE")