Econometrics and Free Software by Bruno Rodrigues.
RSS feed for blog post updates.
Follow me on Mastodon, twitter, or check out my Github.
Check out my package that adds logging to R functions, {chronicler}.
Or read my free ebooks, to learn some R and build reproducible analytical pipelines..
You can also watch my youtube channel or find the slides to the talks I've given here.
Buy me a coffee, my kids don't let me sleep.

Machine learning with {tidymodels}

R

Intro: what is {tidymodels}

I have already written about {tidymodels} in the past but since then, the {tidymodels} meta-package has evolved quite a lot. If you don’t know what {tidymodels} is, it is a suite of packages that make machine learning with R a breeze. R has many packages for machine learning, each with their own syntax and function arguments. {tidymodels} aims at providing an unified interface which allows data scientists to focus on the problem they’re trying to solve, instead of wasting time with learning package specificities.

The packages included in {tidymodels} are:

  • {parsnip} for model definition
  • {recipes} for data preprocessing and feature engineering
  • {rsample} to resample data (useful for cross-validation)
  • {yardstick} to evaluate model performance
  • {dials} to define tuning parameters of your models
  • {tune} for model tuning
  • {workflows} which allows you to bundle everything together and train models easily

There are some others, but I will not cover these. This is a lot of packages, and you might be worried of getting lost; however, in practice I noticed that loading {tidymodels} and then using the functions I needed was good enough. Only rarely did I need to know from which package a certain function came, and the more you use these, the better you know them, obviously. Before continuing, one final and important note: these packages are still in heavy development, so you might not want to use them in production yet. I don’t know how likely it is that the api still evolves, but my guess is that it is likely. However, even though it might be a bit early to use these packages for production code, I think it is important to learn about them as soon as possible and see what is possible with them.

As I will show you, these packages do make the process of training machine learning models a breeze, and of course they integrate very well with the rest of the {tidyverse} packages. The problem we’re going to tackle is to understand which variables play an important role in the probability of someone looking for a job. I’ll use Eustat’s microdata, which I already discussed in my previous blog post. The dataset can be downloaded from here, and is called Population with relation to activity (PRA).

The problem at hand

The dataset contains information on residents from the Basque country, and focuses on their labour supply. Thus, we have information on how many hours people work a week, if they work, in which industry, what is their educational attainment and whether they’re looking for a job. The first step, as usual, is to load the data and required packages:

library(tidyverse)
library(tidymodels)
library(readxl)
library(naniar)
library(janitor)
library(furrr)

list_data <- Sys.glob("~/Documents/b-rodrigues.github.com/content/blog/MICRO*.csv")

dataset <- map(list_data, read_csv2) %>%
  bind_rows()

dictionary <- read_xlsx("~/Documents/b-rodrigues.github.com/content/blog/Microdatos_PRA_2019/diseño_registro_microdatos_pra.xlsx", sheet="Valores",
                        col_names = FALSE)

col_names <- dictionary %>%
  filter(!is.na(...1)) %>%
  dplyr::select(1:2)

english <- readRDS("~/Documents/b-rodrigues.github.com/content/blog/english_col_names.rds")

col_names$english <- english

colnames(dataset) <- col_names$english

dataset <- janitor::clean_names(dataset)

Let’s take a look at the data:

head(dataset)
## # A tibble: 6 x 33
##   household_number survey_year reference_quart… territory capital   sex
##              <dbl>       <dbl>            <dbl> <chr>       <dbl> <dbl>
## 1                1        2019                1 48              9     6
## 2                1        2019                1 48              9     1
## 3                2        2019                1 48              1     1
## 4                2        2019                1 48              1     6
## 5                2        2019                1 48              1     6
## 6                2        2019                1 48              1     1
## # … with 27 more variables: place_of_birth <dbl>, age <chr>, nationality <dbl>,
## #   level_of_studies_completed <dbl>, ruled_teaching_system <chr>,
## #   occupational_training <chr>, retirement_situation <dbl>,
## #   homework_situation <dbl>, part_time_employment <dbl>,
## #   short_time_cause <dbl>, job_search <chr>, search_reasons <dbl>,
## #   day_searched <dbl>, make_arrangements <chr>, search_form <chr>,
## #   search_months <dbl>, availability <chr>,
## #   relationship_with_the_activity <dbl>,
## #   relationship_with_the_activity_2 <chr>, main_occupation <dbl>,
## #   main_activity <chr>, main_professional_situation <dbl>,
## #   main_institutional_sector <dbl>, type_of_contract <dbl>, hours <dbl>,
## #   relationship <dbl>, elevator <dbl>

There are many columns, most of them are categorical variables and unfortunately the levels in the data are only some non-explicit codes. The excel file I have loaded, which I called dictionary contains the codes and their explanation. I kept the file opened while I was working, especially for missing values imputation. Indeed, there are missing values in the data, and one should always try to understand why before blindly imputing them. Indeed, there might be a very good reason why data might be missing for a particular column. For instance, if children are also surveyed, they would have an NA in the, say, main_occupation column which gives the main occupation of the surveyed person. This might seem very obvious, but sometimes these reasons are not so obvious at all. You should always go back with such questions to the data owners/producers, because if not, you will certainly miss something very important. Anyway, the way I tackled this issue was by looking at the variables with missing data and checking two-way tables with other variables. For instance, to go back to my example from before, I would take a look at the two-way frequency table between age and main_occupation. If all the missing values from main_occupation where only for people 16 or younger, then it would be quite safe to assume that I was right, and I could recode these NAs in main_occupation to "without occupation" for instance. I’ll spare you all this exploration, and go straight to the data cleaning:

dataset <- dataset %>%
  mutate(main_occupation2 = ifelse(is.na(main_occupation),
                                   "without_occupation",
                                   main_occupation))

dataset <- dataset %>%
  mutate(main_professional_situation2 = ifelse(is.na(main_professional_situation),
                                               "without_occupation",
                                               main_professional_situation))

# People with missing hours are actually not working, so I put them to 0
dataset <- dataset %>%
  mutate(hours = ifelse(is.na(hours), 0, hours))

# Short time gives the reason why people are working less hours than specified in their contract
dataset <- dataset %>%
  mutate(short_time_cause = ifelse(hours == 0 | is.na(short_time_cause), 
                                   "without_occupation",
                                   short_time_cause))

dataset <- dataset %>%
  mutate(type_of_contract = ifelse(is.na(type_of_contract),
                                   "other_contract",
                                   type_of_contract))

Let’s now apply some further cleaning:

pra <- dataset %>%
  filter(age %in% c("04", "05", "06", "07", "08", "09", "10", "11", "12", "13")) %>%
  filter(retirement_situation == 4) %>%    
  filter(!is.na(job_search)) %>%  
  select(capital, sex, place_of_birth, age, nationality, level_of_studies_completed,
         occupational_training, job_search, main_occupation2, type_of_contract,
         hours, short_time_cause, homework_situation,
         main_professional_situation2) %>%
  mutate_at(.vars = vars(-hours), .funs=as.character) %>%
  mutate(job_search = as.factor(job_search))

I only keep people that are not retired and of ages where they could work. I remove rows where job_search, the target, is missing, mutate all variables but hours to character and job_search to factor. At first, I made every categorical column a factor but I got problems for certain models. I think the issue came from the recipe that I defined (I’ll talk about it below), but the problem was resolved if categorical variables were defined as character variables. However, for certain models, the target (I think it was xgboost) needs to be a factor variable for classification problems.

Let’s take a look at the data and check if any more data is missing:

str(pra)
## Classes 'spec_tbl_df', 'tbl_df', 'tbl' and 'data.frame': 29083 obs. of  14 variables:
##  $ capital                     : chr  "9" "9" "1" "1" ...
##  $ sex                         : chr  "6" "1" "1" "6" ...
##  $ place_of_birth              : chr  "1" "1" "1" "1" ...
##  $ age                         : chr  "09" "09" "11" "10" ...
##  $ nationality                 : chr  "1" "1" "1" "1" ...
##  $ level_of_studies_completed  : chr  "1" "2" "3" "3" ...
##  $ occupational_training       : chr  "N" "N" "N" "N" ...
##  $ job_search                  : Factor w/ 2 levels "N","S": 1 1 1 1 1 1 1 1 1 1 ...
##  $ main_occupation2            : chr  "5" "7" "3" "2" ...
##  $ type_of_contract            : chr  "1" "other_contract" "other_contract" "1" ...
##  $ hours                       : num  36 40 40 40 0 0 22 38 40 0 ...
##  $ short_time_cause            : chr  "2" "2" "2" "2" ...
##  $ homework_situation          : chr  "1" "2" "2" "2" ...
##  $ main_professional_situation2: chr  "4" "2" "3" "4" ...
vis_miss(pra)

The final dataset contains 29083 observations. Look’s like we’re good to go.

Setting up the training: resampling

In order to properly train a model, one needs to split the data into two: a part for trying out models with different configuration of hyper-parameters, and another part for final evaluation of the model. This is achieved with rsample::initial_split():

pra_split <- initial_split(pra, prop = 0.9)

pra_split now contains a training set and a testing set. We can get these by using the rsample::training() and rsample::testing() functions:

pra_train <- training(pra_split)
pra_test <- testing(pra_split)

We can’t stop here though. First we need to split the training set further, in order to perform cross validation. Cross validation will allow us to select the best model; by best I mean a model that has a good hyper-parameter configuration, enabling the model to generalize well to unseen data. I do this by creating 10 splits from the training data (I won’t touch the testing data up until the very end. This testing data is thus sometimes called the holdout set as well):

pra_cv_splits <- vfold_cv(pra_train, v = 10)

Let’s take a look at this object:

pra_cv_splits
## #  10-fold cross-validation 
## # A tibble: 10 x 2
##    splits               id    
##    <named list>         <chr> 
##  1 <split [23.6K/2.6K]> Fold01
##  2 <split [23.6K/2.6K]> Fold02
##  3 <split [23.6K/2.6K]> Fold03
##  4 <split [23.6K/2.6K]> Fold04
##  5 <split [23.6K/2.6K]> Fold05
##  6 <split [23.6K/2.6K]> Fold06
##  7 <split [23.6K/2.6K]> Fold07
##  8 <split [23.6K/2.6K]> Fold08
##  9 <split [23.6K/2.6K]> Fold09
## 10 <split [23.6K/2.6K]> Fold10

Preprocessing the data

I have already pre-processed the missing values in the dataset, so there is not much more that I can do. I will simply create dummy variables out of the categorical variables using step_dummy():

preprocess <- recipe(job_search ~ ., data = pra) %>%
  step_dummy(all_predictors())

preprocess is a recipe that defines the transformations that must be applied to the training data before fitting. In this case there is only one step; transforming all the predictors into dummies (hours is a numeric variable and will be ignored by this step). The recipe also defines the formula that will be fitted by the models, job_search ~ ., and takes data as a further argument. This is only to give the data frame specification to recipe(): it could even be an empty data frame with the right column names and types. This is why I give it the original data pra and not the training set pra_train. Because this recipe is very simple, it could be applied to the original raw data pra and then I could do the split into training and testing set, as well as further splitting the training set into 10 cross-validation sets. However, this is not the recommended way of applying pre-processing steps. Pre-processing needs to happen inside the cross-validation loop, not outside of it. Why? Suppose that you are normalizing a numeric variable, meaning, substracting its mean from it and dividing by its standard deviation. If you do this operation outside of cross-validation, and even worse, before splitting the data into training and testing set, you will be leaking information from the testing set into the training set. The mean will contain information from the testing set, which will be picked up by the model. It is much better and “realistic” to first split the data and then apply the pre-processing (remember that hiding the test set from the model is supposed to simulate the fact that new, completely unseen data, is thrown at your model once it’s put into production). The same logic applies to cross-validation splits; each split contains now also a training and a testing set (which I will be calling analysis and assessment sets, following {tidymodels}’s author, Max Kuhn) and thus the pre-processing needs to be applied inside the cross-validation loop, meaning that the analysis set will be processed on the fly.

Model definition

We come now to the very interesting part: model definition. With {parsnip}, another {tidymodels} package, defining models is always the same, regardless of the underlying package doing the heavy lifting. For instance, to define a logistic regression one would simply write:

# logistic regression 
logit_tune_pra <- logistic_reg() %>%
  set_engine("glm")

This defines a standard logistic regression, powered by the glm() engine or function. The way to do this in vanilla R would be :

glm(y ~ ., data = mydata, family = "binomial")

The difference here is that the formula is contained in the glm() function; in our case it is contained in the recipe, which is why I don’t repeat it in the model definition above. You might wonder what the added value of using {tidymodels} for this is. Well, suppose now that I would like to run a logistic regression but with regularization. I would use {glmnet} for this but would need to know the specific syntax of glmnet() which, as you will see, is very different than the one for glm():

  glmnet(x_vars[train,], y_var[train], alpha = 1, lambda = 1.6)

glmnet(), unlike glm(), does not use a formula as an input, but two matrices, one for the design matrix, and another for the target variable. Using {parsnip}, however, I simply need to change the engine from "glm" to "glmnet":

# logistic regression 
logit_tune_pra <- logistic_reg() %>%
  set_engine("glmnet")

This makes things much simpler as now users only need to learn how to use {parsnip}. However, it is of course still important to read the documentation of the original packages, because it is were hyper-parameters are discussed. Another advantage of {parsnip} is that the same words are used to speak of the same hyper-parameters . For instance for tree-based methods, the number of trees is sometimes ntree then in another package num_trees, and is again different in yet another package. In {parsnip}’s interface for tree-based methods, this parameter is simply called tree. Users can fix the value of hyper-parameters directly by passing values to, say, tree (as in "tree" = 200), or they can tune these hyper-parameters. To do so, one needs to tag them, like so:

# logistic regression 
logit_tune_pra <- logistic_reg(penalty = tune(), mixture = tune()) %>%
  set_engine("glmnet")

This defines logit_tune_pra with 2 hyper-parameters that must be tuned using cross-validation, the penalty and the amount of mixture between penalties (this is for elasticnet regularization).

Now, I will define 5 different models, with different hyper-parameters to tune, and I will also define a grid of hyper-parameters of size 10 for each model. This means that I will train these 5 models 10 times, each time with a different hyper-parameter configuration. To define the grid, I use the grid_max_entropy() function from the {dials} package. This creates a grid with points that are randomly drawn from the parameter space in a way that ensures that the combination we get covers the whole space, or at least are not too far away from any portion of the space. Of course, the more configuration you try, the better, but the longer the training will run.

# Logistic regression
logit_tune_pra <- logistic_reg(penalty = tune(), mixture = tune()) %>%
  set_engine("glmnet")

# Hyperparameter grid
logit_grid <- logit_tune_pra %>%
  parameters() %>%
  grid_max_entropy(size = 10)

# Workflow bundling every step 
logit_wflow <- workflow() %>%
  add_recipe(preprocess) %>%
  add_model(logit_tune_pra)

# random forest
rf_tune_pra <- rand_forest(mtry = tune(), trees = tune()) %>%
  set_engine("ranger") %>%
  set_mode("classification")

rf_grid <- rf_tune_pra %>%
  parameters() %>%
  finalize(select(pra, -job_search)) %>%  
  grid_max_entropy(size = 10)

rf_wflow <- workflow() %>%
  add_recipe(preprocess) %>%
  add_model(rf_tune_pra)

# mars model
mars_tune_pra <- mars(num_terms = tune(), prod_degree = 2, prune_method = tune()) %>%
  set_engine("earth") %>%
  set_mode("classification")

mars_grid <- mars_tune_pra %>%
  parameters() %>%
  grid_max_entropy(size = 10)

mars_wflow <- workflow() %>%
  add_recipe(preprocess) %>%
  add_model(mars_tune_pra)

#boosted trees
boost_tune_pra <- boost_tree(mtry = tune(), tree = tune(),
                             learn_rate = tune(), tree_depth = tune()) %>%
  set_engine("xgboost") %>%
  set_mode("classification")

boost_grid <- boost_tune_pra %>%
  parameters() %>%
  finalize(select(pra, -job_search)) %>%  
  grid_max_entropy(size = 10)

boost_wflow <- workflow() %>%
  add_recipe(preprocess) %>%
  add_model(boost_tune_pra)

#neural nets
keras_tune_pra <- mlp(hidden_units = tune(), penalty = tune(), activation = "relu") %>%
  set_engine("keras") %>%
  set_mode("classification")

keras_grid <- keras_tune_pra %>%
  parameters() %>%
  grid_max_entropy(size = 10)

keras_wflow <- workflow() %>%
  add_recipe(preprocess) %>%
  add_model(keras_tune_pra)

For each model, I defined three objects; the model itself, for instance keras_tune_pra, then a grid of hyper-parameters, and finally a workflow. To define the grid, I need to extract the parameters to tune using the parameters() function, and for tree based methods, I also need to use finalize() to set the mtry parameter. This is because mtry depends on the dimensions of the data (the value of mtry cannot be larger than the number of features), so I need to pass on this information to…well, finalize the grid. Then I can choose the size of the grid and how I want to create it (randomly, or using max entropy, or regularly spaced…). A workflow bundles the pre-processing and the model definition together, and makes fitting the model very easy. Workflows make it easy to run the pre-processing inside the cross-validation loop. Workflow objects can be passed to the fitting function, as we shall see in the next section.

Fitting models with {tidymodels}

Fitting one model with {tidymodels} is quite easy:

fitted_model <- fit(model_formula, data = data_train)

and that’s it. If you define a workflow, which bundles pre-processing and model definition in one package, you need to pass it to fit() as well:

fitted_wflow <- fit(model_wflow, data = data_train)

However, a single call to fit does not perform cross-validation. This simply trains the model on the training data, and that’s it. To perform cross validation, you can use either fit_resamples():

fitted_resamples <- fit_resamples(model_wflow,
                               resamples = my_cv_splits,
                               control = control_resamples(save_pred = TRUE))

or tune_grid():

tuned_model <- tune_grid(model_wflow,
                         resamples = my_cv_splits,
                         grid = my_grid,
                         control = control_resamples(save_pred = TRUE))

As you probably guessed it, fit_resamples() does not perform tuning; it simply fits a model specification (without varying hyper-parameters) to all the analysis sets contained in the my_cv_splits object (which contains the resampled training data for cross-validation), while tune_grid() does the same, but allows for varying hyper-parameters.

We thus are going to use tune_grid() to fit our models and perform hyper-paramater tuning. However, since I have 5 models and 5 grids, I’ll be using map2() for this. If you’re not familiar with map2(), here’s a quick example:

map2(c(1, 1, 1), c(2,2,2), `+`)
## [[1]]
## [1] 3
## 
## [[2]]
## [1] 3
## 
## [[3]]
## [1] 3

map2() maps the +() function to each element of both vectors successively. I’m going to use this to map the tune_grid() function to a list of models and a list of grids. But because this is going to take some time to run, and because I have an AMD Ryzen 5 1600X processor with 6 physical cores and 12 logical cores, I’ll by running this in parallel using furrr::future_map2().

furrr::future_map2() will run one model per core, and the way to do it is to simply define how many cores I want to use, then replace map2() in my code by future_map2():

wflow_list <- list(logit_wflow, rf_wflow, mars_wflow, boost_wflow, keras_wflow)
grid_list <- list(logit_grid, rf_grid, mars_grid, boost_grid, keras_grid)

plan(multiprocess, workers = 6)

trained_models_list <- future_map2(.x = wflow_list,
                                   .y = grid_list,
                                   ~tune_grid(.x , resamples = pra_cv_splits, grid = .y))

Running this code took almost 3 hours. In the end, here is the result:

trained_models_list
## [[1]]
## #  10-fold cross-validation 
## # A tibble: 10 x 4
##    splits               id     .metrics          .notes          
##  * <list>               <chr>  <list>            <list>          
##  1 <split [23.6K/2.6K]> Fold01 <tibble [20 × 5]> <tibble [1 × 1]>
##  2 <split [23.6K/2.6K]> Fold02 <tibble [20 × 5]> <tibble [1 × 1]>
##  3 <split [23.6K/2.6K]> Fold03 <tibble [20 × 5]> <tibble [1 × 1]>
##  4 <split [23.6K/2.6K]> Fold04 <tibble [20 × 5]> <tibble [1 × 1]>
##  5 <split [23.6K/2.6K]> Fold05 <tibble [20 × 5]> <tibble [1 × 1]>
##  6 <split [23.6K/2.6K]> Fold06 <tibble [20 × 5]> <tibble [1 × 1]>
##  7 <split [23.6K/2.6K]> Fold07 <tibble [20 × 5]> <tibble [1 × 1]>
##  8 <split [23.6K/2.6K]> Fold08 <tibble [20 × 5]> <tibble [1 × 1]>
##  9 <split [23.6K/2.6K]> Fold09 <tibble [20 × 5]> <tibble [1 × 1]>
## 10 <split [23.6K/2.6K]> Fold10 <tibble [20 × 5]> <tibble [1 × 1]>
## 
## [[2]]
## #  10-fold cross-validation 
## # A tibble: 10 x 4
##    splits               id     .metrics          .notes          
##  * <list>               <chr>  <list>            <list>          
##  1 <split [23.6K/2.6K]> Fold01 <tibble [20 × 5]> <tibble [1 × 1]>
##  2 <split [23.6K/2.6K]> Fold02 <tibble [20 × 5]> <tibble [1 × 1]>
##  3 <split [23.6K/2.6K]> Fold03 <tibble [20 × 5]> <tibble [1 × 1]>
##  4 <split [23.6K/2.6K]> Fold04 <tibble [20 × 5]> <tibble [1 × 1]>
##  5 <split [23.6K/2.6K]> Fold05 <tibble [20 × 5]> <tibble [1 × 1]>
##  6 <split [23.6K/2.6K]> Fold06 <tibble [20 × 5]> <tibble [1 × 1]>
##  7 <split [23.6K/2.6K]> Fold07 <tibble [20 × 5]> <tibble [1 × 1]>
##  8 <split [23.6K/2.6K]> Fold08 <tibble [20 × 5]> <tibble [1 × 1]>
##  9 <split [23.6K/2.6K]> Fold09 <tibble [20 × 5]> <tibble [1 × 1]>
## 10 <split [23.6K/2.6K]> Fold10 <tibble [20 × 5]> <tibble [1 × 1]>
## 
## [[3]]
## #  10-fold cross-validation 
## # A tibble: 10 x 4
##    splits               id     .metrics          .notes          
##  * <list>               <chr>  <list>            <list>          
##  1 <split [23.6K/2.6K]> Fold01 <tibble [20 × 5]> <tibble [1 × 1]>
##  2 <split [23.6K/2.6K]> Fold02 <tibble [20 × 5]> <tibble [1 × 1]>
##  3 <split [23.6K/2.6K]> Fold03 <tibble [20 × 5]> <tibble [1 × 1]>
##  4 <split [23.6K/2.6K]> Fold04 <tibble [20 × 5]> <tibble [1 × 1]>
##  5 <split [23.6K/2.6K]> Fold05 <tibble [20 × 5]> <tibble [1 × 1]>
##  6 <split [23.6K/2.6K]> Fold06 <tibble [20 × 5]> <tibble [1 × 1]>
##  7 <split [23.6K/2.6K]> Fold07 <tibble [20 × 5]> <tibble [1 × 1]>
##  8 <split [23.6K/2.6K]> Fold08 <tibble [20 × 5]> <tibble [1 × 1]>
##  9 <split [23.6K/2.6K]> Fold09 <tibble [20 × 5]> <tibble [1 × 1]>
## 10 <split [23.6K/2.6K]> Fold10 <tibble [20 × 5]> <tibble [1 × 1]>
## 
## [[4]]
## #  10-fold cross-validation 
## # A tibble: 10 x 4
##    splits               id     .metrics          .notes          
##  * <list>               <chr>  <list>            <list>          
##  1 <split [23.6K/2.6K]> Fold01 <tibble [20 × 7]> <tibble [1 × 1]>
##  2 <split [23.6K/2.6K]> Fold02 <tibble [20 × 7]> <tibble [1 × 1]>
##  3 <split [23.6K/2.6K]> Fold03 <tibble [20 × 7]> <tibble [1 × 1]>
##  4 <split [23.6K/2.6K]> Fold04 <tibble [20 × 7]> <tibble [1 × 1]>
##  5 <split [23.6K/2.6K]> Fold05 <tibble [20 × 7]> <tibble [1 × 1]>
##  6 <split [23.6K/2.6K]> Fold06 <tibble [20 × 7]> <tibble [1 × 1]>
##  7 <split [23.6K/2.6K]> Fold07 <tibble [20 × 7]> <tibble [1 × 1]>
##  8 <split [23.6K/2.6K]> Fold08 <tibble [20 × 7]> <tibble [1 × 1]>
##  9 <split [23.6K/2.6K]> Fold09 <tibble [20 × 7]> <tibble [1 × 1]>
## 10 <split [23.6K/2.6K]> Fold10 <tibble [20 × 7]> <tibble [1 × 1]>
## 
## [[5]]
## #  10-fold cross-validation 
## # A tibble: 10 x 4
##    splits               id     .metrics          .notes          
##  * <list>               <chr>  <list>            <list>          
##  1 <split [23.6K/2.6K]> Fold01 <tibble [20 × 5]> <tibble [1 × 1]>
##  2 <split [23.6K/2.6K]> Fold02 <tibble [20 × 5]> <tibble [1 × 1]>
##  3 <split [23.6K/2.6K]> Fold03 <tibble [20 × 5]> <tibble [1 × 1]>
##  4 <split [23.6K/2.6K]> Fold04 <tibble [20 × 5]> <tibble [1 × 1]>
##  5 <split [23.6K/2.6K]> Fold05 <tibble [20 × 5]> <tibble [1 × 1]>
##  6 <split [23.6K/2.6K]> Fold06 <tibble [20 × 5]> <tibble [1 × 1]>
##  7 <split [23.6K/2.6K]> Fold07 <tibble [20 × 5]> <tibble [1 × 1]>
##  8 <split [23.6K/2.6K]> Fold08 <tibble [20 × 5]> <tibble [1 × 1]>
##  9 <split [23.6K/2.6K]> Fold09 <tibble [20 × 5]> <tibble [1 × 1]>
## 10 <split [23.6K/2.6K]> Fold10 <tibble [20 × 5]> <tibble [1 × 1]>

I now have a list of 5 tibbles containing the analysis/assessment splits, the id identifying the cross-validation fold, a list-column containing information on model performance for that given split and some notes (if everything goes well, notes are empty). Let’s take a look at the column .metrics of the first model and for the first fold:

trained_models_list[[1]]$.metrics[[1]]
## # A tibble: 20 x 5
##     penalty mixture .metric  .estimator .estimate
##       <dbl>   <dbl> <chr>    <chr>          <dbl>
##  1 4.25e- 3  0.0615 accuracy binary         0.906
##  2 4.25e- 3  0.0615 roc_auc  binary         0.895
##  3 6.57e-10  0.0655 accuracy binary         0.908
##  4 6.57e-10  0.0655 roc_auc  binary         0.897
##  5 1.18e- 6  0.167  accuracy binary         0.908
##  6 1.18e- 6  0.167  roc_auc  binary         0.897
##  7 2.19e-10  0.371  accuracy binary         0.907
##  8 2.19e-10  0.371  roc_auc  binary         0.897
##  9 2.73e- 1  0.397  accuracy binary         0.885
## 10 2.73e- 1  0.397  roc_auc  binary         0.5  
## 11 1.72e- 6  0.504  accuracy binary         0.907
## 12 1.72e- 6  0.504  roc_auc  binary         0.897
## 13 1.25e- 9  0.633  accuracy binary         0.907
## 14 1.25e- 9  0.633  roc_auc  binary         0.897
## 15 6.62e- 6  0.880  accuracy binary         0.907
## 16 6.62e- 6  0.880  roc_auc  binary         0.897
## 17 6.00e- 1  0.899  accuracy binary         0.885
## 18 6.00e- 1  0.899  roc_auc  binary         0.5  
## 19 4.57e-10  0.989  accuracy binary         0.907
## 20 4.57e-10  0.989  roc_auc  binary         0.897

This shows how the 10 different configurations of the elasticnet model performed. To see how the model performed on the second fold:

trained_models_list[[1]]$.metrics[[2]]
## # A tibble: 20 x 5
##     penalty mixture .metric  .estimator .estimate
##       <dbl>   <dbl> <chr>    <chr>          <dbl>
##  1 4.25e- 3  0.0615 accuracy binary         0.913
##  2 4.25e- 3  0.0615 roc_auc  binary         0.874
##  3 6.57e-10  0.0655 accuracy binary         0.913
##  4 6.57e-10  0.0655 roc_auc  binary         0.877
##  5 1.18e- 6  0.167  accuracy binary         0.913
##  6 1.18e- 6  0.167  roc_auc  binary         0.878
##  7 2.19e-10  0.371  accuracy binary         0.913
##  8 2.19e-10  0.371  roc_auc  binary         0.878
##  9 2.73e- 1  0.397  accuracy binary         0.901
## 10 2.73e- 1  0.397  roc_auc  binary         0.5  
## 11 1.72e- 6  0.504  accuracy binary         0.913
## 12 1.72e- 6  0.504  roc_auc  binary         0.878
## 13 1.25e- 9  0.633  accuracy binary         0.913
## 14 1.25e- 9  0.633  roc_auc  binary         0.878
## 15 6.62e- 6  0.880  accuracy binary         0.913
## 16 6.62e- 6  0.880  roc_auc  binary         0.878
## 17 6.00e- 1  0.899  accuracy binary         0.901
## 18 6.00e- 1  0.899  roc_auc  binary         0.5  
## 19 4.57e-10  0.989  accuracy binary         0.913
## 20 4.57e-10  0.989  roc_auc  binary         0.878

Hyper-Parameters are the same; it is only the cross validation fold that is different. To get the best performing model from such objects you can use show_best() which will extract the best performing models across all the cross validation folds:

show_best(trained_models_list[[1]], metric = "accuracy")
## # A tibble: 5 x 7
##    penalty mixture .metric  .estimator  mean     n std_err
##      <dbl>   <dbl> <chr>    <chr>      <dbl> <int>   <dbl>
## 1 6.57e-10  0.0655 accuracy binary     0.916    10 0.00179
## 2 1.18e- 6  0.167  accuracy binary     0.916    10 0.00180
## 3 1.72e- 6  0.504  accuracy binary     0.916    10 0.00182
## 4 4.57e-10  0.989  accuracy binary     0.916    10 0.00181
## 5 6.62e- 6  0.880  accuracy binary     0.916    10 0.00181

This shows the 5 best configurations for elasticnet when looking at accuracy. Now how to get the best performing elasticnet regression, random forest, boosted trees, etc? Easy, using map():

map(trained_models_list, show_best, metric = "accuracy")
## [[1]]
## # A tibble: 5 x 7
##    penalty mixture .metric  .estimator  mean     n std_err
##      <dbl>   <dbl> <chr>    <chr>      <dbl> <int>   <dbl>
## 1 6.57e-10  0.0655 accuracy binary     0.916    10 0.00179
## 2 1.18e- 6  0.167  accuracy binary     0.916    10 0.00180
## 3 1.72e- 6  0.504  accuracy binary     0.916    10 0.00182
## 4 4.57e-10  0.989  accuracy binary     0.916    10 0.00181
## 5 6.62e- 6  0.880  accuracy binary     0.916    10 0.00181
## 
## [[2]]
## # A tibble: 5 x 7
##    mtry trees .metric  .estimator  mean     n std_err
##   <int> <int> <chr>    <chr>      <dbl> <int>   <dbl>
## 1    13  1991 accuracy binary     0.929    10 0.00172
## 2    13  1180 accuracy binary     0.929    10 0.00168
## 3    12   285 accuracy binary     0.928    10 0.00168
## 4     8  1567 accuracy binary     0.927    10 0.00171
## 5     8   647 accuracy binary     0.927    10 0.00191
## 
## [[3]]
## # A tibble: 5 x 7
##   num_terms prune_method .metric  .estimator  mean     n std_err
##       <int> <chr>        <chr>    <chr>      <dbl> <int>   <dbl>
## 1         5 backward     accuracy binary     0.904    10 0.00186
## 2         5 forward      accuracy binary     0.902    10 0.00185
## 3         4 exhaustive   accuracy binary     0.901    10 0.00167
## 4         4 seqrep       accuracy binary     0.901    10 0.00167
## 5         2 backward     accuracy binary     0.896    10 0.00209
## 
## [[4]]
## # A tibble: 5 x 9
##    mtry trees tree_depth learn_rate .metric  .estimator  mean     n std_err
##   <int> <int>      <int>      <dbl> <chr>    <chr>      <dbl> <int>   <dbl>
## 1    12  1245         12   7.70e- 2 accuracy binary     0.929    10 0.00175
## 2     1   239          8   8.23e- 2 accuracy binary     0.927    10 0.00186
## 3     1   835         14   8.53e-10 accuracy binary     0.913    10 0.00232
## 4     4  1522         12   2.22e- 5 accuracy binary     0.896    10 0.00209
## 5     6   313          2   1.21e- 8 accuracy binary     0.896    10 0.00209
## 
## [[5]]
## # A tibble: 5 x 7
##   hidden_units  penalty .metric  .estimator  mean     n std_err
##          <int>    <dbl> <chr>    <chr>      <dbl> <int>   <dbl>
## 1           10 3.07e- 6 accuracy binary     0.917    10 0.00209
## 2            6 1.69e-10 accuracy binary     0.917    10 0.00216
## 3            4 2.32e- 7 accuracy binary     0.916    10 0.00194
## 4            7 5.52e- 5 accuracy binary     0.916    10 0.00163
## 5            8 1.13e- 9 accuracy binary     0.916    10 0.00173

Now, we need to test these models on the holdout set, but this post is already quite long. In the next blog post, I will retrain the top best performing models for each type of model and see how they fare against the holdout set. I’ll be also looking at explainability, so stay tuned!

Hope you enjoyed! If you found this blog post useful, you might want to follow me on twitter for blog post updates and watch my youtube channel. If you want to support my blog and channel, you could buy me an espresso or paypal.me, or buy my ebook on Leanpub.

Buy me an EspressoBuy me an Espresso