{tidymodels} workflow with Bayesian optimisation

R Machine learning

I’ve been collecting a few notes on using the tidymodels workflow for modelling, and thought it might be worth sharing them here. More for personal reference than anything, but someone might find my ramblings useful!

Harry Fisher https://hfshr.xyz
05-23-2020

For demonstration purposes I’m going to use the credit_data dataset from the modeldata package. This is a fairly simple data set that doesn’t require too much wrangling to get going.

library(modeldata)
library(tidymodels)
library(tidyverse)
library(doParallel)
library(probably)
library(gt)

data("credit_data")
glimpse(credit_data)
Rows: 4,454
Columns: 14
$ Status    <fct> good, good, bad, good, good, good, good, good, go…
$ Seniority <int> 9, 17, 10, 0, 0, 1, 29, 9, 0, 0, 6, 7, 8, 19, 0, …
$ Home      <fct> rent, rent, owner, rent, rent, owner, owner, pare…
$ Time      <int> 60, 60, 36, 60, 36, 60, 60, 12, 60, 48, 48, 36, 6…
$ Age       <int> 30, 58, 46, 24, 26, 36, 44, 27, 32, 41, 34, 29, 3…
$ Marital   <fct> married, widow, married, single, single, married,…
$ Records   <fct> no, no, yes, no, no, no, no, no, no, no, no, no, …
$ Job       <fct> freelance, fixed, freelance, fixed, fixed, fixed,…
$ Expenses  <int> 73, 48, 90, 63, 46, 75, 75, 35, 90, 90, 60, 60, 7…
$ Income    <int> 129, 131, 200, 182, 107, 214, 125, 80, 107, 80, 1…
$ Assets    <int> 0, 0, 3000, 2500, 0, 3500, 10000, 0, 15000, 0, 40…
$ Debt      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2500, 260, 0,…
$ Amount    <int> 800, 1000, 2000, 900, 310, 650, 1600, 200, 1200, …
$ Price     <int> 846, 1658, 2985, 1325, 910, 1645, 1800, 1093, 195…

First, I’ll split the sample into training and testing sets using rsample. As well as create v-fold cross-validation set for use in the tuning step later.

set.seed(122)

credit_data <- credit_data %>%
  drop_na(Status)

# initial split
split <- initial_split(credit_data, prop = 0.75, strata = "Status")

# train/test sets
train <- training(split)
test <- testing(split)

# cross validation set
folds <- vfold_cv(train, v = 5)

Next, I’ll set up the recipe using recipies. As well as defining the formula for the model, I’ve created two preprocessing steps:

  1. Impute the missing values.
  2. One-hot encode the factor variables in the data.
rec <- recipe(Status ~ ., data = train) %>%
  step_bagimpute(Home, Marital, Job, Income, Assets, Debt) %>%
  step_dummy(Home, Marital, Records, Job, one_hot = T) 

Now I’ll prepare the model specification using parsnip. Here I’m using an xgboost model and specify the parameters I want to tune using Bayesian optimisation. The mtry parameter requires one additional step to finalise the range of possible values (because it depends on the number of variables in the data and a suitable range of values to test can’t be estimated without that information).

mod <- boost_tree(
  trees = 1000,
  min_n = tune(),
  learn_rate = tune(),
  loss_reduction = tune(),
  sample_size = tune(),
  mtry = tune(),
  tree_depth = tune()
) %>%
  set_engine("xgboost") %>%
  set_mode("classification")


params <- parameters(mod) %>%
  finalize(train)

In this step I’ve bundled all the above steps into a workflow. This avoids the need to use the juice, bake and prep functions (which I never quite got my head around…!).

xgboost_wflow <- workflow() %>%
  add_recipe(rec) %>%
  add_model(mod)

Now we are ready to OPTIMISE. I’m going to use an iterative search through Bayesian optimisation to predict what parameters to try next (as opposed to a grid search where we need to specific parameters values in advance).

First I set up some addition works so the tests can be run in parallel, and then use the tune_bayes() function to set up the tuning. Here I’ve decided to:

options(tidymodels.dark = TRUE)

cl <- makePSOCKcluster(8)
registerDoParallel(cl)

tuned <- tune_bayes(
  object = xgboost_wflow,
  resamples = folds,
  param_info = params,
  iter = 30,
  metrics = metric_set(pr_auc),
  initial = 10,
  control = control_bayes(
    verbose = TRUE,
    no_improve = 10,
    seed = 123
  )
)

After a little while, we’re done! The top combinations were:

show_best(tuned, "pr_auc")%>% 
  select(1:7, 11) %>% 
  gt()
mtry min_n tree_depth learn_rate loss_reduction sample_size .metric std_err
14 13 4 1.949228e-02 1.295324e-10 0.2703632 pr_auc 0.01525507
1 13 4 1.615862e-02 3.036423e-03 0.8919163 pr_auc 0.01816570
13 5 12 1.057048e-05 3.710341e-09 0.5714647 pr_auc 0.02226760
11 23 14 2.847187e-02 6.473076e-08 0.2472379 pr_auc 0.01866750
10 5 12 3.553680e-02 1.867881e-01 0.3068261 pr_auc 0.01438201

Now we can create our final model using these parameters.

xgboost_wkflow_tuned <- finalize_workflow(
  xgboost_wflow,
  select_best(tuned, "pr_auc")
)

Finally we can fit the model.

final_res <- last_fit(
  xgboost_wkflow_tuned,
  split
)

stopCluster(cl)

With our model in hand we can make some predictions to evaluate performance

preds <- final_res %>% 
  collect_predictions()

Now we can use the yardstick package to evaluate how the model performed.

conf_mat(preds, Status, .pred_class)
          Truth
Prediction bad good
      bad  152   57
      good 161  743
preds %>%
  gain_curve(Status, .pred_bad) %>%
  autoplot()
preds %>%
  pr_curve(Status, .pred_bad) %>%
  autoplot()

hmmm… That confusion matrix doesn’t look too great. Quite a large number “good” predictions were actually “bad” (false negatives). Maybe when can improve the class prediction by using probably.

Here we use threshold_perf() to evaluate different thresholds to make our class predictions. One methods to determine the “best” cut point is to use the j-index (maximum value of 1 when there are no false positives and no false negatives).

threshold_data <- preds %>%
  threshold_perf(Status, .pred_bad, thresholds = seq(0.2, 1, by = 0.0025))


max_j_index_threshold <- threshold_data %>%
  filter(.metric == "j_index") %>%
  filter(.estimate == max(.estimate)) %>%
  pull(.threshold)


preds_new <- preds %>% 
  mutate(new_class_pred = factor(ifelse(.pred_bad >= max_j_index_threshold, "bad", "good"), 
                                 levels = c("bad", "good"))) 
  

max_j_index_threshold
[1] 0.2725

Now we have a new prediction based on our new threshold of vs the default threshold of 0.50. We can compare the performance on a range of different binary classification metrics by calling summay() on the conf_mat() object for both the old and new predicted classes.

summary(conf_mat(preds, Status, .pred_class)) %>%
  select(-.estimator) %>%
  rename(old_threshold = .estimate) %>%
  bind_cols(.,
            summary(conf_mat(preds_new, Status, new_class_pred)) %>%
              select(.estimate) %>%
              rename(new_threshold = .estimate)) %>%
  gt() %>%
  fmt_number(columns = c(2, 3),
             decimals = 3) %>%
  tab_style(
    style = cell_fill(color = "indianred3"),
    locations = cells_body(columns = 3,
                           rows = new_threshold < old_threshold)
  )  %>%
  tab_style(
    style = cell_fill(color = "springgreen3"),
    locations = cells_body(columns = 3,
                           rows = new_threshold > old_threshold)
  ) %>% 
  cols_align(align = "center")
.metric old_threshold new_threshold
accuracy 0.804 0.786
kap 0.461 0.516
sens 0.486 0.773
spec 0.929 0.791
ppv 0.727 0.592
npv 0.822 0.899
mcc 0.477 0.526
j_index 0.414 0.564
bal_accuracy 0.707 0.782
detection_prevalence 0.188 0.367
precision 0.727 0.592
recall 0.486 0.773
f_meas 0.582 0.670

Lowering the threshold to .27 seems to have had a positive impact on quite a few of the binary classification metrics. There is always going to be a trade off between maximising some metrics over others, and will of course depend on what you are trying to achieve with your model.

…and that is a very quick tour of tidymodels! There are obviously some additional steps you would want to carry out out in the “real” world. You’d probably want to compare a range of different models and maybe do some additional feature engineering based on the data you have, but the code above is a good initial starting point for a tidymodels orientated workflow.

Thanks for reading!

Reuse

Text and figures are licensed under Creative Commons Attribution CC BY 4.0. The figures that have been reused from other sources don't fall under this license and can be recognized by a note in their caption: "Figure from ...".

Citation

For attribution, please cite this work as

Fisher (2020, May 23). Data, Code & Coffee: {tidymodels} workflow with Bayesian optimisation. Retrieved from https://hfshr.xyz/posts/2020-05-23-tidymodel-notes/

BibTeX citation

@misc{fisher2020{tidymodels},
  author = {Fisher, Harry},
  title = {Data, Code & Coffee: {tidymodels} workflow with Bayesian optimisation},
  url = {https://hfshr.xyz/posts/2020-05-23-tidymodel-notes/},
  year = {2020}
}