Lab 4 - The Models package

SOLUTIONS

Introduction

In today’s lab, you’ll practice building workflowsets with recipes, parsnip models, rsample cross validations, model tuning and model comparison.

Learning goals

By the end of the lab you will…

  • Be able to build workflows to evaluate different models and featuresets.

Packages

# check if 'librarian' is installed and if not, install it
if (! "librarian" %in% rownames(installed.packages()) ){
  install.packages("librarian")
}
  
# load packages if not already loaded
librarian::shelf(tidyverse, magrittr, gt, gtExtras, tidymodels, ggplot2)

# set the default theme for plotting
theme_set(theme_bw(base_size = 18) + theme(legend.position = "top"))

The Data

Today we will be using the Ames Housing Data.

This is a data set from De Cock (2011) has 82 fields were recorded for 2,930 properties in Ames Iowa in the US. The version in the modeldata package is copied from the AmesHousing package but does not include a few quality columns that appear to be outcomes rather than predictors.

dat <- modeldata::ames

The data dictionary can be found on the internet:

cat(readr::read_file("http://jse.amstat.org/v19n3/decock/DataDocumentation.txt"))

Exercise 1: EDA

Write and execute the code to perform summary EDA on the Ames Housing data using the package skimr.

SOLUTION:
dat |> skimr::skim()
Data summary
Name dat
Number of rows 2930
Number of columns 74
_______________________
Column type frequency:
factor 40
numeric 34
________________________
Group variables None

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
MS_SubClass 0 1 FALSE 16 One: 1079, Two: 575, One: 287, One: 192
MS_Zoning 0 1 FALSE 7 Res: 2273, Res: 462, Flo: 139, Res: 27
Street 0 1 FALSE 2 Pav: 2918, Grv: 12
Alley 0 1 FALSE 3 No_: 2732, Gra: 120, Pav: 78
Lot_Shape 0 1 FALSE 4 Reg: 1859, Sli: 979, Mod: 76, Irr: 16
Land_Contour 0 1 FALSE 4 Lvl: 2633, HLS: 120, Bnk: 117, Low: 60
Utilities 0 1 FALSE 3 All: 2927, NoS: 2, NoS: 1
Lot_Config 0 1 FALSE 5 Ins: 2140, Cor: 511, Cul: 180, FR2: 85
Land_Slope 0 1 FALSE 3 Gtl: 2789, Mod: 125, Sev: 16
Neighborhood 0 1 FALSE 28 Nor: 443, Col: 267, Old: 239, Edw: 194
Condition_1 0 1 FALSE 9 Nor: 2522, Fee: 164, Art: 92, RRA: 50
Condition_2 0 1 FALSE 8 Nor: 2900, Fee: 13, Art: 5, Pos: 4
Bldg_Type 0 1 FALSE 5 One: 2425, Twn: 233, Dup: 109, Twn: 101
House_Style 0 1 FALSE 8 One: 1481, Two: 873, One: 314, SLv: 128
Overall_Cond 0 1 FALSE 9 Ave: 1654, Abo: 533, Goo: 390, Ver: 144
Roof_Style 0 1 FALSE 6 Gab: 2321, Hip: 551, Gam: 22, Fla: 20
Roof_Matl 0 1 FALSE 8 Com: 2887, Tar: 23, WdS: 9, WdS: 7
Exterior_1st 0 1 FALSE 16 Vin: 1026, Met: 450, HdB: 442, Wd : 420
Exterior_2nd 0 1 FALSE 17 Vin: 1015, Met: 447, HdB: 406, Wd : 397
Mas_Vnr_Type 0 1 FALSE 5 Non: 1775, Brk: 880, Sto: 249, Brk: 25
Exter_Cond 0 1 FALSE 5 Typ: 2549, Goo: 299, Fai: 67, Exc: 12
Foundation 0 1 FALSE 6 PCo: 1310, CBl: 1244, Brk: 311, Sla: 49
Bsmt_Cond 0 1 FALSE 6 Typ: 2616, Goo: 122, Fai: 104, No_: 80
Bsmt_Exposure 0 1 FALSE 5 No: 1906, Av: 418, Gd: 284, Mn: 239
BsmtFin_Type_1 0 1 FALSE 7 GLQ: 859, Unf: 851, ALQ: 429, Rec: 288
BsmtFin_Type_2 0 1 FALSE 7 Unf: 2499, Rec: 106, LwQ: 89, No_: 81
Heating 0 1 FALSE 6 Gas: 2885, Gas: 27, Gra: 9, Wal: 6
Heating_QC 0 1 FALSE 5 Exc: 1495, Typ: 864, Goo: 476, Fai: 92
Central_Air 0 1 FALSE 2 Y: 2734, N: 196
Electrical 0 1 FALSE 6 SBr: 2682, Fus: 188, Fus: 50, Fus: 8
Functional 0 1 FALSE 8 Typ: 2728, Min: 70, Min: 65, Mod: 35
Garage_Type 0 1 FALSE 7 Att: 1731, Det: 782, Bui: 186, No_: 157
Garage_Finish 0 1 FALSE 4 Unf: 1231, RFn: 812, Fin: 728, No_: 159
Garage_Cond 0 1 FALSE 6 Typ: 2665, No_: 159, Fai: 74, Goo: 15
Paved_Drive 0 1 FALSE 3 Pav: 2652, Dir: 216, Par: 62
Pool_QC 0 1 FALSE 5 No_: 2917, Exc: 4, Goo: 4, Typ: 3
Fence 0 1 FALSE 5 No_: 2358, Min: 330, Goo: 118, Goo: 112
Misc_Feature 0 1 FALSE 6 Non: 2824, She: 95, Gar: 5, Oth: 4
Sale_Type 0 1 FALSE 10 WD : 2536, New: 239, COD: 87, Con: 26
Sale_Condition 0 1 FALSE 6 Nor: 2413, Par: 245, Abn: 190, Fam: 46

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Lot_Frontage 0 1 57.65 33.50 0.00 43.00 63.00 78.00 313.00 ▇▇▁▁▁
Lot_Area 0 1 10147.92 7880.02 1300.00 7440.25 9436.50 11555.25 215245.00 ▇▁▁▁▁
Year_Built 0 1 1971.36 30.25 1872.00 1954.00 1973.00 2001.00 2010.00 ▁▂▃▆▇
Year_Remod_Add 0 1 1984.27 20.86 1950.00 1965.00 1993.00 2004.00 2010.00 ▅▂▂▃▇
Mas_Vnr_Area 0 1 101.10 178.63 0.00 0.00 0.00 162.75 1600.00 ▇▁▁▁▁
BsmtFin_SF_1 0 1 4.18 2.23 0.00 3.00 3.00 7.00 7.00 ▃▂▇▁▇
BsmtFin_SF_2 0 1 49.71 169.14 0.00 0.00 0.00 0.00 1526.00 ▇▁▁▁▁
Bsmt_Unf_SF 0 1 559.07 439.54 0.00 219.00 465.50 801.75 2336.00 ▇▅▂▁▁
Total_Bsmt_SF 0 1 1051.26 440.97 0.00 793.00 990.00 1301.50 6110.00 ▇▃▁▁▁
First_Flr_SF 0 1 1159.56 391.89 334.00 876.25 1084.00 1384.00 5095.00 ▇▃▁▁▁
Second_Flr_SF 0 1 335.46 428.40 0.00 0.00 0.00 703.75 2065.00 ▇▃▂▁▁
Gr_Liv_Area 0 1 1499.69 505.51 334.00 1126.00 1442.00 1742.75 5642.00 ▇▇▁▁▁
Bsmt_Full_Bath 0 1 0.43 0.52 0.00 0.00 0.00 1.00 3.00 ▇▆▁▁▁
Bsmt_Half_Bath 0 1 0.06 0.25 0.00 0.00 0.00 0.00 2.00 ▇▁▁▁▁
Full_Bath 0 1 1.57 0.55 0.00 1.00 2.00 2.00 4.00 ▁▇▇▁▁
Half_Bath 0 1 0.38 0.50 0.00 0.00 0.00 1.00 2.00 ▇▁▅▁▁
Bedroom_AbvGr 0 1 2.85 0.83 0.00 2.00 3.00 3.00 8.00 ▁▇▂▁▁
Kitchen_AbvGr 0 1 1.04 0.21 0.00 1.00 1.00 1.00 3.00 ▁▇▁▁▁
TotRms_AbvGrd 0 1 6.44 1.57 2.00 5.00 6.00 7.00 15.00 ▁▇▂▁▁
Fireplaces 0 1 0.60 0.65 0.00 0.00 1.00 1.00 4.00 ▇▇▁▁▁
Garage_Cars 0 1 1.77 0.76 0.00 1.00 2.00 2.00 5.00 ▅▇▂▁▁
Garage_Area 0 1 472.66 215.19 0.00 320.00 480.00 576.00 1488.00 ▃▇▃▁▁
Wood_Deck_SF 0 1 93.75 126.36 0.00 0.00 0.00 168.00 1424.00 ▇▁▁▁▁
Open_Porch_SF 0 1 47.53 67.48 0.00 0.00 27.00 70.00 742.00 ▇▁▁▁▁
Enclosed_Porch 0 1 23.01 64.14 0.00 0.00 0.00 0.00 1012.00 ▇▁▁▁▁
Three_season_porch 0 1 2.59 25.14 0.00 0.00 0.00 0.00 508.00 ▇▁▁▁▁
Screen_Porch 0 1 16.00 56.09 0.00 0.00 0.00 0.00 576.00 ▇▁▁▁▁
Pool_Area 0 1 2.24 35.60 0.00 0.00 0.00 0.00 800.00 ▇▁▁▁▁
Misc_Val 0 1 50.64 566.34 0.00 0.00 0.00 0.00 17000.00 ▇▁▁▁▁
Mo_Sold 0 1 6.22 2.71 1.00 4.00 6.00 8.00 12.00 ▅▆▇▃▃
Year_Sold 0 1 2007.79 1.32 2006.00 2007.00 2008.00 2009.00 2010.00 ▇▇▇▇▃
Sale_Price 0 1 180796.06 79886.69 12789.00 129500.00 160000.00 213500.00 755000.00 ▇▇▁▁▁
Longitude 0 1 -93.64 0.03 -93.69 -93.66 -93.64 -93.62 -93.58 ▅▅▇▆▁
Latitude 0 1 42.03 0.02 41.99 42.02 42.03 42.05 42.06 ▂▂▇▇▇

Exercise 2: Train / Test Splits

Write and execute code to create training and test datasets. Have the training dataset represent 75% of the total data.

SOLUTION:
set.seed(8740)
# split the data, with 75% in the training set
data_split <- rsample::initial_split(dat, strata = "Sale_Price", prop = 0.75)

# extract the training set
ames_train <- rsample::training(data_split)
# extract the text set
ames_test  <- rsample::testing(data_split)

Exercise 3: Data Preprocessing

create a recipe based on the formula Sale_Price ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold with the following steps:

  • transform the outcome variable Sale_Price to log(Sale_Price) (natural log)
  • center and scale all numeric predictors
  • transform the categorical variable Neighborhood to pool infrequent values (see recipes::step_other)
  • create dummy variables for all nominal predictors

Finally prep the recipe.

Make sure you consider the order of the operations (hint: step_dummy turns factors into multiply integer (numeric) predictor, so consider when to scale numeric predictors relative to creating dummy predictors.

You can use broom::tidy() on the recipe to examine whether the prepped data is correct.

SOLUTION:
norm_recipe <- 
  # create a recipe with the specified formula and data
  recipes::recipe(
    Sale_Price ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold, 
    data = ames_train
  ) |>
  # center all predictors
  recipes::step_center(all_numeric_predictors()) |>
  # scales all predictors
  recipes::step_scale(all_numeric_predictors()) |>
  # transformm the outcome using log-base-e (natural log)
  recipes::step_log(Sale_Price, base = exp(1)) |> 
  # pool categories with few members into a new category - 'other'
  recipes::step_other(Neighborhood) |> 
  # create dummy variables for all categories
  recipes::step_dummy(all_nominal_predictors()) 
norm_recipe |> 
  # estimate the means and standard deviations by prepping the data
  recipes::prep(training = ames_train, retain = TRUE) |> broom::tidy()
# A tibble: 5 × 6
  number operation type   trained skip  id          
   <int> <chr>     <chr>  <lgl>   <lgl> <chr>       
1      1 step      center TRUE    FALSE center_8P2vV
2      2 step      scale  TRUE    FALSE scale_5uUEH 
3      3 step      log    TRUE    FALSE log_Y9hV0   
4      4 step      other  TRUE    FALSE other_JBWlv 
5      5 step      dummy  TRUE    FALSE dummy_yvcFM 

Exercise 4 Modeling

Create three regression models using the parsnip:: package and assign each model to its own variable

  • a base regression model using lm
  • a regression model using glmnet; set the model parameters penalty and mixture for tuning
  • a tree model using the ranger engine; set the model parameters min_n and trees for tuning
SOLUTION:
# linear model: engine lm, mode regression
lm_mod_base <- parsnip::linear_reg() |>
  parsnip::set_engine("lm")  |> 
  parsnip::set_mode("regression")

# tuned linear model: engine glmnet, mode regression; tuning penalty and mixture
lm_mod_glmnet <- 
  parsnip::linear_reg( penalty = parsnip::tune(), mixture = parsnip::tune() ) |> 
  parsnip::set_engine("glmnet") |> 
  parsnip::set_mode("regression")

# random forest model: engine lm, mode regression
lm_mod_rforest <- 
  parsnip::rand_forest( min_n = parsnip::tune(), trees = parsnip::tune() ) |> 
  parsnip::set_engine("ranger") |> 
  parsnip::set_mode("regression")
lm_mod_base
Linear Regression Model Specification (regression)

Computational engine: lm 
lm_mod_glmnet
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = parsnip::tune()
  mixture = parsnip::tune()

Computational engine: glmnet 
lm_mod_rforest
Random Forest Model Specification (regression)

Main Arguments:
  trees = parsnip::tune()
  min_n = parsnip::tune()

Computational engine: ranger 

Exercise 5

Use parsnip::translate() on each model to see the model template for each method of fitting.

SOLUTION:
# lm model
lm_mod_base |> parsnip::translate()
Linear Regression Model Specification (regression)

Computational engine: lm 

Model fit template:
stats::lm(formula = missing_arg(), data = missing_arg(), weights = missing_arg())
# glmnet model
lm_mod_glmnet |> parsnip::translate()
Linear Regression Model Specification (regression)

Main Arguments:
  penalty = tune()
  mixture = parsnip::tune()

Computational engine: glmnet 

Model fit template:
glmnet::glmnet(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
    alpha = parsnip::tune(), family = "gaussian")
# rforest model
lm_mod_rforest |> parsnip::translate()
Random Forest Model Specification (regression)

Main Arguments:
  trees = parsnip::tune()
  min_n = parsnip::tune()

Computational engine: ranger 

Model fit template:
ranger::ranger(x = missing_arg(), y = missing_arg(), weights = missing_arg(), 
    num.trees = parsnip::tune(), min.node.size = min_rows(~parsnip::tune(), 
        x), num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 
        1))

Exercise 6 Bootstrap

Create bootstrap samples for the training dataset. You can leave the parameters set to their defaults

SOLUTION:
set.seed(8740)
train_resamples <- rsample::bootstraps(ames_train)

Exercise 7

Create workflows with workflowsets::workflow_set using your recipe and models. Show the resulting datastructure, noting the number of columns, and then use tidyr:: to unnest the info column and show its contents.

SOLUTION:
all_workflows <- 
  workflowsets::workflow_set(
    preproc = list(base = norm_recipe),
    models = list(base = lm_mod_base, glmnet = lm_mod_glmnet, forest = lm_mod_rforest)
  )

all_workflows # four columns
all_workflows |> tidyr::unnest(info)
# A workflow set/tibble: 3 × 4
  wflow_id    info             option    result    
  <chr>       <list>           <list>    <list>    
1 base_base   <tibble [1 × 4]> <opts[0]> <list [0]>
2 base_glmnet <tibble [1 × 4]> <opts[0]> <list [0]>
3 base_forest <tibble [1 × 4]> <opts[0]> <list [0]>
# A tibble: 3 × 7
  wflow_id    workflow   preproc model       comment option    result    
  <chr>       <list>     <chr>   <chr>       <chr>   <list>    <list>    
1 base_base   <workflow> recipe  linear_reg  ""      <opts[0]> <list [0]>
2 base_glmnet <workflow> recipe  linear_reg  ""      <opts[0]> <list [0]>
3 base_forest <workflow> recipe  rand_forest ""      <opts[0]> <list [0]>

Exercise 8

Use workflowsets::workflow_map to map the default function (tune::tune_grid() - look at the help for workflowsets::workflow_map ) across the workflows in the workflowset you just created and update the variable all_workflows with the result.

all_workflows <- all_workflows |> 
  workflowsets::workflow_map(
    verbose = TRUE                # enable logging
    , resamples = train_resamples # a parameter passed to tune::tune_grid()
    , grid = 5                    # a parameter passed to tune::tune_grid()
  )
i   No tuning parameters. `fit_resamples()` will be attempted
i 1 of 3 resampling: base_base
✔ 1 of 3 resampling: base_base (976ms)
i 2 of 3 tuning:     base_glmnet
✔ 2 of 3 tuning:     base_glmnet (3.2s)
i 3 of 3 tuning:     base_forest
✔ 3 of 3 tuning:     base_forest (1m 15.9s)

The updated variable all_workflows contains a nested column named result, and each cell of the column result is a tibble containing a nested column named .metrics. Write code to

  1. un-nest the metrics in the column .metrics
  2. filter out the rows for the metric rsq
  3. group by wflow_id, order the .estimate column from highest to lowest, and pick out the first row of each group.
SOLUTION:
all_workflows |> 
  # unnest the result column
  dplyr::select(wflow_id,result) |> 
  tidyr::unnest(result) |> 
  # unnest the .metrics column
  tidyr::unnest(.metrics) |> 
  # filter out the metric rsq
  dplyr::filter(.metric == 'rsq') |> 
  # group by wflow_id
  dplyr::group_by(wflow_id) |> 
  # order by .estimate, starting with the largest value
  dplyr::arrange(desc(.estimate) ) |> 
  # select the first row for each group (i.e. highest rsq)
  dplyr::slice(1)
# A tibble: 3 × 12
# Groups:   wflow_id [3]
  wflow_id    splits             id         .metric .estimator .estimate .config
  <chr>       <list>             <chr>      <chr>   <chr>          <dbl> <chr>  
1 base_base   <split [2197/798]> Bootstrap… rsq     standard       0.448 Prepro…
2 base_forest <split [2197/797]> Bootstrap… rsq     standard       0.712 Prepro…
3 base_glmnet <split [2197/798]> Bootstrap… rsq     standard       0.448 Prepro…
# ℹ 5 more variables: penalty <dbl>, mixture <dbl>, trees <int>, min_n <int>,
#   .notes <list>

Exercise 9

Run the code below and compare to your results from exercise 8.

SOLUTION:
workflowsets::rank_results(
  all_workflows
  , rank_metric = 'rsq'
  , select_best = TRUE
)
# A tibble: 6 × 9
  wflow_id    .config       .metric  mean std_err     n preprocessor model  rank
  <chr>       <chr>         <chr>   <dbl>   <dbl> <int> <chr>        <chr> <int>
1 base_forest Preprocessor… rmse    0.240 0.00255    25 recipe       rand…     1
2 base_forest Preprocessor… rsq     0.657 0.00471    25 recipe       rand…     1
3 base_base   Preprocessor… rmse    0.321 0.00366    25 recipe       line…     2
4 base_base   Preprocessor… rsq     0.376 0.00836    25 recipe       line…     2
5 base_glmnet Preprocessor… rmse    0.321 0.00364    25 recipe       line…     3
6 base_glmnet Preprocessor… rsq     0.375 0.00836    25 recipe       line…     3

Exercise 10

Select the best model per the rsq metric using its id.

SOLUTION:
# extract the best model (per rsq)
best_model_workflow <- 
  all_workflows |> 
  workflowsets::extract_workflow("base_forest")
# finalize the workflow
best_model_workflow <- 
  best_model_workflow |> 
  tune::finalize_workflow(
    tibble::tibble(trees = 1971, min_n = 2) # enter the name and value of the best-fit parameters
  ) 
# having trained the model, compare test and training performance
training_fit <- best_model_workflow |> 
  fit(data = ames_train)
training_fit
testing_fit <- best_model_workflow |> 
  fit(data = ames_test)
testing_fit
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_center()
• step_scale()
• step_log()
• step_other()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Ranger result

Call:
 ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~1971,      min.node.size = min_rows(~2, x), num.threads = 1, verbose = FALSE,      seed = sample.int(10^5, 1)) 

Type:                             Regression 
Number of trees:                  1971 
Sample size:                      2197 
Number of independent variables:  12 
Mtry:                             3 
Target node size:                 2 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       0.05686626 
R squared (OOB):                  0.6558831 
══ Workflow [trained] ══════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: rand_forest()

── Preprocessor ────────────────────────────────────────────────────────────────
5 Recipe Steps

• step_center()
• step_scale()
• step_log()
• step_other()
• step_dummy()

── Model ───────────────────────────────────────────────────────────────────────
Ranger result

Call:
 ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~1971,      min.node.size = min_rows(~2, x), num.threads = 1, verbose = FALSE,      seed = sample.int(10^5, 1)) 

Type:                             Regression 
Number of trees:                  1971 
Sample size:                      733 
Number of independent variables:  12 
Mtry:                             3 
Target node size:                 2 
Variable importance mode:         none 
Splitrule:                        variance 
OOB prediction error (MSE):       0.06798341 
R squared (OOB):                  0.5976383 

What is the ratio of the OOB prediction errors (MSE): test/train?

The ratio is 0.06821716 / 0.05747611 - 1 = 0.1868785, or about 19% higher in the test dataset than in the training dataset.

Grading

Total points available: 30 points.

Component Points
Ex 1 - 10 30