Lab 4 - The TidyModels Package

BSMM 8740 Fall 2024

Author

Add your name here

Introduction

In today’s lab, you’ll practice building workflowsets with recipes, parsnip models, rsample cross validations, model tuning and model comparison.

Learning goals

By the end of the lab you will…

  • Be able to build workflows to evaluate different models and feature sets.

Getting started

  • Log in to your github account and then go to the GitHub organization for the course and find the 2024-lab-4-[your github username] repository to complete the lab.

    Create an R project using your 2024-lab-4-[your github username] repository (remember to create a PAT, etc., as in lab-1) and add your answers by editing the 2024-lab-4.qmd file in your repository.

  • When you are done, be sure to: save your document, stage, commit and push your work.

Important

To access Github from the lab, you will need to make sure you are logged in as follows:

  • username: .\daladmin
  • password: Business507!

Remember to (create a PAT and set your git credentials)

  • create your PAT using usethis::create_github_token() ,
  • store your PAT with gitcreds::gitcreds_set() ,
  • set your username and email with
    • usethis::use_git_config( user.name = ___, user.email = ___)

Packages

# check if 'librarian' is installed and if not, install it
if (! "librarian" %in% rownames(installed.packages()) ){
  install.packages("librarian")
}
  
# load packages if not already loaded
librarian::shelf(
  tidyverse, magrittr, tidymodels, modeldata, ranger, rsample, broom, recipes, parsnip
)

# set the efault theme for plotting
theme_set(theme_bw(base_size = 18) + theme(legend.position = "top"))

The Data

Today we will be using the Ames Housing Data.

This is a data set from De Cock (2011) has 82 fields were recorded for 2,930 properties in Ames Iowa in the US. The version in the modeldata package is copied from the AmesHousing package but does not include a few quality columns that appear to be outcomes rather than predictors.

dat <- modeldata::ames

The data dictionary can be found on the internet:

cat(readr::read_file("http://jse.amstat.org/v19n3/decock/DataDocumentation.txt"))

Exercise 1: EDA

Write and execute the code to perform summary EDA on the Ames Housing data using the package skimr.

YOUR ANSWER:
# PLEASE SHOW YOUR WORK

Exercise 2: Train / Test Splits

Write and execute code to create training and test datasets. Have the training dataset represent 75% of the total data.

YOUR ANSWER:
# PLEASE SHOW YOUR WORK

set.seed(8740)
data_split <- rsample::__

ames_train <- rsample::__
ames_test  <- rsample::__

Exercise 3: Data Preprocessing

create a recipe based on the formula Sale_Price ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold with the following steps:

  • transform the variable Sale_Price to log(Sale_Price)
  • center and scale all predictors
  • create dummy variables for all nominal variables
  • transform the variable Neighborhood to pool infrequent values (see recipes::step_other)

Finally prep the recipe.

YOUR ANSWER:
# PLEASE SHOW YOUR WORK
norm_recipe <- 
  recipes::recipe( ___ ) %>% 
  ...
  recipes::prep( ___ ) %>% broom::tidy()

Exercise 4 Modeling

Create three regression models

  • a base regression model using lm

  • a regression model using glmnet; set the model parameters penalty and mixture for tuning

  • a tree model using the ranger engine; set the model parameters min_n and trees for tuning

YOUR ANSWER:
# PLEASE SHOW YOUR WORK


lm_mod_base <- 
  parsnip::linear_reg( ___ ) %>% ___

lm_mod_glmnet <- 
  parsnip::linear_reg( ___ ) %>% ___

lm_mod_rforest <- 
  parsnip::rand_forest( ___ ) %>% ___

Exercise 5

Use parsnip::translate() on each model to see the code object that is specific to a particular engine

YOUR ANSWER:
# PLEASE SHOW YOUR WORK

Exercise 6 Bootstrap

Create bootstrap samples for the training dataset. You can leave the parameters set to their defaults

YOUR ANSWER:
# PLEASE SHOW YOUR WORK
set.seed(8740)
train_resamples <- ___ %>% rsample::bootstraps()

This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.

Exercise 7

Create workflows with workflowsets::workflow_set using your recipe and models.

YOUR ANSWER:
# PLEASE SHOW YOUR WORK

all_workflows <- 
  workflowsets::workflow_set(
    preproc = list(base = norm_recipe),
    models = list(base = lm_mod_base, glmnet = lm_mod_glmnet, forest = lm_mod_rforest)
  )

Exercise 8

Map the default function (tune::tune_grid()) across the workflows in the workflowset you just created and update the variable all_workflows with the result

all_workflows <- all_workflows %>% 
  workflowsets::workflow_map(
    verbose = TRUE                # enable logging
    , resamples = train_resamples # a parameter passed to tune::tune_grid()
    , grid = 5                    # a parameter passed to tune::tune_grid()
  )

The updated variable all_workflows contains a nested column named result, and each cell of the column result is a tibble containing a nested column named .metrics. Write code to

  1. un-nest the metrics in the column .metrics

  2. filter out the rows for the metric rmse

  3. group by wflow_id, order the .estimate column from highest to lowest, and pick out the first row of each group

YOUR ANSWER:
# PLEASE SHOW YOUR WORK

all_workflows %>% 
  dplyr::select(wflow_id,__) %>% 
  tidyr::unnest(__) %>% 
  dplyr::select(wflow_id,__) %>% 
  tidyr::unnest(__) %>% 
  dplyr::filter(.metric == '___') %>% 
  dplyr::group_by(wflow_id) %>% 
  dplyr::arrange(desc(__) ) %>% 
  dplyr::slice(1)

Exercise 9

Run the code below and compare to your results from exercise 8.

YOUR ANSWER:
# PLEASE SHOW YOUR WORK

workflowsets::rank_results(all_workflows, rank_metric = "rmse", select_best = TRUE)

# compare to your results from exercise 8.

Exercise 10

Select the best model per the rsme metric using its id.

YOUR ANSWER:
# PLEASE SHOW YOUR WORK

best_model_workflow <- 
  all_workflows %>% 
  workflowsets::extract_workflow("__")

Finalize the workflow by setting the parameters for the best model

YOUR ANSWER:
# PLEASE SHOW YOUR WORK

best_model_workflow <- 
  best_model_workflow %>% 
  tune::finalize_workflow(
    tibble::tibble(__ = __, __ = __) # enter the name and value of the best-fit parameters
  ) 

Now compare the fits

YOUR ANSWER:
# PLEASE SHOW YOUR WORK

training_fit <- best_model_workflow %>% 
  fit(data = ames_train)

testing_fit <- best_model_workflow %>% 
  fit(data = ames_test)


# What is the ratio of the OOB prediction errors (MSE): test/train?

You’re done and ready to submit your work! Save, stage, commit, and push all remaining changes. You can use the commit message “Done with Lab 4!”, and make sure you have committed and pushed all changed files to GitHub (your Git pane in RStudio should be empty) and that all documents are updated in your repo on GitHub.

Submission

I will pull (copy) everyone’s submissions at 5:00pm on the Sunday following class, and I will work only with these copies, so anything submitted after 5:00pm will not be graded. (don’t forget to commit and then push your work by 5:00pm on Sunday!)

Grading

Total points available: 50 points.

Component Points
Ex 1 - 10 30