# check if 'librarian' is installed and if not, install it
if (! "librarian" %in% rownames(installed.packages()) ){
install.packages("librarian")
}
# load packages if not already loaded
::shelf(
librarian
tidyverse, magrittr, tidymodels, modeldata, ranger, rsample, broom, recipes, parsnip
)
# set the efault theme for plotting
theme_set(theme_bw(base_size = 18) + theme(legend.position = "top"))
Lab 4 - The TidyModels Package
BSMM 8740 Fall 2024
Introduction
In today’s lab, you’ll practice building workflowsets
with recipes
, parsnip
models, rsample
cross validations, model tuning and model comparison.
Learning goals
By the end of the lab you will…
- Be able to build workflows to evaluate different models and feature sets.
Getting started
Log in to your github account and then go to the GitHub organization for the course and find the 2024-lab-4-[your github username] repository to complete the lab.
Create an R project using your 2024-lab-4-[your github username] repository (remember to create a PAT, etc., as in lab-1) and add your answers by editing the
2024-lab-4.qmd
file in your repository.When you are done, be sure to: save your document, stage, commit and push your work.
To access Github from the lab, you will need to make sure you are logged in as follows:
- username: .\daladmin
- password: Business507!
Remember to (create a PAT and set your git credentials)
- create your PAT using
usethis::create_github_token()
, - store your PAT with
gitcreds::gitcreds_set()
, - set your username and email with
usethis::use_git_config( user.name = ___, user.email = ___)
Packages
The Data
Today we will be using the Ames Housing Data.
This is a data set from De Cock (2011) has 82 fields were recorded for 2,930 properties in Ames Iowa in the US. The version in the modeldata
package is copied from the AmesHousing
package but does not include a few quality columns that appear to be outcomes rather than predictors.
<- modeldata::ames dat
The data dictionary can be found on the internet:
cat(readr::read_file("http://jse.amstat.org/v19n3/decock/DataDocumentation.txt"))
Exercise 1: EDA
Write and execute the code to perform summary EDA on the Ames Housing data using the package skimr
.
Exercise 2: Train / Test Splits
Write and execute code to create training and test datasets. Have the training dataset represent 75% of the total data.
Exercise 3: Data Preprocessing
create a recipe based on the formula Sale_Price ~ Longitude + Latitude + Lot_Area + Neighborhood + Year_Sold with the following steps:
- transform the variable
Sale_Price
tolog(Sale_Price)
- center and scale all predictors
- create dummy variables for all nominal variables
- transform the variable
Neighborhood
to pool infrequent values (seerecipes::step_other
)
Finally prep the recipe.
Exercise 4 Modeling
Create three regression models
a base regression model using
lm
a regression model using
glmnet
; set the model parameterspenalty
andmixture
for tuninga tree model using the
ranger
engine; set the model parametersmin_n
andtrees
for tuning
Exercise 5
Use parsnip::translate() on each model to see the code object that is specific to a particular engine
Exercise 6 Bootstrap
Create bootstrap samples for the training dataset. You can leave the parameters set to their defaults
This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.
Exercise 7
Create workflows with workflowsets::workflow_set
using your recipe and models.
Exercise 8
Map the default function (tune::tune_grid()
) across the workflows in the workflowset you just created and update the variable all_workflows
with the result
<- all_workflows %>%
all_workflows ::workflow_map(
workflowsetsverbose = TRUE # enable logging
resamples = train_resamples # a parameter passed to tune::tune_grid()
, grid = 5 # a parameter passed to tune::tune_grid()
, )
The updated variable all_workflows
contains a nested column named result, and each cell of the column result is a tibble containing a nested column named .metrics. Write code to
un-nest the metrics in the column .metrics
filter out the rows for the metric rmse
group by wflow_id, order the .estimate column from highest to lowest, and pick out the first row of each group
Exercise 9
Run the code below and compare to your results from exercise 8.
Exercise 10
Select the best model per the rsme metric using its id.
Finalize the workflow by setting the parameters for the best model
Now compare the fits
You’re done and ready to submit your work! Save, stage, commit, and push all remaining changes. You can use the commit message “Done with Lab 4!”, and make sure you have committed and pushed all changed files to GitHub (your Git pane in RStudio should be empty) and that all documents are updated in your repo on GitHub.
I will pull (copy) everyone’s submissions at 5:00pm on the Sunday following class, and I will work only with these copies, so anything submitted after 5:00pm will not be graded. (don’t forget to commit and then push your work by 5:00pm on Sunday!)
Grading
Total points available: 50 points.
Component | Points |
---|---|
Ex 1 - 10 | 30 |