# check if 'librarian' is installed and if not, install it
if (! "librarian" %in% rownames(installed.packages()) ){
install.packages("librarian")
}
# load packages if not already loaded
::shelf(
librarian
tidyverse, magrittr, gt, gtExtras, tidymodels, DataExplorer, skimr, janitor, ggplot2, forcats,
broom, yardstick, parsnip, workflows, rsample, tune, dials
)
# set the default theme for plotting
theme_set(theme_bw(base_size = 18) + theme(legend.position = "top"))
Lab 5 - Classification and clustering
BSMM 8740 Fall 2024
Introduction
In today’s lab, you’ll practice building workflowsets
with recipes
, parsnip
models, rsample
cross validations, model tuning and model comparison in the context of classification and clustering.
Learning goals
By the end of the lab you will…
- Be able to build workflows to fit different classification models.
- Be able to build workflows to evaluate different clustering models.
Getting started
To complete the lab, log on to your github account and then go to the class GitHub organization and find the 2024-lab-5-[your github username] repository .
Create an R project using your 2024-lab-5-[your github username] repository (remember to create a PAT, etc.) and add your answers by editing the
2024-lab-5.qmd
file in your repository.When you are done, be sure to: save your document, stage, commit and push your work.
To access Github from the lab, you will need to make sure you are logged in as follows:
- username: .\daladmin
- password: Business507!
Remember to (create a PAT and set your git credentials)
- create your PAT using
usethis::create_github_token()
, - store your PAT with
gitcreds::gitcreds_set()
, - set your username and email with
usethis::use_git_config( user.name = ___, user.email = ___)
Packages
The Data
Today we will be using customer churn data.
In the customer management lifecycle, customer churn refers to a decision made by the customer about ending the business relationship. It is also referred as loss of clients or customers. This dataset contains 20 features related to churn in a telecom context and we will look at how to predict churn and estimate the effect of predictors on the customer churn odds ratio.
<-
data ::read_csv("data/Telco-Customer-Churn.csv", show_col_types = FALSE) |>
readr::mutate(churn = as.factor(churn)) dplyr
Exercise 1: EDA
Write and execute the code to perform summary EDA on the data using the package skimr
. Plot histograms for monthly charges and tenure. Tenure measures the strength of the customer relationship by measuring the length of time that a person has been a customer.
Exercise 2: train / test splits & recipe
Write and execute code to create training and test datasets. Have the training dataset represent 70% of the total data.
Next create a recipe where churn is related to all the other variables, and
- normalize the numeric variables
- create dummy variables for the ordinal predictors
Make sure the steps are in a sequence that preserves the (0,1) dummy variables.
Prep the data on the training data and show the result.
Exercise 3: logistic modeling
- Create a linear model using logistic regression to predict churn. for the set engine stage use “glm,” and set the mode to “classification.”
- Create a workflow using the recipe of the last exercise and the model if the last step.
- With the workflow, fit the training data
- Combine the training data and the predictions from step 3 using
broom::augment
, and assign the result to a variable - Create a combined metric function using
yardstick::metric_set
as show in the code below: - Use the variable from step 4 as the first argument to the function from step 5. The other arguments are
truth = churn
(from the data) andestimate=.pred_class
(from step 4). Make a note of the numerical metrics. - Use the variable from step 4 as the first argument to the functions listed below, with arguments
truth = churn
andestimate =``.pred_No
.yardstick::roc_auc
yardstick::roc_curve
followed byggplot2::autoplot()
.
You can ignore this message. It means that there are a lot of predictors.
Exercise 4: effects
Use broom::tidy() on the fit object from exercise 4 to get the predictor coefficients. Sort them in decreasing order by absolute value.
What is the effect of one additional year of tenure on the churn odds ratio?
Exercise 5 knn modeling
Now we will create a K-nearest neighbours model to estimate churn. To do this, write the code for the following steps:
- Create a K-nearest neighbours model to predict churn using
parsnip::nearest_neighbor
with argumentneighbors = 3
which will use the three most similar data points from the training set to predict churn. For the set engine stage use “kknn,” and set the mode to “classification.” - Take the workflow from exercise 3 and create a new workflow by updating the original workflow. Use
workflows::update_model
to swap out the original logistic model for the nearest neighbour model. - Use the new workflow to fit the training data. Take the fit and use
broom::augment
to augment the fit with the training data. - Use the augmented data from step 3 to plot the roc curve, using
yardstick::roc_curve(.pred_No, truth = churn)
as in exercise 3. How do you interpret his curve? - Take the fit from step 3 and use
broom::augment
to augment the fit with the test data. - Repeat step 4 using the augmented data from step 5.
Exercise 6 cross validation
Following the last exercise, we should have some concerns about over-fitting by the nearest-neighbour model.
To address this we will use cross validation to tune the model and evaluate the fits.
- Create a cross-validation dataset based on 5 folds using
rsample::vfold_cv
. - Using the knn workflow from exercise 5, apply
tune::fit_resamples
with argumentsresamples
andcontrol
where the resamples are the dataset created in step 1 and control istune::control_resamples(save_pred = TRUE)
, which will ensure that the predictions are saved. - Use
tune::collect_metrics()
on the results from step 2 - Use tune::collect_predictions() on the results from step 2 to plot the roc_auc curve as in exercise 5. Has it changed much from exercise 5?
This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.
Exercise 7: tuning for k
In this exercise we’ll tune the number of nearest neighbours in our model to see if we can improve performance.
- Redo exercise 5 steps 1 and 2, setting
neighbors = tune::tune()
for the model, and then updating the workflow withworkflows::update_model
. - Use
dials::grid_regular(dials::neighbors(), levels = 10)
to create a grid for tuning k. - Use
tune::tune_grid
withtune::control_grid(save_pred = TRUE)
andyardstick::metric_set(yardstick::accuracy, yardstick::roc_auc)
to generate tuning results
Exercise 8
Use tune::collect_metrics()
to collect the metrics from the tuning results in exercise 7 and then plot the metrics as a function of k using the code below.
Exercise 9
Use tune::show_best
and tune::select_best
with argument “roc_auc” to find the best k for the knn classification model. Then
- update the workflow using
tune::finalize_workflow
to set the best k value. - use
tune::last_fit
with the updated workflow from step 1, evaluated on the split data from exercise 2 to finalize the fit. - use
tune::collect_metrics()
to get the metrics for the best fit - use
tune::collect_predictions()
to get the predictions and plot the roc_auc as in the prior exercises
Exercise 10: clustering
Load the data for this exercise as below and plot it, and then create an analysis dataset with the cluster labels removed
You’re done and ready to submit your work! Save, stage, commit, and push all remaining changes. You can use the commit message “Done with Lab 5!” , and make sure you have committed and pushed all changed files to GitHub (your Git pane in RStudio should be empty) and that all documents are updated in your repo on GitHub.
I will pull (copy) everyone’s repository submissions at 5:00pm on the Sunday following class, and I will work only with these copies, so anything submitted after 5:00pm will not be graded. (don’t forget to commit and then push your work!)
Grading
Total points available: 30 points.
Component | Points |
---|---|
Ex 1 - 10 | 30 |