Lab 5 - Classification and clustering

BSMM 8740 Fall 2024

Author

Add your name here

Introduction

In today’s lab, you’ll practice building workflowsets with recipes, parsnip models, rsample cross validations, model tuning and model comparison in the context of classification and clustering.

Learning goals

By the end of the lab you will…

Be able to build workflows to fit different classification models.
Be able to build workflows to evaluate different clustering models.

Getting started

To complete the lab, log on to your github account and then go to the class GitHub organization and find the 2024-lab-5-[your github username] repository .

Create an R project using your 2024-lab-5-[your github username] repository (remember to create a PAT, etc.) and add your answers by editing the 2024-lab-5.qmd file in your repository.
When you are done, be sure to: save your document, stage, commit and push your work.

Important

To access Github from the lab, you will need to make sure you are logged in as follows:

username: .\daladmin
password: Business507!

Remember to (create a PAT and set your git credentials)

create your PAT using usethis::create_github_token() ,
store your PAT with gitcreds::gitcreds_set() ,
set your username and email with
- usethis::use_git_config( user.name = ___, user.email = ___)

Packages

# check if 'librarian' is installed and if not, install it
if (! "librarian" %in% rownames(installed.packages()) ){
  install.packages("librarian")
}
  
# load packages if not already loaded
librarian::shelf(
  tidyverse, magrittr, gt, gtExtras, tidymodels, DataExplorer, skimr, janitor, ggplot2, forcats,
  broom, yardstick, parsnip, workflows, rsample, tune, dials
)

# set the default theme for plotting
theme_set(theme_bw(base_size = 18) + theme(legend.position = "top"))

The Data

Today we will be using customer churn data.

In the customer management lifecycle, customer churn refers to a decision made by the customer about ending the business relationship. It is also referred as loss of clients or customers. This dataset contains 20 features related to churn in a telecom context and we will look at how to predict churn and estimate the effect of predictors on the customer churn odds ratio.

data <- 
  readr::read_csv("data/Telco-Customer-Churn.csv", show_col_types = FALSE) |> 
  dplyr::mutate(churn = as.factor(churn))

Exercise 1: EDA

Write and execute the code to perform summary EDA on the data using the package skimr. Plot histograms for monthly charges and tenure. Tenure measures the strength of the customer relationship by measuring the length of time that a person has been a customer.

YOUR ANSWER:

Exercise 2: train / test splits & recipe

Write and execute code to create training and test datasets. Have the training dataset represent 70% of the total data.

Next create a recipe where churn is related to all the other variables, and

normalize the numeric variables
create dummy variables for the ordinal predictors

Make sure the steps are in a sequence that preserves the (0,1) dummy variables.

Prep the data on the training data and show the result.

YOUR ANSWER:

set.seed(8740)

# split data


# create a recipe

Exercise 3: logistic modeling

Create a linear model using logistic regression to predict churn. for the set engine stage use “glm,” and set the mode to “classification.”
Create a workflow using the recipe of the last exercise and the model if the last step.
With the workflow, fit the training data
Combine the training data and the predictions from step 3 using broom::augment , and assign the result to a variable
Create a combined metric function using yardstick::metric_set as show in the code below:
Use the variable from step 4 as the first argument to the function from step 5. The other arguments are truth = churn (from the data) and estimate=.pred_class (from step 4). Make a note of the numerical metrics.
Use the variable from step 4 as the first argument to the functions listed below, with arguments truth = churn and estimate =``.pred_No.
- yardstick::roc_auc
- yardstick::roc_curve followed by ggplot2::autoplot().

rank-deficiency

You can ignore this message. It means that there are a lot of predictors.

YOUR ANSWER:

# create a linear regression model

# create a workflow

# fit the workflow

# augment the training data with the fitted data

# create the metrics function
m_set_fn <- 
  yardstick::metric_set(
    yardstick::accuracy
    , yardstick::precision
    , yardstick::recall
    , yardstick::f_meas
    , yardstick::spec
    , yardstick::sens
    , yardstick::ppv
    , yardstick::npv
)

# compute roc_auc and plot the roc_curve

Exercise 4: effects

Use broom::tidy() on the fit object from exercise 4 to get the predictor coefficients. Sort them in decreasing order by absolute value.

What is the effect of one additional year of tenure on the churn odds ratio?

YOUR ANSWER:

The effect of one additional year of tenure on the churn odds ratio is __.

Exercise 5 knn modeling

Now we will create a K-nearest neighbours model to estimate churn. To do this, write the code for the following steps:

Create a K-nearest neighbours model to predict churn using parsnip::nearest_neighbor with argument neighbors = 3 which will use the three most similar data points from the training set to predict churn. For the set engine stage use “kknn,” and set the mode to “classification.”
Take the workflow from exercise 3 and create a new workflow by updating the original workflow. Use workflows::update_model to swap out the original logistic model for the nearest neighbour model.
Use the new workflow to fit the training data. Take the fit and use broom::augment to augment the fit with the training data.
Use the augmented data from step 3 to plot the roc curve, using yardstick::roc_curve(.pred_No, truth = churn) as in exercise 3. How do you interpret his curve?
Take the fit from step 3 and use broom::augment to augment the fit with the test data.
Repeat step 4 using the augmented data from step 5.

YOUR ANSWER:

?parsnip::nearest_neighbor

# create a knn classification model model

# create a workflow

# fit the workflow

# augment the training data with the fitted data

# compute the metrics

?yardstick::roc_curve
# compute roc_auc and plot the roc_curve

Exercise 6 cross validation

Following the last exercise, we should have some concerns about over-fitting by the nearest-neighbour model.

To address this we will use cross validation to tune the model and evaluate the fits.

Create a cross-validation dataset based on 5 folds using rsample::vfold_cv.
Using the knn workflow from exercise 5, apply tune::fit_resamples with arguments resamples and control where the resamples are the dataset created in step 1 and control is tune::control_resamples(save_pred = TRUE), which will ensure that the predictions are saved.
Use tune::collect_metrics() on the results from step 2
Use tune::collect_predictions() on the results from step 2 to plot the roc_auc curve as in exercise 5. Has it changed much from exercise 5?

YOUR ANSWER:

?rsample::vfold_cv

# create v-fold cross validation data

# use tune::fit on the cv dat, saving the predictions

?tune::fit_resamples

# collect the metrics

# compute the roc_curve

This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.

Exercise 7: tuning for k

In this exercise we’ll tune the number of nearest neighbours in our model to see if we can improve performance.

Redo exercise 5 steps 1 and 2, setting neighbors = tune::tune() for the model, and then updating the workflow with workflows::update_model.
Use dials::grid_regular(dials::neighbors(), levels = 10) to create a grid for tuning k.
Use tune::tune_grid with tune::control_grid(save_pred = TRUE) and yardstick::metric_set(yardstick::accuracy, yardstick::roc_auc) to generate tuning results

YOUR ANSWER:

?tune::tune_grid

# re-specify the model for tuning


# update the workflow


# make a grid for tuning


# use the grid to tune the model


# show the tuning results dataframe

Exercise 8

Use tune::collect_metrics() to collect the metrics from the tuning results in exercise 7 and then plot the metrics as a function of k using the code below.

YOUR ANSWER:

# collect the metrics

# plot the collected metrics as a function of K
_your_metrics_ |>
ggplot(aes(neighbors,mean)) +
  geom_line(linewidth = 1.5, alpha = 0.6) +
  geom_point(size = 2) +
  facet_wrap(~ .metric, scales = "free", nrow = 2)

Exercise 9

Use tune::show_best and tune::select_best with argument “roc_auc” to find the best k for the knn classification model. Then

update the workflow using tune::finalize_workflow to set the best k value.
use tune::last_fit with the updated workflow from step 1, evaluated on the split data from exercise 2 to finalize the fit.
use tune::collect_metrics() to get the metrics for the best fit
use tune::collect_predictions() to get the predictions and plot the roc_auc as in the prior exercises

YOUR ANSWER:

?tune::finalize_workflow
# show the roc_auc metrics

# select the best roc_auc metric (using a function from tune::)

# finalize the workflow with the best nn metric from the last step

# use  tune::last_fit with the finaized workflow on the data_split (ex 2)

# collect the metrics from the final fit

# collect the predictions from the final fit and plot the roc_curve

Exercise 10: clustering

Load the data for this exercise as below and plot it, and then create an analysis dataset with the cluster labels removed

YOUR ANSWER:

# read the data
labelled_points <- readr::read_csv("data/lab_5_clusters.csv", show_col_types = FALSE)

# plot the clusters
labelled_points |> ggplot(aes(x1, x2, color = cluster)) +
  geom_point(alpha = 0.3) + 
  theme(legend.position="none")

# remove cluster labels to make the analysis dataset
points <-
  labelled_points |>
  select(?)

You have frequently used broom::augment to combine a model with the data set, and broom::tidy to summarize model components; broom::glance is used to similarly to summarize goodness-of-fit metrics.

Now perform k-means clustering on the points data for different values of k as follows:

kclusts <-
  # number of clusters from 1-9
  tibble(k = 1:9) |>
  # mutate to add columns
  mutate(
    # a list-column with the results of the kmeans function (clustering)
    kclust = purrr::map(k, ~stats::kmeans(points, .x)),
    # a list-column with the results broom::tidy applied to the clustering results
    tidied = purrr::map(kclust, broom::tidy),
    # a list-column with the results broom::glance applied to the clustering results
    glanced = purrr::map(kclust, broom::glance),
    # a list-column with the results broom::augment applied to the clustering results
    augmented = purrr::map(kclust, broom::augment, points)
  )

(i) Create 3 variables by tidyr::unnesting the appropriate columns of kclusts

# take kclusts and use tidy::unnest() on the appropriate columns

clusters <- ?

assignments <- ?

clusterings <- ?

(ii) Use the assignments variable to plot the cluster assignments generated by stats::kmeans

# plot the points assigned to each cluster
p <- assignments |> ggplot(aes(x = x1, y = x2)) +
  geom_point(aes(color = .cluster), alpha = 0.8) +
  facet_wrap(~ k) + theme(legend.position="none")
p

(iii) Use the clusters variable to add the cluster centers to the plot

# on the last plot, mark the cluster centres with an X
p + geom_point(data = clusters, size = 10, shape = "x")

(iv) Use the clusterings variable to plot the total within sum of squares value by number of clusters.

# make a separate line-and-point plot with the tot-withinss data by cluster number
clusterings |> ggplot(aes(k, tot.withinss)) +
  geom_line() +
  geom_point()

(v) Using the results of parts (iii) and (iv), the k (number of clusters) that gives the best results is __.

You’re done and ready to submit your work! Save, stage, commit, and push all remaining changes. You can use the commit message “Done with Lab 5!” , and make sure you have committed and pushed all changed files to GitHub (your Git pane in RStudio should be empty) and that all documents are updated in your repo on GitHub.

Submission

I will pull (copy) everyone’s repository submissions at 5:00pm on the Sunday following class, and I will work only with these copies, so anything submitted after 5:00pm will not be graded. (don’t forget to commit and then push your work!)

Grading

Total points available: 30 points.

Component	Points
Ex 1 - 10	30