Lab 5 - Classification & clustering methods

SOLUTIONS

Introduction

In today’s lab, you’ll practice building workflowsets with recipes, parsnip models, rsample cross validations, model tuning and model comparison in the context of classification and clustering.

Packages

# check if 'librarian' is installed and if not, install it
if (! "librarian" %in% rownames(installed.packages()) ){
  install.packages("librarian")
}
  
# load packages if not already loaded
librarian::shelf(tidyverse, magrittr, gt, gtExtras, tidymodels, ggplot2)

# set the default theme for plotting
theme_set(theme_bw(base_size = 18) + theme(legend.position = "top"))

The Data

Today we will be using customer churn data.

In the customer management lifecycle, customer churn refers to a decision made by the customer about ending the business relationship. It is also referred as loss of clients or customers. This dataset contains 20 features related to churn in a telecom context and we will look at how to predict churn and estimate the effect of predictors on the customer churn odds ratio.

data <- 
  readr::read_csv("data/Telco-Customer-Churn.csv", show_col_types = FALSE) |>
  dplyr::mutate(churn = as.factor(churn))

Exercise 1: EDA

Write and execute the code to perform summary EDA on the data using the package skimr. Plot histograms for monthly charges and tenure. Tenure measures the strength of the customer relationship by measuring the length of time that a person has been a customer.

SOLUTION:

skimr::skim(data)
data |>
  ggplot(aes(x=monthly_charges)) + geom_histogram()
data |>
  ggplot(aes(x=tenure)) + geom_histogram()

Data summary
Name	data
Number of rows	7032
Number of columns	21
_______________________
Column type frequency:
character	17
factor	1
numeric	3
________________________
Group variables	None

Variable type: character

skim_variable	complete_rate	min	max	n_unique
gender	1	4	6	2
senior_citizen	1	2	3	2
partner	1	2	3	2
dependents	1	2	3	2
tenure_interval	1	9	11	7
phone_service	1	2	3	2
multiple_lines	1	2	16	3
internet_service	1	2	11	3
online_security	1	2	19	3
online_backup	1	2	19	3
device_protection	1	2	19	3
tech_support	1	2	19	3
streaming_tv	1	2	19	3
streaming_movies	1	2	19	3
contract	1	8	14	3
paperless_billing	1	2	3	2
payment_method	1	12	25	4

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
churn	0	1	FALSE	2	No: 5163, Yes: 1869

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
tenure	1	32.42	24.55	1.00	9.00	29.00	55.00	72.00	▇▃▃▃▅
monthly_charges	1	64.80	30.09	18.25	35.59	70.35	89.86	118.75	▇▅▆▇▅
total_charges	1	2283.30	2266.77	18.80	401.45	1397.47	3794.74	8684.80	▇▂▂▂▁

Exercise 2: train / test splits & recipe

Write and execute code to create training and test datasets. Have the training dataset represent 70% of the total data.

Next create a recipe where churn is related to all the other variables, and

normalize the numeric variables
create dummy variables for the ordinal predictors

Make sure the steps are in a sequence that preserves the (0,1) dummy variables.

Prep the data on the training data and show the result.

SOLUTION:

set.seed(8740)

# split data
data_split    <- rsample::initial_split(data, prop = 0.7)
default_train <- rsample::training(data_split)
default_test  <- rsample::testing(data_split)

# create a recipe
default_recipe <- default_train |>
  recipes::recipe(formula = churn ~ .) |>
  recipes::step_normalize(recipes::all_numeric_predictors()) |>
  recipes::step_dummy(recipes::all_nominal_predictors())

default_recipe |> recipes::prep(default_train) |> 
  summary()

# A tibble: 37 × 4
   variable                     type      role      source  
   <chr>                        <list>    <chr>     <chr>   
 1 tenure                       <chr [2]> predictor original
 2 monthly_charges              <chr [2]> predictor original
 3 total_charges                <chr [2]> predictor original
 4 churn                        <chr [3]> outcome   original
 5 gender_Male                  <chr [2]> predictor derived 
 6 senior_citizen_Yes           <chr [2]> predictor derived 
 7 partner_Yes                  <chr [2]> predictor derived 
 8 dependents_Yes               <chr [2]> predictor derived 
 9 tenure_interval_X0.6.Month   <chr [2]> predictor derived 
10 tenure_interval_X12.24.Month <chr [2]> predictor derived 
# ℹ 27 more rows

Exercise 3: logistic modeling

Create a linear model using logistic regression to predict churn. for the set engine stage use “glm,” and set the mode to “classification.”
Create a workflow using the recipe of the last exercise and the model if the last step.
With the workflow, fit the training data
Combine the training data and the predictions from step 3 using broom::augment , and assign the result to a variable
Create a combined metric function as show in the code below:
Use the variable from step 4 as the first argument to the function from step 5. The other arguments are truth = churn (from the data) and estimate=.pred_class (from step 4). Make a note of the numerical metrics.
Use the variable from step 4 as the first argument to the functions below, with arguments truth = churn and estimate =.pred_No.
- yardstick::roc_auc
- yardstick::roc_curve followed by ggplot2::autoplot().

SOLUTION:

# create a linear regression model
default_model <- parsnip::logistic_reg() |>
  parsnip::set_engine("glm") |>
  parsnip::set_mode("classification")

# create a workflow
default_workflow <- workflows::workflow() |>
  workflows::add_recipe(default_recipe) |>
  workflows::add_model(default_model)

# fit the workflow
lm_fit <-
  default_workflow |>
  parsnip::fit(default_train)

# training dataset
training_results <-
  broom::augment(lm_fit , default_train)

# create the metrics function
m_set_fn <- 
  yardstick::metric_set(
    yardstick::accuracy
    , yardstick::precision
    , yardstick::recall
    , yardstick::f_meas
    , yardstick::spec
    , yardstick::sens
    , yardstick::ppv
    , yardstick::npv
)
training_results |> m_set_fn(truth = churn, estimate = .pred_class)

# A tibble: 8 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 accuracy  binary         0.807
2 precision binary         0.842
3 recall    binary         0.905
4 f_meas    binary         0.872
5 spec      binary         0.546
6 sens      binary         0.905
7 ppv       binary         0.842
8 npv       binary         0.683

# compute roc_auc and plot the roc_curve
training_results |>
  yardstick::roc_auc(.pred_No, truth = churn)
training_results |>
  yardstick::roc_curve(.pred_No, truth = churn) |> autoplot()

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.854

Build your own

You can construct the roc_auc curve directly, as follows:

# use the augmented training results, and find points on the curve for a given threshold
build_roc_aus <- function(threshold, dat = training_results){
  # threshold on predictions
  log_preds <- ifelse(dat$.pred_Yes > threshold, 1, 0)
  # compute predictions and reference/truth
  log_preds <- factor(log_preds, levels = c(0,1), labels = c('No','Yes'))

  return(
    tibble::tibble(
      threshold = threshold
      , "1-Specificity" = 
        1 - yardstick::specificity_vec(truth = dat$churn, estimate = log_preds)
      , Sensitivity = 
          yardstick::sensitivity_vec(truth = dat$churn, estimate = log_preds)
    )
  )
}

#take thesholds in [0,1] and calculate the corresponding x,y points
((0:50)/50) |> purrr::map(~build_roc_aus(.x)) |> 
  dplyr::bind_rows() |> 
    ggplot(aes(x=`1-Specificity`, y=Sensitivity)) +
    geom_point(size=1) + coord_fixed() +
    geom_abline(intercept=0, slope=1)

Exercise 4: effects

Use broom::tidy() on the fit object from exercise 4 to get the predictor coefficients. Sort them in decreasing order by absolute value.

What is the effect of one additional year of tenure on the churn odds ratio?

SOLUTION:

fit0_tbl <- lm_fit |> broom::tidy() |>
  dplyr::arrange(desc(abs(estimate)))

fit0_tbl

# A tibble: 37 × 5
   term                         estimate std.error statistic  p.value
   <chr>                           <dbl>     <dbl>     <dbl>    <dbl>
 1 tenure_interval_X6.12.Month     -3.04     0.820    -3.71  2.11e- 4
 2 tenure_interval_X12.24.Month    -2.78     0.705    -3.94  8.02e- 5
 3 tenure_interval_X0.6.Month      -2.68     0.905    -2.96  3.09e- 3
 4 tenure_interval_X24.36.Month    -2.20     0.554    -3.96  7.51e- 5
 5 tenure                          -2.10     0.386    -5.43  5.67e- 8
 6 (Intercept)                      1.91     1.62      1.18  2.37e- 1
 7 contract_Two.year               -1.45     0.217    -6.67  2.60e-11
 8 tenure_interval_X36.48.Month    -1.28     0.402    -3.19  1.44e- 3
 9 phone_service_Yes               -1.23     0.782    -1.57  1.17e- 1
10 monthly_charges                  1.15     1.15      0.997 3.19e- 1
# ℹ 27 more rows

# pull the tenure coefficient and exponentiate it
fit0_tbl |> dplyr::filter(term == 'tenure') |> 
  dplyr::pull(estimate) |> 
  exp()

[1] 0.1228617

Exercise 5 knn modeling

Now we will create a K-nearest neighbours model to estimate churn. To do this, write the code for the following steps:

Create a K-nearest neighbours model to predict churn using parsnip::nearest_neighbor with argument neighbors = 3 which will use the three most similar data points from the training set to predict churn. For the set engine stage use “kknn,” and set the mode to “classification.”
Take the workflow from exercise 3 and create a new workflow by updating the original workflow. Use workflows::update_model to swap out the original logistic model for the nearest neighbour model.
Use the new workflow to fit the training data. Take the fit and use broom::augment to augment the fit with the training data.
Use the augmented data from step 3 to plot the roc curve, using yardstick::roc_curve(.pred_No, truth = churn) as in exercise 3. How do you interpret his curve?
Take the fit from step 3 and use broom::augment to augment the fit with the test data.
Repeat step 4 using the augmented data from step 5.

SOLUTION:

# create a knn classification model
default_model_knn <- parsnip::nearest_neighbor(neighbors = 3) |>
  parsnip::set_engine("kknn") |>
  parsnip::set_mode("classification")

# create a workflow
default_workflow_knn <- default_workflow |>
  workflows::update_model(default_model_knn)

# fit the workflow
lm_fit_knn <-
  default_workflow_knn |>
  parsnip::fit(default_train)

# augment the training data with the fitted data
training_results_knn <-
  broom::augment(lm_fit_knn , default_train)

# compute the metrics
training_results_knn |> m_set_fn(truth = churn, estimate = .pred_class)

# A tibble: 8 × 3
  .metric   .estimator .estimate
  <chr>     <chr>          <dbl>
1 accuracy  binary         0.999
2 precision binary         0.999
3 recall    binary         1.00 
4 f_meas    binary         0.999
5 spec      binary         0.998
6 sens      binary         1.00 
7 ppv       binary         0.999
8 npv       binary         0.999

# compute roc_auc and plot the roc_curve
training_results_knn |>
  yardstick::roc_auc(.pred_No, truth = churn)
training_results_knn |>
  yardstick::roc_curve(.pred_No, truth = churn) |> autoplot()

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.999

Exercise 6 cross validation

Following the last exercise, we should have some concerns about over-fitting by the nearest-neighbour model.

To address this we will use cross validation to tune the model and evaluate the fits.

Create a cross-validation dataset based on 5 folds using rsample::vfold_cv.
Using the knn workflow from exercise 5, apply tune::fit_resamples with arguments resamples and control where the resamples are the dataset created in step 1 and control is tune::control_resamples(save_pred = TRUE), which will ensure that the predictions are saved.
Use tune::collect_metrics() on the results from step 2
Use tune::collect_predictions() on the results from step 2 to plot the roc_auc curve as in exercise 5. Has it changed much from exercise 5?

SOLUTION:

# create v-fold cross validation data
data_vfold_cv <- data |> rsample::vfold_cv(v=5)

# use tune::fit on the cv dat, saving the predictions
rf_fit_rs <-
  default_workflow_knn |>
  tune::fit_resamples(data_vfold_cv, control = tune::control_resamples(save_pred = TRUE))

# collect the metrics
rf_fit_rs |> tune::collect_metrics()

# A tibble: 3 × 6
  .metric     .estimator  mean     n std_err .config             
  <chr>       <chr>      <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.729     5 0.00480 Preprocessor1_Model1
2 brier_class binary     0.207     5 0.00374 Preprocessor1_Model1
3 roc_auc     binary     0.746     5 0.00790 Preprocessor1_Model1

# compute the roc_curve
rf_fit_rs |> tune::collect_predictions() |>
  yardstick::roc_curve(.pred_No, truth = churn) |> autoplot()

This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.

Exercise 7: tuning for k

In this exercise we’ll tune the number of nearest neighbours in our model to see if we can improve performance.

Redo exercise 5 steps 1 and 2, setting neighbors = tune::tune() for the model, and then updating the workflow with workflows::update_model.
Use dials::grid_regular(dials::neighbors(), levels = 10) to create a grid for tuning k.
Use tune::tune_grid with tune::control_grid(save_pred = TRUE) and yardstick::metric_set(yardstick::accuracy, yardstick::roc_auc) to generate tuning results

SOLUTION:

# re-specify the model for tuning
default_model_knn_tuned <- parsnip::nearest_neighbor(neighbors = tune::tune()) |>
  parsnip::set_engine("kknn") |>
  parsnip::set_mode("classification")

# update the workflow
default_workflow_knn <- default_workflow |>
  workflows::update_model(default_model_knn_tuned)

# make a grid for tuning
clust_num_grid <-
  dials::grid_regular(dials::neighbors(), levels = 10)

# use the grid to tune the model
tune_results <- tune::tune_grid(
  default_workflow_knn,
  resamples = data_vfold_cv,
  grid = clust_num_grid,
  control = tune::control_grid(save_pred = TRUE)
  , metrics =
    yardstick::metric_set(yardstick::accuracy, yardstick::roc_auc)
)

# show the tuning results dataframe
tune_results

# Tuning results
# 5-fold cross-validation 
# A tibble: 5 × 5
  splits              id    .metrics          .notes           .predictions
  <list>              <chr> <list>            <list>           <list>      
1 <split [5625/1407]> Fold1 <tibble [20 × 5]> <tibble [0 × 3]> <tibble>    
2 <split [5625/1407]> Fold2 <tibble [20 × 5]> <tibble [0 × 3]> <tibble>    
3 <split [5626/1406]> Fold3 <tibble [20 × 5]> <tibble [0 × 3]> <tibble>    
4 <split [5626/1406]> Fold4 <tibble [20 × 5]> <tibble [0 × 3]> <tibble>    
5 <split [5626/1406]> Fold5 <tibble [20 × 5]> <tibble [0 × 3]> <tibble>

Exercise 8

Use tune::collect_metrics() to collect the metrics from the tuning results in exercise 7 and then plot the metrics as a function of k using the code below.

SOLUTION:

# collect the metrics
tune_results |>
  tune::collect_metrics()
# plot the collected metrics as a function of K
tune_results |>
  tune::collect_metrics() |>
  ggplot(aes(neighbors,mean)) +
  geom_line(linewidth = 1.5, alpha = 0.6) +
  geom_point(size = 2) +
  facet_wrap(~ .metric, scales = "free", nrow = 2)

# A tibble: 20 × 7
   neighbors .metric  .estimator  mean     n std_err .config              
       <int> <chr>    <chr>      <dbl> <int>   <dbl> <chr>                
 1         1 accuracy binary     0.729     5 0.00471 Preprocessor1_Model01
 2         1 roc_auc  binary     0.656     5 0.00416 Preprocessor1_Model01
 3         2 accuracy binary     0.729     5 0.00460 Preprocessor1_Model02
 4         2 roc_auc  binary     0.718     5 0.00752 Preprocessor1_Model02
 5         3 accuracy binary     0.729     5 0.00480 Preprocessor1_Model03
 6         3 roc_auc  binary     0.746     5 0.00790 Preprocessor1_Model03
 7         4 accuracy binary     0.730     5 0.00470 Preprocessor1_Model04
 8         4 roc_auc  binary     0.760     5 0.00707 Preprocessor1_Model04
 9         5 accuracy binary     0.729     5 0.00469 Preprocessor1_Model05
10         5 roc_auc  binary     0.767     5 0.00698 Preprocessor1_Model05
11         6 accuracy binary     0.757     5 0.00608 Preprocessor1_Model06
12         6 roc_auc  binary     0.777     5 0.00670 Preprocessor1_Model06
13         7 accuracy binary     0.763     5 0.00566 Preprocessor1_Model07
14         7 roc_auc  binary     0.783     5 0.00655 Preprocessor1_Model07
15         8 accuracy binary     0.768     5 0.00578 Preprocessor1_Model08
16         8 roc_auc  binary     0.789     5 0.00668 Preprocessor1_Model08
17         9 accuracy binary     0.770     5 0.00639 Preprocessor1_Model09
18         9 roc_auc  binary     0.793     5 0.00662 Preprocessor1_Model09
19        10 accuracy binary     0.771     5 0.00686 Preprocessor1_Model10
20        10 roc_auc  binary     0.797     5 0.00659 Preprocessor1_Model10

Exercise 9

Use tune::show_best and tune::select_best with argument “roc_auc” to find the best k for the knn classification model. Then

update the workflow using tune::finalize_workflow to set the best k value.
use tune::last_fit with the updated workflow from step 1, evaluated on the split data from exercise 2 to finalize the fit.
use tune::collect_metrics() to get the metrics for the best fit
use tune::collect_predictions() to get the predictions and plot the roc_auc as in the prior exercises

SOLUTION:

# show the roc_auc metrics
tune_results |>
  tune::show_best(metric = "roc_auc")
# select the best roc_auc metric
best_nn <- tune_results |>
  tune::select_best(metric = "roc_auc")

# finalize the workflow with the best nn metric from the last step
final_wf <- default_workflow_knn |>
  tune::finalize_workflow(best_nn)

# use  tune::last_fit with the finaized workflow on the data_split (ex 2)
final_fit <-
  final_wf |>
  tune::last_fit(data_split)

# collect the metrics from the final fit
final_fit |>
  tune::collect_metrics()
final_fit |>
  tune::collect_predictions() |>
  yardstick::roc_curve(.pred_No, truth = churn) |>
  autoplot()

# A tibble: 5 × 7
  neighbors .metric .estimator  mean     n std_err .config              
      <int> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                
1        10 roc_auc binary     0.797     5 0.00659 Preprocessor1_Model10
2         9 roc_auc binary     0.793     5 0.00662 Preprocessor1_Model09
3         8 roc_auc binary     0.789     5 0.00668 Preprocessor1_Model08
4         7 roc_auc binary     0.783     5 0.00655 Preprocessor1_Model07
5         6 roc_auc binary     0.777     5 0.00670 Preprocessor1_Model06

# A tibble: 3 × 4
  .metric     .estimator .estimate .config             
  <chr>       <chr>          <dbl> <chr>               
1 accuracy    binary         0.773 Preprocessor1_Model1
2 roc_auc     binary         0.798 Preprocessor1_Model1
3 brier_class binary         0.157 Preprocessor1_Model1

Exercise 10: clustering

Load the data for this exercise as below and plot it, and then create an analysis dataset with the cluster labels removed

#
# read the data
labelled_points <- readr::read_csv("data/lab_5_clusters.csv", show_col_types = FALSE)

# plot the clusters
labelled_points |> ggplot(aes(x1, x2, color = cluster)) +
  geom_point(alpha = 0.3) + 
  theme(legend.position="none")

# remove cluster labels to make the analysis dataset
points <-
  labelled_points |>
  select(-cluster)

You have frequently used broom::augment to combine a model with the data set, and broom::tidy to summarize model components; broom::glance is used to similarly to summarize goodness-of-fit metrics.

Now perform k-means clustering on the points data for different values of k as follows:

kclusts <-
  # number of clusters from 1-9
  tibble(k = 1:9) |>
  # mutate to add columns
  mutate(
    # a list-column with the results of the kmeans function (clustering)
    kclust = purrr::map(k, ~stats::kmeans(points, .x)),
    # a list-column with the results broom::tidy applied to the clustering results
    tidied = purrr::map(kclust, broom::tidy),
    # a list-column with the results broom::glance applied to the clustering results
    glanced = purrr::map(kclust, broom::glance),
    # a list-column with the results broom::augment applied to the clustering results
    augmented = purrr::map(kclust, broom::augment, points)
  )

SOLUTION:

(i) Create 3 variables by tidyr::unnesting the appropriate columns of kclusts

# take kclusts and use tidy::unnest() on the appropriate columns
clusters <-
  kclusts |>
  tidyr::unnest(cols = c(tidied))

assignments <-
  kclusts |>
  tidyr::unnest(cols = c(augmented))

clusterings <-
  kclusts |>
  tidyr::unnest(cols = c(glanced))

(ii) Use the assignment variable to plot the cluster assignments generated by stats::kmeans

# plot the points assigned to each cluster
p <- assignments |> ggplot(aes(x = x1, y = x2)) +
  geom_point(aes(color = .cluster), alpha = 0.8) +
  facet_wrap(~ k) + theme(legend.position="none")
p

(iii) Use the clusters variable to add the cluster centers to the plot

# on the last plot, mark the cluster centres with an X
p + geom_point(data = clusters, size = 10, shape = "x")

(iv) Use the clusterings variable to plot the total within sum of squares value by number of clusters.

# make a separate line-and-point plot with the tot-withinss data by cluster number
clusterings |> ggplot(aes(k, tot.withinss)) +
  geom_line() +
  geom_point()

(v) Visually and by the “elbow” heursistic, we should use k=3, i.e. k=3 should give good results: good fit with low model complexity.

Submission

Warning

Before you wrap up the assignment, make sure all documents are saved, staged, committed, and pushed to your repository on the course github site.

Remember – you do not have to turn in an *.html file. I will be pulling your work directly from your repository on the course website.

Grading

Total points available: 30 points.

Component	Points
Ex 1 - 10	30