In today’s lab, you’ll practice building workflowsets with recipes, parsnip models, rsample cross validations, model tuning and model comparison in the context of classification and clustering.
Packages
# check if 'librarian' is installed and if not, install itif (!"librarian"%in%rownames(installed.packages()) ){install.packages("librarian")}# load packages if not already loadedlibrarian::shelf(tidyverse, magrittr, gt, gtExtras, tidymodels, ggplot2)# set the default theme for plottingtheme_set(theme_bw(base_size =18) +theme(legend.position ="top"))
The Data
Today we will be using customer churn data.
In the customer management lifecycle, customer churn refers to a decision made by the customer about ending the business relationship. It is also referred as loss of clients or customers. This dataset contains 20 features related to churn in a telecom context and we will look at how to predict churn and estimate the effect of predictors on the customer churn odds ratio.
data <- readr::read_csv("data/Telco-Customer-Churn.csv", show_col_types =FALSE) |> dplyr::mutate(churn =as.factor(churn))
Exercise 1: EDA
Write and execute the code to perform summary EDA on the data using the package skimr. Plot histograms for monthly charges and tenure. Tenure measures the strength of the customer relationship by measuring the length of time that a person has been a customer.
# A tibble: 37 × 4
variable type role source
<chr> <list> <chr> <chr>
1 tenure <chr [2]> predictor original
2 monthly_charges <chr [2]> predictor original
3 total_charges <chr [2]> predictor original
4 churn <chr [3]> outcome original
5 gender_Male <chr [2]> predictor derived
6 senior_citizen_Yes <chr [2]> predictor derived
7 partner_Yes <chr [2]> predictor derived
8 dependents_Yes <chr [2]> predictor derived
9 tenure_interval_X0.6.Month <chr [2]> predictor derived
10 tenure_interval_X12.24.Month <chr [2]> predictor derived
# ℹ 27 more rows
Exercise 3: logistic modeling
Create a linear model using logistic regression to predict churn. for the set engine stage use “glm,” and set the mode to “classification.”
Create a workflow using the recipe of the last exercise and the model if the last step.
With the workflow, fit the training data
Combine the training data and the predictions from step 3 using broom::augment , and assign the result to a variable
Create a combined metric function as show in the code below:
Use the variable from step 4 as the first argument to the function from step 5. The other arguments are truth = churn (from the data) and estimate=.pred_class (from step 4). Make a note of the numerical metrics.
Use the variable from step 4 as the first argument to the functions below, with arguments truth = churn and estimate =.pred_No.
yardstick::roc_auc
yardstick::roc_curve followed by ggplot2::autoplot().
SOLUTION:
# create a linear regression modeldefault_model <- parsnip::logistic_reg() |> parsnip::set_engine("glm") |> parsnip::set_mode("classification")# create a workflowdefault_workflow <- workflows::workflow() |> workflows::add_recipe(default_recipe) |> workflows::add_model(default_model)# fit the workflowlm_fit <- default_workflow |> parsnip::fit(default_train)# training datasettraining_results <- broom::augment(lm_fit , default_train)
# compute roc_auc and plot the roc_curvetraining_results |> yardstick::roc_auc(.pred_No, truth = churn)training_results |> yardstick::roc_curve(.pred_No, truth = churn) |>autoplot()
# pull the tenure coefficient and exponentiate itfit0_tbl |> dplyr::filter(term =='tenure') |> dplyr::pull(estimate) |>exp()
[1] 0.1228617
Exercise 5 knn modeling
Now we will create a K-nearest neighbours model to estimate churn. To do this, write the code for the following steps:
Create a K-nearest neighbours model to predict churn using parsnip::nearest_neighbor with argument neighbors = 3 which will use the three most similar data points from the training set to predict churn. For the set engine stage use “kknn,” and set the mode to “classification.”
Take the workflow from exercise 3 and create a new workflow by updating the original workflow. Use workflows::update_model to swap out the original logistic model for the nearest neighbour model.
Use the new workflow to fit the training data. Take the fit and use broom::augment to augment the fit with the training data.
Use the augmented data from step 3 to plot the roc curve, using yardstick::roc_curve(.pred_No, truth = churn) as in exercise 3. How do you interpret his curve?
Take the fit from step 3 and use broom::augment to augment the fit with the test data.
Repeat step 4 using the augmented data from step 5.
SOLUTION:
# create a knn classification modeldefault_model_knn <- parsnip::nearest_neighbor(neighbors =3) |> parsnip::set_engine("kknn") |> parsnip::set_mode("classification")# create a workflowdefault_workflow_knn <- default_workflow |> workflows::update_model(default_model_knn)# fit the workflowlm_fit_knn <- default_workflow_knn |> parsnip::fit(default_train)# augment the training data with the fitted datatraining_results_knn <- broom::augment(lm_fit_knn , default_train)
# compute the metricstraining_results_knn |>m_set_fn(truth = churn, estimate = .pred_class)
# compute roc_auc and plot the roc_curvetraining_results_knn |> yardstick::roc_auc(.pred_No, truth = churn)training_results_knn |> yardstick::roc_curve(.pred_No, truth = churn) |>autoplot()
Following the last exercise, we should have some concerns about over-fitting by the nearest-neighbour model.
To address this we will use cross validation to tune the model and evaluate the fits.
Create a cross-validation dataset based on 5 folds using rsample::vfold_cv.
Using the knn workflow from exercise 5, apply tune::fit_resamples with arguments resamples and control where the resamples are the dataset created in step 1 and control is tune::control_resamples(save_pred = TRUE), which will ensure that the predictions are saved.
Use tune::collect_metrics() on the results from step 2
Use tune::collect_predictions() on the results from step 2 to plot the roc_auc curve as in exercise 5. Has it changed much from exercise 5?
SOLUTION:
# create v-fold cross validation datadata_vfold_cv <- data |> rsample::vfold_cv(v=5)# use tune::fit on the cv dat, saving the predictionsrf_fit_rs <- default_workflow_knn |> tune::fit_resamples(data_vfold_cv, control = tune::control_resamples(save_pred =TRUE))# collect the metricsrf_fit_rs |> tune::collect_metrics()
# compute the roc_curverf_fit_rs |> tune::collect_predictions() |> yardstick::roc_curve(.pred_No, truth = churn) |>autoplot()
This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.
Exercise 7: tuning for k
In this exercise we’ll tune the number of nearest neighbours in our model to see if we can improve performance.
Redo exercise 5 steps 1 and 2, setting neighbors = tune::tune() for the model, and then updating the workflow with workflows::update_model.
Use dials::grid_regular(dials::neighbors(), levels = 10) to create a grid for tuning k.
Use tune::tune_grid with tune::control_grid(save_pred = TRUE) and yardstick::metric_set(yardstick::accuracy, yardstick::roc_auc) to generate tuning results
SOLUTION:
# re-specify the model for tuningdefault_model_knn_tuned <- parsnip::nearest_neighbor(neighbors = tune::tune()) |> parsnip::set_engine("kknn") |> parsnip::set_mode("classification")# update the workflowdefault_workflow_knn <- default_workflow |> workflows::update_model(default_model_knn_tuned)# make a grid for tuningclust_num_grid <- dials::grid_regular(dials::neighbors(), levels =10)# use the grid to tune the modeltune_results <- tune::tune_grid( default_workflow_knn,resamples = data_vfold_cv,grid = clust_num_grid,control = tune::control_grid(save_pred =TRUE) , metrics = yardstick::metric_set(yardstick::accuracy, yardstick::roc_auc))# show the tuning results dataframetune_results
Use tune::collect_metrics() to collect the metrics from the tuning results in exercise 7 and then plot the metrics as a function of k using the code below.
SOLUTION:
# collect the metricstune_results |> tune::collect_metrics()# plot the collected metrics as a function of Ktune_results |> tune::collect_metrics() |>ggplot(aes(neighbors,mean)) +geom_line(linewidth =1.5, alpha =0.6) +geom_point(size =2) +facet_wrap(~ .metric, scales ="free", nrow =2)
Use tune::show_best and tune::select_best with argument “roc_auc” to find the best k for the knn classification model. Then
update the workflow using tune::finalize_workflow to set the best k value.
use tune::last_fit with the updated workflow from step 1, evaluated on the split data from exercise 2 to finalize the fit.
use tune::collect_metrics() to get the metrics for the best fit
use tune::collect_predictions() to get the predictions and plot the roc_auc as in the prior exercises
SOLUTION:
# show the roc_auc metricstune_results |> tune::show_best(metric ="roc_auc")# select the best roc_auc metricbest_nn <- tune_results |> tune::select_best(metric ="roc_auc")# finalize the workflow with the best nn metric from the last stepfinal_wf <- default_workflow_knn |> tune::finalize_workflow(best_nn)# use tune::last_fit with the finaized workflow on the data_split (ex 2)final_fit <- final_wf |> tune::last_fit(data_split)# collect the metrics from the final fitfinal_fit |> tune::collect_metrics()final_fit |> tune::collect_predictions() |> yardstick::roc_curve(.pred_No, truth = churn) |>autoplot()
Load the data for this exercise as below and plot it, and then create an analysis dataset with the cluster labels removed
## read the datalabelled_points <- readr::read_csv("data/lab_5_clusters.csv", show_col_types =FALSE)# plot the clusterslabelled_points |>ggplot(aes(x1, x2, color = cluster)) +geom_point(alpha =0.3) +theme(legend.position="none")
# remove cluster labels to make the analysis datasetpoints <- labelled_points |>select(-cluster)
You have frequently used broom::augment to combine a model with the data set, and broom::tidy to summarize model components; broom::glance is used to similarly to summarize goodness-of-fit metrics.
Now perform k-means clustering on the points data for different values of k as follows:
kclusts <-# number of clusters from 1-9tibble(k =1:9) |># mutate to add columnsmutate(# a list-column with the results of the kmeans function (clustering)kclust = purrr::map(k, ~stats::kmeans(points, .x)),# a list-column with the results broom::tidy applied to the clustering resultstidied = purrr::map(kclust, broom::tidy),# a list-column with the results broom::glance applied to the clustering resultsglanced = purrr::map(kclust, broom::glance),# a list-column with the results broom::augment applied to the clustering resultsaugmented = purrr::map(kclust, broom::augment, points) )
SOLUTION:
(i) Create 3 variables by tidyr::unnesting the appropriate columns of kclusts
# take kclusts and use tidy::unnest() on the appropriate columnsclusters <- kclusts |> tidyr::unnest(cols =c(tidied))assignments <- kclusts |> tidyr::unnest(cols =c(augmented))clusterings <- kclusts |> tidyr::unnest(cols =c(glanced))
(ii) Use the assignment variable to plot the cluster assignments generated by stats::kmeans
# plot the points assigned to each clusterp <- assignments |>ggplot(aes(x = x1, y = x2)) +geom_point(aes(color = .cluster), alpha =0.8) +facet_wrap(~ k) +theme(legend.position="none")p
(iii) Use the clusters variable to add the cluster centers to the plot
# on the last plot, mark the cluster centres with an Xp +geom_point(data = clusters, size =10, shape ="x")
(iv) Use the clusterings variable to plot the total within sum of squares value by number of clusters.
# make a separate line-and-point plot with the tot-withinss data by cluster numberclusterings |>ggplot(aes(k, tot.withinss)) +geom_line() +geom_point()
(v) Visually and by the “elbow” heursistic, we should use k=3, i.e. k=3 should give good results: good fit with low model complexity.
Submission
Warning
Before you wrap up the assignment, make sure all documents are saved, staged, committed, and pushed to your repository on the course github site.
Remember – you do not have to turn in an *.html file. I will be pulling your work directly from your repository on the course website.