Lab 2 - The Recipes package

BSMM 8740 Fall 2024

Author

Add your name here

Introduction

In today’s lab, you’ll explore several data sets and practice pre-processing and feature engineering with recipes. The main goal is to increase your understanding of the recipes workflow, including the data structures underlying recipes, what they contain and how to access them.

Learning goals

By the end of the lab you will…

  • Be able to use the recipes package to prepare and train & test datasets for analysis/modeling.
  • Be able to access recipes data post-processing to both extract & validate the results of your recipes steps.

Getting started

  • To complete the lab, log on to your github account and then go to the class GitHub organization and find the 2024-lab-2-[your github username] repository .

    Create an R project using your 2024-lab-2-[your github username] repository (remember to create a PAT, etc., as in lab-1) and add your answers by editing the 2024-lab-2.qmd file in your personal repository.

  • When you are done, be sure to save your document, stage, commit and push your work.

Important

To access Github from the lab, you will need to make sure you are logged in as follows:

  • username: .\daladmin
  • password: Business507!

Remember to (create a PAT and set your git credentials)

  • create your PAT using usethis::create_github_token() ,
  • store your PAT with gitcreds::gitcreds_set() ,
  • set your username and email with
    • usethis::use_git_config( user.name = ___, user.email = ___)

Packages

We will use the following package in today’s lab.

# check if 'librarian' is installed and if not, install it
if (! "librarian" %in% rownames(installed.packages()) ){
  install.packages("librarian")
}
  
# load packages if not already loaded
librarian::shelf(
  tidyverse, magrittr, gt, gtExtras, tidymodels, DataExplorer, skimr, janitor, ggplot2, forcats
)

Data: The Boston Cocktail Recipes

The Boston Cocktail Recipes dataset appeared in a TidyTuesday posting. TidyTuesday is a weekly data project in R.

The dataset is derived from the Mr. Boston Bartender’s Guide, together with a dataset that was web-scraped as part of a hackathon.

This dataset contains the following information for each cocktail:

variable class description
name character Name of cocktail
category character Category of cocktail
row_id integer Drink identifier
ingredient_number integer Ingredient number
ingredient character Ingredient
measure character Measurement/volume of ingredient
measure_number real measure as a number

Use the code below to load the Boston Cocktail Recipes data set.

boston_cocktails <- readr::read_csv('data/boston_cocktails.csv', show_col_types = FALSE)
Note

Exercises 1-7 use the recipes package to preprocess data, producing normalized data and principle components whose key parameters can be accessed from the prepped recipe object. As an unsupervised ‘learning’ method the results of PCA often need to interpreted for management. The majority of exercises 1-7 ask you to do that.

Exercises

Exercise 1

First use skimr::skim to assess the quality of the data set.

Next prepare a summary. What is the median measure number across cocktail recipes?

YOUR ANSWER:
# PLEASE SHOW YOUR WORK

Exercise 2

From the boston_cocktails dataset select the name, category, ingredient, and measure_number columns and then pivot the table to create a column for each ingredient. Fill any missing values with the number zero.

Since the names of the new columns may contain spaces, clean then using the janitor::clean_names(). Finally drop any rows with NA values and save this new dataset in a variable.

How much gin is in the cocktail called Leap Frog Highball?

YOUR ANSWER:
# PLEASE SHOW YOUR WORK

Exercise 3

Prepare a recipes::recipe object without a target but give name and category as ‘id’ roles. Add steps to normalize the predictors and perform PCA. Finally prep the data and save it in a variable.

How many predictor variables are prepped by the recipe?

YOUR ANSWER:
# PLEASE SHOW YOUR WORK

This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.

Exercise 4

Apply the recipes::tidy verb to the prepped recipe in the last exercise. The result is a table identifying the information generated and stored by each step in the recipe from the input data.

To see the values calculated for normalization, apply the recipes::tidy verb as before, but with second argument = 1.

What ingredient is the most used, on average?

YOUR ANSWER:
# PLEASE SHOW YOUR WORK

Exercise 5

Now look at the result of the PCA, applying the recipes::tidy verb as before, but with second argument = 2. Save the result in a variable and filter for the components PC1 to PC5. Mutate the resulting component column so that the values are factors, ordering them in the order they appear using the forcats::fct_inorder verb.

Plot this data using ggplot2 and the code below

YOUR ANSWER:
# PLEASE SHOW YOUR WORK 
ggplot(aes(value, terms, fill = terms)) +
geom_col(show.legend = FALSE) +
facet_wrap(~component, nrow = 1) +
labs(y = NULL) +
theme(axis.text=element_text(size=7),
      axis.title=element_text(size=14,face="bold"))

How would you describe the drinks represented by PC1?

Exercise 6

As in the last exercise, use the variable with the tidied PCA data and use only PCA components PC1 to PC4. Take/slice the top 8 ingedients by component, ordered by their absolute value using the verb dplyr::slice_max. Next, generate a grouped table using gt::gt, colouring the cell backgrounds (i.e. fill) with green for values 0\ge0 and red for values <0<0.

What is the characteristic alcoholic beverage of each of the first 4 principle components.

YOUR ANSWER:
# PLEASE SHOW YOUR WORK

Exercise 7

For this exercise, bake the prepped PCA recipe using recipes::bake on the original data and plot each cocktail by its PC1, PC2 component, using

ggplot(aes(PC1, PC2, label = name)) +
  geom_point(aes(color = category), alpha = 0.7, size = 2) +
  geom_text(check_overlap = TRUE, hjust = "inward") + 
  labs(color = NULL)

Can you create an interpretation of the PCA analysis?

YOUR ANSWER:
# PLEASE SHOW YOUR WORK

This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.

Exercise 8

In the following exercise, we’ll use the recipes package to prepare time series data. The starting dataset contains monthly house price data for each of the four countries/regions in the UK

uk_prices <- readr::read_csv('data/UK_house_prices.csv', show_col_types = FALSE)

Write code to clean the names in the uk_prices dataset using janitor::clean_names(), and then using skimr::skim confirm that the region names are correct and that there are no missing values. Call the resulting analytic data set df.

YOUR ANSWER:
# PLEASE SHOW YOUR WORK

Exercise 9

We want to use the monthly house price data to predict house prices one month ahead for each region. The basic model will be:

SalesVolume ~ .

where the date and region-name are id variables, not predictor variables. Instead we will use the prior-month lagged prices as the only predictor.

The recipe uses the following 5 steps from the recipes package: update_role(date, region_name, new_role = “id”) | recipe(sales_volume ~ ., data = df |> janitor::clean_names()) | step_naomit(lag_1_sales_volume, skip=FALSE) | step_lag(sales_volume, lag=1) | step_arrange(region_name, date)

Use these 5 steps in the proper order to create a recipe to pre-process the time series data for each region. Prep and then bake your recipe using df.

YOUR ANSWER:
# PLEASE SHOW YOUR WORK

The result of the bake step, after filtering for dates \le 2005-03-01, should look like this:

readRDS("data/baked_uk_house_dat.rds") |> 
  dplyr::filter(date <= lubridate::ymd(20050301)) |> 
  gt::gt() |> 
  gt::fmt_currency(columns = -c(date,region_name), decimals = 0) |> 
  gtExtras::gt_theme_espn()
date region_name sales_volume lag_1_sales_volume
2005-02-01 England $56,044 $53,464
2005-03-01 England $67,322 $56,044
2005-02-01 Northern Ireland $978 $978
2005-03-01 Northern Ireland $978 $978
2005-02-01 Scotland $7,631 $8,876
2005-03-01 Scotland $9,661 $7,631
2005-02-01 Wales $2,572 $2,516
2005-03-01 Wales $3,336 $2,572

Exercise 10

Recall The Business Problem

We’re at a fast paced startup. The company is growing fast and the marketing team is looking for ways to increase the sales from existing customers by making them buy more. The main idea is to unlock the potential of the customer base through incentives, in this case a discount. We of course want to measure the effect of the discount on the customer’s behavior. Still, they do not want to waste money giving discounts to users which are not valuable. As always, it is about return on investment (ROI).

Without going into specifics about the nature of the discount, it has been designed to provide a positive return on investment if the customer buys more than $1\$ 1 as a result of the discount. How can we measure the effect of the discount and make sure our experiment has a positive ROI? The marketing team came up with the following strategy:

  • Select a sample of existing customers from the same cohort.
  • Set a test window of 1 month.
  • Look into the historical data of web visits from the last month. The hypothesis is that web visits are a good proxy for the customer’s interest in the product.
  • For customers with a high number of web visits, send them a discount. There will be a hold out group which will not receive the discount within the potential valuable customers based on the number of web visits. For customers with a low number of web visits, do not send them a discount (the marketing team wants to report a positive ROI, so they do not want to waste money on customers which are not valuable). Still, they want to use them to measure the effect of the discount.
  • We also want to use the results of the test to tag loyal customers. These are customers which got a discount (since they showed potential interest in the product) and customers with exceptional sales numbers even if they did not get a discount. The idea is to use this information to target them in the future if the discount strategy is positive.

In the last lab we did some exploratory data analysis. The next step is to prepare some descriptive statistics.

Descriptive Statistics

The first thing the data analytics team did was to split the sales distribution by discount group:

data <- readr::read_csv('data/sales_dag.csv', show_col_types = FALSE)

data |> dplyr::mutate(discount = factor(discount)) |> 
  ggplot(aes(x = sales, after_stat(count), fill = discount)) +
  geom_histogram(alpha = 0.30, position = 'identity', color="#e9ecef", bins = 30)+
  geom_density(alpha = 0.30) +
  xlab("Sales") +
  ylab("Density") +
  theme_minimal()

It looks customers with a discount have higher sales. Data scientist A is optimistic with this initial result. To quantify this, compute the difference in means by discount:

YOUR ANSWER:

group by discount and find the difference in means:

What is the difference in means?

The discount strategy seems to be working. Data scientist A is happy with the results and decides to get feedback from the rest of the data science team.

Data scientist B is not so happy with the results. They think that the uplift is too good to be true (based on domain knowledge and the sales distributions 🤔). When thinking about reasons for such a high uplift, they realized the discount assignment was not at random. It was based on the number of web visits (remember the marketing plan?). This means that the discount group is not comparable to the control group completely! They decide to plot sales against web visits per discount group:

data |> dplyr::mutate(discount = factor(discount)) |> 
  ggplot(aes(x=visits, y = sales, color = discount)) +
  geom_point() + 
  facet_grid(cols = vars(discount))

Indeed, they realize they should probably adjust for the number of web visits. A natural metric is sales per web visit. Compute the sales per visit for each discount group:

YOUR ANSWER:

The mean value is higher for the discount group. As always, they also looked at the distributions:

data |> dplyr::mutate(discount = factor(discount)) |> 
  ggplot(aes(x = sales_per_visit, after_stat(count), fill = discount)) +
  geom_histogram(alpha = 0.30, position = 'identity', color="#e9ecef", bins = 30)+
  # geom_density(alpha = 0.30) +
  xlab("Sales per Visit") +
  ylab("Count") +
  theme_minimal()

For both data scientists A & B the results look much better, but they were unsure about which uplift to report. They thought about the difference in means:

Compute the difference in mean sales_per_visit by discount:

YOUR ANSWER:

Can you see a way to interpret this in terms of dollars? yes/no

You’re done and ready to submit your work! Save, stage, commit, and push all remaining changes. You can use the commit message “Done with Lab 2!” , and make sure you have committed and pushed all changed files to GitHub (your Git pane in RStudio should be empty) and that all documents are updated in your repo on GitHub.

Submission

I will pull (copy) everyone’s submissions at 5:00pm on the Sunday following class, and I will work only with these copies, so anything submitted after 5:00pm will not be graded. (don’t forget to commit and then push your work by 5:00pm on Sunday!)

Grading

Total points available: 30 points.

Component Points
Ex 1 - 10 30