# check if 'librarian' is installed and if not, install it
if (! "librarian" %in% rownames(installed.packages()) ){
install.packages("librarian")
}
# load packages if not already loaded
::shelf(
librarian
tidyverse, magrittr, gt, gtExtras, tidymodels, DataExplorer, skimr, janitor, ggplot2, forcats )
Lab 2 - The Recipes package
BSMM 8740 Fall 2024
Introduction
In today’s lab, you’ll explore several data sets and practice pre-processing and feature engineering with recipes
. The main goal is to increase your understanding of the recipes workflow, including the data structures underlying recipes, what they contain and how to access them.
Learning goals
By the end of the lab you will…
- Be able to use the
recipes
package to prepare and train & test datasets for analysis/modeling. - Be able to access
recipes
data post-processing to both extract & validate the results of your recipes steps.
Getting started
To complete the lab, log on to your github account and then go to the class GitHub organization and find the 2024-lab-2-[your github username] repository .
Create an R project using your 2024-lab-2-[your github username] repository (remember to create a PAT, etc., as in lab-1) and add your answers by editing the
2024-lab-2.qmd
file in your personal repository.When you are done, be sure to save your document, stage, commit and push your work.
To access Github from the lab, you will need to make sure you are logged in as follows:
- username: .\daladmin
- password: Business507!
Remember to (create a PAT and set your git credentials)
- create your PAT using
usethis::create_github_token()
, - store your PAT with
gitcreds::gitcreds_set()
, - set your username and email with
usethis::use_git_config( user.name = ___, user.email = ___)
Packages
We will use the following package in today’s lab.
Data: The Boston Cocktail Recipes
The Boston Cocktail Recipes dataset appeared in a TidyTuesday posting. TidyTuesday is a weekly data project in R.
The dataset is derived from the Mr. Boston Bartender’s Guide, together with a dataset that was web-scraped as part of a hackathon.
This dataset contains the following information for each cocktail:
variable | class | description |
---|---|---|
name | character | Name of cocktail |
category | character | Category of cocktail |
row_id | integer | Drink identifier |
ingredient_number | integer | Ingredient number |
ingredient | character | Ingredient |
measure | character | Measurement/volume of ingredient |
measure_number | real | measure as a number |
Use the code below to load the Boston Cocktail Recipes data set.
<- readr::read_csv('data/boston_cocktails.csv', show_col_types = FALSE) boston_cocktails
Exercises 1-7 use the recipes
package to preprocess data, producing normalized data and principle components whose key parameters can be accessed from the prepped
recipe object. As an unsupervised ‘learning’ method the results of PCA often need to interpreted for management. The majority of exercises 1-7 ask you to do that.
Exercises
Exercise 1
First use skimr::skim
to assess the quality of the data set.
Next prepare a summary
. What is the median measure number across cocktail recipes?
Exercise 2
From the boston_cocktails dataset select the name, category, ingredient, and measure_number columns and then pivot the table to create a column for each ingredient. Fill any missing values with the number zero.
Since the names of the new columns may contain spaces, clean then using the janitor::clean_names()
. Finally drop any rows with NA values and save this new dataset in a variable.
How much gin is in the cocktail called Leap Frog Highball?
Exercise 3
Prepare a recipes::recipe object without a target but give name and category as ‘id’ roles. Add steps to normalize the predictors and perform PCA. Finally prep
the data and save it in a variable.
How many predictor variables are prepped by the recipe?
This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.
Exercise 4
Apply the recipes::tidy
verb to the prepped recipe in the last exercise. The result is a table identifying the information generated and stored by each step in the recipe from the input data.
To see the values calculated for normalization, apply the recipes::tidy
verb as before, but with second argument = 1.
What ingredient is the most used, on average?
Exercise 5
Now look at the result of the PCA, applying the recipes::tidy
verb as before, but with second argument = 2. Save the result in a variable and filter for the components PC1 to PC5. Mutate the resulting component
column so that the values are factors, ordering them in the order they appear using the forcats::fct_inorder
verb.
Plot this data using ggplot2
and the code below
ggplot(aes(value, terms, fill = terms)) +
geom_col(show.legend = FALSE) +
facet_wrap(~component, nrow = 1) +
labs(y = NULL) +
theme(axis.text=element_text(size=7),
axis.title=element_text(size=14,face="bold"))
How would you describe the drinks represented by PC1?
Exercise 6
As in the last exercise, use the variable with the tidied PCA data and use only PCA components PC1 to PC4. Take/slice the top 8 ingedients by component, ordered by their absolute value using the verb dplyr::slice_max
. Next, generate a grouped table using gt::gt, colouring the cell backgrounds (i.e. fill) with green for values and red for values .
What is the characteristic alcoholic beverage of each of the first 4 principle components.
Exercise 7
For this exercise, bake the prepped PCA recipe using recipes::bake
on the original data and plot each cocktail by its PC1, PC2 component, using
ggplot(aes(PC1, PC2, label = name)) +
geom_point(aes(color = category), alpha = 0.7, size = 2) +
geom_text(check_overlap = TRUE, hjust = "inward") +
labs(color = NULL)
Can you create an interpretation of the PCA analysis?
This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.
Exercise 8
In the following exercise, we’ll use the recipes package to prepare time series data. The starting dataset contains monthly house price data for each of the four countries/regions in the UK
<- readr::read_csv('data/UK_house_prices.csv', show_col_types = FALSE) uk_prices
Write code to clean the names in the uk_prices
dataset using janitor::clean_names()
, and then using skimr::skim
confirm that the region names are correct and that there are no missing values. Call the resulting analytic data set df
.
Exercise 9
We want to use the monthly house price data to predict house prices one month ahead for each region. The basic model will be:
~ . SalesVolume
where the date and region-name are id variables, not predictor variables. Instead we will use the prior-month lagged prices as the only predictor.
The recipe uses the following 5 steps from the recipes
package: update_role
(date, region_name, new_role = “id”) | recipe
(sales_volume ~ ., data = df |> janitor::clean_names()) | step_naomit
(lag_1_sales_volume, skip=FALSE) | step_lag
(sales_volume, lag=1) | step_arrange
(region_name, date)
Use these 5 steps in the proper order to create a recipe to pre-process the time series data for each region. Prep and then bake your recipe using df
.
The result of the bake step, after filtering for dates 2005-03-01, should look like this:
readRDS("data/baked_uk_house_dat.rds") |>
::filter(date <= lubridate::ymd(20050301)) |>
dplyr::gt() |>
gt::fmt_currency(columns = -c(date,region_name), decimals = 0) |>
gt::gt_theme_espn() gtExtras
date | region_name | sales_volume | lag_1_sales_volume |
---|---|---|---|
2005-02-01 | England | $56,044 | $53,464 |
2005-03-01 | England | $67,322 | $56,044 |
2005-02-01 | Northern Ireland | $978 | $978 |
2005-03-01 | Northern Ireland | $978 | $978 |
2005-02-01 | Scotland | $7,631 | $8,876 |
2005-03-01 | Scotland | $9,661 | $7,631 |
2005-02-01 | Wales | $2,572 | $2,516 |
2005-03-01 | Wales | $3,336 | $2,572 |
Exercise 10
Recall The Business Problem
We’re at a fast paced startup. The company is growing fast and the marketing team is looking for ways to increase the sales from existing customers by making them buy more. The main idea is to unlock the potential of the customer base through incentives, in this case a discount. We of course want to measure the effect of the discount on the customer’s behavior. Still, they do not want to waste money giving discounts to users which are not valuable. As always, it is about return on investment (ROI).
Without going into specifics about the nature of the discount, it has been designed to provide a positive return on investment if the customer buys more than as a result of the discount. How can we measure the effect of the discount and make sure our experiment has a positive ROI? The marketing team came up with the following strategy:
- Select a sample of existing customers from the same cohort.
- Set a test window of 1 month.
- Look into the historical data of web visits from the last month. The hypothesis is that web visits are a good proxy for the customer’s interest in the product.
- For customers with a high number of web visits, send them a discount. There will be a hold out group which will not receive the discount within the potential valuable customers based on the number of web visits. For customers with a low number of web visits, do not send them a discount (the marketing team wants to report a positive ROI, so they do not want to waste money on customers which are not valuable). Still, they want to use them to measure the effect of the discount.
- We also want to use the results of the test to tag loyal customers. These are customers which got a discount (since they showed potential interest in the product) and customers with exceptional sales numbers even if they did not get a discount. The idea is to use this information to target them in the future if the discount strategy is positive.
In the last lab we did some exploratory data analysis. The next step is to prepare some descriptive statistics.
Descriptive Statistics
The first thing the data analytics team did was to split the sales distribution by discount group:
<- readr::read_csv('data/sales_dag.csv', show_col_types = FALSE)
data
|> dplyr::mutate(discount = factor(discount)) |>
data ggplot(aes(x = sales, after_stat(count), fill = discount)) +
geom_histogram(alpha = 0.30, position = 'identity', color="#e9ecef", bins = 30)+
geom_density(alpha = 0.30) +
xlab("Sales") +
ylab("Density") +
theme_minimal()
It looks customers with a discount have higher sales. Data scientist A is optimistic with this initial result. To quantify this, compute the difference in means by discount:
group by discount and find the difference in means:
What is the difference in means?
The discount strategy seems to be working. Data scientist A is happy with the results and decides to get feedback from the rest of the data science team.
Data scientist B is not so happy with the results. They think that the uplift is too good to be true (based on domain knowledge and the sales distributions 🤔). When thinking about reasons for such a high uplift, they realized the discount assignment was not at random. It was based on the number of web visits (remember the marketing plan?). This means that the discount group is not comparable to the control group completely! They decide to plot sales against web visits per discount group:
|> dplyr::mutate(discount = factor(discount)) |>
data ggplot(aes(x=visits, y = sales, color = discount)) +
geom_point() +
facet_grid(cols = vars(discount))
Indeed, they realize they should probably adjust for the number of web visits. A natural metric is sales per web visit. Compute the sales per visit for each discount group:
The mean value is higher for the discount group. As always, they also looked at the distributions:
|> dplyr::mutate(discount = factor(discount)) |>
data ggplot(aes(x = sales_per_visit, after_stat(count), fill = discount)) +
geom_histogram(alpha = 0.30, position = 'identity', color="#e9ecef", bins = 30)+
# geom_density(alpha = 0.30) +
xlab("Sales per Visit") +
ylab("Count") +
theme_minimal()
For both data scientists A & B the results look much better, but they were unsure about which uplift to report. They thought about the difference in means:
Compute the difference in mean sales_per_visit by discount:
Can you see a way to interpret this in terms of dollars? yes/no
You’re done and ready to submit your work! Save, stage, commit, and push all remaining changes. You can use the commit message “Done with Lab 2!” , and make sure you have committed and pushed all changed files to GitHub (your Git pane in RStudio should be empty) and that all documents are updated in your repo on GitHub.
Submission
I will pull (copy) everyone’s submissions at 5:00pm on the Sunday following class, and I will work only with these copies, so anything submitted after 5:00pm will not be graded. (don’t forget to commit and then push your work by 5:00pm on Sunday!)
Grading
Total points available: 30 points.
Component | Points |
---|---|
Ex 1 - 10 | 30 |