# check if 'librarian' is installed and if not, install it
if (! "librarian" %in% rownames(installed.packages()) ){
install.packages("librarian")
}
# load packages if not already loaded
::shelf(
librarian
tidyverse, magrittr, gt, gtExtras, tidymodels, DataExplorer, skimr, janitor, ggplot2
)
theme_set(theme_bw(base_size = 12))
<- readr::read_csv('data/boston_cocktails.csv', show_col_types = FALSE) boston_cocktails
Lab 2 - The Recipes package
SOLUTIONS
Packages
We will use the following package in this lab.
I like to use the gt::
and gtExtras::
packages to format my tables. The results can be saved as an image and are especially useful for inserting into PowerPoint decks.
You’ll see examples thoughout the course and I encourage you to examine the examples and use gt:: and gtExtras::
in your own work.
Data: The Boston Cocktail Recipes
The Boston Cocktail Recipes dataset appeared in a TidyTuesday posting. TidyTuesday is a weekly data project in R.
The dataset is derived from the Mr. Boston Bartender’s Guide, together with a dataset that was web-scraped as part of a hackathon.
This dataset contains the following information for each cocktail:
variable | class | description |
---|---|---|
name | character | Name of cocktail |
category | character | Category of cocktail |
row_id | integer | Drink identifier |
ingredient_number | integer | Ingredient number |
ingredient | character | Ingredient |
measure | character | Measurement/volume of ingredient |
measure_number | real | measure as a number |
Exercises
Exercise 1
First use skimr::skim
and DataExplorer::introduce
to assess the quality of the data set.
Next prepare a summary
. What is the median measure number across cocktail recipes?
Exercise 2
From the boston_cocktails dataset select the name, category, ingredient, and measure_number columns and then pivot the table to create a column for each ingredient. Fill any missing values with the number zero.
Since the names of the new columns may contain spaces, clean them using the janitor::clean_names()
. Finally drop any rows with NA values and save this new dataset in a variable.
How much gin is in the cocktail called Leap Frog Highball?
Exercise 3
Prepare a recipes::recipe object without a target but give name and category as ‘id’ roles. Add steps to normalize the predictors and perform PCA. Finally prep
the data and save it in a variable.
How many predictor variables are prepped by the recipe?
Exercise 4
Apply the recipes::tidy
verb to the prepped recipe in the last exercise. The result is a table identifying the information generated and stored by each step in the recipe from the input data.
To see the values calculated for normalization, apply the recipes::tidy
verb as before, but with second argument = 1.
What ingredient is the most used, on average?
Exercise 5
Now look at the result of the PCA, applying the recipes::tidy
verb as before, but with second argument = 2. Save the result in a variable and filter for the components PC1 to PC5. Mutate the resulting component
column so that the values are factors, ordering them in the order they appear using the forcats::fct_inorder
verb.
Plot this data using ggplot2
and the code below
ggplot(aes(value, terms, fill = terms)) +
geom_col(show.legend = FALSE) +
facet_wrap(~component, nrow = 1) +
labs(y = NULL) +
theme(axis.text=element_text(size=7),
axis.title=element_text(size=14,face="bold"))
How would you describe the drinks represented by PC1?
Exercise 6
As in the last exercise, use the variable with the tidied PCA data and use only PCA components PC1 to PC4. Take/slice the top 8 ingedients by component, ordered by their absolute value using the verb dplyr::slice_max
. Next, generate a grouped table using gt::gt, colouring the cell backgrounds (i.e. fill) with green for values and red for values .
What is the characteristic alcoholic beverage of each of the first 4 principle components.
Exercise 7
For this exercise, bake the prepped PCA recipe using recipes::bake
on the original data and plot each cocktail by its PC1, PC2 component, using
ggplot(aes(PC1, PC2, label = name)) +
geom_point(aes(color = category), alpha = 0.7, size = 2) +
geom_text(check_overlap = TRUE, hjust = "inward") +
labs(color = NULL)
Can you create an interpretation of the PCA analysis?
Exercise 8
In the following exercise, we’ll use the recipes package to prepare time series data. The starting dataset contains monthly house price data for each of the four countries/regions in the UK
<- readr::read_csv('data/UK_house_prices.csv', show_col_types = FALSE) uk_prices
Write code to clean the names in the uk_prices
dataset using janitor::clean_names()
, and then using skimr::skim
confirm that the region names are correct and that there are no missing values. Call the resulting analytic data set df
.
Exercise 9
We want to use the monthly house price data to predict house prices one month ahead for each region. The basic model will be:
~ . SalesVolume
where the date and region-name are id variables, not predictor variables. Instead we will use the prior-month lagged prices as the only predictor.
The recipe uses the following 5 steps from the recipes
package: update_role
(region_name, new_role = “id”) | recipe
(sales_volume ~ ., data = df |> janitor::clean_names()) | step_naomit
(lag_1_sales_volume, skip=FALSE) | step_lag
(sales_volume, lag=1) | step_arrange
(region_name, date)
Use these 5 steps in the proper order to create a recipe to pre-process the time series data for each region. Prep and then bake your recipe using df
.
The result of the bake step, after filtering for dates 2005-03-01, should look like this:
readRDS("data/baked_uk_house_dat.rds") |>
::filter(date <= lubridate::ymd(20050301)) |>
dplyr::gt() |>
gt::fmt_currency(columns = -c(date,region_name), decimals = 0) |>
gt::gt_theme_espn() gtExtras
date | region_name | sales_volume | lag_1_sales_volume |
---|---|---|---|
2005-02-01 | England | $56,044 | $53,464 |
2005-03-01 | England | $67,322 | $56,044 |
2005-02-01 | Northern Ireland | $978 | $978 |
2005-03-01 | Northern Ireland | $978 | $978 |
2005-02-01 | Scotland | $7,631 | $8,876 |
2005-03-01 | Scotland | $9,661 | $7,631 |
2005-02-01 | Wales | $2,572 | $2,516 |
2005-03-01 | Wales | $3,336 | $2,572 |
Exercise 10
Recall The Business Problem
We’re at a fast paced startup. The company is growing fast and the marketing team is looking for ways to increase the sales from existing customers by making them buy more. The main idea is to unlock the potential of the customer base through incentives, in this case a discount. We of course want to measure the effect of the discount on the customer’s behavior. Still, they do not want to waste money giving discounts to users which are not valuable. As always, it is about return on investment (ROI).
Without going into specifics about the nature of the discount, it has been designed to provide a positive return on investment if the customer buys more than as a result of the discount. How can we measure the effect of the discount and make sure our experiment has a positive ROI? The marketing team came up with the following strategy:
- Select a sample of existing customers from the same cohort.
- Set a test window of 1 month.
- Look into the historical data of web visits from the last month. The hypothesis is that web visits are a good proxy for the customer’s interest in the product.
- For customers with a high number of web visits, send them a discount. There will be a hold out group which will not receive the discount within the potential valuable customers based on the number of web visits. For customers with a low number of web visits, do not send them a discount (the marketing team wants to report a positive ROI, so they do not want to waste money on customers which are not valuable). Still, they want to use them to measure the effect of the discount.
- We also want to use the results of the test to tag loyal customers. These are customers which got a discount (since they showed potential interest in the product) and customers with exceptional sales numbers even if they did not get a discount. The idea is to use this information to target them in the future if the discount strategy is positive.
In the last lab we did some exploratory data analysis. The next step is to prepare some descriptive statistics.
Descriptive Statistics
The first thing the data analytics team did was to split the sales distribution by discount group:
<- readr::read_csv('data/sales_dag.csv', show_col_types = FALSE)
data
|> dplyr::mutate(discount = factor(discount)) |>
data ggplot(aes(x = sales, after_stat(count), fill = discount)) +
geom_histogram(alpha = 0.30, position = 'identity', color="#e9ecef", bins = 30)+
geom_density(alpha = 0.30) +
xlab("Sales") +
ylab("Density") +
theme_minimal()
It looks customers with a discount have higher sales. Data scientist A is optimistic with this initial result. To quantify this, they computed the difference in means:
difference in means:
<- data |> dplyr::group_by(discount) |>
mean_sales ::summarize("mean sales" = mean(sales)) |>
dplyr::mutate("mean sales difference" = `mean sales` - lag(`mean sales`))
dplyr mean_sales
# A tibble: 2 × 3
discount `mean sales` `mean sales difference`
<dbl> <dbl> <dbl>
1 0 15.8 NA
2 1 20.6 4.80
Our calculation gives a mean uplift! This is great news. The discount strategy seems to be working. Data scientist A is happy with the results and decides to get feedback from the rest of the data science team.
Data scientist B is not so happy with the results. They think that the uplift is too good to be true (based on domain knowledge and the sales distributions 🤔). When thinking about reasons for such a high uplift, they realized the discount assignment was not at random. It was based on the number of web visits (remember the marketing plan?). This means that the discount group is not comparable to the control group completely! They decide to plot sales against web visits per discount group:
|> dplyr::mutate(discount = factor(discount)) |>
data ggplot(aes(x=visits, y = sales, color = discount)) +
geom_point() +
facet_grid(cols = vars(discount))
Indeed, they realize they should probably adjust for the number of web visits. A natural metric is sales per web visit. Let’s compute it:
<- data |> dplyr::group_by(discount) |>
mean_sales_pv ::summarize("sales_per_visit" = mean(sales_per_visit))
dplyr mean_sales_pv
# A tibble: 2 × 2
discount sales_per_visit
<dbl> <dbl>
1 0 0.861
2 1 0.938
The mean value is higher for the discount group. As always, they also looked at the distributions:
|> dplyr::mutate(discount = factor(discount)) |>
data ggplot(aes(x = sales_per_visit, after_stat(count), fill = discount)) +
geom_histogram(alpha = 0.30, position = 'identity', color="#e9ecef", bins = 30)+
# geom_density(alpha = 0.30) +
xlab("Sales per Visit") +
ylab("Count") +
theme_minimal()
For both data scientists A & B the results look much better, but they were unsure about which uplift to report. They thought about the difference in means:
<- data |> dplyr::group_by(discount) |>
mean_sales_per_visit ::summarize("mean sales per visit" = mean(sales_per_visit)) |>
dplyr::mutate("mean sales difference" = `mean sales per visit` - dplyr::lag(`mean sales per visit`))
dplyr
|>
mean_sales_per_visit ::gt() |>
gt::tab_header(title = "Mean sales per visit") |>
gt::fmt_number(columns = `mean sales difference`, decimals = 5) |>
gt::gt_theme_espn() gtExtras
Mean sales per visit | ||
---|---|---|
discount | mean sales per visit | mean sales difference |
0 | 0.8612426 | NA |
1 | 0.9382929 | 0.07705 |
However, how to interpret this value in terms of dollars? To be continued …
Grading
Total points available: 30 points.
Component | Points |
---|---|
Ex 1 - 10 | 30 |