Lab 2 - The Recipes package

SOLUTIONS

Packages

We will use the following package in this lab.

# check if 'librarian' is installed and if not, install it
if (! "librarian" %in% rownames(installed.packages()) ){
  install.packages("librarian")
}
  
# load packages if not already loaded
librarian::shelf(
  tidyverse, magrittr, gt, gtExtras, tidymodels, DataExplorer, skimr, janitor, ggplot2
)

theme_set(theme_bw(base_size = 12))
boston_cocktails <- readr::read_csv('data/boston_cocktails.csv', show_col_types = FALSE)
Tip

I like to use the gt:: and gtExtras:: packages to format my tables. The results can be saved as an image and are especially useful for inserting into PowerPoint decks.

You’ll see examples thoughout the course and I encourage you to examine the examples and use gt:: and gtExtras:: in your own work.

Data: The Boston Cocktail Recipes

The Boston Cocktail Recipes dataset appeared in a TidyTuesday posting. TidyTuesday is a weekly data project in R.

The dataset is derived from the Mr. Boston Bartender’s Guide, together with a dataset that was web-scraped as part of a hackathon.

This dataset contains the following information for each cocktail:

variable class description
name character Name of cocktail
category character Category of cocktail
row_id integer Drink identifier
ingredient_number integer Ingredient number
ingredient character Ingredient
measure character Measurement/volume of ingredient
measure_number real measure as a number

Exercises

Exercise 1

First use skimr::skim and DataExplorer::introduce to assess the quality of the data set.

Next prepare a summary. What is the median measure number across cocktail recipes?

SOLUTION:
boston_cocktails %>% skimr::skim()
Data summary
Name Piped data
Number of rows 2542
Number of columns 7
_______________________
Column type frequency:
character 4
numeric 3
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
name 0 1 4 36 0 937 0
category 0 1 3 21 0 11 0
ingredient 0 1 3 23 0 40 0
measure 0 1 1 8 0 28 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
row_id 0 1 495.37 284.75 1.00 261.25 497 730.75 990 ▇▇▇▇▇
ingredient_number 0 1 2.54 1.29 1.00 1.00 2 3.00 6 ▇▃▂▁▁
measure_number 0 1 0.98 0.72 0.02 0.50 1 1.50 16 ▇▁▁▁▁
boston_cocktails %>% summary()
     name             category             row_id      ingredient_number
 Length:2542        Length:2542        Min.   :  1.0   Min.   :1.000    
 Class :character   Class :character   1st Qu.:261.2   1st Qu.:1.000    
 Mode  :character   Mode  :character   Median :497.0   Median :2.000    
                                       Mean   :495.4   Mean   :2.545    
                                       3rd Qu.:730.8   3rd Qu.:3.000    
                                       Max.   :990.0   Max.   :6.000    
  ingredient          measure          measure_number   
 Length:2542        Length:2542        Min.   : 0.0200  
 Class :character   Class :character   1st Qu.: 0.5000  
 Mode  :character   Mode  :character   Median : 1.0000  
                                       Mean   : 0.9797  
                                       3rd Qu.: 1.5000  
                                       Max.   :16.0000  

The median measure is 1.0. Note that the dimensions are identified as ounces (oz) in the measure column.

Exercise 2

From the boston_cocktails dataset select the name, category, ingredient, and measure_number columns and then pivot the table to create a column for each ingredient. Fill any missing values with the number zero.

Since the names of the new columns may contain spaces, clean them using the janitor::clean_names(). Finally drop any rows with NA values and save this new dataset in a variable.

How much gin is in the cocktail called Leap Frog Highball?

SOLUTION:
cocktails_df <- boston_cocktails %>%
  # select the columns (by de-selecting the ones we don't want)
  dplyr::select(-ingredient_number, -row_id, -measure) %>%
  # pivot wider (make more columns); use zeros in place of NA values
  tidyr::pivot_wider(
    names_from = ingredient
    , values_from = measure_number
    , values_fill = 0
  ) %>%
  janitor::clean_names() %>%
  tidyr::drop_na()
# show the table in the document
cocktails_df
# A tibble: 937 × 42
   name    category light_rum lemon_juice lime_juice sweet_vermouth orange_juice
   <chr>   <chr>        <dbl>       <dbl>      <dbl>          <dbl>        <dbl>
 1 Gauguin Cocktai…      2           1          1               0           0   
 2 Fort L… Cocktai…      1.5         0          0.25            0.5         0.25
 3 Cuban … Cocktai…      2           0          0.5             0           0   
 4 Cool C… Cocktai…      0           0          0               0           1   
 5 John C… Whiskies      0           1          0               0           0   
 6 Cherry… Cocktai…      1.25        0          0               0           0   
 7 Casa B… Cocktai…      2           0          1.5             0           0   
 8 Caribb… Cocktai…      0.5         0          0               0           0   
 9 Amber … Cordial…      0           0.25       0               0           0   
10 The Jo… Whiskies      0           0.5        0               0           0   
# ℹ 927 more rows
# ℹ 35 more variables: powdered_sugar <dbl>, dark_rum <dbl>,
#   cranberry_juice <dbl>, pineapple_juice <dbl>, bourbon_whiskey <dbl>,
#   simple_syrup <dbl>, cherry_flavored_brandy <dbl>, light_cream <dbl>,
#   triple_sec <dbl>, maraschino <dbl>, amaretto <dbl>, grenadine <dbl>,
#   apple_brandy <dbl>, brandy <dbl>, gin <dbl>, anisette <dbl>,
#   dry_vermouth <dbl>, apricot_flavored_brandy <dbl>, bitters <dbl>, …
cocktails_df %>% 
  # filter for the desired cocktail
  dplyr::filter(name == 'Leap Frog Highball') %>% 
  dplyr::pull(gin)
[1] 2

Two ounces (oz) of gin are in the Leap Frog Highball.

Exercise 3

Prepare a recipes::recipe object without a target but give name and category as ‘id’ roles. Add steps to normalize the predictors and perform PCA. Finally prep the data and save it in a variable.

How many predictor variables are prepped by the recipe?

SOLUTION:
# create a recipe: y~. with an outcome/target, but here we just use ~.
pca_rec <- recipes::recipe(~., data = cocktails_df) 
pca_rec %>% summary()
# A tibble: 42 × 4
   variable        type      role      source  
   <chr>           <list>    <chr>     <chr>   
 1 name            <chr [3]> predictor original
 2 category        <chr [3]> predictor original
 3 light_rum       <chr [2]> predictor original
 4 lemon_juice     <chr [2]> predictor original
 5 lime_juice      <chr [2]> predictor original
 6 sweet_vermouth  <chr [2]> predictor original
 7 orange_juice    <chr [2]> predictor original
 8 powdered_sugar  <chr [2]> predictor original
 9 dark_rum        <chr [2]> predictor original
10 cranberry_juice <chr [2]> predictor original
# ℹ 32 more rows
pca_rec <- pca_rec %>% 
  # change the roles of name and category to 'id' from 'predictor'
  recipes::update_role(name, category, new_role = "id") %>%
  # normalize the remaining predictors
  recipes::step_normalize(all_predictors()) %>%
  # convert the predictors to principle components
  recipes::step_pca(all_predictors())

# note there are 40 predictors, but that nothing has been calculated yet
pca_rec %>% summary()
# A tibble: 42 × 4
   variable        type      role      source  
   <chr>           <list>    <chr>     <chr>   
 1 name            <chr [3]> id        original
 2 category        <chr [3]> id        original
 3 light_rum       <chr [2]> predictor original
 4 lemon_juice     <chr [2]> predictor original
 5 lime_juice      <chr [2]> predictor original
 6 sweet_vermouth  <chr [2]> predictor original
 7 orange_juice    <chr [2]> predictor original
 8 powdered_sugar  <chr [2]> predictor original
 9 dark_rum        <chr [2]> predictor original
10 cranberry_juice <chr [2]> predictor original
# ℹ 32 more rows
# calculate prepare the data per the steps in the recipe
pca_prep <- recipes::prep(pca_rec)
pca_prep %>% summary
# A tibble: 7 × 4
  variable type      role      source  
  <chr>    <list>    <chr>     <chr>   
1 name     <chr [3]> id        original
2 category <chr [3]> id        original
3 PC1      <chr [2]> predictor derived 
4 PC2      <chr [2]> predictor derived 
5 PC3      <chr [2]> predictor derived 
6 PC4      <chr [2]> predictor derived 
7 PC5      <chr [2]> predictor derived 
  • There are 40 predictors and 2 id variables before the data is prepped.
  • Once prepped, the PCA returns just 5 components by default, so we have (post-prep) 5 predictors and 2 id variables.

Exercise 4

Apply the recipes::tidy verb to the prepped recipe in the last exercise. The result is a table identifying the information generated and stored by each step in the recipe from the input data.

To see the values calculated for normalization, apply the recipes::tidy verb as before, but with second argument = 1.

What ingredient is the most used, on average?

SOLUTION:
# tidy returns a tibble with the calculations performed by prep
pca_prep %>% recipes::tidy()
# A tibble: 2 × 6
  number operation type      trained skip  id             
   <int> <chr>     <chr>     <lgl>   <lgl> <chr>          
1      1 step      normalize TRUE    FALSE normalize_kssgp
2      2 step      pca       TRUE    FALSE pca_LwyKj      
# if we select the first (normalization) step we get the values calculated:
# - the mean and standard deviation for each variable
foo <- pca_prep %>% recipes::tidy(1)
foo
# A tibble: 80 × 4
   terms           statistic  value id             
   <chr>           <chr>      <dbl> <chr>          
 1 light_rum       mean      0.161  normalize_kssgp
 2 lemon_juice     mean      0.229  normalize_kssgp
 3 lime_juice      mean      0.138  normalize_kssgp
 4 sweet_vermouth  mean      0.0691 normalize_kssgp
 5 orange_juice    mean      0.185  normalize_kssgp
 6 powdered_sugar  mean      0.0891 normalize_kssgp
 7 dark_rum        mean      0.0454 normalize_kssgp
 8 cranberry_juice mean      0.0363 normalize_kssgp
 9 pineapple_juice mean      0.0608 normalize_kssgp
10 bourbon_whiskey mean      0.0768 normalize_kssgp
# ℹ 70 more rows
# we can just filter to find the largest mean value
# first, isolate the mean values
foo %>% dplyr::filter(statistic == 'mean') %>% 
  # once we have just the mean values, filter out the row with the max value 
  dplyr::filter(value == max(value))
# A tibble: 1 × 4
  terms statistic value id             
  <chr> <chr>     <dbl> <chr>          
1 gin   mean      0.252 normalize_kssgp

On average, it is gin that is the largest component of the cocktails, with just over 1/4 oz per cocktail.

Exercise 5

Now look at the result of the PCA, applying the recipes::tidy verb as before, but with second argument = 2. Save the result in a variable and filter for the components PC1 to PC5. Mutate the resulting component column so that the values are factors, ordering them in the order they appear using the forcats::fct_inorder verb.

Plot this data using ggplot2 and the code below

ggplot(aes(value, terms, fill = terms)) +
geom_col(show.legend = FALSE) +
facet_wrap(~component, nrow = 1) +
labs(y = NULL) +
theme(axis.text=element_text(size=7),
      axis.title=element_text(size=14,face="bold"))

How would you describe the drinks represented by PC1?

SOLUTION:
# the tidy operation shows the weights of each ingredient, 
# for each principal component.
# - i.e. PC1 (along with PC2 - PC5) is a weighted sum of ingredients
bar <- pca_prep %>% recipes::tidy(2)
bar
# A tibble: 1,600 × 4
   terms             value component id       
   <chr>             <dbl> <chr>     <chr>    
 1 light_rum        0.163  PC1       pca_LwyKj
 2 lemon_juice     -0.0140 PC1       pca_LwyKj
 3 lime_juice       0.224  PC1       pca_LwyKj
 4 sweet_vermouth  -0.0661 PC1       pca_LwyKj
 5 orange_juice     0.0308 PC1       pca_LwyKj
 6 powdered_sugar  -0.476  PC1       pca_LwyKj
 7 dark_rum         0.124  PC1       pca_LwyKj
 8 cranberry_juice  0.0954 PC1       pca_LwyKj
 9 pineapple_juice  0.119  PC1       pca_LwyKj
10 bourbon_whiskey  0.0963 PC1       pca_LwyKj
# ℹ 1,590 more rows
# plot to show the ingredient weights
bar %>%
  # since there are only 5 components, this is redundant
  dplyr::filter(component %in% paste0("PC", 1:5)) %>%
  # change component from a character to a factor, and give them an order
  dplyr::mutate(component = forcats::fct_inorder(component)) %>%
  # plot
  ggplot(aes(value, terms, fill = terms)) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~component, nrow = 1) +
  labs(y = NULL) +
  theme(axis.text=element_text(size=7),
        axis.title=element_text(size=14,face="bold"))

Based on the its largest components (in absolute value) PC1 is a syrupy (not sugary) drink, containing tequila and lime juice but without egg products.

Exercise 6

As in the last exercise, use the variable with the tidied PCA data and use only PCA components PC1 to PC4. Take/slice the top 8 ingedients by component, ordered by their absolute value using the verb dplyr::slice_max. Next, generate a grouped table using gt::gt, colouring the cell backgrounds (i.e. fill) with green for values 0\ge0 and red for values <0<0.

What is the characteristic alcoholic beverage of each of the first 4 principle components.

SOLUTION:
bar %>%
  # filter our the rows for PC1 - PC4
  dplyr::filter(component %in% paste0("PC", 1:4)) %>%
  # group by component (i.e. principal component) 
  dplyr::group_by(component) %>%
  # for each group, take the top 8 ingredients by absolute value
  dplyr::slice_max(n = 8, order_by = abs(value)) %>% 
  # now make a nicely formatted table
  gt::gt() %>% 
  # make/apply a table style: this one for values < 0
  gt::tab_style(
    style = list(
      gt::cell_fill(color = "red"),
      gt::cell_text(weight = "bold")
      ),
    locations = gt::cells_body(
      columns = value,
      rows = value < 0
    )
  ) %>% 
  # make/apply another table style: this one for values >= 0
    gt::tab_style(
    style = list(
      gt::cell_fill(color = "green"),
      gt::cell_text(weight = "bold")
      ),
    locations = gt::cells_body(
      columns = value,
      rows = value >= 0
    )
  ) %>% 
  # apply a theme; any theme will do
  gtExtras::gt_theme_espn()
terms value id
PC1
powdered_sugar -0.4764442 pca_LwyKj
simple_syrup 0.3125978 pca_LwyKj
whole_egg -0.2903794 pca_LwyKj
egg_white -0.2643614 pca_LwyKj
gin -0.2588481 pca_LwyKj
port -0.2354010 pca_LwyKj
lime_juice 0.2240295 pca_LwyKj
blanco_tequila 0.2025436 pca_LwyKj
PC2
dry_vermouth 0.4332412 pca_LwyKj
sweet_vermouth 0.3754366 pca_LwyKj
powdered_sugar -0.3273723 pca_LwyKj
lemon_juice -0.2651836 pca_LwyKj
simple_syrup -0.2512150 pca_LwyKj
gin 0.2437374 pca_LwyKj
whole_egg -0.2191547 pca_LwyKj
port -0.1763370 pca_LwyKj
PC3
gin 0.3780348 pca_LwyKj
egg_white 0.3094631 pca_LwyKj
benedictine -0.2864463 pca_LwyKj
whole_egg -0.2825812 pca_LwyKj
blended_scotch_whiskey -0.2352922 pca_LwyKj
lemon_juice 0.2318927 pca_LwyKj
vodka -0.2167814 pca_LwyKj
apricot_flavored_brandy 0.2134899 pca_LwyKj
PC4
grenadine 0.4068737 pca_LwyKj
orange_juice 0.3625622 pca_LwyKj
vodka 0.3008406 pca_LwyKj
simple_syrup -0.2533007 pca_LwyKj
half_and_half 0.2493552 pca_LwyKj
egg_white 0.2465323 pca_LwyKj
cranberry_juice 0.2228099 pca_LwyKj
white_creme_de_cacao 0.2089063 pca_LwyKj

Principal components and similar methods are very useful in reducing the complexity of our models. In this case we reduced the 40 original predictors to just 5 predictors.

The challenge with using these methods is attaching meaning to the revised predictors, and this is important when we need to explain our models. The computer can compute the new predictors but they can’t tell us what they represent. For this we need to look at the structure of the new predictors and see if we can attach some meaning to them; often the solution is to give them names that capture the underlying structure.

In this case, looking at the ingredients that make up the PCA predictors

  • PC1 represents a drink with:

    • little or no sugar, egg, gin or port; some or a lot of syrup and citrus juice
    • most often / mainly tequila
  • PC2 represents a drink with:

    • little or no sugar, syrup, or citrus juice
    • most often / mainly vermouth
  • PC3 represents a drink with:

    • little or no egg, whiskey or vodka

    • most often / mainly gin

  • PC4 represents a drink with

    • little or no syrup; some or a lot of juice and dairy product

    • most often / mainly grenadine and vodka

Exercise 7

For this exercise, bake the prepped PCA recipe using recipes::bake on the original data and plot each cocktail by its PC1, PC2 component, using

ggplot(aes(PC1, PC2, label = name)) +
  geom_point(aes(color = category), alpha = 0.7, size = 2) +
  geom_text(check_overlap = TRUE, hjust = "inward") + 
  labs(color = NULL)

Can you create an interpretation of the PCA analysis?

SOLUTION:
# bake the dataset cocktails_df, creating a new dataset
recipes::bake(pca_prep, new_data = cocktails_df) %>%
  ggplot(aes(PC1, PC2, label = name)) +
  geom_point(aes(color = category), alpha = 0.7, size = 2) +
  geom_text(check_overlap = TRUE, hjust = "inward") + 
  labs(color = NULL)

In this exercise we are plotting the cocktails against the first two principal components and trying to interpret the results.

The interpretation is more difficult now than it was in the last exercise where we found an interpretation for each principal component. Now we are looking at each cocktail in terms of combinations of PC1 and PC2, both positive and negative.

It appears that the lower left quadrant (negative PC1 and PC2 values) are mainly cocktail classics. If you look at the components of PC1 and PC2 from the last exercise and negate them, you can describe the cocktails in this quadrant: they have egg /egg-white, port, sugar, no vermouth or gin or tequila.

The lower right quadrant is positive PC1 and negative PC2. PC1 contains juice and syrup, while PC2 is negative juice and syrup - so in this quadrant we have cocktails that are like PC1 (juice and syrup) and unlike PC2 (juice and syrup & sugar). The cocktails in this quadrant have citrus juice, are sweet from the use of syrup & sugar and likely have tequila.

The top half of the plot shows cocktails clustered along the PC1=0 axis, so those are mainly tequila cocktails, very much like PC2.

Is there an opportunity to create a new cocktail in the upper left or upper right quadrants?

Exercise 8

In the following exercise, we’ll use the recipes package to prepare time series data. The starting dataset contains monthly house price data for each of the four countries/regions in the UK

uk_prices <- readr::read_csv('data/UK_house_prices.csv', show_col_types = FALSE)

Write code to clean the names in the uk_prices dataset using janitor::clean_names(), and then using skimr::skim confirm that the region names are correct and that there are no missing values. Call the resulting analytic data set df.

SOLUTION:
# PLEASE SHOW YOUR WORK
df <- uk_prices |> janitor::clean_names()
unique(df$region_name)
[1] "England"          "Northern Ireland" "Scotland"         "Wales"           
skimr::skim(df)
Data summary
Name df
Number of rows 888
Number of columns 3
_______________________
Column type frequency:
character 1
Date 1
numeric 1
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
region_name 0 1 5 16 0 4 0

Variable type: Date

skim_variable n_missing complete_rate min max median n_unique
date 0 1 2005-01-01 2023-06-01 2014-03-16 222

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
sales_volume 8 0.99 20490.35 29647.56 665 2517.5 4984.5 18286.5 145089 ▇▁▂▁▁

Exercise 9

We want to use the monthly house price data to predict house prices one month ahead for each region. The basic model will be:

SalesVolume ~ .

where the date and region-name are id variables, not predictor variables. Instead we will use the prior-month lagged prices as the only predictor.

The recipe uses the following 5 steps from the recipes package: update_role(region_name, new_role = “id”) | recipe(sales_volume ~ ., data = df |> janitor::clean_names()) | step_naomit(lag_1_sales_volume, skip=FALSE) | step_lag(sales_volume, lag=1) | step_arrange(region_name, date)

Use these 5 steps in the proper order to create a recipe to pre-process the time series data for each region. Prep and then bake your recipe using df.

SOLUTION:
# PLEASE SHOW YOUR WORK
df_rec <- df |> recipes::recipe(sales_volume ~ .) |> 
  recipes::update_role(region_name, new_role = "id") |> 
  recipes::step_lag(sales_volume, lag=1) |> 
  recipes::step_naomit(lag_1_sales_volume, skip=FALSE) |> 
  recipes::step_arrange(region_name, date)

df_rec |> 
  recipes::prep() |> 
  recipes::bake(new_data = NULL) |> 
  dplyr::filter(date <= lubridate::ymd(20050301)) 
# A tibble: 8 × 4
  date       region_name      sales_volume lag_1_sales_volume
  <date>     <fct>                   <dbl>              <dbl>
1 2005-02-01 England                56044              53464 
2 2005-03-01 England                67322              56044 
3 2005-02-01 Northern Ireland         978.               978.
4 2005-03-01 Northern Ireland         978.               978.
5 2005-02-01 Scotland                7631               8876 
6 2005-03-01 Scotland                9661               7631 
7 2005-02-01 Wales                   2572               2516 
8 2005-03-01 Wales                   3336               2572 

The result of the bake step, after filtering for dates \le 2005-03-01, should look like this:

readRDS("data/baked_uk_house_dat.rds") |> 
  dplyr::filter(date <= lubridate::ymd(20050301)) |> 
  gt::gt() |> 
  gt::fmt_currency(columns = -c(date,region_name), decimals = 0) |> 
  gtExtras::gt_theme_espn()
date region_name sales_volume lag_1_sales_volume
2005-02-01 England $56,044 $53,464
2005-03-01 England $67,322 $56,044
2005-02-01 Northern Ireland $978 $978
2005-03-01 Northern Ireland $978 $978
2005-02-01 Scotland $7,631 $8,876
2005-03-01 Scotland $9,661 $7,631
2005-02-01 Wales $2,572 $2,516
2005-03-01 Wales $3,336 $2,572

Exercise 10

Recall The Business Problem

We’re at a fast paced startup. The company is growing fast and the marketing team is looking for ways to increase the sales from existing customers by making them buy more. The main idea is to unlock the potential of the customer base through incentives, in this case a discount. We of course want to measure the effect of the discount on the customer’s behavior. Still, they do not want to waste money giving discounts to users which are not valuable. As always, it is about return on investment (ROI).

Without going into specifics about the nature of the discount, it has been designed to provide a positive return on investment if the customer buys more than $1\$ 1 as a result of the discount. How can we measure the effect of the discount and make sure our experiment has a positive ROI? The marketing team came up with the following strategy:

  • Select a sample of existing customers from the same cohort.
  • Set a test window of 1 month.
  • Look into the historical data of web visits from the last month. The hypothesis is that web visits are a good proxy for the customer’s interest in the product.
  • For customers with a high number of web visits, send them a discount. There will be a hold out group which will not receive the discount within the potential valuable customers based on the number of web visits. For customers with a low number of web visits, do not send them a discount (the marketing team wants to report a positive ROI, so they do not want to waste money on customers which are not valuable). Still, they want to use them to measure the effect of the discount.
  • We also want to use the results of the test to tag loyal customers. These are customers which got a discount (since they showed potential interest in the product) and customers with exceptional sales numbers even if they did not get a discount. The idea is to use this information to target them in the future if the discount strategy is positive.

In the last lab we did some exploratory data analysis. The next step is to prepare some descriptive statistics.

Descriptive Statistics

The first thing the data analytics team did was to split the sales distribution by discount group:

data <- readr::read_csv('data/sales_dag.csv', show_col_types = FALSE)

data |> dplyr::mutate(discount = factor(discount)) |> 
  ggplot(aes(x = sales, after_stat(count), fill = discount)) +
  geom_histogram(alpha = 0.30, position = 'identity', color="#e9ecef", bins = 30)+
  geom_density(alpha = 0.30) +
  xlab("Sales") +
  ylab("Density") +
  theme_minimal()

It looks customers with a discount have higher sales. Data scientist A is optimistic with this initial result. To quantify this, they computed the difference in means:

SOLUTION:

difference in means:

mean_sales <- data |> dplyr::group_by(discount) |> 
  dplyr::summarize("mean sales" = mean(sales)) |> 
  dplyr::mutate("mean sales difference" = `mean sales` - lag(`mean sales`))
mean_sales
# A tibble: 2 × 3
  discount `mean sales` `mean sales difference`
     <dbl>        <dbl>                   <dbl>
1        0         15.8                   NA   
2        1         20.6                    4.80

Our calculation gives a $4.8\$ 4.8 mean uplift! This is great news. The discount strategy seems to be working. Data scientist A is happy with the results and decides to get feedback from the rest of the data science team.

Data scientist B is not so happy with the results. They think that the uplift is too good to be true (based on domain knowledge and the sales distributions 🤔). When thinking about reasons for such a high uplift, they realized the discount assignment was not at random. It was based on the number of web visits (remember the marketing plan?). This means that the discount group is not comparable to the control group completely! They decide to plot sales against web visits per discount group:

data |> dplyr::mutate(discount = factor(discount)) |> 
  ggplot(aes(x=visits, y = sales, color = discount)) +
  geom_point() + 
  facet_grid(cols = vars(discount))

Indeed, they realize they should probably adjust for the number of web visits. A natural metric is sales per web visit. Let’s compute it:

SOLUTION:
mean_sales_pv <- data |> dplyr::group_by(discount) |> 
  dplyr::summarize("sales_per_visit" = mean(sales_per_visit)) 
mean_sales_pv
# A tibble: 2 × 2
  discount sales_per_visit
     <dbl>           <dbl>
1        0           0.861
2        1           0.938

The mean value is higher for the discount group. As always, they also looked at the distributions:

data |> dplyr::mutate(discount = factor(discount)) |> 
  ggplot(aes(x = sales_per_visit, after_stat(count), fill = discount)) +
  geom_histogram(alpha = 0.30, position = 'identity', color="#e9ecef", bins = 30)+
  # geom_density(alpha = 0.30) +
  xlab("Sales per Visit") +
  ylab("Count") +
  theme_minimal()

For both data scientists A & B the results look much better, but they were unsure about which uplift to report. They thought about the difference in means:

SOLUTION:
mean_sales_per_visit <- data |> dplyr::group_by(discount) |> 
  dplyr::summarize("mean sales per visit" = mean(sales_per_visit)) |> 
  dplyr::mutate("mean sales difference" = `mean sales per visit` - dplyr::lag(`mean sales per visit`))

mean_sales_per_visit |> 
  gt::gt() |> 
  gt::tab_header(title = "Mean sales per visit") |> 
  gt::fmt_number(columns = `mean sales difference`, decimals = 5) |> 
  gtExtras::gt_theme_espn()
Mean sales per visit
discount mean sales per visit mean sales difference
0 0.8612426 NA
1 0.9382929 0.07705

However, how to interpret this value in terms of dollars? To be continued …

Grading

Total points available: 30 points.

Component Points
Ex 1 - 10 30