Lab 1 - Tidy Data Wrangling

SOLUTIONS

Data: Yearly statistics and standings for baseball teams

Today’s data is all baseball statistics. The data is in the Lahman package.

View the data

Before doing any analysis, you will want to get quick view of the data. This is useful as part of the EDA process.

dim(Teams)
[1] 3015   48

Data dictionary

The variable definitions are found in the help for Teams, and are listed below.

?Teams
Column Description
yearID Year
lgID League; a factor with levels AA AL FL NL PL UA
teamID Team; a factor
franchID Franchise (links to TeamsFranchises table)
divID Team’s division; a factor with levels C E W
Rank Position in final standings
G Games played
Ghome Games played at home
W Wins
L Losses
DivWin Division Winner (Y or N)
WCWin Wild Card Winner (Y or N)
LgWin League Champion(Y or N)
WSWin World Series Winner (Y or N)
R Runs scored
AB At bats
H Hits by batters
X2B Doubles
X3B Triples
HR Homeruns by batters
BB Walks by batters
SO Strikeouts by batters
SB Stolen bases
CS Caught stealing
HBP Batters hit by pitch
SF Sacrifice flies
RA Opponents runs scored
ER Earned runs allowed
ERA Earned run average
CG Complete games
SHO Shutouts
SV Saves
IPouts Outs Pitched (innings pitched x 3)
HA Hits allowed
HRA Homeruns allowed
BBA Walks allowed
SOA Strikeouts by pitchers
E Errors
DP Double Plays
FP Fielding percentage
name Team’s full name
park Name of team’s home ballpark
attendance Home attendance total
BPF Three-year park factor for batters
PPF Three-year park factor for pitchers
teamIDBR Team ID used by Baseball Reference website
teamIDlahman45 Team ID used in Lahman database version 4.5
teamIDretro Team ID used by Retrosheet

Exercises

Exercise 1

How many observations are in the Teams dataset? How many variables?

# take the first three rows and glimpse the data
Teams |> dplyr::slice_head(n=3) |> dplyr::glimpse()
Rows: 3
Columns: 48
$ yearID         <int> 1871, 1871, 1871
$ lgID           <fct> NA, NA, NA
$ teamID         <fct> BS1, CH1, CL1
$ franchID       <fct> BNA, CNA, CFC
$ divID          <chr> NA, NA, NA
$ Rank           <int> 3, 2, 8
$ G              <int> 31, 28, 29
$ Ghome          <int> NA, NA, NA
$ W              <int> 20, 19, 10
$ L              <int> 10, 9, 19
$ DivWin         <chr> NA, NA, NA
$ WCWin          <chr> NA, NA, NA
$ LgWin          <chr> "N", "N", "N"
$ WSWin          <chr> NA, NA, NA
$ R              <int> 401, 302, 249
$ AB             <int> 1372, 1196, 1186
$ H              <int> 426, 323, 328
$ X2B            <int> 70, 52, 35
$ X3B            <int> 37, 21, 40
$ HR             <int> 3, 10, 7
$ BB             <int> 60, 60, 26
$ SO             <int> 19, 22, 25
$ SB             <int> 73, 69, 18
$ CS             <int> 16, 21, 8
$ HBP            <int> NA, NA, NA
$ SF             <int> NA, NA, NA
$ RA             <int> 303, 241, 341
$ ER             <int> 109, 77, 116
$ ERA            <dbl> 3.55, 2.76, 4.11
$ CG             <int> 22, 25, 23
$ SHO            <int> 1, 0, 0
$ SV             <int> 3, 1, 0
$ IPouts         <int> 828, 753, 762
$ HA             <int> 367, 308, 346
$ HRA            <int> 2, 6, 13
$ BBA            <int> 42, 28, 53
$ SOA            <int> 23, 22, 34
$ E              <int> 243, 229, 234
$ DP             <int> 24, 16, 15
$ FP             <dbl> 0.834, 0.829, 0.818
$ name           <chr> "Boston Red Stockings", "Chicago White Stockings", "Cle…
$ park           <chr> "South End Grounds I", "Union Base-Ball Grounds", "Nati…
$ attendance     <int> NA, NA, NA
$ BPF            <int> 103, 104, 96
$ PPF            <int> 98, 102, 100
$ teamIDBR       <chr> "BOS", "CHI", "CLE"
$ teamIDlahman45 <chr> "BS1", "CH1", "CL1"
$ teamIDretro    <chr> "BS1", "CH1", "CL1"

How many character columns/measurements have missing variables?

Teams |> skimr::skim()
Data summary
Name Teams
Number of rows 3015
Number of columns 48
_______________________
Column type frequency:
character 10
factor 3
numeric 35
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
divID 1517 0.50 1 1 0 3 0
DivWin 1545 0.49 1 1 0 2 0
WCWin 2181 0.28 1 1 0 2 0
LgWin 28 0.99 1 1 0 2 0
WSWin 357 0.88 1 1 0 2 0
name 0 1.00 11 33 0 140 0
park 34 0.99 7 70 0 219 0
teamIDBR 0 1.00 3 3 0 101 0
teamIDlahman45 0 1.00 3 3 0 148 0
teamIDretro 0 1.00 3 3 0 151 0

Variable type: factor

skim_variable n_missing complete_rate ordered n_unique top_counts
lgID 0 1 FALSE 7 NL: 1534, AL: 1310, AA: 85, NA: 50
teamID 0 1 FALSE 149 CHN: 147, PHI: 140, PIT: 136, CIN: 133
franchID 0 1 FALSE 120 ATL: 147, CHC: 147, CIN: 141, PIT: 141

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
yearID 0 1.00 1959.49 43.23 1871.00 1923.00 1968.00 1997.00 2022.00 ▃▅▅▆▇
Rank 0 1.00 4.03 2.29 1.00 2.00 4.00 6.00 13.00 ▇▅▃▁▁
G 0 1.00 150.13 24.33 6.00 154.00 159.00 162.00 165.00 ▁▁▁▁▇
Ghome 399 0.87 78.08 6.91 24.00 77.00 81.00 81.00 84.00 ▁▁▁▁▇
W 0 1.00 74.67 17.97 0.00 66.00 77.00 87.00 116.00 ▁▁▃▇▂
L 0 1.00 74.67 17.73 4.00 65.00 76.00 87.00 134.00 ▁▂▇▅▁
R 0 1.00 681.16 139.02 24.00 614.00 691.00 764.00 1220.00 ▁▁▇▅▁
AB 0 1.00 5132.17 794.89 211.00 5142.00 5405.00 5519.00 5781.00 ▁▁▁▁▇
H 0 1.00 1339.25 229.86 33.00 1297.00 1389.00 1464.00 1783.00 ▁▁▁▇▅
X2B 0 1.00 229.03 59.72 1.00 195.00 234.00 272.00 376.00 ▁▂▆▇▂
X3B 0 1.00 45.43 22.52 0.00 29.00 40.00 58.00 150.00 ▅▇▃▁▁
HR 0 1.00 106.54 64.11 0.00 47.00 111.00 156.00 307.00 ▇▆▇▃▁
BB 0 1.00 473.61 132.08 0.00 427.00 494.00 554.00 835.00 ▁▁▇▇▁
SO 16 0.99 768.12 323.43 3.00 517.00 766.00 1000.00 1596.00 ▂▇▇▅▂
SB 125 0.96 109.09 69.47 0.00 62.00 92.00 137.00 581.00 ▇▃▁▁▁
CS 831 0.72 46.26 21.91 0.00 32.75 43.00 56.00 191.00 ▆▇▁▁▁
HBP 1158 0.62 46.18 18.31 7.00 32.00 44.00 58.00 160.00 ▆▇▂▁▁
SF 1541 0.49 44.04 10.16 7.00 38.00 44.00 50.00 77.00 ▁▂▇▃▁
RA 0 1.00 681.16 138.81 34.00 610.50 689.00 766.00 1252.00 ▁▁▇▃▁
ER 0 1.00 573.99 149.50 23.00 504.00 595.00 672.00 1023.00 ▁▂▇▆▁
ERA 0 1.00 3.84 0.76 1.22 3.37 3.84 4.33 8.00 ▁▇▇▁▁
CG 0 1.00 47.08 39.39 0.00 9.00 41.00 75.00 148.00 ▇▅▃▂▁
SHO 0 1.00 9.63 5.03 0.00 6.00 9.00 13.00 32.00 ▅▇▃▁▁
SV 0 1.00 24.59 16.34 0.00 10.00 25.00 39.00 68.00 ▇▅▆▅▁
IPouts 0 1.00 4016.16 660.67 162.00 4080.00 4257.00 4341.00 4518.00 ▁▁▁▁▇
HA 0 1.00 1339.03 230.06 49.00 1286.00 1389.00 1468.00 1993.00 ▁▁▁▇▁
HRA 0 1.00 106.54 60.99 0.00 52.00 114.00 154.00 305.00 ▆▆▇▂▁
BBA 0 1.00 473.96 131.23 1.00 429.50 496.00 553.50 827.00 ▁▁▇▇▁
SOA 0 1.00 767.54 324.61 0.00 513.00 767.00 1002.00 1687.00 ▂▇▇▅▁
E 0 1.00 179.84 108.25 20.00 110.00 141.00 205.00 639.00 ▇▅▁▁▁
DP 0 1.00 132.60 35.76 0.00 116.00 140.00 156.00 217.00 ▁▁▃▇▁
FP 0 1.00 0.97 0.03 0.76 0.97 0.98 0.98 0.99 ▁▁▁▁▇
attendance 279 0.91 1385099.81 964039.83 0.00 544781.75 1203014.50 2080399.25 4483350.00 ▇▆▅▂▁
BPF 0 1.00 100.19 4.92 60.00 97.00 100.00 103.00 129.00 ▁▁▇▅▁
PPF 0 1.00 100.21 4.86 60.00 97.00 100.00 103.00 141.00 ▁▁▇▁▁
SOLUTION

From the dim(Teams) statement used after library(Lahman), there are 3015 observations and 48 variables.

From Teams |> skimr::skim() 6 of 10 character variables have missing values

Exercise 2

Ben Baumer worked for the New York Mets from 2004 to 2012. What was the team W/L record during those years? Use filter() and select() to quickly identify only those pieces of information that we care about.

SOLUTION:
# filter to use only rows where teamID equals "NYN"
mets <- Teams  %>% 
  dplyr::filter(teamID == "NYN")
# filter to use only rows where yearID is >= 2004 and <= 2012
# you could also write dplyr::filter(yearID  %in% 2004:2012)
my_mets <- mets %>% 
  dplyr::filter(yearID >= 2004 & yearID <= 2012)
# the dataset needs to have at least the year and the won (W) loss (L) record for that year
my_mets %>% 
  dplyr::select(teamID,yearID,W,L)
  teamID yearID  W  L
1    NYN   2004 71 91
2    NYN   2005 83 79
3    NYN   2006 97 65
4    NYN   2007 88 74
5    NYN   2008 89 73
6    NYN   2009 70 92
7    NYN   2010 79 83
8    NYN   2011 77 85
9    NYN   2012 74 88

Overall, the won-loss record was as follows:

my_mets %>% 
  dplyr::select(teamID,yearID,W,L) %>% 
  dplyr::summarize(
    "2004-2012 wins" = sum(W)
    , "2004-2012 losses" = sum(L)
  )
  2004-2012 wins 2004-2012 losses
1            728              730

Exercise 3

The model estimates the expected winning percentage as follows:

Ŵpct=11+(RARS)2 \hat{\text{W}}_{\text{pct}}=\frac{1}{1+\left(\frac{\text{RA}}{\text{RS}}\right)^{2}}

where RA\text{RA} is the number of runs the team allows to be scored, RS\text{RS} is the number of runs that the team scores, and Ŵpct\hat{\text{W}}_{\text{pct}} is the team’s expected winning percentage. The runs scored and allowed are present in the Teams table, so we start by selecting them.

SOLUTION:
mets_ben <- Teams |>
  # select to get the columns you want
  dplyr::select(teamID, yearID, W, L, R, RA) |>
  # filter to get the rows you want
  dplyr::filter(teamID == "NYN" & yearID %in% 2004:2012)

The column name can be changed with the dplyr::rename function (Use new_name = old_name to rename selected variables). Alternatively, you can rename the column directly in the select statement above, like this:

dplyr::select(teamID,yearID,W,L,RS = R,RA)

mets_ben <- mets_ben |>
  dplyr::rename(RS = R)    # new name = old name
mets_ben
  teamID yearID  W  L  RS  RA
1    NYN   2004 71 91 684 731
2    NYN   2005 83 79 722 648
3    NYN   2006 97 65 834 731
4    NYN   2007 88 74 804 750
5    NYN   2008 89 73 799 715
6    NYN   2009 70 92 671 757
7    NYN   2010 79 83 656 652
8    NYN   2011 77 85 718 742
9    NYN   2012 74 88 650 709

Exercise 4

Next, we need to compute the team’s actual winning percentage in each of these seasons. Thus, we need to add a new column to our data frame, and we do this with the mutate() command.

SOLUTION:
mets_ben <- mets_ben |>
  # once we have the data, we mutate to add a new value (column), using the formula
  dplyr::mutate( WPct = 1/(1 + (RA/RS)^2 ) )
mets_ben
  teamID yearID  W  L  RS  RA      WPct
1    NYN   2004 71 91 684 731 0.4668211
2    NYN   2005 83 79 722 648 0.5538575
3    NYN   2006 97 65 834 731 0.5655308
4    NYN   2007 88 74 804 750 0.5347071
5    NYN   2008 89 73 799 715 0.5553119
6    NYN   2009 70 92 671 757 0.4399936
7    NYN   2010 79 83 656 652 0.5030581
8    NYN   2011 77 85 718 742 0.4835661
9    NYN   2012 74 88 650 709 0.4566674

The expected number of wins is then equal to the product of the expected winning percentage times the number of games.

mets_ben <- mets_ben |>
  # once we have calculated the expected winning percentage,
  # the expected number of wins is the percentage times the total number of games played
  dplyr::mutate( W_hat = WPct * (W+L) )
mets_ben
  teamID yearID  W  L  RS  RA      WPct    W_hat
1    NYN   2004 71 91 684 731 0.4668211 75.62501
2    NYN   2005 83 79 722 648 0.5538575 89.72491
3    NYN   2006 97 65 834 731 0.5655308 91.61600
4    NYN   2007 88 74 804 750 0.5347071 86.62255
5    NYN   2008 89 73 799 715 0.5553119 89.96053
6    NYN   2009 70 92 671 757 0.4399936 71.27896
7    NYN   2010 79 83 656 652 0.5030581 81.49541
8    NYN   2011 77 85 718 742 0.4835661 78.33771
9    NYN   2012 74 88 650 709 0.4566674 73.98012

Exercise 5

In this case, the Mets’ fortunes were better than expected in three of these seasons, and worse than expected in the other six.

We can confirm this as follows:

SOLUTION:
mets_ben %>% 
  # first check that the assertion above is correct
  dplyr::summarize('better then expected' = sum(W >= W_hat), 'worse than expected' = sum(W < W_hat))
  better then expected worse than expected
1                    3                   6

To see how the Mets did over all seasons we can repeat our calculation

Teams |>
  # here we repeat our prior calculation (all steps combined) for all the years in the dataset
  dplyr::select(teamID, yearID, W, L, RS = R, RA) |>
  dplyr::filter(teamID == "NYN") |>
  dplyr::mutate( 
    WPct = 1/(1 + (RA/RS)^2 )
    , W_hat = WPct * (W+L)
  )  |> 
dplyr::summarize( 
  "better then expected" = sum(W >= W_hat)
  , 'worse than expected' = sum(W < W_hat) 
)
  better then expected worse than expected
1                   22                  39

Exercise 6

Naturally, the Mets experienced ups and downs during Ben’s time with the team. Which seasons were best? To figure this out, we can simply sort the rows of the data frame by number of wins.

SOLUTION:
# for this we just need to sort the number of wins in descending order
mets_ben |> dplyr::arrange(desc(W))
  teamID yearID  W  L  RS  RA      WPct    W_hat
1    NYN   2006 97 65 834 731 0.5655308 91.61600
2    NYN   2008 89 73 799 715 0.5553119 89.96053
3    NYN   2007 88 74 804 750 0.5347071 86.62255
4    NYN   2005 83 79 722 648 0.5538575 89.72491
5    NYN   2010 79 83 656 652 0.5030581 81.49541
6    NYN   2011 77 85 718 742 0.4835661 78.33771
7    NYN   2012 74 88 650 709 0.4566674 73.98012
8    NYN   2004 71 91 684 731 0.4668211 75.62501
9    NYN   2009 70 92 671 757 0.4399936 71.27896

Exercise 7

In 2006, the Mets had the best record in baseball during the regular season and nearly made the World Series. How do these seasons rank in terms of the team’s performance relative to our model?

SOLUTION:
mets_ben %>% 
  # add a column with the difference between wins (W) and expected wins (W_hat)
  dplyr::mutate(Diff = W - W_hat) |>
  # then sort the result
  dplyr::arrange(desc(Diff))
  teamID yearID  W  L  RS  RA      WPct    W_hat        Diff
1    NYN   2006 97 65 834 731 0.5655308 91.61600  5.38400315
2    NYN   2007 88 74 804 750 0.5347071 86.62255  1.37744558
3    NYN   2012 74 88 650 709 0.4566674 73.98012  0.01988152
4    NYN   2008 89 73 799 715 0.5553119 89.96053 -0.96052803
5    NYN   2009 70 92 671 757 0.4399936 71.27896 -1.27895513
6    NYN   2011 77 85 718 742 0.4835661 78.33771 -1.33770571
7    NYN   2010 79 83 656 652 0.5030581 81.49541 -2.49540821
8    NYN   2004 71 91 684 731 0.4668211 75.62501 -4.62501135
9    NYN   2005 83 79 722 648 0.5538575 89.72491 -6.72490937

In the years 2006, 2007 and 2012, the Mets had more wins than expected by the model. In all other seasons they performed worse than predicted by the model.

We can summarize the Mets performance as follows:

mets_ben |>
  dplyr::summarize(
    num_years = dplyr::n(),  # number of years
    total_W = sum(W),        # total number of wins
    total_L = sum(L),        # total number of losses
    total_WPct = total_W / (total_W + total_L) # win percentage
  )
  num_years total_W total_L total_WPct
1         9     728     730  0.4993141

In these nine years, the Mets had a combined record of 728 wins and 730 losses, for an overall winning percentage of 49.93%.

Exercise 8

Discretize the years into three chunks: one for each of the three general managers under whom Ben worked. Jim Duquette was the Mets’ general manager in 2004, Omar Minaya from 2005 to 2010, and Sandy Alderson from 2011 to 2012.

SOLUTION:
mets_ben %>% 
  # this questions requires a logic for deciding 
  # which years each general manager worked
  dplyr::mutate(
    # nested ifelse statements are OK for this logic, 
    # but are only practical for about three cases
    gm = ifelse(
      yearID == 2004, 
      'Jim Duquette', 
      ifelse(
        yearID >= 2011, 
        'Sandy Alderson', 
        'Omar Minaya')
    )
  )
  teamID yearID  W  L  RS  RA      WPct    W_hat             gm
1    NYN   2004 71 91 684 731 0.4668211 75.62501   Jim Duquette
2    NYN   2005 83 79 722 648 0.5538575 89.72491    Omar Minaya
3    NYN   2006 97 65 834 731 0.5655308 91.61600    Omar Minaya
4    NYN   2007 88 74 804 750 0.5347071 86.62255    Omar Minaya
5    NYN   2008 89 73 799 715 0.5553119 89.96053    Omar Minaya
6    NYN   2009 70 92 671 757 0.4399936 71.27896    Omar Minaya
7    NYN   2010 79 83 656 652 0.5030581 81.49541    Omar Minaya
8    NYN   2011 77 85 718 742 0.4835661 78.33771 Sandy Alderson
9    NYN   2012 74 88 650 709 0.4566674 73.98012 Sandy Alderson

Alternatively, we can use the case_when function

mets_ben <- mets_ben |>
  dplyr::mutate(
    # same problem, but case_when is easier to work with
    gm = dplyr::case_when(
      yearID == 2004 ~ 'Jim Duquette', 
      yearID >= 2011 ~ 'Sandy Alderson', 
      TRUE ~ 'Omar Minaya' # this is the default case
    )
  )
mets_ben
  teamID yearID  W  L  RS  RA      WPct    W_hat             gm
1    NYN   2004 71 91 684 731 0.4668211 75.62501   Jim Duquette
2    NYN   2005 83 79 722 648 0.5538575 89.72491    Omar Minaya
3    NYN   2006 97 65 834 731 0.5655308 91.61600    Omar Minaya
4    NYN   2007 88 74 804 750 0.5347071 86.62255    Omar Minaya
5    NYN   2008 89 73 799 715 0.5553119 89.96053    Omar Minaya
6    NYN   2009 70 92 671 757 0.4399936 71.27896    Omar Minaya
7    NYN   2010 79 83 656 652 0.5030581 81.49541    Omar Minaya
8    NYN   2011 77 85 718 742 0.4835661 78.33771 Sandy Alderson
9    NYN   2012 74 88 650 709 0.4566674 73.98012 Sandy Alderson

Exercise 9

The raw churn data can be transformed into a tidy dataset as follows:

SOLUTION:
set.seed(42)

# read data and drop column 1 (it contains row numbers and doesn't have a column name)
df <- readr::read_csv("data/monthly_data.csv", show_col_types = FALSE, col_select = -1)

df |>
  # take date columns and pivot to longer table
  tidyr::pivot_longer(starts_with("20"), names_to = "date", values_to = "quantity") |> 
  # split the 'date' column into two measurements
  tidyr::separate_wider_delim(cols = date, delim = "_", names = c("date","paymentMandate")) |> 
  # pivot the two columns paymentMandate and quantity to two columns called payment and mandate
  tidyr::pivot_wider(names_from = paymentMandate, values_from = quantity) |> 
  # finally, mutate the date columns from strings to Dates
  dplyr::mutate(
    incorporation_date = as.Date(incorporation_date)
    , date = as.Date(date)
  )
# A tibble: 10,824 × 6
   company_id vertical    incorporation_date date       payments mandates
        <dbl> <chr>       <date>             <date>        <dbl>    <dbl>
 1          1 gym/fitness 2013-05-30         2014-01-01        0        1
 2          1 gym/fitness 2013-05-30         2014-02-01        0        2
 3          1 gym/fitness 2013-05-30         2014-03-01        0        2
 4          1 gym/fitness 2013-05-30         2014-04-01        1        1
 5          1 gym/fitness 2013-05-30         2014-05-01        0        0
 6          1 gym/fitness 2013-05-30         2014-06-01        1        0
 7          1 gym/fitness 2013-05-30         2014-07-01        0        0
 8          1 gym/fitness 2013-05-30         2014-08-01        0        0
 9          1 gym/fitness 2013-05-30         2014-09-01        0        0
10          1 gym/fitness 2013-05-30         2014-10-01        0        0
# ℹ 10,814 more rows

Exercise 10

Use the gm function to define the manager groups with the group_by() operator, and run the summaries again, this time across the manager groups.

SOLUTION:
#| label: read the data
data <- readr::read_csv("data/sales_dag.csv", show_col_types = FALSE)

data |> dplyr::slice_head(n=5) |> 
  gt::gt() |> 
  gt::tab_header(title = "sample marketing data") |> 
  gtExtras::gt_theme_espn()
sample marketing data
visits discount is_loyal sales sales_per_visit
12 0 0 13.34830 1.1123585
26 1 1 21.70125 0.8346635
13 0 0 14.70040 1.1308004
24 0 0 20.37734 0.8490557
14 0 0 12.63372 0.9024089
data |> skimr::skim()
Data summary
Name data
Number of rows 700
Number of columns 5
_______________________
Column type frequency:
numeric 5
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
visits 0 1 21.19 5.19 3.00 17.00 21.00 25.00 40.00 ▁▃▇▃▁
discount 0 1 0.74 0.44 0.00 0.00 1.00 1.00 1.00 ▃▁▁▁▇
is_loyal 0 1 0.75 0.43 0.00 0.75 1.00 1.00 1.00 ▂▁▁▁▇
sales 0 1 19.40 4.88 2.09 16.44 19.50 22.50 36.14 ▁▃▇▃▁
sales_per_visit 0 1 0.92 0.11 0.42 0.85 0.91 0.98 1.41 ▁▂▇▂▁

The mean of the sales_per_visit columns/measurement is 0.9183 and there are no grouped observations.

# calculate the % share of customers receiving a discount vs the % not receiving a discount
data$discount |> table() / length(data$discount)

   0    1 
0.26 0.74 

Similarly for the share of customers which are loyal:

# calculate the % share of customers that are 'loyal' vs not 'loyal'
data$is_loyal |> table() / length(data$is_loyal)

   0    1 
0.25 0.75 

To understand these features better, they also looked at a cross-tab table:

# build a cross-tab table of 'loyal' customers vs customers getting a discount
data |> xtabs(~discount + is_loyal, data = _)
        is_loyal
discount   0   1
       0 175   7
       1   0 518

Alternatively:

data |> 
  dplyr::group_by(discount,is_loyal) |> 
  dplyr::summarize(n = dplyr::n(), .groups='drop') |> 
  tidyr::pivot_wider(
    names_from = is_loyal
    , values_from = n
    , names_prefix = 'is_loyal '
  ) |> gt::gt() |> 
  gt::tab_header(title = "Cross-tabs", subtitle = "discount vs is_loyal") |> 
  gtExtras::gt_theme_espn()
Cross-tabs
discount vs is_loyal
discount is_loyal 0 is_loyal 1
0 175 7
1 NA 518
data |> dplyr::mutate(id = dplyr::row_number(), .before = 1) |> 
  dplyr::filter(discount == 0) |> 
  dplyr::arrange( desc(sales) ) |> 
  dplyr::slice_head(n=10) |> 
  gt::gt() |> 
  gt::tab_header(title = "Sales: loyal customers vs others") |> 
  gtExtras::gt_theme_espn()
Sales: loyal customers vs others
id visits discount is_loyal sales sales_per_visit
567 33 0 1 31.72169 0.9612633
205 33 0 1 28.31111 0.8579125
366 33 0 1 28.08163 0.8509586
50 33 0 1 27.79064 0.8421407
281 29 0 1 27.58115 0.9510740
546 27 0 1 26.25533 0.9724196
105 29 0 1 26.21752 0.9040526
362 28 0 0 24.09410 0.8605037
652 28 0 0 24.06459 0.8594495
494 27 0 0 24.00630 0.8891220

The loyal customers are the top ones in terms of sales. This is good news. It means that the definition of loyal customers is consistent with the data.

In order to have orders of magnitude for the sales, the data scientist provided some summary statistics table:

gtExtras::gt_plt_summary(data, title = "Sales data")
Sales data
700 rows x 5 cols
Column Plot Overview Missing Mean Median SD
visits 340 0.0% 21.2 21.0 5.2
discount 01 0.0% 0.7 1.0 0.4
is_loyal 01 0.0% 0.8 1.0 0.4
sales 236 0.0% 19.4 19.5 4.9
sales_per_visit 0.421.41 0.0% 0.9 0.9 0.1

To have a better glimpse of the data, the data scientist also provided a histogram of the sales:

data |> 
  ggplot(aes(x=sales)) +
  geom_histogram(aes(y = ..density..), bins = 30, colour = 1, fill = "white") +
  geom_density(lwd = 1, colour = 4, fill = 4, alpha = 0.25) +
  labs(title = "Sales Distribution") +
  theme_minimal()

Resources for additional practice (optional)