Lab 1 - Tidy Data Wrangling

SOLUTIONS

Data: Yearly statistics and standings for baseball teams

Today’s data is all baseball statistics. The data is in the Lahman package.

View the data

Before doing any analysis, you will want to get quick view of the data. This is useful as part of the EDA process.

dim(Teams)

[1] 3015   48

Data dictionary

The variable definitions are found in the help for Teams, and are listed below.

?Teams

Column	Description
yearID	Year
lgID	League; a factor with levels AA AL FL NL PL UA
teamID	Team; a factor
franchID	Franchise (links to TeamsFranchises table)
divID	Team’s division; a factor with levels C E W
Rank	Position in final standings
G	Games played
Ghome	Games played at home
W	Wins
L	Losses
DivWin	Division Winner (Y or N)
WCWin	Wild Card Winner (Y or N)
LgWin	League Champion(Y or N)
WSWin	World Series Winner (Y or N)
R	Runs scored
AB	At bats
H	Hits by batters
X2B	Doubles
X3B	Triples
HR	Homeruns by batters
BB	Walks by batters
SO	Strikeouts by batters
SB	Stolen bases
CS	Caught stealing
HBP	Batters hit by pitch
SF	Sacrifice flies
RA	Opponents runs scored
ER	Earned runs allowed
ERA	Earned run average
CG	Complete games
SHO	Shutouts
SV	Saves
IPouts	Outs Pitched (innings pitched x 3)
HA	Hits allowed
HRA	Homeruns allowed
BBA	Walks allowed
SOA	Strikeouts by pitchers
E	Errors
DP	Double Plays
FP	Fielding percentage
name	Team’s full name
park	Name of team’s home ballpark
attendance	Home attendance total
BPF	Three-year park factor for batters
PPF	Three-year park factor for pitchers
teamIDBR	Team ID used by Baseball Reference website
teamIDlahman45	Team ID used in Lahman database version 4.5
teamIDretro	Team ID used by Retrosheet

Exercises

Exercise 1

How many observations are in the Teams dataset? How many variables?

# take the first three rows and glimpse the data
Teams |> dplyr::slice_head(n=3) |> dplyr::glimpse()

Rows: 3
Columns: 48
$ yearID         <int> 1871, 1871, 1871
$ lgID           <fct> NA, NA, NA
$ teamID         <fct> BS1, CH1, CL1
$ franchID       <fct> BNA, CNA, CFC
$ divID          <chr> NA, NA, NA
$ Rank           <int> 3, 2, 8
$ G              <int> 31, 28, 29
$ Ghome          <int> NA, NA, NA
$ W              <int> 20, 19, 10
$ L              <int> 10, 9, 19
$ DivWin         <chr> NA, NA, NA
$ WCWin          <chr> NA, NA, NA
$ LgWin          <chr> "N", "N", "N"
$ WSWin          <chr> NA, NA, NA
$ R              <int> 401, 302, 249
$ AB             <int> 1372, 1196, 1186
$ H              <int> 426, 323, 328
$ X2B            <int> 70, 52, 35
$ X3B            <int> 37, 21, 40
$ HR             <int> 3, 10, 7
$ BB             <int> 60, 60, 26
$ SO             <int> 19, 22, 25
$ SB             <int> 73, 69, 18
$ CS             <int> 16, 21, 8
$ HBP            <int> NA, NA, NA
$ SF             <int> NA, NA, NA
$ RA             <int> 303, 241, 341
$ ER             <int> 109, 77, 116
$ ERA            <dbl> 3.55, 2.76, 4.11
$ CG             <int> 22, 25, 23
$ SHO            <int> 1, 0, 0
$ SV             <int> 3, 1, 0
$ IPouts         <int> 828, 753, 762
$ HA             <int> 367, 308, 346
$ HRA            <int> 2, 6, 13
$ BBA            <int> 42, 28, 53
$ SOA            <int> 23, 22, 34
$ E              <int> 243, 229, 234
$ DP             <int> 24, 16, 15
$ FP             <dbl> 0.834, 0.829, 0.818
$ name           <chr> "Boston Red Stockings", "Chicago White Stockings", "Cle…
$ park           <chr> "South End Grounds I", "Union Base-Ball Grounds", "Nati…
$ attendance     <int> NA, NA, NA
$ BPF            <int> 103, 104, 96
$ PPF            <int> 98, 102, 100
$ teamIDBR       <chr> "BOS", "CHI", "CLE"
$ teamIDlahman45 <chr> "BS1", "CH1", "CL1"
$ teamIDretro    <chr> "BS1", "CH1", "CL1"

How many character columns/measurements have missing variables?

Teams |> skimr::skim()

Data summary
Name	Teams
Number of rows	3015
Number of columns	48
_______________________
Column type frequency:
character	10
factor	3
numeric	35
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	n_unique
divID	1517	0.50	1	1	3
DivWin	1545	0.49	1	1	2
WCWin	2181	0.28	1	1	2
LgWin	28	0.99	1	1	2
WSWin	357	0.88	1	1	2
name	0	1.00	11	33	140
park	34	0.99	7	70	219
teamIDBR	0	1.00	3	3	101
teamIDlahman45	0	1.00	3	3	148
teamIDretro	0	1.00	3	3	151

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
lgID	1	FALSE	7	NL: 1534, AL: 1310, AA: 85, NA: 50
teamID	1	FALSE	149	CHN: 147, PHI: 140, PIT: 136, CIN: 133
franchID	1	FALSE	120	ATL: 147, CHC: 147, CIN: 141, PIT: 141

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
yearID	0	1.00	1959.49	43.23	1871.00	1923.00	1968.00	1997.00	2022.00	▃▅▅▆▇
Rank	0	1.00	4.03	2.29	1.00	2.00	4.00	6.00	13.00	▇▅▃▁▁
G	0	1.00	150.13	24.33	6.00	154.00	159.00	162.00	165.00	▁▁▁▁▇
Ghome	399	0.87	78.08	6.91	24.00	77.00	81.00	81.00	84.00	▁▁▁▁▇
W	0	1.00	74.67	17.97	0.00	66.00	77.00	87.00	116.00	▁▁▃▇▂
L	0	1.00	74.67	17.73	4.00	65.00	76.00	87.00	134.00	▁▂▇▅▁
R	0	1.00	681.16	139.02	24.00	614.00	691.00	764.00	1220.00	▁▁▇▅▁
AB	0	1.00	5132.17	794.89	211.00	5142.00	5405.00	5519.00	5781.00	▁▁▁▁▇
H	0	1.00	1339.25	229.86	33.00	1297.00	1389.00	1464.00	1783.00	▁▁▁▇▅
X2B	0	1.00	229.03	59.72	1.00	195.00	234.00	272.00	376.00	▁▂▆▇▂
X3B	0	1.00	45.43	22.52	0.00	29.00	40.00	58.00	150.00	▅▇▃▁▁
HR	0	1.00	106.54	64.11	0.00	47.00	111.00	156.00	307.00	▇▆▇▃▁
BB	0	1.00	473.61	132.08	0.00	427.00	494.00	554.00	835.00	▁▁▇▇▁
SO	16	0.99	768.12	323.43	3.00	517.00	766.00	1000.00	1596.00	▂▇▇▅▂
SB	125	0.96	109.09	69.47	0.00	62.00	92.00	137.00	581.00	▇▃▁▁▁
CS	831	0.72	46.26	21.91	0.00	32.75	43.00	56.00	191.00	▆▇▁▁▁
HBP	1158	0.62	46.18	18.31	7.00	32.00	44.00	58.00	160.00	▆▇▂▁▁
SF	1541	0.49	44.04	10.16	7.00	38.00	44.00	50.00	77.00	▁▂▇▃▁
RA	0	1.00	681.16	138.81	34.00	610.50	689.00	766.00	1252.00	▁▁▇▃▁
ER	0	1.00	573.99	149.50	23.00	504.00	595.00	672.00	1023.00	▁▂▇▆▁
ERA	0	1.00	3.84	0.76	1.22	3.37	3.84	4.33	8.00	▁▇▇▁▁
CG	0	1.00	47.08	39.39	0.00	9.00	41.00	75.00	148.00	▇▅▃▂▁
SHO	0	1.00	9.63	5.03	0.00	6.00	9.00	13.00	32.00	▅▇▃▁▁
SV	0	1.00	24.59	16.34	0.00	10.00	25.00	39.00	68.00	▇▅▆▅▁
IPouts	0	1.00	4016.16	660.67	162.00	4080.00	4257.00	4341.00	4518.00	▁▁▁▁▇
HA	0	1.00	1339.03	230.06	49.00	1286.00	1389.00	1468.00	1993.00	▁▁▁▇▁
HRA	0	1.00	106.54	60.99	0.00	52.00	114.00	154.00	305.00	▆▆▇▂▁
BBA	0	1.00	473.96	131.23	1.00	429.50	496.00	553.50	827.00	▁▁▇▇▁
SOA	0	1.00	767.54	324.61	0.00	513.00	767.00	1002.00	1687.00	▂▇▇▅▁
E	0	1.00	179.84	108.25	20.00	110.00	141.00	205.00	639.00	▇▅▁▁▁
DP	0	1.00	132.60	35.76	0.00	116.00	140.00	156.00	217.00	▁▁▃▇▁
FP	0	1.00	0.97	0.03	0.76	0.97	0.98	0.98	0.99	▁▁▁▁▇
attendance	279	0.91	1385099.81	964039.83	0.00	544781.75	1203014.50	2080399.25	4483350.00	▇▆▅▂▁
BPF	0	1.00	100.19	4.92	60.00	97.00	100.00	103.00	129.00	▁▁▇▅▁
PPF	0	1.00	100.21	4.86	60.00	97.00	100.00	103.00	141.00	▁▁▇▁▁

SOLUTION

From the dim(Teams) statement used after library(Lahman), there are 3015 observations and 48 variables.

From Teams |> skimr::skim() 6 of 10 character variables have missing values

Exercise 2

Ben Baumer worked for the New York Mets from 2004 to 2012. What was the team W/L record during those years? Use filter() and select() to quickly identify only those pieces of information that we care about.

SOLUTION:

# filter to use only rows where teamID equals "NYN"
mets <- Teams  %>% 
  dplyr::filter(teamID == "NYN")
# filter to use only rows where yearID is >= 2004 and <= 2012
# you could also write dplyr::filter(yearID  %in% 2004:2012)
my_mets <- mets %>% 
  dplyr::filter(yearID >= 2004 & yearID <= 2012)
# the dataset needs to have at least the year and the won (W) loss (L) record for that year
my_mets %>% 
  dplyr::select(teamID,yearID,W,L)

  teamID yearID  W  L
1    NYN   2004 71 91
2    NYN   2005 83 79
3    NYN   2006 97 65
4    NYN   2007 88 74
5    NYN   2008 89 73
6    NYN   2009 70 92
7    NYN   2010 79 83
8    NYN   2011 77 85
9    NYN   2012 74 88

Overall, the won-loss record was as follows:

my_mets %>% 
  dplyr::select(teamID,yearID,W,L) %>% 
  dplyr::summarize(
    "2004-2012 wins" = sum(W)
    , "2004-2012 losses" = sum(L)
  )

  2004-2012 wins 2004-2012 losses
1            728              730

Exercise 3

The model estimates the expected winning percentage as follows:

$\hat{\text{W}}_{\text{pct}}=\frac{1}{1+\left(\frac{\text{RA}}{\text{RS}}\right)^{2}}$

where $\text{RA}$ is the number of runs the team allows to be scored, $\text{RS}$ is the number of runs that the team scores, and $\hat{\text{W}}_{\text{pct}}$ is the team’s expected winning percentage. The runs scored and allowed are present in the Teams table, so we start by selecting them.

SOLUTION:

mets_ben <- Teams |>
  # select to get the columns you want
  dplyr::select(teamID, yearID, W, L, R, RA) |>
  # filter to get the rows you want
  dplyr::filter(teamID == "NYN" & yearID %in% 2004:2012)

The column name can be changed with the dplyr::rename function (Use new_name = old_name to rename selected variables). Alternatively, you can rename the column directly in the select statement above, like this:

dplyr::select(teamID,yearID,W,L,RS = R,RA)

mets_ben <- mets_ben |>
  dplyr::rename(RS = R)    # new name = old name
mets_ben

  teamID yearID  W  L  RS  RA
1    NYN   2004 71 91 684 731
2    NYN   2005 83 79 722 648
3    NYN   2006 97 65 834 731
4    NYN   2007 88 74 804 750
5    NYN   2008 89 73 799 715
6    NYN   2009 70 92 671 757
7    NYN   2010 79 83 656 652
8    NYN   2011 77 85 718 742
9    NYN   2012 74 88 650 709

Exercise 4

Next, we need to compute the team’s actual winning percentage in each of these seasons. Thus, we need to add a new column to our data frame, and we do this with the mutate() command.

SOLUTION:

mets_ben <- mets_ben |>
  # once we have the data, we mutate to add a new value (column), using the formula
  dplyr::mutate( WPct = 1/(1 + (RA/RS)^2 ) )
mets_ben

  teamID yearID  W  L  RS  RA      WPct
1    NYN   2004 71 91 684 731 0.4668211
2    NYN   2005 83 79 722 648 0.5538575
3    NYN   2006 97 65 834 731 0.5655308
4    NYN   2007 88 74 804 750 0.5347071
5    NYN   2008 89 73 799 715 0.5553119
6    NYN   2009 70 92 671 757 0.4399936
7    NYN   2010 79 83 656 652 0.5030581
8    NYN   2011 77 85 718 742 0.4835661
9    NYN   2012 74 88 650 709 0.4566674

The expected number of wins is then equal to the product of the expected winning percentage times the number of games.

mets_ben <- mets_ben |>
  # once we have calculated the expected winning percentage,
  # the expected number of wins is the percentage times the total number of games played
  dplyr::mutate( W_hat = WPct * (W+L) )
mets_ben

  teamID yearID  W  L  RS  RA      WPct    W_hat
1    NYN   2004 71 91 684 731 0.4668211 75.62501
2    NYN   2005 83 79 722 648 0.5538575 89.72491
3    NYN   2006 97 65 834 731 0.5655308 91.61600
4    NYN   2007 88 74 804 750 0.5347071 86.62255
5    NYN   2008 89 73 799 715 0.5553119 89.96053
6    NYN   2009 70 92 671 757 0.4399936 71.27896
7    NYN   2010 79 83 656 652 0.5030581 81.49541
8    NYN   2011 77 85 718 742 0.4835661 78.33771
9    NYN   2012 74 88 650 709 0.4566674 73.98012

Exercise 5

In this case, the Mets’ fortunes were better than expected in three of these seasons, and worse than expected in the other six.

We can confirm this as follows:

SOLUTION:

mets_ben %>% 
  # first check that the assertion above is correct
  dplyr::summarize('better then expected' = sum(W >= W_hat), 'worse than expected' = sum(W < W_hat))

  better then expected worse than expected
1                    3                   6

To see how the Mets did over all seasons we can repeat our calculation

Teams |>
  # here we repeat our prior calculation (all steps combined) for all the years in the dataset
  dplyr::select(teamID, yearID, W, L, RS = R, RA) |>
  dplyr::filter(teamID == "NYN") |>
  dplyr::mutate( 
    WPct = 1/(1 + (RA/RS)^2 )
    , W_hat = WPct * (W+L)
  )  |> 
dplyr::summarize( 
  "better then expected" = sum(W >= W_hat)
  , 'worse than expected' = sum(W < W_hat) 
)

  better then expected worse than expected
1                   22                  39

Exercise 6

Naturally, the Mets experienced ups and downs during Ben’s time with the team. Which seasons were best? To figure this out, we can simply sort the rows of the data frame by number of wins.

SOLUTION:

# for this we just need to sort the number of wins in descending order
mets_ben |> dplyr::arrange(desc(W))

  teamID yearID  W  L  RS  RA      WPct    W_hat
1    NYN   2006 97 65 834 731 0.5655308 91.61600
2    NYN   2008 89 73 799 715 0.5553119 89.96053
3    NYN   2007 88 74 804 750 0.5347071 86.62255
4    NYN   2005 83 79 722 648 0.5538575 89.72491
5    NYN   2010 79 83 656 652 0.5030581 81.49541
6    NYN   2011 77 85 718 742 0.4835661 78.33771
7    NYN   2012 74 88 650 709 0.4566674 73.98012
8    NYN   2004 71 91 684 731 0.4668211 75.62501
9    NYN   2009 70 92 671 757 0.4399936 71.27896

Exercise 7

In 2006, the Mets had the best record in baseball during the regular season and nearly made the World Series. How do these seasons rank in terms of the team’s performance relative to our model?

SOLUTION:

mets_ben %>% 
  # add a column with the difference between wins (W) and expected wins (W_hat)
  dplyr::mutate(Diff = W - W_hat) |>
  # then sort the result
  dplyr::arrange(desc(Diff))

  teamID yearID  W  L  RS  RA      WPct    W_hat        Diff
1    NYN   2006 97 65 834 731 0.5655308 91.61600  5.38400315
2    NYN   2007 88 74 804 750 0.5347071 86.62255  1.37744558
3    NYN   2012 74 88 650 709 0.4566674 73.98012  0.01988152
4    NYN   2008 89 73 799 715 0.5553119 89.96053 -0.96052803
5    NYN   2009 70 92 671 757 0.4399936 71.27896 -1.27895513
6    NYN   2011 77 85 718 742 0.4835661 78.33771 -1.33770571
7    NYN   2010 79 83 656 652 0.5030581 81.49541 -2.49540821
8    NYN   2004 71 91 684 731 0.4668211 75.62501 -4.62501135
9    NYN   2005 83 79 722 648 0.5538575 89.72491 -6.72490937

In the years 2006, 2007 and 2012, the Mets had more wins than expected by the model. In all other seasons they performed worse than predicted by the model.

We can summarize the Mets performance as follows:

mets_ben |>
  dplyr::summarize(
    num_years = dplyr::n(),  # number of years
    total_W = sum(W),        # total number of wins
    total_L = sum(L),        # total number of losses
    total_WPct = total_W / (total_W + total_L) # win percentage
  )

  num_years total_W total_L total_WPct
1         9     728     730  0.4993141

In these nine years, the Mets had a combined record of 728 wins and 730 losses, for an overall winning percentage of 49.93%.

Exercise 8

Discretize the years into three chunks: one for each of the three general managers under whom Ben worked. Jim Duquette was the Mets’ general manager in 2004, Omar Minaya from 2005 to 2010, and Sandy Alderson from 2011 to 2012.

SOLUTION:

mets_ben %>% 
  # this questions requires a logic for deciding 
  # which years each general manager worked
  dplyr::mutate(
    # nested ifelse statements are OK for this logic, 
    # but are only practical for about three cases
    gm = ifelse(
      yearID == 2004, 
      'Jim Duquette', 
      ifelse(
        yearID >= 2011, 
        'Sandy Alderson', 
        'Omar Minaya')
    )
  )

  teamID yearID  W  L  RS  RA      WPct    W_hat             gm
1    NYN   2004 71 91 684 731 0.4668211 75.62501   Jim Duquette
2    NYN   2005 83 79 722 648 0.5538575 89.72491    Omar Minaya
3    NYN   2006 97 65 834 731 0.5655308 91.61600    Omar Minaya
4    NYN   2007 88 74 804 750 0.5347071 86.62255    Omar Minaya
5    NYN   2008 89 73 799 715 0.5553119 89.96053    Omar Minaya
6    NYN   2009 70 92 671 757 0.4399936 71.27896    Omar Minaya
7    NYN   2010 79 83 656 652 0.5030581 81.49541    Omar Minaya
8    NYN   2011 77 85 718 742 0.4835661 78.33771 Sandy Alderson
9    NYN   2012 74 88 650 709 0.4566674 73.98012 Sandy Alderson

Alternatively, we can use the case_when function

mets_ben <- mets_ben |>
  dplyr::mutate(
    # same problem, but case_when is easier to work with
    gm = dplyr::case_when(
      yearID == 2004 ~ 'Jim Duquette', 
      yearID >= 2011 ~ 'Sandy Alderson', 
      TRUE ~ 'Omar Minaya' # this is the default case
    )
  )
mets_ben

  teamID yearID  W  L  RS  RA      WPct    W_hat             gm
1    NYN   2004 71 91 684 731 0.4668211 75.62501   Jim Duquette
2    NYN   2005 83 79 722 648 0.5538575 89.72491    Omar Minaya
3    NYN   2006 97 65 834 731 0.5655308 91.61600    Omar Minaya
4    NYN   2007 88 74 804 750 0.5347071 86.62255    Omar Minaya
5    NYN   2008 89 73 799 715 0.5553119 89.96053    Omar Minaya
6    NYN   2009 70 92 671 757 0.4399936 71.27896    Omar Minaya
7    NYN   2010 79 83 656 652 0.5030581 81.49541    Omar Minaya
8    NYN   2011 77 85 718 742 0.4835661 78.33771 Sandy Alderson
9    NYN   2012 74 88 650 709 0.4566674 73.98012 Sandy Alderson

Exercise 9

The raw churn data can be transformed into a tidy dataset as follows:

SOLUTION:

set.seed(42)

# read data and drop column 1 (it contains row numbers and doesn't have a column name)
df <- readr::read_csv("data/monthly_data.csv", show_col_types = FALSE, col_select = -1)

df |>
  # take date columns and pivot to longer table
  tidyr::pivot_longer(starts_with("20"), names_to = "date", values_to = "quantity") |> 
  # split the 'date' column into two measurements
  tidyr::separate_wider_delim(cols = date, delim = "_", names = c("date","paymentMandate")) |> 
  # pivot the two columns paymentMandate and quantity to two columns called payment and mandate
  tidyr::pivot_wider(names_from = paymentMandate, values_from = quantity) |> 
  # finally, mutate the date columns from strings to Dates
  dplyr::mutate(
    incorporation_date = as.Date(incorporation_date)
    , date = as.Date(date)
  )

# A tibble: 10,824 × 6
   company_id vertical    incorporation_date date       payments mandates
        <dbl> <chr>       <date>             <date>        <dbl>    <dbl>
 1          1 gym/fitness 2013-05-30         2014-01-01        0        1
 2          1 gym/fitness 2013-05-30         2014-02-01        0        2
 3          1 gym/fitness 2013-05-30         2014-03-01        0        2
 4          1 gym/fitness 2013-05-30         2014-04-01        1        1
 5          1 gym/fitness 2013-05-30         2014-05-01        0        0
 6          1 gym/fitness 2013-05-30         2014-06-01        1        0
 7          1 gym/fitness 2013-05-30         2014-07-01        0        0
 8          1 gym/fitness 2013-05-30         2014-08-01        0        0
 9          1 gym/fitness 2013-05-30         2014-09-01        0        0
10          1 gym/fitness 2013-05-30         2014-10-01        0        0
# ℹ 10,814 more rows

Exercise 10

Use the gm function to define the manager groups with the group_by() operator, and run the summaries again, this time across the manager groups.

SOLUTION:

#| label: read the data
data <- readr::read_csv("data/sales_dag.csv", show_col_types = FALSE)

data |> dplyr::slice_head(n=5) |> 
  gt::gt() |> 
  gt::tab_header(title = "sample marketing data") |> 
  gtExtras::gt_theme_espn()

sample marketing data
visits	discount	is_loyal	sales	sales_per_visit
12	0	0	13.34830	1.1123585
26	1	1	21.70125	0.8346635
13	0	0	14.70040	1.1308004
24	0	0	20.37734	0.8490557
14	0	0	12.63372	0.9024089

data |> skimr::skim()

Data summary
Name	data
Number of rows	700
Number of columns	5
_______________________
Column type frequency:
numeric	5
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
visits	1	21.19	5.19	3.00	17.00	21.00	25.00	40.00	▁▃▇▃▁
discount	1	0.74	0.44	0.00	0.00	1.00	1.00	1.00	▃▁▁▁▇
is_loyal	1	0.75	0.43	0.00	0.75	1.00	1.00	1.00	▂▁▁▁▇
sales	1	19.40	4.88	2.09	16.44	19.50	22.50	36.14	▁▃▇▃▁
sales_per_visit	1	0.92	0.11	0.42	0.85	0.91	0.98	1.41	▁▂▇▂▁

The mean of the sales_per_visit columns/measurement is 0.9183 and there are no grouped observations.

# calculate the % share of customers receiving a discount vs the % not receiving a discount
data$discount |> table() / length(data$discount)


   0    1 
0.26 0.74

Similarly for the share of customers which are loyal:

# calculate the % share of customers that are 'loyal' vs not 'loyal'
data$is_loyal |> table() / length(data$is_loyal)


   0    1 
0.25 0.75

To understand these features better, they also looked at a cross-tab table:

# build a cross-tab table of 'loyal' customers vs customers getting a discount
data |> xtabs(~discount + is_loyal, data = _)

        is_loyal
discount   0   1
       0 175   7
       1   0 518

Alternatively:

data |> 
  dplyr::group_by(discount,is_loyal) |> 
  dplyr::summarize(n = dplyr::n(), .groups='drop') |> 
  tidyr::pivot_wider(
    names_from = is_loyal
    , values_from = n
    , names_prefix = 'is_loyal '
  ) |> gt::gt() |> 
  gt::tab_header(title = "Cross-tabs", subtitle = "discount vs is_loyal") |> 
  gtExtras::gt_theme_espn()

Cross-tabs
discount vs is_loyal
discount	is_loyal 0	is_loyal 1
0	175	7
1	NA	518

data |> dplyr::mutate(id = dplyr::row_number(), .before = 1) |> 
  dplyr::filter(discount == 0) |> 
  dplyr::arrange( desc(sales) ) |> 
  dplyr::slice_head(n=10) |> 
  gt::gt() |> 
  gt::tab_header(title = "Sales: loyal customers vs others") |> 
  gtExtras::gt_theme_espn()

Sales: loyal customers vs others
id	visits	discount	is_loyal	sales	sales_per_visit
567	33	0	1	31.72169	0.9612633
205	33	0	1	28.31111	0.8579125
366	33	0	1	28.08163	0.8509586
50	33	0	1	27.79064	0.8421407
281	29	0	1	27.58115	0.9510740
546	27	0	1	26.25533	0.9724196
105	29	0	1	26.21752	0.9040526
362	28	0	0	24.09410	0.8605037
652	28	0	0	24.06459	0.8594495
494	27	0	0	24.00630	0.8891220

The loyal customers are the top ones in terms of sales. This is good news. It means that the definition of loyal customers is consistent with the data.

In order to have orders of magnitude for the sales, the data scientist provided some summary statistics table:

gtExtras::gt_plt_summary(data, title = "Sales data")

Sales data
700 rows x 5 cols
	Column	Plot Overview	Missing	Mean	Median	SD
	visits		0.0%	21.2	21.0	5.2
	discount		0.0%	0.7	1.0	0.4
	is_loyal		0.0%	0.8	1.0	0.4
	sales		0.0%	19.4	19.5	4.9
	sales_per_visit		0.0%	0.9	0.9	0.1

To have a better glimpse of the data, the data scientist also provided a histogram of the sales:

data |> 
  ggplot(aes(x=sales)) +
  geom_histogram(aes(y = ..density..), bins = 30, colour = 1, fill = "white") +
  geom_density(lwd = 1, colour = 4, fill = 4, alpha = 0.25) +
  labs(title = "Sales Distribution") +
  theme_minimal()

Resources for additional practice (optional)

Chapter 2: Get Started Data Visualization by Kieran Healy
Chapter 3: Data visualization in R for Data Science by Hadley Wickham
RStudio Cloud Primers
- Visualization Basics: https://rstudio.cloud/learn/primers/1.1
- Work with Data: https://rstudio.cloud/learn/primers/2
- Visualize Data: https://rstudio.cloud/learn/primers/3