dim(Teams)
[1] 3015 48
SOLUTIONS
Today’s data is all baseball statistics. The data is in the Lahman
package.
Before doing any analysis, you will want to get quick view of the data. This is useful as part of the EDA process.
dim(Teams)
[1] 3015 48
The variable definitions are found in the help for Teams, and are listed below.
?Teams
Column | Description |
---|---|
yearID | Year |
lgID | League; a factor with levels AA AL FL NL PL UA |
teamID | Team; a factor |
franchID | Franchise (links to TeamsFranchises table) |
divID | Team’s division; a factor with levels C E W |
Rank | Position in final standings |
G | Games played |
Ghome | Games played at home |
W | Wins |
L | Losses |
DivWin | Division Winner (Y or N) |
WCWin | Wild Card Winner (Y or N) |
LgWin | League Champion(Y or N) |
WSWin | World Series Winner (Y or N) |
R | Runs scored |
AB | At bats |
H | Hits by batters |
X2B | Doubles |
X3B | Triples |
HR | Homeruns by batters |
BB | Walks by batters |
SO | Strikeouts by batters |
SB | Stolen bases |
CS | Caught stealing |
HBP | Batters hit by pitch |
SF | Sacrifice flies |
RA | Opponents runs scored |
ER | Earned runs allowed |
ERA | Earned run average |
CG | Complete games |
SHO | Shutouts |
SV | Saves |
IPouts | Outs Pitched (innings pitched x 3) |
HA | Hits allowed |
HRA | Homeruns allowed |
BBA | Walks allowed |
SOA | Strikeouts by pitchers |
E | Errors |
DP | Double Plays |
FP | Fielding percentage |
name | Team’s full name |
park | Name of team’s home ballpark |
attendance | Home attendance total |
BPF | Three-year park factor for batters |
PPF | Three-year park factor for pitchers |
teamIDBR | Team ID used by Baseball Reference website |
teamIDlahman45 | Team ID used in Lahman database version 4.5 |
teamIDretro | Team ID used by Retrosheet |
How many observations are in the Teams
dataset? How many variables?
# take the first three rows and glimpse the data
|> dplyr::slice_head(n=3) |> dplyr::glimpse() Teams
Rows: 3
Columns: 48
$ yearID <int> 1871, 1871, 1871
$ lgID <fct> NA, NA, NA
$ teamID <fct> BS1, CH1, CL1
$ franchID <fct> BNA, CNA, CFC
$ divID <chr> NA, NA, NA
$ Rank <int> 3, 2, 8
$ G <int> 31, 28, 29
$ Ghome <int> NA, NA, NA
$ W <int> 20, 19, 10
$ L <int> 10, 9, 19
$ DivWin <chr> NA, NA, NA
$ WCWin <chr> NA, NA, NA
$ LgWin <chr> "N", "N", "N"
$ WSWin <chr> NA, NA, NA
$ R <int> 401, 302, 249
$ AB <int> 1372, 1196, 1186
$ H <int> 426, 323, 328
$ X2B <int> 70, 52, 35
$ X3B <int> 37, 21, 40
$ HR <int> 3, 10, 7
$ BB <int> 60, 60, 26
$ SO <int> 19, 22, 25
$ SB <int> 73, 69, 18
$ CS <int> 16, 21, 8
$ HBP <int> NA, NA, NA
$ SF <int> NA, NA, NA
$ RA <int> 303, 241, 341
$ ER <int> 109, 77, 116
$ ERA <dbl> 3.55, 2.76, 4.11
$ CG <int> 22, 25, 23
$ SHO <int> 1, 0, 0
$ SV <int> 3, 1, 0
$ IPouts <int> 828, 753, 762
$ HA <int> 367, 308, 346
$ HRA <int> 2, 6, 13
$ BBA <int> 42, 28, 53
$ SOA <int> 23, 22, 34
$ E <int> 243, 229, 234
$ DP <int> 24, 16, 15
$ FP <dbl> 0.834, 0.829, 0.818
$ name <chr> "Boston Red Stockings", "Chicago White Stockings", "Cle…
$ park <chr> "South End Grounds I", "Union Base-Ball Grounds", "Nati…
$ attendance <int> NA, NA, NA
$ BPF <int> 103, 104, 96
$ PPF <int> 98, 102, 100
$ teamIDBR <chr> "BOS", "CHI", "CLE"
$ teamIDlahman45 <chr> "BS1", "CH1", "CL1"
$ teamIDretro <chr> "BS1", "CH1", "CL1"
How many character columns/measurements have missing variables?
|> skimr::skim() Teams
Name | Teams |
Number of rows | 3015 |
Number of columns | 48 |
_______________________ | |
Column type frequency: | |
character | 10 |
factor | 3 |
numeric | 35 |
________________________ | |
Group variables | None |
Variable type: character
skim_variable | n_missing | complete_rate | min | max | empty | n_unique | whitespace |
---|---|---|---|---|---|---|---|
divID | 1517 | 0.50 | 1 | 1 | 0 | 3 | 0 |
DivWin | 1545 | 0.49 | 1 | 1 | 0 | 2 | 0 |
WCWin | 2181 | 0.28 | 1 | 1 | 0 | 2 | 0 |
LgWin | 28 | 0.99 | 1 | 1 | 0 | 2 | 0 |
WSWin | 357 | 0.88 | 1 | 1 | 0 | 2 | 0 |
name | 0 | 1.00 | 11 | 33 | 0 | 140 | 0 |
park | 34 | 0.99 | 7 | 70 | 0 | 219 | 0 |
teamIDBR | 0 | 1.00 | 3 | 3 | 0 | 101 | 0 |
teamIDlahman45 | 0 | 1.00 | 3 | 3 | 0 | 148 | 0 |
teamIDretro | 0 | 1.00 | 3 | 3 | 0 | 151 | 0 |
Variable type: factor
skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
---|---|---|---|---|---|
lgID | 0 | 1 | FALSE | 7 | NL: 1534, AL: 1310, AA: 85, NA: 50 |
teamID | 0 | 1 | FALSE | 149 | CHN: 147, PHI: 140, PIT: 136, CIN: 133 |
franchID | 0 | 1 | FALSE | 120 | ATL: 147, CHC: 147, CIN: 141, PIT: 141 |
Variable type: numeric
skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
---|---|---|---|---|---|---|---|---|---|---|
yearID | 0 | 1.00 | 1959.49 | 43.23 | 1871.00 | 1923.00 | 1968.00 | 1997.00 | 2022.00 | ▃▅▅▆▇ |
Rank | 0 | 1.00 | 4.03 | 2.29 | 1.00 | 2.00 | 4.00 | 6.00 | 13.00 | ▇▅▃▁▁ |
G | 0 | 1.00 | 150.13 | 24.33 | 6.00 | 154.00 | 159.00 | 162.00 | 165.00 | ▁▁▁▁▇ |
Ghome | 399 | 0.87 | 78.08 | 6.91 | 24.00 | 77.00 | 81.00 | 81.00 | 84.00 | ▁▁▁▁▇ |
W | 0 | 1.00 | 74.67 | 17.97 | 0.00 | 66.00 | 77.00 | 87.00 | 116.00 | ▁▁▃▇▂ |
L | 0 | 1.00 | 74.67 | 17.73 | 4.00 | 65.00 | 76.00 | 87.00 | 134.00 | ▁▂▇▅▁ |
R | 0 | 1.00 | 681.16 | 139.02 | 24.00 | 614.00 | 691.00 | 764.00 | 1220.00 | ▁▁▇▅▁ |
AB | 0 | 1.00 | 5132.17 | 794.89 | 211.00 | 5142.00 | 5405.00 | 5519.00 | 5781.00 | ▁▁▁▁▇ |
H | 0 | 1.00 | 1339.25 | 229.86 | 33.00 | 1297.00 | 1389.00 | 1464.00 | 1783.00 | ▁▁▁▇▅ |
X2B | 0 | 1.00 | 229.03 | 59.72 | 1.00 | 195.00 | 234.00 | 272.00 | 376.00 | ▁▂▆▇▂ |
X3B | 0 | 1.00 | 45.43 | 22.52 | 0.00 | 29.00 | 40.00 | 58.00 | 150.00 | ▅▇▃▁▁ |
HR | 0 | 1.00 | 106.54 | 64.11 | 0.00 | 47.00 | 111.00 | 156.00 | 307.00 | ▇▆▇▃▁ |
BB | 0 | 1.00 | 473.61 | 132.08 | 0.00 | 427.00 | 494.00 | 554.00 | 835.00 | ▁▁▇▇▁ |
SO | 16 | 0.99 | 768.12 | 323.43 | 3.00 | 517.00 | 766.00 | 1000.00 | 1596.00 | ▂▇▇▅▂ |
SB | 125 | 0.96 | 109.09 | 69.47 | 0.00 | 62.00 | 92.00 | 137.00 | 581.00 | ▇▃▁▁▁ |
CS | 831 | 0.72 | 46.26 | 21.91 | 0.00 | 32.75 | 43.00 | 56.00 | 191.00 | ▆▇▁▁▁ |
HBP | 1158 | 0.62 | 46.18 | 18.31 | 7.00 | 32.00 | 44.00 | 58.00 | 160.00 | ▆▇▂▁▁ |
SF | 1541 | 0.49 | 44.04 | 10.16 | 7.00 | 38.00 | 44.00 | 50.00 | 77.00 | ▁▂▇▃▁ |
RA | 0 | 1.00 | 681.16 | 138.81 | 34.00 | 610.50 | 689.00 | 766.00 | 1252.00 | ▁▁▇▃▁ |
ER | 0 | 1.00 | 573.99 | 149.50 | 23.00 | 504.00 | 595.00 | 672.00 | 1023.00 | ▁▂▇▆▁ |
ERA | 0 | 1.00 | 3.84 | 0.76 | 1.22 | 3.37 | 3.84 | 4.33 | 8.00 | ▁▇▇▁▁ |
CG | 0 | 1.00 | 47.08 | 39.39 | 0.00 | 9.00 | 41.00 | 75.00 | 148.00 | ▇▅▃▂▁ |
SHO | 0 | 1.00 | 9.63 | 5.03 | 0.00 | 6.00 | 9.00 | 13.00 | 32.00 | ▅▇▃▁▁ |
SV | 0 | 1.00 | 24.59 | 16.34 | 0.00 | 10.00 | 25.00 | 39.00 | 68.00 | ▇▅▆▅▁ |
IPouts | 0 | 1.00 | 4016.16 | 660.67 | 162.00 | 4080.00 | 4257.00 | 4341.00 | 4518.00 | ▁▁▁▁▇ |
HA | 0 | 1.00 | 1339.03 | 230.06 | 49.00 | 1286.00 | 1389.00 | 1468.00 | 1993.00 | ▁▁▁▇▁ |
HRA | 0 | 1.00 | 106.54 | 60.99 | 0.00 | 52.00 | 114.00 | 154.00 | 305.00 | ▆▆▇▂▁ |
BBA | 0 | 1.00 | 473.96 | 131.23 | 1.00 | 429.50 | 496.00 | 553.50 | 827.00 | ▁▁▇▇▁ |
SOA | 0 | 1.00 | 767.54 | 324.61 | 0.00 | 513.00 | 767.00 | 1002.00 | 1687.00 | ▂▇▇▅▁ |
E | 0 | 1.00 | 179.84 | 108.25 | 20.00 | 110.00 | 141.00 | 205.00 | 639.00 | ▇▅▁▁▁ |
DP | 0 | 1.00 | 132.60 | 35.76 | 0.00 | 116.00 | 140.00 | 156.00 | 217.00 | ▁▁▃▇▁ |
FP | 0 | 1.00 | 0.97 | 0.03 | 0.76 | 0.97 | 0.98 | 0.98 | 0.99 | ▁▁▁▁▇ |
attendance | 279 | 0.91 | 1385099.81 | 964039.83 | 0.00 | 544781.75 | 1203014.50 | 2080399.25 | 4483350.00 | ▇▆▅▂▁ |
BPF | 0 | 1.00 | 100.19 | 4.92 | 60.00 | 97.00 | 100.00 | 103.00 | 129.00 | ▁▁▇▅▁ |
PPF | 0 | 1.00 | 100.21 | 4.86 | 60.00 | 97.00 | 100.00 | 103.00 | 141.00 | ▁▁▇▁▁ |
Ben Baumer worked for the New York Mets from 2004 to 2012. What was the team W/L record during those years? Use filter()
and select()
to quickly identify only those pieces of information that we care about.
The model estimates the expected winning percentage as follows:
where is the number of runs the team allows to be scored, is the number of runs that the team scores, and is the team’s expected winning percentage. The runs scored and allowed are present in the Teams
table, so we start by selecting them.
Next, we need to compute the team’s actual winning percentage in each of these seasons. Thus, we need to add a new column to our data frame, and we do this with the mutate()
command.
The expected number of wins is then equal to the product of the expected winning percentage times the number of games.
<- mets_ben |>
mets_ben # once we have calculated the expected winning percentage,
# the expected number of wins is the percentage times the total number of games played
::mutate( W_hat = WPct * (W+L) )
dplyr mets_ben
teamID yearID W L RS RA WPct W_hat
1 NYN 2004 71 91 684 731 0.4668211 75.62501
2 NYN 2005 83 79 722 648 0.5538575 89.72491
3 NYN 2006 97 65 834 731 0.5655308 91.61600
4 NYN 2007 88 74 804 750 0.5347071 86.62255
5 NYN 2008 89 73 799 715 0.5553119 89.96053
6 NYN 2009 70 92 671 757 0.4399936 71.27896
7 NYN 2010 79 83 656 652 0.5030581 81.49541
8 NYN 2011 77 85 718 742 0.4835661 78.33771
9 NYN 2012 74 88 650 709 0.4566674 73.98012
In this case, the Mets’ fortunes were better than expected in three of these seasons, and worse than expected in the other six.
We can confirm this as follows:
Naturally, the Mets experienced ups and downs during Ben’s time with the team. Which seasons were best? To figure this out, we can simply sort the rows of the data frame by number of wins.
In 2006, the Mets had the best record in baseball during the regular season and nearly made the World Series. How do these seasons rank in terms of the team’s performance relative to our model?
Discretize the years into three chunks: one for each of the three general managers under whom Ben worked. Jim Duquette was the Mets’ general manager in 2004, Omar Minaya from 2005 to 2010, and Sandy Alderson from 2011 to 2012.
The raw churn data can be transformed into a tidy dataset as follows:
Use the gm
function to define the manager groups with the group_by()
operator, and run the summaries again, this time across the manager groups.
Alternatively:
|>
data ::group_by(discount,is_loyal) |>
dplyr::summarize(n = dplyr::n(), .groups='drop') |>
dplyr::pivot_wider(
tidyrnames_from = is_loyal
values_from = n
, names_prefix = 'is_loyal '
, |> gt::gt() |>
) ::tab_header(title = "Cross-tabs", subtitle = "discount vs is_loyal") |>
gt::gt_theme_espn() gtExtras
Cross-tabs | ||
---|---|---|
discount vs is_loyal | ||
discount | is_loyal 0 | is_loyal 1 |
0 | 175 | 7 |
1 | NA | 518 |
|> dplyr::mutate(id = dplyr::row_number(), .before = 1) |>
data ::filter(discount == 0) |>
dplyr::arrange( desc(sales) ) |>
dplyr::slice_head(n=10) |>
dplyr::gt() |>
gt::tab_header(title = "Sales: loyal customers vs others") |>
gt::gt_theme_espn() gtExtras
Sales: loyal customers vs others | |||||
---|---|---|---|---|---|
id | visits | discount | is_loyal | sales | sales_per_visit |
567 | 33 | 0 | 1 | 31.72169 | 0.9612633 |
205 | 33 | 0 | 1 | 28.31111 | 0.8579125 |
366 | 33 | 0 | 1 | 28.08163 | 0.8509586 |
50 | 33 | 0 | 1 | 27.79064 | 0.8421407 |
281 | 29 | 0 | 1 | 27.58115 | 0.9510740 |
546 | 27 | 0 | 1 | 26.25533 | 0.9724196 |
105 | 29 | 0 | 1 | 26.21752 | 0.9040526 |
362 | 28 | 0 | 0 | 24.09410 | 0.8605037 |
652 | 28 | 0 | 0 | 24.06459 | 0.8594495 |
494 | 27 | 0 | 0 | 24.00630 | 0.8891220 |
The loyal customers are the top ones in terms of sales. This is good news. It means that the definition of loyal customers is consistent with the data.
In order to have orders of magnitude for the sales, the data scientist provided some summary statistics table:
::gt_plt_summary(data, title = "Sales data") gtExtras
Sales data | ||||||
---|---|---|---|---|---|---|
700 rows x 5 cols | ||||||
Column | Plot Overview | Missing | Mean | Median | SD | |
visits | 0.0% | 21.2 | 21.0 | 5.2 | ||
discount | 0.0% | 0.7 | 1.0 | 0.4 | ||
is_loyal | 0.0% | 0.8 | 1.0 | 0.4 | ||
sales | 0.0% | 19.4 | 19.5 | 4.9 | ||
sales_per_visit | 0.0% | 0.9 | 0.9 | 0.1 |
To have a better glimpse of the data, the data scientist also provided a histogram of the sales:
|>
data ggplot(aes(x=sales)) +
geom_histogram(aes(y = ..density..), bins = 30, colour = 1, fill = "white") +
geom_density(lwd = 1, colour = 4, fill = 4, alpha = 0.25) +
labs(title = "Sales Distribution") +
theme_minimal()