# check if 'librarian' is installed and if not, install it
if (! "librarian" %in% rownames(installed.packages()) ){
install.packages("librarian")
}
# load packages if not already loaded
::shelf(tidyverse, Lahman, magrittr, gt, gtExtras, ggplot2, skimr
librarian
, dplyr, gt, gtExtras, here, readr, rmarkdown)theme_set(theme_bw(base_size = 18) + theme(legend.position = "top"))
Lab 1 - Tidy Data Wrangling
BSMM 8740 Fall 2024
Introduction
This lab will go through many of the same operations we’ve demonstrated in class. The main goal is to reinforce our understanding of tidy data, the tidyverse and the pipe, which we will be using throughout the course.
As the labs progress, you are encouraged to explore beyond what the labs require; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R and the tidyverse. Today we begin with exercises in the fundamental building blocks of R and RStudio: the interface, reading in data, and basic commands.
Learning goals
By the end of the lab, you will…
- Be familiar with the workflow using R, RStudio, Git, and GitHub
- Have practiced version control using GitHub
Getting started
To complete the lab, log on to your github account and then go to the class GitHub organization and find the 2024-lab-1-[your github username] repository .
Create an R project using your 2024-lab-1-[your github username] repository (remember to create a PAT, etc.) and add your answers by editing the
2024-lab-1.qmd
file in your repository.When you are done, be sure to: save your document, stage, commit and push your work.
To access Github from the lab, you will need to make sure you are logged in as follows:
- username: .\daladmin
- password: Business507!
Remember to (create a PAT and set your git credentials)
- create your PAT using
usethis::create_github_token()
, - store your PAT with
gitcreds::gitcreds_set()
, - set your username and email with
usethis::use_git_config( user.name = ___, user.email = ___)
Packages
The code below will install (if necessary) and load the packages needed in today’s exercises
Now, today’s lab
The tidyverse is a meta-package. When you load it you get eight packages loaded for you:
- ggplot2: for data visualization
- dplyr: for data wrangling
- tidyr: for data tidying and rectangling
- readr: for reading and writing data
- tibble: for modern, tidy data frames
- stringr: for string manipulation
- forcats: for dealing with factors
- purrr: for iteration with functional programming
The message that’s printed when you load the package tells you which versions of these packages are loaded as well as any conflicts they may have introduced, e.g., the filter()
function from dplyr has now masked (overwritten) the filter()
function available in base R (and that’s ok, we’ll use dplyr::filter()
anyway).
We’ll be using functionality from all of these packages throughout the semester, though we’ll always load them all at once with library(tidyverse)
. You can find out more about the tidyverse and each of the packages that make it up here.
Data: Yearly statistics and standings for baseball teams
Today’s data is all baseball statistics. The data is in the Lahman
package.
View the data
Before doing any analysis, you may want to get quick view of the data. This is useful when you’ve imported data to see if your data imported correctly. We can use the view()
function to see the entire data set in RStudio. Type the code below in the Console to view the entire dataset.
dim(Teams)
Data dictionary
The variable definitions are found in the help for Teams
?Teams
Exercises
Write all code and narrative in your R Markdown file where indicated. Write all narrative in complete sentences. Throughout the assignment, you should periodically render your Quarto document to ensure that all code executes and that your document format is intact, save, stage & commit the changes in the Git pane, and push the updated files to your repository. This ensures that your work is saved.
Make sure we can read all of the code in your quarto document. This means you will need to break up long lines of code. One way to help avoid long lines of code is is start a new line after every pipe (%>%
or |>
) and plus sign (+
).
Exercise 1
The view()
function helps us get a quick view of the dataset, but let’s get more detail about its structure. Viewing a summary of the data is a useful starting point for data analysis, especially if the dataset has a large number of observations (rows) or variables (columns). Run the code below to use the dplyr::glimpse()
function to see a summary of the ikea
dataset.
::glimpse(Teams[1:10,]) dplyr
In your lab-1.qmd
document you’ll see that we already added the code required for most exercises as well as a sentence where you can fill in the blanks to report the answer.
Also note that the code chunk as a label: glimpse-data
. It’s not required, but it is good practice and highly encouraged to label your code chunks in this way. If there is an error when you render your document, the code-chunk label will identify where the error is.
Exercise 2
Ben Baumer worked for the New York Mets from 2004 to 2012. What was the team W/L record during those years? Use filter()
and select()
to quickly identify only those pieces of information that we care about.
Exercise 3
We’ve answered the simple question of how the Mets performed during the time that Ben was there, but since we are data scientists, we are interested in deeper questions. For example, some of these seasons were subpar—the Mets had more losses than wins. Did the team just get unlucky in those seasons? Or did they actually play as badly as their record indicates?
In order to answer this question, we need a model for expected winning percentage. It turns out that one of the most widely used contributions to the field of baseball analytics (courtesy of Bill James) is exactly that. This model translates the number of runs4 that a team scores and allows over the course of an entire season into an expectation for how many games they should have won. The simplest version of this model is this:
where is the number of runs the team allows to be scored, is the number of runs that the team scores, and is the team’s expected winning percentage. Luckily for us, the runs scored and allowed are present in the Teams
table, so let’s grab them and save them in a new data frame.
This is a good place to save, stage, commit, and push changes to your remote lab-1 repository. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message (e.g., “Completed exercises 1 - 3”), and push. After you push the changes, the Git pane in RStudio should be empty.
Exercise 4
Next, we need to compute the team’s actual winning percentage in each of these seasons. Thus, we need to add a new column to our data frame, and we do this with the mutate()
command.
Exercise 5
In this case, the Mets’ fortunes were better than expected in three of these seasons, and worse than expected in the other six.
In how many seasons were the Mets better than expected? How many were they worse than expected?
This is a good place to save, stage, commit, and push changes to your remote lab-1 repo. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message (e.g., “Completed exercises 4 and 5”), and push. After you push the changes, the Git pane in RStudio should be empty.
Exercise 6
Naturally, the Mets experienced ups and downs during Ben’s time with the team. Which seasons were best? To figure this out, we can simply sort the rows of the data frame.
Exercise 7
In 2006, the Mets had the best record in baseball during the regular season and nearly made the World Series. How do these seasons rank in terms of the team’s performance relative to our model?
This is a good place to save, stage, commit, and push changes to your remote lab-1 repo. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message (e.g., “Completed exercises 6 - 8”), and push. After you push the changes, the Git pane in RStudio should be empty.
Exercise 8
Discretize the years into three chunks: one for each of the three general managers under whom Ben worked. Jim Duquette was the Mets’ general manager in 2004, Omar Minaya from 2005 to 2010, and Sandy Alderson from 2011 to 2012.
Exercise 9
The following dataset is the basis of a model that predicts which businesses are likely to have customer churn at the start of 2015, based on the business type and incorporation_date. This question will give you some practice using the various tidyr::
pivot operations.
set.seed(42)
# read data and drop column 1 (it contains row numbers and doesn't have a column name)
<- readr::read_csv("data/monthly_data.csv", show_col_types = FALSE, col_select = -1)
df
# Have a glimpse of the data
glimpse(df)
Rows: 902
Columns: 27
$ company_id <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, …
$ `2014-01-01_payments` <dbl> 0, 1, 6, 8, 0, 2, 3, 0, 0, 0, 0, 0, 0, 4, 43, 34…
$ `2014-02-01_payments` <dbl> 0, 1, 6, 4, 0, 2, 3, 0, 0, 0, 2, 0, 11, 1, 51, 5…
$ `2014-03-01_payments` <dbl> 0, 3, 6, 7, 39, 1, 1, 6, 0, 0, 0, 8, 0, 1, 44, 4…
$ `2014-04-01_payments` <dbl> 1, 2, 6, 7, 0, 3, 7, 50, 1, 0, 0, 2, 0, 0, 47, 5…
$ `2014-05-01_payments` <dbl> 0, 2, 6, 1, 54, 1, 4, 119, 0, 1, 0, 0, 0, 5, 46,…
$ `2014-06-01_payments` <dbl> 1, 1, 7, 2, 0, 2, 1, 151, 3, 0, 0, 0, 0, 2, 81, …
$ `2014-07-01_payments` <dbl> 0, 1, 8, 2, 0, 2, 7, 182, 0, 0, 0, 3, 0, 0, 91, …
$ `2014-08-01_payments` <dbl> 0, 1, 7, 4, 22, 1, 2, 167, 0, 0, 0, 2, 0, 0, 93,…
$ `2014-09-01_payments` <dbl> 0, 1, 8, 3, 0, 1, 5, 180, 0, 0, 0, 0, 9, 5, 88, …
$ `2014-10-01_payments` <dbl> 0, 4, 8, 5, 0, 2, 8, 157, 1, 0, 0, 0, 2, 1, 86, …
$ `2014-11-01_payments` <dbl> 0, 3, 9, 5, 0, 1, 2, 105, 0, 0, 0, 0, 0, 0, 93, …
$ `2014-12-01_payments` <dbl> 0, 3, 9, 9, 0, 3, 8, 57, 0, 0, 0, 0, 0, 0, 104, …
$ `2014-01-01_mandates` <dbl> 1, 0, 0, 0, 4, 0, 0, 0, 4, 4, 0, 0, 22, 0, 1, 6,…
$ `2014-02-01_mandates` <dbl> 2, 0, 0, 0, 31, 1, 0, 2, 3, 8, 0, 0, 20, 12, 0, …
$ `2014-03-01_mandates` <dbl> 2, 0, 0, 0, 24, 0, 0, 0, 5, 19, 0, 0, 11, 17, 0,…
$ `2014-04-01_mandates` <dbl> 1, 0, 0, 53, 18, 0, 1, 0, 0, 0, 0, 1, 11, 14, 1,…
$ `2014-05-01_mandates` <dbl> 0, 0, 2, 0, 8, 0, 0, 0, 0, 0, 0, 1, 15, 0, 0, 3,…
$ `2014-06-01_mandates` <dbl> 0, 0, 2, 0, 7, 0, 0, 0, 0, 0, 0, 0, 13, 5, 0, 0,…
$ `2014-07-01_mandates` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 16, 0, 0…
$ `2014-08-01_mandates` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 8, 0, 0,…
$ `2014-09-01_mandates` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 16, 2, 1, 1,…
$ `2014-10-01_mandates` <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 16, 1, 0, 2,…
$ `2014-11-01_mandates` <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 1, 0, …
$ `2014-12-01_mandates` <dbl> 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 11, 0, 0, 0,…
$ vertical <chr> "gym/fitness", "gym/fitness", "freelance develop…
$ incorporation_date <date> 2013-05-30, 2003-09-25, 2008-10-22, 2005-06-28,…
You’ll notice that the data is not in tidy form: one type of measurement in each column and all measurements that go together in the same row. To put this data into tidy from you’ll need to
- Take all the columns with names that start with a date string and use
tidyr::pivot_longer
to create a column named date containing the original column names and a column named quantity to contain the values. - In the date column (which contains strings at this stage), split the date from the remainder of the string, saving the remainder in a column called paymentMandate, and the date string in a column called date. To do this use
tidyr::separate_wider_delim
withdelim = "_"
and the specified names. - Next, use tidyr::pivot_wider with
names_from = paymentMandate
, andvalues_from = quantity
to create payments and mandates columns. - Finally, use
dplyr::mutate
to change the type of the data and incorporation_date columns toDate
.
Exercise 10
The Business Problem
The story begins in a fast paced startup. The company is growing fast and the marketing team is looking for ways to increase the sales from existing customers by making them buy more. The main idea is to unlock the potential of the customer base through incentives, in this case a discount. We of course want to measure the effect of the discount on the customer’s behavior. Still, they do not want to waste money giving discounts to users which are not valuable. As always, it is about return on investment (ROI).
Without going into specifics about the nature of the discount, it has been designed to provide a positive return on investment if the customer buys more than as a result of the discount. How can we measure the effect of the discount and make sure our experiment has a positive ROI? The marketing team came up with the following strategy:
- Select a sample of existing customers from the same cohort.
- Set a test window of 1 month.
- Look into the historical data of web visits from the last month. The hypothesis is that web visits are a good proxy for the customer’s interest in the product.
- For customers with a high number of web visits, send them a discount. There will be a hold out group which will not receive the discount within the potential valuable customers based on the number of web visits. For customers with a low number of web visits, do not send them a discount (the marketing team wants to report a positive ROI, so they do not want to waste money on customers which are not valuable). Still, they want to use them to measure the effect of the discount.
- We also want to use the results of the test to tag loyal customers. These are customers which got a discount (since they showed potential interest in the product) and customers with exceptional sales numbers even if they did not get a discount. The idea is to use this information to target them in the future if the discount strategy is positive.
The Data
The team collected data from the experiment above and asked the data science team to analyze it and provide insights. In particular they want to know if they should keep the discount strategy. The data consists of the following fields: - visits
: Number of visits to the website during the test window. - discount
: Whether the customer received a discount or not. - is_loyal
: Whether the customer is loyal or not according to the definition above. - sales
: Sales in during the test window.
Prepare Notebook
Data scientist A was the one in charge of preparing the environment and collecting the data. As an important best practice, they fixed a global seed which initializes the random number generator in order to make sure every part of the analysis was reproducible. This ensures that the calculations are not affected by pure randomness. In addition all the required packages were listed from the start (reproducible R environment).
set.seed(8740)
Read Data
They pulled the data from a csv file and displayed the first 5 measurements.
<- readr::read_csv("data/sales_dag.csv", show_col_types = FALSE)
data
|> dplyr::slice_head(n=5) |>
data ::gt() |>
gt::tab_header(title = "sample marketing data") |>
gt::gt_theme_espn() gtExtras
sample marketing data | ||||
---|---|---|---|---|
visits | discount | is_loyal | sales | sales_per_visit |
12 | 0 | 0 | 13.34830 | 1.1123585 |
26 | 1 | 1 | 21.70125 | 0.8346635 |
13 | 0 | 0 | 14.70040 | 1.1308004 |
24 | 0 | 0 | 20.37734 | 0.8490557 |
14 | 0 | 0 | 12.63372 | 0.9024089 |
They then checked for missing values and whether the measurements were in the correct format.
Exploratory Data Analysis
As part of the project scope, the data science team in charge of the analysis was asked to provide a summary of the data. The team was also asked to provide a visualization of the data to help the marketing team understand the data better. Data scientist A took over this task.
They started by looking at the share of customers which received a discount:
Note that all customers with discount are loyal (as required) and that there are loyal users which did not receive a discount. This is because they had exceptional sales numbers. Verify this:
|> dplyr::mutate(id = dplyr::row_number(), .before = 1) |>
data ::filter(discount == 0) |>
dplyr::arrange( desc(sales) ) |>
dplyr::slice_head(n=10) |>
dplyr::gt() |>
gt::tab_header(title = "Sales: loyal customers vs others") |>
gt::gt_theme_espn() gtExtras
The loyal customers are the top ones in terms of sales. This is good news. It means that the definition of loyal customers is consistent with the data.
In order to have orders of magnitude for the sales, the data scientist provided some summary statistics table:
::gt_plt_summary(data) gtExtras
To have a better glimpse of the data, the data scientist also provided a histogram of the sales:
|>
data ggplot(aes(x=sales)) +
geom_histogram(aes(y = ..density..), bins = 30, colour = 1, fill = "white") +
geom_density(lwd = 1, colour = 4, fill = 4, alpha = 0.25) +
labs(title = "Sales Distribution") +
theme_minimal()
You’re done and ready to submit your work! Save, stage, commit, and push all remaining changes. You can use the commit message “Done with Lab 1!” , and make sure you have committed and pushed all changed files to GitHub (your Git pane in RStudio should be empty) and that all documents are updated in your repo on GitHub.
I will pull (copy) everyone’s repository submissions at 5:00pm on the Sunday following class, and I will work only with these copies, so anything submitted after 5:00pm will not be graded. (don’t forget to commit and then push your work!)
Grading
Total points available: 30 points.
Component | Points |
---|---|
Ex 1 - 10 | 30 |
Resources for additional practice (optional)
- Chapter 2: Get Started Data Visualization by Kieran Healy
- Chapter 3: Data visualization in R for Data Science by Hadley Wickham
- RStudio Cloud Primers
- Visualization Basics: https://rstudio.cloud/learn/primers/1.1
- Work with Data: https://rstudio.cloud/learn/primers/2
- Visualize Data: https://rstudio.cloud/learn/primers/3