Lab 1 - Tidy Data Wrangling

BSMM 8740 Fall 2024

Author

Add your name here

Introduction

This lab will go through many of the same operations we’ve demonstrated in class. The main goal is to reinforce our understanding of tidy data, the tidyverse and the pipe, which we will be using throughout the course.

As the labs progress, you are encouraged to explore beyond what the labs require; a willingness to experiment will make you a much better programmer. Before we get to that stage, however, you need to build some basic fluency in R and the tidyverse. Today we begin with exercises in the fundamental building blocks of R and RStudio: the interface, reading in data, and basic commands.

Learning goals

By the end of the lab, you will…

  • Be familiar with the workflow using R, RStudio, Git, and GitHub
  • Have practiced version control using GitHub

Getting started

  • To complete the lab, log on to your github account and then go to the class GitHub organization and find the 2024-lab-1-[your github username] repository .

    Create an R project using your 2024-lab-1-[your github username] repository (remember to create a PAT, etc.) and add your answers by editing the 2024-lab-1.qmd file in your repository.

  • When you are done, be sure to: save your document, stage, commit and push your work.

Important

To access Github from the lab, you will need to make sure you are logged in as follows:

  • username: .\daladmin
  • password: Business507!

Remember to (create a PAT and set your git credentials)

  • create your PAT using usethis::create_github_token() ,
  • store your PAT with gitcreds::gitcreds_set() ,
  • set your username and email with
    • usethis::use_git_config( user.name = ___, user.email = ___)

Packages

Note

The code below will install (if necessary) and load the packages needed in today’s exercises

# check if 'librarian' is installed and if not, install it
if (! "librarian" %in% rownames(installed.packages()) ){
  install.packages("librarian")
}
  
# load packages if not already loaded
librarian::shelf(tidyverse, Lahman, magrittr, gt, gtExtras, ggplot2, skimr
                 , dplyr, gt, gtExtras, here, readr, rmarkdown)
theme_set(theme_bw(base_size = 18) + theme(legend.position = "top"))

Now, today’s lab

The tidyverse is a meta-package. When you load it you get eight packages loaded for you:

  • ggplot2: for data visualization
  • dplyr: for data wrangling
  • tidyr: for data tidying and rectangling
  • readr: for reading and writing data
  • tibble: for modern, tidy data frames
  • stringr: for string manipulation
  • forcats: for dealing with factors
  • purrr: for iteration with functional programming

The message that’s printed when you load the package tells you which versions of these packages are loaded as well as any conflicts they may have introduced, e.g., the filter() function from dplyr has now masked (overwritten) the filter() function available in base R (and that’s ok, we’ll use dplyr::filter() anyway).

We’ll be using functionality from all of these packages throughout the semester, though we’ll always load them all at once with library(tidyverse). You can find out more about the tidyverse and each of the packages that make it up here.

Data: Yearly statistics and standings for baseball teams

Today’s data is all baseball statistics. The data is in the Lahman package.

View the data

Before doing any analysis, you may want to get quick view of the data. This is useful when you’ve imported data to see if your data imported correctly. We can use the view() function to see the entire data set in RStudio. Type the code below in the Console to view the entire dataset.

dim(Teams)

Data dictionary

The variable definitions are found in the help for Teams

?Teams

Exercises

Write all code and narrative in your R Markdown file where indicated. Write all narrative in complete sentences. Throughout the assignment, you should periodically render your Quarto document to ensure that all code executes and that your document format is intact, save, stage & commit the changes in the Git pane, and push the updated files to your repository. This ensures that your work is saved.

Tip

Make sure we can read all of the code in your quarto document. This means you will need to break up long lines of code. One way to help avoid long lines of code is is start a new line after every pipe (%>% or |>) and plus sign (+).

Exercise 1

The view() function helps us get a quick view of the dataset, but let’s get more detail about its structure. Viewing a summary of the data is a useful starting point for data analysis, especially if the dataset has a large number of observations (rows) or variables (columns). Run the code below to use the dplyr::glimpse() function to see a summary of the ikea dataset.

dplyr::glimpse(Teams[1:10,])
YOUR ANSWER:

How many observations are in the Teams dataset? How many variables/measurements?

dplyr::glimpse(Teams[1:10,])

How many character columns/measurements have missing variables?

# run a data exploration here using the skimr package.
Teams |> skimr::skim()
Note

In your lab-1.qmd document you’ll see that we already added the code required for most exercises as well as a sentence where you can fill in the blanks to report the answer.

Also note that the code chunk as a label: glimpse-data. It’s not required, but it is good practice and highly encouraged to label your code chunks in this way. If there is an error when you render your document, the code-chunk label will identify where the error is.

Exercise 2

Ben Baumer worked for the New York Mets from 2004 to 2012. What was the team W/L record during those years? Use filter() and select() to quickly identify only those pieces of information that we care about.

YOUR ANSWER:
# fill in the blanks and ensure your code produces the correct result
mets <- Teams  %>% 
  dplyr::filter(teamID == "NYN")
my_mets <- mets %>% 
  dplyr::filter(_)
my_mets %>% 
  dplyr::select(_,_,_,_)

Exercise 3

We’ve answered the simple question of how the Mets performed during the time that Ben was there, but since we are data scientists, we are interested in deeper questions. For example, some of these seasons were subpar—the Mets had more losses than wins. Did the team just get unlucky in those seasons? Or did they actually play as badly as their record indicates?

In order to answer this question, we need a model for expected winning percentage. It turns out that one of the most widely used contributions to the field of baseball analytics (courtesy of Bill James) is exactly that. This model translates the number of runs4 that a team scores and allows over the course of an entire season into an expectation for how many games they should have won. The simplest version of this model is this:

Ŵpct=11+(RARS)2 \hat{\text{W}}_{\text{pct}}=\frac{1}{1+\left(\frac{\text{RA}}{\text{RS}}\right)^{2}}

where RA\text{RA} is the number of runs the team allows to be scored, RS\text{RS} is the number of runs that the team scores, and Ŵpct\hat{\text{W}}_{\text{pct}} is the team’s expected winning percentage. Luckily for us, the runs scored and allowed are present in the Teams table, so let’s grab them and save them in a new data frame.

YOUR ANSWER:
# fill in the blanks and ensure your code produces the correct result
mets_ben <- Teams |>
  dplyr::select(_, _, _, _, _, _) |>
  dplyr::filter(_ == "NYN" & _ %in% 2004:2012)
mets_ben

First, note that the runs-scored variable is called R in the Teams table, but to stick with our notation we want to rename it RS.

mets_ben <- mets_ben |>
  dplyr::rename(_ = _)    # new name = old name
mets_ben

This is a good place to save, stage, commit, and push changes to your remote lab-1 repository. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message (e.g., “Completed exercises 1 - 3”), and push. After you push the changes, the Git pane in RStudio should be empty.

Exercise 4

Next, we need to compute the team’s actual winning percentage in each of these seasons. Thus, we need to add a new column to our data frame, and we do this with the mutate() command.

YOUR ANSWER:
# add the calculation using the formula above
mets_ben <- mets_ben |>
  dplyr::mutate(WPct = _)
mets_ben

The expected number of wins is then equal to the product of the expected winning percentage times the number of games.

# fill in the blanks and ensure your code produces the correct result
mets_ben <- mets_ben |>
  dplyr::mutate(W_hat = _)
mets_ben

Exercise 5

In this case, the Mets’ fortunes were better than expected in three of these seasons, and worse than expected in the other six.

In how many seasons were the Mets better than expected? How many were they worse than expected?

YOUR ANSWER:
# repeat the better_than/ worse_than calculation for all the years in the dataset: show your work.

This is a good place to save, stage, commit, and push changes to your remote lab-1 repo. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message (e.g., “Completed exercises 4 and 5”), and push. After you push the changes, the Git pane in RStudio should be empty.

Exercise 6

Naturally, the Mets experienced ups and downs during Ben’s time with the team. Which seasons were best? To figure this out, we can simply sort the rows of the data frame.

YOUR ANSWER:
dplyr::arrange(mets_ben, _)

Exercise 7

In 2006, the Mets had the best record in baseball during the regular season and nearly made the World Series. How do these seasons rank in terms of the team’s performance relative to our model?

YOUR ANSWER:
mets_ben %>% 
  dplyr::mutate(Diff = _) |>
  dplyr::arrange(_)

Summarize the Mets performance:

# fill in the blanks and ensure your code produces the correct result
mets_ben |>
  dplyr::summarize(
    num_years = _, 
    total_W = _, 
    total_L = _, 
    total_WPct = _, 
    sum_resid = _
  )

In these nine years, the Mets had a combined record of ? wins and ? losses, for an overall winning percentage of _?.

This is a good place to save, stage, commit, and push changes to your remote lab-1 repo. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message (e.g., “Completed exercises 6 - 8”), and push. After you push the changes, the Git pane in RStudio should be empty.

Exercise 8

Discretize the years into three chunks: one for each of the three general managers under whom Ben worked. Jim Duquette was the Mets’ general manager in 2004, Omar Minaya from 2005 to 2010, and Sandy Alderson from 2011 to 2012.

YOUR ANSWER:
mets_ben <- mets_ben %>% 
  dplyr::mutate(
    gm = ifelse(
      yearID == _, 
      _, 
      ifelse(
        yearID >= _, 
        _, 
        _)
    )
  )

Alternatively, we use the case_when function

mets_ben <- mets_ben |>
  dplyr::mutate(
    gm = dplyr::case_when(
      yearID == _ ~ _, 
      yearID >= _ ~ _, 
      TRUE ~ _
    )
  )

Exercise 9

The following dataset is the basis of a model that predicts which businesses are likely to have customer churn at the start of 2015, based on the business type and incorporation_date. This question will give you some practice using the various tidyr:: pivot operations.

set.seed(42)

# read data and drop column 1 (it contains row numbers and doesn't have a column name)
df <- readr::read_csv("data/monthly_data.csv", show_col_types = FALSE, col_select = -1)

# Have a glimpse of the data
glimpse(df)
Rows: 902
Columns: 27
$ company_id            <dbl> 1, 2, 3, 4, 5, 6, 7, 8, 10, 11, 12, 13, 14, 15, …
$ `2014-01-01_payments` <dbl> 0, 1, 6, 8, 0, 2, 3, 0, 0, 0, 0, 0, 0, 4, 43, 34…
$ `2014-02-01_payments` <dbl> 0, 1, 6, 4, 0, 2, 3, 0, 0, 0, 2, 0, 11, 1, 51, 5…
$ `2014-03-01_payments` <dbl> 0, 3, 6, 7, 39, 1, 1, 6, 0, 0, 0, 8, 0, 1, 44, 4…
$ `2014-04-01_payments` <dbl> 1, 2, 6, 7, 0, 3, 7, 50, 1, 0, 0, 2, 0, 0, 47, 5…
$ `2014-05-01_payments` <dbl> 0, 2, 6, 1, 54, 1, 4, 119, 0, 1, 0, 0, 0, 5, 46,…
$ `2014-06-01_payments` <dbl> 1, 1, 7, 2, 0, 2, 1, 151, 3, 0, 0, 0, 0, 2, 81, …
$ `2014-07-01_payments` <dbl> 0, 1, 8, 2, 0, 2, 7, 182, 0, 0, 0, 3, 0, 0, 91, …
$ `2014-08-01_payments` <dbl> 0, 1, 7, 4, 22, 1, 2, 167, 0, 0, 0, 2, 0, 0, 93,…
$ `2014-09-01_payments` <dbl> 0, 1, 8, 3, 0, 1, 5, 180, 0, 0, 0, 0, 9, 5, 88, …
$ `2014-10-01_payments` <dbl> 0, 4, 8, 5, 0, 2, 8, 157, 1, 0, 0, 0, 2, 1, 86, …
$ `2014-11-01_payments` <dbl> 0, 3, 9, 5, 0, 1, 2, 105, 0, 0, 0, 0, 0, 0, 93, …
$ `2014-12-01_payments` <dbl> 0, 3, 9, 9, 0, 3, 8, 57, 0, 0, 0, 0, 0, 0, 104, …
$ `2014-01-01_mandates` <dbl> 1, 0, 0, 0, 4, 0, 0, 0, 4, 4, 0, 0, 22, 0, 1, 6,…
$ `2014-02-01_mandates` <dbl> 2, 0, 0, 0, 31, 1, 0, 2, 3, 8, 0, 0, 20, 12, 0, …
$ `2014-03-01_mandates` <dbl> 2, 0, 0, 0, 24, 0, 0, 0, 5, 19, 0, 0, 11, 17, 0,…
$ `2014-04-01_mandates` <dbl> 1, 0, 0, 53, 18, 0, 1, 0, 0, 0, 0, 1, 11, 14, 1,…
$ `2014-05-01_mandates` <dbl> 0, 0, 2, 0, 8, 0, 0, 0, 0, 0, 0, 1, 15, 0, 0, 3,…
$ `2014-06-01_mandates` <dbl> 0, 0, 2, 0, 7, 0, 0, 0, 0, 0, 0, 0, 13, 5, 0, 0,…
$ `2014-07-01_mandates` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 13, 16, 0, 0…
$ `2014-08-01_mandates` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 14, 8, 0, 0,…
$ `2014-09-01_mandates` <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 16, 2, 1, 1,…
$ `2014-10-01_mandates` <dbl> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 16, 1, 0, 2,…
$ `2014-11-01_mandates` <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 6, 0, 1, 0, …
$ `2014-12-01_mandates` <dbl> 0, 0, 0, 0, 0, 0, 3, 0, 0, 0, 0, 0, 11, 0, 0, 0,…
$ vertical              <chr> "gym/fitness", "gym/fitness", "freelance develop…
$ incorporation_date    <date> 2013-05-30, 2003-09-25, 2008-10-22, 2005-06-28,…

You’ll notice that the data is not in tidy form: one type of measurement in each column and all measurements that go together in the same row. To put this data into tidy from you’ll need to

  1. Take all the columns with names that start with a date string and use tidyr::pivot_longer to create a column named date containing the original column names and a column named quantity to contain the values.
  2. In the date column (which contains strings at this stage), split the date from the remainder of the string, saving the remainder in a column called paymentMandate, and the date string in a column called date. To do this use tidyr::separate_wider_delim with delim = "_" and the specified names.
  3. Next, use tidyr::pivot_wider with names_from = paymentMandate, and values_from = quantity to create payments and mandates columns.
  4. Finally, use dplyr::mutate to change the type of the data and incorporation_date columns to Date.
YOUR ANSWER:

What is the mean of the sales_per_visit columns/measurement? Are there any grouped observations?

# show your steps and the resulting tibble here:

Your code should produce a tibble that looks like this:

# A tibble: 10,824 × 6
   company_id vertical    incorporation_date date       payments mandates
        <int> <chr>       <date>             <date>        <dbl>    <dbl>
 1          1 gym/fitness 2013-05-30         2014-01-01        0        1
 2          1 gym/fitness 2013-05-30         2014-02-01        0        2
 3          1 gym/fitness 2013-05-30         2014-03-01        0        2
 4          1 gym/fitness 2013-05-30         2014-04-01        1        1
 5          1 gym/fitness 2013-05-30         2014-05-01        0        0
 6          1 gym/fitness 2013-05-30         2014-06-01        1        0
 7          1 gym/fitness 2013-05-30         2014-07-01        0        0
 8          1 gym/fitness 2013-05-30         2014-08-01        0        0
 9          1 gym/fitness 2013-05-30         2014-09-01        0        0
10          1 gym/fitness 2013-05-30         2014-10-01        0        0
# ℹ 10,814 more rows

Exercise 10

The Business Problem

The story begins in a fast paced startup. The company is growing fast and the marketing team is looking for ways to increase the sales from existing customers by making them buy more. The main idea is to unlock the potential of the customer base through incentives, in this case a discount. We of course want to measure the effect of the discount on the customer’s behavior. Still, they do not want to waste money giving discounts to users which are not valuable. As always, it is about return on investment (ROI).

Without going into specifics about the nature of the discount, it has been designed to provide a positive return on investment if the customer buys more than $1\$ 1 as a result of the discount. How can we measure the effect of the discount and make sure our experiment has a positive ROI? The marketing team came up with the following strategy:

  • Select a sample of existing customers from the same cohort.
  • Set a test window of 1 month.
  • Look into the historical data of web visits from the last month. The hypothesis is that web visits are a good proxy for the customer’s interest in the product.
  • For customers with a high number of web visits, send them a discount. There will be a hold out group which will not receive the discount within the potential valuable customers based on the number of web visits. For customers with a low number of web visits, do not send them a discount (the marketing team wants to report a positive ROI, so they do not want to waste money on customers which are not valuable). Still, they want to use them to measure the effect of the discount.
  • We also want to use the results of the test to tag loyal customers. These are customers which got a discount (since they showed potential interest in the product) and customers with exceptional sales numbers even if they did not get a discount. The idea is to use this information to target them in the future if the discount strategy is positive.

The Data

The team collected data from the experiment above and asked the data science team to analyze it and provide insights. In particular they want to know if they should keep the discount strategy. The data consists of the following fields: - visits: Number of visits to the website during the test window. - discount: Whether the customer received a discount or not. - is_loyal: Whether the customer is loyal or not according to the definition above. - sales: Sales in $\$ during the test window.

Prepare Notebook

Data scientist A was the one in charge of preparing the environment and collecting the data. As an important best practice, they fixed a global seed which initializes the random number generator in order to make sure every part of the analysis was reproducible. This ensures that the calculations are not affected by pure randomness. In addition all the required packages were listed from the start (reproducible R environment).

set.seed(8740)
Read Data

They pulled the data from a csv file and displayed the first 5 measurements.

data <- readr::read_csv("data/sales_dag.csv", show_col_types = FALSE)

data |> dplyr::slice_head(n=5) |> 
  gt::gt() |> 
  gt::tab_header(title = "sample marketing data") |> 
  gtExtras::gt_theme_espn()
sample marketing data
visits discount is_loyal sales sales_per_visit
12 0 0 13.34830 1.1123585
26 1 1 21.70125 0.8346635
13 0 0 14.70040 1.1308004
24 0 0 20.37734 0.8490557
14 0 0 12.63372 0.9024089

They then checked for missing values and whether the measurements were in the correct format.

YOUR ANSWER:

What is the mean of the sales_per_visit columns/measurement? Are there any grouped observations?

# run a data exploration here using the skimr package.
data |> skimr::skim()
Exploratory Data Analysis

As part of the project scope, the data science team in charge of the analysis was asked to provide a summary of the data. The team was also asked to provide a visualization of the data to help the marketing team understand the data better. Data scientist A took over this task.

They started by looking at the share of customers which received a discount:

YOUR ANSWER:
# calculate the % share of customers receiving a discount vs the % not receiving a discount

Similarly for the share of customers which are loyal:

# calculate the % share of customers that are 'loyal' vs not 'loyal'

To understand these features better, they also looked at a cross-tab table:

# build a cross-tab table of 'loyal' customers vs customers getting a discount

Note that all customers with discount are loyal (as required) and that there are loyal users which did not receive a discount. This is because they had exceptional sales numbers. Verify this:

data |> dplyr::mutate(id = dplyr::row_number(), .before = 1) |> 
  dplyr::filter(discount == 0) |> 
  dplyr::arrange( desc(sales) ) |> 
  dplyr::slice_head(n=10) |> 
  gt::gt() |> 
  gt::tab_header(title = "Sales: loyal customers vs others") |> 
  gtExtras::gt_theme_espn()

The loyal customers are the top ones in terms of sales. This is good news. It means that the definition of loyal customers is consistent with the data.

In order to have orders of magnitude for the sales, the data scientist provided some summary statistics table:

gtExtras::gt_plt_summary(data)

To have a better glimpse of the data, the data scientist also provided a histogram of the sales:

data |> 
  ggplot(aes(x=sales)) +
  geom_histogram(aes(y = ..density..), bins = 30, colour = 1, fill = "white") +
  geom_density(lwd = 1, colour = 4, fill = 4, alpha = 0.25) +
  labs(title = "Sales Distribution") +
  theme_minimal()

You’re done and ready to submit your work! Save, stage, commit, and push all remaining changes. You can use the commit message “Done with Lab 1!” , and make sure you have committed and pushed all changed files to GitHub (your Git pane in RStudio should be empty) and that all documents are updated in your repo on GitHub.

Submission

I will pull (copy) everyone’s repository submissions at 5:00pm on the Sunday following class, and I will work only with these copies, so anything submitted after 5:00pm will not be graded. (don’t forget to commit and then push your work!)

Grading

Total points available: 30 points.

Component Points
Ex 1 - 10 30

Resources for additional practice (optional)