Lab 3 - Regression

BSMM 8740 Fall 2024

Author

Add your name here

Introduction

In today’s lab, you’ll explore several data sets and practice building and evaluating regression models.

Learning goals

By the end of the lab you will…

Be able to use different regression models to predict a response/target/outcome as a function of a set of variates.

Getting started

To complete the lab, log on to your github account and then go to the class GitHub organization and find the 2024-lab-3-[your github username] repository to complete the lab.

Create an R project using your 2024-lab-3-[your github username] repository (remember to create a PAT, etc., as in lab-1) and add your answers by editing the 2024-lab-3.qmd file in your repository.
When you are done, be sure to save your document, stage, commit and push your work.

Important

To access Github from the lab, you will need to make sure you are logged in as follows:

username: .\daladmin
password: Business507!

Remember to (create a PAT and set your git credentials)

create your PAT using usethis::create_github_token() ,
store your PAT with gitcreds::gitcreds_set() ,
set your username and email with
- usethis::use_git_config( user.name = ___, user.email = ___)

Packages

We will use the following package in today’s lab.

# check if 'librarian' is installed and if not, install it
if (! "librarian" %in% rownames(installed.packages()) ){
  install.packages("librarian")
}
  
# load packages if not already loaded
librarian::shelf(
  tidyverse, magrittr, gt, gtExtras, tidymodels, DataExplorer, skimr, janitor, ggplot2, knitr,
  ISLR2, stats, xgboost
)
theme_set(theme_bw(base_size = 12))

Data: Boston House Values

The Boston House Values dataset (usually referred to as the Boston dataset) appears in several R packages in different versions and is based on economic studies published in the late 1970’s.

This dataset contains the following information for each cocktail:

variable	description
crim	per capita crime rate by town.
zn	proportion of residential land zoned for lots over 25,000 sq.ft.
indus	proportion of non-retail business acres per town.
chas	Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
nox	nitrogen oxides concentration (parts per 10 million).
rm	average number of rooms per dwelling.
age	proportion of owner-occupied units built prior to 1940.
dis	weighted mean of distances to five Boston employment centres.
rad	index of accessibility to radial highways.
tax	full-value property-tax rate per $10,000.
ptratio	pupil-teacher ratio by town.
lstat	lower status of the population (percent).
medv	median value of owner-occupied homes in $1000s.

Use the code below to load the Boston Cocktail Recipes data set.

boston <- ISLR2::Boston

Exercises

Exercise 1

Plot the median value of owner-occupied homes (medv) vs the percentage of houses with lower socioeconomic status (lstat) then use lm to model medv ~ lstat and save the result in a variable for use later.

Next prepare a summary of the model. What is the intercept and the coefficient of lstat in this model?

YOUR ANSWER:

# PLEASE SHOW YOUR WORK
# plot medv vs lstat

# create a linear model of medv vs lstat and save the model

Exercise 2

Using the result from Exercise 1, and the data below, use the predict function (stats::predict.lm or just predict) with the argument interval = “confidence” to prepare a summary table with columns lstat, fit, lwr, upr.

You can use stats::predict.lm directly with the data below.

Or consider creating a nested column using dplyr::mutate along with purrr::map with first argument lstat and second argument a function you create. The last operation is to unnest the nested column with tidyr::unnest(__).

tibble(lstat = c(5, 10, 15, 20))

Finally, use your model to plot some performance checks using the performance::check_model function with arguments check=c("linearity","qq","homogeneity", "outliers").

Are there any overly influential observations in this dataset?

YOUR ANSWER:

# PLEASESHOW YOUR WORK



# Are there any overly influential observations in this dataset?

Exercise 3

Fit the variable medv (median value of owner-occupied homes) to all predictors in the dataset and use the performance::check_collinearity function on the resulting model to check if any predictors are redundant.

The variance inflation factor is a measure of the magnitude of multicollinearity of model terms. A VIF less than 5 indicates a low correlation of that predictor with other predictors. A value between 5 and 10 indicates a moderate correlation, while VIF values larger than 10 are a sign for high, not tolerable correlation of model predictors.

Which predictors in this dataset might be redundant for predicting medv?

YOUR ANSWER:

# PLEASESHOW YOUR WORK



# Which predictors in this dataset might be redundant for predicting `medv`?

This is a good place to save, commit, and push changes to your remote lab repo. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.

Exercise 4

In this exercise you will compare and interpret the results of linear regression on two similar datasets.

The first dataset (dat0 - generated below) has demand0 and price0 variables along with an unobserved variable (unobserved0 - so not in our dataset) that doesn’t change the values of demand0 and price0. Use lm to build a model to predict demand0 from price0 . Plot the data, including intercept and slope. What is the slope of the demand curve in dataset dat0?

Tip

plot with something like:

dat0 %>% ggplot(aes(x=price0,y=demand0)) + 
         # plot the points
         geom_point() +
         # add a straight line to the plot
         geom_abline(
          data = ?? a table with the coefficient estimates ??
            , aes(intercept = `(Intercept)`, slope = price0)
            , colour = "red"
         )

N <- 500
set.seed(1966)

dat0 <- tibble::tibble(
  price0 = 10+rnorm(500)
  , demand0 = 30-(price0 + rnorm(500))
  , unobserved0 = 0.45*price0 + 0.77*demand0 + rnorm(500)
)

The second dataset (dat1 - generated below) has demand1 and price1 variables, along with a variable unobserved1 that is completely random and is not observed, so it isn’t in our dataset. Use lm to build a model to predict demand1 from price1 . Plot the data, including intercept and slope. What is the slope of the demand curve in dataset dat1?

set.seed(1966)

dat1 <- tibble::tibble(
  unobserved1 = rnorm(500)
  , price1 = 10 + unobserved1 + rnorm(500)
  , demand1 = 23 -(0.5*price1 + unobserved1 + rnorm(500))
)

Which linear model returns the (approximately) correct dependence of demand on price, as given in the data generation process?

YOUR ANSWER:

# PLEASESHOW YOUR WORK



# Which linear model returns the (approximately) correct dependence of demand on price, as given in the data generation process?

Exercise 5

Now repeat the modeling of exercise 4, but assuming that the formerly unobservable variables are now observable, and so can be included in the linear regression models.

Which model returns the (approximately) correct dependence of demand on price, as given in the data generation process?

What can you conclude from these two exercises?

YOUR ANSWER:

# PLEASESHOW YOUR WORK



# What can you conclude from these two exercises?

Exercise 6

For the next several exercises, we’ll work with a new dataset. This dataset is taken from an EPA site on fuel economy, in particular the fuel economy dataset for 2023.

Use the code below to load the FE Guide data set.

dat <- 
  readxl::read_xlsx("data/2023 FE Guide for DOE-release dates before 7-28-2023.xlsx")

From the raw data in dat, we’ll make a smaller dataset, and we’ll need to do some cleaning to make it useable.

First select the columns “Comb FE (Guide) - Conventional Fuel”, “Eng Displ”,‘# Cyl’, Transmission , “# Gears”, “Air Aspiration Method Desc”, “Regen Braking Type Desc”, “Batt Energy Capacity (Amp-hrs)” , “Drive Desc”, “Fuel Usage Desc - Conventional Fuel”, “Cyl Deact?”, and “Var Valve Lift?” and then clean the column names using janitor::janitor::clean_names(). Assign the revised data to the variable cars_23.

Perform a quick check of the data using DataExplorer::introduce() and DataExplorer::plot_missing() and modify the data as follows

mutate the columns comb_fe_guide_conventional_fuel, number_cyl, and number_gears to ensure that they contain integers values, not doubles.
use tidyr::replace_na to replace any missing values in batt_energy_capacity_amp_hrs column with zeros, and replace and missing values in regen_braking_type_desc with empty strings (““).
finally, mutate the columns ‘transmission’,‘air_aspiration_method_desc’,‘regen_braking_type_desc’,‘drive_desc’ ,‘fuel_usage_desc_conventional_fuel’,‘cyl_deact’,‘var_valve_lift’ so their values are factors.

Prepare a recipe to pre-process cars_23 ahead of modelling, using comb_fe_guide_conventional_fuel as the outcome, with the following steps.

Centering for: recipes::all_numeric()
Scaling for: recipes::all_numeric()
Dummy variables from: recipes::all_factor()

How many predictor variables are there in cars_23 ?

YOUR ANSWER:

# PLEASESHOW YOUR WORK



# How many predictor variables are there in `cars_23` ?

Exercise 7

For this exercise, set a sample size equal to 75% of the observations of cars_23 and split the data as follows:

set.seed(1966)

# sample 75% of the rows of the cars_23 dataset to make the training set
train <- cars_23 %>% 
  # make an ID column for use as a key
  tibble::rowid_to_column("ID") %>% 
  # sample the rows
  dplyr::sample_frac(0.75)

# remove the training dataset from the original dataset to make the training set
test  <- 
  dplyr::anti_join(
    cars_23 %>% tibble::rowid_to_column("ID") # add a key column to the original data
    , train
    , by = 'ID'
  )

# drop the ID column from training and test datasets
train %<>% dplyr::select(-ID); test %<>% dplyr::select(-ID)

Next prep the recipe created in the last exercise using recipes::prep on the training data, and then use the result of the prep step to recipes::bake with the training and test data. Save the baked data in separate variables for use later.

After these two steps how many columns are in the data? Why does this differ from the last step?

YOUR ANSWER:

# PLEASESHOW YOUR WORK



# After these two steps how many columns are in the data? Why does this differ from the last step?

This is a good place to render, commit, and push changes to your remote lab repo on GitHub. Click the checkbox next to each file in the Git pane to stage the updates you’ve made, write an informative commit message, and push. After you push the changes, the Git pane in RStudio should be empty.

Exercise 8

In this exercise we will run xgboost::xgboost to evaluate the regression.

First run fit the model with default meta-parameters for max_depth and eta, using the training data per the code below:

untuned_xgb <-
  xgboost::xgboost(
    data = cars_23_train %>% dplyr::select(-comb_fe_guide_conventional_fuel) %>% as.matrix(), 
    label = cars_23_train %>% dplyr::select(comb_fe_guide_conventional_fuel) %>% as.matrix(),
    nrounds = 1000,
    objective = "reg:squarederror",
    early_stopping_rounds = 3,
    max_depth = 6,
    eta = .25
    , verbose = FALSE
  )

Next use the fitted model to predict the outcome using the test data:

# create predictions using the test data and the fitted model
yhat <- predict(
  untuned_xgb
  , cars_23_test %>% 
    dplyr::select(-comb_fe_guide_conventional_fuel) %>% 
    as.matrix() 
)

Finally, pull out the comb_fe_guide_conventional_fuel column from the test data, assign it to the variable y and then use caret::postResample with arguments yhat and y to evaluate how well the model fits.

What is the RMSE for the un-tuned model?

YOUR ANSWER:

# PLEASESHOW YOUR WORK



# What is the RMSE for the un-tuned model?

Exercise 9

In this exercise we are going to tune the model using cross validation. First we create a tuning grid for the parameters and then fit the model for all the values in the grid, saving the results.

Finally, we select the best parameters by least RMSE. This code will take a while to run

#create hyperparameter grid
hyper_grid <- expand.grid(max_depth = seq(3, 6, 1), eta = seq(.2, .35, .01))  

# initialize our metric variables
xgb_train_rmse <- NULL
xgb_test_rmse  <- NULL

for (j in 1:nrow(hyper_grid)) {
  set.seed(123)
  m_xgb_untuned <- xgboost::xgb.cv(
    data = cars_23_train %>% dplyr::select(-comb_fe_guide_conventional_fuel) %>% as.matrix(), 
    label = cars_23_train %>% dplyr::select(comb_fe_guide_conventional_fuel) %>% as.matrix(),
    nrounds = 1000,
    objective = "reg:squarederror",
    early_stopping_rounds = 3,
    nfold = 5,
    max_depth = hyper_grid$max_depth[j],
    eta = hyper_grid$eta[j],
    verbose = FALSE
  )
  
  xgb_train_rmse[j] <- m_xgb_untuned$evaluation_log$train_rmse_mean[m_xgb_untuned$best_iteration]
  xgb_test_rmse[j] <- m_xgb_untuned$evaluation_log$test_rmse_mean[m_xgb_untuned$best_iteration]
}    

best <- hyper_grid[which(xgb_test_rmse == min(xgb_test_rmse)),]; best # there may be ties

re-run the code from the last exercise and evaluate the fit using one of the best tuning parameters (i.e. when re-running the regression, set max_depth and eta to one pair the best-fit parameters [there may be ties]).

Is the tuned model better than the un-tuned model? If better, how much has the RMSE improved (in %).

YOUR ANSWER:

# PLEASESHOW YOUR WORK



# Is the tuned model better than the un-tuned model? If better, how much has the RMSE improved (in %).

Exercise 10

Using xgboost::xgb.importance rank the importance of each predictor in the model. Finally, take the top 10 predictors by importance and plot them using xgboost::xgb.plot.importance.

Per this model, what is the most important feature for predicting fuel efficiency?

YOUR ANSWER:

# PLEASESHOW YOUR WORK



# What is the most important feature for predicting fuel efficiency?

You’re done and ready to submit your work! Save, stage, commit, and push all remaining changes. You can use the commit message “Done with Lab 3!” , and make sure you have committed and pushed all changed files to GitHub (your Git pane in RStudio should be empty) and that all documents are updated in your repo on GitHub.

Submission

I will pull (copy) everyone’s submissions at 5:00pm on the Sunday following class, and I will work only with these copies, so anything submitted after 5:00pm will not be graded. (don’t forget to commit and then push your work by 5:00pm on Sunday!)

Grading

Total points available: 30 points.

Component	Points
Ex 1 - 10	30