if (! "librarian" %in% rownames(installed.packages()) ){
install.packages("librarian")
}
# load packages if not already loaded
::shelf(
librarian
tidyverse, magrittr, gt, gtExtras, tidymodels, DataExplorer, skimr, janitor, ggplot2, knitr, ISLR2, stats, xgboost, see
)theme_set(theme_bw(base_size = 12))
Lab 3 - Regression
SOLUTIONS
Introduction
In today’s lab, you’ll explore several data sets and practice building and evaluating regression models.
Learning goals
By the end of the lab you will…
- Be able to use different regression models to predict a response/target/outcome as a function of a set of variates.
Packages
We will use the following package in today’s lab.
Data: Boston House Values
The Boston House Values dataset (usually referred to as the Boston dataset) appears in several R packages in different versions and is based on economic studies published in the late 1970’s.
This dataset contains the following information for each cocktail:
variable | description |
---|---|
crim | per capita crime rate by town. |
zn | proportion of residential land zoned for lots over 25,000 sq.ft. |
indus | proportion of non-retail business acres per town. |
chas | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise). |
nox | nitrogen oxides concentration (parts per 10 million). |
rm | average number of rooms per dwelling. |
age | proportion of owner-occupied units built prior to 1940. |
dis | weighted mean of distances to five Boston employment centres. |
rad | index of accessibility to radial highways. |
tax | full-value property-tax rate per $10,000. |
ptratio | pupil-teacher ratio by town. |
lstat | lower status of the population (percent). |
medv | median value of owner-occupied homes in $1000s. |
Use the code below to load the Boston Cocktail Recipes data set.
<- ISLR2::Boston boston
Exercises
Exercise 1
Plot the median value of owner-occupied homes (medv
) vs the percentage of houses with lower socioeconomic status (lstat
) then use lm
to model medv ~ lstat
and save the result in a variable for use later.
Next prepare a summary of the model. What is the intercept and the coefficient of lstat
in this model?
Exercise 2
Using the result from Exercise 1, and the data below, use the predict function (stats::predict.lm
or just predict
) with the argument interval = “confidence” to prepare a summary table with columns lstat, fit, lwr, upr.
tibble(lstat = c(5, 10, 15, 20))
Finally, use your model to plot some performance checks using the performance::check_model
function with arguments check=c("linearity","qq","homogeneity", "outliers")
.
Are there any overly influential observations in this dataset?
Exercise 3
Fit medv
to all predictors in the dataset and use the performance::check_collinearity
function on the resulting model to check if any predictors are redundant.
The variance inflation factor is a measure of the magnitude of multicollinearity of model terms. A VIF less than 5 indicates a low correlation of that predictor with other predictors. A value between 5 and 10 indicates a moderate correlation, while VIF values larger than 10 are a sign for high, not tolerable correlation of model predictors.
Which predictors in this dataset might be redundant for predicting medv
?
Exercise 4
In this exercise you will compare and interpret the results of linear regression on two similar datasets.
The first dataset (dat0
- generated below) has demand0
and price0
variables along with an unobserved variable (unobserved0
- so not in our dataset) that doesn’t change the values of demand0
and price0
. Use lm to build a model to predict demand0
from price0
. Plot the data, including intercept and slope. What is the slope of the demand curve in dataset dat0
?
<- 500
N set.seed(1966)
<- tibble::tibble(
dat0 price0 = 10+rnorm(500)
demand0 = 30-(price0 + rnorm(500))
, unobserved0 = 0.45*price0 + 0.77*demand0 + rnorm(500)
, )
The second dataset (dat1
- generated below) has demand1
and price1
variables, along with a variable unobserved1
that is completely random and is not observed, so it isn’t in our dataset. Use lm to build a model to predict demand1
from price1
. Plot the data, including intercept and slope. What is the slope of the demand curve in dataset dat1
?
set.seed(1966)
<- tibble::tibble(
dat1 unobserved1 = rnorm(500)
price1 = 10 + unobserved1 + rnorm(500)
, demand1 = 23 -(0.5*price1 + unobserved1 + rnorm(500))
, )
Which linear model returns the (approximately) correct dependence of demand on price, as given in the data generation process?
Exercise 5
Now repeat the modeling of exercise 4, but assuming that the formerly unobservable variables are now observable, and so can be included in the linear regression models.
Which model returns the (approximately) correct dependence of demand on price, as given in the data generation process?
What can you conclude from these two exercises?
Exercise 6
For the next several exercises, we’ll work with a new dataset. This dataset is taken from an EPA site on fuel economy, in particular the fuel economy dataset for 2023.
Use the code below to load the FE Guide data set.
<-
dat ::read_xlsx( "data/2023 FE Guide for DOE-release dates before 7-28-2023.xlsx") readxl
From the raw data in dat
, we’ll make a smaller dataset, and we’ll need to do some cleaning to make it useable.
First select the columns “Comb FE (Guide) - Conventional Fuel”, “Eng Displ”,‘# Cyl’, Transmission , “# Gears”, “Air Aspiration Method Desc”, “Regen Braking Type Desc”, “Batt Energy Capacity (Amp-hrs)” , “Drive Desc”, “Fuel Usage Desc - Conventional Fuel”, “Cyl Deact?”, and “Var Valve Lift?” and then clean the column names using janitor::janitor::clean_names(). Assign the revised data to the variable cars_23
.
Perform a quick check of the data using DataExplorer::introduce()
and DataExplorer::plot_missing()
and modify the data as follows
- mutate the columns
comb_fe_guide_conventional_fuel
,number_cyl
, andnumber_gears
to ensure that they contain integers values, not doubles. - use
tidyr::replace_na
to replace any missing values inbatt_energy_capacity_amp_hrs
column with zeros, and replace and missing values inregen_braking_type_desc
with empty strings (““). - finally, mutate the columns ‘transmission’,‘air_aspiration_method_desc’,‘regen_braking_type_desc’,‘drive_desc’ ,‘fuel_usage_desc_conventional_fuel’,‘cyl_deact’,‘var_valve_lift’ so their values are factors.
Prepare a recipe to pre-process cars_23
ahead of modelling, using comb_fe_guide_conventional_fuel
as the outcome, with the following steps.
- Centering for: recipes::all_numeric()
- Scaling for: recipes::all_numeric()
- Dummy variables from: recipes::all_factor()
How many predictor variables are there in cars_23
?
Exercise 7
For this exercise, set a sample size equal to 75% of the observations of cars_23
and split the data as follows:
set.seed(1966)
# sample 75% of the rows of the cars_23 dataset to make the training set
<- cars_23 %>%
train # make an ID column for use as a key
::rowid_to_column("ID") %>%
tibble# sample the rows
::sample_frac(0.75)
dplyr
# remove the training dataset from the original dataset to make the training set
<-
test ::anti_join(
dplyr%>% tibble::rowid_to_column("ID") # add a key column to the original data
cars_23
, trainby = 'ID'
,
)
# drop the ID column from training and test datasets
%<>% dplyr::select(-ID); test %<>% dplyr::select(-ID) train
Next prep the recipe created in the last exercise using recipes::prep
on the training data, and then use the result of the prep step to recipes::bake
with the training and test data. Save the baked data in separate variables for use later.
After these two steps how many columns are in the data? Why does this differ from the last step?
Exercise 8
In this exercise we will run xgboost::xgboost to evaluate the regression.
First run fit the model with default meta-parameters for max_depth
and eta
, using the training data per the code below:
<-
untuned_xgb ::xgboost(
xgboostdata = cars_23_train %>% dplyr::select(-comb_fe_guide_conventional_fuel) %>% as.matrix(),
label = cars_23_train %>% dplyr::select(comb_fe_guide_conventional_fuel) %>% as.matrix(),
nrounds = 1000,
objective = "reg:squarederror",
early_stopping_rounds = 3,
max_depth = 6,
eta = .25
verbose = FALSE
, )
Next use the fitted model to predict the outcome using the test data:
# create predictions using the test data and the fitted model
<- predict(
yhat
untuned_xgb%>%
, cars_23_test ::select(-comb_fe_guide_conventional_fuel) %>%
dplyras.matrix()
)
Finally, pull out the comb_fe_guide_conventional_fuel
column from the test data, assign it to the variable y
and then use caret::postResample
with arguments yhat
and y
to evaluate how well the model fits.
What is the RMSE for the un-tuned model?
Exercise 9
In this exercise we are going to tune the model using cross validation. First we create a tuning grid for the parameters and then fit the model for all the values in the grid, saving the results.
Finally, we select the best parameters by least RMSE.
#create hyperparameter grid
<- expand.grid(max_depth = seq(3, 6, 1), eta = seq(.2, .35, .01))
hyper_grid
# initialize our metric variables
<- NULL
xgb_train_rmse <- NULL
xgb_test_rmse
for (j in 1:nrow(hyper_grid)) {
set.seed(123)
<- xgboost::xgb.cv(
m_xgb_untuned data = cars_23_train %>% dplyr::select(-comb_fe_guide_conventional_fuel) %>% as.matrix(),
label = cars_23_train %>% dplyr::select(comb_fe_guide_conventional_fuel) %>% as.matrix(),
nrounds = 1000,
objective = "reg:squarederror",
early_stopping_rounds = 3,
nfold = 5,
max_depth = hyper_grid$max_depth[j],
eta = hyper_grid$eta[j],
verbose = FALSE
)
<- m_xgb_untuned$evaluation_log$train_rmse_mean[m_xgb_untuned$best_iteration]
xgb_train_rmse[j] <- m_xgb_untuned$evaluation_log$test_rmse_mean[m_xgb_untuned$best_iteration]
xgb_test_rmse[j]
}
<- hyper_grid[which(xgb_test_rmse == min(xgb_test_rmse)),]; best # there may be ties best
max_depth eta
50 4 0.32
re-run the code from the last exercise and evaluate the fit using the best tuning parameters.
Is the tuned model better than the un-tuned model? If better, how much has the RMSE improved (in %).
Exercise 10
Using xgboost::xgb.importance
rank the importance of each predictor in the model. Finally, take the top 10 predictors by importance and plot them using xgboost::xgb.plot.importance
.
Per this model, what is the most important feature for predicting fuel efficiency?
Grading
Total points available: 50 points.
Component | Points |
---|---|
Ex 1 - 10 | 30 |