Model Diagnostics
The following supplemental notes were created by Dr. Maria Tackett for STA 210. They are provided for students who want to dive deeper into the mathematics behind regression and reflect some of the material covered in STA 211: Mathematics of Regression. Additional supplemental notes will be added throughout the semester.
This document discusses some of the mathematical details of the model diagnostics - leverage, standardized residuals, and Cook’s distance. We assume the reader knowledge of the matrix form for multiple linear regression. Please see Matrix Form of Linear Regression for a review.
Introduction
Suppose we have observations. Let the be , such that are the explanatory variables (predictors) and is the response variable. We assume the data can be modeled using the least-squares regression model, such that the mean response for a given combination of explanatory variables follows the form in Equation 1.
We can write the response for the observation as shown in Equation 2.
such that is the amount deviates from , the mean response for a given combination of explanatory variables. We assume each , where is a constant variance for the distribution of the response for any combination of explanatory variables .
Matrix Form for the Regression Model
We can represent the Equation 1 and Equation 2 using matrix notation. Let
Thus,
Therefore the estimated response for a given combination of explanatory variables and the associated residuals can be written as
Hat Matrix & Leverage
Recall from the notes Matrix Form of Linear Regression that can be written as the following:
Combining Equation 4 and Equation 5, we can write as the following:
We define the hat matrix as an matrix of the form . Thus Equation 6 becomes
The diagonal elements of the hat matrix are a measure of how far the predictor variables of each observation are from the means of the predictor variables. For example, is a measure of how far the values of the predictor variables for the observation, , are from the mean values of the predictor variables, . In the case of simple linear regression, the diagonal, , can be written as
We call these diagonal elements, the leverage of each observation.
The diagonal elements of the hat matrix have the following properties:
- , where is the number of predictor variables in the model.
- The mean hat value is .
Using these properties, we consider a point to have high leverage if it has a leverage value that is more than 2 times the average. In other words, observations with leverage greater than are considered to be high leverage points, i.e. outliers in the predictor variables. We are interested in flagging high leverage points, because they may have an influence on the regression coefficients.
When there are high leverage points in the data, the regression line will tend towards those points; therefore, one property of high leverage points is that they tend to have small residuals. We will show this by rewriting the residuals from Equation 4 using Equation 7.
Note that the identity matrix and hat matrix are idempotent, i.e. , . Thus, is also idempotent. These matrices are also symmetric. Using these properties and Equation 8, we have that the variance-covariance matrix of the residuals , is
where is the estimated regression variance. Thus, the variance of the residual is . Therefore, the higher the leverage, the smaller the variance of the residual. Because the expected value of the residuals is 0, we conclude that points with high leverage tend to have smaller residuals than points with lower leverage.
Standardized Residuals
In general, we standardize a value by shifting by the expected value and rescaling by the standard deviation (or standard error). Thus, the standardized residual takes the form
The expected value of the residuals is 0, i.e. . From Equation 9), the standard error of the residual is . Therefore,
Cook’s Distance
Cook’s distance is a measure of how much each observation influences the model coefficients, and thus the predicted values. The Cook’s distance for the observation can be written as
where is the vector of predicted values from the model fitted when the observation is deleted. Cook’s Distance can be calculated without deleting observations one at a time, since Equation 12 below is mathematically equivalent to Equation 11.