Matrix Notation for Multiple Linear Regression

Note

The following supplemental notes were created by Dr. Maria Tackett for STA 210. They are provided for students who want to dive deeper into the mathematics behind regression and reflect some of the material covered in STA 211: Mathematics of Regression. Additional supplemental notes will be added throughout the semester.

This document provides the details for the matrix notation for multiple linear regression. We assume the reader has familiarity with some linear algebra. Please see Chapter 1 of An Introduction to Statistical Learning for a brief review of linear algebra.

Introduction

Suppose we have nn observations. Let the ithi^{th} be (xi1,,xip,yi)(x_{i1}, \ldots, x_{ip}, y_i), such that xi1,,xipx_{i1}, \ldots, x_{ip} are the explanatory variables (predictors) and yiy_i is the response variable. We assume the data can be modeled using the least-squares regression model, such that the mean response for a given combination of explanatory variables follows the form in Equation 1.

y=β0+β1x1++βpxp(1) y = \beta_0 + \beta_1 x_1 + \dots + \beta_p x_p \qquad(1)

We can write the response for the ithi^{th} observation as shown in Equation 2

yi=β0+β1xi1++βpxip+ϵi(2) y_i = \beta_0 + \beta_1 x_{i1} + \dots + \beta_p x_{ip} + \epsilon_i \qquad(2)

such that ϵi\epsilon_i is the amount yiy_i deviates from μ{y|xi1,,xip}\mu\{y|x_{i1}, \ldots, x_{ip}\}, the mean response for a given combination of explanatory variables. We assume each ϵiN(0,σ2)\epsilon_i \sim N(0,\sigma^2), where σ2\sigma^2 is a constant variance for the distribution of the response yy for any combination of explanatory variables x1,,xpx_1, \ldots, x_p.

Matrix Representation for the Regression Model

We can represent the Equation 1 and Equation 2 using matrix notation. Let

𝐘=[y1y2yn]𝐗=[x11x12x1px21x22x2pxn1xn2xnp]𝛃=[β0β1βp]𝛜=[ϵ1ϵ2ϵn](3) \begin{align*}\mathbf{Y}=\left[\begin{matrix}y_{1}\\ y_{2}\\ \vdots\\ y_{n} \end{matrix}\right]\hspace{1cm}\mathbf{X}=\begin{bmatrix}x_{11} & x_{12} & \dots & x_{1p}\\ x_{21} & x_{22} & \dots & x_{2p}\\ \vdots & \vdots & \ddots & \vdots\\ x_{n1} & x_{n2} & \dots & x_{np} \end{bmatrix}\hspace{1cm}\boldsymbol{\beta}=\begin{bmatrix}\beta_{0}\\ \beta_{1}\\ \vdots\\ \beta_{p} \end{bmatrix}\hspace{1cm}\boldsymbol{\epsilon}=\begin{bmatrix}\epsilon_{1}\\ \epsilon_{2}\\ \vdots\\ \epsilon_{n} \end{bmatrix}\end{align*} \qquad(3)

Thus,

𝐘=𝐗𝛃+𝛜\mathbf{Y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{\epsilon}

Therefore the estimated response for a given combination of explanatory variables and the associated residuals can be written as

𝐘̂=𝐗𝛃̂𝐞=𝐘𝐗𝛃̂(4) \hat{\mathbf{Y}} = \mathbf{X}\hat{\boldsymbol{\beta}} \hspace{1cm} \mathbf{e} = \mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}} \qquad(4)

Estimating the Coefficients

The least-squares model is the one that minimizes the sum of the squared residuals. Therefore, we want to find the coefficients, 𝛃̂\hat{\boldsymbol{\beta}} that minimizes

i=1nei2=𝐞T𝐞=(𝐘𝐗𝛃̂)T(𝐘𝐗𝛃̂)(5) \sum\limits_{i=1}^{n} e_{i}^2 = \mathbf{e}^T\mathbf{e} = (\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}})^T(\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}}) \qquad(5)

where 𝐞T\mathbf{e}^T, the transpose of the matrix 𝐞\mathbf{e}.

(𝐘𝐗𝛃̂)T(𝐘𝐗𝛃̂)=(𝐘T𝐘𝐘T𝐗𝛃̂(𝛃̂T𝐗T𝐘+𝛃̂T𝐗T𝐗𝛃̂)(6) (\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}})^T(\mathbf{Y} - \mathbf{X}\hat{\boldsymbol{\beta}}) = (\mathbf{Y}^T\mathbf{Y} - \mathbf{Y}^T \mathbf{X}\hat{\boldsymbol{\beta}} - (\hat{\boldsymbol{\beta}}{}^{T}\mathbf{X}^T\mathbf{Y} + \hat{\boldsymbol{\beta}}{}^{T}\mathbf{X}^T\mathbf{X} \hat{\boldsymbol{\beta}}) \qquad(6)

Note that (𝐘𝐓𝐗𝛃̂)T=𝛃̂T𝐗T𝐘(\mathbf{Y^T}\mathbf{X}\hat{\boldsymbol{\beta}})^T = \hat{\boldsymbol{\beta}}{}^{T}\mathbf{X}^T\mathbf{Y}. Since these are both constants (i.e. 1×11\times 1 vectors), 𝐘𝐓𝐗𝛃̂=𝛃̂T𝐗T𝐘\mathbf{Y^T}\mathbf{X}\hat{\boldsymbol{\beta}} = \hat{\boldsymbol{\beta}}{}^{T}\mathbf{X}^T\mathbf{Y}. Thus, Equation 6 becomes

𝐘T𝐘2𝐗T𝛃̂T𝐘+𝛃̂T𝐗T𝐗𝛃̂ \mathbf{Y}^T\mathbf{Y} - 2 \mathbf{X}^T\hat{\boldsymbol{\beta}}{}^{T}\mathbf{Y} + \hat{\boldsymbol{\beta}}{}^{T}\mathbf{X}^T\mathbf{X} \hat{\boldsymbol{\beta}}

Since we want to find the 𝛃̂\hat{\boldsymbol{\beta}} that minimizes Equation 5, will find the value of 𝛃̂\hat{\boldsymbol{\beta}} such that the derivative with respect to 𝛃̂\hat{\boldsymbol{\beta}} is equal to 0.

𝐞T𝐞𝛃̂=𝛃̂(𝐘T𝐘2𝐗T𝛃̂T𝐘+𝛃̂T𝐗T𝐗𝛃̂)=02𝐗T𝐘+2𝐗T𝐗𝛃̂=02𝐗T𝐘=2𝐗T𝐗𝛃̂𝐗T𝐘=𝐗T𝐗𝛃̂(𝐗T𝐗)1𝐗T𝐘=(𝐗T𝐗)1𝐗T𝐗𝛃̂(𝐗T𝐗)1𝐗T𝐘=𝐈𝛃̂(7) \begin{aligned} \frac{\partial \mathbf{e}^T\mathbf{e}}{\partial \hat{\boldsymbol{\beta}}} & = \frac{\partial}{\partial \hat{\boldsymbol{\beta}}}(\mathbf{Y}^T\mathbf{Y} - 2 \mathbf{X}^T\hat{\boldsymbol{\beta}}{}^T\mathbf{Y} + \hat{\boldsymbol{\beta}}{}^{T}\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}}) = 0 \\ &\Rightarrow - 2 \mathbf{X}^T\mathbf{Y} + 2 \mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} = 0 \\ & \Rightarrow 2 \mathbf{X}^T\mathbf{Y} = 2 \mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} \\ & \Rightarrow \mathbf{X}^T\mathbf{Y} = \mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} \\ & \Rightarrow (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}\hat{\boldsymbol{\beta}} \\ & \Rightarrow (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} = \mathbf{I}\hat{\boldsymbol{\beta}} \end{aligned} \qquad(7)

Thus, the estimate of the model coefficients is 𝛃̂=(𝐗T𝐗)1𝐗T𝐘\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y}.

Variance-covariance matrix of the coefficients

We will use two properties to derive the form of the variance-covariance matrix of the coefficients:

  1. E[𝛜𝛜T]=σ2IE[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^T] = \sigma^2I
  2. 𝛃̂=𝛃+(𝐗T𝐗)1ϵ\hat{\boldsymbol{\beta}} = \boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\epsilon

First, we will show that E[𝛜𝛜T]=σ2IE[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^T] = \sigma^2I

E[𝛜𝛜T]=E[ϵ1ϵ2ϵn][ϵ1ϵ2ϵn]=E[ϵ12ϵ1ϵ2ϵ1ϵnϵ2ϵ1ϵ22ϵ2ϵnϵnϵ1ϵnϵ2ϵn2]=[E[ϵ12]E[ϵ1ϵ2]E[ϵ1ϵn]E[ϵ2ϵ1]E[ϵ22]E[ϵ2ϵn]E[ϵnϵ1]E[ϵnϵ2]E[ϵn2]](8) \begin{aligned} E[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^T] &= E \begin{bmatrix}\epsilon_1 & \epsilon_2 & \dots & \epsilon_n \end{bmatrix}\begin{bmatrix}\epsilon_1 \\ \epsilon_2 \\ \vdots \\ \epsilon_n \end{bmatrix} \\ & = E \begin{bmatrix} \epsilon_1^2 & \epsilon_1 \epsilon_2 & \dots & \epsilon_1 \epsilon_n \\ \epsilon_2 \epsilon_1 & \epsilon_2^2 & \dots & \epsilon_2 \epsilon_n \\ \vdots & \vdots & \ddots & \vdots \\ \epsilon_n \epsilon_1 & \epsilon_n \epsilon_2 & \dots & \epsilon_n^2 \end{bmatrix} \\ & = \begin{bmatrix} E[\epsilon_1^2] & E[\epsilon_1 \epsilon_2] & \dots & E[\epsilon_1 \epsilon_n] \\ E[\epsilon_2 \epsilon_1] & E[\epsilon_2^2] & \dots & E[\epsilon_2 \epsilon_n] \\ \vdots & \vdots & \ddots & \vdots \\ E[\epsilon_n \epsilon_1] & E[\epsilon_n \epsilon_2] & \dots & E[\epsilon_n^2] \end{bmatrix} \end{aligned} \qquad(8)

Recall, the regression assumption that the errors ϵis\epsilon_i's are Normally distributed with mean 0 and variance σ2\sigma^2. Thus, E(ϵi2)=Var(ϵi)=σ2E(\epsilon_i^2) = Var(\epsilon_i) = \sigma^2 for all ii. Additionally, recall the regression assumption that the errors are uncorrelated, i.e. E(ϵiϵj)=Cov(ϵi,ϵj)=0E(\epsilon_i \epsilon_j) = Cov(\epsilon_i, \epsilon_j) = 0 for all i,ji,j. Using these assumptions, we can write Equation 8 as

E[𝛜𝛜T]=[σ2000σ2000σ2]=σ2𝐈(9) E[\mathbf{\epsilon}\mathbf{\epsilon}^T] = \begin{bmatrix} \sigma^2 & 0 & \dots & 0 \\ 0 & \sigma^2 & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \sigma^2 \end{bmatrix} = \sigma^2 \mathbf{I} \qquad(9)

where 𝐈\mathbf{I} is the n×nn \times n identity matrix.

Next, we show that 𝛃̂=𝛃+(𝐗T𝐗)1ϵ\hat{\boldsymbol{\beta}} = \boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\epsilon.

Recall that the 𝛃̂=(𝐗T𝐗)1𝐗T𝐘\hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} and 𝐘=𝐗𝛃+𝛜\mathbf{Y} = \mathbf{X}\mathbf{\beta} + \mathbf{\epsilon}. Then,

𝛃̂=(𝐗T𝐗)1𝐗T𝐘=(𝐗T𝐗)1𝐗T(𝐗𝛃+𝛜)=𝛃+(𝐗T𝐗)1𝐗T𝛜(10) \begin{aligned} \hat{\boldsymbol{\beta}} &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} \\ &= (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T(\mathbf{X}\boldsymbol{\beta} + \boldsymbol{\epsilon}) \\ &= \boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T \mathbf{\epsilon} \\ \end{aligned} \qquad(10)

Using these two properties, we derive the form of the variance-covariance matrix for the coefficients. Note that the covariance matrix is E[(𝛃̂𝛃)(𝛃̂𝛃)T]E[(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})^T]

E[(𝛃̂𝛃)(𝛃̂𝛃)T]=E[(𝛃+(𝐗T𝐗)1𝐗T𝛜𝛃)(𝛃+(𝐗T𝐗)1𝐗T𝛜𝛃)T]=E[(𝐗T𝐗)1𝐗T𝛜𝛜T𝐗(𝐗T𝐗)1]=(𝐗T𝐗)1𝐗TE[𝛜𝛜T]𝐗(𝐗T𝐗)1=(𝐗T𝐗)1𝐗T(σ2𝐈)𝐗(𝐗T𝐗)1=σ2𝐈(𝐗T𝐗)1𝐗T𝐗(𝐗T𝐗)1=σ2𝐈(𝐗T𝐗)1=σ2(𝐗T𝐗)1(11) \begin{aligned} E[(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})^T] &= E[(\boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \boldsymbol{\epsilon} - \boldsymbol{\beta})(\boldsymbol{\beta} + (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \boldsymbol{\epsilon} - \boldsymbol{\beta})^T]\\ & = E[(\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \boldsymbol{\epsilon}\boldsymbol{\epsilon}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}] \\ & = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T E[\boldsymbol{\epsilon}\boldsymbol{\epsilon}^T]\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\\ & = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T (\sigma^2\mathbf{I})\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\\ &= \sigma^2\mathbf{I}(\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{X}(\mathbf{X}^T\mathbf{X})^{-1}\\ & = \sigma^2\mathbf{I}(\mathbf{X}^T\mathbf{X})^{-1}\\ & = \sigma^2(\mathbf{X}^T\mathbf{X})^{-1} \\ \end{aligned} \qquad(11)