Model Selection Criteria: AIC & BIC

Note

The following supplemental notes were created by Dr. Maria Tackett for STA 210. They are provided for students who want to dive deeper into the mathematics behind regression and reflect some of the material covered in STA 211: Mathematics of Regression. Additional supplemental notes will be added throughout the semester.

This document discusses some of the mathematical details of Akaike’s Information Criterion (AIC) and Schwarz’s Bayesian Information Criterion (BIC). We assume the reader knowledge of the matrix form for multiple linear regression.Please see Matrix Notation for Multiple Linear Regression for a review.

Maximum Likelihood Estimation of 𝛃\boldsymbol{\beta} and Οƒ\sigma

To understand the formulas for AIC and BIC, we will first briefly explain the likelihood function and maximum likelihood estimates for regression.

Let 𝐘\mathbf{Y} be nΓ—1n \times 1 matrix of responses, 𝐗\mathbf{X}, the nΓ—(p+1)n \times (p+1) matrix of predictors, and 𝛃\boldsymbol{\beta}, (p+1)Γ—1(p+1) \times 1 matrix of coefficients. If the multiple linear regression model is correct then,

𝐘∼N(𝐗𝛃,Οƒ2)(1) \mathbf{Y} \sim N(\mathbf{X}\boldsymbol{\beta}, \sigma^2) \qquad(1)

When we do linear regression, our goal is to estimate the unknown parameters 𝛃\boldsymbol{\beta} and Οƒ2\sigma^2 from Equation 1. In Matrix Notation for Multiple Linear Regression, we showed a way to estimate these parameters using matrix alegbra. Another approach for estimating 𝛃\boldsymbol{\beta} and Οƒ2\sigma^2 is using maximum likelihood estimation.

A likelihood function is used to summarise the evidence from the data in support of each possible value of a model parameter. Using Equation 1, we will write the likelihood function for linear regression as

L(𝐗,𝐘|𝛃,Οƒ2)=∏i=1n(2πσ2)βˆ’12exp{βˆ’12Οƒ2βˆ‘i=1n(Yiβˆ’π—i𝛃)T(Yiβˆ’π—i𝛃)}(2) L(\mathbf{X}, \mathbf{Y}|\boldsymbol{\beta}, \sigma^2) = \prod\limits_{i=1}^n (2\pi \sigma^2)^{-\frac{1}{2}} \exp\bigg\{-\frac{1}{2\sigma^2}\sum\limits_{i=1}^n(Y_i - \mathbf{X}_i \boldsymbol{\beta})^T(Y_i - \mathbf{X}_i \boldsymbol{\beta})\bigg\} \qquad(2)

where YiY_i is the ithi^{th} response and 𝐗i\mathbf{X}_i is the vector of predictors for the ithi^{th} observation. One approach estimating 𝛃\boldsymbol{\beta} and Οƒ2\sigma^2 is to find the values of those parameters that maximize the likelihood in Equation 2, i.e. maximum likelhood estimation. To make the calculations more manageable, instead of maximizing the likelihood function, we will instead maximize its logarithm, i.e. the log-likelihood function. The values of the parameters that maximize the log-likelihood function are those that maximize the likelihood function. The log-likelihood function we will maximize is

logL(𝐗,𝐘|𝛃,Οƒ2)=βˆ‘i=1nβˆ’12log(2πσ2)βˆ’12Οƒ2βˆ‘i=1n(Yiβˆ’π—i𝛃)T(Yiβˆ’π—i𝛃)=βˆ’n2log(2πσ2)βˆ’12Οƒ2(π˜βˆ’π—π›ƒ)T(π˜βˆ’π—π›ƒ)(3) \begin{aligned} \log L(\mathbf{X}, \mathbf{Y}|\boldsymbol{\beta}, \sigma^2) &= \sum\limits_{i=1}^n -\frac{1}{2}\log(2\pi\sigma^2) -\frac{1}{2\sigma^2}\sum\limits_{i=1}^n(Y_i - \mathbf{X}_i \boldsymbol{\beta})^T(Y_i - \mathbf{X}_i \boldsymbol{\beta}) \\ &= -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}(\mathbf{Y} - \mathbf{X} \boldsymbol{\beta})^T(\mathbf{Y} - \mathbf{X} \boldsymbol{\beta})\\ \end{aligned} \qquad(3)

The maximum likelihood estimate of 𝛃\boldsymbol{\beta} and Οƒ2\sigma^2 are 𝛃̂=(𝐗T𝐗)βˆ’1𝐗Tπ˜ΟƒΜ‚2=1n(π˜βˆ’π—π›ƒ)T(π˜βˆ’π—π›ƒ)=1nRSS(4) \hat{\boldsymbol{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1}\mathbf{X}^T\mathbf{Y} \hspace{1cm} \hat{\sigma}^2 = \frac{1}{n}(\mathbf{Y} - \mathbf{X} \boldsymbol{\beta})^T(\mathbf{Y} - \mathbf{X} \boldsymbol{\beta}) = \frac{1}{n}RSS \qquad(4)

where RSSRSS is the residual sum of squares. Note that the maximum likelihood estimate is not exactly equal to the estimate of Οƒ2\sigma^2 we typically use RSSnβˆ’pβˆ’1\frac{RSS}{n-p-1}. This is because the maximum likelihood estimate of Οƒ2\sigma^2 in Equation 4 is a biased estimator of Οƒ2\sigma^2. When nn is much larger than the number of predictors pp, then the differences in these two estimates are trivial.

AIC

Akaike’s Information Criterion (AIC) is is a model selection criterion developed by Hirotugu Akaike that aims to estimate the relative quality of different models while penalizing for model complexity. Here is the original paper on AIC concept by Akaike – A New Look at the Statistical Modeling Identification. The purpose of AIC is to find a model that maximizes the likelihood of the data while taking into account the number of parameters used. The formula for AIC is as follows:

AIC=βˆ’2logL+2(p+1)(5) AIC = -2 \log L + 2(p+1) \qquad(5)

where logL\log L is the log-likelihood which measures how well the model fits the data. The term p+1p+1 represents the number of parameters in the model, including the intercept and any additional predictors. This is the general form of AIC that can be applied to a variety of models, but for now, let’s focus on AIC for mutliple linear regression.

AIC=βˆ’2logL+2(p+1)=βˆ’2[βˆ’n2log(2πσ2)βˆ’12Οƒ2(π˜βˆ’π—π›ƒ)T(π˜βˆ’π—π›ƒ)]+2(p+1)=nlog(2Ο€RSSn)+1RSS/nRSS=nlog(2Ο€)+nlog(RSS)βˆ’nlog(n)+2(p+1)(6) \begin{aligned} AIC &= -2 \log L + 2(p+1) \\ &= -2\bigg[-\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}(\mathbf{Y} - \mathbf{X} \boldsymbol{\beta})^T(\mathbf{Y} - \mathbf{X} \boldsymbol{\beta})\bigg] + 2(p+1) \\ &= n\log\big(2\pi\frac{RSS}{n}\big) + \frac{1}{RSS/n}RSS \\ &= n\log(2\pi) + n\log(RSS) - n\log(n) + 2(p+1) \end{aligned} \qquad(6)

BIC

Similar to AIC, the Bayesian Information Criterion (BIC) is another model selection criterion that considers both model fit and complexity. BIC is based on Bayesian principles and provides a more stronger penalty for model complexity compared to AIC. Gideon Schwarz’s foundational paper on BIC is titled β€œEstimating the Dimension of a Model” and was published in 1978. The formula for BIC is as follows:

BIC=βˆ’2logL+(p+1)logn(7) BIC = -2 \log L + (p+1) \log n \qquad(7)

In the formula, the terms logL\log L and p+1p+1 have the same meaning as in AIC. Additionally, the term logn\log n represents the logarithm of the sample size (nn). The logn\log n term in BIC introduces a stronger penalty for model complexity compared to AIC, as the penalty term scales with the sample size.

The main difference between AIC and BIC lies in the penalty term for model complexity. While AIC penalizes complexity to some extent with the term 2(p+1)2 (p+1), BIC’s penalty increases logarithmically with the sample size, resulting in a more pronounced penalty. Therefore, BIC tends to favor simpler models compared to AIC, promoting a more parsimonious approach to model selection.