Model Selection Criteria: AIC & BIC
The following supplemental notes were created by Dr. Maria Tackett for STA 210. They are provided for students who want to dive deeper into the mathematics behind regression and reflect some of the material covered in STA 211: Mathematics of Regression. Additional supplemental notes will be added throughout the semester.
This document discusses some of the mathematical details of Akaikeβs Information Criterion (AIC) and Schwarzβs Bayesian Information Criterion (BIC). We assume the reader knowledge of the matrix form for multiple linear regression.Please see Matrix Notation for Multiple Linear Regression for a review.
Maximum Likelihood Estimation of and
To understand the formulas for AIC and BIC, we will first briefly explain the likelihood function and maximum likelihood estimates for regression.
Let be matrix of responses, , the matrix of predictors, and , matrix of coefficients. If the multiple linear regression model is correct then,
When we do linear regression, our goal is to estimate the unknown parameters and from Equation 1. In Matrix Notation for Multiple Linear Regression, we showed a way to estimate these parameters using matrix alegbra. Another approach for estimating and is using maximum likelihood estimation.
A likelihood function is used to summarise the evidence from the data in support of each possible value of a model parameter. Using Equation 1, we will write the likelihood function for linear regression as
where is the response and is the vector of predictors for the observation. One approach estimating and is to find the values of those parameters that maximize the likelihood in Equation 2, i.e. maximum likelhood estimation. To make the calculations more manageable, instead of maximizing the likelihood function, we will instead maximize its logarithm, i.e. the log-likelihood function. The values of the parameters that maximize the log-likelihood function are those that maximize the likelihood function. The log-likelihood function we will maximize is
The maximum likelihood estimate of and are
where is the residual sum of squares. Note that the maximum likelihood estimate is not exactly equal to the estimate of we typically use . This is because the maximum likelihood estimate of in Equation 4 is a biased estimator of . When is much larger than the number of predictors , then the differences in these two estimates are trivial.
AIC
Akaikeβs Information Criterion (AIC) is is a model selection criterion developed by Hirotugu Akaike that aims to estimate the relative quality of different models while penalizing for model complexity. Here is the original paper on AIC concept by Akaike β A New Look at the Statistical Modeling Identification. The purpose of AIC is to find a model that maximizes the likelihood of the data while taking into account the number of parameters used. The formula for AIC is as follows:
where is the log-likelihood which measures how well the model fits the data. The term represents the number of parameters in the model, including the intercept and any additional predictors. This is the general form of AIC that can be applied to a variety of models, but for now, letβs focus on AIC for mutliple linear regression.
BIC
Similar to AIC, the Bayesian Information Criterion (BIC) is another model selection criterion that considers both model fit and complexity. BIC is based on Bayesian principles and provides a more stronger penalty for model complexity compared to AIC. Gideon Schwarzβs foundational paper on BIC is titled βEstimating the Dimension of a Modelβ and was published in 1978. The formula for BIC is as follows:
In the formula, the terms and have the same meaning as in AIC. Additionally, the term represents the logarithm of the sample size (). The term in BIC introduces a stronger penalty for model complexity compared to AIC, as the penalty term scales with the sample size.
The main difference between AIC and BIC lies in the penalty term for model complexity. While AIC penalizes complexity to some extent with the term , BICβs penalty increases logarithmically with the sample size, resulting in a more pronounced penalty. Therefore, BIC tends to favor simpler models compared to AIC, promoting a more parsimonious approach to model selection.