Bias - Variance Trade-offs

Note

The following supplemental notes are based on this page. They are provided for students who want to dive deeper into the mathematics behind regularized regression. Additional supplemental notes will be added throughout the semester.

In the simple linear regression model, you have nn observations of the response variable YY with a linear combination of mm predictor variables 𝐱\mathbf{x} where

π(Y=y|𝐱,𝛉)=𝒩(y|β0+𝛃𝐱,σ2)(1) \pi\left(Y=y|\mathbf{x,\theta}\right)=\mathcal{N}\left(\left.y\right|\beta_{0}+\mathbf{\mathbf{\mathbf{\beta}}}'\mathbf{x},\sigma^{2}\right) \qquad(1)

where θ=(β0,𝛃,σ2)\theta=\left(\beta_{0},\mathbf{\mathbf{\mathbf{\beta}}},\sigma^{2}\right) are all the parameters of the model. The vector of parameters β1:D\beta_{1:D} are the weights or regression coefficients. Each coefficient βi\beta_i specifies the change in the output that we expect if the corresponding input feature xix_i changes by one unit. The term β0\beta_0 is the offset or bias term, and specifies the output if all the inputs are zero. This captures the unconditional response, and acts as a baseline. We sometimes write the input as (1,x1,,xD)\left(1,x_{1},\ldots,x_{D}\right) so the offset can be absorbed into the weight vector.

We can always apply a transformation ϕ\phi (linear or non-linear) to the input vector, replacing β\beta with ϕ(β)\phi(\beta). As long as the parameters of the feature extractor are fixed, the model remain linear in the parameters even if not linear in the inputs.

Least squares estimation

To fit the linear regression model to data, we minimize the negative log-likelihood on the training set.

NLL(β,σ2)=n=1Nlog[(12πσ2)12exp(12σ2(ynβxn)2)]=12σ2n=1N(ynŷn)2Nlog(2πσ2) \begin{align*} \text{NLL}\left(\beta,\sigma^{2}\right) & =\sum_{n=1}^{N}\log\left[\left(\frac{1}{2\pi\sigma^{2}}\right)^{\frac{1}{2}}\exp\left(-\frac{1}{2\sigma^{2}}\left(y_{n}-\beta'x_{n}\right)^{2}\right)\right]\\ & =-\frac{1}{2\sigma^{2}}\sum_{n=1}^{N}\left(y_{n}-\hat{y}_{n}\right)^{2}-N\log\left(2\pi\sigma^{2}\right) \end{align*}

where the predicted response is ŷβxn\hat{y}\equiv\beta'x_{n}. Focusing on just the weights, the NLL is (up to a constant):

RSS(β)=12n=1N(ynβxn)2=12ynβxn2=12(ynβxn)(ynβxn) \begin{align*} \text{RSS}\left(\beta\right) & =\frac{1}{2}\sum_{n=1}^{N}\left(y_{n}-\beta'x_{n}\right)^{2}=\frac{1}{2}\left\Vert y_{n}-\beta'x_{n}\right\Vert ^{2}=\frac{1}{2}\left(y_{n}-\beta'x_{n}\right)'\left(y_{n}-\beta'x_{n}\right)\\ \end{align*}

We must estimate the parameter values 𝛃̂\mathbf{\hat{\beta}} from the data, and using the OLS method, the loss function is

LOLS(β̂)=i=1n(yixiβ̂)2=yXβ̂2(2) \text{L}_{OLS}\left(\hat{\beta}\right)=\sum_{i=1}^{n}\left(y_{i}-x_{i}^{'}\hat{\beta}\right)^{2}=\left\Vert y-X\hat{\beta}\right\Vert ^{2} \qquad(2)

which is minimized with the estimate

𝛃̂OLS=(XX)1(XY)(3) \hat{\mathbf{\beta}}_{OLS}=\left(X'X\right)^{-1}\left(X'Y\right) \qquad(3)

Bias and variance

The bias is the difference between the true population parameter and the expected estimator:

Bias(𝛃̂OLS)=𝔼[𝛃̂𝕆𝕃𝕊]𝛃(4) \text{Bias}\left(\hat{\mathbf{\beta}}_{OLS}\right)=\mathbb{E\left[\hat{\mathbf{\beta}}_{OLS}\right]-\mathbf{\beta}} \qquad(4)

The variance uncertainty, in these estimates:

Var(𝛃̂OLS)=σ2(XX)1(5) \text{Var}\left(\hat{\mathbf{\beta}}_{OLS}\right)=\sigma^{2}\left(X'X\right)^{-1} \qquad(5)

where σ2\sigma^2 is estimated form the residuals ee

e=yx𝛃̂σ̂2=eenm(6) \begin{align*}e & =y-x\hat{\mathbf{\beta}}\\\hat{\sigma}^{2} & =\frac{e'e}{n-m}\end{align*} \qquad(6)

This picture illustrates what bias and variance are.

Source: kdnuggets.com

Source: kdnuggets.com

Both the bias and the variance are desired to be low, and the model’s error can be decomposed into three parts: error resulting from a large variance, error resulting from significant bias, and the remainder - the unexplainable part.

𝔼[e]=(𝔼[Xβ̂]Xβ̂)2+𝔼[(Xβ̂Xβ̂)2]+σ2=Bias2+Variance+σ2(7) \begin{align*}\mathbb{E}\left[e\right] & =\left(\mathbb{E}\left[X\hat{\beta}\right]-X\hat{\beta}\right)^{2}+\mathbb{E}\left[\left(X\hat{\beta}-X\hat{\beta}\right)^{2}\right]+\sigma^{2}\\ & =\text{Bias}^{2}+\text{Variance}+\sigma^{2}\end{align*} \qquad(7)

The OLS estimator has the desired property of being unbiased. However, it can have a huge variance. Specifically, this happens when:

  • The predictor variables are highly correlated with each other;
  • There are many predictors. This is reflected in the formula for variance given above: if m approaches n, the variance approaches infinity.

The general solution to this is: reduce variance at the cost of introducing some bias. This approach is called regularization and is almost always beneficial for the predictive performance of the model. To make it sink in, let’s take a look at the following plot.

Source: researchgate.net

Source: researchgate.net

As the model complexity, which in the case of linear regression can be thought of as the number of predictors, increases, estimates’ variance also increases, but the bias decreases. The unbiased OLS would place us on the right-hand side of the picture, which is far from optimal. That’s why we regularize: to lower the variance at the cost of some bias, thus moving left on the plot, towards the optimum.

Ridge Regression

From the discussion so far we have concluded that we would like to decrease the model complexity, that is the number of predictors. We could use the forward or backward selection for this, but that way we would not be able to tell anything about the removed variables’ effect on the response. Removing predictors from the model can be seen as settings their coefficients to zero. Instead of forcing them to be exactly zero, let’s penalize them if they are too far from zero, thus enforcing them to be small in a continuous way. This way, we decrease model complexity while keeping all variables in the model. This, basically, is what Ridge Regression does.

Model Specification

In Ridge Regression, the OLS loss function is augmented in such a way that we not only minimize the sum of squared residuals but also penalize the size of parameter estimates, in order to shrink them towards zero:

Lridge(β̂)=i=1n(yixiβ̂)2+λj=1mβ̂j2=yXβ̂2+λβ̂2 \text{L}_{ridge}\left(\hat{\beta}\right)=\sum_{i=1}^{n}\left(y_{i}-x_{i}^{'}\hat{\beta}\right)^{2} + \lambda\sum_{j=1}^{m}\hat{\beta}^2_j=\left\Vert y-X\hat{\beta}\right\Vert ^{2} + \lambda\left\Vert \hat{\beta}\right\Vert ^{2}

Solving this for β̂\hat\beta gives the the ridge regression estimates β̂ridge=(XX+λI)1(XY)\hat\beta_{ridge} = (X'X+\lambda I)^{-1}(X'Y), where I denotes the identity matrix.

The λ\lambda parameter is the regularization penalty. We will talk about how to choose it in the next sections of this tutorial, but for now notice that:

  • As λ0,β̂ridgeβ̂OLS\lambda \rightarrow 0, \quad \hat\beta_{ridge} \rightarrow \hat\beta_{OLS};
  • As λ,β̂ridge0\lambda \rightarrow \infty, \quad \hat\beta_{ridge} \rightarrow 0.

So, setting λ\lambda to 0 is the same as using the OLS, while the larger its value, the stronger is the coefficients’ size penalized.

Bias-Variance Trade-Off in Ridge Regression

Bias(𝛃̂ridge)=λ(XX+λI)1βVar(𝛃̂ridge)=σ2(XX+λI)1XX(XX+λI)1 \begin{align*} \text{Bias}\left(\hat{\mathbf{\beta}}_{ridge}\right) & =\lambda(X'X+\lambda I)^{-1}\beta\\ \text{Var}\left(\hat{\mathbf{\beta}}_{ridge}\right) & =\sigma^{2}(X'X+\lambda I)^{-1}X'X(X'X+\lambda I)^{-1} \end{align*}

From there you can see that as λ\lambda becomes larger, the variance decreases, and the bias increases. This poses the question: how much bias are we willing to accept in order to decrease the variance? Or: what is the optimal value for λ\lambda?

There are two ways we could tackle this issue. A more traditional approach would be to choose λ such that some information criterion, e.g., AIC or BIC, is the smallest. A more machine learning-like approach is to perform cross-validation and select the value of λ that minimizes the cross-validated sum of squared residuals (or some other measure). The former approach emphasizes the model’s fit to the data, while the latter is more focused on its predictive performance. Let’s discuss both.

Minimizing Information Criteria

This approach boils down to estimating the model with many different values for λ\lambda and choosing the one that minimizes the Akaike or Bayesian Information Criterion:

AICridge=log(ee)+2dfridgeBICridge=log(ee)+2dfridgelogn \begin{align*} \text{AIC}_{\text{ridge}} & =\log(e'e)+2\text{df}_{\text{ridge}}\\ \text{BIC}_{\text{ridge}} & =\log(e'e)+2\text{df}_{\text{ridge}}\log n \end{align*}

where dfridge\text{df}_{\text{ridge}} is the number of degrees of freedom. Watch out here! The number of degrees of freedom in ridge regression is different than in the regular OLS! This is often overlooked which leads to incorrect inference. In both OLS and ridge regression, degrees of freedom are equal to the trace of the so-called hat matrix, which is a matrix that maps the vector of response values to the vector of fitted values as follows: ŷ=Hy\hat y = H y.

In OLS, we find that HOLS=X(XX)1X\text{H}_{\text{OLS}}=X(X′X)^{−1}X, which gives dfOLS=trHOLS=m\text{df}_{\text{OLS}}=\text{tr}\text{H}_{\text{OLS}}=m, where mm is the number of predictor variables. In ridge regression, however, the formula for the hat matrix should include the regularization penalty: Hridge=X(XX+λI)1X\text{H}_{\text{ridge}}=X(X′X+\lambda I)^{−1}X, which gives dfridge=trHridge\text{df}_{\text{ridge}}=\text{tr}\text{H}_{\text{ridge}}, which is no longer equal to mm. Some ridge regression software produce information criteria based on the OLS formula. To be sure you are doing things right, it is safer to compute them manually, which is what we will do later in this tutorial.

Ridge Regression: R example

In R, the glmnet package contains all you need to implement ridge regression. We will use the infamous mtcars dataset as an illustration, where the task is to predict miles per gallon based on car’s other characteristics. One more thing: ridge regression assumes the predictors are standardized and the response is centered! You will see why this assumption is needed in a moment. For now, we will just standardize before modeling.

# Load libraries, get data & set seed for reproducibility ---------------------
set.seed(123)    # seef for reproducibility
library(glmnet)  # for ridge regression
library(dplyr)   # for data cleaning
library(psych)   # for function tr() to compute trace of a matrix

data("mtcars")
# Center y, X will be standardized in the modelling function
y <- mtcars %>% select(mpg) %>% scale(center = TRUE, scale = FALSE) %>% as.matrix()
X <- mtcars %>% select(-mpg) %>% as.matrix()


# Perform 10-fold cross-validation to select lambda ---------------------------
lambdas_to_try <- 10^seq(-3, 5, length.out = 100)
# Setting alpha = 0 implements ridge regression
ridge_cv <- cv.glmnet(X, y, alpha = 0, lambda = lambdas_to_try,
                      standardize = TRUE, nfolds = 10)
# Plot cross-validation results
plot(ridge_cv)