BSMM8740-2-R-2024F [WEEK - 3]
Last time we worked with the recipes
package to develop workflows for pre-processing our data.
Today we look at regression methods we might apply to understand relationships between measurements in our data.
In the simple linear regression (SLR) model, we have \(N\) observations of a single outcome variable \(Y\) along with \(D\) predictor (aka co-variate) variables \(\mathbf{x}\) where the likelihood1 of observing \(Y=y\) is conditional on the predictor values \(x\) and parameters \(\theta=\{\beta,\sigma^2\}\):
\[ \pi\left(Y=y|\mathbf{x,\theta}\right)=\mathscr{N}\left(\left.y\right|\mu(\mathbf{x};\beta),\sigma^{2}\right) \]
In the SLR model, \(\mathscr{N}\left(\left.y\right|\mu(\mathbf{x};\beta),\sigma^{2}\right)\) is a Normal probability density with mean \(\mu(\mathbf{x};\beta)\) and variance \(\sigma^2\), where \(\sigma^2\) is a constant and the mean is a function of the predictors \(\mathbf{x}\) and a vector of parameters \(\beta\).
The mean function \(\mu(\mathbf{x};\beta)\) is often assumed to be continuous, i.e. a small change in the predictors implies a small change in the outcome. In addition, it is often convenient to decompose the mean function into a sum of simpler functions, e.g. polynomial functions (like straight lines, parabolas, and more).
The decomposition of a function \(f\) of a single variable \(x\) into a sum of simpler polynomial functions is called a Taylor series and is defined as follows:
\[ f(x;x_0)=\sum_{n=0}\beta_n(x-x_0)^n=\beta_0+\beta_1(x-x_0)+\beta_2(x-x_0)^2+\beta_3(x-x_0)^3+\ldots \]
where \(\frac{d^nf(x)}{dx^n}|_{x=x_0} \equiv f^{(n)}(x_0) = n!\beta_n\;\rightarrow \beta_n=\frac{1}{n!}f^{(n)}(x_0)\)
\[ f(x;x_0)=\sum_n\beta_n(x-x_0)^n=\beta_0+\beta_1(x-x_0)+\beta_2(x-x_0)^2+\beta_3(x-x_0)^3+\ldots \]
We can use the following constructive proof to find the coefficients in the series:
Similarly, when the number of variables is \(D=2\) the Taylor series is (writing \(f_{x}\equiv\frac{\partial f}{\partial x}\), \(f_{y}\equiv\frac{\partial f}{\partial y}\), \(f_{x,y}\equiv\frac{\partial^2 f}{\partial x,\partial x}\) and so on):
\[ \begin{align*} f(x,y;x_0,y_0) & =f(x_{0},y_{0})+f_{x}(x_{0},y_{0})(x-x_{0})+f_{y}(x_{0},y_{0})(y-y_{0})\\ & = + \frac{1}{2}f_{x,x}(x_{0},y_{0})(x-x_{0})^{2}+\frac{1}{2}f_{y,y}(x_{0},y_{0})(y-y_{0})^{2}\\ & = + \frac{1}{2}f_{x,y}(x_{0},y_{0})(x-x_{0})(y-y_{0})+\ldots \end{align*} \]
which decomposes a function of two variables into a sum of simpler functions, e.g. polynomial functions (like 2-D straight lines, parabolas, and more).
For one co-variate, if the mean function is smooth (i.e. not changing rapidly with the co-variate) then \(\mu^{(n)}\) will be decreasing in \(n\), and furthermore \(\beta_n\) decreases as \(1/n!\), so SLR models in one co-variate typically use only the first two or at most three \(\beta\) coefficients.
Thus the likelihood of observing \(Y=y\) is conditional on the predictor values \(x\) and parameters \(\theta=\{\beta_0,\beta_1,\sigma^2\}\):
\(\theta=\{\beta_{0},\mathbf{\beta},\sigma^{2}\}\) are the parameters of the model, where \(\beta_0\) is a constant and \(\beta_1\) is the co-variate weight or regression coefficient.
For multiple co-variates, If the mean function is smooth (i.e. not changing rapidly with the co-variates) then under similar (reasonable) assumptions on the differentials, it is common to see SLR models using only the first order coefficients, i.e. a constant and one coefficient for each co-variate.
Note
It is common to add the unit constant to the covariate vector \(x=(1,x_1,x_2,\ldots)\) so that the coefficient vector is \(\mathbf{\beta}=(\beta_0,\beta_{x_1},\beta_{x_2},\ldots)\), and the model can be expressed1 as a vector equation: \(y=\mathbf{\beta}\cdot \mathbf{x}\).
Recall that the one dimensional Normal/Gaussian probability density (aka likelihood) is:
\[ \mathscr{N}\left(x;\mu,\sigma^2\right)=\frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{1}{2}\frac{x-\mu}{\sigma^2}} \]
The likelihood of multiple observations is the product of the likelihoods for each observation.
To fit the 1D linear regression model given \(N\) data samples, we minimize the negative log-likelihood (NLL) on the training set.
\[ \begin{align*} \text{NLL}\left(\beta,\sigma^{2}\right) & =-\sum_{n=1}^{N}\log\left[\left(\frac{1}{2\pi\sigma^{2}}\right)^{\frac{1}{2}}\exp\left(-\frac{1}{2\sigma^{2}}\left(y_{n}-\beta'x_{n}\right)^{2}\right)\right]\\ & =\frac{1}{2\sigma^{2}}\sum_{n=1}^{N}\left(y_{n}-\hat{y}_{n}\right)^{2}-\frac{N}{2}\log\left(2\pi\sigma^{2}\right) \end{align*} \]
where the predicted response is \(\hat{y}\equiv\beta'x_{n}\). This is also the maximum likelihood estimation (MLE) method.
Minimizing the NLL by minimizing the residual sum of squares (RSS) is the same as minimizing
The MLE can be generalized by replacing the NLL (\(\ell\left(y_{n},\theta;x_{n}\right)=-\log\pi\left(y_n|x_n,\theta\right)\)) with any other loss function to get
\[ \mathscr{L}\left(\theta\right)=\frac{1}{N}\sum_{n=1}^{N}\ell\left(y_{n},\theta;x_{n}\right) \]
This is known as the empirical risk minimization (ERM) - the expected loss taken with respect to the empirical distribution.
Focusing on just the coefficients \(\beta\), the minimum NLL is (up to a constant) the minimum of the residual sum of squares (RSS)1 with coefficient estimates \(\hat\beta\) :
\[ \begin{align*}\text{RSS}\left(\beta\right) & =\frac{1}{2}\sum_{n=1}^{N}\left(y_{n}-\beta'x_{n}\right)^{2}=\frac{1}{2}\left\Vert y_{n}-\beta'x_{n}\right\Vert ^{2}\\ & =\frac{1}{2}\left(y_{n}-\beta'x_{n}\right)'\left(y_{n}-\beta'x_{n}\right)\\ \\ \end{align*} \]
Note that, given the assumption that the data generation process is Normal/Gaussian, we can write our regression equation in terms of individual observations as
\[ y_i=\beta_0+\beta_1 x_i + u_i \]
where error term \(u_i\) is a sample from \(\mathscr{N}\left(0,\sigma^{2}\right)\) which in turn implies \(\mathbb{E}\left[u\right]=0;\;\mathbb{E}\left[\left.u\right|x\right]=0\)
The independence of the covariates and the errors/residuals is a testable assumption.
It follows independence of the covariates and the errors that
\[ \begin{align*} \mathbb{E}\left[y-\beta_{0}-\beta_{1}x\right] & =0\\ \mathbb{E}\left[x\left(y-\beta_{0}-\beta_{1}x\right)\right] & =0 \end{align*} \]
Writing these equations in terms of our samples (where \(\hat{\beta}_{0}, \hat{\beta}_{1}\) are our coefficient estimates)
\[ \begin{align*} \frac{1}{N}\sum_{i-1}^{N}y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i} & =0\\ \frac{1}{N}\sum_{i-1}^{N}x_{i}\left(y_{i}-\hat{\beta}_{0}-\hat{\beta}_{1}x_{i}\right) & =0 \end{align*} \]
From the first equation
\[ \begin{align*} \bar{y}-\hat{\beta}_{0}-\hat{\beta}_{1}\bar{x} & =0\\ \bar{y}-\hat{\beta}_{1}\bar{x} & =\hat{\beta}_{0} \end{align*} \]
Substituting the expression for \(\hat{\beta}_{0}\) in the independence equation
\[ \begin{align*} \frac{1}{N}\sum_{i-1}^{N}x_{i}\left(y_{i}-\left(\bar{y}-\hat{\beta}_{1}\bar{x}\right)-\hat{\beta}_{1}x_{i}\right) & =0\\ \frac{1}{N}\sum_{i-1}^{N}x_{i}\left(y_{i}-\bar{y}\right) & =\hat{\beta}_{1}\frac{1}{N}\sum_{i-1}^{N}x_{i}\left(\bar{x}-x_{i}\right)\\ \frac{1}{N}\sum_{i-1}^{N}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right) & =\hat{\beta}_{1}\frac{1}{N}\sum_{i-1}^{N}\left(\bar{x}-x_{i}\right)^2 \end{align*} \]
So as long as \(\sum_{i-1}^{N}\left(\bar{x}-x_{i}\right)^2\ne 0\)
\[ \begin{align*} \hat{\beta}_{1} & =\frac{\sum_{i-1}^{N}\left(x_{i}-\bar{x}\right)\left(y_{i}-\bar{y}\right)}{\sum_{i-1}^{N}\left(\bar{x}_{i}-x_{i}\right)^2}\\ & =\frac{\text{sample covariance}(x_{i}y_{i})}{\text{sample variance}(x_{i})} \end{align*} \]
Similarly, in the vector equation, the minimum of the RSS is solved by (assuming \(N>D\)):
\[ \hat{\mathbf{\beta}}_{OLS}=\left(X'X\right)^{-1}\left(X'Y\right) = \frac{\text{cov}(X,Y)}{\text{var}(X)} \]
There are algorithmic issues with computing \(\left(X'X\right)^{-1}\) though, so we could instead start with \(X\beta=y\) and write \(\hat{\mathbf{\beta}}_{OLS}=X^{-1}y\) .
Computing the inverse of \(X'X\) directly, while theoretically possible, can be numerically unstable.
In R, the \(QR\) decomposition is used to solve for \(\beta\). Let \(X=QR\) where \(Q'Q=I\) and write:
\[ \begin{align*} (QR)\beta & = y\\ Q'QR\beta & = Q'y\\ \beta & = R^{-1}(Q'y) \end{align*} \]
Since \(R\) is upper triangular, the last equation can be solved by back-substitution.
[,1] [,2] [,3]
[1,] -0.1825742 -0.4082483 -8.944272e-01
[2,] -0.3651484 -0.8164966 4.472136e-01
[3,] -0.9128709 0.4082483 2.593051e-16
matrix A
[,1] [,2] [,3]
[1,] 3 2.0 -1
[2,] 2 -2.0 4
[3,] -1 0.5 -1
vector x
[1] 1 -2 -2
vector y
[,1]
[1,] 1
[2,] -2
[3,] 0
One of the important assumptions of the classical linear regression models is that there is no exact collinearity among the regressors.
While high correlation between regressors is a necessary indicator of the collinearity problem, a direct linear relationship beween regressors is sufficient.
Data collection methods, constraints on the fitted regression model, model specification error, an overdefined model, may be some potential sources of multicollinearity.
In other cases it is an artifact caused by creating new predictors from other predictors.
The problem of collinearity has potentially serious effect on the regression estimates such as:
Mitigating Collinearity:
We introduced truncated Taylor series approximations to motivate using simplified models of the mean function when using regression.
But bias is the error introduced by approximating a real-world problem, which may be complex, by a simplified model.
So to reduce bias, why not include more Taylor series terms, or more covariates in a first-order model?
Note that for random variables in general and Gaussian random variables in particular
So adding more terms or more covariates may reduce bias by improving the mean estimate, but will certainly increase the variance of the estimate.
Low Bias and High Variance
High Bias and Low Variance
A model with high bias makes oversimplified assumptions about the data, ignoring relevant complexities.
This leads to underfitting, where the model is too simple to capture the underlying patterns in the data, resulting in poor performance on both training and new data.
Balancing Bias and Variance
Underfitting (High Bias, Low Variance)
Overfitting (Low Bias, High Variance)
Balanced Model
The following regression models techniques with the higher variance that follows from a large number of covariates by adding a bit of bias. The variance is reduced by penalizing covariate coefficients, shrinking then towards zero.
The resulting simpler models may not fully capture the patterns in the data, thus underfitting the data.
Ridge regression is an example of a penalized regression model; in this case the magnitude of the weights are penalized by adding the \(\ell_2\) norm of the weights to the loss function. In particular, the ridge regression weights are:
\[ \hat{\beta}_{\text{ridge}}=\arg\!\min\text{RSS}\left(\beta\right)+\lambda\left\Vert \beta\right\Vert _{2}^{2} \]
where \(\lambda\) is the strength of the penalty term.
The Ridge objective function is
\[ \mathscr{L}\left(\beta,\lambda\right)=\text{NLL}+\lambda\left\Vert \beta\right\Vert_2^2 \]
The solution is:
\[ \begin{align*} \hat{\mathbf{\beta}}_{ridge} & =\left(X'X-\lambda I_{D}\right)^{-1}\left(X'Y\right)\\ & =\left(\sum_{n}x_{n}x'_{n}+\lambda I_{D}\right)^{-1}\left(\sum_{n}y_{n}x_{n}\right) \end{align*} \]
As for un-penalized linear regression, using matrix inversion to solve for \(\hat{\mathbf{\beta}}_{ridge}\) can be a bad idea. The QR transformation can be used here, however, ridge regression is often used when \(D>N\), in which case the SVD transformation is faster.
> # perform k-fold cross-validation to find optimal lambda value
> cv_model <- glmnet::cv.glmnet(x, y, alpha = 0)
>
> # find optimal lambda value that minimizes test MSE
> best_lambda <- cv_model$lambda.min
>
> # produce plot of test MSE by lambda value
> cv_model %>% broom::tidy() %>%
+ ggplot(aes(x=lambda, y = estimate)) +
+ geom_ribbon(aes(ymin = conf.low, ymax = conf.high), fill = "#00ABFD", alpha=0.5) +
+ geom_point() +
+ geom_vline(xintercept=best_lambda) +
+ labs(title='Ridge Regression'
+ , subtitle =
+ stringr::str_glue(
+ "The best lambda value is {scales::number(best_lambda, accuracy=0.01)}"
+ )
+ ) +
+ ggplot2::scale_x_log10()
> model$beta %>%
+ as.matrix() %>%
+ t() %>%
+ tibble::as_tibble() %>%
+ tibble::add_column(lambda = model$lambda, .before = 1) %>%
+ tidyr::pivot_longer(-lambda, names_to = 'parameter') %>%
+ ggplot(aes(x=lambda, y=value, color=parameter)) +
+ geom_line() + geom_point() +
+ xlim(0,2000) +
+ labs(title='Ridge Regression'
+ , subtitle =
+ stringr::str_glue(
+ "Parameters as a function of lambda"
+ )
+ )
Lasso regression is another example of a penalized regression model; in this case both the magnitude of the weights and the number of parameters are penalized by using the \(\ell_1\) norm of the weights to the loss function of the lasso regression. In particular, the lasso regression weights are:
\[ \hat{\beta}_{\text{lasso}}=\arg\!\min\text{RSS}\left(\beta\right)+\lambda\left\Vert \beta\right\Vert _{1} \]
The Lasso objective function is
\[ \mathscr{L}\left(\beta,\lambda\right)=\text{NLL}+\lambda\left\Vert \beta\right\Vert _{1} \]
> # define response variable
> y <- mtcars %>% dplyr::pull(hp)
>
> # define matrix of predictor variables
> x <- mtcars %>% dplyr::select(mpg, wt, drat, qsec) %>% data.matrix()
>
> # fit ridge regression model
> model <- glmnet::glmnet(x, y, alpha = 1)
>
> # get coefficients when lambda = 3.53
> coef(model, s = 3.53)
5 x 1 sparse Matrix of class "dgCMatrix"
s1
(Intercept) 480.761125
mpg -3.036337
wt 20.222451
drat .
qsec -18.944318
> #perform k-fold cross-validation to find optimal lambda value
> cv_model <- glmnet::cv.glmnet(x, y, alpha = 1)
>
> #find optimal lambda value that minimizes test MSE
> best_lambda <- cv_model$lambda.min
>
> #produce plot of test MSE by lambda value
> cv_model %>% broom::tidy() %>%
+ ggplot(aes(x=lambda, y = estimate)) +
+ geom_ribbon(aes(ymin = conf.low, ymax = conf.high), fill = "#00ABFD", alpha=0.5) +
+ geom_point() +
+ geom_vline(xintercept=best_lambda) +
+ labs(title='Lasso Regression'
+ , subtitle =
+ stringr::str_glue(
+ "The best lambda value is {scales::number(best_lambda, accuracy=0.01)}"
+ )
+ ) +
+ xlim(0,exp(4)) + ggplot2::scale_x_log10()
> model %>%
+ broom::tidy() %>%
+ tidyr::pivot_wider(names_from=term, values_from=estimate) %>%
+ dplyr::select(-c(step,dev.ratio, `(Intercept)`)) %>%
+ dplyr::mutate_all(dplyr::coalesce, 0) %>%
+ tidyr::pivot_longer(-lambda, names_to = 'parameter') %>%
+ ggplot(aes(x=lambda, y=value, color=parameter)) +
+ geom_line() + geom_point() +
+ xlim(0,70) +
+ labs(title='Ridge Regression'
+ , subtitle =
+ stringr::str_glue(
+ "Parameters as a function of lambda"
+ )
+ )
Elastic Net regression is a hybrid of ridge and lasso regression.
The elastic net objective function is
\[ \mathscr{L}\left(\beta,\lambda,\alpha\right)=\text{NLL}+\lambda\left(\left(1-\alpha\right)\left\Vert \beta\right\Vert _{2}^{2}+\alpha\left\Vert \beta\right\Vert _{1}\right) \]
so that \(\alpha=0\) is ridge regression and \(\alpha=1\) is lasso regression and \(\alpha\in\left(0,1\right)\) is the general elastic net.
> # set length of data and seed for reproducability
> n <- 50
> set.seed(2467)
> # create the dataset
> dat <- tibble::tibble(
+ a = sample(1:20, n, replace = T)/10
+ , b = sample(1:10, n, replace = T)/10
+ , c = sort(sample(1:10, n, replace = T))
+ ) %>%
+ dplyr::mutate(
+ z = (a*b)/2 + c + sample(-10:10, n, replace = T)/10
+ , .before = 1
+ )
> # cross validate to get the best alpha
> alpha_dat <- tibble::tibble( alpha = seq(0.01, 0.99, 0.01) ) %>%
+ dplyr::mutate(
+ mse =
+ purrr::map_dbl(
+ alpha
+ , (\(a){
+ cvg <-
+ glmnet::cv.glmnet(
+ x = dat %>% dplyr::select(-z) %>% as.matrix()
+ , y = dat$z
+ , family = "gaussian"
+ , gamma = a
+ )
+ min(cvg$cvm)
+ })
+ )
+ )
>
> best_alpha <- alpha_dat %>%
+ dplyr::filter(mse == min(mse)) %>%
+ dplyr::pull(alpha)
>
> cat("best alpha:", best_alpha)
best alpha: 0.64
best lambda: 0.01015384
# A tibble: 4 × 5
term step estimate lambda dev.ratio
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 1 -0.467 0.0102 0.963
2 a 1 0.221 0.0102 0.963
3 b 1 0.560 0.0102 0.963
4 c 1 1.03 0.0102 0.963
RMSE: 0.5817823
R-squared: 0.9999828
MSE: 0.3384707
A generalized linear model (GLM) is a flexible generalization of ordinary linear regression.
Ordinary linear regression predicts the expected value of the outcome variable, a random variable, as a linear combination of a set of observed values (predictors). In a generalized linear model (GLM), each outcome \(Y\) is assumed to be generated from a particular distribution in an exponential family, The mean, \(\mu\), of the distribution depends on the independent variables, \(X\), through:
\[ \mathbb{E}\left[\left.Y\right|X\right]=\mu=\text{g}^{-1}\left(X\beta\right) \] where \(g\) is called the link function.
For example, if \(Y\) is Poisson distributed, then
\[ \mathbb{P}\left[\left.Y=y\right|X,\lambda\right]=\frac{\lambda^{y}}{y!}e^{-\lambda}=e^{y\log\lambda-\lambda-\log y!} \]
Where \(\lambda\) is both the mean and the variance. In the glm the link function is \(\log\) and
\[ \log\mathbb{E}\left[\left.Y\right|X\right] = \beta X=\log\lambda \]
Random Component:
Systematic Component:
Link Function:
Linear Regression (Binomial Distribution)
Logistic Regression (binomial Distribution)
Poisson Regression (Poisson Distribution)
Gamma Regression (Gamma Distribution)
Logistic Regression Example:
Poisson Regression Example:
Gamma Regression Example:
There are many methodologies for constructing regression trees but one of the oldest is known as the classification and regression tree (CART) approach.
Basic regression trees partition a data set into smaller subgroups and then fit a simple constant for each observation in the subgroup. The partitioning is achieved by successive binary partitions (aka recursive partitioning) based on the different predictors.
As a simple example, consider a continuous response variable \(y\) with two covariates \(x_1,x_2\) and the support of \(x_1,x_2\) partitioned into three regions. Then we write the tree regression model for \(y\) as:
\[ \hat{y} = \hat{f}(x_1,x_2)=\sum_{i=1}^{3}c_1\times I_{(x_1,x_2)\in R_i} \]
Tree algorithms differ in how they grow the regression tree, i.e. partition the space of the covariates.
All partitioning of variables is done in a top-down, greedy fashion. This just means that a partition performed earlier in the tree will not change based on later partitions. In general the partitions are made to minimize following objective function (support initially partitioned into 2 regions, i.e. a binary tree):
\[ \text{SSE}=\left\{ \sum_{i\in R_{1}}\left(y_{i}-c_{i}\right)^{2}+\sum_{i\in R_{2}}\left(y_{i}-c_{i}\right)^{2}\right\} \]
Having found the best split, we repeat the splitting process on each of the two regions.
This process is continued until some stopping criterion is reached. What typically results is a very deep, complex tree that may produce good predictions on the training set, but is likely to overfit the data, particularly at the lower nodes.
By pruning these lower level nodes, we can introduce a little bit of bias in our model that help to stabilize predictions and will tend to generalize better to new, unseen data.
As with penalized linear regression, we can use a complexity parameter \(\alpha\) to penalize the number of terminal nodes of the tree (\(T\)), like the lasso \(L_1\) norm penalty, and find the smallest tree with lowest penalized error, i.e. the minimizing the following objective function:
\[ \text{SSE}+\alpha\left|T\right| \]
Strengths
Weaknesses
As mentioned, single tree models suffer from high variance. Although pruning the tree helps reduce this variance, there are alternative methods that actually exploite the variability of single trees in a way that can significantly improve performance over and above that of single trees. Bootstrap aggregating (bagging) is one such approach.
Bagging combines and averages multiple models. Averaging across multiple trees reduces the variability of any one tree and reduces overfitting, which improves predictive performance.
Bagging combines and averages multiple tree models. Averaging across multiple trees reduces the variability of any one tree and reduces overfitting, improving predictive performance.
Bagging follows three steps:
Fig: The bagging process.
Bagging trees introduces a random component into the tree building process that reduces the variance of a single tree’s prediction and improves predictive performance. However, the trees in bagging are not completely independent of each other since all the original predictors are considered at every split of every tree.
So trees from different bootstrap samples typically have similar structure to each other (especially at the top of the tree) due to underlying relationships. They are correlated.
Tree correlation prevents bagging from optimally reducing the variance of the predictive values. Reducing variance further can be achieved by injecting more randomness into the tree-growing process. Random forests achieve this in two ways:
For regression trees, typical default values used in split-value randomization are \(m=\frac{p}{3}\) but this should be considered a tuning parameter.
When \(m=p\), the randomization amounts to using only step 1 and is the same as bagging.
Strengths
Weaknesses
Gradient boosted machines (GBMs) are an extremely popular machine learning algorithm that have proven successful across many domains and is one of the leading methods for winning Kaggle competitions.
Whereas random forests build an ensemble of deep independent trees, GBMs build an ensemble of shallow and weak successive trees with each tree learning and improving on the previous. When combined, these many weak successive trees produce a powerful “committee” that are often hard to beat with other algorithms.
The main idea of boosting is to add new models to the ensemble sequentially. At each particular iteration, a new weak, base-learner model is trained with respect to the error of the whole ensemble learnt so far.
Sequential ensemble approach.
Boosting is a framework that iteratively improves any weak learning model. Many gradient boosting applications allow you to “plug in” various classes of weak learners at your disposal. In practice however, boosted algorithms almost always use decision trees as the base-learner.
A weak model is one whose error rate is only slightly better than random guessing. The idea behind boosting is that each sequential model builds a simple weak model to slightly improve the remaining errors. With regards to decision trees, shallow trees represent a weak learner. Commonly, trees with only 1-6 splits are used.
Combining many weak models (versus strong ones) has a few benefits:
Here is the algorithm for boosted regression trees with features \(x\) and response \(y\):
XGBoost is short for eXtreme Gradient Boosting package.
While the XGBoost
model often achieves higher accuracy than a single decision tree, it sacrifices the intrinsic interpretability of decision trees. For example, following the path that a decision tree takes to make its decision is trivial and self-explained, but following the paths of hundreds or thousands of trees is much harder.
We will work with XGBoost
in today’s lab.
Kernel Regression is a non-parametric technique in machine learning used to estimate the relationship between a dependent variable and one or more independent variables.
Unlike linear regression, Kernel Regression does not assume a specific form for the relationship between the variables. Instead, it uses a weighted average of nearby observed data points to make predictions.
> #Kernel regression
> # from https://towardsdatascience.com/kernel-regression-made-easy-to-understand-86caf2d2b844
> Kdata <-
+ tibble::tibble(
+ Area = c(11,22,33,44,50,56,67,70,78,89,90,100)
+ , RiverFlow = c(2337,2750,2301,2500,1700,2100,1100,1750,1000,1642, 2000,1932)
+ )
>
> #function to calculate Gaussian kernel
> gausinKernel <- function(x,b){exp(-0.5 *(x/b)^2)/(sqrt(2*pi))}
> #plotting function
> plt_fit <- function(bandwidth = 10, support = seq(5,110,1)){
+ tibble::tibble(x_hat = support) |>
+ dplyr::mutate(
+ y_hat =
+ purrr::map_dbl(
+ x_hat
+ , (
+ \(x){
+ K <- gausinKernel(Kdata$Area-x, bandwidth)
+ sum( Kdata$RiverFlow * K/sum(K) )
+ })
+ )
+ ) |>
+ ggplot(aes(x=x_hat, y=y_hat)) +
+ geom_line(color="blue") +
+ geom_point(data = Kdata, aes(x=Area, y=RiverFlow), size=4, color="red") +
+ labs(title = "Kernel regression", subtitle = stringr::str_glue("bandwith = {bandwidth}; data = red | fit = blue") ) +
+ theme_minimal()
+ }
Regression with neural nets involves using artificial neural networks (ANNs) to predict a continuous output variable based on one or more input variables. Neural nets are powerful, flexible models that can capture complex relationships and patterns in the data.
Architecture of an ANN
Single layer NN architecture
Common Activation Functions
> set.seed(500)
>
> # Boston dataset from MASS
> data <- MASS::Boston
>
> # Normalize the data
> maxs <- data %>% dplyr::summarise_all(max) %>% as.matrix() %>% as.vector()
> mins <- data %>% dplyr::summarise_all(min) %>% as.matrix() %>% as.vector()
> data_scaled <- data %>%
+ scale(center = mins, scale = maxs - mins) %>%
+ tibble::as_tibble()
>
> # Split the data into training and testing set
> data_split <- data_scaled %>% rsample::initial_split(prop = .75)
> # extracting training data and test data as two seperate dataframes
> data_train <- rsample::training(data_split)
> data_test <- rsample::testing(data_split)
>
> nn <- data_train %>%
+ neuralnet::neuralnet(
+ medv ~ .
+ , data = .
+ , hidden = c(5, 3)
+ , linear.output = TRUE
+ )
>
> # Predict on test data
> pr.nn <- neuralnet::compute( nn, data_test %>% dplyr::select(-medv) )
>
> # Compute mean squared error
> pr.nn_ <-
+ pr.nn$net.result *
+ (max(data$medv) - min(data$medv)) +
+ min(data$medv)
> test.r <-
+ data_test$medv *
+ (max(data$medv) - min(data$medv)) +
+ min(data$medv)
> MSE.nn <- sum((test.r - pr.nn_)^2) / nrow(data_test)
Today we worked though a parametric and non-parametric regression methods that are useful for predicting a value given a set of covariates.
Next week we will look in detail at the tidymodels
package which will give a way to develop a workflow for fitting and comparing our models across different feature sets.