View on GitHub

stats-learning-notes

Notes from Introduction to Statistical Learning

Previous: Chapter 5 - Resampling Methods


Chapter 6 - Linear Model Selection and Regularization

Though least squares is the most commonly used fitting procedure other fitting procedures can yield better model interpretability and predictive accuracy.

Subset selection involves identifying the subset of the selectors that are believed to be related to the response and then fitting a least squares model with the reduced set of variables.

Best Subset Selection

Best subset selection involves fitting a separate least squares regression for each of the possible combinations of predictors and then selecting the best model.

Selecting the “best” model is not a trivial process and usually involves a two-step procedure, as outlined by the algorithm below:

  1. Let denote the null model which uses no predictors and always yields the sample mean for predictions.

  2. For
    1. Fit all models that contain exactly predictors.
    2. Let denote the model that yields the smallest residual sum of squares or equivalently the largest
  3. Select the best model from using cross-validated prediction error, Cp (Akaike information criterion), Bayes information criterion, or adjusted

It should be noted that step 3 of the above algorithm should be performed with care because as the number of features used by the models increases, the residual sum of squares decreases monotonically and the increases monotonically. Because of this, picking the statistically best model will always yield the model involving all of the variables. This stems from the fact that residual sum of squares and are measures of training error and it’d be better to select the best model based on low test error. For this reason, step 3 utilizes cross-validated prediction error, Cp, Bayes information criterion, or adjusted R^{2} to select the best models.

Best subset selection can also be employed for other learning techniques such as logistic regression. For example, in the case of logistic regression, instead of using residual sum of squares or to order models in step 2 of the above algorithm, a measure that substitutes for residual sum of squares in a broader class of models known as deviance is used. Deviance is negative two times the maximized log-likelihood. The smaller the deviance, the better the fit of the model.

Best subset selection has computational limitations since models must be considered. As such, best subset selection becomes computationally infeasible for values of greater than ~40. Some computational shortcuts, known as branch-and-bound techniques help eliminate some of the model evaluation but they only work for least squares linear regression and they still face limitations as gets large because it becomes more likely that models are encountered that look good on the training data though they may have no predictive power on new data. As should be obvious, this leads to overfitting and high variance.

One family of alternatives to best subset selection are stepwise methods that explore a much more restricted set of models.

Forward Stepwise Selection

Forward stepwise selection begins with a model that utilizes no predictors and successively adds predictors one-at-a-time until the model utilizes all the predictors. Specifically, the predictor that yields the greatest additional improvement is added to the model at each step.

An algorithm for forward stepwise selection is outlined below:

  1. Let denote the null model that utilizes no predictors.

  2. For
    1. Consider all models that augment the predictors of with one additional parameter.
    2. Choose the best model that yields the smallest residual sum of squares or largest and call it
  3. Select a single best model from the models, using cross-validated prediction error, Cp (Akaike information criterion), Bayes information criterion, or adjusted

Forward stepwise selection involves fitting one null model and models for each iteration of This amounts to models which is a significant improvement over best subset selection’s models.

Forward stepwise selection may not always find the best possible model out of all models due to its additive nature. For example, forward stepwise selection could not find the best 2-variable model in a data set where the best 1-variable model utilizes a variable not used by the best 2-variable model.

Forward stepwise selection can be applied in high-dimensional scenarios where however it can only construct submodels due to the reliance on least squares regression.

Backward Stepwise Selection

Backward stepwise selection is another efficient alternative to best subset selection. Contrary to forward stepwise selection, backward stepwise selection starts with the full least squares model utilizing all predictors and iteratively removes the least useful predictor with each iteration.

An algorithm for backward stepwise selection:

  1. Let denote a model using all predictors.

  2. For
    1. Consider all models that use predictors from
    2. Choose the best of these models as determined by the smallest residual sum of squares or highest Call this model
  3. Select the single best model from using cross-validated prediction error, Cp (AIC), BIC, or adjusted

Like forward stepwise selection, backward stepwise selection searches through only models, making it useful in scenarios where is too large for best subset selection. Like forward stepwise selection, backward stepwise selection is not guaranteed to yield the best possible model.

Unlike forward stepwise selection, backward stepwise selection requires that the number of samples, , is greater than the number of variables, so the full model with all predictors can be fit.

This makes forward stepwise selection the only viable subset method when is very large since forward stepwise selection can be used when

Both forward stepwise selection and backward stepwise selection perform a guided search over the model space and effectively consider substantially more than models.

Hybrid Approaches

Hybrid subset selection methods add variables to the model sequentially, analogous to forward stepwise selection, but with each iteration they may also remove any variables that no longer offer any improvement to model fit.

Hybrid approaches try to better simulate best subset selection while maintaining the computational advantages of stepwise approaches.

Choosing an Optimal Model

Since and residual sum of squares are both related to training error, the model with all the predictors will always appear to be the best. To combat this, it would be better to select the model from the set of models that yields the lowest estimated test error.

Two common approaches for estimating test error are:

  1. Indirectly estimating test error by making adjustments to the training error to account for the bias caused by overfitting.

  2. Directly estimating test error using a validation set or cross validation.

Cp, AIC, BIC, and adjusted R-Squared

Recall that training mean squared error () usually underestimates test mean squared error since the least squares approach ensures the smallest training residual sum of squares. An important difference being that training error will decrease as more variables are added, whereas test error may not decrease with more variables. This prevents the use of training residual sum of squares and for comparing models with different numbers of variables.

There are, however, a number of techniques for adjusting training error according to model size which enables comparing models with different numbers of variables.

Four of these strategies are: Cp, Akaike information criterion, Bayes information criterion, and adjusted .

For a model containing predictors fitted with least squares, the Cp estimate of test mean squared error is calculated as

where is an estimate of the variance of the error associated with each response measurement. In essence, the Cp statistic adds a penalty of to the training residual sum of squares to adjust for the tendency for training error to underestimate test error and adjust for additional predictors.

It can be shown that if is an unbiased estimate of , then Cp will be an unbiased estimate of test mean squared error. As a result, Cp tends to take on small values when test mean square error is low, so a model with a low Cp is preferable.

The Akaike information criterion (AIC) is defined for a large class of models fit by maximum likelihood. In the case of simple linear regression, when errors follow a Gaussian distribution, maximum likelihood and least squares are the same thing, in which case, AIC is given by

This formula omits an additive constant, but even so it can be seen that Cp and AIC are proportional for least squares models and as such AIC offers no benefit in this case.

For least squares models with predictors, the Bayes information criterion (BIC), excluding a few irrelevant constants, is given by

Similar to Cp, BIC tends to take on smaller values when test MSE is low, so smaller values of BIC are preferable.

BIC replaces the penalty imposed by Cp with a penalty of where n is the number of observations. Because is greater than 2 for the BIC statistic tends to penalize models with more variables more heavily than Cp, which in turn results in the selection of smaller models.

Adjusted is another popular choice for comparing models with differing numbers of variables. Recall that is defined as

where TSS is the total sum of squares given by

Since the residual sum of squares always decreases given more variables, will always increase given more variables.

For a least squares fitted model with predictors, adjusted is given by

Unlike Cp, AIC, and BIC where a smaller value reflects lower test error, for adjusted a larger value signifies a lower test error.

Maximizing adjusted is equivalent to minimizing Because appears in the denominator, the number of variables may increase or decrease the value of

Adjusted aims to penalize models that include unnecessary variables. This stems from the idea that after all of the correct variables have been added, adding additional noise variables will only decrease the residual sum of squares slightly. This slight decrease is counteracted by the presence of in the denominator of

Validation and cross validation can be useful in situations where it’s hard to pinpoint the model’s degrees of freedom or when it’s hard to estimate the error variance,

The one-standard-error rule advises that when many models have low estimated test error and it’s difficult or variable as to which model has the lowest test error, one should select the model with the fewest variables that is within one standard error of the lowest estimated test error. The rationale being that given a set of more or less equally good models, it’s often better to pick the simpler model.

Shrinkage Methods

Shrinkage methods present an alternative to subset selection that uses all the predictors, but employs a technique to constrain or regularize the coefficient estimates.

Constraining coefficient estimates can significantly reduce their variance. Two well known techniques of shrinking regression coefficients toward zero are ridge regression and the lasso.

Ridge Regression

Ridge regression is very similar to least squares fitting except the coefficients are estimated by minimizing a modified quantity.

Recall that the least squares fitting procedure estimates the coefficients by minimizing the residual sum of squares where the residual sum of squares is given by

Ridge regression instead selects coefficients by selecting coefficients that minimize

where is a tuning parameter.

The second term, is referred to as a shrinkage penalty. In this case, the penalty is small when the coefficients are close to zero, but depending on and how the coefficients grow. As the second term grows, it pushes the coefficient estimates closer to zero, thereby shrinking them.

The tuning parameter serves to control the balance of how the two terms affect coefficient estimates. When is zero, the second term is nullified, yielding estimates exactly matching those of least squares. As approaches infinity, the impact of the shrinkage penalty grows, pushing/shrinking the ridge regression coefficients closer and closer to zero.

Depending on the value of ridge regression will produce different sets of estimates, notated by for each value of

It’s worth noting that the ridge regression penalty is only applied to variable coefficients, not the intercept coefficient Recall that the goal is to shrink the impact of each variable on the response and as such, this shrinkage should not be applied to the intercept coefficient which is a measure of the mean value of the response when none of the variables are present.

The norm of a vector is defined as

The norm measures the distance of the vector, from zero.

In regard to ridge regression, as increases, the norm of will always decrease as the coefficient estimates shrink closer to zero.

An important difference between ridge regression and least squares regression is that least squares regression’s coefficient estimates are scale equivalent and ridge regression’s are not. This means that multiplying by a constant, leads to a scaling of the least squares coefficient estimates by a factor of Another way of looking at it is that regardless of how the jth predictor is scaled, the value of remains the same. In contrast, ridge regression coefficients can change dramatically when the scale of a given predictor is changed. This means that may depend on the scaling of other predictors. Because of this, it is best to apply ridge regression after standardizing the predictors using the formula below:

This formula puts all the predictors on the same scale by normalizing each predictor relative to its estimated standard deviation. As a result, all the predictors will have a standard deviation of 1 which will yield a final fit that does not depend on the scale of the predictors.

Ridge regression’s advantage over least squares stems from the bias-variance trade-off. As the tuning parameter increases, the flexibility of the ridge regression fit decreases leading to a decrease in variance, but also causing an increase in bias. Since least squares is equivalent to the most flexible form of ridge regression (where ) it offers less bias at the cost of higher variance. As such, ridge regression is best employed in situations where least squares estimates have high variance.

Ridge regression also offers computational advantages for fixed values of In fact, it can be shown that the computational requirements of calculating ridge regression coefficient estimates for all values of simultaneously are almost identical to those for fitting a model using least squares.

Compared to subset methods, ridge regression is at a disadvantage when it comes to number of predictors used since ridge regression will always use all predictors. Ridge regression will shrink predictor coefficients toward zero, but it will never set any of them to exactly zero (except when ). Though the extra predictors may not hurt prediction accuracy, they can make interpretability more difficult, especially when is large.

The Lasso

The lasso is a more recent alternative to ridge regression that allows for excluding some variables.

Coefficient estimates for the lasso are generated by minimizing the quantity

The main difference between ridge regression and the lasso is the change in penalty. Instead of the term of ridge regression, the lasso uses the norm of the coefficient vector as its penalty term. The norm of a coefficient vector is given by

The penalty can force some coefficient estimates to zero when the tuning parameter is sufficiently large. This means that like subset methods, the lasso performs variable selection. This results in models generated from the lasso tending to be easier to interpret the models formulated with ridge regression. These models are sometimes called sparse models since they include only a subset of the variables.

The variable selection of the lasso can be considered a kind of soft thresholding.

An alternative viewpoint of ridge regression and the lasso reforms the problem in terms of trying to minimize the residual sum of squares subject to

for the lasso and

in the case of ridge regression.

This reformation states that for every value of there is some value of that will yield the same coefficients for both perspectives of the lasso. Similarly, for every value of there is some value of that will yield the same coefficient estimates for both perspectives of ridge regression.

In both cases, this means that the goal is a set of coefficient estimates such that the residual sum of squares is as small as possible, subject to the requirement that the penalty not exceed the budget of

This perspective reveals a close relationship between the lasso, ridge regression, and best subset selection. In fact, best subset selection is equivalent to trying to minimize the residual sum of squares with the constraint that

where is an indicator variable that is equal to one when is non-zero and is equal to zero otherwise. In this case, the inequality enforces that no more than coefficients can be non-zero.

Unfortunately, this perspective on best subset selection is still computationally infeasible because it requires considering all models containing predictors. This does however mean that we can interpret ridge regression and the lasso as alternatives to best subset selection that are more computationally feasible since they replace an intractable budget with alternatives that are much easier to solve.

Selecting the tuning parameter, can be accomplished for both ridge regression and the lasso through the use of cross-validation. A general algorithm for selecting a tuning parameter might proceed like so:

  1. Select a range of values
  2. Compute the cross-validation error for the given shrinkage method for each value of
  3. Select the value of for which the cross-validation error is the smallest.
  4. Refit the model using all available observations and the selected tuning parameter value.

Neither ridge regression nor the lasso is universally dominant over the other. The lasso will perform better in scenarios where not all of the predictors are related to the response, or where some number of variables are only weakly associated with the response. Ridge regression will perform better when the response is a function of many predictors, all with coefficients roughly equal in size.

Like ridge regression, the lasso can help reduce variance at the expense of a small increase in bias in situations where the least squares estimates have excessively high variance.

Dimension Reduction Methods

Dimension reduction methods are a class of techniques that transform the predictors and then fit a least squares model using the transformed variables instead of the original predictors.

Let represent linear combinations of the original predictors, Formally,

For some constants , It is then possible to use least squares to fit the linear regression model:

where and the regression coefficients are represented by

If the constants are chosen carefully, dimension reduction approaches can outperform least squares regression of the original predictors.

The term dimension reduction references the fact that this approach reduces the problem of estimating the coefficients where there by reducing the dimension of the problem from to

This approach can be seen as a special constrained version of the original linear regression considering that

where

This serves to constrain the estimated coefficients since they must now take the form

This constraint has the potential to bias the coefficient estimates, but in situations where is large relative to , selecting a value of much less than can significantly reduce the variance of the fitted coefficients.

If and all the linear combinations are linearly independent, the constraint has no effect and no dimension reduction occurs and the model is equivalent to performing least squares regression on the original predictors.

All dimension reduction methods work in two steps. First, the transformed predictors, are obtained. Second, the model is fit using the transformed predictors.

The difference in dimension reduction methods tends to arise from the means of deriving the transformed predictors, or the selection of the coefficients.

Two popular forms of dimension reduction are principal component analysis and partial least squares.

Principal Component Regression

Principal component analysis is a common approach for deriving a low-dimensional set of features from a large set of variables. It is also a useful tool for unsupervised learning.

Principal component analysis (PCA) is a technique for reducing the dimension of an data matrix

The first principal component direction of the data is the line along which the observations vary the most.

Put another way, the first principal component direction is the line such that if the observations were projected onto the line then the projected observations would have the largest possible variance and projecting observations onto any other line would yield projected observations with lower variance.

Another interpretation of principal component analysis describes the first principal component vector as the line that is as close as possible to the data. In other words, the first principal component line minimizes the sum of the squared perpendicular distances between each point and the line. This means that the first principal component is chosen such that the projected observations are as close as possible to the original observations.

Projecting a point onto a line simply involves finding the location on the line which is closest to the point.

For a two predictor scenario, the first principal component can be summarized mathematically as

where and for which the selected values of and maximize the variance of the linear combination.

It is necessary to consider only linear combinations of the form because otherwise and could be increased arbitrarily to exaggerate the variance.

In general, up to distinct principal components can be constructed.

The second principal component, is a linear combination of the variables that is uncorrelated with and that has the largest variance subject to that constraint.

It turns out that the constraint that must not be correlated with is equivalent to the condition that the direction of must be perpendicular, or orthogonal, to the first principal component direction.

Generally, this means

Constructing additional principal components, in cases where , would successively maximize variance subject to the constraint that the additional components be uncorrelated with the previous components.

The Principal Component Regression Approach

The principal component regression approach first constructs the first principal components, and then uses the components as the predictors in a linear regression model that is fit with least squares.

The premise behind this approach is that a small number of principal components can often suffice to explain most of the variability in the data as well as the relationship between the predictors and the response. This relies on the assumption that the directions in which show the most variation are the directions that are associated with the predictor Though not always true, it is true often enough to approximate good results.

In scenarios where the assumption underlying principal component regression holds true, then the result of fitting a model to will likely be better than the result of fitting a model to since most of the information in the data that relates to the response is captured by and by estimating only coefficients overfitting is mitigated.

The number of principal components used relates to the bias-variance trade-off in that using fewer principal components will increase bias, but reduce variance and conversely, using more principal components will decrease bias, but increase variance.

Principal component regression will tend to do better in scenarios where fewer principal components are sufficient to capture most of the variation in the predictors and the relation with response. The closer is to , the more principal component regression will resemble the results of a least squares model fit to the original predictors.

It should be noted that principal component regression is not a feature selection method since each of the principal components used in the regression is a linear combination of all original predictors. For this reason, principal component regression is more similar to ridge regression than it is to the lasso. In fact, it can be shown that principal component regression and ridge regression are closely related with ridge regression acting as a continuous version of principle component regression.

As with the shrinkage methods, the value chosen for is best informed by cross-validation.

It is generally recommended that all predictors be standardized prior to generating principal components. As with ridge regression, standardization can be achieved via

Standardization ensures all variables are on the same scale which limits the degree to which the high-variance predictors dominate the principal components obtained. Additionally, the scale on which the variables are measured will ultimately affect the principal component regression model obtained. That said, if the variables are all in the same units, one might choose not to standardize them.

Partial Least Squares

Unlike principal component regression, partial least squares is a supervised learning method in that the value of the response is used to supervise the dimension reduction process.

Partial least squares (PLS) identifies a new set of features that are linear combinations of the original predictors and then uses these new features to fit a linear model using least squares.

Unlike principal component regression, partial least squares makes use of the response to identify new features that not only approximate the original predictors well, but that are also related to the response.

The first partial least squares component is computed by first standardizing the predictors. Next, the values of each coefficient is set by performing a simple linear regression of onto It can be shown that the derived coefficient is proportional to the correlation between and Because of this proportional relationship, it can be seen that partial least squares places the highest weight on variables that are most strongly related to the response as it computes

To identify the second partial least squares direction, it is first necessary to adjust each of the variables for This is achieved by regressing each variable onto and taking the residuals. These residuals can be interpreted as the remaining information not explained by the first partial least squares direction.

This orthogonalized data is used to compute in the same way that the original data was used to compute This iterative approach can be repeated times to identify multiple partial least squares components,

Like principal component regression, the number of partial least squares directions, , used with partial least squares is generally selected using cross validation.

Before performing partial least squares, the predictors and the response are typically standardized.

In practice, partial least squares often performs no better than principal component regression or ridge regression. Though the supervised dimension reduction of partial least squares can reduce bias, it also has the potential to increase variance. Because of this, the benefit of partial least squares compared to principal component regression is often negligible.

Considerations For High-Dimensional Data

Most statistical techniques for regression and classification are intended for low dimensional settings where

Data containing more features than observations are often referred to as high-dimensional.

When least squares will yield a set of coefficient estimates that perfectly fit the data whether or not there is truly a relationship between the features and the response. As such, least squares should never be used in a high-dimensional setting.

Cp, AIC, and BIC are also not appropriate in the high-dimensional setting because estimating is problematic.

Regression in High Dimensions

Methods for generating less flexible least squares models like forward stepwise selection, ridge regression, and the lasso turn out to be especially useful in the high-dimensional setting, since they essentially avoid overfitting by using a less flexible fitting approach.

Regularization and/or shrinkage play a key role in high-dimensional problems.

Appropriate tuning parameter selection is crucial for good predictive performance in the high-dimensional setting.

Test error tends to increase as the dimension of the problem increases unless the additional features are truly associated with the response. This is related to the curse of dimensionality, as the additional noise features increase the dimensionality of the problem, increasing the risk of overfitting without any potential upside.

The risk of multicollinearity is also exacerbated in the high-dimensional setting. Since any variable in the model can be written as a linear combination of all the other variables in the model, it can be extremely difficult to determine which variables (if any) are truly predictive of the outcome. Even the best regression coefficient estimates can never be identified. The best that can be hoped for is that large regression coefficients are assigned to the variables that are truly predictive of the outcome.

In the high-dimensional setting, one should never use sum of squared errors, p-values, or other traditional measures of model fit on the training data as evidence of good model fit. MSE or of an independent test set is a valid measure of model fit, but MSE or of the training set is certainly not.


Next: Chapter 7 - Moving Beyond Linearity