View on GitHub

stats-learning-notes

Notes from Introduction to Statistical Learning

Previous: Chapter 6 - Linear Model Selection and Regularization


Chapter 7 - Moving Beyond Linearity

Polynomial regression extends the linear model by adding additional predictors obtained by raising each of the original predictors to a power. For example, cubic regression uses three variables, , , and as predictors.

Step functions split the range of a variable into distinct regions in order to produce a qualitative variable. This has the effect of fitting a piecewise constant function.

Regression splines are an extension of polynomials and step functions that provide more flexibility. Regression splines split the range of into distinct regions and within each region a polynomial function is used to fit the data. The polynomial functions selected are constrained to ensure they join smoothly at region boundaries called knots. With enough regions, regression splines can offer an extremely flexible fit.

Smoothing splines are similar to regression splines, but unlike regression splines, smoothing splines result from minimizing a residual sum of squares criterion subject to a smoothness penalty.

Local regression is similar to splines, however the regions are allowed to overlap in the local regression scenario. The overlapping regions allow for improved smoothness.

Generalized additive models extend splines, local regression, and polynomials to deal with multiple predictors.

Polynomial Regression

Extending linear regression to accommodate scenarios where the relationship between the predictors and the response is non-linear typically involves replacing the standard linear model

with a polynomial function of the form

This approach is known as polynomial regression. For large values of polynomial regression can produce extremely non-linear curves, but a greater than 3 or 4 is unusual as large values of can be overly flexible and take on some strange shapes, especially near the boundaries of the variable.

Coefficients in polynomial regression can be estimated easily using least squares linear regression since the model is a standard linear model with predictors …, which are derived by transforming the original predictor

Even though this yields a linear regression model, the individual coefficients are less important compared to the overall fit of the model and the perspective it provides on the relationship between the predictors and the response.

Once a model is fit, least squares can be used to estimate the variance of each coefficient as well as the covariance between coefficient pairs.

The obtained variance estimates can be used to compute the estimated variance of The estimated pointwise standard error of is the square root of this variance.

Step Functions

Polynomial functions of the predictors in a linear model impose a global structure on the estimated non-linear function of Step functions don’t impose such a global structure.

Step functions split the range of into bins and fit a different constant to each bin. This is equivalent to converting a continuous variable into an ordered categorical variable.

First, cut points, are created in the range of from which new variables are created.

where is an indicator function that returns 1 if the condition is true.

It is worth noting that each bin is unique and

since each variable only ends up in one of intervals.

Once the slices have been selected, a linear model is fit using as predictors:

Only one of can be non-zero. When all the predictors will be zero. This means can be interpreted as the mean value of for Similarly, for the linear model reduces to so represents the average increase in the response for in compared to

Unless there are natural breakpoints in the predictors, piecewise constant functions can miss the interesting data.

Basis Functions

Polynomial and piecewise constant functions are special cases of a basis function approach. The basis function approach utilizes a family of functions or transformations that can be applied to a variable

Instead of fitting a linear model in a similar model that applies the fixed and known basis functions to is used:

For polynomial regression, the basis functions are For piecewise constant functions the basis functions are

Since the basis function model is just linear regression with predictors least squares can be used to estimate the unknown regression coefficients. Additionally, all the inference tools for linear models like standard error for coefficient estimates and F-statistics for overall model significance can also be employed in this setting.

Many different types of basis functions exist.

Regression Splines

The simplest spline is a piecewise polynomial function. Piecewise polynomial regression involves fitting separate low-degree polynomials over different regions of instead of fitting a high-degree polynomial over the entire range of

For example, a piecewise cubic polynomial is generated by fitting a cubic regression in the form

but where the coefficients, differ in different regions of the range of

The points in the range where the coefficients change are called knots.

Assuming no functions are repeated, a range of split at knots would be fit to different functions of the selected type (constant, linear, cubic, etc.), one for each region.

In many situations, the number of degrees of freedom in the piecewise context can be determined by multiplying the number of parameters () by one more than the number of knots. For a piecewise polynomial regression of dimension the number of degrees of freedom would be

Piecewise functions often run into the problem that they aren’t continuous at the knots. To remedy this, a constraint can be put in place that the fitted curve must be continuous. Even then the fitted curve can look unnatural.

To ensure the fitted curve is not just continuous, but also smooth, additional constraints can be placed on the derivatives of the piecewise polynomial.

A degree- spline is a degree- polynomial with continuity in derivatives up to degree at each knot.

For example, a cubic spline, requires that each cubic piecewise polynomial is constrained at each knot such that the curve is continuous, the first derivative is continuous, and the second derivative is continuous. Each constraint imposed on the piecewise cubic polynomial effectively reclaims one degree of freedom by reducing complexity.

In general, a cubic spline with knots uses a total of degrees of freedom.

The Spline Basis Representation

The basis model can be used to represent a regression spline. For example, a cubic spline with knots can be modeled as:

with an appropriate choice of basis functions. Such a model could then be fit using least squares.

Though there are many ways to represent cubic splines using different choices of basis functions, the most direct way is to start off with a basis for a cubic polynomial () and then add one truncated power basis function per knot. A truncated power basis function is defined as

where is the knot. It can be shown that augmenting a cubic polynomial with a term of the form will lead to discontinuity only in the third derivative of The function will remain continuous with continuous first and second derivatives at each of the knots.

The means that to fit a cubic spline to a data set with knots, least squares regression can be employed with an intercept and predictors of the form where are the knots. This amounts to estimating a total of regression coefficients and uses degrees of freedom.

Cubic splines are popular because the discontinuity at the knots is not detectable by the human eye in most situations.

Splines can suffer from high variance at the outer range of the predictors. To combat this, a natural spline can be used. A natural spline is a regression spline with additional boundary constraints that force the function to be linear in the boundary region.

There are a variety of methods for choosing the number and location of the knots. Because the regression spline is most flexible in regions that contain a lot of knots, one option is to place more knots where the function might vary the most and fewer knots where the function might be more stable. Another common practice is to place the knots in a uniform fashion. One means of doing this is to choose the desired degrees of freedom and then use software or other heuristics to place the corresponding number of knots at uniform quantiles of the data.

Cross validation is a useful mechanism for determining the appropriate number of knots and/or degrees of freedom.

Regression splines often outperform polynomial regression. Unlike polynomials which must use a high dimension to produce a flexible fit, splines can keep the degree fixed and increase the number of knots instead. Splines can also distribute knots, and hence flexibility, to those parts of the function that most need it which tends to produce more stable estimates.

Smoothing Splines

Smoothing splines take a substantially different approach to producing a spline. To fit a smooth curve to a data set, it would be ideal to find a function that fits the data well with a small residual sum of squares. However without any constraints on it’s always possible to produce a that interpolates all of the data and yields an RSS of zero, but is over flexible and over fits the data. What is really wanted is a that makes RSS small while also remaining smooth. One way to achieve this is to find a function that minimizes

where is a non-negative tuning parameter. Such a function yields a smoothing spline.

Like ridge regression and the lasso, smoothing splines utilize a loss and penalty strategy.

The term

is a loss function that encourages to be smooth and less variable. refers to the second derivative of the function The first derivative measures the slope of a function at and the second derivative measures the rate at which the slop is changing. Put another way, the second derivative measures the rate of change of the rate of change of Roughly speaking, the second derivative is a measure of a function’s roughness. is large in absolute value if is very wiggly near and is close to zero when is smooth near As an example, the second derivative of a straight line is zero because it is perfectly smooth.

The symbol indicates an integral which can be thought of as a summation over the range of All together this means that is a measure of the change in over its full range.

If is very smooth, then will be close to constant and will have a small value. On the other extreme, if is variable and wiggly then will vary significantly and will have a large value.

The tuning constant, controls how smooth the resulting function should be. When is large, will be smoother. When is zero, the penalty term will have no effect, resulting in a function that is as variable and jumpy as the training observations dictate. As approaches infinity, will grow smoother and smoother until it eventually is a perfectly smooth straight line that is also the linear least squares solution since the loss function aims to minimize the residual sum of squares.

At this point it should come as no surprise that the tuning constant, controls the bias-variance trade-off of the smoothing spline.

The smoothing spline has some noteworthy special properties. It is a piecewise cubic polynomial with knots at the unique values of that is continuous in its first and second derivatives at each knot. Additionally, is linear in the regions outside the outer most knots. Though the minimal is a natural cubic spline with knots at it is not the same natural cubic spline derived from the basis function approach. Instead, it’s a shrunken version of such a function where controls the amount of shrinkage.

The choice of also controls the effective degrees of freedom of the smoothing spline. It can be shown that as increases from zero to infinity, the effective degrees of freedom () decreases from down to 2.

Smoothing splines are considered in terms of effective degrees of freedom because though it nominally has parameters and thus degrees of freedom, those parameters are heavily constrained. Because of this, effective degrees of freedom are more useful as a measure of flexibility.

The effective degrees of freedom are not guaranteed to be an integer.

The higher the more flexible the smoothing spline. The definition of effective degrees of freedom is somewhat technical, but at a high level, effective degrees of freedom is defined as

or the sum of the diagonal elements of the matrix which is an -vector of the fitted values of the smoothing spline at each of the training points, Such an -vector can be combined with the response vector to determine the solution for a particular value of

Using these values, the leave-one-out cross validation error can be calculated efficiently via

where refers to the fitted value using all training observations except for the ith.

Local Regression

Local regression is an approach to fitting flexible non-linear functions which involves computing the fit at a target point using only the nearby training observations.

Each new point from which a local regression fit is calculated requires fitting a new weighted least squares regression model by minimizing the appropriate regression weighting function for a new set of weights.

A general algorithm for local regression is

  1. Select the fraction of training points whose are closest to
  2. Assign a weight to each point in this neighborhood such that the point furthest from has a weight of zero and the point closest to has the highest weight. All but the nearest neighbors get a weight of zero.
  3. Fit a weighted least squares regression of the on to the using the weights calculated earlier by finding coefficients that minimize a modified version of the appropriate least squares model. For linear regression that modified model is

  4. The fitted value at is given by

Local regression is sometimes referred to as a memory-based procedure because the whole training data set is required to make each prediction.

In order to perform local regression, a number of important choices must be made.

The most important decision is the size of the span S. The span plays a role like did for smoothing splines, offering some choice with regard to the bias-variance trade-off. The smaller the span S, the more local, flexible, and wiggly the resulting non-linear fit will be. Conversely, a larger value of S will lead to a more global fit. Again, cross validation is useful for choosing an appropriate value for S.

In the multiple linear regression setting, local regression can be generalized to yield a multiple linear regression model in which some variable coefficients are globally static while other variable coefficients are localized. These types of varying coefficient models are a useful way of adapting a model to the most recently gathered data.

Local regression can also be useful in the multi-dimensional space though the curse of dimensionality limits its effectiveness to just a few variables.

Generalized Additive Models

Generalized additive models (GAM) offer a general framework for extending a standard linear model by allowing non-linear functions of each of the predictors while maintaining additivity. GAMs can be applied with both quantitative and qualitative models.

One way to extend the multiple linear regression model

to allow for non-linear relationships between each feature and the response is to replace each linear component, with a smooth non-linear function which would yield the model

This model is additive because a separate is calculated for each and then added together.

The additive nature of GAMs makes them more interpretable than some other types of models.

GAMs allow for using the many methods of fitting functions to single variables as building blocks for fitting an additive model.

Backfitting can be used to fit GAMs in situations where least squares cannot be used. Backfitting fits a model involving multiple parameters by repeatedly updating the fit for each predictor in turn, hold the others fixed. This approach has the benefit that each time a function is updated the fitting method for a variable can be applied to a partial residual.

A partial residual is the remainder left over after subtracting the products of the fixed variables and their respective coefficients from the response. This residual can be used as a response in a non-linear regression of the variables being updated.

For example, given a model of

a residual for could be computed as

The yielded residual can then be used as a response in order to fit in a non linear regression on

Pros and Cons of GAMs

Overall, GAMs provide a useful compromise between linear and fully non-parametric models.

GAMs for Classification Problems

GAMs can also be used in scenarios where is qualitative. For simplicity, what follows assumes takes on values or and to be the conditional probability that the response is equal to one.

Similar to using GAMs for linear regression, using GAMs for classification begins by modifying the logistic regression model

to pair each predictor with a specialized function instead of with a constant coefficient:

to yield a logistic regression GAM. From this point, logistic regression GAMs share all the same pros and cons as their linear regression counterparts.


Next: Chapter 8 - Tree-Based Methods