<<

Linear Methods for Regression and Shrinkage Methods

Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer

1 Linear Regression Models • Input vectors

• is an attribute / feature / predictor (independent variable) • The linear regression model:

• The output is called response (dependent variable)

• ’s are unknown parameters (coefficients) 2 Linear Regression Models Least Squares

• A set of training data • Each corresponding to attributes

• Each is a class attribute value / a label • Wish to estimate the parameters

3 Linear Regression Models Least Squares

• One common approach ‐ the method of least squares: • Pick the coefficients to minimize the residual sum of squares:

4 Linear Regression Models Least Squares • This criterion is reasonable if the training observations represent independent draws.

• Even if the ’s were not drawn randomly, the criterion is still valid if the ’s are conditionally independent given the inputs .

5 Linear Regression Models Least Squares • Make no assumption about the validity of the model • Simply finds the best linear fit to the data

6 Linear Regression Models Finding Residual Sum of Squares

• Denote by the matrix with each row an input vector (with a 1 in the first position) • Let be the N‐vector of outputs in the training set • Quadratic function in the parameters:

7 Linear Regression Models Finding Residual Sum of Squares

• Set the first derivation to zero: • Obtain the unique solution:

8 Linear Regression Models Orthogonal Projection

• The fitted values at the training inputs are: • The matrix appearing in the above equation, called “hat” matrix because it puts the hat on

9 Linear Regression Models Example

• Training Data:

xy (1, 2, 1) 22 (2, 0, 4) 49 (3, 4, 2) 39 (4, 2, 3) 52 (5, 4, 1) 38

10 Linear Regression Models Example

1121 22 1204 49 • 1 3 4 2 39 1 4 2 3 52 1 5 4 1 38

• • 4.04 0.51 8.43 8.13

11 Linear Regression Models Example

• 4.04 0.51 8.43 8.13

• residual vector 21.61 0.39 49.91 ‐0.91 39.13 ‐0.13 50.57 1.4 38.78 ‐0.78

12 Linear Regression Models Orthogonal Projection

• Different geometrical representation of the least squares estimate, in

• Denote column vectors of by with

13 Linear Regression Models Orthogonal Projection • These vectors span a subspace of , also referred to as the column space of • Minimize by choosing  residual vector is orthogonal to this subspace • Some examples: 0.39 1121 0.91 1204 0.13 1 3 4 2 1.4 1 4 2 3 0.78 1 5 4 1 0.39 0.91 0.13 1.4 0.78 0 0.39 0.91 ∗ 2 0.13∗3 1.4∗40.78∗50 14 Linear Regression Models Orthogonal Projection

• is the orthogonal projection of onto this subspace • Projection matrix: hat matrix computes the orthogonal projection

15 Linear Regression Models Sampling Properties of Parameters

• Assume observations are uncorrelated and have constant , and are fixed (non random) • Variance‐covariance matrix of the least squares parameter estimates:

16 Linear Regression Models Sampling Properties of parameters

• Test the hypothesis that a particular coefficient • Form the standardized coefficient or Z‐score • is the th diagonal element of

• Under null hypothesis that , is distributed as (a distribution with degrees of freedom) 17 Subset Selection Motivation

• Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero • With a large number of predictors, we would like to determine a smaller subset that exhibit the strongest effects • With subset selection we retain only a subset of the variables, and eliminate the rest from the model • Least squares regression is used to estimate the coefficients of the inputs that are retained 18 Subset Selection Motivation

• Best subset regression finds for each the subset of size that gives smallest residual sum of squares • An efficient algorithm – the leaps and bounds procedure –makes this feasible for as large as 30 or 40

19 Subset Selection Motivation • Figure 3.5 shows all the subset models (8 features) that are eligible for selection by the best‐subsets approach

• The best‐subset curve (red lower boundary in Figure 3.5) is necessarily decreasing  cannot be used to select the subset size k 20 Subset Selection Introduction

• Performing well in training data is not necessarily good for actual use • Use a different data set as a testing data set for measuring the error of the learned model • Choose the smallest model that minimizes an estimate of the expected prediction error • Use cross‐validation method (more details later)

21 Subset Selection Stepwise Selection

Forward Stepwise Selection • Starts with the intercept • Sequentially adds into the model the predictor that most improves the fit • Clever updating algorithms  exploit the QR decomposition for the current fit to rapidly establish the next candidate • Produces a sequence of models indexed by , the subset size, which must be determined

• Greedy algorithm 22 Subset Selection Stepwise Selection

Backward Stepwise Selection • Starts with full model • Sequentially deletes the predictor that has the least impact on the fit • Candidate for dropping: variable with the smallest Z‐score • Can be only used when

23 Shrinkage Methods Motivation

• Retaining a subset of the predictors and discarding the rest • Subset selection produces a model that is ‐ interpretable ‐ possibly lower prediction error than the full model • Discrete process – variables are either retained or discarded  often exhibits high variance • Does not reduce the prediction error of the full model 24 Shrinkage Methods Motivation

• Shrinkage methods more continuous • Don’t suffer as much from high variability

25 Shrinkage Methods Motivation • When there are many correlated variables in a linear regression model  their coefficients poorly determined and exhibit high variance • Wildly large positive coefficient on one variable can be canceled by a similarly large negative coefficient on its correlated cousin • Problem is alleviated when imposing a size constraint on the coefficients • Ridge solutions are not equivariant under scaling of the inputs, so one normally standardizes the inputs before solving 26 Shrinkage Methods Ridge Regression

• Shrinks the regression coefficients by imposing a penalty on their size • Ridge coefficient minimize a penalized residual sum of squares:

argmin

27 Shrinkage Methods Ridge Regression

• is complexity parameter  controls the amount of shrinkage: the larger value of , the greater the amount of shrinkage • The coefficients are shrunk toward zero (and each other)

28 Shrinkage Methods Ridge Regression

• An equivalent way to write the ridge problem: subject to • Makes explicit the size constraint on the parameters • One‐to‐one correspondence between the parameters and

29 Shrinkage Methods Ridge Regression

• Notice that the intercept is left out in the penalty • Assume the centering has been done, the matrix has p (rather than p+1) columns

– Each is replaced by – Estimate by • The criterion in matrix form:

30 Shrinkage Methods Ridge Regression

• The ridge regression solutions: • is the identity matrix • Even consider quadratic penalty , the ridge regression solution is still a linear function of

31

• Shrinkage method like ridge, with subtle but important differences • Lasso estimate: subject to

• Similar to ridge regression, , We can fit a model without an intercept

32 Lasso

• Lasso problem in the equivalent Lagrangian form: argmin 1 2 • Similarity to the ridge regression problem: the ridge penalty is replaced by the lasso penalty

33 Lasso

• The latter constraints makes the solution nonlinear in the • There is no closed form expression as in ridge regression • Computing the Lasso solution is a quadratic programming problem

34 Lasso

• Making sufficiently small will cause some of the coefficients to be zero  Lasso does continuous subset selection • If is chosen larger than (where , the least squares estimates)  Lasso estimates are ’s

• For example  least squares coefficients are shrunk by about 50% on average

35 Discussion Orthonormal Case • Input matrix the three procedures have explicit solutions • Each method applies a simple transformation to the least squares estimate

Estimator Formula

Best subset (size M) ·

Ridge /1 Lasso

denotes the positive part of 36 Discussion

• Ridge regression does a proportional shrinkage • “Soft Thresholding”: used in the context of wavelet‐based smoothing Lasso translates each coefficient by a constant factor , truncating at zero • “Hard Thresholding” Best‐subset selection drops all variables with coefficients smaller than the M‐th largest

37 Discussion

Nonorthogonal Case • Suppose that there are only two parameters • Residual sum of squares has elliptical contours, centered at the full least square estimate • Constraint region ‐ ridge regression is the disk ‐ Lasso is the diamond (has corners) ‐ both methods find the first point where the elliptical contours hit the constraint region 38 Discussion

39 Discussion

• Unlike the disk, the diamond (Lasso) has corners • For Lasso: – Solution occurs at a corner  has one parameter equal to zero – When  diamond becomes a rhomboid  has many corners, flat edges and faces – There are many more opportunities for the estimated parameters to be zero

40 Methods Using Derived Input Directions • Many situations we have large number of inputs, often correlated • One idea is to produce a small number of linear combinations: , of the original inputs

• are used in place of the as inputs in the regression • One major consideration is how the linear combinations are constructed.

41 Methods Using Derived Input Directions Principal Components Regression

• The linear combinations used are the principal components • The singular value decomposition of the matrix has the form – Here and are and orthogonal matrices, with the columns of spanning the column space of and the columns of spanning the row space. – is a diagonal matrix with diagonal entries ⋯ 0called singular values of

42 Methods Using Derived Input Directions Principal Components Regression

• The eigenvector (columns of ) are called principal components directions of

• Forms the derived input columns • Regresses on for some

43 Methods Using Derived Input Directions Principal Components Regression

44 Methods Using Derived Input Directions Principal Components Regression

• are orthogonal, this regression is just a sum of univariate regression:

where

45 Methods Using Derived Input Directions Principal Components Regression

• Since are each linear combinations of the original • Express the solution in terms of coefficients of :

46 Methods Using Derived Input Directions Principal Components Regression

• As with ridge regression, principal components depend on the scaling of inputs • Typically first standardize • If  back the usual least squares estimates – Columns of span the column space of

• For , get a reduced regression

47