Linear Methods for Regression and Shrinkage Methods
Total Page:16
File Type:pdf, Size:1020Kb
Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares • Input vectors • is an attribute / feature / predictor (independent variable) • The linear regression model: • The output is called response (dependent variable) • ’s are unknown parameters (coefficients) 2 Linear Regression Models Least Squares • A set of training data • Each corresponding to attributes • Each is a class attribute value / a label • Wish to estimate the parameters 3 Linear Regression Models Least Squares • One common approach ‐ the method of least squares: • Pick the coefficients to minimize the residual sum of squares: 4 Linear Regression Models Least Squares • This criterion is reasonable if the training observations represent independent draws. • Even if the ’s were not drawn randomly, the criterion is still valid if the ’s are conditionally independent given the inputs . 5 Linear Regression Models Least Squares • Make no assumption about the validity of the model • Simply finds the best linear fit to the data 6 Linear Regression Models Finding Residual Sum of Squares • Denote by the matrix with each row an input vector (with a 1 in the first position) • Let be the N‐vector of outputs in the training set • Quadratic function in the parameters: 7 Linear Regression Models Finding Residual Sum of Squares • Set the first derivation to zero: • Obtain the unique solution: 8 Linear Regression Models Orthogonal Projection • The fitted values at the training inputs are: • The matrix appearing in the above equation, called “hat” matrix because it puts the hat on 9 Linear Regression Models Example • • Training Data: xy (1, 2, 1) 22 (2, 0, 4) 49 (3, 4, 2) 39 (4, 2, 3) 52 (5, 4, 1) 38 10 Linear Regression Models Example 1121 22 1204 49 • 1 3 4 2 39 1 4 2 3 52 1 5 4 1 38 • • 4.04 0.51 8.43 8.13 11 Linear Regression Models Example • 4.04 0.51 8.43 8.13 • residual vector 21.61 0.39 49.91 ‐0.91 39.13 ‐0.13 50.57 1.4 38.78 ‐0.78 12 Linear Regression Models Orthogonal Projection • Different geometrical representation of the least squares estimate, in • Denote column vectors of by with 13 Linear Regression Models Orthogonal Projection • These vectors span a subspace of , also referred to as the column space of • Minimize by choosing residual vector is orthogonal to this subspace • Some examples: 0.39 1121 0.91 1204 0.13 1 3 4 2 1.4 1 4 2 3 0.78 1 5 4 1 0.39 0.91 0.13 1.4 0.78 0 0.39 0.91 ∗ 2 0.13∗3 1.4∗40.78∗50 14 Linear Regression Models Orthogonal Projection • is the orthogonal projection of onto this subspace • Projection matrix: hat matrix computes the orthogonal projection 15 Linear Regression Models Sampling Properties of Parameters • Assume observations are uncorrelated and have constant variance , and are fixed (non random) • Variance‐covariance matrix of the least squares parameter estimates: 16 Linear Regression Models Sampling Properties of parameters • Test the hypothesis that a particular coefficient • Form the standardized coefficient or Z‐score • is the th diagonal element of • Under null hypothesis that , is distributed as (a distribution with degrees of freedom) 17 Subset Selection Motivation • Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero • With a large number of predictors, we would like to determine a smaller subset that exhibit the strongest effects • With subset selection we retain only a subset of the variables, and eliminate the rest from the model • Least squares regression is used to estimate the coefficients of the inputs that are retained 18 Subset Selection Motivation • Best subset regression finds for each the subset of size that gives smallest residual sum of squares • An efficient algorithm – the leaps and bounds procedure –makes this feasible for as large as 30 or 40 19 Subset Selection Motivation • Figure 3.5 shows all the subset models (8 features) that are eligible for selection by the best‐subsets approach • The best‐subset curve (red lower boundary in Figure 3.5) is necessarily decreasing cannot be used to select the subset size k 20 Subset Selection Introduction • Performing well in training data is not necessarily good for actual use • Use a different data set as a testing data set for measuring the error of the learned model • Choose the smallest model that minimizes an estimate of the expected prediction error • Use cross‐validation method (more details later) 21 Subset Selection Stepwise Selection Forward Stepwise Selection • Starts with the intercept • Sequentially adds into the model the predictor that most improves the fit • Clever updating algorithms exploit the QR decomposition for the current fit to rapidly establish the next candidate • Produces a sequence of models indexed by , the subset size, which must be determined • Greedy algorithm 22 Subset Selection Stepwise Selection Backward Stepwise Selection • Starts with full model • Sequentially deletes the predictor that has the least impact on the fit • Candidate for dropping: variable with the smallest Z‐score • Can be only used when 23 Shrinkage Methods Motivation • Retaining a subset of the predictors and discarding the rest • Subset selection produces a model that is ‐ interpretable ‐ possibly lower prediction error than the full model • Discrete process – variables are either retained or discarded often exhibits high variance • Does not reduce the prediction error of the full model 24 Shrinkage Methods Motivation • Shrinkage methods more continuous • Don’t suffer as much from high variability 25 Shrinkage Methods Motivation • When there are many correlated variables in a linear regression model their coefficients poorly determined and exhibit high variance • Wildly large positive coefficient on one variable can be canceled by a similarly large negative coefficient on its correlated cousin • Problem is alleviated when imposing a size constraint on the coefficients • Ridge solutions are not equivariant under scaling of the inputs, so one normally standardizes the inputs before solving 26 Shrinkage Methods Ridge Regression • Shrinks the regression coefficients by imposing a penalty on their size • Ridge coefficient minimize a penalized residual sum of squares: argmin 27 Shrinkage Methods Ridge Regression • is complexity parameter controls the amount of shrinkage: the larger value of , the greater the amount of shrinkage • The coefficients are shrunk toward zero (and each other) 28 Shrinkage Methods Ridge Regression • An equivalent way to write the ridge problem: subject to • Makes explicit the size constraint on the parameters • One‐to‐one correspondence between the parameters and 29 Shrinkage Methods Ridge Regression • Notice that the intercept is left out in the penalty • Assume the centering has been done, the matrix has p (rather than p+1) columns – Each is replaced by – Estimate by • The criterion in matrix form: 30 Shrinkage Methods Ridge Regression • The ridge regression solutions: • is the identity matrix • Even consider quadratic penalty , the ridge regression solution is still a linear function of 31 Lasso • Shrinkage method like ridge, with subtle but important differences • Lasso estimate: subject to • Similar to ridge regression, , We can fit a model without an intercept 32 Lasso • Lasso problem in the equivalent Lagrangian form: argmin 1 2 • Similarity to the ridge regression problem: the ridge penalty is replaced by the lasso penalty 33 Lasso • The latter constraints makes the solution nonlinear in the • There is no closed form expression as in ridge regression • Computing the Lasso solution is a quadratic programming problem 34 Lasso • Making sufficiently small will cause some of the coefficients to be zero Lasso does continuous subset selection • If is chosen larger than (where , the least squares estimates) Lasso estimates are ’s • For example least squares coefficients are shrunk by about 50% on average 35 Discussion Orthonormal Case • Input matrix the three procedures have explicit solutions • Each method applies a simple transformation to the least squares estimate Estimator Formula Best subset (size M) · Ridge /1 Lasso denotes the positive part of 36 Discussion • Ridge regression does a proportional shrinkage • “Soft Thresholding”: used in the context of wavelet‐based smoothing Lasso translates each coefficient by a constant factor , truncating at zero • “Hard Thresholding” Best‐subset selection drops all variables with coefficients smaller than the M‐th largest 37 Discussion Nonorthogonal Case • Suppose that there are only two parameters • Residual sum of squares has elliptical contours, centered at the full least square estimate • Constraint region ‐ ridge regression is the disk ‐ Lasso is the diamond (has corners) ‐ both methods find the first point where the elliptical contours hit the constraint region 38 Discussion 39 Discussion • Unlike the disk, the diamond (Lasso) has corners • For Lasso: – Solution occurs at a corner has one parameter equal to zero – When diamond becomes a rhomboid has many corners, flat edges and faces – There are many more opportunities for the estimated parameters to be zero 40 Methods Using Derived Input Directions • Many situations we have large number of inputs, often correlated