Linear Methods for Regression and Shrinkage Methods

Linear Methods for Regression and Shrinkage Methods Reference: The Elements of Statistical Learning, by T. Hastie, R. Tibshirani, J. Friedman, Springer 1 Linear Regression Models Least Squares • Input vectors • is an attribute / feature / predictor (independent variable) • The linear regression model: • The output is called response (dependent variable) • ’s are unknown parameters (coefficients) 2 Linear Regression Models Least Squares • A set of training data • Each corresponding to attributes • Each is a class attribute value / a label • Wish to estimate the parameters 3 Linear Regression Models Least Squares • One common approach ‐ the method of least squares: • Pick the coefficients to minimize the residual sum of squares: 4 Linear Regression Models Least Squares • This criterion is reasonable if the training observations represent independent draws. • Even if the ’s were not drawn randomly, the criterion is still valid if the ’s are conditionally independent given the inputs . 5 Linear Regression Models Least Squares • Make no assumption about the validity of the model • Simply finds the best linear fit to the data 6 Linear Regression Models Finding Residual Sum of Squares • Denote by the matrix with each row an input vector (with a 1 in the first position) • Let be the N‐vector of outputs in the training set • Quadratic function in the parameters: 7 Linear Regression Models Finding Residual Sum of Squares • Set the first derivation to zero: • Obtain the unique solution: 8 Linear Regression Models Orthogonal Projection • The fitted values at the training inputs are: • The matrix appearing in the above equation, called “hat” matrix because it puts the hat on 9 Linear Regression Models Example • • Training Data: xy (1, 2, 1) 22 (2, 0, 4) 49 (3, 4, 2) 39 (4, 2, 3) 52 (5, 4, 1) 38 10 Linear Regression Models Example 1121 22 1204 49 • 1 3 4 2 39 1 4 2 3 52 1 5 4 1 38 • • 4.04 0.51 8.43 8.13 11 Linear Regression Models Example • 4.04 0.51 8.43 8.13 • residual vector 21.61 0.39 49.91 ‐0.91 39.13 ‐0.13 50.57 1.4 38.78 ‐0.78 12 Linear Regression Models Orthogonal Projection • Different geometrical representation of the least squares estimate, in • Denote column vectors of by with 13 Linear Regression Models Orthogonal Projection • These vectors span a subspace of , also referred to as the column space of • Minimize by choosing residual vector is orthogonal to this subspace • Some examples: 0.39 1121 0.91 1204 0.13 1 3 4 2 1.4 1 4 2 3 0.78 1 5 4 1 0.39 0.91 0.13 1.4 0.78 0 0.39 0.91 ∗ 2 0.13∗3 1.4∗40.78∗50 14 Linear Regression Models Orthogonal Projection • is the orthogonal projection of onto this subspace • Projection matrix: hat matrix computes the orthogonal projection 15 Linear Regression Models Sampling Properties of Parameters • Assume observations are uncorrelated and have constant variance , and are fixed (non random) • Variance‐covariance matrix of the least squares parameter estimates: 16 Linear Regression Models Sampling Properties of parameters • Test the hypothesis that a particular coefficient • Form the standardized coefficient or Z‐score • is the th diagonal element of • Under null hypothesis that , is distributed as (a distribution with degrees of freedom) 17 Subset Selection Motivation • Prediction accuracy can sometimes be improved by shrinking or setting some coefficients to zero • With a large number of predictors, we would like to determine a smaller subset that exhibit the strongest effects • With subset selection we retain only a subset of the variables, and eliminate the rest from the model • Least squares regression is used to estimate the coefficients of the inputs that are retained 18 Subset Selection Motivation • Best subset regression finds for each the subset of size that gives smallest residual sum of squares • An efficient algorithm – the leaps and bounds procedure –makes this feasible for as large as 30 or 40 19 Subset Selection Motivation • Figure 3.5 shows all the subset models (8 features) that are eligible for selection by the best‐subsets approach • The best‐subset curve (red lower boundary in Figure 3.5) is necessarily decreasing cannot be used to select the subset size k 20 Subset Selection Introduction • Performing well in training data is not necessarily good for actual use • Use a different data set as a testing data set for measuring the error of the learned model • Choose the smallest model that minimizes an estimate of the expected prediction error • Use cross‐validation method (more details later) 21 Subset Selection Stepwise Selection Forward Stepwise Selection • Starts with the intercept • Sequentially adds into the model the predictor that most improves the fit • Clever updating algorithms exploit the QR decomposition for the current fit to rapidly establish the next candidate • Produces a sequence of models indexed by , the subset size, which must be determined • Greedy algorithm 22 Subset Selection Stepwise Selection Backward Stepwise Selection • Starts with full model • Sequentially deletes the predictor that has the least impact on the fit • Candidate for dropping: variable with the smallest Z‐score • Can be only used when 23 Shrinkage Methods Motivation • Retaining a subset of the predictors and discarding the rest • Subset selection produces a model that is ‐ interpretable ‐ possibly lower prediction error than the full model • Discrete process – variables are either retained or discarded often exhibits high variance • Does not reduce the prediction error of the full model 24 Shrinkage Methods Motivation • Shrinkage methods more continuous • Don’t suffer as much from high variability 25 Shrinkage Methods Motivation • When there are many correlated variables in a linear regression model their coefficients poorly determined and exhibit high variance • Wildly large positive coefficient on one variable can be canceled by a similarly large negative coefficient on its correlated cousin • Problem is alleviated when imposing a size constraint on the coefficients • Ridge solutions are not equivariant under scaling of the inputs, so one normally standardizes the inputs before solving 26 Shrinkage Methods Ridge Regression • Shrinks the regression coefficients by imposing a penalty on their size • Ridge coefficient minimize a penalized residual sum of squares: argmin 27 Shrinkage Methods Ridge Regression • is complexity parameter controls the amount of shrinkage: the larger value of , the greater the amount of shrinkage • The coefficients are shrunk toward zero (and each other) 28 Shrinkage Methods Ridge Regression • An equivalent way to write the ridge problem: subject to • Makes explicit the size constraint on the parameters • One‐to‐one correspondence between the parameters and 29 Shrinkage Methods Ridge Regression • Notice that the intercept is left out in the penalty • Assume the centering has been done, the matrix has p (rather than p+1) columns – Each is replaced by – Estimate by • The criterion in matrix form: 30 Shrinkage Methods Ridge Regression • The ridge regression solutions: • is the identity matrix • Even consider quadratic penalty , the ridge regression solution is still a linear function of 31 Lasso • Shrinkage method like ridge, with subtle but important differences • Lasso estimate: subject to • Similar to ridge regression, , We can fit a model without an intercept 32 Lasso • Lasso problem in the equivalent Lagrangian form: argmin 1 2 • Similarity to the ridge regression problem: the ridge penalty is replaced by the lasso penalty 33 Lasso • The latter constraints makes the solution nonlinear in the • There is no closed form expression as in ridge regression • Computing the Lasso solution is a quadratic programming problem 34 Lasso • Making sufficiently small will cause some of the coefficients to be zero Lasso does continuous subset selection • If is chosen larger than (where , the least squares estimates) Lasso estimates are ’s • For example least squares coefficients are shrunk by about 50% on average 35 Discussion Orthonormal Case • Input matrix the three procedures have explicit solutions • Each method applies a simple transformation to the least squares estimate Estimator Formula Best subset (size M) · Ridge /1 Lasso denotes the positive part of 36 Discussion • Ridge regression does a proportional shrinkage • “Soft Thresholding”: used in the context of wavelet‐based smoothing Lasso translates each coefficient by a constant factor , truncating at zero • “Hard Thresholding” Best‐subset selection drops all variables with coefficients smaller than the M‐th largest 37 Discussion Nonorthogonal Case • Suppose that there are only two parameters • Residual sum of squares has elliptical contours, centered at the full least square estimate • Constraint region ‐ ridge regression is the disk ‐ Lasso is the diamond (has corners) ‐ both methods find the first point where the elliptical contours hit the constraint region 38 Discussion 39 Discussion • Unlike the disk, the diamond (Lasso) has corners • For Lasso: – Solution occurs at a corner has one parameter equal to zero – When diamond becomes a rhomboid has many corners, flat edges and faces – There are many more opportunities for the estimated parameters to be zero 40 Methods Using Derived Input Directions • Many situations we have large number of inputs, often correlated

Linear Methods for Regression and Shrinkage Methods

Shrinkage Priors for Bayesian Penalized Regression

Shrinkage Estimation of Rate Statistics Arxiv:1810.07654V1 [Stat.AP] 17 Oct

Linear, Ridge Regression, and Principal Component Analysis

Kernel Mean Shrinkage Estimators

Generalized Shrinkage Methods for Forecasting Using Many Predictors

Cross-Validation, Shrinkage and Variable Selection in Linear Regression Revisited

Mining Big Data Using Parsimonious Factor, Machine Learning, Variable Selection and Shrinkage Methods Hyun Hak Kim1 and Norman R

Shrinkage Improves Estimation of Microbial Associations Under Di↵Erent Normalization Methods 1 2 1,3,4, 5,6,7, Michelle Badri , Zachary D

Shrinkage Estimation for Functional Principal Component Scores, with Application to the Population Kinetics of Plasma Folate

Image Denoising Via Residual Kurtosis Minimization

An Empirical Bayes Approach to Shrinkage Estimation on the Manifold of Symmetric Positive-Deﬁnite Matrices∗†

Arxiv:1602.01182V2 [Stat.ML] 5 Feb 2017