LASSO Geometric Interpretation, Cross Validation Slides

Total Page:16

File Type:pdf, Size:1020Kb

LASSO Geometric Interpretation, Cross Validation Slides 1/19/17 Regularized Regression: Geometric intuition of solution Plus: Cross validation CSE 446: Machine Learning Emily Fox University of Washington January 20, 2017 ©2017 Emily Fox CSE 446: Machine Learning Coordinate descent for lasso (for normalized features) ©2017 Emily Fox CSE 446: Machine Learning 1 1/19/17 Coordinate descent for least squares regression Initialize ŵ = 0 (or smartly…) while not converged residual for j=0,1,…,D without feature j N compute: ρj = hj(xi)(yi – ŷi(ŵ-j)) i=1 X set: ŵj = ρj prediction without feature j 3 ©2017 Emily Fox CSE 446: Machine Learning Coordinate descent for lasso Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D N compute: ρj = hj(xi)(yi – ŷi(ŵ-j)) i=1 X ρj + λ/2 if ρj < -λ/2 set: ŵj = 0 if ρj in [-λ/2, λ/2] ρj – λ/2 if ρj > λ/2 4 ©2017 Emily Fox CSE 446: Machine Learning 2 1/19/17 Soft thresholding ρj + λ/2 if ρj < -λ/2 ŵj = 0 if ρj in [-λ/2, λ/2] ρj – λ/2 if ρj > λ/2 ŵj ρj 5 ©2017 Emily Fox CSE 446: Machine Learning How to assess convergence? Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D N compute: ρj = hj(xi)(yi – ŷi(ŵ-j)) i=1 X ρj + λ/2 if ρj < -λ/2 set: ŵj = 0 if ρj in [-λ/2, λ/2] ρj – λ/2 if ρj > λ/2 6 ©2017 Emily Fox CSE 446: Machine Learning 3 1/19/17 Convergence criteria When to stop? For convex problems, will start to take smaller and smaller steps Measure size of steps taken in a full loop over all features - stop when max step < ε 7 ©2017 Emily Fox CSE 446: Machine Learning Other lasso solvers Classically: Least angle regression (LARS) [Efron et al. ‘04] Then: Coordinate descent algorithm [Fu ‘98, Friedman, Hastie, & Tibshirani ’08] Now: • Parallel CD (e.g., Shotgun, [Bradley et al. ‘11]) • Other parallel learning approaches for linear models - Parallel stochastic gradient descent (SGD) (e.g., Hogwild! [Niu et al. ’11]) - Parallel independent solutions then averaging [Zhang et al. ‘12] • Alternating directions method of multipliers (ADMM) [Boyd et al. ’11] 8 ©2017 Emily Fox CSE 446: Machine Learning 4 1/19/17 Coordinate descent for lasso (for unnormalized features) ©2017 Emily Fox CSE 446: Machine Learning Coordinate descent for lasso with unnormalized features N 2 Precompute: zj = hj(xi) i=1 Initialize ŵ = 0 (orX smartly…) while not converged for j=0,1,…,D N compute: ρj = hj(xi)(yi – ŷi(ŵ-j)) i=1 X (ρj + λ/2)/zj if ρj < -λ/2 set: ŵj = 0 if ρj in [-λ/2, λ/2] (ρj – λ/2)/zj if ρj > λ/2 10 ©2017 Emily Fox CSE 446: Machine Learning 5 1/19/17 Geometric intuition for sparsity of lasso solution ©2017 Emily Fox CSE 446: Machine Learning Geometric intuition for ridge regression ©2017 Emily Fox CSE 446: Machine Learning 6 1/19/17 Visualizing the ridge cost in 2D RSS Cost 10 5 w1 0 −5 −10 −10 −5 0 5 10 w0 N 2 2 2 2 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi)) + λ (w0 +w1 ) i=1 X 13 ©2017 Emily Fox CSE 446: Machine Learning Visualizing the ridge cost in 2D L2 penalty 10 5 w1 0 −5 −10 −10 −5 0 5 10 w0 N 2 2 2 2 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi)) + λ (w0 +w1 ) i=1 X 14 ©2017 Emily Fox CSE 446: Machine Learning 7 1/19/17 Visualizing the ridge cost in 2D N 2 2 2 2 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi)) + λ (w0 +w1 ) i=1 X 15 ©2017 Emily Fox CSE 446: Machine Learning Visualizing the ridge solution level sets intersect 5 521 5215 10 5215 5215 5 5215 4.75 w1 5215 5215 0 4.75 −5 −10 −10 −5 0 5 10 w0 N 2 2 2 2 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi)) + λ (w0 +w1 ) i=1 X 16 ©2017 Emily Fox CSE 446: Machine Learning 8 1/19/17 Geometric intuition for lasso ©2017 Emily Fox CSE 446: Machine Learning Visualizing the lasso cost in 2D RSS Cost 10 5 w1 0 −5 −10 −10 −5 0 5 10 w0 N 2 RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi)) + λ (|w0|+|w1|) i=1 X 18 ©2017 Emily Fox CSE 446: Machine Learning 9 1/19/17 Visualizing the lasso cost in 2D L1 penalty 10 5 w1 0 −5 −10 −10 −5 0 5 10 w0 N 2 RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi)) + λ (|w0|+|w1|) i=1 X 19 ©2017 Emily Fox CSE 446: Machine Learning Visualizing the lasso cost in 2D N 2 RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi)) + λ (|w0|+|w1|) i=1 X 20 ©2017 Emily Fox CSE 446: Machine Learning 10 1/19/17 Visualizing the lasso solution level sets intersect 5 521 5215 10 5215 5215 5 5215 2.75 w1 5215 5215 0 2.75 −5 −10 −10 −5 0 5 10 w0 N 2 RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi)) + λ (|w0|+|w1|) i=1 X 21 ©2017 Emily Fox CSE 446: Machine Learning Revisit polynomial fit demo What happens if we refit our high-order polynomial, but now using lasso regression? Will consider a few settings of λ … 22 ©2017 Emily Fox CSE 446: Machine Learning 11 1/19/17 How to choose λ: Cross validation ©2017 Emily Fox CSE 446: Machine Learning If sufficient amount of data… Validation Test Training set set set fit ŵλ test performance * of ŵλ to select λ assess generalization error of ŵλ* 24 ©2017 Emily Fox CSE 446: Machine Learning 12 1/19/17 Start with smallish dataset All data 25 ©2017 Emily Fox CSE 446: Machine Learning Still form test set and hold out Test Rest of data set 26 ©2017 Emily Fox CSE 446: Machine Learning 13 1/19/17 How do we use the other data? Rest of data use for both training and validation, but not so naively 27 ©2017 Emily Fox CSE 446: Machine Learning Recall naïve approach Valid. Training set set small validation set Is validation set enough to compare performance of ŵλ across λ values? No 28 ©2017 Emily Fox CSE 446: Machine Learning 14 1/19/17 Choosing the validation set Valid. set small validation set Didn’t have to use the last data points tabulated to form validation set Can use any data subset 29 ©2017 Emily Fox CSE 446: Machine Learning Choosing the validation set Valid. set small validation set Which subset should I use? average performance ALL! over all choices 30 ©2017 Emily Fox CSE 446: Machine Learning 15 1/19/17 K-fold cross validation Rest of data N N N N N K K K K K Preprocessing: Randomly assign data to K groups (use same split of data for all other steps) 31 ©2017 Emily Fox CSE 446: Machine Learning K-fold cross validation Valid set (1) error1(λ) ŵλ For k=1,…,K (k) 1. Estimate ŵλ on the training blocks 2. Compute error on validation block: errork(λ) 32 ©2017 Emily Fox CSE 446: Machine Learning 16 1/19/17 K-fold cross validation Valid set (2) error2(λ) ŵλ For k=1,…,K (k) 1. Estimate ŵλ on the training blocks 2. Compute error on validation block: errork(λ) 33 ©2017 Emily Fox CSE 446: Machine Learning K-fold cross validation Valid set (3) ŵλ error3(λ) For k=1,…,K (k) 1. Estimate ŵλ on the training blocks 2. Compute error on validation block: errork(λ) 34 ©2017 Emily Fox CSE 446: Machine Learning 17 1/19/17 K-fold cross validation Valid set (4) ŵλ error4(λ) For k=1,…,K (k) 1. Estimate ŵλ on the training blocks 2. Compute error on validation block: errork(λ) 35 ©2017 Emily Fox CSE 446: Machine Learning K-fold cross validation Valid set (5) ŵλ error5(λ) For k=1,…,K (k) 1. Estimate ŵλ on the training blocks 2. Compute error on validation block: errork(λ) 1 K λ λ Compute average error: CV( ) = K errork( ) kX=1 36 ©2017 Emily Fox CSE 446: Machine Learning 18 1/19/17 K-fold cross validation Valid set Repeat procedure for each choice of λ Choose λ* to minimize CV(λ) 37 ©2017 Emily Fox CSE 446: Machine Learning What value of K? Formally, the best approximation occurs for validation sets of size 1 (K=N) leave-one-out cross validation Computationally intensive - requires computing N fits of model per λ Typically, K=5 or 10 5-fold CV 10-fold CV 38 ©2017 Emily Fox CSE 446: Machine Learning 19 1/19/17 Choosing λ via cross validation for lasso Cross validation is choosing the λ that provides best predictive accuracy Tends to favor less sparse solutions, and thus smaller λ, than optimal choice for feature selection c.f., “Machine Learning: A Probabilistic Perspective”, Murphy, 2012 for further discussion 39 ©2017 Emily Fox CSE 446: Machine Learning Practical concerns with lasso ©2017 Emily Fox CSE 446: Machine Learning 20 1/19/17 Issues with standard lasso objective 1. With group of highly correlated features, lasso tends to select amongst them arbitrarily - Often prefer to select all together 2. Often, empirically ridge has better predictive performance than lasso, but lasso leads to sparser solution Elastic net aims to address these issues - hybrid between lasso and ridge regression - uses L1 and L2 penalties See Zou & Hastie ‘05 for further discussion 41 ©2017 Emily Fox CSE 446: Machine Learning Summary for feature selection and lasso regression ©2017 Emily Fox CSE 446: Machine Learning 21 1/19/17 Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature selection in general, be careful about interpreting selected features - selection only considers features included - sensitive to correlations between features - result depends on algorithm used - there are theoretical guarantees for lasso under certain conditions 43 ©2017 Emily Fox CSE 446: Machine Learning What you can do now… • Describe “all subsets” and greedy variants for feature selection • Analyze computational costs of these algorithms • Formulate lasso objective • Describe what happens to estimated lasso coefficients as tuning parameter λ is varied • Interpret lasso coefficient path plot • Contrast ridge and lasso regression • Estimate lasso regression parameters using an iterative coordinate descent algorithm • Implement K-fold cross validation to select lasso tuning parameter λ 44 ©2017 Emily Fox CSE 446: Machine Learning 22 1/19/17 Linear classifiers CSE 446: Machine Learning Emily Fox University of Washington January 20, 2017 ©2017 Emily Fox CSE 446: Machine Learning Linear classifier: Intuition ©2017 Emily Fox CSE 446: Machine Learning 23 1/19/17 ŷ = +1 Classifier Sentence Classifier from review MODEL Output: y Input: x Predicted Sushi was awesome, class the food was awesome, but the service was awful.
Recommended publications
  • A Robust Hybrid of Lasso and Ridge Regression
    A robust hybrid of lasso and ridge regression Art B. Owen Stanford University October 2006 Abstract Ridge regression and the lasso are regularized versions of least squares regression using L2 and L1 penalties respectively, on the coefficient vector. To make these regressions more robust we may replace least squares with Huber’s criterion which is a hybrid of squared error (for relatively small errors) and absolute error (for relatively large ones). A reversed version of Huber’s criterion can be used as a hybrid penalty function. Relatively small coefficients contribute their L1 norm to this penalty while larger ones cause it to grow quadratically. This hybrid sets some coefficients to 0 (as lasso does) while shrinking the larger coefficients the way ridge regression does. Both the Huber and reversed Huber penalty functions employ a scale parameter. We provide an objective function that is jointly convex in the regression coefficient vector and these two scale parameters. 1 Introduction We consider here the regression problem of predicting y ∈ R based on z ∈ Rd. The training data are pairs (zi, yi) for i = 1, . , n. We suppose that each vector p of predictor vectors zi gets turned into a feature vector xi ∈ R via zi = φ(xi) for some fixed function φ. The predictor for y is linear in the features, taking the form µ + x0β where β ∈ Rp. In ridge regression (Hoerl and Kennard, 1970) we minimize over β, a criterion of the form n p X 0 2 X 2 (yi − µ − xiβ) + λ βj , (1) i=1 j=1 for a ridge parameter λ ∈ [0, ∞].
    [Show full text]
  • Lasso Reference Manual Release 17
    STATA LASSO REFERENCE MANUAL RELEASE 17 ® A Stata Press Publication StataCorp LLC College Station, Texas Copyright c 1985–2021 StataCorp LLC All rights reserved Version 17 Published by Stata Press, 4905 Lakeway Drive, College Station, Texas 77845 Typeset in TEX ISBN-10: 1-59718-337-7 ISBN-13: 978-1-59718-337-6 This manual is protected by copyright. All rights are reserved. No part of this manual may be reproduced, stored in a retrieval system, or transcribed, in any form or by any means—electronic, mechanical, photocopy, recording, or otherwise—without the prior written permission of StataCorp LLC unless permitted subject to the terms and conditions of a license granted to you by StataCorp LLC to use the software and documentation. No license, express or implied, by estoppel or otherwise, to any intellectual property rights is granted by this document. StataCorp provides this manual “as is” without warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. StataCorp may make improvements and/or changes in the product(s) and the program(s) described in this manual at any time and without notice. The software described in this manual is furnished under a license agreement or nondisclosure agreement. The software may be copied only in accordance with the terms of the agreement. It is against the law to copy the software onto DVD, CD, disk, diskette, tape, or any other medium for any purpose other than backup or archival purposes. The automobile dataset appearing on the accompanying media is Copyright c 1979 by Consumers Union of U.S., Inc., Yonkers, NY 10703-1057 and is reproduced by permission from CONSUMER REPORTS, April 1979.
    [Show full text]
  • Overfitting Can Be Harmless for Basis Pursuit, but Only to a Degree
    Overfitting Can Be Harmless for Basis Pursuit, But Only to a Degree Peizhong Ju Xiaojun Lin Jia Liu School of ECE School of ECE Department of ECE Purdue University Purdue University The Ohio State University West Lafayette, IN 47906 West Lafayette, IN 47906 Columbus, OH 43210 [email protected] [email protected] [email protected] Abstract Recently, there have been significant interests in studying the so-called “double- descent” of the generalization error of linear regression models under the overpa- rameterized and overfitting regime, with the hope that such analysis may provide the first step towards understanding why overparameterized deep neural networks (DNN) still generalize well. However, to date most of these studies focused on the min `2-norm solution that overfits the data. In contrast, in this paper we study the overfitting solution that minimizes the `1-norm, which is known as Basis Pursuit (BP) in the compressed sensing literature. Under a sparse true linear regression model with p i.i.d. Gaussian features, we show that for a large range of p up to a limit that grows exponentially with the number of samples n, with high probability the model error of BP is upper bounded by a value that decreases with p. To the best of our knowledge, this is the first analytical result in the literature establishing the double-descent of overfitting BP for finite n and p. Further, our results reveal significant differences between the double-descent of BP and min `2-norm solu- tions. Specifically, the double-descent upper-bound of BP is independent of the signal strength, and for high SNR and sparse models the descent-floor of BP can be much lower and wider than that of min `2-norm solutions.
    [Show full text]
  • Least Squares After Model Selection in High-Dimensional Sparse Models.” DOI:10.3150/11-BEJ410SUPP
    Bernoulli 19(2), 2013, 521–547 DOI: 10.3150/11-BEJ410 Least squares after model selection in high-dimensional sparse models ALEXANDRE BELLONI1 and VICTOR CHERNOZHUKOV2 1100 Fuqua Drive, Durham, North Carolina 27708, USA. E-mail: [email protected] 250 Memorial Drive, Cambridge, Massachusetts 02142, USA. E-mail: [email protected] In this article we study post-model selection estimators that apply ordinary least squares (OLS) to the model selected by first-step penalized estimators, typically Lasso. It is well known that Lasso can estimate the nonparametric regression function at nearly the oracle rate, and is thus hard to improve upon. We show that the OLS post-Lasso estimator performs at least as well as Lasso in terms of the rate of convergence, and has the advantage of a smaller bias. Remarkably, this performance occurs even if the Lasso-based model selection “fails” in the sense of missing some components of the “true” regression model. By the “true” model, we mean the best s-dimensional approximation to the nonparametric regression function chosen by the oracle. Furthermore, OLS post-Lasso estimator can perform strictly better than Lasso, in the sense of a strictly faster rate of convergence, if the Lasso-based model selection correctly includes all components of the “true” model as a subset and also achieves sufficient sparsity. In the extreme case, when Lasso perfectly selects the “true” model, the OLS post-Lasso estimator becomes the oracle estimator. An important ingredient in our analysis is a new sparsity bound on the dimension of the model selected by Lasso, which guarantees that this dimension is at most of the same order as the dimension of the “true” model.
    [Show full text]
  • Modern Regression 2: the Lasso
    Modern regression 2: The lasso Ryan Tibshirani Data Mining: 36-462/36-662 March 21 2013 Optional reading: ISL 6.2.2, ESL 3.4.2, 3.4.3 1 Reminder: ridge regression and variable selection Recall our setup: given a response vector y 2 Rn, and a matrix X 2 Rn×p of predictor variables (predictors on the columns) Last time we saw that ridge regression, ^ridge 2 2 β = argmin ky − Xβk2 + λkβk2 β2Rp can have better prediction error than linear regression in a variety of scenarios, depending on the choice of λ. It worked best when there was a subset of the true coefficients that are small or zero But it will never sets coefficients to zero exactly, and therefore cannot perform variable selection in the linear model. While this didn't seem to hurt its prediction ability, it is not desirable for the purposes of interpretation (especially if the number of variables p is large) 2 Recall our example: n = 50, p = 30; true coefficients: 10 are nonzero and pretty big, 20 are zero Linear MSE ● True nonzero Ridge MSE True zero 1.0 ● Ridge Bias^2 ● 0.8 ● Ridge Var ● ● ● ● 0.6 ● 0.5 ● ● ● 0.4 Coefficients ● ● ● ● ● 0.0 ● ● 0.2 ● ● ● ● ● ● 0.0 −0.5 0 5 10 15 20 25 0 5 10 15 20 25 λ λ 3 Example: prostate data Recall the prostate data example: we are interested in the level of prostate-specific antigen (PSA), elevated in men who have prostate cancer. We have measurements of PSA on n = 97 men with prostate cancer, and p = 8 clinical predictors.
    [Show full text]
  • On Lasso Refitting Strategies
    On Lasso refitting strategies EVGENII CHZHEN1,* , MOHAMED HEBIRI1,** , and JOSEPH SALMON2,y 1LAMA, Universit´eParis-Est, 5 Boulevard Descartes, 77420, Champs-sur-Marne, France E-mail: *[email protected]; **[email protected] 2IMAG, Universit´ede Montpellier, Place Eug`eneBataillon, 34095 Montpellier Cedex 5, France E-mail: [email protected] A well-know drawback of `1-penalized estimators is the systematic shrinkage of the large coefficients towards zero. A simple remedy is to treat Lasso as a model-selection procedure and to perform a second refitting step on the selected support. In this work we formalize the notion of refitting and provide oracle bounds for arbitrary refitting procedures of the Lasso solution. One of the most widely used refitting techniques which is based on Least-Squares may bring a problem of interpretability, since the signs of the refitted estimator might be flipped with respect to the original estimator. This problem arises from the fact that the Least-Squares refitting considers only the support of the Lasso solution, avoiding any information about signs or amplitudes. To this end we define a sign consistent refitting as an arbitrary refitting procedure, preserving the signs of the first step Lasso solution and provide Oracle inequalities for such estimators. Finally, we consider special refitting strategies: Bregman Lasso and Boosted Lasso. Bregman Lasso has a fruitful property to converge to the Sign-Least-Squares refitting (Least-Squares with sign constraints), which provides with greater interpretability. We additionally study the Bregman Lasso refitting in the case of orthogonal design, providing with simple intuition behind the proposed method.
    [Show full text]
  • Lecture 2: Overfitting. Regularization
    Lecture 2: Overfitting. Regularization • Generalizing regression • Overfitting • Cross-validation • L2 and L1 regularization for linear estimators • A Bayesian interpretation of regularization • Bias-variance trade-off COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 1 Recall: Overfitting • A general, HUGELY IMPORTANT problem for all machine learning algorithms • We can find a hypothesis that predicts perfectly the training data but does not generalize well to new data • E.g., a lookup table! COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 2 Another overfitting example 1 M = 0 1 M = 1 t t 0 0 −1 −1 0 x 1 0 x 1 1 M = 3 1 M = 9 t t 0 0 −1 −1 0 x 1 0 x 1 • The higher the degree of the polynomial M, the more degrees of freedom, and the more capacity to “overfit” the training data • Typical overfitting means that error on the training data is very low, but error on new instances is high COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 3 Overfitting more formally • Assume that the data is drawn from some fixed, unknown probability distribution • Every hypothesis has a "true" error J ∗(h), which is the expected error when data is drawn from the distribution. • Because we do not have all the data, we measure the error on the training set JD(h) • Suppose we compare hypotheses h1 and h2 on the training set, and JD(h1) < JD(h2) ∗ ∗ • If h2 is "truly" better, i.e. J (h2) < J (h1), our algorithm is overfitting. • We need theoretical and empirical methods to guard against it! COMP-652 and ECSE-608, Lecture 2 - January 10, 2017 4 Typical overfitting plot 1 Training Test RMS 0.5 E 0 0 3 6 9 M • The training error decreases with the degree of the polynomial M, i.e.
    [Show full text]
  • A Revisit to De-Biased Lasso for Generalized Linear Models
    A Revisit to De-biased Lasso for Generalized Linear Models Lu Xia1 ,∗ Bin Nan2 y and Yi Li1 z 1 Department of Biostatistics, University of Michigan, Ann Arbor 2 Department of Statistics, University of California, Irvine Abstract De-biased lasso has emerged as a popular tool to draw statistical inference for high-dimensional regression models. However, simulations indicate that for general- ized linear models (GLMs), de-biased lasso inadequately removes biases and yields unreliable confidence intervals. This motivates us to scrutinize the application of de-biased lasso in high-dimensional GLMs. When p > n, we detect that a key spar- sity condition on the inverse information matrix generally does not hold in a GLM setting, which likely explains the subpar performance of de-biased lasso. Even in a less challenging \large n, diverging p" scenario, we find that de-biased lasso and the maximum likelihood method often yield confidence intervals with unsatisfactory cov- erage probabilities. In this scenario, we examine an alternative approach for further bias correction by directly inverting the Hessian matrix without imposing the matrix sparsity assumption. We establish the asymptotic distributions of any linear combi- nations of the resulting estimates, which lay the theoretical groundwork for drawing inference. Simulations show that this refined de-biased estimator performs well in removing biases and yields an honest confidence interval coverage. We illustrate the method by analyzing a prospective hospital-based Boston Lung Cancer Study, a large scale epidemiology cohort investigating the joint effects of genetic variants on lung arXiv:2006.12778v1 [stat.ME] 23 Jun 2020 cancer risk. Keywords: Confidence interval; Coverage; High-dimension; Inverse of information matrix; Statistical inference.
    [Show full text]
  • Adaptive LASSO Based on Joint M-Estimation of Regression and Scale
    2016 24th European Signal Processing Conference (EUSIPCO) Adaptive LASSO based on joint M-estimation of regression and scale Esa Ollila Aalto University, Dept. of Signal Processing and Acoustics, P.O.Box 13000, FI-00076 Aalto, Finland Abstract—The adaptive Lasso (Least Absolute Shrinkage and where λ > 0 is the shrinkage (penalty) parameter, chosen Selection Operator) obtains oracle variable selection property by by the user, w =(w1,...,wp)> is a vector of non-negative using cleverly chosen adaptive weights for regression coefficients weights, and is the Hadamard product, i.e., the component- in the `1-penalty. In this paper, in the spirit of M-estimation of ◦ p wise product of two vectors. Thus w β = w β . regression, we propose a class of adaptive M-Lasso estimates of k ◦ k1 j=1 j| j| Standard Lasso is obtained when wj 1. Adaptive Lasso regression and scale as solutions to generalized zero subgradient ⌘ P equations. The defining estimating equations depend on a differ- was proposed in the real-valued case, but it can be extended entiable convex loss function and choosing the LS-loss function to complex-valued case in straightforward manner. Adaptive yields the standard adaptive Lasso estimate and the associated Lasso is obtained when λ = λn depends on the sample scale statistic. An efficient algorithm, a generalization of the n wˆ = cyclic coordinate descent algorithm, is developed for computing size and the weights are data dependent, defined as j ˆ γ ˆ p M M 1/ βinit,j for γ > 0, where βinit C is a root-n-consistent the proposed -Lasso estimates.
    [Show full text]
  • Chapter 5: Lasso and Sparsity in Statistics
    5 Lasso and Sparsity in Statistics Robert J. Tibshirani Stanford University, Stanford, CA In this chapter, I discuss the lasso and sparsity, in the area of supervised learn- ing that has been the focus of my research and that of many other statisticians. This area can be described as follows. Many statistical problems involve mod- eling important variables or outcomes as functions of predictor variables. One objective is to produce models that allow predictions of outcomes that are as accurate as possible. Another is to develop an understanding of which variables in a set of potential predictors are strongly related to the outcome variable. For example, the outcome of interest might be the price of a company's stock in a week's time, and potential predictor variables might include information about the company, the sector of the economy it operates in, recent fluctu- ations in the stock's price, and other economic factors. With technological development and the advent of huge amounts of data, we are frequently faced with very large numbers of potential predictor variables. 5.1 Sparsity, `1 Penalties and the Lasso The most basic statistical method for what is called supervised learning relates an outcome variable Y to a linear predictor variables x1; : : : ; xp, viz. p X Y = β0 + xjβj + , (5.1) j=1 where is an error term that represents the fact that knowing x1; : : : ; xp does not normally tell us exactly what Y will be. We often refer to the right-hand side of (5.1), minus , as the predicted outcome. These are referred to as linear regression models.
    [Show full text]
  • High-Dimensional Generalized Linear Models and the Lasso
    The Annals of Statistics 2008, Vol. 36, No. 2, 614–645 DOI: 10.1214/009053607000000929 © Institute of Mathematical Statistics, 2008 HIGH-DIMENSIONAL GENERALIZED LINEAR MODELS AND THE LASSO BY SARA A. VA N D E GEER ETH Zürich We consider high-dimensional generalized linear models with Lipschitz loss functions, and prove a nonasymptotic oracle inequality for the empirical risk minimizer with Lasso penalty. The penalty is based on the coefficients in the linear predictor, after normalization with the empirical norm. The ex- amples include logistic regression, density estimation and classification with hinge loss. Least squares regression is also discussed. 1. Introduction. We consider the lasso penalty for high-dimensional gener- alized linear models. Let Y ∈ Y ⊂ R be a real-valued (response) variable and X be a co-variable with values in some space X.Let m F = fθ (·) = θkψk(·), θ ∈ k=1 be a (subset of a) linear space of functions on X.Welet be a convex subset of m = m { }m R , possibly R . The functions ψk k=1 form a given system of real-valued base functions on X. × → { }n Let γf : X Y R be some loss function, and let (Xi,Yi) i=1 be i.i.d. copies of (X, Y ). We consider the estimator with lasso penalty n ˆ := 1 + ˆ θn arg min γfθ (Xi,Yi) λnI(θ) , ∈ n θ i=1 where m ˆ I(θ):= σˆk|θk| k=1 m denotes the weighted 1 norm of the vector θ ∈ R , with random weights 1/2 1 n σˆ := ψ2(X ) . k n k i i=1 Moreover, the smoothing parameter λn controls the amount of complexity regular- ization.
    [Show full text]
  • Data Mining Model Selection
    Data Mining Model Selection Bob Stine Dept of Statistics, Wharton School University of Pennsylvania Wharton Department of Statistics From Last Time • Review from prior class • Calibration • Missing data procedures Missing at random vs. informative missing • Problems of greedy model selection Problems with stepwise regression. So then why be greedy? • Questions • Missing data procedure: Why not impute? “Add an indicator” is fast, suited to problems with many missing. Imputation more suited to small, well-specified models. EG. Suppose every X has missing values. How many imputation models do you need to build, and which cases should you use? Wharton 2 Department of Statistics Topics for Today • Over-fitting • Model promises more than it delivers • Model selection procedures • Subset selection • Regularization (aka, shrinkage) • Averaging • Cross-validation Wharton 3 Department of Statistics Model Validation • Narrow interpretation • A predictive model is “valid” if its predictions have the properties advertised by model • Calibrated, right on average mean & • Correct uncertainty, at least variance variance • Must know process that selected model • Cannot validate a model from a static, “published perspective” • Stepwise model for S&P 500 looks okay, but... Wharton 4 Department of Statistics Model Validation • Fails miserably (as it should) when used to predict future returns • Predictors are simply random noise • Greedy selection overfits, finding coincidental patterns ±2 RMSE training test Wharton 5 Department of Statistics Over-Fitting • Critical problem in data mining • Caused by an excess of potential explanatory variables (predictors) • Claimed error rate steadily falls with size of the model • “Over-confident” over-fitting • Model claims to predict new cases better than it will.
    [Show full text]