LASSO Geometric Interpretation, Cross Validation Slides

1/19/17 Regularized Regression: Geometric intuition of solution Plus: Cross validation CSE 446: Machine Learning Emily Fox University of Washington January 20, 2017 ©2017 Emily Fox CSE 446: Machine Learning Coordinate descent for lasso (for normalized features) ©2017 Emily Fox CSE 446: Machine Learning 1 1/19/17 Coordinate descent for least squares regression Initialize ŵ = 0 (or smartly…) while not converged residual for j=0,1,…,D without feature j N compute: ρj = hj(xi)(yi – ŷi(ŵ-j)) i=1 X set: ŵj = ρj prediction without feature j 3 ©2017 Emily Fox CSE 446: Machine Learning Coordinate descent for lasso Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D N compute: ρj = hj(xi)(yi – ŷi(ŵ-j)) i=1 X ρj + λ/2 if ρj < -λ/2 set: ŵj = 0 if ρj in [-λ/2, λ/2] ρj – λ/2 if ρj > λ/2 4 ©2017 Emily Fox CSE 446: Machine Learning 2 1/19/17 Soft thresholding ρj + λ/2 if ρj < -λ/2 ŵj = 0 if ρj in [-λ/2, λ/2] ρj – λ/2 if ρj > λ/2 ŵj ρj 5 ©2017 Emily Fox CSE 446: Machine Learning How to assess convergence? Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D N compute: ρj = hj(xi)(yi – ŷi(ŵ-j)) i=1 X ρj + λ/2 if ρj < -λ/2 set: ŵj = 0 if ρj in [-λ/2, λ/2] ρj – λ/2 if ρj > λ/2 6 ©2017 Emily Fox CSE 446: Machine Learning 3 1/19/17 Convergence criteria When to stop? For convex problems, will start to take smaller and smaller steps Measure size of steps taken in a full loop over all features - stop when max step < ε 7 ©2017 Emily Fox CSE 446: Machine Learning Other lasso solvers Classically: Least angle regression (LARS) [Efron et al. ‘04] Then: Coordinate descent algorithm [Fu ‘98, Friedman, Hastie, & Tibshirani ’08] Now: • Parallel CD (e.g., Shotgun, [Bradley et al. ‘11]) • Other parallel learning approaches for linear models - Parallel stochastic gradient descent (SGD) (e.g., Hogwild! [Niu et al. ’11]) - Parallel independent solutions then averaging [Zhang et al. ‘12] • Alternating directions method of multipliers (ADMM) [Boyd et al. ’11] 8 ©2017 Emily Fox CSE 446: Machine Learning 4 1/19/17 Coordinate descent for lasso (for unnormalized features) ©2017 Emily Fox CSE 446: Machine Learning Coordinate descent for lasso with unnormalized features N 2 Precompute: zj = hj(xi) i=1 Initialize ŵ = 0 (orX smartly…) while not converged for j=0,1,…,D N compute: ρj = hj(xi)(yi – ŷi(ŵ-j)) i=1 X (ρj + λ/2)/zj if ρj < -λ/2 set: ŵj = 0 if ρj in [-λ/2, λ/2] (ρj – λ/2)/zj if ρj > λ/2 10 ©2017 Emily Fox CSE 446: Machine Learning 5 1/19/17 Geometric intuition for sparsity of lasso solution ©2017 Emily Fox CSE 446: Machine Learning Geometric intuition for ridge regression ©2017 Emily Fox CSE 446: Machine Learning 6 1/19/17 Visualizing the ridge cost in 2D RSS Cost 10 5 w1 0 −5 −10 −10 −5 0 5 10 w0 N 2 2 2 2 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi)) + λ (w0 +w1 ) i=1 X 13 ©2017 Emily Fox CSE 446: Machine Learning Visualizing the ridge cost in 2D L2 penalty 10 5 w1 0 −5 −10 −10 −5 0 5 10 w0 N 2 2 2 2 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi)) + λ (w0 +w1 ) i=1 X 14 ©2017 Emily Fox CSE 446: Machine Learning 7 1/19/17 Visualizing the ridge cost in 2D N 2 2 2 2 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi)) + λ (w0 +w1 ) i=1 X 15 ©2017 Emily Fox CSE 446: Machine Learning Visualizing the ridge solution level sets intersect 5 521 5215 10 5215 5215 5 5215 4.75 w1 5215 5215 0 4.75 −5 −10 −10 −5 0 5 10 w0 N 2 2 2 2 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi)) + λ (w0 +w1 ) i=1 X 16 ©2017 Emily Fox CSE 446: Machine Learning 8 1/19/17 Geometric intuition for lasso ©2017 Emily Fox CSE 446: Machine Learning Visualizing the lasso cost in 2D RSS Cost 10 5 w1 0 −5 −10 −10 −5 0 5 10 w0 N 2 RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi)) + λ (|w0|+|w1|) i=1 X 18 ©2017 Emily Fox CSE 446: Machine Learning 9 1/19/17 Visualizing the lasso cost in 2D L1 penalty 10 5 w1 0 −5 −10 −10 −5 0 5 10 w0 N 2 RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi)) + λ (|w0|+|w1|) i=1 X 19 ©2017 Emily Fox CSE 446: Machine Learning Visualizing the lasso cost in 2D N 2 RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi)) + λ (|w0|+|w1|) i=1 X 20 ©2017 Emily Fox CSE 446: Machine Learning 10 1/19/17 Visualizing the lasso solution level sets intersect 5 521 5215 10 5215 5215 5 5215 2.75 w1 5215 5215 0 2.75 −5 −10 −10 −5 0 5 10 w0 N 2 RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi)) + λ (|w0|+|w1|) i=1 X 21 ©2017 Emily Fox CSE 446: Machine Learning Revisit polynomial fit demo What happens if we refit our high-order polynomial, but now using lasso regression? Will consider a few settings of λ … 22 ©2017 Emily Fox CSE 446: Machine Learning 11 1/19/17 How to choose λ: Cross validation ©2017 Emily Fox CSE 446: Machine Learning If sufficient amount of data… Validation Test Training set set set fit ŵλ test performance * of ŵλ to select λ assess generalization error of ŵλ* 24 ©2017 Emily Fox CSE 446: Machine Learning 12 1/19/17 Start with smallish dataset All data 25 ©2017 Emily Fox CSE 446: Machine Learning Still form test set and hold out Test Rest of data set 26 ©2017 Emily Fox CSE 446: Machine Learning 13 1/19/17 How do we use the other data? Rest of data use for both training and validation, but not so naively 27 ©2017 Emily Fox CSE 446: Machine Learning Recall naïve approach Valid. Training set set small validation set Is validation set enough to compare performance of ŵλ across λ values? No 28 ©2017 Emily Fox CSE 446: Machine Learning 14 1/19/17 Choosing the validation set Valid. set small validation set Didn’t have to use the last data points tabulated to form validation set Can use any data subset 29 ©2017 Emily Fox CSE 446: Machine Learning Choosing the validation set Valid. set small validation set Which subset should I use? average performance ALL! over all choices 30 ©2017 Emily Fox CSE 446: Machine Learning 15 1/19/17 K-fold cross validation Rest of data N N N N N K K K K K Preprocessing: Randomly assign data to K groups (use same split of data for all other steps) 31 ©2017 Emily Fox CSE 446: Machine Learning K-fold cross validation Valid set (1) error1(λ) ŵλ For k=1,…,K (k) 1. Estimate ŵλ on the training blocks 2. Compute error on validation block: errork(λ) 32 ©2017 Emily Fox CSE 446: Machine Learning 16 1/19/17 K-fold cross validation Valid set (2) error2(λ) ŵλ For k=1,…,K (k) 1. Estimate ŵλ on the training blocks 2. Compute error on validation block: errork(λ) 33 ©2017 Emily Fox CSE 446: Machine Learning K-fold cross validation Valid set (3) ŵλ error3(λ) For k=1,…,K (k) 1. Estimate ŵλ on the training blocks 2. Compute error on validation block: errork(λ) 34 ©2017 Emily Fox CSE 446: Machine Learning 17 1/19/17 K-fold cross validation Valid set (4) ŵλ error4(λ) For k=1,…,K (k) 1. Estimate ŵλ on the training blocks 2. Compute error on validation block: errork(λ) 35 ©2017 Emily Fox CSE 446: Machine Learning K-fold cross validation Valid set (5) ŵλ error5(λ) For k=1,…,K (k) 1. Estimate ŵλ on the training blocks 2. Compute error on validation block: errork(λ) 1 K λ λ Compute average error: CV( ) = K errork( ) kX=1 36 ©2017 Emily Fox CSE 446: Machine Learning 18 1/19/17 K-fold cross validation Valid set Repeat procedure for each choice of λ Choose λ* to minimize CV(λ) 37 ©2017 Emily Fox CSE 446: Machine Learning What value of K? Formally, the best approximation occurs for validation sets of size 1 (K=N) leave-one-out cross validation Computationally intensive - requires computing N fits of model per λ Typically, K=5 or 10 5-fold CV 10-fold CV 38 ©2017 Emily Fox CSE 446: Machine Learning 19 1/19/17 Choosing λ via cross validation for lasso Cross validation is choosing the λ that provides best predictive accuracy Tends to favor less sparse solutions, and thus smaller λ, than optimal choice for feature selection c.f., “Machine Learning: A Probabilistic Perspective”, Murphy, 2012 for further discussion 39 ©2017 Emily Fox CSE 446: Machine Learning Practical concerns with lasso ©2017 Emily Fox CSE 446: Machine Learning 20 1/19/17 Issues with standard lasso objective 1. With group of highly correlated features, lasso tends to select amongst them arbitrarily - Often prefer to select all together 2. Often, empirically ridge has better predictive performance than lasso, but lasso leads to sparser solution Elastic net aims to address these issues - hybrid between lasso and ridge regression - uses L1 and L2 penalties See Zou & Hastie ‘05 for further discussion 41 ©2017 Emily Fox CSE 446: Machine Learning Summary for feature selection and lasso regression ©2017 Emily Fox CSE 446: Machine Learning 21 1/19/17 Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature selection in general, be careful about interpreting selected features - selection only considers features included - sensitive to correlations between features - result depends on algorithm used - there are theoretical guarantees for lasso under certain conditions 43 ©2017 Emily Fox CSE 446: Machine Learning What you can do now… • Describe “all subsets” and greedy variants for feature selection • Analyze computational costs of these algorithms • Formulate lasso objective • Describe what happens to estimated lasso coefficients as tuning parameter λ is varied • Interpret lasso coefficient path plot • Contrast ridge and lasso regression • Estimate lasso regression parameters using an iterative coordinate descent algorithm • Implement K-fold cross validation to select lasso tuning parameter λ 44 ©2017 Emily Fox CSE 446: Machine Learning 22 1/19/17 Linear classifiers CSE 446: Machine Learning Emily Fox University of Washington January 20, 2017 ©2017 Emily Fox CSE 446: Machine Learning Linear classifier: Intuition ©2017 Emily Fox CSE 446: Machine Learning 23 1/19/17 ŷ = +1 Classifier Sentence Classifier from review MODEL Output: y Input: x Predicted Sushi was awesome, class the food was awesome, but the service was awful.

LASSO Geometric Interpretation, Cross Validation Slides

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support