LASSO Geometric Interpretation, Cross Validation Slides

1/19/17 Regularized Regression: Geometric intuition of solution Plus: Cross validation CSE 446: Machine Learning Emily Fox University of Washington January 20, 2017 ©2017 Emily Fox CSE 446: Machine Learning Coordinate descent for lasso (for normalized features) ©2017 Emily Fox CSE 446: Machine Learning 1 1/19/17 Coordinate descent for least squares regression Initialize ŵ = 0 (or smartly…) while not converged residual for j=0,1,…,D without feature j N compute: ρj = hj(xi)(yi – ŷi(ŵ-j)) i=1 X set: ŵj = ρj prediction without feature j 3 ©2017 Emily Fox CSE 446: Machine Learning Coordinate descent for lasso Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D N compute: ρj = hj(xi)(yi – ŷi(ŵ-j)) i=1 X ρj + λ/2 if ρj < -λ/2 set: ŵj = 0 if ρj in [-λ/2, λ/2] ρj – λ/2 if ρj > λ/2 4 ©2017 Emily Fox CSE 446: Machine Learning 2 1/19/17 Soft thresholding ρj + λ/2 if ρj < -λ/2 ŵj = 0 if ρj in [-λ/2, λ/2] ρj – λ/2 if ρj > λ/2 ŵj ρj 5 ©2017 Emily Fox CSE 446: Machine Learning How to assess convergence? Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D N compute: ρj = hj(xi)(yi – ŷi(ŵ-j)) i=1 X ρj + λ/2 if ρj < -λ/2 set: ŵj = 0 if ρj in [-λ/2, λ/2] ρj – λ/2 if ρj > λ/2 6 ©2017 Emily Fox CSE 446: Machine Learning 3 1/19/17 Convergence criteria When to stop? For convex problems, will start to take smaller and smaller steps Measure size of steps taken in a full loop over all features - stop when max step < ε 7 ©2017 Emily Fox CSE 446: Machine Learning Other lasso solvers Classically: Least angle regression (LARS) [Efron et al. ‘04] Then: Coordinate descent algorithm [Fu ‘98, Friedman, Hastie, & Tibshirani ’08] Now: • Parallel CD (e.g., Shotgun, [Bradley et al. ‘11]) • Other parallel learning approaches for linear models - Parallel stochastic gradient descent (SGD) (e.g., Hogwild! [Niu et al. ’11]) - Parallel independent solutions then averaging [Zhang et al. ‘12] • Alternating directions method of multipliers (ADMM) [Boyd et al. ’11] 8 ©2017 Emily Fox CSE 446: Machine Learning 4 1/19/17 Coordinate descent for lasso (for unnormalized features) ©2017 Emily Fox CSE 446: Machine Learning Coordinate descent for lasso with unnormalized features N 2 Precompute: zj = hj(xi) i=1 Initialize ŵ = 0 (orX smartly…) while not converged for j=0,1,…,D N compute: ρj = hj(xi)(yi – ŷi(ŵ-j)) i=1 X (ρj + λ/2)/zj if ρj < -λ/2 set: ŵj = 0 if ρj in [-λ/2, λ/2] (ρj – λ/2)/zj if ρj > λ/2 10 ©2017 Emily Fox CSE 446: Machine Learning 5 1/19/17 Geometric intuition for sparsity of lasso solution ©2017 Emily Fox CSE 446: Machine Learning Geometric intuition for ridge regression ©2017 Emily Fox CSE 446: Machine Learning 6 1/19/17 Visualizing the ridge cost in 2D RSS Cost 10 5 w1 0 −5 −10 −10 −5 0 5 10 w0 N 2 2 2 2 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi)) + λ (w0 +w1 ) i=1 X 13 ©2017 Emily Fox CSE 446: Machine Learning Visualizing the ridge cost in 2D L2 penalty 10 5 w1 0 −5 −10 −10 −5 0 5 10 w0 N 2 2 2 2 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi)) + λ (w0 +w1 ) i=1 X 14 ©2017 Emily Fox CSE 446: Machine Learning 7 1/19/17 Visualizing the ridge cost in 2D N 2 2 2 2 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi)) + λ (w0 +w1 ) i=1 X 15 ©2017 Emily Fox CSE 446: Machine Learning Visualizing the ridge solution level sets intersect 5 521 5215 10 5215 5215 5 5215 4.75 w1 5215 5215 0 4.75 −5 −10 −10 −5 0 5 10 w0 N 2 2 2 2 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi)) + λ (w0 +w1 ) i=1 X 16 ©2017 Emily Fox CSE 446: Machine Learning 8 1/19/17 Geometric intuition for lasso ©2017 Emily Fox CSE 446: Machine Learning Visualizing the lasso cost in 2D RSS Cost 10 5 w1 0 −5 −10 −10 −5 0 5 10 w0 N 2 RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi)) + λ (|w0|+|w1|) i=1 X 18 ©2017 Emily Fox CSE 446: Machine Learning 9 1/19/17 Visualizing the lasso cost in 2D L1 penalty 10 5 w1 0 −5 −10 −10 −5 0 5 10 w0 N 2 RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi)) + λ (|w0|+|w1|) i=1 X 19 ©2017 Emily Fox CSE 446: Machine Learning Visualizing the lasso cost in 2D N 2 RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi)) + λ (|w0|+|w1|) i=1 X 20 ©2017 Emily Fox CSE 446: Machine Learning 10 1/19/17 Visualizing the lasso solution level sets intersect 5 521 5215 10 5215 5215 5 5215 2.75 w1 5215 5215 0 2.75 −5 −10 −10 −5 0 5 10 w0 N 2 RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi)) + λ (|w0|+|w1|) i=1 X 21 ©2017 Emily Fox CSE 446: Machine Learning Revisit polynomial fit demo What happens if we refit our high-order polynomial, but now using lasso regression? Will consider a few settings of λ … 22 ©2017 Emily Fox CSE 446: Machine Learning 11 1/19/17 How to choose λ: Cross validation ©2017 Emily Fox CSE 446: Machine Learning If sufficient amount of data… Validation Test Training set set set fit ŵλ test performance * of ŵλ to select λ assess generalization error of ŵλ* 24 ©2017 Emily Fox CSE 446: Machine Learning 12 1/19/17 Start with smallish dataset All data 25 ©2017 Emily Fox CSE 446: Machine Learning Still form test set and hold out Test Rest of data set 26 ©2017 Emily Fox CSE 446: Machine Learning 13 1/19/17 How do we use the other data? Rest of data use for both training and validation, but not so naively 27 ©2017 Emily Fox CSE 446: Machine Learning Recall naïve approach Valid. Training set set small validation set Is validation set enough to compare performance of ŵλ across λ values? No 28 ©2017 Emily Fox CSE 446: Machine Learning 14 1/19/17 Choosing the validation set Valid. set small validation set Didn’t have to use the last data points tabulated to form validation set Can use any data subset 29 ©2017 Emily Fox CSE 446: Machine Learning Choosing the validation set Valid. set small validation set Which subset should I use? average performance ALL! over all choices 30 ©2017 Emily Fox CSE 446: Machine Learning 15 1/19/17 K-fold cross validation Rest of data N N N N N K K K K K Preprocessing: Randomly assign data to K groups (use same split of data for all other steps) 31 ©2017 Emily Fox CSE 446: Machine Learning K-fold cross validation Valid set (1) error1(λ) ŵλ For k=1,…,K (k) 1. Estimate ŵλ on the training blocks 2. Compute error on validation block: errork(λ) 32 ©2017 Emily Fox CSE 446: Machine Learning 16 1/19/17 K-fold cross validation Valid set (2) error2(λ) ŵλ For k=1,…,K (k) 1. Estimate ŵλ on the training blocks 2. Compute error on validation block: errork(λ) 33 ©2017 Emily Fox CSE 446: Machine Learning K-fold cross validation Valid set (3) ŵλ error3(λ) For k=1,…,K (k) 1. Estimate ŵλ on the training blocks 2. Compute error on validation block: errork(λ) 34 ©2017 Emily Fox CSE 446: Machine Learning 17 1/19/17 K-fold cross validation Valid set (4) ŵλ error4(λ) For k=1,…,K (k) 1. Estimate ŵλ on the training blocks 2. Compute error on validation block: errork(λ) 35 ©2017 Emily Fox CSE 446: Machine Learning K-fold cross validation Valid set (5) ŵλ error5(λ) For k=1,…,K (k) 1. Estimate ŵλ on the training blocks 2. Compute error on validation block: errork(λ) 1 K λ λ Compute average error: CV( ) = K errork( ) kX=1 36 ©2017 Emily Fox CSE 446: Machine Learning 18 1/19/17 K-fold cross validation Valid set Repeat procedure for each choice of λ Choose λ* to minimize CV(λ) 37 ©2017 Emily Fox CSE 446: Machine Learning What value of K? Formally, the best approximation occurs for validation sets of size 1 (K=N) leave-one-out cross validation Computationally intensive - requires computing N fits of model per λ Typically, K=5 or 10 5-fold CV 10-fold CV 38 ©2017 Emily Fox CSE 446: Machine Learning 19 1/19/17 Choosing λ via cross validation for lasso Cross validation is choosing the λ that provides best predictive accuracy Tends to favor less sparse solutions, and thus smaller λ, than optimal choice for feature selection c.f., “Machine Learning: A Probabilistic Perspective”, Murphy, 2012 for further discussion 39 ©2017 Emily Fox CSE 446: Machine Learning Practical concerns with lasso ©2017 Emily Fox CSE 446: Machine Learning 20 1/19/17 Issues with standard lasso objective 1. With group of highly correlated features, lasso tends to select amongst them arbitrarily - Often prefer to select all together 2. Often, empirically ridge has better predictive performance than lasso, but lasso leads to sparser solution Elastic net aims to address these issues - hybrid between lasso and ridge regression - uses L1 and L2 penalties See Zou & Hastie ‘05 for further discussion 41 ©2017 Emily Fox CSE 446: Machine Learning Summary for feature selection and lasso regression ©2017 Emily Fox CSE 446: Machine Learning 21 1/19/17 Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature selection in general, be careful about interpreting selected features - selection only considers features included - sensitive to correlations between features - result depends on algorithm used - there are theoretical guarantees for lasso under certain conditions 43 ©2017 Emily Fox CSE 446: Machine Learning What you can do now… • Describe “all subsets” and greedy variants for feature selection • Analyze computational costs of these algorithms • Formulate lasso objective • Describe what happens to estimated lasso coefficients as tuning parameter λ is varied • Interpret lasso coefficient path plot • Contrast ridge and lasso regression • Estimate lasso regression parameters using an iterative coordinate descent algorithm • Implement K-fold cross validation to select lasso tuning parameter λ 44 ©2017 Emily Fox CSE 446: Machine Learning 22 1/19/17 Linear classifiers CSE 446: Machine Learning Emily Fox University of Washington January 20, 2017 ©2017 Emily Fox CSE 446: Machine Learning Linear classifier: Intuition ©2017 Emily Fox CSE 446: Machine Learning 23 1/19/17 ŷ = +1 Classifier Sentence Classifier from review MODEL Output: y Input: x Predicted Sushi was awesome, class the food was awesome, but the service was awful.

Load more