1/19/17
Regularized Regression: Geometric intuition of solution Plus: Cross validation CSE 446: Machine Learning Emily Fox University of Washington January 20, 2017
©2017 Emily Fox CSE 446: Machine Learning
Coordinate descent for lasso (for normalized features)
©2017 Emily Fox CSE 446: Machine Learning
1 1/19/17
Coordinate descent for least squares regression
Initialize ŵ = 0 (or smartly…) while not converged residual for j=0,1,…,D without feature j N
compute: ρj = hj(xi)(yi – ŷi(ŵ-j)) i=1 X set: ŵj = ρj prediction without feature j
3 ©2017 Emily Fox CSE 446: Machine Learning
Coordinate descent for lasso
Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D N
compute: ρj = hj(xi)(yi – ŷi(ŵ-j)) i=1 X
ρj + λ/2 if ρj < -λ/2 set: ŵj = 0 if ρj in [-λ/2, λ/2] ρj – λ/2 if ρj > λ/2
4 ©2017 Emily Fox CSE 446: Machine Learning
2 1/19/17
Soft thresholding
ρj + λ/2 if ρj < -λ/2 ŵj = 0 if ρj in [-λ/2, λ/2] ρj – λ/2 if ρj > λ/2
ŵj
ρj
5 ©2017 Emily Fox CSE 446: Machine Learning
How to assess convergence?
Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D N
compute: ρj = hj(xi)(yi – ŷi(ŵ-j)) i=1 X
ρj + λ/2 if ρj < -λ/2 set: ŵj = 0 if ρj in [-λ/2, λ/2] ρj – λ/2 if ρj > λ/2
6 ©2017 Emily Fox CSE 446: Machine Learning
3 1/19/17
Convergence criteria
When to stop?
For convex problems, will start to take smaller and smaller steps
Measure size of steps taken in a full loop over all features - stop when max step < ε
7 ©2017 Emily Fox CSE 446: Machine Learning
Other lasso solvers
Classically: Least angle regression (LARS) [Efron et al. ‘04]
Then: Coordinate descent algorithm [Fu ‘98, Friedman, Hastie, & Tibshirani ’08]
Now: • Parallel CD (e.g., Shotgun, [Bradley et al. ‘11]) • Other parallel learning approaches for linear models - Parallel stochastic gradient descent (SGD) (e.g., Hogwild! [Niu et al. ’11]) - Parallel independent solutions then averaging [Zhang et al. ‘12] • Alternating directions method of multipliers (ADMM) [Boyd et al. ’11]
8 ©2017 Emily Fox CSE 446: Machine Learning
4 1/19/17
Coordinate descent for lasso (for unnormalized features)
©2017 Emily Fox CSE 446: Machine Learning
Coordinate descent for lasso with unnormalized features N 2 Precompute: zj = hj(xi) i=1 Initialize ŵ = 0 (orX smartly…) while not converged for j=0,1,…,D N
compute: ρj = hj(xi)(yi – ŷi(ŵ-j)) i=1 X
(ρj + λ/2)/zj if ρj < -λ/2 set: ŵj = 0 if ρj in [-λ/2, λ/2] (ρj – λ/2)/zj if ρj > λ/2
10 ©2017 Emily Fox CSE 446: Machine Learning
5 1/19/17
Geometric intuition for sparsity of lasso solution
©2017 Emily Fox CSE 446: Machine Learning
Geometric intuition for ridge regression
©2017 Emily Fox CSE 446: Machine Learning
6 1/19/17
Visualizing the ridge cost in 2D RSS Cost
10
5 w1
0
−5
−10 −10 −5 0 5 10 w0 N 2 2 2 2 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi)) + λ (w0 +w1 ) i=1 X 13 ©2017 Emily Fox CSE 446: Machine Learning
Visualizing the ridge cost in 2D L2 penalty
10
5 w1
0
−5
−10 −10 −5 0 5 10 w0 N 2 2 2 2 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi)) + λ (w0 +w1 ) i=1 X 14 ©2017 Emily Fox CSE 446: Machine Learning
7 1/19/17
Visualizing the ridge cost in 2D
N 2 2 2 2 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi)) + λ (w0 +w1 ) i=1 X 15 ©2017 Emily Fox CSE 446: Machine Learning
Visualizing the ridge solution level sets intersect 5 521 5215
10
5215
5215 5
5215
4.75 w1 5215 5215 0 4.75
−5
−10 −10 −5 0 5 10 w0 N 2 2 2 2 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi)) + λ (w0 +w1 ) i=1 X 16 ©2017 Emily Fox CSE 446: Machine Learning
8 1/19/17
Geometric intuition for lasso
©2017 Emily Fox CSE 446: Machine Learning
Visualizing the lasso cost in 2D RSS Cost
10
5 w1
0
−5
−10 −10 −5 0 5 10 w0 N 2 RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi)) + λ (|w0|+|w1|) i=1 X 18 ©2017 Emily Fox CSE 446: Machine Learning
9 1/19/17
Visualizing the lasso cost in 2D L1 penalty
10
5 w1
0
−5
−10 −10 −5 0 5 10 w0 N 2 RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi)) + λ (|w0|+|w1|) i=1 X 19 ©2017 Emily Fox CSE 446: Machine Learning
Visualizing the lasso cost in 2D
N 2 RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi)) + λ (|w0|+|w1|) i=1 X 20 ©2017 Emily Fox CSE 446: Machine Learning
10 1/19/17
Visualizing the lasso solution level sets intersect 5 521 5215
10
5215
5215 5
5215
2.75 w1 5215 5215 0 2.75
−5
−10 −10 −5 0 5 10 w0 N 2 RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi)) + λ (|w0|+|w1|) i=1 X 21 ©2017 Emily Fox CSE 446: Machine Learning
Revisit polynomial fit demo
What happens if we refit our high-order polynomial, but now using lasso regression?
Will consider a few settings of λ …
22 ©2017 Emily Fox CSE 446: Machine Learning
11 1/19/17
How to choose λ: Cross validation
©2017 Emily Fox CSE 446: Machine Learning
If sufficient amount of data…
Validation Test Training set set set
fit ŵλ test performance * of ŵλ to select λ assess generalization error of ŵλ*
24 ©2017 Emily Fox CSE 446: Machine Learning
12 1/19/17
Start with smallish dataset
All data
25 ©2017 Emily Fox CSE 446: Machine Learning
Still form test set and hold out
Test Rest of data set
26 ©2017 Emily Fox CSE 446: Machine Learning
13 1/19/17
How do we use the other data?
Rest of data
use for both training and validation, but not so naively
27 ©2017 Emily Fox CSE 446: Machine Learning
Recall naïve approach
Valid. Training set set
small validation set
Is validation set enough to compare performance of ŵλ across λ values?
No
28 ©2017 Emily Fox CSE 446: Machine Learning
14 1/19/17
Choosing the validation set
Valid. set
small validation set
Didn’t have to use the last data points tabulated to form validation set Can use any data subset
29 ©2017 Emily Fox CSE 446: Machine Learning
Choosing the validation set
Valid. set
small validation set
Which subset should I use?
average performance ALL! over all choices
30 ©2017 Emily Fox CSE 446: Machine Learning
15 1/19/17
K-fold cross validation
Rest of data
N N N N N K K K K K
Preprocessing: Randomly assign data to K groups
(use same split of data for all other steps)
31 ©2017 Emily Fox CSE 446: Machine Learning
K-fold cross validation
Valid
set
(1) error1(λ) ŵλ For k=1,…,K (k) 1. Estimate ŵλ on the training blocks
2. Compute error on validation block: errork(λ)
32 ©2017 Emily Fox CSE 446: Machine Learning
16 1/19/17
K-fold cross validation
Valid
set
(2) error2(λ) ŵλ For k=1,…,K (k) 1. Estimate ŵλ on the training blocks
2. Compute error on validation block: errork(λ)
33 ©2017 Emily Fox CSE 446: Machine Learning
K-fold cross validation
Valid
set
(3) ŵλ error3(λ) For k=1,…,K (k) 1. Estimate ŵλ on the training blocks
2. Compute error on validation block: errork(λ)
34 ©2017 Emily Fox CSE 446: Machine Learning
17 1/19/17
K-fold cross validation
Valid
set
(4) ŵλ error4(λ) For k=1,…,K (k) 1. Estimate ŵλ on the training blocks
2. Compute error on validation block: errork(λ)
35 ©2017 Emily Fox CSE 446: Machine Learning
K-fold cross validation
Valid
set
(5) ŵλ error5(λ) For k=1,…,K (k) 1. Estimate ŵλ on the training blocks
2. Compute error on validation block: errork(λ) 1 K λ λ Compute average error: CV( ) = K errork( ) k=1 X 36 ©2017 Emily Fox CSE 446: Machine Learning
18 1/19/17
K-fold cross validation
Valid
set
Repeat procedure for each choice of λ
Choose λ* to minimize CV(λ)
37 ©2017 Emily Fox CSE 446: Machine Learning
What value of K?
Formally, the best approximation occurs for validation sets of size 1 (K=N)
leave-one-out cross validation
Computationally intensive - requires computing N fits of model per λ
Typically, K=5 or 10 5-fold CV 10-fold CV
38 ©2017 Emily Fox CSE 446: Machine Learning
19 1/19/17
Choosing λ via cross validation for lasso
Cross validation is choosing the λ that provides best predictive accuracy
Tends to favor less sparse solutions, and thus smaller λ, than optimal choice for feature selection
c.f., “Machine Learning: A Probabilistic Perspective”, Murphy, 2012 for further discussion
39 ©2017 Emily Fox CSE 446: Machine Learning
Practical concerns with lasso
©2017 Emily Fox CSE 446: Machine Learning
20 1/19/17
Issues with standard lasso objective
1. With group of highly correlated features, lasso tends to select amongst them arbitrarily - Often prefer to select all together
2. Often, empirically ridge has better predictive performance than lasso, but lasso leads to sparser solution
Elastic net aims to address these issues - hybrid between lasso and ridge regression
- uses L1 and L2 penalties
See Zou & Hastie ‘05 for further discussion
41 ©2017 Emily Fox CSE 446: Machine Learning
Summary for feature selection and lasso regression
©2017 Emily Fox CSE 446: Machine Learning
21 1/19/17
Impact of feature selection and lasso
Lasso has changed machine learning, statistics, & electrical engineering
But, for feature selection in general, be careful about interpreting selected features - selection only considers features included - sensitive to correlations between features - result depends on algorithm used - there are theoretical guarantees for lasso under certain conditions
43 ©2017 Emily Fox CSE 446: Machine Learning
What you can do now…
• Describe “all subsets” and greedy variants for feature selection • Analyze computational costs of these algorithms • Formulate lasso objective • Describe what happens to estimated lasso coefficients as tuning parameter λ is varied • Interpret lasso coefficient path plot • Contrast ridge and lasso regression • Estimate lasso regression parameters using an iterative coordinate descent algorithm • Implement K-fold cross validation to select lasso tuning parameter λ
44 ©2017 Emily Fox CSE 446: Machine Learning
22 1/19/17
Linear classifiers
CSE 446: Machine Learning Emily Fox University of Washington January 20, 2017
©2017 Emily Fox CSE 446: Machine Learning
Linear classifier: Intuition
©2017 Emily Fox CSE 446: Machine Learning
23 1/19/17
ŷ = +1 Classifier
Sentence Classifier from review MODEL Output: y Input: x Predicted Sushi was awesome, class the food was awesome, but the service was awful. ŷ = -1
47 ©2017 Emily Fox CSE 446: Machine Learning
Feature Coefficient … …
Simple linear classifier Score(x) = weighted sum of Sentence features of sentence from review If Score (x) > 0: ŷ = Input: x Else: ŷ =
48 ©2017 Emily Fox CSE 446: Machine Learning
24 1/19/17
A simple example: Word counts
Feature Coefficient Input xi: Sushi was great, good 1.0 the food was awesome, great 1.2 but the service was terrible. awesome 1.7 bad -1.0 terrible -2.1 awful -3.3 restaurant, the, 0.0 we, where, … … … Called a linear classifier, because score is weighted sum of features. 49 ©2017 Emily Fox CSE 446: Machine Learning
More generically…
Model: ŷi = sign(Score(xi))
Score(xi) = w0 h0(xi) + w1 h1(xi) + … + wD hD(xi)
D T = wj hj(xi) = w h(xi) j=0 X feature 1 = h0(x) … e.g., 1
feature 2 = h1(x) … e.g., x[1] = #awesome
feature 3 = h2(x) … e.g., x[2] = #awful or, log(x[7]) x[2] = log(#bad) x #awful or, tf-idf(“awful”) …
feature D+1 = hD(x) … some other function of x[1],…, x[d]
50 ©2017 Emily Fox CSE 446: Machine Learning
25 1/19/17
Decision boundaries
©2017 Emily Fox CSE 446: Machine Learning
Suppose only two words had non-zero coefficient Input Coefficient Value w 0.0 0 Score(x) = 1.0 #awesome – 1.5 #awful #awesome w1 1.0
#awful w2 -1.5
Sushi was awesome, #awful … the food was awesome, 4 but the service was awful. 3
2
1
0 0 1 2 3 4 … #awesome
52 ©2017 Emily Fox CSE 446: Machine Learning
26 1/19/17
Decision boundary example Input Coefficient Value w 0.0 0 Score(x) = 1.0 #awesome – 1.5 #awful #awesome w1 1.0
#awful w2 -1.5
Score(x) < 0
#awful … Decision boundary 4 separates + and – 3 predictions
2
1 1.0 #awesome – 1.5 #awful = 0 Score(x) > 0 0 0 1 2 3 4 … #awesome
53 ©2017 Emily Fox CSE 446: Machine Learning
For more inputs (linear features)… [2] x
x[3] #awful
Score(x) = w0
+ w1 #awesome
#awesome x[1] + w2 #awful
+ w3 #great 54 ©2017 Emily Fox CSE 446: Machine Learning
27 1/19/17
For general features…
For more general classifiers (not just linear features) è more complicated shapes
55 ©2017 Emily Fox CSE 446: Machine Learning
28