1/19/17

Regularized Regression: Geometric intuition of solution Plus: Cross validation CSE 446: Emily Fox University of Washington January 20, 2017

©2017 Emily Fox CSE 446: Machine Learning

Coordinate descent for lasso (for normalized features)

©2017 Emily Fox CSE 446: Machine Learning

1 1/19/17

Coordinate descent for least squares regression

Initialize ŵ = 0 (or smartly…) while not converged residual for j=0,1,…,D without feature j N

compute: ρj = hj(xi)(yi – ŷi(ŵ-j)) i=1 X set: ŵj = ρj prediction without feature j

3 ©2017 Emily Fox CSE 446: Machine Learning

Coordinate descent for lasso

Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D N

compute: ρj = hj(xi)(yi – ŷi(ŵ-j)) i=1 X

ρj + λ/2 if ρj < -λ/2 set: ŵj = 0 if ρj in [-λ/2, λ/2] ρj – λ/2 if ρj > λ/2

4 ©2017 Emily Fox CSE 446: Machine Learning

2 1/19/17

Soft thresholding

ρj + λ/2 if ρj < -λ/2 ŵj = 0 if ρj in [-λ/2, λ/2] ρj – λ/2 if ρj > λ/2

ŵj

ρj

5 ©2017 Emily Fox CSE 446: Machine Learning

How to assess convergence?

Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D N

compute: ρj = hj(xi)(yi – ŷi(ŵ-j)) i=1 X

ρj + λ/2 if ρj < -λ/2 set: ŵj = 0 if ρj in [-λ/2, λ/2] ρj – λ/2 if ρj > λ/2

6 ©2017 Emily Fox CSE 446: Machine Learning

3 1/19/17

Convergence criteria

When to stop?

For convex problems, will start to take smaller and smaller steps

Measure size of steps taken in a full loop over all features - stop when max step < ε

7 ©2017 Emily Fox CSE 446: Machine Learning

Other lasso solvers

Classically: Least angle regression (LARS) [Efron et al. ‘04]

Then: Coordinate descent algorithm [Fu ‘98, Friedman, Hastie, & Tibshirani ’08]

Now: • Parallel CD (e.g., Shotgun, [Bradley et al. ‘11]) • Other parallel learning approaches for linear models - Parallel stochastic (SGD) (e.g., Hogwild! [Niu et al. ’11]) - Parallel independent solutions then averaging [Zhang et al. ‘12] • Alternating directions method of multipliers (ADMM) [Boyd et al. ’11]

8 ©2017 Emily Fox CSE 446: Machine Learning

4 1/19/17

Coordinate descent for lasso (for unnormalized features)

©2017 Emily Fox CSE 446: Machine Learning

Coordinate descent for lasso with unnormalized features N 2 Precompute: zj = hj(xi) i=1 Initialize ŵ = 0 (orX smartly…) while not converged for j=0,1,…,D N

compute: ρj = hj(xi)(yi – ŷi(ŵ-j)) i=1 X

(ρj + λ/2)/zj if ρj < -λ/2 set: ŵj = 0 if ρj in [-λ/2, λ/2] (ρj – λ/2)/zj if ρj > λ/2

10 ©2017 Emily Fox CSE 446: Machine Learning

5 1/19/17

Geometric intuition for sparsity of lasso solution

©2017 Emily Fox CSE 446: Machine Learning

Geometric intuition for ridge regression

©2017 Emily Fox CSE 446: Machine Learning

6 1/19/17

Visualizing the ridge cost in 2D RSS Cost

10

5 w1

0

−5

−10 −10 −5 0 5 10 w0 N 2 2 2 2 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi)) + λ (w0 +w1 ) i=1 X 13 ©2017 Emily Fox CSE 446: Machine Learning

Visualizing the ridge cost in 2D L2 penalty

10

5 w1

0

−5

−10 −10 −5 0 5 10 w0 N 2 2 2 2 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi)) + λ (w0 +w1 ) i=1 X 14 ©2017 Emily Fox CSE 446: Machine Learning

7 1/19/17

Visualizing the ridge cost in 2D

N 2 2 2 2 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi)) + λ (w0 +w1 ) i=1 X 15 ©2017 Emily Fox CSE 446: Machine Learning

Visualizing the ridge solution level sets intersect 5 521 5215

10

5215

5215 5

5215

4.75 w1 5215 5215 0 4.75

−5

−10 −10 −5 0 5 10 w0 N 2 2 2 2 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi)) + λ (w0 +w1 ) i=1 X 16 ©2017 Emily Fox CSE 446: Machine Learning

8 1/19/17

Geometric intuition for lasso

©2017 Emily Fox CSE 446: Machine Learning

Visualizing the lasso cost in 2D RSS Cost

10

5 w1

0

−5

−10 −10 −5 0 5 10 w0 N 2 RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi)) + λ (|w0|+|w1|) i=1 X 18 ©2017 Emily Fox CSE 446: Machine Learning

9 1/19/17

Visualizing the lasso cost in 2D L1 penalty

10

5 w1

0

−5

−10 −10 −5 0 5 10 w0 N 2 RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi)) + λ (|w0|+|w1|) i=1 X 19 ©2017 Emily Fox CSE 446: Machine Learning

Visualizing the lasso cost in 2D

N 2 RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi)) + λ (|w0|+|w1|) i=1 X 20 ©2017 Emily Fox CSE 446: Machine Learning

10 1/19/17

Visualizing the lasso solution level sets intersect 5 521 5215

10

5215

5215 5

5215

2.75 w1 5215 5215 0 2.75

−5

−10 −10 −5 0 5 10 w0 N 2 RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi)) + λ (|w0|+|w1|) i=1 X 21 ©2017 Emily Fox CSE 446: Machine Learning

Revisit polynomial fit demo

What happens if we refit our high-order polynomial, but now using lasso regression?

Will consider a few settings of λ …

22 ©2017 Emily Fox CSE 446: Machine Learning

11 1/19/17

How to choose λ: Cross validation

©2017 Emily Fox CSE 446: Machine Learning

If sufficient amount of data…

Validation Test Training set set set

fit ŵλ test performance * of ŵλ to select λ assess generalization error of ŵλ*

24 ©2017 Emily Fox CSE 446: Machine Learning

12 1/19/17

Start with smallish dataset

All data

25 ©2017 Emily Fox CSE 446: Machine Learning

Still form test set and hold out

Test Rest of data set

26 ©2017 Emily Fox CSE 446: Machine Learning

13 1/19/17

How do we use the other data?

Rest of data

use for both training and validation, but not so naively

27 ©2017 Emily Fox CSE 446: Machine Learning

Recall naïve approach

Valid. Training set set

small validation set

Is validation set enough to compare performance of ŵλ across λ values?

No

28 ©2017 Emily Fox CSE 446: Machine Learning

14 1/19/17

Choosing the validation set

Valid. set

small validation set

Didn’t have to use the last data points tabulated to form validation set Can use any data subset

29 ©2017 Emily Fox CSE 446: Machine Learning

Choosing the validation set

Valid. set

small validation set

Which subset should I use?

average performance ALL! over all choices

30 ©2017 Emily Fox CSE 446: Machine Learning

15 1/19/17

K-fold cross validation

Rest of data

N N N N N K K K K K

Preprocessing: Randomly assign data to K groups

(use same split of data for all other steps)

31 ©2017 Emily Fox CSE 446: Machine Learning

K-fold cross validation

Valid

set

(1) error1(λ) ŵλ For k=1,…,K (k) 1. Estimate ŵλ on the training blocks

2. Compute error on validation block: errork(λ)

32 ©2017 Emily Fox CSE 446: Machine Learning

16 1/19/17

K-fold cross validation

Valid

set

(2) error2(λ) ŵλ For k=1,…,K (k) 1. Estimate ŵλ on the training blocks

2. Compute error on validation block: errork(λ)

33 ©2017 Emily Fox CSE 446: Machine Learning

K-fold cross validation

Valid

set

(3) ŵλ error3(λ) For k=1,…,K (k) 1. Estimate ŵλ on the training blocks

2. Compute error on validation block: errork(λ)

34 ©2017 Emily Fox CSE 446: Machine Learning

17 1/19/17

K-fold cross validation

Valid

set

(4) ŵλ error4(λ) For k=1,…,K (k) 1. Estimate ŵλ on the training blocks

2. Compute error on validation block: errork(λ)

35 ©2017 Emily Fox CSE 446: Machine Learning

K-fold cross validation

Valid

set

(5) ŵλ error5(λ) For k=1,…,K (k) 1. Estimate ŵλ on the training blocks

2. Compute error on validation block: errork(λ) 1 K λ λ Compute average error: CV( ) = K errork( ) k=1 X 36 ©2017 Emily Fox CSE 446: Machine Learning

18 1/19/17

K-fold cross validation

Valid

set

Repeat procedure for each choice of λ

Choose λ* to minimize CV(λ)

37 ©2017 Emily Fox CSE 446: Machine Learning

What value of K?

Formally, the best approximation occurs for validation sets of size 1 (K=N)

leave-one-out cross validation

Computationally intensive - requires computing N fits of model per λ

Typically, K=5 or 10 5-fold CV 10-fold CV

38 ©2017 Emily Fox CSE 446: Machine Learning

19 1/19/17

Choosing λ via cross validation for lasso

Cross validation is choosing the λ that provides best predictive accuracy

Tends to favor less sparse solutions, and thus smaller λ, than optimal choice for

c.f., “Machine Learning: A Probabilistic Perspective”, Murphy, 2012 for further discussion

39 ©2017 Emily Fox CSE 446: Machine Learning

Practical concerns with lasso

©2017 Emily Fox CSE 446: Machine Learning

20 1/19/17

Issues with standard lasso objective

1. With group of highly correlated features, lasso tends to select amongst them arbitrarily - Often prefer to select all together

2. Often, empirically ridge has better predictive performance than lasso, but lasso leads to sparser solution

Elastic net aims to address these issues - hybrid between lasso and ridge regression

- uses L1 and L2 penalties

See Zou & Hastie ‘05 for further discussion

41 ©2017 Emily Fox CSE 446: Machine Learning

Summary for feature selection and lasso regression

©2017 Emily Fox CSE 446: Machine Learning

21 1/19/17

Impact of feature selection and lasso

Lasso has changed machine learning, , & electrical engineering

But, for feature selection in general, be careful about interpreting selected features - selection only considers features included - sensitive to correlations between features - result depends on algorithm used - there are theoretical guarantees for lasso under certain conditions

43 ©2017 Emily Fox CSE 446: Machine Learning

What you can do now…

• Describe “all subsets” and greedy variants for feature selection • Analyze computational costs of these algorithms • Formulate lasso objective • Describe what happens to estimated lasso coefficients as tuning parameter λ is varied • Interpret lasso coefficient path plot • Contrast ridge and lasso regression • Estimate lasso regression parameters using an iterative coordinate descent algorithm • Implement K-fold cross validation to select lasso tuning parameter λ

44 ©2017 Emily Fox CSE 446: Machine Learning

22 1/19/17

Linear classifiers

CSE 446: Machine Learning Emily Fox University of Washington January 20, 2017

©2017 Emily Fox CSE 446: Machine Learning

Linear classifier: Intuition

©2017 Emily Fox CSE 446: Machine Learning

23 1/19/17

ŷ = +1 Classifier

Sentence Classifier from review MODEL Output: y Input: x Predicted Sushi was awesome, class the food was awesome, but the service was awful. ŷ = -1

47 ©2017 Emily Fox CSE 446: Machine Learning

Feature Coefficient … …

Simple linear classifier Score(x) = weighted sum of Sentence features of sentence from review If Score (x) > 0: ŷ = Input: x Else: ŷ =

48 ©2017 Emily Fox CSE 446: Machine Learning

24 1/19/17

A simple example: Word counts

Feature Coefficient Input xi: Sushi was great, good 1.0 the food was awesome, great 1.2 but the service was terrible. awesome 1.7 bad -1.0 terrible -2.1 awful -3.3 restaurant, the, 0.0 we, where, … … … Called a linear classifier, because score is weighted sum of features. 49 ©2017 Emily Fox CSE 446: Machine Learning

More generically…

Model: ŷi = sign(Score(xi))

Score(xi) = w0 h0(xi) + w1 h1(xi) + … + wD hD(xi)

D T = wj hj(xi) = w h(xi) j=0 X feature 1 = h0(x) … e.g., 1

feature 2 = h1(x) … e.g., x[1] = #awesome

feature 3 = h2(x) … e.g., x[2] = #awful or, log(x[7]) x[2] = log(#bad) x #awful or, tf-idf(“awful”) …

feature D+1 = hD(x) … some other function of x[1],…, x[d]

50 ©2017 Emily Fox CSE 446: Machine Learning

25 1/19/17

Decision boundaries

©2017 Emily Fox CSE 446: Machine Learning

Suppose only two words had non-zero coefficient Input Coefficient Value w 0.0 0 Score(x) = 1.0 #awesome – 1.5 #awful #awesome w1 1.0

#awful w2 -1.5

Sushi was awesome, #awful … the food was awesome, 4 but the service was awful. 3

2

1

0 0 1 2 3 4 … #awesome

52 ©2017 Emily Fox CSE 446: Machine Learning

26 1/19/17

Decision boundary example Input Coefficient Value w 0.0 0 Score(x) = 1.0 #awesome – 1.5 #awful #awesome w1 1.0

#awful w2 -1.5

Score(x) < 0

#awful … Decision boundary 4 separates + and – 3 predictions

2

1 1.0 #awesome – 1.5 #awful = 0 Score(x) > 0 0 0 1 2 3 4 … #awesome

53 ©2017 Emily Fox CSE 446: Machine Learning

For more inputs (linear features)… [2] x

x[3] #awful

Score(x) = w0

+ w1 #awesome

#awesome x[1] + w2 #awful

+ w3 #great 54 ©2017 Emily Fox CSE 446: Machine Learning

27 1/19/17

For general features…

For more general classifiers (not just linear features) è more complicated shapes

55 ©2017 Emily Fox CSE 446: Machine Learning

28