1/24/17

Linear classifiers: Overfitting and regularization CSE 446: Emily Fox University of Washington January 25, 2017

©2017 Emily Fox CSE 446: Machine Learning

Logistic regression recap

©2017 Emily Fox CSE 446: Machine Learning

1 1/24/17

Thus far, we focused on decision boundaries

Score(xi) = w0 h0(xi) + w1 h1(xi) + … + wD hD(xi) T = w h(xi)

Score(x) < 0 #awful … Relate Score(xi) to ⌃ 4 P(y=+1|x,ŵ)? 3

2

1 1.0 #awesome – 1.5 #awful = 0 Score(x) > 0 0 0 1 2 3 4 … #awesome

3 ©2017 Emily Fox CSE 446: Machine Learning

Compare and contrast regression models

with Gaussian errors 2σ T 2 yi = w h(xi) + εi εi ~ N(0,σ ) è p(y|x,w) = N(y; wTh(x), σ2) µ = wTh(x) y

1 . y = +1 - wT h(x) 1 + e P(y|x,w) = - wT h(x) e . y = -1 -1 +1 - wT h(x) y 1 + e 4 ©2017 Emily Fox CSE 446: Machine Learning

2 1/24/17

Understanding the logistic regression model

P(y=+1|xi,w) = sigmoid (Score( x i)) = 1 . - wT h(x) 1 + e

Score(xi) P(y=+1|xi,w) ) x ( h > w 1 e 1+

w>h(x) 5 ©2017 Emily Fox CSE 446: Machine Learning

Predict most likely class ⌃ P( y| x ) = estimate of class probabilities

Sentence ⌃ from If P(y=+1| x ) > 0.5: review ŷ = Else: Input: x ŷ =

⌃ Estimating P( y| x ) improves interpretability: - Predict ŷ = +1 and tell me how sure you are

6 ©2017 Emily Fox CSE 446: Machine Learning

3 1/24/17

Gradient descent algorithm, cont’d

©2017 Emily Fox CSE 446: Machine Learning

Step 5: Gradient over all data points

Sum over Feature data points value Difference between truth and prediction

NNN w>h(x ) @`@`@```((((wwww))))= (1 1[y = +1])w>h(x ) ln 1+e i === hhh(((xxx )i)) 111[[[yyy === +1] +1] +1] i PPP(((yyy =+1=+1=+1 xxx ,,,www))) @@@www jjj iii iii ||| iii jjj iii=1=1=1 ⇣ ⌘ XXX ⇣⇣⇣ ⌘⌘⌘

P(y=+1|xi,w) ≈ 1 P(y=+1|xi,w) ≈ 0

yi=+1

yi=-1

8 ©2017 Emily Fox CSE 446: Machine Learning

4 1/24/17

Summary of gradient ascent for logistic regression

init w(1)=0 (or randomly, or smartly), t=1

Δ

while || ℓ (w(t))|| > ε for j=0,…,D N h (x ) 1[y = +1] P (y =+1 x , w(t)) partial[j] = j i i | i i=1 X ⇣ ⌘ (t+1) (t) wj ß wj + η partial[j] t ß t + 1

9 ©2017 Emily Fox CSE 446: Machine Learning

Overfitting in classification

©2017 Emily Fox CSE 446: Machine Learning

5 1/24/17

Learned decision boundary

Coefficient Feature Value learned

h0(x) 1 0.23

h1(x) x[1] 1.12

h2(x) x[2] -1.07

11 ©2017 Emily Fox CSE 446: Machine Learning

Note: we are not including cross Quadratic features (in 2d) terms for simplicity

Coefficient Feature Value learned

h0(x) 1 1.68

h1(x) x[1] 1.39

h2(x) x[2] -0.59 2 h3(x) (x[1]) -0.17 2 h4(x) (x[2]) -0.96

12 ©2017 Emily Fox CSE 446: Machine Learning

6 1/24/17

Note: we are not including cross Degree 6 features (in 2d) terms for simplicity

Coefficient Feature Value learned

h0(x) 1 21.6 Coefficient values h1(x) x[1] 5.3

h2(x) x[2] -42.7 getting large 2 h3(x) (x[1]) -15.9 Score(x) < 0 2 h4(x) (x[2]) -48.6 3 h5(x) (x[1]) -11.0 3 h6(x) (x[2]) 67.0 4 h7(x) (x[1]) 1.5 4 h8(x) (x[2]) 48.0 5 h9(x) (x[1]) 4.4 5 h10(x) (x[2]) -14.2 6 h11(x) (x[1]) 0.8 6 h12(x) (x[2]) -8.6

13 ©2017 Emily Fox CSE 446: Machine Learning

Note: we are not including cross Degree 20 features (in 2d) terms for simplicity

Coefficient Feature Value learned

h0(x) 1 8.7

h1(x) x[1] 5.1

h2(x) x[2] 78.7 Often, overfitting associated with … … … very large estimated coefficients ŵ 6 h11(x) (x[1]) -7.5 6 h12(x) (x[2]) 3803 7 h13(x) (x[1]) 21.1 7 h14(x) (x[2]) -2406 … … … 19 -6 h37(x) (x[1]) -2*10 19 h38(x) (x[2]) -0.15 20 -8 h39(x) (x[1]) -2*10 20 h40(x) (x[2]) 0.03

14 ©2017 Emily Fox CSE 446: Machine Learning

7 1/24/17

Overfitting in classification

Error = Overfitting if there exists w*: • training_error(w*) > training_error(ŵ) • true_error(w*) < true_error(ŵ) Classification Error

Model complexity

15 ©2017 Emily Fox CSE 446: Machine Learning

Overfitting in classifiers è Overconfident predictions

©2017 Emily Fox CSE 446: Machine Learning

8 1/24/17

Logistic regression model

T w h(xi) -∞ +∞ 0.0

0.5 0.0 1.0 T P(y=+1|xi,w) = sigmoid(w h(xi))

17 ©2017 Emily Fox CSE 446: Machine Learning

The subtle (negative) consequence of overfitting in logistic regression

Overfitting è Large coefficient values

T ŵ h(xi) is very positive (or very negative) è T sigmoid(ŵ h(xi)) goes to 1 (or to 0)

Model becomes extremely overconfident of predictions

18 ©2017 Emily Fox CSE 446: Machine Learning

9 1/24/17

Effect of coefficients on logistic regression model

Input x: #awesome=2, #awful=1

w0 0 w0 0 w0 0

w#awesome +1 w#awesome +2 w#awesome +6

w#awful -1 w#awful -2 w#awful -6 ) ) ) x x x ( ( ( h h h > > > w w w 1 1 1 e e e 1+ 1+ 1+

#awesome - #awful #awesome - #awful #awesome - #awful

19 ©2017 Emily Fox CSE 446: Machine Learning

Learned probabilities

Coefficient Feature Value learned

h0(x) 1 0.23

h1(x) x[1] 1.12

h2(x) x[2] -1.07 1 P (y =+1 x, w)= w h(x) | 1+e >

20 ©2017 Emily Fox CSE 446: Machine Learning

10 1/24/17

Quadratic features: Learned probabilities

Coefficient Feature Value learned

h0(x) 1 1.68

h1(x) x[1] 1.39

h2(x) x[2] -0.58 2 h3(x) (x[1]) -0.17 2 h4(x) (x[2]) -0.96

1 P (y =+1 x, w)= w h(x) | 1+e >

21 ©2017 Emily Fox CSE 446: Machine Learning

Overfitting è Overconfident predictions

Degree 6: Learned probabilities Degree 20: Learned probabilities Tiny uncertainty regions è Overfitting & overconfident about it!!!

We are sure we are right, when we are surely wrong! L

22 ©2017 Emily Fox CSE 446: Machine Learning

11 1/24/17

Overfitting in logistic regression: Another perspective

©2017 Emily Fox CSE 446: Machine Learning

Linearly-separable data

Data are linearly separable if: • There exist coefficients ŵ such that: - For all positive training data

- For all negative training data

Note 1: If you are using D features, happens in a D-dimensional space

Note 2: If you have enough features, data are (almost) always linearly separable training_error(ŵ) = 0 24 ©2017 Emily Fox CSE 446: Machine Learning

12 1/24/17

Effect of linear separability on coefficients

Score(x) < 0 Data are linearly separable with … ŵ1=1.0 and ŵ2=-1.5 4

3 [2]=#awful Data also linearly separable with X 2 ŵ1=10 and ŵ2=-15 1 1.0 #awesome – 1.5 #awful = 0 Score(x) > 0 0 Data also linearly separable with 0 1 2 3 4 … ŵ =109 and ŵ =-1.5x109 x[1]=#awesome 1 2

25 ©2017 Emily Fox CSE 446: Machine Learning

MaximumEffect likelihoodof linear estimation separability (MLE) prefers most certain model è Coeonffi weightscients go to infinity for linearly-separable data!!!

P(y=+1|xi, ŵ)

Data is linearly separable with … ŵ1=1.0 and ŵ2=-1.5

4

3 x[1]= 2 [2]=#awful Data also linearly separable with X x[2]= 1 2 ŵ1=10 and ŵ2=-15

1

0 Data also linearly separable with

0 1 2 3 4 … 9 9 ŵ1=10 and ŵ2=-1.5x10 x[1]=#awesome

26 ©2017 Emily Fox CSE 446: Machine Learning

13 1/24/17

Overfitting in logistic regression is “twice as bad”

Learning tries If data are to find decision linearly boundary that separable separates data

Overly Coefficients complex go to boundary infinity!

9 ŵ1=10 9 ŵ2=-1.5x10

27 ©2017 Emily Fox CSE 446: Machine Learning

Penalizing large coefficients to mitigate overfitting

©2017 Emily Fox CSE 446: Machine Learning

14 1/24/17

Desired total cost format

Want to balance: i. How well function fits data ii. Magnitude of coefficients

want to balance Total quality = measure of fit - measure of magnitude of coefficients

(data likelihood) large # = overfit large # = good fit to training data 29 ©2017 Emily Fox CSE 446: Machine Learning

Consider resulting objective

Select ŵ to minimize: 2 ℓ(w) - λ ||w|| 2 tuning parameter = balance of fit and magnitude

Pick λ using: L2 regularized • Validation set (for large datasets) logistic regression • Cross-validation (for smaller datasets) (as in ridge/lasso regression)

30 ©2017 Emily Fox CSE 446: Machine Learning

15 1/24/17

Degree 20 features, effect of regularization penalty λ

Regularization λ = 0 λ = 0.00001 λ = 0.001 λ = 1 λ = 10

Range of -3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 -0.05 to 0.22 coefficients

Decision boundary

31 ©2017 Emily Fox CSE 446: Machine Learning

Coefficient path

j ŵ coe ffi cients

λ 32 ©2017 Emily Fox CSE 446: Machine Learning

16 1/24/17

Degree 20 features: regularization reduces “overconfidence”

Regularization λ = 0 λ = 0.00001 λ = 0.001 λ = 1

Range of -3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 coefficients

Learned probabilities

33 ©2017 Emily Fox CSE 446: Machine Learning

Finding best L2 regularized linear classifier with gradient ascent

©2017 Emily Fox CSE 446: Machine Learning

17 1/24/17

Understanding contribution of L2 regularization

Term from L2 penalty @`(w) 2wj @wj

- 2 λ wj Impact on wj

wj > 0

wj < 0

35 ©2017 Emily Fox CSE 446: Machine Learning

Summary of gradient ascent for logistic regression + L2 regularization

init w(1)=0 (or randomly, or smartly), t=1 while not converged: for j=0,…,D N h (x ) 1[y = +1] P (y =+1 x , w(t)) partial[j] = j i i | i i=1 X ⇣ ⌘ (t+1) (t) (t) wj ß wj + η (partial[j] – 2λ wj ) t ß t + 1

36 ©2017 Emily Fox CSE 446: Machine Learning

18 1/24/17

Summary of overfitting in logistic regression

©2017 Emily Fox CSE 446: Machine Learning

What you can do now…

• Identify when overfitting is happening • Relate large learned coefficients to overfitting • Describe the impact of overfitting on decision boundaries and predicted probabilities of linear classifiers • Motivate the form of L2 regularized logistic regression quality metric • Describe what happens to estimated coefficients as tuning parameter λ is varied • Interpret coefficient path plot • Estimate L2 regularized logistic regression coefficients using gradient ascent • Describe the use of L1 regularization to obtain sparse logistic regression solutions

38 ©2017 Emily Fox CSE 446: Machine Learning

19 1/24/17

Linear classifiers: Scaling up learning via SGD CSE 446: Machine Learning Emily Fox University of Washington January 25, 2017

©2017 Emily Fox CSE 446: Machine Learning

Stochastic : Learning, one data point at a time

©2017 Emily Fox CSE 446: Machine Learning

20 1/24/17

Why gradient ascent is slow…

(t) (t+1) (t+2)

w Update w Update w

coefficients coefficients

Δ Δ Every update

(t) (t+1) ℓ(w ) ℓ(w ) requires a full pass Compute over data Data Pass gradient over data

41 ©2017 Emily Fox CSE 446: Machine Learning

More formally: How expensive is gradient ascent?

Sum over data points

@`@`@`(((www))) NNN === hhh (((xxx ))) 111[[[yyy === +1] +1] +1] PPP(((yyy =+1=+1=+1 xxx ,,,www))) @@@www jjj iii iii ||| iii jjj iii=1=1=1 XXX ⇣⇣⇣ ⌘⌘⌘

Contribution of data point xi,yi to gradient

@`i(w) @wj

42 ©2017 Emily Fox CSE 446: Machine Learning

21 1/24/17

Every step requires touching every data point!!!

Sum over data points

@`@`((ww)) NN @` (w) == hh ((xxi )) 11[[yy == +1] +1] PP((yy =+1=+1 xx ,,ww)) @@ww jj ii ii || ii jj ii=1=1 @wj XX ⇣⇣ ⌘⌘ Time to compute Total time to compute # of data points (N) contribution of xi,yi 1 step of gradient ascent 1 millisecond 1000 1 second 1000 1 millisecond 10 million 1 millisecond 10 billion

43 ©2017 Emily Fox CSE 446: Machine Learning

Stochastic gradient ascent

(t) (t+1) (t+2) (t+3) (t+4) w Update w Update w Update w Update w coefficients coefficients coefficients coefficients

Many updates for each pass Compute over data Data Use only gradient small subsets of data

44 ©2017 Emily Fox CSE 446: Machine Learning

22 1/24/17

Example: Instead of all data points for gradient, use 1 data point only???

Sum over data points Gradient @`@`((ww)) NN @` (w) == hh ((xxi )) 11[[yy == +1] +1] PP((yy =+1=+1 xx ,,ww)) @@ww jj ii ii || ii ascent jj ii=1=1 @wj XX ⇣⇣ ⌘⌘ Each time, pick different data point i Stochastic N @`(w) @`i(w) gradient = hj(xi) 1[yi = +1] P (y =+1 xi, w) @wj ≈ | i=1 @wj ascent X ⇣ ⌘

45 ©2017 Emily Fox CSE 446: Machine Learning

Stochastic gradient ascent for logistic regression

init w(1)=0, t=1 Each time, pick Sum overdi fferent data point i untilfor converged i=1,…,N data points

for j=0,…,D NN @`(w) = hh ((xx ) 11[[yy == +1] +1] PP(y(=+1y =+1x ,xw(,t)w) ) partial[j] = jj i ii | i i @wj i=1 | i=1X ⇣ ⌘ (t+1) X(t) ⇣ ⌘ wj ß wj + η partial[j] t ß t + 1

46 ©2017 Emily Fox CSE 446: Machine Learning

23 1/24/17

Stochastic gradient for L2-regularized objective

N Total@` ( w) @`i(w) = hj(xi) 1[yi = +1] P (y =+1 xi, w) derivative@w = - 2 λ w partial[j]| j i=1 @wj j X ⇣ ⌘ What about regularization term? Each time, pick different data point i

Stochastic Total @`i(w) gradient - 2 λ wj partial[j] ascent derivative≈ @wj N Each data point contributes 1/N to regularization

47 ©2017 Emily Fox CSE 446: Machine Learning

Comparing computational time per step

Gradient ascent Stochastic gradient ascent

NN N @`@`((ww)) @`i(w) @`(w) @`i(w) 1 == hhjj((xxii)) 11[[yyii == +1] +1] =PP((yy =+1=+1hj(xi)xxii,,[wwyi))= +1] P (y =+1 xi, w) @@ww @wj≈ || | jj ii=1=1 @wj i=1 @wj XX ⇣⇣ X ⇣ ⌘⌘ ⌘ Total time for Total time for Time to compute # of data points (N) 1 step of 1 step of contribution of x ,y i i gradient stochastic gradient 1 millisecond 1000 1 second 1 second 1000 16.7 minutes 1 millisecond 10 million 2.8 hours 1 millisecond 10 billion 115.7 days

48 ©2017 Emily Fox CSE 446: Machine Learning

24 1/24/17

Which one is better??? Depends…

Total time to convergence for large data Time per Sensitivity to Algorithm In theory In practice iteration parameters Slow for Often Gradient Slower Moderate large data slower Stochastic Always Often Faster Very high gradient fast faster

49 ©2017 Emily Fox CSE 446: Machine Learning

Summary of stochastic gradient

Tiny change to gradient ascent

Much better scalability

Huge impact in real-world

Very tricky to get right in practice

50 ©2017 Emily Fox CSE 446: Machine Learning

25 1/24/17

Why would stochastic gradient ever work???

©2017 Emily Fox CSE 446: Machine Learning

Gradient is direction of steepest ascent

Gradient is “best” direction, but any direction that goes “up” would be useful

52 ©2017 Emily Fox CSE 446: Machine Learning

26 1/24/17

In ML, steepest direction is sum of “little directions” from each data point

Sum over data points @`@`((ww)) NN @` (w) == hh ((xxi )) 11[[yy == +1] +1] PP((yy =+1=+1 xx ,,ww)) @@ww jj ii ii || ii jj ii=1=1 @wj XX ⇣⇣ ⌘⌘ For most data points, contribution points “up” 53 ©2017 Emily Fox CSE 446: Machine Learning

Stochastic gradient: Pick a data point and move in direction

N @`(w) @`i(w) = hj(xi) 1[yi = +1] P (y =+1 xi, w) @wj ≈ | i=1 @wj X ⇣ ⌘

Most of the time, total likelihood will increase

54 ©2017 Emily Fox CSE 446: Machine Learning

27 1/24/17

Stochastic gradient ascent: Most iterations increase likelihood, but sometimes decrease it è On average, make progress

until converged for i=1,…,N for j=0,…,D @` (w) w (t+1) ß w (t) + η i j j @w t ß t + 1 j

55 ©2017 Emily Fox CSE 446: Machine Learning

Convergence path

©2017 Emily Fox CSE 446: Machine Learning

28 1/24/17

Convergence paths

Gradient

Stochastic gradient

57 ©2017 Emily Fox CSE 446: Machine Learning

Stochastic gradient convergence is “noisy” Stochastic gradient Stochastic gradient achieves higher makes “noisy” progress likelihood sooner, but it’s noisier

Gradient usually increases

Better likelihood smoothly Avg. log likelihood

Total time proportional to # passes over data 58 ©2017 Emily Fox CSE 446: Machine Learning

29 1/24/17

Eventually, gradient catches up

Note: should only trust “average” quality of stochastic gradient

Stochastic gradient Gradient Better Avg. log likelihood

59 ©2017 Emily Fox CSE 446: Machine Learning

The last coefficients may be really good or really bad!! L

Stochastic gradient will eventually oscillate around a solution

(1005) Minimize noise: w was good don’t return last learned coefficients

How do we minimize risk of Output average: picking bad T coefficients ŵ = 1 w(t) t=1 (1000) T w was bad X 60 ©2017 Emily Fox CSE 446: Machine Learning

30 1/24/17

Summary of why stochastic gradient works

Gradient finds direction of steeps ascent

Gradient is sum of contributions from each data point

Stochastic gradient uses direction from 1 data point

On average increases likelihood, sometimes decreases

Stochastic gradient has “noisy” convergence

61 ©2017 Emily Fox CSE 446: Machine Learning

Online learning: Fitting models from streaming data

©2017 Emily Fox CSE 446: Machine Learning

31 1/24/17

Batch vs online learning

Batch learning Online learning • All data is available at • Data arrives (streams in) over time start of training time - Must train model as data arrives! time t=1 t=2 t=3 t=4

ML Data Data Data Data Data algorithm

ML (1)(2)(3)(4) ŵ algorithm ŵ

©2017 Emily Fox CSE 446: Machine Learning

Online learning example: Ad targeting

ŷ = Suggested ads Website …...... …...... …...... …...... Ad1 …...... …...... User clicked …...... …...... Ad2 …...... …...... on Ad2 …...... …...... ê …...... …...... Ad3 …...... …...... yt=Ad2

ML Input: xt User info, algorithm page text ŵ(t) ŵ(t+1) 64 ©2017 Emily Fox CSE 446: Machine Learning

32 1/24/17

Online learning problem • Data arrives over each time step t:

- Observe input xt • Info of user, text of webpage

- Make a prediction ŷt • Which ad to show

- Observe true output yt • Which ad user clicked on

Need ML algorithm to update coefficients each time step!

65 ©2017 Emily Fox CSE 446: Machine Learning

Stochastic gradient ascent can be used for online learning!!! • init w(1)=0, t=1 • Each time step t:

- Observe input xt

- Make a prediction ŷt

- Observe true output yt - Update coefficients: for j=0,…,D

(t+1) (t) @`t(w) wj ß wj + η @wj 66 ©2017 Emily Fox CSE 446: Machine Learning

33 1/24/17

Summary of online learning

Data arrives over time

Must make a prediction every time new data point arrives

Observe true class after prediction made

Want to update parameters immediately

67 ©2017 Emily Fox CSE 446: Machine Learning

Summary of stochastic gradient descent

©2017 Emily Fox CSE 446: Machine Learning

34 1/24/17

What you can do now…

• Significantly speedup learning algorithm using stochastic gradient • Describe intuition behind why stochastic gradient works • Apply stochastic gradient in practice • Describe online learning problems • Relate stochastic gradient to online learning

69 ©2017 Emily Fox CSE 446: Machine Learning

Generalized linear models

OPTIONAL

©2017 Emily Fox CSE 446: Machine Learning

35 1/24/17

Models for data

• Linear regression with Gaussian errors 2σ T 2 yi = w h(xi) + εi εi ~ N(0,σ ) è p(y|x,w) = N(y; wTh(x), σ2) µ = wTh(x) y

• Logistic regression

1 . y = +1 - wT h(x) 1 + e P(y|x,w) = - wT h(x) e . y = -1 -1 +1 - wT h(x) y 1 + e 71 ©2017 Emily Fox CSE 446: Machine Learning

Examples of “generalized linear models”

A generalized linear model has E[y | x,w ] = g(wTh(x))

Mean Link function g Linear regression wTh(x) identity function Logistic regression P(y=+1|x,w) sigmoid function

(assuming y in {0,1} (mean is different if (need slight modification instead of {-1,1} or some y is not in {0,1}) if y is not in {0,1}) other set of values)

72 ©2017 Emily Fox CSE 446: Machine Learning

36 1/24/17

Stochastic gradient descent more formally

OPTIONAL

©2017 Emily Fox CSE 446: Machine Learning

Learning Problems as Expectations

• Minimizing loss in training data: - Given dataset: • Sampled iid from some distribution p(x) on features: - , e.g., hinge loss, logistic loss,… - We often minimize loss in training data: 1 N ` (w)= `(w, xj) D N j=1 X

• However, we should really minimize expected loss on all data:

`(w)=Ex [`(w, x)] = p(x)`(w, x)dx Z

• So, we are approximating the integral by the average on the training data

74 ©2017 Emily Fox CSE 446: Machine Learning

37 1/24/17

Gradient Ascent in Terms of Expectations

• “True” objective function: `(w)=E [`(w, x)] = p(x)`(w, x)dx x • Taking the gradient: Z

• “True” gradient ascent rule:

• How do we estimate expected gradient?

75 ©2017 Emily Fox CSE 446: Machine Learning

SGD: Stochastic Gradient Ascent (or Descent)

• “True” gradient: `(w)=Ex [ `(w, x)] r r • Sample based approximation:

• What if we estimate gradient with just one sample??? - Unbiased estimate of gradient - Very noisy!

- Called stochastic gradient ascent (or descent) • Among many other names - VERY useful in practice!!!

76 ©2017 Emily Fox CSE 446: Machine Learning

38