1/25/2017

Linear classifiers: Overfitting and regularization CSE 446: Emily Fox University of Washington January 25, 2017

©2017 Emily Fox CSE 446: Machine Learning

Logistic regression recap

©2017 Emily Fox CSE 446: Machine Learning

1 1/25/2017

Thus far, we focused on decision boundaries

Score(xi) = w0 h0(xi) + w1 h1(xi) + … + wD hD(xi) T = w h(xi)

Score(x) < 0 #awful … Relate Score(xi) to ⌃ 4 P(y=+1|x,ı )? 3

2

1 Score(x) > 0 0 0 1 2 3 4 … #awesome

3 ©2017 Emily Fox CSE 446: Machine Learning

Compare and contrast regression models

with Gaussian errors 2σ T 2 yi = w h(xi) + εi εi ~ N(0,σ )  p(y|x,w) = N(y; wTh(x), σ2) q w T h(x) y

1 . y = +1 1 + e- wT h(x) P(y|x,w) = - wT h(x) e . y = -1 -1 +1 y 1 + e- wT h(x) 4 ©2017 Emily Fox CSE 446: Machine Learning

2 1/25/2017

Understanding the logistic regression model

P(y=+1|xi,w) = sigmoid(Score(xi)) = 1 . 1 + e- wT h(x)

Score(xi) P(y=+1|xi,w)

5 ©2017 Emily Fox CSE 446: Machine Learning

Predict most likely class ⌃ P(y|x)= estimate of class probabilities

Sentence ⌃ from If P(y=+1| x) > 0.5: review ij = Else: Input: x ij =

⌃ Estimating P( y| x ) improves interpretability: - Predict ij = +1 and tell me how sure you are

6 ©2017 Emily Fox CSE 446: Machine Learning

3 1/25/2017

Gradient descent algorithm, cont’d

©2017 Emily Fox CSE 446: Machine Learning

Step 5: Gradient over all data points

Sum over Feature data points value Difference between truth and prediction

P(y=+1|xi,w) Ş   P(y=+1|xi,w) Ş  

yi=+1

yi=-1

8 ©2017 Emily Fox CSE 446: Machine Learning

4 1/25/2017

Summary of gradient ascent for logistic regression

init w(1)=0 (or randomly, or smartly), t=1 while || ℓΔ (w(t))|| > ε for j=0,…,D partial[j] =

(t+1) (t) wj  wj + η partial[j] t  t + 1

9 ©2017 Emily Fox CSE 446: Machine Learning

Overfitting in classification

©2017 Emily Fox CSE 446: Machine Learning

5 1/25/2017

Learned decision boundary

Coefficient Feature Value learned

h0(x) 1 0.23

h1(x) x[1] 1.12

h2(x) x[2] -1.07

11 ©2017 Emily Fox CSE 446: Machine Learning

Note: we are not including cross Quadratic features (in 2d) terms for simplicity

Coefficient Feature Value learned

h0(x) 1 1.68

h1(x) x[1] 1.39

h2(x) x[2] -0.59 2 h3(x) (x[1]) -0.17 2 h4(x) (x[2]) -0.96

12 ©2017 Emily Fox CSE 446: Machine Learning

6 1/25/2017

Note: we are not including cross Degree 6 features (in 2d) terms for simplicity

Coefficient Feature Value learned

h0(x) 1 21.6 Coefficient values h1(x) x[1] 5.3

h2(x) x[2] -42.7 getting large 2 h3(x) (x[1]) -15.9 Score(x) < 0 2 h4(x) (x[2]) -48.6 3 h5(x) (x[1]) -11.0 3 h6(x) (x[2]) 67.0 4 h7(x) (x[1]) 1.5 4 h8(x) (x[2]) 48.0 5 h9(x) (x[1]) 4.4 5 h10(x) (x[2]) -14.2 6 h11(x) (x[1]) 0.8 6 h12(x) (x[2]) -8.6

13 ©2017 Emily Fox CSE 446: Machine Learning

Note: we are not including cross Degree 20 features (in 2d) terms for simplicity

Coefficien Featur Value t e learned

h0(x) 1 8.7 h1(x) x[1] 5.1 Often, overfitting associated with h2(x) x[2] 78.7 … … … very large estimated coefficients ı 6 h11(x) (x[1]) -7.5 6 h12(x) (x[2]) 3803 7 h13(x) (x[1]) 21.1 7 h14(x) (x[2]) -2406 … … … 19 -6 h37(x) (x[1]) -2*10 19 h38(x) (x[2]) -0.15 20 -8 h39(x) (x[1]) -2*10 20 h40(x) (x[2]) 0.03 14 ©2017 Emily Fox CSE 446: Machine Learning

7 1/25/2017

Overfitting in classification

Error= Overfitting if there exists w*: • training_error(w*) > training_error(ı ) • true_error(w*) < true_error(ı ) Classification Classification Error

Model complexity

15 ©2017 Emily Fox CSE 446: Machine Learning

Overfitting in classifiers  Overconfident predictions

©2017 Emily Fox CSE 446: Machine Learning

8 1/25/2017

Logistic regression model

wTh(x ) -∞ i +∞ 0.0

0.5 0.0 1.0 T P(y=+1|xi,w) = sigmoid(w h(xi))

17 ©2017 Emily Fox CSE 446: Machine Learning

The subtle (negative) consequence of overfitting in logistic regression

Overfitting  Large coefficient values

T ı h(xi) is very positive (or very negative)  T sigmoid(ı h(xi)) goes to 1 (or to 0)

Model becomes extremely overconfident of predictions

18 ©2017 Emily Fox CSE 446: Machine Learning

9 1/25/2017

Effect of coefficients on logistic regression model

Input x: #awesome=2, #awful=1

w0 0 w0 0 w0 0

w#awesome +1 w#awesome +2 w#awesome +6

w#awful -1 w#awful -2 w#awful -6

#awesome - #awful #awesome - #awful #awesome - #awful

19 ©2017 Emily Fox CSE 446: Machine Learning

Learned probabilities

Coefficient Feature Value learned

h0(x) 1 0.23

h1(x) x[1] 1.12

h2(x) x[2] -1.07

20 ©2017 Emily Fox CSE 446: Machine Learning

10 1/25/2017

Quadratic features: Learned probabilities

Coefficien Feature Value t learned

h0(x) 1 1.68

h1(x) x[1] 1.39

h2(x) x[2] -0.58 2 h3(x) (x[1]) -0.17 2 h4(x) (x[2]) -0.96

21 ©2017 Emily Fox CSE 446: Machine Learning

Overfitting  Overconfident predictions

Degree 6: Learned probabilities Degree 20: Learned probabilities Tiny uncertainty regions  Overfitting & overconfident about it!!! We are sure we are right, when we are surely wrong! 

22 ©2017 Emily Fox CSE 446: Machine Learning

11 1/25/2017

Overfitting in logistic regression: Another perspective

©2017 Emily Fox CSE 446: Machine Learning

Linearly-separable data

Data are linearly separable if: • There exist coefficients ı such that: - For all positive training data

- For all negative training data

Note 1: If you are using D features, happens in a D-dimensional space Note 2: If you have enough features, data are (almost) always linearly separable training_error(ı ) = 0

24 ©2017 Emily Fox CSE 446: Machine Learning

12 1/25/2017

Effect of linear separability on coefficients

Score(x) < 0 Data are linearly separable with … ı 1=1.0 and ı 2=-1.5 awful 4 =# 3 [2] Data also linearly separable with X 2 ı 1=10 and ı 2=-15 1 Score(x) > 0 0 Data also linearly separable with 0 1 2 3 4 … ı =109 and ı =-1.5x109 x[1]=#awesome 1 2

25 ©2017 Emily Fox CSE 446: Machine Learning

MaximumEffect likelihoodof linear estimation separability (MLE) prefers most certain model  Coefficientson weights go to infinity for linearly-separable data!!!

P(y=+1|xi, ı ) Data is linearly separable with … ı 1=1.0 and ı 2=-1.5 awful 4 =# 3 x[1]= 2 [2] Data also linearly separable with X x[2]= 1 2 ı 1=10 and ı 2=-15 1 0 Data also linearly separable with 0 1 2 3 4 … ı =109 and ı =-1.5x109 x[1]=#awesome 1 2

26 ©2017 Emily Fox CSE 446: Machine Learning

13 1/25/2017

Overfitting in logistic regression is “twice as bad”

Learning tries If data are to find decision linearly boundary that separable separates data

Overly Coefficients complex go to boundary infinity!

9 ı 1=10 9 ı 2=-1.5x10

27 ©2017 Emily Fox CSE 446: Machine Learning

Penalizing large coefficients to mitigate overfitting

©2017 Emily Fox CSE 446: Machine Learning

14 1/25/2017

Desired total cost format

Want to balance: i. How well function fits data ii. Magnitude of coefficients

want to balance Total quality = measure of fit - measure of magnitude of coefficients

(data likelihood) large # = overfit large # = good fit to training data 29 ©2017 Emily Fox CSE 446: Machine Learning

Consider resulting objective

Select ı to maximize: 2 ℓ(w) - λ||w||2 tuning parameter = balance of fit and magnitude

Pick λ using: L2 regularized • Validation set (for large datasets) logistic regression • Cross-validation (for smaller datasets) (as in ridge/lasso regression)

30 ©2017 Emily Fox CSE 446: Machine Learning

15 1/25/2017

Degree 20 features, effect of regularization penalty λ

Regularizatio λ = 0 λ = 0.00001 λ = 0.001 λ = 1 λ = 10 n

Range of -3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 -0.05 to 0.22 coefficients

Decision boundary

31 ©2017 Emily Fox CSE 446: Machine Learning

Coefficient path j ı coefficients

λ 32 ©2017 Emily Fox CSE 446: Machine Learning

16 1/25/2017

Degree 20 features: regularization reduces “overconfidence”

Regularization λ = 0 λ = 0.00001 λ = 0.001 λ = 1

Range of -3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 coefficients

Learned probabilities

33 ©2017 Emily Fox CSE 446: Machine Learning

Finding best L2 regularized linear classifier with gradient ascent

©2017 Emily Fox CSE 446: Machine Learning

17 1/25/2017

Understanding contribution of L2 regularization

Term from L2 penalty

- 2 λ wj Impact on wj

wj > 0

wj < 0

35 ©2017 Emily Fox CSE 446: Machine Learning

Summary of gradient ascent for logistic regression + L2 regularization

init w(1)=0 (or randomly, or smartly), t=1 while not converged: for j=0,…,D partial[j] =

(t+1) (t) (t) wj  wj + η (partial[j] – 2λ wj ) t  t + 1

36 ©2017 Emily Fox CSE 446: Machine Learning

18 1/25/2017

Summary of overfitting in logistic regression

©2017 Emily Fox CSE 446: Machine Learning

What you can do now…

• Identify when overfitting is happening • Relate large learned coefficients to overfitting • Describe the impact of overfitting on decision boundaries and predicted probabilities of linear classifiers • Motivate the form of L2 regularized logistic regression quality metric • Describe what happens to estimated coefficients as tuning parameter λ is varied • Interpret coefficient path plot • Estimate L2 regularized logistic regression coefficients using gradient ascent • Describe the use of L1 regularization to obtain sparse logistic regression solutions

38 ©2017 Emily Fox CSE 446: Machine Learning

19 1/25/2017

Linear classifiers: Scaling up learning via SGD CSE 446: Machine Learning Emily Fox University of Washington January 25, 2017

©2017 Emily Fox CSE 446: Machine Learning

Stochastic : Learning, one data point at a time

©2017 Emily Fox CSE 446: Machine Learning

20 1/25/2017

Why gradient ascent is slow…

(t) (t+1) (t+2) w Update w Update w coefficients coefficients

Every update Δ

ℓΔ (w(t)) ℓ(w(t+1)) requires a full pass Compute over data Data Pass gradient over data

41 ©2017 Emily Fox CSE 446: Machine Learning

More formally: How expensive is gradient ascent?

Sum over data points

Contribution of data point xi,yi to gradient

42 ©2017 Emily Fox CSE 446: Machine Learning

21 1/25/2017

Every step requires touching every data point!!!

Sum over data points

Time to compute Total time to compute # of data points (N) contribution of xi,yi 1 step of gradient ascent 1 millisecond 1000 1 second 1000 1 millisecond 10 million 1 millisecond 10 billion

43 ©2017 Emily Fox CSE 446: Machine Learning

Stochastic gradient ascent

(t) (t+1) (t+2) (t+3) (t+4) w Update w Update w Update w Update w coefficients coefficients coefficients coefficients

Many updates for each pass Compute over data Data Use only gradient small subsets of data

44 ©2017 Emily Fox CSE 446: Machine Learning

22 1/25/2017

Example: Instead of all data points for gradient, use 1 data point only???

Sum over data points Gradient ascent

Each time, pick different data point i Stochastic gradient Ş ascent

45 ©2017 Emily Fox CSE 446: Machine Learning

Stochastic gradient ascent for logistic regression

init w(1)=0, t=1 Each time, pick Sum overdifferent data point i untilforconverged i=1,…,N data points for j=0,…,D partial[j] =

(t+1) (t) wj  wj + η partial[j] t  t + 1

46 ©2017 Emily Fox CSE 446: Machine Learning

23 1/25/2017

Stochastic gradient for L2-regularized objective

Total derivative = - 2 λ wj partial[j]

What about regularization term? Each time, pick different data point i

Stochastic Total gradient Ş - 2 λ wj partial[j] ascent derivative N Each data point contributes 1/N to regularization

47 ©2017 Emily Fox CSE 446: Machine Learning

Comparing computational time per step

Gradient ascent Stochastic gradient ascent Ş

Total time for Total time for Time to compute # of data points (N) 1 step of 1 step of contribution of x ,y i i gradient stochastic gradient 1 millisecond 1000 1 second 1 second 1000 16.7 minutes 1 millisecond 10 million 2.8 hours 1 millisecond 10 billion 115.7 days

48 ©2017 Emily Fox CSE 446: Machine Learning

24 1/25/2017

Which one is better??? Depends…

Total time to convergence for large data Time per Sensitivity to Algorithm In theory In practice iteration parameters Slow for Often Gradient Slower Moderate large data slower Stochastic Always Often Faster Very high gradient fast faster

49 ©2017 Emily Fox CSE 446: Machine Learning

Summary of stochastic gradient

Tiny change to gradient ascent

Much better scalability

Huge impact in real-world

Very tricky to get right in practice

50 ©2017 Emily Fox CSE 446: Machine Learning

25 1/25/2017

Why would stochastic gradient ever work???

©2017 Emily Fox CSE 446: Machine Learning

Gradient is direction of steepest ascent

Gradient is “best” direction, but any direction that goes “up” would be useful

52 ©2017 Emily Fox CSE 446: Machine Learning

26 1/25/2017

In ML, steepest direction is sum of “little directions” from each data point

Sum over data points

For most data points, contribution points “up” 53 ©2017 Emily Fox CSE 446: Machine Learning

Stochastic gradient: Pick a data point and move in direction

Ş

Most of the time, total likelihood will increase

54 ©2017 Emily Fox CSE 446: Machine Learning

27 1/25/2017

Stochastic gradient ascent: Most iterations increase likelihood, but sometimes decrease it  On average, make progress

until converged for i=1,…,N for j=0,…,D

(t+1) (t) wj  wj + η t  t + 1

55 ©2017 Emily Fox CSE 446: Machine Learning

Convergence path

©2017 Emily Fox CSE 446: Machine Learning

28 1/25/2017

Convergence paths

Gradient

Stochastic gradient

57 ©2017 Emily Fox CSE 446: Machine Learning

Stochastic gradient convergence is “noisy” Stochastic gradient Stochastic gradient achieves higher makes “noisy” progress likelihood sooner, but it’s noisier

Gradient usually increases

Better likelihood smoothly Avg. log likelihood log Avg.

Total time proportional to # passes over data 58 ©2017 Emily Fox CSE 446: Machine Learning

29 1/25/2017

Eventually, gradient catches up

Note: should only trust “average” quality of stochastic gradient

Stochastic gradient Gradient Better Avg. log likelihood log Avg.

59 ©2017 Emily Fox CSE 446: Machine Learning

The last coefficients may be really good or really bad!! 

Stochastic gradient will eventually oscillate around a solution Minimize noise: w(1005) was good don’t return last learned coefficients

How do we minimize risk of Output average: picking bad coefficients ı = 1 w(t) T w(1000) was bad

60 ©2017 Emily Fox CSE 446: Machine Learning

30 1/25/2017

Summary of why stochastic gradient works

Gradient finds direction of steeps ascent

Gradient is sum of contributions from each data point

Stochastic gradient uses direction from 1 data point

On average increases likelihood, sometimes decreases

Stochastic gradient has “noisy” convergence

61 ©2017 Emily Fox CSE 446: Machine Learning

Online learning: Fitting models from streaming data

©2017 Emily Fox CSE 446: Machine Learning

31 1/25/2017

Batch vs online learning

Batch learning Online learning • All data is available at • Data arrives (streams in) over time start of training time - Must train model as data arrives! time t=1 t=2 t=3 t=4

ML Data Data Data Data Data algorithm

ı ML (1)(2)(3)(4) algorithm ı

©2017 Emily Fox CSE 446: Machine Learning

Online learning example: Ad targeting

ij = Suggested ads Website …...... …...... …...... …...... Ad1 …...... …...... User clicked …...... …...... Ad2 …...... …...... on Ad2 …...... …......  …...... …...... Ad3 …...... …...... yt=Ad2

ML Input: xt User info, algorithm page text ı (t) ı (t+1) 64 ©2017 Emily Fox CSE 446: Machine Learning

32 1/25/2017

Online learning problem • Data arrives over each time step t:

- Observe input xt • Info of user, text of webpage

- Make a prediction ijt • Which ad to show

- Observe true output yt • Which ad user clicked on

Need ML algorithm to update coefficients each time step!

65 ©2017 Emily Fox CSE 446: Machine Learning

Stochastic gradient ascent can be used for online learning!!! • init w(1)=0, t=1 • Each time step t:

- Observe input xt

- Make a prediction ijt

- Observe true output yt - Update coefficients: for j=0,…,D

(t+1) (t) wj  wj + η

66 ©2017 Emily Fox CSE 446: Machine Learning

33 1/25/2017

Summary of online learning

Data arrives over time

Must make a prediction every time new data point arrives

Observe true class after prediction made

Want to update parameters immediately

67 ©2017 Emily Fox CSE 446: Machine Learning

Summary of stochastic gradient descent

©2017 Emily Fox CSE 446: Machine Learning

34 1/25/2017

What you can do now…

• Significantly speedup learning algorithm using stochastic gradient • Describe intuition behind why stochastic gradient works • Apply stochastic gradient in practice • Describe online learning problems • Relate stochastic gradient to online learning

69 ©2017 Emily Fox CSE 446: Machine Learning

Generalized linear models

©2017 Emily Fox CSE 446: Machine Learning

35 1/25/2017

Models for data

• Linear regression with Gaussian errors 2σ T 2 yi = w h(xi) + εi εi ~ N(0,σ )  p(y|x,w) = N(y; wTh(x), σ2) qwTh(x) y

• Logistic regression

1 . y = +1 1 + e- wT h(x) P(y|x,w) = - wT h(x) e . y = -1 -1 +1 y 1 + e- wT h(x) 71 ©2017 Emily Fox CSE 446: Machine Learning

Examples of “generalized linear models”

A generalized linear model has E[y | x,w ] = g(wTh(x))

Mean Link function g Linear regression wTh(x) identity function Logistic regression P(y=+1|x,w) sigmoid function

(assuming y in {0,1} (mean is different if (need slight modification instead of {-1,1} or some y is not in {0,1}) if y is not in {0,1}) other set of values)

72 ©2017 Emily Fox CSE 446: Machine Learning

36 1/25/2017

Stochastic gradient descent more formally

©2017 Emily Fox CSE 446: Machine Learning

Learning Problems as Expectations

• Minimizing loss in training data: - Given dataset: • Sampled iid from some distribution p(x) on features: - , e.g., hinge loss, logistic loss,… - We often minimize loss in training data:

• However, we should really minimize expected loss on all data:

• So, we are approximating the integral by the average on the training data

74 ©2017 Emily Fox CSE 446: Machine Learning

37 1/25/2017

Gradient Ascent in Terms of Expectations

• “True” objective function:

• Taking the gradient:

• “True” gradient ascent rule:

• How do we estimate expected gradient?

75 ©2017 Emily Fox CSE 446: Machine Learning

SGD: Stochastic Gradient Ascent (or Descent)

• “True” gradient:

• Sample based approximation:

• What if we estimate gradient with just one sample??? - Unbiased estimate of gradient - Very noisy!

- Called stochastic gradient ascent (or descent) • Among many other names - VERY useful in practice!!!

76 ©2017 Emily Fox CSE 446: Machine Learning

38