1/24/17
Linear classifiers: Overfitting and regularization CSE 446: Machine Learning Emily Fox University of Washington January 25, 2017
©2017 Emily Fox CSE 446: Machine Learning
Logistic regression recap
©2017 Emily Fox CSE 446: Machine Learning
1 1/24/17
Thus far, we focused on decision boundaries
Score(xi) = w0 h0(xi) + w1 h1(xi) + … + wD hD(xi) T = w h(xi)
Score(x) < 0 #awful … Relate Score(xi) to ⌃ 4 P(y=+1|x,ŵ)? 3
2
1 1.0 #awesome – 1.5 #awful = 0 Score(x) > 0 0 0 1 2 3 4 … #awesome
3 ©2017 Emily Fox CSE 446: Machine Learning
Compare and contrast regression models
• Linear regression with Gaussian errors 2σ T 2 yi = w h(xi) + εi εi ~ N(0,σ ) è p(y|x,w) = N(y; wTh(x), σ2) µ = wTh(x) y
1 . y = +1 - wT h(x) 1 + e P(y|x,w) = - wT h(x) e . y = -1 -1 +1 - wT h(x) y 1 + e 4 ©2017 Emily Fox CSE 446: Machine Learning
2 1/24/17
Understanding the logistic regression model
P(y=+1|xi,w) = sigmoid (Score( x i)) = 1 . - wT h(x) 1 + e
Score(xi) P(y=+1|xi,w) ) x ( h > w 1 e 1+
w>h(x) 5 ©2017 Emily Fox CSE 446: Machine Learning
Predict most likely class ⌃ P( y| x ) = estimate of class probabilities
Sentence ⌃ from If P(y=+1| x ) > 0.5: review ŷ = Else: Input: x ŷ =
⌃ Estimating P( y| x ) improves interpretability: - Predict ŷ = +1 and tell me how sure you are
6 ©2017 Emily Fox CSE 446: Machine Learning
3 1/24/17
Gradient descent algorithm, cont’d
©2017 Emily Fox CSE 446: Machine Learning
Step 5: Gradient over all data points
Sum over Feature data points value Difference between truth and prediction
NNN w>h(x ) @`@`@```((((wwww))))= (1 1[y = +1])w>h(x ) ln 1+e i === hhh (((xxx )i)) 111[[[yyy === +1] +1] +1] i PPP (((yyy =+1=+1=+1 xxx ,,,www))) @@@www jjj iii iii ||| iii jjj iii=1=1=1 ⇣ ⌘ XXX ⇣⇣⇣ ⌘⌘⌘
P(y=+1|xi,w) ≈ 1 P(y=+1|xi,w) ≈ 0
yi=+1
yi=-1
8 ©2017 Emily Fox CSE 446: Machine Learning
4 1/24/17
Summary of gradient ascent for logistic regression
init w(1)=0 (or randomly, or smartly), t=1
Δ
while || ℓ (w(t))|| > ε for j=0,…,D N h (x ) 1[y = +1] P (y =+1 x , w(t)) partial[j] = j i i | i i=1 X ⇣ ⌘ (t+1) (t) wj ß wj + η partial[j] t ß t + 1
9 ©2017 Emily Fox CSE 446: Machine Learning
Overfitting in classification
©2017 Emily Fox CSE 446: Machine Learning
5 1/24/17
Learned decision boundary
Coefficient Feature Value learned
h0(x) 1 0.23
h1(x) x[1] 1.12
h2(x) x[2] -1.07
11 ©2017 Emily Fox CSE 446: Machine Learning
Note: we are not including cross Quadratic features (in 2d) terms for simplicity
Coefficient Feature Value learned
h0(x) 1 1.68
h1(x) x[1] 1.39
h2(x) x[2] -0.59 2 h3(x) (x[1]) -0.17 2 h4(x) (x[2]) -0.96
12 ©2017 Emily Fox CSE 446: Machine Learning
6 1/24/17
Note: we are not including cross Degree 6 features (in 2d) terms for simplicity
Coefficient Feature Value learned
h0(x) 1 21.6 Coefficient values h1(x) x[1] 5.3
h2(x) x[2] -42.7 getting large 2 h3(x) (x[1]) -15.9 Score(x) < 0 2 h4(x) (x[2]) -48.6 3 h5(x) (x[1]) -11.0 3 h6(x) (x[2]) 67.0 4 h7(x) (x[1]) 1.5 4 h8(x) (x[2]) 48.0 5 h9(x) (x[1]) 4.4 5 h10(x) (x[2]) -14.2 6 h11(x) (x[1]) 0.8 6 h12(x) (x[2]) -8.6
13 ©2017 Emily Fox CSE 446: Machine Learning
Note: we are not including cross Degree 20 features (in 2d) terms for simplicity
Coefficient Feature Value learned
h0(x) 1 8.7
h1(x) x[1] 5.1
h2(x) x[2] 78.7 Often, overfitting associated with … … … very large estimated coefficients ŵ 6 h11(x) (x[1]) -7.5 6 h12(x) (x[2]) 3803 7 h13(x) (x[1]) 21.1 7 h14(x) (x[2]) -2406 … … … 19 -6 h37(x) (x[1]) -2*10 19 h38(x) (x[2]) -0.15 20 -8 h39(x) (x[1]) -2*10 20 h40(x) (x[2]) 0.03
14 ©2017 Emily Fox CSE 446: Machine Learning
7 1/24/17
Overfitting in classification
Error = Overfitting if there exists w*: • training_error(w*) > training_error(ŵ) • true_error(w*) < true_error(ŵ) Classification Error
Model complexity
15 ©2017 Emily Fox CSE 446: Machine Learning
Overfitting in classifiers è Overconfident predictions
©2017 Emily Fox CSE 446: Machine Learning
8 1/24/17
Logistic regression model
T w h(xi) -∞ +∞ 0.0
0.5 0.0 1.0 T P(y=+1|xi,w) = sigmoid(w h(xi))
17 ©2017 Emily Fox CSE 446: Machine Learning
The subtle (negative) consequence of overfitting in logistic regression
Overfitting è Large coefficient values
T ŵ h(xi) is very positive (or very negative) è T sigmoid(ŵ h(xi)) goes to 1 (or to 0)
Model becomes extremely overconfident of predictions
18 ©2017 Emily Fox CSE 446: Machine Learning
9 1/24/17
Effect of coefficients on logistic regression model
Input x: #awesome=2, #awful=1
w0 0 w0 0 w0 0
w#awesome +1 w#awesome +2 w#awesome +6
w#awful -1 w#awful -2 w#awful -6 ) ) ) x x x ( ( ( h h h > > > w w w 1 1 1 e e e 1+ 1+ 1+
#awesome - #awful #awesome - #awful #awesome - #awful
19 ©2017 Emily Fox CSE 446: Machine Learning
Learned probabilities
Coefficient Feature Value learned
h0(x) 1 0.23
h1(x) x[1] 1.12
h2(x) x[2] -1.07 1 P (y =+1 x, w)= w h(x) | 1+e >
20 ©2017 Emily Fox CSE 446: Machine Learning
10 1/24/17
Quadratic features: Learned probabilities
Coefficient Feature Value learned
h0(x) 1 1.68
h1(x) x[1] 1.39
h2(x) x[2] -0.58 2 h3(x) (x[1]) -0.17 2 h4(x) (x[2]) -0.96
1 P (y =+1 x, w)= w h(x) | 1+e >
21 ©2017 Emily Fox CSE 446: Machine Learning
Overfitting è Overconfident predictions
Degree 6: Learned probabilities Degree 20: Learned probabilities Tiny uncertainty regions è Overfitting & overconfident about it!!!
We are sure we are right, when we are surely wrong! L
22 ©2017 Emily Fox CSE 446: Machine Learning
11 1/24/17
Overfitting in logistic regression: Another perspective
©2017 Emily Fox CSE 446: Machine Learning
Linearly-separable data
Data are linearly separable if: • There exist coefficients ŵ such that: - For all positive training data
- For all negative training data
Note 1: If you are using D features, linear separability happens in a D-dimensional space
Note 2: If you have enough features, data are (almost) always linearly separable training_error(ŵ) = 0 24 ©2017 Emily Fox CSE 446: Machine Learning
12 1/24/17
Effect of linear separability on coefficients
Score(x) < 0 Data are linearly separable with … ŵ1=1.0 and ŵ2=-1.5 4
3 [2]=#awful Data also linearly separable with X 2 ŵ1=10 and ŵ2=-15 1 1.0 #awesome – 1.5 #awful = 0 Score(x) > 0 0 Data also linearly separable with 0 1 2 3 4 … ŵ =109 and ŵ =-1.5x109 x[1]=#awesome 1 2
25 ©2017 Emily Fox CSE 446: Machine Learning
MaximumEffect likelihoodof linear estimation separability (MLE) prefers most certain model è Coeonffi weightscients go to infinity for linearly-separable data!!!
P(y=+1|xi, ŵ)
Data is linearly separable with … ŵ1=1.0 and ŵ2=-1.5
4
3 x[1]= 2 [2]=#awful Data also linearly separable with X x[2]= 1 2 ŵ1=10 and ŵ2=-15
1
0 Data also linearly separable with
0 1 2 3 4 … 9 9 ŵ1=10 and ŵ2=-1.5x10 x[1]=#awesome
26 ©2017 Emily Fox CSE 446: Machine Learning
13 1/24/17
Overfitting in logistic regression is “twice as bad”
Learning tries If data are to find decision linearly boundary that separable separates data
Overly Coefficients complex go to boundary infinity!
9 ŵ1=10 9 ŵ2=-1.5x10
27 ©2017 Emily Fox CSE 446: Machine Learning
Penalizing large coefficients to mitigate overfitting
©2017 Emily Fox CSE 446: Machine Learning
14 1/24/17
Desired total cost format
Want to balance: i. How well function fits data ii. Magnitude of coefficients
want to balance Total quality = measure of fit - measure of magnitude of coefficients
(data likelihood) large # = overfit large # = good fit to training data 29 ©2017 Emily Fox CSE 446: Machine Learning
Consider resulting objective
Select ŵ to minimize: 2 ℓ(w) - λ ||w|| 2 tuning parameter = balance of fit and magnitude
Pick λ using: L2 regularized • Validation set (for large datasets) logistic regression • Cross-validation (for smaller datasets) (as in ridge/lasso regression)
30 ©2017 Emily Fox CSE 446: Machine Learning
15 1/24/17
Degree 20 features, effect of regularization penalty λ
Regularization λ = 0 λ = 0.00001 λ = 0.001 λ = 1 λ = 10
Range of -3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 -0.05 to 0.22 coefficients
Decision boundary
31 ©2017 Emily Fox CSE 446: Machine Learning
Coefficient path
j ŵ coe ffi cients
λ 32 ©2017 Emily Fox CSE 446: Machine Learning
16 1/24/17
Degree 20 features: regularization reduces “overconfidence”
Regularization λ = 0 λ = 0.00001 λ = 0.001 λ = 1
Range of -3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 coefficients
Learned probabilities
33 ©2017 Emily Fox CSE 446: Machine Learning
Finding best L2 regularized linear classifier with gradient ascent
©2017 Emily Fox CSE 446: Machine Learning
17 1/24/17
Understanding contribution of L2 regularization
Term from L2 penalty @`(w) 2 wj @wj
- 2 λ wj Impact on wj
wj > 0
wj < 0
35 ©2017 Emily Fox CSE 446: Machine Learning
Summary of gradient ascent for logistic regression + L2 regularization
init w(1)=0 (or randomly, or smartly), t=1 while not converged: for j=0,…,D N h (x ) 1[y = +1] P (y =+1 x , w(t)) partial[j] = j i i | i i=1 X ⇣ ⌘ (t+1) (t) (t) wj ß wj + η (partial[j] – 2λ wj ) t ß t + 1
36 ©2017 Emily Fox CSE 446: Machine Learning
18 1/24/17
Summary of overfitting in logistic regression
©2017 Emily Fox CSE 446: Machine Learning
What you can do now…
• Identify when overfitting is happening • Relate large learned coefficients to overfitting • Describe the impact of overfitting on decision boundaries and predicted probabilities of linear classifiers • Motivate the form of L2 regularized logistic regression quality metric • Describe what happens to estimated coefficients as tuning parameter λ is varied • Interpret coefficient path plot • Estimate L2 regularized logistic regression coefficients using gradient ascent • Describe the use of L1 regularization to obtain sparse logistic regression solutions
38 ©2017 Emily Fox CSE 446: Machine Learning
19 1/24/17
Linear classifiers: Scaling up learning via SGD CSE 446: Machine Learning Emily Fox University of Washington January 25, 2017
©2017 Emily Fox CSE 446: Machine Learning
Stochastic gradient descent: Learning, one data point at a time
©2017 Emily Fox CSE 446: Machine Learning
20 1/24/17
Why gradient ascent is slow…
(t) (t+1) (t+2)
w Update w Update w
coefficients coefficients
Δ Δ Every update
(t) (t+1) ℓ(w ) ℓ(w ) requires a full pass Compute over data Data Pass gradient over data
41 ©2017 Emily Fox CSE 446: Machine Learning
More formally: How expensive is gradient ascent?
Sum over data points
@`@`@`(((www))) NNN === hhh (((xxx ))) 111[[[yyy === +1] +1] +1] PPP(((yyy =+1=+1=+1 xxx ,,,www))) @@@www jjj iii iii ||| iii jjj iii=1=1=1 XXX ⇣⇣⇣ ⌘⌘⌘
Contribution of data point xi,yi to gradient
@`i(w) @wj
42 ©2017 Emily Fox CSE 446: Machine Learning
21 1/24/17
Every step requires touching every data point!!!
Sum over data points
@`@`((ww)) NN @` (w) == hh ((xxi )) 11[[yy == +1] +1] PP((yy =+1=+1 xx ,,ww)) @@ww jj ii ii || ii jj ii=1=1 @wj XX ⇣⇣ ⌘⌘ Time to compute Total time to compute # of data points (N) contribution of xi,yi 1 step of gradient ascent 1 millisecond 1000 1 second 1000 1 millisecond 10 million 1 millisecond 10 billion
43 ©2017 Emily Fox CSE 446: Machine Learning
Stochastic gradient ascent
(t) (t+1) (t+2) (t+3) (t+4) w Update w Update w Update w Update w coefficients coefficients coefficients coefficients
Many updates for each pass Compute over data Data Use only gradient small subsets of data
44 ©2017 Emily Fox CSE 446: Machine Learning
22 1/24/17
Example: Instead of all data points for gradient, use 1 data point only???
Sum over data points Gradient @`@`((ww)) NN @` (w) == hh ((xxi )) 11[[yy == +1] +1] PP((yy =+1=+1 xx ,,ww)) @@ww jj ii ii || ii ascent jj ii=1=1 @wj XX ⇣⇣ ⌘⌘ Each time, pick different data point i Stochastic N @`(w) @`i(w) gradient = hj(xi) 1[yi = +1] P (y =+1 xi, w) @wj ≈ | i=1 @wj ascent X ⇣ ⌘
45 ©2017 Emily Fox CSE 446: Machine Learning
Stochastic gradient ascent for logistic regression
init w(1)=0, t=1 Each time, pick Sum overdi fferent data point i untilfor converged i=1,…,N data points
for j=0,…,D NN @`(w) = hh ((xx ) 11[[yy == +1] +1] PP(y(=+1y =+1x ,xw(,t)w) ) partial[j] = jj i ii | i i @wj i=1 | i=1X ⇣ ⌘ (t+1) X(t) ⇣ ⌘ wj ß wj + η partial[j] t ß t + 1
46 ©2017 Emily Fox CSE 446: Machine Learning
23 1/24/17
Stochastic gradient for L2-regularized objective
N Total@` ( w) @`i(w) = hj(xi) 1[yi = +1] P (y =+1 xi, w) derivative@w = - 2 λ w partial[j]| j i=1 @wj j X ⇣ ⌘ What about regularization term? Each time, pick different data point i
Stochastic Total @`i(w) gradient - 2 λ wj partial[j] ascent derivative≈ @wj N Each data point contributes 1/N to regularization
47 ©2017 Emily Fox CSE 446: Machine Learning
Comparing computational time per step
Gradient ascent Stochastic gradient ascent
NN N @`@`((ww)) @`i(w) @`(w) @`i(w) 1 == hhjj((xxii)) 11[[yyii == +1] +1] =PP((yy =+1=+1hj(xi)xxii,,[wwyi))= +1] P (y =+1 xi, w) @@ww @wj ≈ || | jj ii=1=1 @wj i=1 @wj XX ⇣⇣ X ⇣ ⌘⌘ ⌘ Total time for Total time for Time to compute # of data points (N) 1 step of 1 step of contribution of x ,y i i gradient stochastic gradient 1 millisecond 1000 1 second 1 second 1000 16.7 minutes 1 millisecond 10 million 2.8 hours 1 millisecond 10 billion 115.7 days
48 ©2017 Emily Fox CSE 446: Machine Learning
24 1/24/17
Which one is better??? Depends…
Total time to convergence for large data Time per Sensitivity to Algorithm In theory In practice iteration parameters Slow for Often Gradient Slower Moderate large data slower Stochastic Always Often Faster Very high gradient fast faster
49 ©2017 Emily Fox CSE 446: Machine Learning
Summary of stochastic gradient
Tiny change to gradient ascent
Much better scalability
Huge impact in real-world
Very tricky to get right in practice
50 ©2017 Emily Fox CSE 446: Machine Learning
25 1/24/17
Why would stochastic gradient ever work???
©2017 Emily Fox CSE 446: Machine Learning
Gradient is direction of steepest ascent
Gradient is “best” direction, but any direction that goes “up” would be useful
52 ©2017 Emily Fox CSE 446: Machine Learning
26 1/24/17
In ML, steepest direction is sum of “little directions” from each data point
Sum over data points @`@`((ww)) NN @` (w) == hh ((xxi )) 11[[yy == +1] +1] PP((yy =+1=+1 xx ,,ww)) @@ww jj ii ii || ii jj ii=1=1 @wj XX ⇣⇣ ⌘⌘ For most data points, contribution points “up” 53 ©2017 Emily Fox CSE 446: Machine Learning
Stochastic gradient: Pick a data point and move in direction
N @`(w) @`i(w) = hj(xi) 1[yi = +1] P (y =+1 xi, w) @wj ≈ | i=1 @wj X ⇣ ⌘
Most of the time, total likelihood will increase
54 ©2017 Emily Fox CSE 446: Machine Learning
27 1/24/17
Stochastic gradient ascent: Most iterations increase likelihood, but sometimes decrease it è On average, make progress
until converged for i=1,…,N for j=0,…,D @` (w) w (t+1) ß w (t) + η i j j @w t ß t + 1 j
55 ©2017 Emily Fox CSE 446: Machine Learning
Convergence path
©2017 Emily Fox CSE 446: Machine Learning
28 1/24/17
Convergence paths
Gradient
Stochastic gradient
57 ©2017 Emily Fox CSE 446: Machine Learning
Stochastic gradient convergence is “noisy” Stochastic gradient Stochastic gradient achieves higher makes “noisy” progress likelihood sooner, but it’s noisier
Gradient usually increases
Better likelihood smoothly Avg. log likelihood
Total time proportional to # passes over data 58 ©2017 Emily Fox CSE 446: Machine Learning
29 1/24/17
Eventually, gradient catches up
Note: should only trust “average” quality of stochastic gradient
Stochastic gradient Gradient Better Avg. log likelihood
59 ©2017 Emily Fox CSE 446: Machine Learning
The last coefficients may be really good or really bad!! L
Stochastic gradient will eventually oscillate around a solution
(1005) Minimize noise: w was good don’t return last learned coefficients
How do we minimize risk of Output average: picking bad T coefficients ŵ = 1 w(t) t=1 (1000) T w was bad X 60 ©2017 Emily Fox CSE 446: Machine Learning
30 1/24/17
Summary of why stochastic gradient works
Gradient finds direction of steeps ascent
Gradient is sum of contributions from each data point
Stochastic gradient uses direction from 1 data point
On average increases likelihood, sometimes decreases
Stochastic gradient has “noisy” convergence
61 ©2017 Emily Fox CSE 446: Machine Learning
Online learning: Fitting models from streaming data
©2017 Emily Fox CSE 446: Machine Learning
31 1/24/17
Batch vs online learning
Batch learning Online learning • All data is available at • Data arrives (streams in) over time start of training time - Must train model as data arrives! time t=1 t=2 t=3 t=4
ML Data Data Data Data Data algorithm
ML (1)(2)(3)(4) ŵ algorithm ŵ
©2017 Emily Fox CSE 446: Machine Learning
Online learning example: Ad targeting
ŷ = Suggested ads Website …...... …...... …...... …...... Ad1 …...... …...... User clicked …...... …...... Ad2 …...... …...... on Ad2 …...... …...... ê …...... …...... Ad3 …...... …...... yt=Ad2
ML Input: xt User info, algorithm page text ŵ(t) ŵ(t+1) 64 ©2017 Emily Fox CSE 446: Machine Learning
32 1/24/17
Online learning problem • Data arrives over each time step t:
- Observe input xt • Info of user, text of webpage
- Make a prediction ŷt • Which ad to show
- Observe true output yt • Which ad user clicked on
Need ML algorithm to update coefficients each time step!
65 ©2017 Emily Fox CSE 446: Machine Learning
Stochastic gradient ascent can be used for online learning!!! • init w(1)=0, t=1 • Each time step t:
- Observe input xt
- Make a prediction ŷt
- Observe true output yt - Update coefficients: for j=0,…,D
(t+1) (t) @`t(w) wj ß wj + η @wj 66 ©2017 Emily Fox CSE 446: Machine Learning
33 1/24/17
Summary of online learning
Data arrives over time
Must make a prediction every time new data point arrives
Observe true class after prediction made
Want to update parameters immediately
67 ©2017 Emily Fox CSE 446: Machine Learning
Summary of stochastic gradient descent
©2017 Emily Fox CSE 446: Machine Learning
34 1/24/17
What you can do now…
• Significantly speedup learning algorithm using stochastic gradient • Describe intuition behind why stochastic gradient works • Apply stochastic gradient in practice • Describe online learning problems • Relate stochastic gradient to online learning
69 ©2017 Emily Fox CSE 446: Machine Learning
Generalized linear models
OPTIONAL
©2017 Emily Fox CSE 446: Machine Learning
35 1/24/17
Models for data
• Linear regression with Gaussian errors 2σ T 2 yi = w h(xi) + εi εi ~ N(0,σ ) è p(y|x,w) = N(y; wTh(x), σ2) µ = wTh(x) y
• Logistic regression
1 . y = +1 - wT h(x) 1 + e P(y|x,w) = - wT h(x) e . y = -1 -1 +1 - wT h(x) y 1 + e 71 ©2017 Emily Fox CSE 446: Machine Learning
Examples of “generalized linear models”
A generalized linear model has E[y | x,w ] = g(wTh(x))
Mean Link function g Linear regression wTh(x) identity function Logistic regression P(y=+1|x,w) sigmoid function
(assuming y in {0,1} (mean is different if (need slight modification instead of {-1,1} or some y is not in {0,1}) if y is not in {0,1}) other set of values)
72 ©2017 Emily Fox CSE 446: Machine Learning
36 1/24/17
Stochastic gradient descent more formally
OPTIONAL
©2017 Emily Fox CSE 446: Machine Learning
Learning Problems as Expectations
• Minimizing loss in training data: - Given dataset: • Sampled iid from some distribution p(x) on features: - Loss function, e.g., hinge loss, logistic loss,… - We often minimize loss in training data: 1 N ` (w)= `(w, xj) D N j=1 X
• However, we should really minimize expected loss on all data:
`(w)=Ex [`(w, x)] = p(x)`(w, x)dx Z
• So, we are approximating the integral by the average on the training data
74 ©2017 Emily Fox CSE 446: Machine Learning
37 1/24/17
Gradient Ascent in Terms of Expectations
• “True” objective function: `(w)=E [`(w, x)] = p(x)`(w, x)dx x • Taking the gradient: Z
• “True” gradient ascent rule:
• How do we estimate expected gradient?
75 ©2017 Emily Fox CSE 446: Machine Learning
SGD: Stochastic Gradient Ascent (or Descent)
• “True” gradient: `(w)=Ex [ `(w, x)] r r • Sample based approximation:
• What if we estimate gradient with just one sample??? - Unbiased estimate of gradient - Very noisy!
- Called stochastic gradient ascent (or descent) • Among many other names - VERY useful in practice!!!
76 ©2017 Emily Fox CSE 446: Machine Learning
38