1/25/2017
Linear classifiers: Overfitting and regularization CSE 446: Machine Learning Emily Fox University of Washington January 25, 2017
©2017 Emily Fox CSE 446: Machine Learning
Logistic regression recap
©2017 Emily Fox CSE 446: Machine Learning
1 1/25/2017
Thus far, we focused on decision boundaries
Score(xi) = w0 h0(xi) + w1 h1(xi) + … + wD hD(xi) T = w h(xi)
Score(x) < 0 #awful … Relate Score(xi) to ⌃ 4 P(y=+1|x,ı )? 3
2
1 Score(x) > 0 0 0 1 2 3 4 … #awesome
3 ©2017 Emily Fox CSE 446: Machine Learning
Compare and contrast regression models
• Linear regression with Gaussian errors 2σ T 2 yi = w h(xi) + εi εi ~ N(0,σ ) p(y|x,w) = N(y; wTh(x), σ2) q w T h(x) y
1 . y = +1 1 + e- wT h(x) P(y|x,w) = - wT h(x) e . y = -1 -1 +1 y 1 + e- wT h(x) 4 ©2017 Emily Fox CSE 446: Machine Learning
2 1/25/2017
Understanding the logistic regression model
P(y=+1|xi,w) = sigmoid(Score(xi)) = 1 . 1 + e- wT h(x)
Score(xi) P(y=+1|xi,w)
5 ©2017 Emily Fox CSE 446: Machine Learning
Predict most likely class ⌃ P(y|x)= estimate of class probabilities
Sentence ⌃ from If P(y=+1| x) > 0.5: review ij = Else: Input: x ij =
⌃ Estimating P( y| x ) improves interpretability: - Predict ij = +1 and tell me how sure you are
6 ©2017 Emily Fox CSE 446: Machine Learning
3 1/25/2017
Gradient descent algorithm, cont’d
©2017 Emily Fox CSE 446: Machine Learning
Step 5: Gradient over all data points
Sum over Feature data points value Difference between truth and prediction
P(y=+1|xi,w) Ş P(y=+1|xi,w) Ş
yi=+1
yi=-1
8 ©2017 Emily Fox CSE 446: Machine Learning
4 1/25/2017
Summary of gradient ascent for logistic regression
init w(1)=0 (or randomly, or smartly), t=1 while || ℓΔ (w(t))|| > ε for j=0,…,D partial[j] =
(t+1) (t) wj wj + η partial[j] t t + 1
9 ©2017 Emily Fox CSE 446: Machine Learning
Overfitting in classification
©2017 Emily Fox CSE 446: Machine Learning
5 1/25/2017
Learned decision boundary
Coefficient Feature Value learned
h0(x) 1 0.23
h1(x) x[1] 1.12
h2(x) x[2] -1.07
11 ©2017 Emily Fox CSE 446: Machine Learning
Note: we are not including cross Quadratic features (in 2d) terms for simplicity
Coefficient Feature Value learned
h0(x) 1 1.68
h1(x) x[1] 1.39
h2(x) x[2] -0.59 2 h3(x) (x[1]) -0.17 2 h4(x) (x[2]) -0.96
12 ©2017 Emily Fox CSE 446: Machine Learning
6 1/25/2017
Note: we are not including cross Degree 6 features (in 2d) terms for simplicity
Coefficient Feature Value learned
h0(x) 1 21.6 Coefficient values h1(x) x[1] 5.3
h2(x) x[2] -42.7 getting large 2 h3(x) (x[1]) -15.9 Score(x) < 0 2 h4(x) (x[2]) -48.6 3 h5(x) (x[1]) -11.0 3 h6(x) (x[2]) 67.0 4 h7(x) (x[1]) 1.5 4 h8(x) (x[2]) 48.0 5 h9(x) (x[1]) 4.4 5 h10(x) (x[2]) -14.2 6 h11(x) (x[1]) 0.8 6 h12(x) (x[2]) -8.6
13 ©2017 Emily Fox CSE 446: Machine Learning
Note: we are not including cross Degree 20 features (in 2d) terms for simplicity
Coefficien Featur Value t e learned
h0(x) 1 8.7 h1(x) x[1] 5.1 Often, overfitting associated with h2(x) x[2] 78.7 … … … very large estimated coefficients ı 6 h11(x) (x[1]) -7.5 6 h12(x) (x[2]) 3803 7 h13(x) (x[1]) 21.1 7 h14(x) (x[2]) -2406 … … … 19 -6 h37(x) (x[1]) -2*10 19 h38(x) (x[2]) -0.15 20 -8 h39(x) (x[1]) -2*10 20 h40(x) (x[2]) 0.03 14 ©2017 Emily Fox CSE 446: Machine Learning
7 1/25/2017
Overfitting in classification
Error= Overfitting if there exists w*: • training_error(w*) > training_error(ı ) • true_error(w*) < true_error(ı ) Classification Classification Error
Model complexity
15 ©2017 Emily Fox CSE 446: Machine Learning
Overfitting in classifiers Overconfident predictions
©2017 Emily Fox CSE 446: Machine Learning
8 1/25/2017
Logistic regression model
wTh(x ) -∞ i +∞ 0.0
0.5 0.0 1.0 T P(y=+1|xi,w) = sigmoid(w h(xi))
17 ©2017 Emily Fox CSE 446: Machine Learning
The subtle (negative) consequence of overfitting in logistic regression
Overfitting Large coefficient values
T ı h(xi) is very positive (or very negative) T sigmoid(ı h(xi)) goes to 1 (or to 0)
Model becomes extremely overconfident of predictions
18 ©2017 Emily Fox CSE 446: Machine Learning
9 1/25/2017
Effect of coefficients on logistic regression model
Input x: #awesome=2, #awful=1
w0 0 w0 0 w0 0
w#awesome +1 w#awesome +2 w#awesome +6
w#awful -1 w#awful -2 w#awful -6
#awesome - #awful #awesome - #awful #awesome - #awful
19 ©2017 Emily Fox CSE 446: Machine Learning
Learned probabilities
Coefficient Feature Value learned
h0(x) 1 0.23
h1(x) x[1] 1.12
h2(x) x[2] -1.07
20 ©2017 Emily Fox CSE 446: Machine Learning
10 1/25/2017
Quadratic features: Learned probabilities
Coefficien Feature Value t learned
h0(x) 1 1.68
h1(x) x[1] 1.39
h2(x) x[2] -0.58 2 h3(x) (x[1]) -0.17 2 h4(x) (x[2]) -0.96
21 ©2017 Emily Fox CSE 446: Machine Learning
Overfitting Overconfident predictions
Degree 6: Learned probabilities Degree 20: Learned probabilities Tiny uncertainty regions Overfitting & overconfident about it!!! We are sure we are right, when we are surely wrong!
22 ©2017 Emily Fox CSE 446: Machine Learning
11 1/25/2017
Overfitting in logistic regression: Another perspective
©2017 Emily Fox CSE 446: Machine Learning
Linearly-separable data
Data are linearly separable if: • There exist coefficients ı such that: - For all positive training data
- For all negative training data
Note 1: If you are using D features, linear separability happens in a D-dimensional space Note 2: If you have enough features, data are (almost) always linearly separable training_error(ı ) = 0
24 ©2017 Emily Fox CSE 446: Machine Learning
12 1/25/2017
Effect of linear separability on coefficients
Score(x) < 0 Data are linearly separable with … ı 1=1.0 and ı 2=-1.5 awful 4 =# 3 [2] Data also linearly separable with X 2 ı 1=10 and ı 2=-15 1 Score(x) > 0 0 Data also linearly separable with 0 1 2 3 4 … ı =109 and ı =-1.5x109 x[1]=#awesome 1 2
25 ©2017 Emily Fox CSE 446: Machine Learning
MaximumEffect likelihoodof linear estimation separability (MLE) prefers most certain model Coefficientson weights go to infinity for linearly-separable data!!!
P(y=+1|xi, ı ) Data is linearly separable with … ı 1=1.0 and ı 2=-1.5 awful 4 =# 3 x[1]= 2 [2] Data also linearly separable with X x[2]= 1 2 ı 1=10 and ı 2=-15 1 0 Data also linearly separable with 0 1 2 3 4 … ı =109 and ı =-1.5x109 x[1]=#awesome 1 2
26 ©2017 Emily Fox CSE 446: Machine Learning
13 1/25/2017
Overfitting in logistic regression is “twice as bad”
Learning tries If data are to find decision linearly boundary that separable separates data
Overly Coefficients complex go to boundary infinity!
9 ı 1=10 9 ı 2=-1.5x10
27 ©2017 Emily Fox CSE 446: Machine Learning
Penalizing large coefficients to mitigate overfitting
©2017 Emily Fox CSE 446: Machine Learning
14 1/25/2017
Desired total cost format
Want to balance: i. How well function fits data ii. Magnitude of coefficients
want to balance Total quality = measure of fit - measure of magnitude of coefficients
(data likelihood) large # = overfit large # = good fit to training data 29 ©2017 Emily Fox CSE 446: Machine Learning
Consider resulting objective
Select ı to maximize: 2 ℓ(w) - λ||w||2 tuning parameter = balance of fit and magnitude
Pick λ using: L2 regularized • Validation set (for large datasets) logistic regression • Cross-validation (for smaller datasets) (as in ridge/lasso regression)
30 ©2017 Emily Fox CSE 446: Machine Learning
15 1/25/2017
Degree 20 features, effect of regularization penalty λ
Regularizatio λ = 0 λ = 0.00001 λ = 0.001 λ = 1 λ = 10 n
Range of -3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 -0.05 to 0.22 coefficients
Decision boundary
31 ©2017 Emily Fox CSE 446: Machine Learning
Coefficient path j ı coefficients
λ 32 ©2017 Emily Fox CSE 446: Machine Learning
16 1/25/2017
Degree 20 features: regularization reduces “overconfidence”
Regularization λ = 0 λ = 0.00001 λ = 0.001 λ = 1
Range of -3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 coefficients
Learned probabilities
33 ©2017 Emily Fox CSE 446: Machine Learning
Finding best L2 regularized linear classifier with gradient ascent
©2017 Emily Fox CSE 446: Machine Learning
17 1/25/2017
Understanding contribution of L2 regularization
Term from L2 penalty
- 2 λ wj Impact on wj
wj > 0
wj < 0
35 ©2017 Emily Fox CSE 446: Machine Learning
Summary of gradient ascent for logistic regression + L2 regularization
init w(1)=0 (or randomly, or smartly), t=1 while not converged: for j=0,…,D partial[j] =
(t+1) (t) (t) wj wj + η (partial[j] – 2λ wj ) t t + 1
36 ©2017 Emily Fox CSE 446: Machine Learning
18 1/25/2017
Summary of overfitting in logistic regression
©2017 Emily Fox CSE 446: Machine Learning
What you can do now…
• Identify when overfitting is happening • Relate large learned coefficients to overfitting • Describe the impact of overfitting on decision boundaries and predicted probabilities of linear classifiers • Motivate the form of L2 regularized logistic regression quality metric • Describe what happens to estimated coefficients as tuning parameter λ is varied • Interpret coefficient path plot • Estimate L2 regularized logistic regression coefficients using gradient ascent • Describe the use of L1 regularization to obtain sparse logistic regression solutions
38 ©2017 Emily Fox CSE 446: Machine Learning
19 1/25/2017
Linear classifiers: Scaling up learning via SGD CSE 446: Machine Learning Emily Fox University of Washington January 25, 2017
©2017 Emily Fox CSE 446: Machine Learning
Stochastic gradient descent: Learning, one data point at a time
©2017 Emily Fox CSE 446: Machine Learning
20 1/25/2017
Why gradient ascent is slow…
(t) (t+1) (t+2) w Update w Update w coefficients coefficients
Every update Δ
ℓΔ (w(t)) ℓ(w(t+1)) requires a full pass Compute over data Data Pass gradient over data
41 ©2017 Emily Fox CSE 446: Machine Learning
More formally: How expensive is gradient ascent?
Sum over data points
Contribution of data point xi,yi to gradient
42 ©2017 Emily Fox CSE 446: Machine Learning
21 1/25/2017
Every step requires touching every data point!!!
Sum over data points
Time to compute Total time to compute # of data points (N) contribution of xi,yi 1 step of gradient ascent 1 millisecond 1000 1 second 1000 1 millisecond 10 million 1 millisecond 10 billion
43 ©2017 Emily Fox CSE 446: Machine Learning
Stochastic gradient ascent
(t) (t+1) (t+2) (t+3) (t+4) w Update w Update w Update w Update w coefficients coefficients coefficients coefficients
Many updates for each pass Compute over data Data Use only gradient small subsets of data
44 ©2017 Emily Fox CSE 446: Machine Learning
22 1/25/2017
Example: Instead of all data points for gradient, use 1 data point only???
Sum over data points Gradient ascent
Each time, pick different data point i Stochastic gradient Ş ascent
45 ©2017 Emily Fox CSE 446: Machine Learning
Stochastic gradient ascent for logistic regression
init w(1)=0, t=1 Each time, pick Sum overdifferent data point i untilforconverged i=1,…,N data points for j=0,…,D partial[j] =
(t+1) (t) wj wj + η partial[j] t t + 1
46 ©2017 Emily Fox CSE 446: Machine Learning
23 1/25/2017
Stochastic gradient for L2-regularized objective
Total derivative = - 2 λ wj partial[j]
What about regularization term? Each time, pick different data point i
Stochastic Total gradient Ş - 2 λ wj partial[j] ascent derivative N Each data point contributes 1/N to regularization
47 ©2017 Emily Fox CSE 446: Machine Learning
Comparing computational time per step
Gradient ascent Stochastic gradient ascent Ş
Total time for Total time for Time to compute # of data points (N) 1 step of 1 step of contribution of x ,y i i gradient stochastic gradient 1 millisecond 1000 1 second 1 second 1000 16.7 minutes 1 millisecond 10 million 2.8 hours 1 millisecond 10 billion 115.7 days
48 ©2017 Emily Fox CSE 446: Machine Learning
24 1/25/2017
Which one is better??? Depends…
Total time to convergence for large data Time per Sensitivity to Algorithm In theory In practice iteration parameters Slow for Often Gradient Slower Moderate large data slower Stochastic Always Often Faster Very high gradient fast faster
49 ©2017 Emily Fox CSE 446: Machine Learning
Summary of stochastic gradient
Tiny change to gradient ascent
Much better scalability
Huge impact in real-world
Very tricky to get right in practice
50 ©2017 Emily Fox CSE 446: Machine Learning
25 1/25/2017
Why would stochastic gradient ever work???
©2017 Emily Fox CSE 446: Machine Learning
Gradient is direction of steepest ascent
Gradient is “best” direction, but any direction that goes “up” would be useful
52 ©2017 Emily Fox CSE 446: Machine Learning
26 1/25/2017
In ML, steepest direction is sum of “little directions” from each data point
Sum over data points
For most data points, contribution points “up” 53 ©2017 Emily Fox CSE 446: Machine Learning
Stochastic gradient: Pick a data point and move in direction
Ş
Most of the time, total likelihood will increase
54 ©2017 Emily Fox CSE 446: Machine Learning
27 1/25/2017
Stochastic gradient ascent: Most iterations increase likelihood, but sometimes decrease it On average, make progress
until converged for i=1,…,N for j=0,…,D
(t+1) (t) wj wj + η t t + 1
55 ©2017 Emily Fox CSE 446: Machine Learning
Convergence path
©2017 Emily Fox CSE 446: Machine Learning
28 1/25/2017
Convergence paths
Gradient
Stochastic gradient
57 ©2017 Emily Fox CSE 446: Machine Learning
Stochastic gradient convergence is “noisy” Stochastic gradient Stochastic gradient achieves higher makes “noisy” progress likelihood sooner, but it’s noisier
Gradient usually increases
Better likelihood smoothly Avg. log likelihood log Avg.
Total time proportional to # passes over data 58 ©2017 Emily Fox CSE 446: Machine Learning
29 1/25/2017
Eventually, gradient catches up
Note: should only trust “average” quality of stochastic gradient
Stochastic gradient Gradient Better Avg. log likelihood log Avg.
59 ©2017 Emily Fox CSE 446: Machine Learning
The last coefficients may be really good or really bad!!
Stochastic gradient will eventually oscillate around a solution Minimize noise: w(1005) was good don’t return last learned coefficients
How do we minimize risk of Output average: picking bad coefficients ı = 1 w(t) T w(1000) was bad
60 ©2017 Emily Fox CSE 446: Machine Learning
30 1/25/2017
Summary of why stochastic gradient works
Gradient finds direction of steeps ascent
Gradient is sum of contributions from each data point
Stochastic gradient uses direction from 1 data point
On average increases likelihood, sometimes decreases
Stochastic gradient has “noisy” convergence
61 ©2017 Emily Fox CSE 446: Machine Learning
Online learning: Fitting models from streaming data
©2017 Emily Fox CSE 446: Machine Learning
31 1/25/2017
Batch vs online learning
Batch learning Online learning • All data is available at • Data arrives (streams in) over time start of training time - Must train model as data arrives! time t=1 t=2 t=3 t=4
ML Data Data Data Data Data algorithm
ı ML (1)(2)(3)(4) algorithm ı
©2017 Emily Fox CSE 446: Machine Learning
Online learning example: Ad targeting
ij = Suggested ads Website …...... …...... …...... …...... Ad1 …...... …...... User clicked …...... …...... Ad2 …...... …...... on Ad2 …...... …...... …...... …...... Ad3 …...... …...... yt=Ad2
ML Input: xt User info, algorithm page text ı (t) ı (t+1) 64 ©2017 Emily Fox CSE 446: Machine Learning
32 1/25/2017
Online learning problem • Data arrives over each time step t:
- Observe input xt • Info of user, text of webpage
- Make a prediction ijt • Which ad to show
- Observe true output yt • Which ad user clicked on
Need ML algorithm to update coefficients each time step!
65 ©2017 Emily Fox CSE 446: Machine Learning
Stochastic gradient ascent can be used for online learning!!! • init w(1)=0, t=1 • Each time step t:
- Observe input xt
- Make a prediction ijt
- Observe true output yt - Update coefficients: for j=0,…,D
(t+1) (t) wj wj + η
66 ©2017 Emily Fox CSE 446: Machine Learning
33 1/25/2017
Summary of online learning
Data arrives over time
Must make a prediction every time new data point arrives
Observe true class after prediction made
Want to update parameters immediately
67 ©2017 Emily Fox CSE 446: Machine Learning
Summary of stochastic gradient descent
©2017 Emily Fox CSE 446: Machine Learning
34 1/25/2017
What you can do now…
• Significantly speedup learning algorithm using stochastic gradient • Describe intuition behind why stochastic gradient works • Apply stochastic gradient in practice • Describe online learning problems • Relate stochastic gradient to online learning
69 ©2017 Emily Fox CSE 446: Machine Learning
Generalized linear models
©2017 Emily Fox CSE 446: Machine Learning
35 1/25/2017
Models for data
• Linear regression with Gaussian errors 2σ T 2 yi = w h(xi) + εi εi ~ N(0,σ ) p(y|x,w) = N(y; wTh(x), σ2) qwTh(x) y
• Logistic regression
1 . y = +1 1 + e- wT h(x) P(y|x,w) = - wT h(x) e . y = -1 -1 +1 y 1 + e- wT h(x) 71 ©2017 Emily Fox CSE 446: Machine Learning
Examples of “generalized linear models”
A generalized linear model has E[y | x,w ] = g(wTh(x))
Mean Link function g Linear regression wTh(x) identity function Logistic regression P(y=+1|x,w) sigmoid function
(assuming y in {0,1} (mean is different if (need slight modification instead of {-1,1} or some y is not in {0,1}) if y is not in {0,1}) other set of values)
72 ©2017 Emily Fox CSE 446: Machine Learning
36 1/25/2017
Stochastic gradient descent more formally
©2017 Emily Fox CSE 446: Machine Learning
Learning Problems as Expectations
• Minimizing loss in training data: - Given dataset: • Sampled iid from some distribution p(x) on features: - Loss function, e.g., hinge loss, logistic loss,… - We often minimize loss in training data:
• However, we should really minimize expected loss on all data:
• So, we are approximating the integral by the average on the training data
74 ©2017 Emily Fox CSE 446: Machine Learning
37 1/25/2017
Gradient Ascent in Terms of Expectations
• “True” objective function:
• Taking the gradient:
• “True” gradient ascent rule:
• How do we estimate expected gradient?
75 ©2017 Emily Fox CSE 446: Machine Learning
SGD: Stochastic Gradient Ascent (or Descent)
• “True” gradient:
• Sample based approximation:
• What if we estimate gradient with just one sample??? - Unbiased estimate of gradient - Very noisy!
- Called stochastic gradient ascent (or descent) • Among many other names - VERY useful in practice!!!
76 ©2017 Emily Fox CSE 446: Machine Learning
38