Overfitting in Logistic Regression, SGD Slides

1/24/17 Linear classifiers: Overfitting and regularization CSE 446: Machine Learning Emily Fox University of Washington January 25, 2017 ©2017 Emily Fox CSE 446: Machine Learning Logistic regression recap ©2017 Emily Fox CSE 446: Machine Learning 1 1/24/17 Thus far, we focused on decision boundaries Score(xi) = w0 h0(xi) + w1 h1(xi) + … + wD hD(xi) T = w h(xi) Score(x) < 0 #awful … Relate Score(xi) to ⌃ 4 P(y=+1|x,ŵ)? 3 2 1 1.0 #awesome – 1.5 #awful = 0 Score(x) > 0 0 0 1 2 3 4 … #awesome 3 ©2017 Emily Fox CSE 446: Machine Learning Compare and contrast regression models • Linear regression with Gaussian errors 2σ T 2 yi = w h(xi) + εi εi ~ N(0,σ ) è p(y|x,w) = N(y; wTh(x), σ2) µ = wTh(x) y • Logistic regression 1 . y = +1 - wT h(x) 1 + e P(y|x,w) = - wT h(x) e . y = -1 -1 +1 - wT h(x) y 1 + e 4 ©2017 Emily Fox CSE 446: Machine Learning 2 1/24/17 Understanding the logistic regression model P(y=+1|xi,w) = sigmoid (Score( x i)) = 1 . T - w h(x) 1 + e Score(xi) P(y=+1|xi,w) ) x ( h > w 1 − e 1+ w>h(x) 5 ©2017 Emily Fox CSE 446: Machine Learning Predict most likely class ⌃ P( y| x ) = estimate of class probabilities Sentence ⌃ from If P(y=+1| x ) > 0.5: review ŷ = Else: Input: x ŷ = ⌃ Estimating P( y| x ) improves interpretability: - Predict ŷ = +1 and tell me how sure you are 6 ©2017 Emily Fox CSE 446: Machine Learning 3 1/24/17 Gradient descent algorithm, cont’d ©2017 Emily Fox CSE 446: Machine Learning Step 5: Gradient over all data points Sum over Feature data points value Difference between truth and prediction NNN w>h(x ) @`@`@```((((wwww))))= (1 1[y = +1])w>h(x ) ln 1+e− i === − hhh−(((xxx )i)) 111[[[yyy === +1] +1] +1] i PPP−(((yyy =+1=+1=+1 xxx ,,,www))) @@@www jjj iii iii −−− ||| iii jjj iii=1=1=1 ⇣ ⌘ XXX ⇣⇣⇣ ⌘⌘⌘ P(y=+1|xi,w) ≈ 1 P(y=+1|xi,w) ≈ 0 yi=+1 yi=-1 8 ©2017 Emily Fox CSE 446: Machine Learning 4 1/24/17 Summary of gradient ascent for logistic regression init w(1)=0 (or randomly, or smartly), t=1 Δ while || ℓ (w(t))|| > ε for j=0,…,D N h (x ) 1[y = +1] P (y =+1 x , w(t)) partial[j] = j i i − | i i=1 X ⇣ ⌘ (t+1) (t) wj ß wj + η partial[j] t ß t + 1 9 ©2017 Emily Fox CSE 446: Machine Learning Overfitting in classification ©2017 Emily Fox CSE 446: Machine Learning 5 1/24/17 Learned decision boundary Coefficient Feature Value learned h0(x) 1 0.23 h1(x) x[1] 1.12 h2(x) x[2] -1.07 11 ©2017 Emily Fox CSE 446: Machine Learning Note: we are not including cross Quadratic features (in 2d) terms for simplicity Coefficient Feature Value learned h0(x) 1 1.68 h1(x) x[1] 1.39 h2(x) x[2] -0.59 2 h3(x) (x[1]) -0.17 2 h4(x) (x[2]) -0.96 12 ©2017 Emily Fox CSE 446: Machine Learning 6 1/24/17 Note: we are not including cross Degree 6 features (in 2d) terms for simplicity Coefficient Feature Value learned h0(x) 1 21.6 Coefficient values h1(x) x[1] 5.3 h2(x) x[2] -42.7 getting large 2 h3(x) (x[1]) -15.9 Score(x) < 0 2 h4(x) (x[2]) -48.6 3 h5(x) (x[1]) -11.0 3 h6(x) (x[2]) 67.0 4 h7(x) (x[1]) 1.5 4 h8(x) (x[2]) 48.0 5 h9(x) (x[1]) 4.4 5 h10(x) (x[2]) -14.2 6 h11(x) (x[1]) 0.8 6 h12(x) (x[2]) -8.6 13 ©2017 Emily Fox CSE 446: Machine Learning Note: we are not including cross Degree 20 features (in 2d) terms for simplicity Coefficient Feature Value learned h0(x) 1 8.7 h1(x) x[1] 5.1 h2(x) x[2] 78.7 Often, overfitting associated with … … … very large estimated coefficients ŵ 6 h11(x) (x[1]) -7.5 6 h12(x) (x[2]) 3803 7 h13(x) (x[1]) 21.1 7 h14(x) (x[2]) -2406 … … … 19 -6 h37(x) (x[1]) -2*10 19 h38(x) (x[2]) -0.15 20 -8 h39(x) (x[1]) -2*10 20 h40(x) (x[2]) 0.03 14 ©2017 Emily Fox CSE 446: Machine Learning 7 1/24/17 Overfitting in classification Error = Overfitting if there exists w*: • training_error(w*) > training_error(ŵ) • true_error(w*) < true_error(ŵ) Classification Error Model complexity 15 ©2017 Emily Fox CSE 446: Machine Learning Overfitting in classifiers è Overconfident predictions ©2017 Emily Fox CSE 446: Machine Learning 8 1/24/17 Logistic regression model T w h(xi) -∞ +∞ 0.0 0.5 0.0 1.0 T P(y=+1|xi,w) = sigmoid(w h(xi)) 17 ©2017 Emily Fox CSE 446: Machine Learning The subtle (negative) consequence of overfitting in logistic regression Overfitting è Large coefficient values T ŵ h(xi) is very positive (or very negative) è T sigmoid(ŵ h(xi)) goes to 1 (or to 0) Model becomes extremely overconfident of predictions 18 ©2017 Emily Fox CSE 446: Machine Learning 9 1/24/17 Effect of coefficients on logistic regression model Input x: #awesome=2, #awful=1 w0 0 w0 0 w0 0 w#awesome +1 w#awesome +2 w#awesome +6 w#awful -1 w#awful -2 w#awful -6 ) ) ) x x x ( ( ( h h h > > > w w w 1 1 1 − − − e e e 1+ 1+ 1+ #awesome - #awful #awesome - #awful #awesome - #awful 19 ©2017 Emily Fox CSE 446: Machine Learning Learned probabilities Coefficient Feature Value learned h0(x) 1 0.23 h1(x) x[1] 1.12 h2(x) x[2] -1.07 1 P (y =+1 x, w)= w h(x) | 1+e− > 20 ©2017 Emily Fox CSE 446: Machine Learning 10 1/24/17 Quadratic features: Learned probabilities Coefficient Feature Value learned h0(x) 1 1.68 h1(x) x[1] 1.39 h2(x) x[2] -0.58 2 h3(x) (x[1]) -0.17 2 h4(x) (x[2]) -0.96 1 P (y =+1 x, w)= w h(x) | 1+e− > 21 ©2017 Emily Fox CSE 446: Machine Learning Overfitting è Overconfident predictions Degree 6: Learned probabilities Degree 20: Learned probabilities Tiny uncertainty regions è Overfitting & overconfident about it!!! We are sure we are right, when we are surely wrong! L 22 ©2017 Emily Fox CSE 446: Machine Learning 11 1/24/17 Overfitting in logistic regression: Another perspective ©2017 Emily Fox CSE 446: Machine Learning Linearly-separable data Data are linearly separable if: • There exist coefficients ŵ such that: - For all positive training data - For all negative training data Note 1: If you are using D features, linear separability happens in a D-dimensional space Note 2: If you have enough features, data are (almost) always linearly separable training_error(ŵ) = 0 24 ©2017 Emily Fox CSE 446: Machine Learning 12 1/24/17 Effect of linear separability on coefficients Score(x) < 0 Data are linearly separable with … ŵ1=1.0 and ŵ2=-1.5 4 3 [2]=#awful Data also linearly separable with X 2 ŵ1=10 and ŵ2=-15 1 1.0 #awesome – 1.5 #awful = 0 Score(x) > 0 0 Data also linearly separable with 0 1 2 3 4 … ŵ =109 and ŵ =-1.5x109 x[1]=#awesome 1 2 25 ©2017 Emily Fox CSE 446: Machine Learning MaximumEffect likelihoodof linear estimation separability (MLE) prefers most certain model è Coeonffi weightscients go to infinity for linearly-separable data!!! P(y=+1|xi, ŵ) Data is linearly separable with … ŵ1=1.0 and ŵ2=-1.5 4 3 x[1]= 2 [2]=#awful Data also linearly separable with X x[2]= 1 2 ŵ1=10 and ŵ2=-15 1 0 Data also linearly separable with 0 1 2 3 4 … 9 9 ŵ1=10 and ŵ2=-1.5x10 x[1]=#awesome 26 ©2017 Emily Fox CSE 446: Machine Learning 13 1/24/17 Overfitting in logistic regression is “twice as bad” Learning tries If data are to find decision linearly boundary that separable separates data Overly Coefficients complex go to boundary infinity! 9 ŵ1=10 9 ŵ2=-1.5x10 27 ©2017 Emily Fox CSE 446: Machine Learning Penalizing large coefficients to mitigate overfitting ©2017 Emily Fox CSE 446: Machine Learning 14 1/24/17 Desired total cost format Want to balance: i. How well function fits data ii. Magnitude of coefficients want to balance Total quality = measure of fit - measure of magnitude of coefficients (data likelihood) large # = overfit large # = good fit to training data 29 ©2017 Emily Fox CSE 446: Machine Learning Consider resulting objective Select ŵ to minimize: 2 ℓ(w) - λ ||w||2 tuning parameter = balance of fit and magnitude Pick λ using: L2 regularized • Validation set (for large datasets) logistic regression • Cross-validation (for smaller datasets) (as in ridge/lasso regression) 30 ©2017 Emily Fox CSE 446: Machine Learning 15 1/24/17 Degree 20 features, effect of regularization penalty λ Regularization λ = 0 λ = 0.00001 λ = 0.001 λ = 1 λ = 10 Range of -3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 -0.05 to 0.22 coefficients Decision boundary 31 ©2017 Emily Fox CSE 446: Machine Learning Coefficient path j ŵ cients ffi coe λ 32 ©2017 Emily Fox CSE 446: Machine Learning 16 1/24/17 Degree 20 features: regularization reduces “overconfidence” Regularization λ = 0 λ = 0.00001 λ = 0.001 λ = 1 Range of -3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 coefficients Learned probabilities 33 ©2017 Emily Fox CSE 446: Machine Learning Finding best L2 regularized linear classifier with gradient ascent ©2017 Emily Fox CSE 446: Machine Learning 17 1/24/17 Understanding contribution of L2 regularization Term from L2 penalty @`(w) 2λwj @wj − - 2 λ wj Impact on wj wj > 0 wj < 0 35 ©2017 Emily Fox CSE 446: Machine Learning Summary of gradient ascent for logistic regression + L2 regularization init w(1)=0 (or randomly, or smartly), t=1 while not converged: for j=0,…,D N h (x ) 1[y = +1] P (y =+1 x , w(t)) partial[j] = j i i − | i i=1 X ⇣ ⌘ (t+1) (t) (t) wj ß wj + η (partial[j] – 2λ wj ) t ß t + 1 36 ©2017 Emily Fox CSE 446: Machine Learning 18 1/24/17 Summary of overfitting in logistic regression ©2017 Emily Fox CSE 446: Machine Learning What you can do now… • Identify when overfitting is happening • Relate large learned coefficients to overfitting • Describe the impact of overfitting on decision boundaries and predicted probabilities of linear classifiers • Motivate the form of L2 regularized logistic regression quality metric • Describe what happens to estimated coefficients as tuning parameter λ is varied • Interpret coefficient path plot • Estimate L2 regularized logistic regression coefficients using gradient ascent • Describe the use of L1 regularization to obtain sparse logistic regression solutions 38 ©2017 Emily Fox CSE 446: Machine Learning 19 1/24/17 Linear

Overfitting in Logistic Regression, SGD Slides

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support