Overfitting in Logistic Regression, SGD Slides

1/24/17 Linear classifiers: Overfitting and regularization CSE 446: Machine Learning Emily Fox University of Washington January 25, 2017 ©2017 Emily Fox CSE 446: Machine Learning Logistic regression recap ©2017 Emily Fox CSE 446: Machine Learning 1 1/24/17 Thus far, we focused on decision boundaries Score(xi) = w0 h0(xi) + w1 h1(xi) + … + wD hD(xi) T = w h(xi) Score(x) < 0 #awful … Relate Score(xi) to ⌃ 4 P(y=+1|x,ŵ)? 3 2 1 1.0 #awesome – 1.5 #awful = 0 Score(x) > 0 0 0 1 2 3 4 … #awesome 3 ©2017 Emily Fox CSE 446: Machine Learning Compare and contrast regression models • Linear regression with Gaussian errors 2σ T 2 yi = w h(xi) + εi εi ~ N(0,σ ) è p(y|x,w) = N(y; wTh(x), σ2) µ = wTh(x) y • Logistic regression 1 . y = +1 - wT h(x) 1 + e P(y|x,w) = - wT h(x) e . y = -1 -1 +1 - wT h(x) y 1 + e 4 ©2017 Emily Fox CSE 446: Machine Learning 2 1/24/17 Understanding the logistic regression model P(y=+1|xi,w) = sigmoid (Score( x i)) = 1 . T - w h(x) 1 + e Score(xi) P(y=+1|xi,w) ) x ( h > w 1 − e 1+ w>h(x) 5 ©2017 Emily Fox CSE 446: Machine Learning Predict most likely class ⌃ P( y| x ) = estimate of class probabilities Sentence ⌃ from If P(y=+1| x ) > 0.5: review ŷ = Else: Input: x ŷ = ⌃ Estimating P( y| x ) improves interpretability: - Predict ŷ = +1 and tell me how sure you are 6 ©2017 Emily Fox CSE 446: Machine Learning 3 1/24/17 Gradient descent algorithm, cont’d ©2017 Emily Fox CSE 446: Machine Learning Step 5: Gradient over all data points Sum over Feature data points value Difference between truth and prediction NNN w>h(x ) @`@`@```((((wwww))))= (1 1[y = +1])w>h(x ) ln 1+e− i === − hhh−(((xxx )i)) 111[[[yyy === +1] +1] +1] i PPP−(((yyy =+1=+1=+1 xxx ,,,www))) @@@www jjj iii iii −−− ||| iii jjj iii=1=1=1 ⇣ ⌘ XXX ⇣⇣⇣ ⌘⌘⌘ P(y=+1|xi,w) ≈ 1 P(y=+1|xi,w) ≈ 0 yi=+1 yi=-1 8 ©2017 Emily Fox CSE 446: Machine Learning 4 1/24/17 Summary of gradient ascent for logistic regression init w(1)=0 (or randomly, or smartly), t=1 Δ while || ℓ (w(t))|| > ε for j=0,…,D N h (x ) 1[y = +1] P (y =+1 x , w(t)) partial[j] = j i i − | i i=1 X ⇣ ⌘ (t+1) (t) wj ß wj + η partial[j] t ß t + 1 9 ©2017 Emily Fox CSE 446: Machine Learning Overfitting in classification ©2017 Emily Fox CSE 446: Machine Learning 5 1/24/17 Learned decision boundary Coefficient Feature Value learned h0(x) 1 0.23 h1(x) x[1] 1.12 h2(x) x[2] -1.07 11 ©2017 Emily Fox CSE 446: Machine Learning Note: we are not including cross Quadratic features (in 2d) terms for simplicity Coefficient Feature Value learned h0(x) 1 1.68 h1(x) x[1] 1.39 h2(x) x[2] -0.59 2 h3(x) (x[1]) -0.17 2 h4(x) (x[2]) -0.96 12 ©2017 Emily Fox CSE 446: Machine Learning 6 1/24/17 Note: we are not including cross Degree 6 features (in 2d) terms for simplicity Coefficient Feature Value learned h0(x) 1 21.6 Coefficient values h1(x) x[1] 5.3 h2(x) x[2] -42.7 getting large 2 h3(x) (x[1]) -15.9 Score(x) < 0 2 h4(x) (x[2]) -48.6 3 h5(x) (x[1]) -11.0 3 h6(x) (x[2]) 67.0 4 h7(x) (x[1]) 1.5 4 h8(x) (x[2]) 48.0 5 h9(x) (x[1]) 4.4 5 h10(x) (x[2]) -14.2 6 h11(x) (x[1]) 0.8 6 h12(x) (x[2]) -8.6 13 ©2017 Emily Fox CSE 446: Machine Learning Note: we are not including cross Degree 20 features (in 2d) terms for simplicity Coefficient Feature Value learned h0(x) 1 8.7 h1(x) x[1] 5.1 h2(x) x[2] 78.7 Often, overfitting associated with … … … very large estimated coefficients ŵ 6 h11(x) (x[1]) -7.5 6 h12(x) (x[2]) 3803 7 h13(x) (x[1]) 21.1 7 h14(x) (x[2]) -2406 … … … 19 -6 h37(x) (x[1]) -2*10 19 h38(x) (x[2]) -0.15 20 -8 h39(x) (x[1]) -2*10 20 h40(x) (x[2]) 0.03 14 ©2017 Emily Fox CSE 446: Machine Learning 7 1/24/17 Overfitting in classification Error = Overfitting if there exists w*: • training_error(w*) > training_error(ŵ) • true_error(w*) < true_error(ŵ) Classification Error Model complexity 15 ©2017 Emily Fox CSE 446: Machine Learning Overfitting in classifiers è Overconfident predictions ©2017 Emily Fox CSE 446: Machine Learning 8 1/24/17 Logistic regression model T w h(xi) -∞ +∞ 0.0 0.5 0.0 1.0 T P(y=+1|xi,w) = sigmoid(w h(xi)) 17 ©2017 Emily Fox CSE 446: Machine Learning The subtle (negative) consequence of overfitting in logistic regression Overfitting è Large coefficient values T ŵ h(xi) is very positive (or very negative) è T sigmoid(ŵ h(xi)) goes to 1 (or to 0) Model becomes extremely overconfident of predictions 18 ©2017 Emily Fox CSE 446: Machine Learning 9 1/24/17 Effect of coefficients on logistic regression model Input x: #awesome=2, #awful=1 w0 0 w0 0 w0 0 w#awesome +1 w#awesome +2 w#awesome +6 w#awful -1 w#awful -2 w#awful -6 ) ) ) x x x ( ( ( h h h > > > w w w 1 1 1 − − − e e e 1+ 1+ 1+ #awesome - #awful #awesome - #awful #awesome - #awful 19 ©2017 Emily Fox CSE 446: Machine Learning Learned probabilities Coefficient Feature Value learned h0(x) 1 0.23 h1(x) x[1] 1.12 h2(x) x[2] -1.07 1 P (y =+1 x, w)= w h(x) | 1+e− > 20 ©2017 Emily Fox CSE 446: Machine Learning 10 1/24/17 Quadratic features: Learned probabilities Coefficient Feature Value learned h0(x) 1 1.68 h1(x) x[1] 1.39 h2(x) x[2] -0.58 2 h3(x) (x[1]) -0.17 2 h4(x) (x[2]) -0.96 1 P (y =+1 x, w)= w h(x) | 1+e− > 21 ©2017 Emily Fox CSE 446: Machine Learning Overfitting è Overconfident predictions Degree 6: Learned probabilities Degree 20: Learned probabilities Tiny uncertainty regions è Overfitting & overconfident about it!!! We are sure we are right, when we are surely wrong! L 22 ©2017 Emily Fox CSE 446: Machine Learning 11 1/24/17 Overfitting in logistic regression: Another perspective ©2017 Emily Fox CSE 446: Machine Learning Linearly-separable data Data are linearly separable if: • There exist coefficients ŵ such that: - For all positive training data - For all negative training data Note 1: If you are using D features, linear separability happens in a D-dimensional space Note 2: If you have enough features, data are (almost) always linearly separable training_error(ŵ) = 0 24 ©2017 Emily Fox CSE 446: Machine Learning 12 1/24/17 Effect of linear separability on coefficients Score(x) < 0 Data are linearly separable with … ŵ1=1.0 and ŵ2=-1.5 4 3 [2]=#awful Data also linearly separable with X 2 ŵ1=10 and ŵ2=-15 1 1.0 #awesome – 1.5 #awful = 0 Score(x) > 0 0 Data also linearly separable with 0 1 2 3 4 … ŵ =109 and ŵ =-1.5x109 x[1]=#awesome 1 2 25 ©2017 Emily Fox CSE 446: Machine Learning MaximumEffect likelihoodof linear estimation separability (MLE) prefers most certain model è Coeonffi weightscients go to infinity for linearly-separable data!!! P(y=+1|xi, ŵ) Data is linearly separable with … ŵ1=1.0 and ŵ2=-1.5 4 3 x[1]= 2 [2]=#awful Data also linearly separable with X x[2]= 1 2 ŵ1=10 and ŵ2=-15 1 0 Data also linearly separable with 0 1 2 3 4 … 9 9 ŵ1=10 and ŵ2=-1.5x10 x[1]=#awesome 26 ©2017 Emily Fox CSE 446: Machine Learning 13 1/24/17 Overfitting in logistic regression is “twice as bad” Learning tries If data are to find decision linearly boundary that separable separates data Overly Coefficients complex go to boundary infinity! 9 ŵ1=10 9 ŵ2=-1.5x10 27 ©2017 Emily Fox CSE 446: Machine Learning Penalizing large coefficients to mitigate overfitting ©2017 Emily Fox CSE 446: Machine Learning 14 1/24/17 Desired total cost format Want to balance: i. How well function fits data ii. Magnitude of coefficients want to balance Total quality = measure of fit - measure of magnitude of coefficients (data likelihood) large # = overfit large # = good fit to training data 29 ©2017 Emily Fox CSE 446: Machine Learning Consider resulting objective Select ŵ to minimize: 2 ℓ(w) - λ ||w||2 tuning parameter = balance of fit and magnitude Pick λ using: L2 regularized • Validation set (for large datasets) logistic regression • Cross-validation (for smaller datasets) (as in ridge/lasso regression) 30 ©2017 Emily Fox CSE 446: Machine Learning 15 1/24/17 Degree 20 features, effect of regularization penalty λ Regularization λ = 0 λ = 0.00001 λ = 0.001 λ = 1 λ = 10 Range of -3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 -0.05 to 0.22 coefficients Decision boundary 31 ©2017 Emily Fox CSE 446: Machine Learning Coefficient path j ŵ cients ffi coe λ 32 ©2017 Emily Fox CSE 446: Machine Learning 16 1/24/17 Degree 20 features: regularization reduces “overconfidence” Regularization λ = 0 λ = 0.00001 λ = 0.001 λ = 1 Range of -3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 coefficients Learned probabilities 33 ©2017 Emily Fox CSE 446: Machine Learning Finding best L2 regularized linear classifier with gradient ascent ©2017 Emily Fox CSE 446: Machine Learning 17 1/24/17 Understanding contribution of L2 regularization Term from L2 penalty @`(w) 2λwj @wj − - 2 λ wj Impact on wj wj > 0 wj < 0 35 ©2017 Emily Fox CSE 446: Machine Learning Summary of gradient ascent for logistic regression + L2 regularization init w(1)=0 (or randomly, or smartly), t=1 while not converged: for j=0,…,D N h (x ) 1[y = +1] P (y =+1 x , w(t)) partial[j] = j i i − | i i=1 X ⇣ ⌘ (t+1) (t) (t) wj ß wj + η (partial[j] – 2λ wj ) t ß t + 1 36 ©2017 Emily Fox CSE 446: Machine Learning 18 1/24/17 Summary of overfitting in logistic regression ©2017 Emily Fox CSE 446: Machine Learning What you can do now… • Identify when overfitting is happening • Relate large learned coefficients to overfitting • Describe the impact of overfitting on decision boundaries and predicted probabilities of linear classifiers • Motivate the form of L2 regularized logistic regression quality metric • Describe what happens to estimated coefficients as tuning parameter λ is varied • Interpret coefficient path plot • Estimate L2 regularized logistic regression coefficients using gradient ascent • Describe the use of L1 regularization to obtain sparse logistic regression solutions 38 ©2017 Emily Fox CSE 446: Machine Learning 19 1/24/17 Linear

Load more