Linear Classifiers: Overfitting and Regularization CSE 446: Machine Learning Emily Fox University of Washington January 25, 2017

1/25/2017 Linear classifiers: Overfitting and regularization CSE 446: Machine Learning Emily Fox University of Washington January 25, 2017 ©2017 Emily Fox CSE 446: Machine Learning Logistic regression recap ©2017 Emily Fox CSE 446: Machine Learning 1 1/25/2017 Thus far, we focused on decision boundaries Score(xi) = w0 h0(xi) + w1 h1(xi) + … + wD hD(xi) T = w h(xi) Score(x) < 0 #awful … Relate Score(xi) to ⌃ 4 P(y=+1|x,ı )? 3 2 1 Score(x) > 0 0 0 1 2 3 4 … #awesome 3 ©2017 Emily Fox CSE 446: Machine Learning Compare and contrast regression models • Linear regression with Gaussian errors 2σ T 2 yi = w h(xi) + εi εi ~ N(0,σ ) p(y|x,w) = N(y; wTh(x), σ2) qwTh(x) y • Logistic regression 1 . y = +1 1 + e- wT h(x) P(y|x,w) = - wT h(x) e . y = -1 -1 +1 y 1 + e- wT h(x) 4 ©2017 Emily Fox CSE 446: Machine Learning 2 1/25/2017 Understanding the logistic regression model P(y=+1|xi,w) = sigmoid(Score(xi)) = 1 . 1 + e- wT h(x) Score(xi) P(y=+1|xi,w) 5 ©2017 Emily Fox CSE 446: Machine Learning Predict most likely class ⌃ P(y|x)= estimate of class probabilities Sentence ⌃ from If P(y=+1| x) > 0.5: review ĳ = Else: Input: x ĳ = ⌃ Estimating P( y| x ) improves interpretability: - Predict ĳ = +1 and tell me how sure you are 6 ©2017 Emily Fox CSE 446: Machine Learning 3 1/25/2017 Gradient descent algorithm, cont’d ©2017 Emily Fox CSE 446: Machine Learning Step 5: Gradient over all data points Sum over Feature data points value Difference between truth and prediction P(y=+1|xi,w) Ş P(y=+1|xi,w) Ş yi=+1 yi=-1 8 ©2017 Emily Fox CSE 446: Machine Learning 4 1/25/2017 Summary of gradient ascent for logistic regression init w(1)=0 (or randomly, or smartly), t=1 while || ℓΔ (w(t))|| > ε for j=0,…,D partial[j] = (t+1) (t) wj wj + η partial[j] t t + 1 9 ©2017 Emily Fox CSE 446: Machine Learning Overfitting in classification ©2017 Emily Fox CSE 446: Machine Learning 5 1/25/2017 Learned decision boundary Coefficient Feature Value learned h0(x) 1 0.23 h1(x) x[1] 1.12 h2(x) x[2] -1.07 11 ©2017 Emily Fox CSE 446: Machine Learning Note: we are not including cross Quadratic features (in 2d) terms for simplicity Coefficient Feature Value learned h0(x) 1 1.68 h1(x) x[1] 1.39 h2(x) x[2] -0.59 2 h3(x) (x[1]) -0.17 2 h4(x) (x[2]) -0.96 12 ©2017 Emily Fox CSE 446: Machine Learning 6 1/25/2017 Note: we are not including cross Degree 6 features (in 2d) terms for simplicity Coefficient Feature Value learned h0(x) 1 21.6 Coefficient values h1(x) x[1] 5.3 h2(x) x[2] -42.7 getting large 2 h3(x) (x[1]) -15.9 Score(x) < 0 2 h4(x) (x[2]) -48.6 3 h5(x) (x[1]) -11.0 3 h6(x) (x[2]) 67.0 4 h7(x) (x[1]) 1.5 4 h8(x) (x[2]) 48.0 5 h9(x) (x[1]) 4.4 5 h10(x) (x[2]) -14.2 6 h11(x) (x[1]) 0.8 6 h12(x) (x[2]) -8.6 13 ©2017 Emily Fox CSE 446: Machine Learning Note: we are not including cross Degree 20 features (in 2d) terms for simplicity Coefficien Featur Value t e learned h0(x) 1 8.7 h1(x) x[1] 5.1 Often, overfitting associated with h2(x) x[2] 78.7 … … … very large estimated coefficients ı 6 h11(x) (x[1]) -7.5 6 h12(x) (x[2]) 3803 7 h13(x) (x[1]) 21.1 7 h14(x) (x[2]) -2406 … … … 19 -6 h37(x) (x[1]) -2*10 19 h38(x) (x[2]) -0.15 20 -8 h39(x) (x[1]) -2*10 20 h40(x) (x[2]) 0.03 14 ©2017 Emily Fox CSE 446: Machine Learning 7 1/25/2017 Overfitting in classification Error= Overfitting if there exists w*: • training_error(w*) > training_error(ı ) • true_error(w*) < true_error(ı ) Classification Classification Error Model complexity 15 ©2017 Emily Fox CSE 446: Machine Learning Overfitting in classifiers Overconfident predictions ©2017 Emily Fox CSE 446: Machine Learning 8 1/25/2017 Logistic regression model wTh(x ) -∞ i +∞ 0.0 0.5 0.0 1.0 T P(y=+1|xi,w) = sigmoid(w h(xi)) 17 ©2017 Emily Fox CSE 446: Machine Learning The subtle (negative) consequence of overfitting in logistic regression Overfitting Large coefficient values T ı h(xi) is very positive (or very negative) T sigmoid(ı h(xi)) goes to 1 (or to 0) Model becomes extremely overconfident of predictions 18 ©2017 Emily Fox CSE 446: Machine Learning 9 1/25/2017 Effect of coefficients on logistic regression model Input x: #awesome=2, #awful=1 w0 0 w0 0 w0 0 w#awesome +1 w#awesome +2 w#awesome +6 w#awful -1 w#awful -2 w#awful -6 #awesome - #awful #awesome - #awful #awesome - #awful 19 ©2017 Emily Fox CSE 446: Machine Learning Learned probabilities Coefficient Feature Value learned h0(x) 1 0.23 h1(x) x[1] 1.12 h2(x) x[2] -1.07 20 ©2017 Emily Fox CSE 446: Machine Learning 10 1/25/2017 Quadratic features: Learned probabilities Coefficien Feature Value t learned h0(x) 1 1.68 h1(x) x[1] 1.39 h2(x) x[2] -0.58 2 h3(x) (x[1]) -0.17 2 h4(x) (x[2]) -0.96 21 ©2017 Emily Fox CSE 446: Machine Learning Overfitting Overconfident predictions Degree 6: Learned probabilities Degree 20: Learned probabilities Tiny uncertainty regions Overfitting & overconfident about it!!! We are sure we are right, when we are surely wrong! 22 ©2017 Emily Fox CSE 446: Machine Learning 11 1/25/2017 Overfitting in logistic regression: Another perspective ©2017 Emily Fox CSE 446: Machine Learning Linearly-separable data Data are linearly separable if: • There exist coefficients ı such that: - For all positive training data - For all negative training data Note 1: If you are using D features, linear separability happens in a D-dimensional space Note 2: If you have enough features, data are (almost) always linearly separable training_error(ı ) = 0 24 ©2017 Emily Fox CSE 446: Machine Learning 12 1/25/2017 Effect of linear separability on coefficients Score(x) < 0 Data are linearly separable with … ı 1=1.0 and ı 2=-1.5 awful 4 =# 3 [2] Data also linearly separable with X 2 ı 1=10 and ı 2=-15 1 Score(x) > 0 0 Data also linearly separable with 0 1 2 3 4 … ı =109 and ı =-1.5x109 x[1]=#awesome 1 2 25 ©2017 Emily Fox CSE 446: Machine Learning MaximumEffect likelihoodof linear estimation separability (MLE) prefers most certain model Coefficientson weights go to infinity for linearly-separable data!!! P(y=+1|xi, ı ) Data is linearly separable with … ı 1=1.0 and ı 2=-1.5 awful 4 =# 3 x[1]= 2 [2] Data also linearly separable with X x[2]= 1 2 ı 1=10 and ı 2=-15 1 0 Data also linearly separable with 0 1 2 3 4 … ı =109 and ı =-1.5x109 x[1]=#awesome 1 2 26 ©2017 Emily Fox CSE 446: Machine Learning 13 1/25/2017 Overfitting in logistic regression is “twice as bad” Learning tries If data are to find decision linearly boundary that separable separates data Overly Coefficients complex go to boundary infinity! 9 ı 1=10 9 ı 2=-1.5x10 27 ©2017 Emily Fox CSE 446: Machine Learning Penalizing large coefficients to mitigate overfitting ©2017 Emily Fox CSE 446: Machine Learning 14 1/25/2017 Desired total cost format Want to balance: i. How well function fits data ii. Magnitude of coefficients want to balance Total quality = measure of fit - measure of magnitude of coefficients (data likelihood) large # = overfit large # = good fit to training data 29 ©2017 Emily Fox CSE 446: Machine Learning Consider resulting objective Select ı to maximize: 2 ℓ(w) - λ||w||2 tuning parameter = balance of fit and magnitude Pick λ using: L2 regularized • Validation set (for large datasets) logistic regression • Cross-validation (for smaller datasets) (as in ridge/lasso regression) 30 ©2017 Emily Fox CSE 446: Machine Learning 15 1/25/2017 Degree 20 features, effect of regularization penalty λ Regularizatio λ = 0 λ = 0.00001 λ = 0.001 λ = 1 λ = 10 n Range of -3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 -0.05 to 0.22 coefficients Decision boundary 31 ©2017 Emily Fox CSE 446: Machine Learning Coefficient path j ı coefficients λ 32 ©2017 Emily Fox CSE 446: Machine Learning 16 1/25/2017 Degree 20 features: regularization reduces “overconfidence” Regularization λ = 0 λ = 0.00001 λ = 0.001 λ = 1 Range of -3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 coefficients Learned probabilities 33 ©2017 Emily Fox CSE 446: Machine Learning Finding best L2 regularized linear classifier with gradient ascent ©2017 Emily Fox CSE 446: Machine Learning 17 1/25/2017 Understanding contribution of L2 regularization Term from L2 penalty - 2 λ wj Impact on wj wj > 0 wj < 0 35 ©2017 Emily Fox CSE 446: Machine Learning Summary of gradient ascent for logistic regression + L2 regularization init w(1)=0 (or randomly, or smartly), t=1 while not converged: for j=0,…,D partial[j] = (t+1) (t) (t) wj wj + η (partial[j] – 2λ wj ) t t + 1 36 ©2017 Emily Fox CSE 446: Machine Learning 18 1/25/2017 Summary of overfitting in logistic regression ©2017 Emily Fox CSE 446: Machine Learning What you can do now… • Identify when overfitting is happening • Relate large learned coefficients to overfitting • Describe the impact of overfitting on decision boundaries and predicted probabilities of linear classifiers • Motivate the form of L2 regularized logistic regression quality metric • Describe what happens to estimated coefficients as tuning parameter λ is varied • Interpret coefficient path plot • Estimate L2 regularized logistic regression coefficients using gradient ascent • Describe the use of L1 regularization to obtain sparse logistic regression solutions 38 ©2017 Emily Fox CSE 446: Machine Learning 19 1/25/2017 Linear classifiers: Scaling up learning via SGD CSE 446: Machine Learning Emily Fox University of Washington January 25, 2017 ©2017 Emily Fox CSE 446: Machine Learning Stochastic gradient descent: Learning, one data point at a time ©2017 Emily Fox CSE 446: Machine Learning 20 1/25/2017 Why gradient ascent is slow… (t) (t+1) (t+2) w Update w Update w coefficients coefficients Every update Δ ℓΔ (w(t)) ℓ(w(t+1)) requires a full

Linear Classifiers: Overfitting and Regularization CSE 446: Machine Learning Emily Fox University of Washington January 25, 2017

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support