Linear Classifiers: Overfitting and Regularization CSE 446: Machine Learning Emily Fox University of Washington January 25, 2017

1/25/2017 Linear classifiers: Overfitting and regularization CSE 446: Machine Learning Emily Fox University of Washington January 25, 2017 ©2017 Emily Fox CSE 446: Machine Learning Logistic regression recap ©2017 Emily Fox CSE 446: Machine Learning 1 1/25/2017 Thus far, we focused on decision boundaries Score(xi) = w0 h0(xi) + w1 h1(xi) + … + wD hD(xi) T = w h(xi) Score(x) < 0 #awful … Relate Score(xi) to ⌃ 4 P(y=+1|x,ı )? 3 2 1 Score(x) > 0 0 0 1 2 3 4 … #awesome 3 ©2017 Emily Fox CSE 446: Machine Learning Compare and contrast regression models • Linear regression with Gaussian errors 2σ T 2 yi = w h(xi) + εi εi ~ N(0,σ ) p(y|x,w) = N(y; wTh(x), σ2) qwTh(x) y • Logistic regression 1 . y = +1 1 + e- wT h(x) P(y|x,w) = - wT h(x) e . y = -1 -1 +1 y 1 + e- wT h(x) 4 ©2017 Emily Fox CSE 446: Machine Learning 2 1/25/2017 Understanding the logistic regression model P(y=+1|xi,w) = sigmoid(Score(xi)) = 1 . 1 + e- wT h(x) Score(xi) P(y=+1|xi,w) 5 ©2017 Emily Fox CSE 446: Machine Learning Predict most likely class ⌃ P(y|x)= estimate of class probabilities Sentence ⌃ from If P(y=+1| x) > 0.5: review ĳ = Else: Input: x ĳ = ⌃ Estimating P( y| x ) improves interpretability: - Predict ĳ = +1 and tell me how sure you are 6 ©2017 Emily Fox CSE 446: Machine Learning 3 1/25/2017 Gradient descent algorithm, cont’d ©2017 Emily Fox CSE 446: Machine Learning Step 5: Gradient over all data points Sum over Feature data points value Difference between truth and prediction P(y=+1|xi,w) Ş P(y=+1|xi,w) Ş yi=+1 yi=-1 8 ©2017 Emily Fox CSE 446: Machine Learning 4 1/25/2017 Summary of gradient ascent for logistic regression init w(1)=0 (or randomly, or smartly), t=1 while || ℓΔ (w(t))|| > ε for j=0,…,D partial[j] = (t+1) (t) wj wj + η partial[j] t t + 1 9 ©2017 Emily Fox CSE 446: Machine Learning Overfitting in classification ©2017 Emily Fox CSE 446: Machine Learning 5 1/25/2017 Learned decision boundary Coefficient Feature Value learned h0(x) 1 0.23 h1(x) x[1] 1.12 h2(x) x[2] -1.07 11 ©2017 Emily Fox CSE 446: Machine Learning Note: we are not including cross Quadratic features (in 2d) terms for simplicity Coefficient Feature Value learned h0(x) 1 1.68 h1(x) x[1] 1.39 h2(x) x[2] -0.59 2 h3(x) (x[1]) -0.17 2 h4(x) (x[2]) -0.96 12 ©2017 Emily Fox CSE 446: Machine Learning 6 1/25/2017 Note: we are not including cross Degree 6 features (in 2d) terms for simplicity Coefficient Feature Value learned h0(x) 1 21.6 Coefficient values h1(x) x[1] 5.3 h2(x) x[2] -42.7 getting large 2 h3(x) (x[1]) -15.9 Score(x) < 0 2 h4(x) (x[2]) -48.6 3 h5(x) (x[1]) -11.0 3 h6(x) (x[2]) 67.0 4 h7(x) (x[1]) 1.5 4 h8(x) (x[2]) 48.0 5 h9(x) (x[1]) 4.4 5 h10(x) (x[2]) -14.2 6 h11(x) (x[1]) 0.8 6 h12(x) (x[2]) -8.6 13 ©2017 Emily Fox CSE 446: Machine Learning Note: we are not including cross Degree 20 features (in 2d) terms for simplicity Coefficien Featur Value t e learned h0(x) 1 8.7 h1(x) x[1] 5.1 Often, overfitting associated with h2(x) x[2] 78.7 … … … very large estimated coefficients ı 6 h11(x) (x[1]) -7.5 6 h12(x) (x[2]) 3803 7 h13(x) (x[1]) 21.1 7 h14(x) (x[2]) -2406 … … … 19 -6 h37(x) (x[1]) -2*10 19 h38(x) (x[2]) -0.15 20 -8 h39(x) (x[1]) -2*10 20 h40(x) (x[2]) 0.03 14 ©2017 Emily Fox CSE 446: Machine Learning 7 1/25/2017 Overfitting in classification Error= Overfitting if there exists w*: • training_error(w*) > training_error(ı ) • true_error(w*) < true_error(ı ) Classification Classification Error Model complexity 15 ©2017 Emily Fox CSE 446: Machine Learning Overfitting in classifiers Overconfident predictions ©2017 Emily Fox CSE 446: Machine Learning 8 1/25/2017 Logistic regression model wTh(x ) -∞ i +∞ 0.0 0.5 0.0 1.0 T P(y=+1|xi,w) = sigmoid(w h(xi)) 17 ©2017 Emily Fox CSE 446: Machine Learning The subtle (negative) consequence of overfitting in logistic regression Overfitting Large coefficient values T ı h(xi) is very positive (or very negative) T sigmoid(ı h(xi)) goes to 1 (or to 0) Model becomes extremely overconfident of predictions 18 ©2017 Emily Fox CSE 446: Machine Learning 9 1/25/2017 Effect of coefficients on logistic regression model Input x: #awesome=2, #awful=1 w0 0 w0 0 w0 0 w#awesome +1 w#awesome +2 w#awesome +6 w#awful -1 w#awful -2 w#awful -6 #awesome - #awful #awesome - #awful #awesome - #awful 19 ©2017 Emily Fox CSE 446: Machine Learning Learned probabilities Coefficient Feature Value learned h0(x) 1 0.23 h1(x) x[1] 1.12 h2(x) x[2] -1.07 20 ©2017 Emily Fox CSE 446: Machine Learning 10 1/25/2017 Quadratic features: Learned probabilities Coefficien Feature Value t learned h0(x) 1 1.68 h1(x) x[1] 1.39 h2(x) x[2] -0.58 2 h3(x) (x[1]) -0.17 2 h4(x) (x[2]) -0.96 21 ©2017 Emily Fox CSE 446: Machine Learning Overfitting Overconfident predictions Degree 6: Learned probabilities Degree 20: Learned probabilities Tiny uncertainty regions Overfitting & overconfident about it!!! We are sure we are right, when we are surely wrong! 22 ©2017 Emily Fox CSE 446: Machine Learning 11 1/25/2017 Overfitting in logistic regression: Another perspective ©2017 Emily Fox CSE 446: Machine Learning Linearly-separable data Data are linearly separable if: • There exist coefficients ı such that: - For all positive training data - For all negative training data Note 1: If you are using D features, linear separability happens in a D-dimensional space Note 2: If you have enough features, data are (almost) always linearly separable training_error(ı ) = 0 24 ©2017 Emily Fox CSE 446: Machine Learning 12 1/25/2017 Effect of linear separability on coefficients Score(x) < 0 Data are linearly separable with … ı 1=1.0 and ı 2=-1.5 awful 4 =# 3 [2] Data also linearly separable with X 2 ı 1=10 and ı 2=-15 1 Score(x) > 0 0 Data also linearly separable with 0 1 2 3 4 … ı =109 and ı =-1.5x109 x[1]=#awesome 1 2 25 ©2017 Emily Fox CSE 446: Machine Learning MaximumEffect likelihoodof linear estimation separability (MLE) prefers most certain model Coefficientson weights go to infinity for linearly-separable data!!! P(y=+1|xi, ı ) Data is linearly separable with … ı 1=1.0 and ı 2=-1.5 awful 4 =# 3 x[1]= 2 [2] Data also linearly separable with X x[2]= 1 2 ı 1=10 and ı 2=-15 1 0 Data also linearly separable with 0 1 2 3 4 … ı =109 and ı =-1.5x109 x[1]=#awesome 1 2 26 ©2017 Emily Fox CSE 446: Machine Learning 13 1/25/2017 Overfitting in logistic regression is “twice as bad” Learning tries If data are to find decision linearly boundary that separable separates data Overly Coefficients complex go to boundary infinity! 9 ı 1=10 9 ı 2=-1.5x10 27 ©2017 Emily Fox CSE 446: Machine Learning Penalizing large coefficients to mitigate overfitting ©2017 Emily Fox CSE 446: Machine Learning 14 1/25/2017 Desired total cost format Want to balance: i. How well function fits data ii. Magnitude of coefficients want to balance Total quality = measure of fit - measure of magnitude of coefficients (data likelihood) large # = overfit large # = good fit to training data 29 ©2017 Emily Fox CSE 446: Machine Learning Consider resulting objective Select ı to maximize: 2 ℓ(w) - λ||w||2 tuning parameter = balance of fit and magnitude Pick λ using: L2 regularized • Validation set (for large datasets) logistic regression • Cross-validation (for smaller datasets) (as in ridge/lasso regression) 30 ©2017 Emily Fox CSE 446: Machine Learning 15 1/25/2017 Degree 20 features, effect of regularization penalty λ Regularizatio λ = 0 λ = 0.00001 λ = 0.001 λ = 1 λ = 10 n Range of -3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 -0.05 to 0.22 coefficients Decision boundary 31 ©2017 Emily Fox CSE 446: Machine Learning Coefficient path j ı coefficients λ 32 ©2017 Emily Fox CSE 446: Machine Learning 16 1/25/2017 Degree 20 features: regularization reduces “overconfidence” Regularization λ = 0 λ = 0.00001 λ = 0.001 λ = 1 Range of -3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 coefficients Learned probabilities 33 ©2017 Emily Fox CSE 446: Machine Learning Finding best L2 regularized linear classifier with gradient ascent ©2017 Emily Fox CSE 446: Machine Learning 17 1/25/2017 Understanding contribution of L2 regularization Term from L2 penalty - 2 λ wj Impact on wj wj > 0 wj < 0 35 ©2017 Emily Fox CSE 446: Machine Learning Summary of gradient ascent for logistic regression + L2 regularization init w(1)=0 (or randomly, or smartly), t=1 while not converged: for j=0,…,D partial[j] = (t+1) (t) (t) wj wj + η (partial[j] – 2λ wj ) t t + 1 36 ©2017 Emily Fox CSE 446: Machine Learning 18 1/25/2017 Summary of overfitting in logistic regression ©2017 Emily Fox CSE 446: Machine Learning What you can do now… • Identify when overfitting is happening • Relate large learned coefficients to overfitting • Describe the impact of overfitting on decision boundaries and predicted probabilities of linear classifiers • Motivate the form of L2 regularized logistic regression quality metric • Describe what happens to estimated coefficients as tuning parameter λ is varied • Interpret coefficient path plot • Estimate L2 regularized logistic regression coefficients using gradient ascent • Describe the use of L1 regularization to obtain sparse logistic regression solutions 38 ©2017 Emily Fox CSE 446: Machine Learning 19 1/25/2017 Linear classifiers: Scaling up learning via SGD CSE 446: Machine Learning Emily Fox University of Washington January 25, 2017 ©2017 Emily Fox CSE 446: Machine Learning Stochastic gradient descent: Learning, one data point at a time ©2017 Emily Fox CSE 446: Machine Learning 20 1/25/2017 Why gradient ascent is slow… (t) (t+1) (t+2) w Update w Update w coefficients coefficients Every update Δ ℓΔ (w(t)) ℓ(w(t+1)) requires a full

Load more