Overfitting, Regularization

Overfitting, Regularization

Day 2: Overfitting, regularization Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 19 June 2018 Review • Yesterday o Supervised learning o Linear regression - polynomial curve fitting o Empirical risk minimization, evaluation • Today o Overfitting o Model selection o Regularization o Gradient descent • Schedule: 9:00am-10:25am Lecture 2.a: Overfitting, model selection 10:35am-noon Lecture 2.b: Regularization, gradient descent noon-1:00pm Lunch 1:00pm-3:30pm Programming 3:30pm-5:00pm Invited Talk - Mathew Walter 1 Overfitting 2 Dataset size and linear regression • Recall linear regression $ � * o Input � ∈ ℝ , output � ∈ ℝ, training data � = � , � : � = 1,2, … , � $ o Estimate � ∈ ℝ and bias �3 ∈ ℝ by minimizing training loss B � * A �4, �43 = argmin = �. � + �3 − � �,; < *CD • What happens when we only have a single data point (in 1D)? � o Ill-posed problem: an infinite number of lines perfectly fit the data • Two points in 1D? • Two points in 2D? � o the amount of data needed to obtain a meaningful estimate of a model is related to the number of parameters in the model (its complexity) 3 Dataset size and linear regression • Recall linear regression $ � * o Input � ∈ ℝ , output � ∈ ℝ, training data � = � , � : � = 1,2, … , � $ o Estimate � ∈ ℝ and bias �3 ∈ ℝ by minimizing training loss B � * A �4, �43 = argmin = �. � + �3 − � �,; < *CD • What happens when we only have a single data point (in 1D)? � o Ill-posed problem: an infinite number of lines perfectly fit the data • Two points in 1D? • Two points in 2D? � o the amount of data needed to obtain a meaningful estimate of a model is related to the number of parameters in the model (its complexity) 4 Linear regression - generalization Consider 1D example * * • �GHI*J = � , � : � = 1,2, … , � where * o � ∼ ������� −5,5 * ∗ * * ∗ * o � = � � + � for true � and noise � ∼ � 0, 1 • �GWXG similarly generated B A �4 = argmin = �� * − � * ; *CD • The training error increases with the size of training data? 5 Model complexity vs fit for fixed N • Recall polynomial regression of degree � in 1D B A * * \ * Y �4 = argmin = �3 + �D. � + �A. � + ⋯ + �^. � − �G �∈ℝYZ[ *CD N=30 6 Overfitting with ERM • For same amount of data, more complex models overfits more than simple model o Recall: higher degree à more number of parameters to fit • What happens if we have more data? 7 Model complexity vs fit for fixed N • Recall polynomial regression of degree � in 1D B A * * \ * Y �4 = argmin = �3 + �D. � + �A. � + ⋯ + �^. � − �G �∈ℝYZ[ *CD N=100 8 Overfitting with ERM • For same amount of data, complex models overfit more than simple models o Recall: higher degree à more number of parameters to fit • What happens if we have more data? o More complex models require more data to avoid overfitting 9 How to avoid overfitting? • How to detect overfitting? • How to avoid overfitting? o Look at test error and pick m=5? • Split � = �GHI*J ∪ �`Ia ∪ �GWXG o Use performance on �`Ia as proxy for test error 10 Model selection • � = �GHI*J ∪ �`Ia ∪ �GWXG • m model classes ℋD, ℋA, … , ℋ^ o Recall each ℋH is a set of candidate functions mapping � → � A H o e.g., ℋH = � → �3 + �D. � + �A. � + ⋯ �H. � k • Minimize training loss �efghij on �GHI*J to pick best �H ∈ ℋH k A H o e.g., �H � = �43 + �4D. � + �4A. � + ⋯ �4H. � where �43, �4D, �4A, … , �4H l A * * \ * g * = argmin = �3 + �D� + �A� + ⋯ + �H� − � ; ,…,; < g i i m ,n ∈efghij k k k k • Compute validation loss �eohp �H on �`Ia for each �D, �A, … , �^ k∗ k k k k • Pick � = min �e �D , �e �A , … , �e �^ = min �e (�H) ohp ohp ohp H ohp k∗ • Evaluate test loss �efstf � 11 Model selection • Can we overfit to validation data? o How much data to keep aside for validation? • What if we don’t have enough data? 12 Cross validation Illustration credit: ! ! ! ! ! ! Split � = " # $ % & '()' Nati Srebro �D ∪ �A ∪ ⋯ ∪ �*z" ℎ,," * ℎ ∪ �GWXG # ,,# Extreme case *$ ℎ,,$ � = � (leave one *% ℎ,,% *& ℎ,,& out cross validation) • m model classes ℋD, ℋA, … , ℋ^ • For each �: w o Training loss � v is loss on �GHI*J = �D ∪ �A … ∪ �wxD ∪ �wyD … �z efghij k w k w o Let best �H ∈ ℋH by �H = argmin � v (�) efghij {∈ℋg k w o Compute validation loss �ev �H on �w for each � ∗ ∑� ‰ � • Pick model based on average validation loss �} = ������ �C� ��� �� � • ℋĤ∗ is the correct model class to use. k∗ k∗ l k w • � = argmin � v (�) or � = ∑w � ∗ (if it makes sense) efghij∪ev Ĥ {∈ℋg4∗ k∗ • Evaluate �efstf � 13.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    14 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us