Overfitting, Regularization

Day 2: Overfitting, regularization Introduction to Machine Learning Summer School June 18, 2018 - June 29, 2018, Chicago Instructor: Suriya Gunasekar, TTI Chicago 19 June 2018 Review • Yesterday o Supervised learning o Linear regression - polynomial curve fitting o Empirical risk minimization, evaluation • Today o Overfitting o Model selection o Regularization o Gradient descent • Schedule: 9:00am-10:25am Lecture 2.a: Overfitting, model selection 10:35am-noon Lecture 2.b: Regularization, gradient descent noon-1:00pm Lunch 1:00pm-3:30pm Programming 3:30pm-5:00pm Invited Talk - Mathew Walter 1 Overfitting 2 Dataset size and linear regression • Recall linear regression $ � * o Input � ∈ ℝ , output � ∈ ℝ, training data � = � , � : � = 1,2, … , � $ o Estimate � ∈ ℝ and bias �3 ∈ ℝ by minimizing training loss B � * A �4, �43 = argmin = �. � + �3 − � �,; < *CD • What happens when we only have a single data point (in 1D)? � o Ill-posed problem: an infinite number of lines perfectly fit the data • Two points in 1D? • Two points in 2D? � o the amount of data needed to obtain a meaningful estimate of a model is related to the number of parameters in the model (its complexity) 3 Dataset size and linear regression • Recall linear regression $ � * o Input � ∈ ℝ , output � ∈ ℝ, training data � = � , � : � = 1,2, … , � $ o Estimate � ∈ ℝ and bias �3 ∈ ℝ by minimizing training loss B � * A �4, �43 = argmin = �. � + �3 − � �,; < *CD • What happens when we only have a single data point (in 1D)? � o Ill-posed problem: an infinite number of lines perfectly fit the data • Two points in 1D? • Two points in 2D? � o the amount of data needed to obtain a meaningful estimate of a model is related to the number of parameters in the model (its complexity) 4 Linear regression - generalization Consider 1D example * * • �GHI*J = � , � : � = 1,2, … , � where * o � ∼ �� −5,5 * ∗ * * ∗ * o � = � � + � for true � and noise � ∼ � 0, 1 • �GWXG similarly generated B A �4 = argmin = �� * − � * ; *CD • The training error increases with the size of training data? 5 Model complexity vs fit for fixed N • Recall polynomial regression of degree � in 1D B A * * \ * Y �4 = argmin = �3 + �D. � + �A. � + ⋯ + �^. � − �G �∈ℝYZ[ *CD N=30 6 Overfitting with ERM • For same amount of data, more complex models overfits more than simple model o Recall: higher degree à more number of parameters to fit • What happens if we have more data? 7 Model complexity vs fit for fixed N • Recall polynomial regression of degree � in 1D B A * * \ * Y �4 = argmin = �3 + �D. � + �A. � + ⋯ + �^. � − �G �∈ℝYZ[ *CD N=100 8 Overfitting with ERM • For same amount of data, complex models overfit more than simple models o Recall: higher degree à more number of parameters to fit • What happens if we have more data? o More complex models require more data to avoid overfitting 9 How to avoid overfitting? • How to detect overfitting? • How to avoid overfitting? o Look at test error and pick m=5? • Split � = �GHI*J ∪ �Ìa ∪ �GWXG o Use performance on �Ìa as proxy for test error 10 Model selection • � = �GHI*J ∪ �Ìa ∪ �GWXG • m model classes ℋD, ℋA, … , ℋ^ o Recall each ℋH is a set of candidate functions mapping � → � A H o e.g., ℋH = � → �3 + �D. � + �A. � + ⋯ �H. � k • Minimize training loss �efghij on �GHI*J to pick best �H ∈ ℋH k A H o e.g., �H � = �43 + �4D. � + �4A. � + ⋯ �4H. � where �43, �4D, �4A, … , �4H l A * * \ * g * = argmin = �3 + �D� + �A� + ⋯ + �H� − � ; ,…,; < g i i m ,n ∈efghij k k k k • Compute validation loss �eohp �H on �Ìa for each �D, �A, … , �^ k∗ k k k k • Pick � = min �e �D , �e �A , … , �e �^ = min �e (�H) ohp ohp ohp H ohp k∗ • Evaluate test loss �efstf � 11 Model selection • Can we overfit to validation data? o How much data to keep aside for validation? • What if we don’t have enough data? 12 Cross validation Illustration credit: ! ! ! ! ! ! Split � = " # $ % & '()' Nati Srebro �D ∪ �A ∪ ⋯ ∪ �*z" ℎ,," * ℎ ∪ �GWXG # ,,# Extreme case *$ ℎ,,$ � = � (leave one *% ℎ,,% *& ℎ,,& out cross validation) • m model classes ℋD, ℋA, … , ℋ^ • For each �: w o Training loss � v is loss on �GHI*J = �D ∪ �A … ∪ �wxD ∪ �wyD … �z efghij k w k w o Let best �H ∈ ℋH by �H = argmin � v (�) efghij {∈ℋg k w o Compute validation loss �ev �H on �w for each � ∗ ∑� ‰ � • Pick model based on average validation loss �} = �� C� �� • ℋĤ∗ is the correct model class to use. k∗ k∗ l k w • � = argmin � v (�) or � = ∑w � ∗ (if it makes sense) efghij∪ev Ĥ {∈ℋg4∗ k∗ • Evaluate �efstf � 13.

Load more