<<

Day 2: Overfitting, regularization

Introduction to Summer School June 18, 2018 - June 29, 2018, Chicago

Instructor: Suriya Gunasekar, TTI Chicago

19 June 2018 Review • Yesterday o o - polynomial o Empirical risk minimization, evaluation • Today o Overfitting o o Regularization o • Schedule: 9:00am-10:25am Lecture 2.a: Overfitting, model selection 10:35am-noon Lecture 2.b: Regularization, gradient descent noon-1:00pm Lunch 1:00pm-3:30pm Programming 3:30pm-5:00pm Invited Talk - Mathew Walter

1 Overfitting

2 Dataset size and linear regression

• Recall linear regression � o Input � ∈ ℝ , output � ∈ ℝ, training data � = � , � : � = 1,2, … , � o Estimate � ∈ ℝ and bias � ∈ ℝ by minimizing training loss � �, � = argmin �. � + � − � �, • What happens when we only have a single data point (in 1D)? � o Ill-posed problem: an infinite number of lines perfectly fit the data • Two points in 1D? • Two points in 2D? � o the amount of data needed to obtain a meaningful estimate of a model is related to the number of parameters in the model (its complexity)

3 Dataset size and linear regression

• Recall linear regression � o Input � ∈ ℝ , output � ∈ ℝ, training data � = � , � : � = 1,2, … , � o Estimate � ∈ ℝ and bias � ∈ ℝ by minimizing training loss � �, � = argmin �. � + � − � �, • What happens when we only have a single data point (in 1D)? � o Ill-posed problem: an infinite number of lines perfectly fit the data • Two points in 1D?

• Two points in 2D? � o the amount of data needed to obtain a meaningful estimate of a model is related to the number of parameters in the model (its complexity)

4 Linear regression - generalization

Consider 1D example • � = � , � : � = 1,2, … , � where o � ∼ ������� −5,5 ∗ ∗ o � = � � + � for true � and noise � ∼ � 0, 1 • � similarly generated � = argmin �� − �

• The training error increases with the size of training data?

5 Model complexity vs fit for fixed N • Recall polynomial regression of degree � in 1D � = argmin � + �. � + �. � + ⋯ + �. � − � �∈ℝ

N=30

6 Overfitting with ERM • For same amount of data, more complex models overfits more than simple model o Recall: higher degree à more number of parameters to fit • What happens if we have more data?

7 Model complexity vs fit for fixed N • Recall polynomial regression of degree � in 1D � = argmin � + �. � + �. � + ⋯ + �. � − � �∈ℝ

N=100

8 Overfitting with ERM • For same amount of data, complex models overfit more than simple models o Recall: higher degree à more number of parameters to fit • What happens if we have more data? o More complex models require more data to avoid overfitting

9 How to avoid overfitting? • How to detect overfitting?

• How to avoid overfitting? o Look at test error and pick m=5?

• Split � = � ∪ � ∪ � o Use performance on � as proxy for test error

10 Model selection

• � = � ∪ � ∪ �

• m model classes ℋ, ℋ, … , ℋ

o Recall each ℋ is a set of candidate functions mapping � → � o e.g., ℋ = � → � + �. � + �. � + ⋯ �. � • Minimize training loss � on � to pick best � ∈ ℋ o e.g., � � = � + �. � + �. � + ⋯ �. � where �, �, �, … , � = argmin � + �� + �� + ⋯ + �� − � ,…, , ∈ • Compute validation loss � � on � for each �, �, … , � ∗ • Pick � = min � � , � � , … , � � = min � (�) ∗ • Evaluate test loss � � 11 Model selection

• Can we overfit to validation data?

o How much data to keep aside for validation? • What if we don’t have enough data?

12 Cross validation Illustration credit: ! ! ! ! ! ! Split � = " # $ % & '()' Nati Srebro � ∪ � ∪ ⋯ ∪ �*" ℎ,," * ℎ ∪ � # ,,# Extreme case *$ ℎ,,$ � = � (leave one *% ℎ,,% *& ℎ,,& out cross validation) • m model classes ℋ, ℋ, … , ℋ • For each �: o Training loss � is loss on � = � ∪ � … ∪ � ∪ � … � o Let best � ∈ ℋ by � = argmin � (�) ∈ℋ o Compute validation loss � � on � for each � ∗ ∑� � • Pick model based on average validation loss � = ������ �� ��� �� � • ℋ̂∗ is the correct model class to use. ∗ ∗ • � = argmin � (�) or � = ∑ � ∗ (if it makes sense) ∪ ̂ ∈ℋ∗ ∗ • Evaluate � � 13