Cross-Validation for Detecting and Preventing Overfitting

Cross-validation for detecting and preventing overfitting Note to other teachers and users of Andrew W. Moore these slides. Andrew would be delighted if you found this source material useful in giving your own lectures. Feel free to use Professor these slides verbatim, or to modify them to fit your own needs. PowerPoint originals are available. If you make use School of Computer Science of a significant portion of these slides in your own lecture, please include this message, or the following link to the Carnegie Mellon University source repository of Andrew’s tutorials: http://www.cs.cmu.edu/~awm/tutorials . www.cs.cmu.edu/~awm Comments and corrections gratefully received. [email protected] 412-268-7599 Copyright © Andrew W. Moore Slide 1 A Regression Problem y = f(x) + noise Can we learn f from this data? y x Let’s consider three methods… Copyright © Andrew W. Moore Slide 2 Linear Regression y x Copyright © Andrew W. Moore Slide 3 Linear Regression Univariate Linear regression with a constant term: X Y X= 3 y= 7 Originally 3 7 1 3 discussed in the previous Andrew 1 3 : : Lecture: “Neural Nets” : : x1=(3).. y1=7.. Copyright © Andrew W. Moore Slide 4 Linear Regression Univariate Linear regression with a constant term: X Y X= 3 y= 7 3 7 1 3 1 3 : : : : x =(3).. y =7.. Z= y1= 1 1 3 7 1 1 3 : : z1=(1,3).. y1=7.. zk=(1,xk) Copyright © Andrew W. Moore Slide 5 Linear Regression Univariate Linear regression with a constant term: X Y X= 3 y= 7 3 7 1 3 1 3 : : : : x =(3).. y =7.. Z= y1= 1 1 3 7 1 1 3 : : β=(ZTZ)-1(ZTy) z1=(1,3).. y1=7.. est y = β0+ β1 x zk=(1,xk) Copyright © Andrew W. Moore Slide 6 Quadratic Regression y x Copyright © Andrew W. Moore Slide 7 Quadratic Regression Much more about X Y X= 3 y= 7 this in the future Andrew Lecture: 3 7 1 3 “Favorite Regression 1 3 : : Algorithms” : : x1=(3,2).. y1=7.. 1 3 9 y= Z= 7 1 1 1 3 : β=(ZTZ)-1(ZTy) : z=(1 , x, x2 ) est 2 , y = β0+ β1 x+ β2 x Copyright © Andrew W. Moore Slide 8 Join-the-dots Also known as piecewise linear nonparametric regression if that makes you feel better y x Copyright © Andrew W. Moore Slide 9 Which is best? y y x x Why not choose the method with the best fit to the data? Copyright © Andrew W. Moore Slide 10 What do we really want? y y x x Why not choose the method with the best fit to the data? “How well are you going to predict future data drawn from the same distribution?” Copyright © Andrew W. Moore Slide 11 The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a y training set x Copyright © Andrew W. Moore Slide 12 The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a y training set 3. Perform your regression on the training x set (Linear regression example) Copyright © Andrew W. Moore Slide 13 The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a y training set 3. Perform your regression on the training x set (Linear regression example) 4. Estimate your future performance with the test Mean Squared Error = 2.4 set Copyright © Andrew W. Moore Slide 14 The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a y training set 3. Perform your regression on the training x set (Quadratic regression example) 4. Estimate your future performance with the test Mean Squared Error = 0.9 set Copyright © Andrew W. Moore Slide 15 The test set method 1. Randomly choose 30% of the data to be in a test set 2. The remainder is a y training set 3. Perform your regression on the training x set (Join the dots example) 4. Estimate your future performance with the test Mean Squared Error = 2.2 set Copyright © Andrew W. Moore Slide 16 The test set method Good news: •Very very simple •Can then simply choose the method with the best test-set score Bad news: •What’s the downside? Copyright © Andrew W. Moore Slide 17 The test set method Good news: •Very very simple •Can then simply choose the method with the best test-set score Bad news: •Wastes data: we get an estimate of the We say the “test-set best method to apply to 30% less data estimator of performance •If we don’t have much data, our test-set has high might just be lucky or unlucky variance” Copyright © Andrew W. Moore Slide 18 LOOCV (Leave-one-out Cross Validation) For k=1 to R th 1. Let (xk,yk) be the k record y x Copyright © Andrew W. Moore Slide 19 LOOCV (Leave-one-out Cross Validation) For k=1 to R th 1. Let (xk,yk) be the k record 2. Temporarily remove (xk,yk) from the dataset y x Copyright © Andrew W. Moore Slide 20 LOOCV (Leave-one-out Cross Validation) For k=1 to R th 1. Let (xk,yk) be the k record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 y datapoints x Copyright © Andrew W. Moore Slide 21 LOOCV (Leave-one-out Cross Validation) For k=1 to R th 1. Let (xk,yk) be the k record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 y datapoints 4. Note your error (xk,yk) x Copyright © Andrew W. Moore Slide 22 LOOCV (Leave-one-out Cross Validation) For k=1 to R th 1. Let (xk,yk) be the k record 2. Temporarily remove (xk,yk) from the dataset 3. Train on the remaining R-1 y datapoints 4. Note your error (xk,yk) x When you’ve done all points, report the mean error. Copyright © Andrew W. Moore Slide 23 LOOCV (Leave-one-out Cross Validation) For k=1 to R 1. Let (xk,yk) be the kth record y y y 2. Temporarily remove (xk,yk) from the dataset x x x 3. Train on the remaining R-1 datapoints 4. Note your y y y error (xk,yk) When you’ve done all points, x x x report the mean error. MSELOOCV = 2.12 y y y x x x Copyright © Andrew W. Moore Slide 24 LOOCV for Quadratic Regression For k=1 to R 1. Let (xk,yk) be the kth record y y y 2. Temporarily remove (xk,yk) from the dataset x x x 3. Train on the remaining R-1 datapoints 4. Note your y y y error (xk,yk) When you’ve done all points, x x x report the mean error. MSELOOCV =0.962 y y y x x x Copyright © Andrew W. Moore Slide 25 LOOCV for Join The Dots For k=1 to R 1. Let (xk,yk) be the kth record y y y 2. Temporarily remove (xk,yk) from the dataset x x x 3. Train on the remaining R-1 datapoints 4. Note your y y y error (xk,yk) When you’ve done all points, x x x report the mean error. MSELOOCV =3.33 y y y x x x Copyright © Andrew W. Moore Slide 26 Which kind of Cross Validation? Downside Upside Test-set Variance: unreliable Cheap estimate of future performance Leave- Expensive. Doesn’t one-out Has some weird waste data behavior ..can we get the best of both worlds? Copyright © Andrew W. Moore Slide 27 k-fold Cross Randomly break the dataset into k partitions (in our example we’ll have k=3 Validation partitions colored Red Green and Blue) y x Copyright © Andrew W. Moore Slide 28 k-fold Cross Randomly break the dataset into k partitions (in our example we’ll have k=3 Validation partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. y x Copyright © Andrew W. Moore Slide 29 k-fold Cross Randomly break the dataset into k partitions (in our example we’ll have k=3 Validation partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. y Find the test-set sum of errors on the green points. x Copyright © Andrew W. Moore Slide 30 k-fold Cross Randomly break the dataset into k partitions (in our example we’ll have k=3 Validation partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition. Find the test-set sum of errors on the red points. For the green partition: Train on all the points not in the green partition. y Find the test-set sum of errors on the green points. For the blue partition: Train on all the points not in the blue partition. Find x the test-set sum of errors on the blue points. Copyright © Andrew W. Moore Slide 31 k-fold Cross Randomly break the dataset into k partitions (in our example we’ll have k=3 Validation partitions colored Red Green and Blue) For the red partition: Train on all the points not in the red partition.

Load more