The Theory Behind Overfitting, Cross Validation, Regularization, Bagging

The Theory Behind Overfitting, Cross Validation, Regularization, Bagging, and Boosting: Tutorial Benyamin Ghojogh [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada Mark Crowley [email protected] Department of Electrical and Computer Engineering, Machine Learning Laboratory, University of Waterloo, Waterloo, ON, Canada Abstract 1. Introduction In this tutorial paper, we first define mean N Assume we have a dataset of instances f(xi; yi)gi=1 with squared error, variance, covariance, and bias d sample size N and dimensionality xi 2 R and yi 2 R. of both random variables and classifica- N N The fxigi=1 are the input data to the model and the fyigi=1 tion/predictor models. Then, we formulate the are the observations (labels). We denote the dataset by D true and generalization errors of the model so that N := jDj. This dataset is the union of the disjoint for both training and validation/test instances subsets, i.e., training set T and test set R; therefore: where we make use of the Stein’s Unbiased Risk Estimator (SURE). We define overfitting, under- D = T[R; (1) fitting, and generalization using the obtained true T\R = ?: (2) and generalization errors. We introduce cross validation and two well-known examples which For the training set, the observations (labels), yi’s, are are K-fold and leave-one-out cross validations. available. Although for the test set, we might also have We briefly introduce generalized cross validation yi’s, but we do not use them for training the model. The and then move on to regularization where we observations are continuous or come from a finite discrete use the SURE again. We work on both `2 and `1 set of values in classification and prediction (regression) norm regularizations. Then, we show that boot- tasks, respectively. Assume the sample size of training and strap aggregating (bagging) reduces the variance test sets are n := jT j and m := N − n, respectively; there- n of estimation. Boosting, specifically AdaBoost, fore, we have f(xi; yi)gi=1 as the training set. In some is introduced and it is explained as both an cases where we want to have validation set V as well, the additive model and a maximum margin model, datasets includes three disjoint subsets: i.e., Support Vector Machine (SVM). The upper bound on the generalization error of boosting D = T[R[V; (3) is also provided to show why boosting prevents T\R = ?; T\V = ?; V\R = ?: (4) from overfitting. As examples of regularization, arXiv:1905.12787v1 [stat.ML] 28 May 2019 the theory of ridge and lasso regressions, weight We will define the intuitions of training, test, and validation decay, noise injection to input/weights, and early sets later in this paper. stopping are explained. Random forest, dropout, In this paper, we introduce overfitting, cross validation, histogram of oriented gradients, and single shot generalized cross validation, regularization, bagging, and multi-box detector are explained as examples boosting and explain why they work theoretically. We also of bagging in machine learning and computer provide some examples of these methods in machine learn- vision. Finally, boosting tree and SVM models ing and computer vision. are mentioned as examples of boosting. 2. Mean Squared Error, Variance, and Bias 2.1. Measures for a Random Variable Assume we have variable X and we estimate it. Let the random variable Xb denote the estimate of X. The variance The Theory Behind Overfitting, Cross Validation, Regularization, Bagging, and Boosting: Tutorial 2 Figure 1. The dart example for (a) high bias and low variance, (b) low bias and high variance, (c) high bias and high variance, and (d) low bias and low variance. The worst and best cases are (c) and (d), respectively. The center of the circles is the true value of the variable. of estimating this random variable is defined as: The relation of MSE, variance, and bias is as follows: 2 MSE(Xb) = E (Xb − X) 2 Var(Xb) := E (Xb − E(Xb)) ; (5) 2 = E (Xb − E(Xb) + E(Xb) − X) 2 2 = E (Xb − E(Xb)) + (E(Xb) − X) which means average deviation of Xb from the mean of our + 2(Xb − E(Xb))(E(Xb) − X) estimate, E(Xb), where the deviation is squared for symme- (a) 2 2 try of difference. This variance can be restated as: = E (Xb − E(Xb)) + (E(Xb) − X) + 2 (E(Xb) − E(Xb))(E(Xb) − X) | {z } 2 2 0 Var(Xb) = E Xb + (E(Xb)) − 2XbE(Xb) (b) 2 (a) 2 2 = Var(Xb) + (Bias(Xb)) ; (9) = E(Xb ) + (E(Xb)) − 2E(Xb)E(Xb) 2 2 = E(Xb ) − (E(Xb)) ; (6) where (a) is because expectation is a linear operator and X and E(Xb) are not random, and (b) is because of Eqs. (5) and (7). where (a) is because expectation is a linear operator and If we have two random variables Xb and Yb, we can say: E(Xb) is not a random variable. Our estimation can have a bias. The bias of our estimate is (6) 2 2 Var(aXb + bYb) = E (aXb + bYb) − E(aXb + bYb) defined as: (a) 2 2 2 2 = a E(Xb ) + b E(Yb ) + 2ab E(XbYb) 2 2 2 2 − a (E(Xb)) − b (E(Yb)) − 2ab E(Yb)E(Yb) Bias(Xb) := E(Xb) − X; (7) (6) 2 2 = a Var(Xb) + b Var(Xb) + 2ab Cov(X;b Yb); (10) which means how much the mean of our estimate deviates where (a) is because of linearity of expectation and the from the original X. Cov(X;b Yb) is covariance defined as: The Mean Squared Error (MSE) of our estimate, Xb, is defined as: Cov(X;b Yb) := E(XbYb) − E(Yb) E(Yb): (11) If the two random variables are independent, i.e., X ?? Y , 2 MSE(Xb) := E (Xb − X) ; (8) we have: (a) ZZ ?? ZZ E(XbYb) = xyf(x; y)dxdy = xyf(x)f(y)dxdy which means how much our estimate deviates from the bb b b b b bb b b b b original X. Z Z Z = yf(y) xf(x)dx dy = E(Xb) yf(y)dy The intuition of bias, variance, and MSE is illustrated in b b b b b b b b b Fig.1 where the estimations are like a dart game. We have | {z } | {z } E(Xb) E(Yb ) four cases with low/high values of bias and variance which are depicted in this figure. = E(Xb) E(Yb) =) Cov(X;b Yb) = 0; (12) The Theory Behind Overfitting, Cross Validation, Regularization, Bagging, and Boosting: Tutorial 3 where (a) is according to definition of expectation. Note that Eq. (12) is not true for the reverse implication (we can prove by counterexample). We can extend Eqs. (10) and (11) to multiple random variables: k X Var aiXi i=1 k k k X 2 X X = ai Var(Xi) + aiajCov(Xi;Xj); i=1 i=1 j=1;j6=i (13) k k k k X1 X2 X1 X2 Cov aiXi; bjYj = ai bj Cov(Xi;Yj); Figure 2. The true model and the estimated model which is trained i=1 j=1 i=1 j=1 using the input training data and their observations. The observa- (14) tions are obtained from the outputs of the true model fed with the training input but corrupted by the noise. After the model is where ai’s and bj’s are not random. trained, it can be used to estimate the observation for either training or test input data. 2.2. Measures for a Model Assume we have a function f which gets the i-th input xi and outputs fi = f(xi). Figure2 shows this function and its input and output. We wish to know the function which we call it the true model but we do not have access to it as it is unknown. Also, the pure outputs (true observations), fi’s, are not available. The output may be corrupted with an additive noise "i: yi = fi + "i; (15) 2 where the noise is "i ∼ N (0; σ ). Therefore: 2 (6) 2 2 Figure 3. The triangle of variance, bias, and MSE. The f and fb ("i) = 0; (" ) = ar("i) + ( ("i)) = σ ; (16) E E i V E are the true and estimated models, respectively. The true observation fi is not random, thus: the estimation is a member of a discrete set of possible ob- E(fi) = fi: (17) servations. n The input training data fxigi=1 and their corrupted obser- The definitions of variance, bias, and MSE, i.e., Eqs. (5), vations fy gn are available to us. We would like to ap- i i=1 (7), and (8), can also be used for the estimation fbi of the proximate (estimate) the true model by a model fb in or- true model fi. The Eq. (9) can be illustrated for the model n der to estimate the observations fyigi=1 from the input f as in Fig.3 which holds because of Pythagorean theorem. n n fxigi=1. Calling the estimated observations by fybigi=1, n n 2.3. Measures for Ensemble of Models we want the fybigi=1 to be as close as possible to fyigi=1 n for the training input data fxigi=1. We train the model us- If we have an ensemble of models (Polikar, 2012), we can ing the training data in order to estimate the true model. have some similar definitions of bias and variance (e.g., see After training the model, it can be used to estimate the out- the Appendix C in (Schapire et al., 1998)).

The Theory Behind Overfitting, Cross Validation, Regularization, Bagging

Harmless Overfitting: Using Denoising Autoencoders in Estimation of Distribution Algorithms

More Perceptron

Measuring Overfitting and Mispecification in Nonlinear Models

Early Stopping for Kernel Boosting Algorithms: a General Analysis with Localized Complexities

Over-Fitting in Model Selection with Gaussian Process Regression

On Overfitting and Asymptotic Bias in Batch Reinforcement Learning With

Overfitting Princeton University COS 495 Instructor: Yingyu Liang Review: Machine Learning Basics Math Formulation

Search for Ultra-High Energy Photons with the Pierre Auger Observatory Using Deep Learning Techniques

A Robust Hybrid of Lasso and Ridge Regression

Using Validation Sets to Avoid Overfitting in Adaboost

Lasso Reference Manual Release 17

Overfitting Can Be Harmless for Basis Pursuit, but Only to a Degree