Using Validation Sets to Avoid Overfitting in Adaboost

Using Validation Sets to Avoid Overfitting in AdaBoost ∗ Tom Bylander and Lisa Tate Department of Computer Science, University of Texas at San Antonio, San Antonio, TX 78249 USA {bylander, ltate}@cs.utsa.edu Abstract In both cases, the training set is used to find a small hypothesis space, and the validation set is used to select a hy- AdaBoost is a well known, effective technique for increas- pothesis from that space. The justification is that in a large ing the accuracy of learning algorithms. However, it has the potential to overfit the training set because its objective is hypothesis space, minimizing training error will often result to minimize error on the training set. We demonstrate that in overfitting. The training set can instead be used to estab- overfitting in AdaBoost can be alleviated in a time-efficient lish a set of choices as training error is minimized, with the manner using a combination of dagging and validation sets. validation set used to reverse or modify those choices. Half of the training set is removed to form the validation set. A validation set could simply be used for early stopping The sequence of base classifiers, produced by AdaBoost from in AdaBoost. However, we obtain the most success by also the training set, is applied to the validation set, creating a performing 2-dagging (Ting & Witten 1997), and modifying modified set of weights. The training and validation sets are weights. Dagging is disjoint aggregation, where a training switched, and a second pass is performed. The final classi- set is partitioned into subsets and training is performed on fier votes using both sets of weights. We show our algorithm has similar performance on standard datasets and improved each set. Our algorithm partitions the original training set performance when classification noise is added. into 2 partitions, applies boosting on one partition and validation with the other partition. It switches the training and validation sets then applies boosting and validation again. Introduction In each boosting/validation pass, the sequence of weak hy- AdaBoost was first introduced by Freund and Schapire (Fre- potheses from boosting on the training partition are fit to und & Schapire 1997). Since then AdaBoost has shown to the validation partition in the same order. The sequence is do well with algorithms such as C4.5 (Quinlan 1996), De- truncated if and when the average error of the training and cision Stumps (Freund & Schapire 1996), and Naive Bayes validation set is 50% or more. The final weights are derived (Elkan 1997). It has been shown that AdaBoost has the po- from averaging the error rates. tential to overfit (Groves & Schuurmans 1998; Jiang 2000), Our results show that a large improvement is due to the although rarely with low noise data. However, it has a 2-dagging. When a training set contains classification noise, much higher potential to overfit in the presence of very learning algorithms that are sensitive to noise and prone to noisy data (Dietterich 2000; Ratsch, Onoda, & Muller 2001; overfitting can fit the noise. When the 2-dagging partitions Servedio 2003). At each iteration, AdaBoost focuses on the training set, a noisy example will be in only one of the classifying the misclassified instances. This might result in subsets. Although the hypothesis created from that sub- fitting the noise during training. In this paper, we use valida- set may overfit the noisy example, the hypothesis created tion sets to adjust the hypothesis of the AdaBoost algorithm from the other subset will not overfit that particular exam- to improve generalization, thereby alleviating overfitting and ple. Aggregating the hypotheses counteracts the overfitting improving performance. of each separate hypothesis. A further improvement is ob- Validation sets have long been used in addressing the tained by using the validation set to stop early and to modify problem of overfitting with neural networks (Heskes 1997) the weights. While this might not be the best way to use val- and decision trees (Quinlan 1996). The basic concept is to idation sets for boosting, we do demonstrate that validation apply the classifier to a set of instances distinct from the sets can be used to avoid overfitting efficiently. training set. The performance over the validation set can Several techniques have been used to address overfitting then be used to determine pruning, in the case of decision of noisy data in AdaBoost. BrownBoost (Freund 1999), an trees, or early stopping, in the case of neural networks. adaptive version of Boost By Majority (Freund 1990), gives small weights to misclassified examples far from the margin. ∗This research has been supported in part by the Center for In- frastructure Assurance and Security at the University of Texas at Using a ”soft margin” (Ratsch, Onoda, & Muller 2001) ef- San Antonio. fectively does not give preference to hypotheses that rely on Copyright c 2006, American Association for Artificial Intelli- a few examples with large weights from continuous misclas- gence (www.aaai.org). All rights reserved. sification. MadaBoost (Domingo & Watanabe 2000) pro- 544 vides an upper bound on example weights based on initial value, a function of the error, is used to update the weights probability. SmoothBoost (Servedio 2003) smooths the dis- in the distribution at every iteration t. If an example is mis- tribution over the examples by weighting on the margin of classified it receives a higher weight. The Zt value is the error and creating an upper bound so that no one example normalization factor. The βt value is also used to calculate can have too much weight. NadaBoost (Nakamura, Nomiya, the final weight for each hypothesis ht as log(1/βt). The & Uehara 2002) adjusts the weighting scheme by simply final hypothesis hfin(x) is a weighted plurality vote deter- adding a threshold for the maximum allowable weight. In mined by the weights of the hypotheses. this paper, we address overfitting of noisy data by using a validation set to smooth the hypothesis weights. Validation Algorithms The rest of this paper is organized as follows. First we FitValidationSet Subroutine describe the AdaBoost.M1 algorithm, used for multiclass datasets. We then present our AdaBoost.MV algorithm. Fi- Subroutine FitValidationSet nally, we describe our experiments including a comparison Input: validation set of m examples ((x1, y1),...,(xm, ym)) with the BrownBoost algorithm and conclude with a discus- with labels y ∈ Y = {1, . , k} sion of the results. i hypotheses {h1, . , hT } from AdaBoost.M1 errors {1, . , T } from AdaBoost.M1 AdaBoost.M1 Algorithm 0 1 Initialize D1(i) = m for all i Do for t = 1,...,T Algorithm AdaBoost.M1 0 X 0 Input: train set of m examples ((x1, y1),...,(xm, ym)) 1. Calculate the error of ht: t = Dt(i) with labels yi ∈ Y = {1, . , k} i:ht(xi)6=yi learning algorithm L 2. if ( + 0 )/2 ≥ 1/2, then set T = t − 1 and abort integer T specifying number of iterations t t 0 Initialize D (i) = 1 for all i 0 t 1 m 3. Set βt = 0 Do for t = 1,...,T 1 − t 4. Update distribution D0 : 1. Call L providing it with the distribution Dt t 0 0 2. Get back a hypothesis ht: X → Y D (i) β if h (x ) = y D0 (i) = t × t t i i X t+1 Z0 1 otherwise 3. Calculate the error of ht: t = Dt(i) t i:ht(xi)6=yi 0 X 0 where Zt = Dt(i) 4. if t ≥ 1/2, then set T = t − 1 and abort 0 Output the errors t t 5. Set βt = 1 − t Figure 2: The FitValidationSet algorithm 6. Update distribution Dt: Dt(i) βt if ht(xi) = yi The inputs to the FitValidationSet subroutine (Figure 2) Dt+1(i) = × Zt 1 otherwise are the validation set, the classifiers from boosting on the training set, and the errors from boosting on the training set. where Zt is a normalization constant. FitValidationSet applies AdaBoost.M1’s distribution updat- Output the final classifier: ing scheme to the classifiers in sequential order. The first 0 distribution D1 is initialized to be uniform. The validation X 1 0 error t is calculated for each classifier ht with respect to hfin(x) = arg max log 0 0 y∈Y βt distribution D . The next distribution D is determined t:h (x)=y t t+1 t using the same distribution update as AdaBoost.M1. FitVal- 0 idationSet will stop early if (t + t)/2 ≥ 1/2, where t is Figure 1: The AdaBoost.M1 algorithm the error of ht on the distribution Dt from AdaBoost.M1. In our algorithm, the validation set is used to adjust the weights AdaBoost.M1 is a version of the AdaBoost algorithm away from fitting (maybe overfitting) the training set toward (Freund & Schapire 1997) that handles multiple-label classi- fitting the validation set. There are a variety of other rea- fication. The AdaBoost.M1 algorithm takes a set of training sonable sounding ways to use a validation set in boosting. examples ((x1, y1),...,(xm, ym)), where xi ∈ X is an in- Informally, our algorithm can be justified as follows. stance and yi ∈ Y is the class label. It initializes the first We don’t want to overfit either the training set or the vali- distribution of weights D1 to be uniform over the training dation set. We also don’t want to forget the weights learned examples. It calls a learning algorithm, referred to as L, it- from either set, so our algorithm chooses a weight in be- eratively for a predetermined T times. At each iteration, L tween the two. For the method of adjustment, averaging the is provided with the distribution Dt and returns a hypothesis errors followed by computing a weight is more stable than ht relative to the distribution.

Using Validation Sets to Avoid Overfitting in Adaboost

Early Stopping for Kernel Boosting Algorithms: a General Analysis with Localized Complexities

Search for Ultra-High Energy Photons with the Pierre Auger Observatory Using Deep Learning Techniques

Identifying Training Stop Point with Noisy Labeled Data

Links Between Perceptrons, Mlps and Svms

Asymptotic Statistical Theory of Overtraining and Cross-Validation

Explaining the Success of Adaboost and Random Forests As Interpolating Classifiers

Asymptotic Statistical Theory of Overtraining and Cross-Validation

Early Stopping for Kernel Boosting Algorithms: a General Analysis with Localized Complexities

Early Stopping and Non-Parametric Regression: an Optimal Data-Dependent Stopping Rule

Don't Relax: Early Stopping for Convex Regularization Arxiv:1707.05422V1

Implementation of Intelligent System to Support Remote Telemedicine Services Using

The Theory Behind Overfitting, Cross Validation, Regularization, Bagging