Effect of Pruning and Early Stopping on Performance of a Boosting Ensemble
Total Page:16
File Type:pdf, Size:1020Kb
Effect of Pruning and Early Stopping on Performance of a Boosting Ensemble Harris Drucker, Ph.D Monmouth University, West Long Branch, NJ 07764 USA [email protected] Abstract: Generating an architecture for an ensemble of boosting machines involves making a series of design decisions. One design decision is whether to use simple “weak learners” such as decision tree stumps or more complicated weak learners such as large decision trees or neural networks. Another design decision is the training algorithm for the constituent weak learners. Here we concentrate on binary decision trees and show that the best results are obtained using the Z-criterion to build the trees without pruning. In using neural networks, early stopping is recommended as an approach to lower the training time. In examining the multi-class boosting algorithms, the jury is still out on whether using the all-pairs binary learning algorithm or pseudo-loss is better. Keywords: boosting, decision trees, pruning, C4.5, pseudo-loss, neural networks 1. Introduction Boosting techniques have been around long enough that there is strong evidence that building an ensemble of classifiers using boosting techniques [2] gives superior performance to that of any other ensemble technique which includes bagging [2] and stacking [16]. Basic to an understanding of any of the boosting algorithm is the concept of a weak learner, which is any classifier which can do better than a 50% error rate on any distribution of data whether the problem is a multi-class problem or a binary (two-class) problem. In all of the boosting algorithms, we are given a set of training examples and the first weak learner on the ensemble is trained using this first training set. The second and subsequent weak learners are trained on examples that the previously constructed weak learners find difficult to classify correctly. As the boosting algorithms iterate, the error rates on the new training sets for each new learner tends to rise because the examples used to train on becomes more difficult to learn although the ensemble training error rate decreases. For the two- class case, keeping the error rate below 50% is not too difficult with any reasonable choice of weak learner. It should be made clear that even if the weak learner has a terrible error rate such as 80%, we could always reverse classifications and turn this weak learner into one with an error rate of 20%. Therefore, when we say that we want the error rate less than 50% in the two-class case we mean that weak learner should do better than a random decision. Decision tree stumps (decision trees with two leaves) have often been used as weak learners for the two-class case [1]. In the multi-class case, the error rate still must remain below 50%. For the n-class case, a random classifier will give an error rate of (n-1)/n which can be far from 50%. Therefore, for the multi-class case we must investigate weak learners which are more powerful than those for the two-class case. Weak learner is somewhat of a misnomer in that we would be pleased to have classifiers that have as low an error rate as possible. The boosting algorithms prove that the ensemble training error rate goes to zero as long as the error rate is less than 50% for each member of the ensemble. Because it may be difficult to obtain low error rates in the multi-class case, we use neural networks and the concept of pseudo-loss [7] rather than directly minimizing error rate. For the two-class problem, decision trees give good results. We will give a detailed discussion of decision trees in the following sections, discuss design decisions that have to be made, show the results of some experiments and state some open issues. 2. Design Decisions Once a decision has been made to use a committee of weak learners, a series of design choices have to be made: 1. Which of the boosting algorithms to use? 2. Which of the weak learners to use? Possible candidates are decision stumps, decision trees, or neural networks. 3. What should the architecture be? For decision stumps, since they are by definition, two-leaf trees, no additional architectural decision has to be made. For decision trees, once the data is known, the architecture is fixed. However, depending on the choice of the “splitting criterion”, a different tree may be constructed. For neural networks, except for the no-hidden- layer case, one must decide on the number of hidden layers and the number of nodes in each hidden layer. 4. The choice of learning algorithm for the weak learner: For neural networks and classification, back-propagation is the method of choice but there are different methods to improve learning rate. For neural networks and regression [5], conjugate gradient seems to be the better procedure. For decision trees, as stated above, one must choose the criterion on which to base the splitting of the nodes. In addition, one must consider whether to build the network using one set of examples and prune using another set, or to build and prune using the same set of examples. Decisions that have to be made here that will be discussed in the next sections. Emphasis will be made on decision trees. We have attempted to answer these questions elsewhere [5] although one’s conclusions change as the state-of-the-art in boosting algorithms changes. We will concentrate on the following design issue here: If one has a two-class case, and decides to use decision trees, what is the proper choice of design parameters. However, we will also briefly discuss the use of neural networks and pseudo-loss. 3. Choice of boosting algorithms I divide the boosting algorithms into three types: (1) the first boosting algorithm [13], (2) those that are primarily used for the two-class case [14], and (3) an algorithm specifically designed for the multi-class case [7]. The first practical implementation of a boosting algorithm [6] (Figure 1) was successfully used to drive down the error rate in a character recognition problem. In this case the weak learner was a multi-layered neural network with over 100,000 connections. The oracle of this algorithm is assumed to create a large (but not infinite) number of examples. Every time it is called, a new pattern is generated. We call this algorithm Boost1 because it was the first boosting algorithm. It can be used for multi-class classification. WeakLearn is any algorithm that has less than a 50% error rate on any distribution of data. Figure 1: Algorithm Boost1 Given Oracle, size of training set m and WeakLearn: 1. Call Oracle to generate m training patterns. Train WeakLearn on these examples to obtain a first network h1 2. Iterate until obtain m new training patterns: h1 · Flip a fair coin · If heads, repetitively call Oracle and pass patterns though h1 until a pattern is misclassified by h1 and then add to training set for h2. If tails, repetitively call Oracle until pattern is classified incorrectly by h1 and then add to training set for h2 3. Call WeakLearn to train h2 4. Iterate until obtain m training patterns: · Call Oracle and pass pattern through h1 and h2 · If h1 and h2 disagree on the classification, add to training set for h3, else discard pattern. 5. Call WeakLearn to train h3. 6. The final hypothesis is then: 3 h= arg maxhxyti(,) final yYÎ åi=1 In the case of neural networks recognizing the ten digits, we just added the three sets of ten outputs together and chose as the hypothesis the output with the largest sum. The problem with this algorithm is that it assumes an oracle which can produce a very large number of training patterns. The method used to obtain a very large number of training patterns was to use affine transformations of the original training set of written digits to produce a large number of derivative patterns. The affine transformations included horizontal and vertical transformations, scaling, squeezing (simultaneous vertical and horizontal elongation of the original patterns), sheering and line width variation. Using this technique, on the NIST (National Institute of Standards of Technology) database of 60,000 training patterns and 10,000 test patterns we were able to bring the error rate down from 1.1% (using a single neural network) to .7% using the three neural networks [9], the lowest reported error rate for this database Figure 2: Algorithm AdaBoost.P 1. Input sequence of m examples (x1,y1), …, (xm,ym) with labels yÎ=Yk{1,...,} and WeakLearn 2. Let B={(i,y):iι{1,...,myy},}i 3. Initialize D1(iB)=1/|| for (i,yB).Î Set t = 1. 4. Iterate while et <.5 : · Call WeakLearn, providing it with distribution Dit() · Get back a hypothesis ht =®:XY x [0,1]. · Calculate the pseudo-loss: etttt=å Di(,y)[1-+hx(i,yii)hxy(,)] (,)iyBÎ et · Set bt = (1)-et · Update distribution Dit (): (1/2)[1+-hx(,y)hxy(,)] Di() ttiii t b t Dt+1 = Zt · tt=+1 where Zt is a normalization constant chosen such that Dt+1 is a distribution. 5. Output the final hypothesis: éùæö1 h= arg maxêúloghxy(,) final yYÎ å ç÷t t ëûêúèøbt This first boosting algorithm has been supplanted with a multitude of boosting algorithms, all with different algorithm names but generically called AdaBoost algorithms. Rather than relying on an oracle, the same training examples are used over and over again. As each new member of the ensemble is created, each weak learner is trained on a different distribution taken from the original training set.