Pattern Recognition Letters 28 (2007) 631–643 www.elsevier.com/locate/patrec

Unifying multi-class AdaBoost algorithms with binary base learners under the margin framework

Yijun Sun a,*, Sinisa Todorovic b, Jian Li c

a Interdisciplinary Center for Biotechnology Research, P.O. Box 103622, University of Florida, Gainesville, FL 32610, USA b Beckman Institute, University of Illinois at Urbana-Champaign, 405 N. Mathews Ave, Urbana, IL 61801, USA c Department of Electrical and Computer Engineering, P.O. Box 116130, University of Florida, Gainesville, FL 32611, USA

Received 30 January 2006; received in revised form 21 June 2006 Available online 14 December 2006

Communicated by K. Tumer

Abstract

Multi-class AdaBoost algorithms AdaBooost.MO, -ECC and -OC have received a great attention in the literature, but their relation- ships have not been fully examined to date. In this paper, we present a novel interpretation of the three algorithms, by showing that MO and ECC perform stage-wise functional on a cost function defined over margin values, and that OC is a shrinkage ver- sion of ECC. This allows us to strictly explain the properties of ECC and OC, empirically observed in prior work. Also, the outlined interpretation leads us to introduce shrinkage as regularization in MO and ECC, and thus to derive two new algorithms: SMO and SECC. Experiments on diverse databases are performed. The results demonstrate the effectiveness of the proposed algorithms and val- idate our theoretical findings. 2006 Elsevier B.V. All rights reserved.

Keywords: AdaBoost; Margin theory; Multi-class classification problem

1. Introduction as the base learner. This approach is important, since the majority of well-studied classification algorithms are AdaBoost is a method for improving the accuracy of a designed only for binary problems. It is also the main focus learning algorithm (Freund and Schapire, 1997; Schapire of this paper. Several algorithms have been proposed and Singer, 1999). It repeatedly applies a base learner to within the outlined framework, including AdaBoost.MO the re-sampled version of training data to produce a collec- (Schapire and Singer, 1999), -OC (Schapire, 1997) and tion of hypothesis functions, which are then combined via a -ECC (Guruswami and Sahai, 1999). In these algorithms, weighted linear vote to form the final ensemble classifier. a code matrix is specified such that each row of the code Under a mild assumption that the base learner consistently matrix (i.e., code word) represents a class. The code matrix outperforms random guessing, AdaBoost can produce a in MO is specified before learning; therefore, the underly- classifier with arbitrary accurate training performance. ing dependence between the fixed code matrix and so-con- AdaBoost, originally designed for binary problems, can structed binary classifiers is not explicitly accounted for, as be extended to solve for multi-class problems. In one of discussed in (Allwein et al., 2000). ECC and OC seem to such approaches, a multi-class problem is first decomposed alleviate this problem by alternatively generating columns into a set of binary ones, and then a binary classifier is used of a code matrix and binary hypothesis functions in a pre-defined number of iteration steps. Thereby, the under- * Corresponding author. Tel.: +1 352 273 8065; fax: +1 352 392 0044. lying dependence between the code matrix and binary clas- E-mail address: [email protected]fl.edu (Y. Sun). sifiers is exploited in a stage-wise manner.

0167-8655/$ - see front matter 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2006.11.001 632 Y. Sun et al. / Pattern Recognition Letters 28 (2007) 631–643

MO, ECC, and OC, as the multi-class extensions of verges fastest to the smallest test error, as compared to the AdaBoost, naturally inherit some of the well-known prop- other algorithms. erties of AdaBoost, including a good generalization cap- The remainder of the paper is organized as follows. In ability. Extensive theoretical and empirical studies have Sections 2 and 3, we briefly review the output-coding been reported in the literature aimed at understanding method for solving the multi-class problem and two-class the generalization capability of the two-class AdaBoost AdaBoost. Then, in Sections 4and5, we show that MO (Dietterich, 2000; Grove and Schuurmans, 1998; Quinlan, and ECC perform functional gradient-descent. Further, 1996). One leading explanation is the margin theory (Scha- in Section 5, we prove that OC is a shrinkage version of pire et al., 1998), stating that AdaBoost can effectively ECC. The experimental validation of our theoretical find- increase the margin, which in turn is conducive to a good ings is presented in Section 6. We conclude the paper with generalization over unseen test data. It has been proved our final remarks in Section 7. that AdaBoost performs a stage-wise functional gradient descent procedure on the cost function of sample margins 2. Output coding method (Mason et al., 2000; Breiman, 1999; Friedman et al., 2000). However, to the best of our knowledge neither sim- In this section, we briefly review the output coding ilar margin-based analysis, nor empirical comparison of the method for solving multi-class classification problems multi-class AdaBoost algorithms has to date been reported. (Allwein et al., 2000; Dietterich and Bakiri, 1995). Let Further, we observe that the behavior of ECC and OC N D ¼fðxn; ynÞgn¼1 2 X Y denote a training dataset, where for various experimental settings, as well as the relationship X is the pattern space and Y ¼f1; ...; Cg is the label between the two algorithms are not fully examined in the space. To decompose the multi-class problem into several literature. For example, Guruswami and Sahai (1999) binary ones, a code matrix M 2 {±1}C·L is introduced, claim that ECC outperforms OC, which, as we show in this where L is the length of a code word. Here, M(c) denotes paper, is not true for many settings. Also, it has been the cth row, that is, a code word for class c, and M(c,l) empirically observed that the training error of ECC con- denotes an element of the code matrix. Each column of verges faster to zero than that of OC (Guruswami and M defines a binary partition of C classes over data samples Sahai, 1999), but no mathematical explanation of this phe- – the partition, on which a binary classifier is trained. After nomenon, as of yet, has been proposed. L training steps, the output coding method produces a final T In this paper, we investigate the aforementioned missing classifier f(x)=[f1(x),...,fL(x)] , where flðxÞ : x ! R. links in the theoretical developments and empirical studies When presented an unseen sample x, the output coding of MO, ECC, and OC. We show that MO and ECC per- method predicts the label y*, such that the code word form stage-wise functional gradient descent on a cost func- M(y*) is the ‘‘closest’’ to f(x), with respect to a specified tion, defined in the domain of margin values. We further decoding strategy. In this paper, we use the loss-based prove that OC is actually a shrinkage version of ECC. This decoding strategyP (Allwein et al., 2000), given by theoretical analysis allows us to derive the following L y ¼ arg miny2Y l¼1 expðMðy; lÞflðxÞÞ. results. First, several properties of ECC and OC are formu- lated and proved, including (i) the relationship between 3. AdaBoost and LP-based margin optimization their training convergence rates, and (ii) their performances in noisy regimes. Second, we show how to avoid the redun- In this section, we briefly review the AdaBoost algo- dant calculation of pseudo-loss in OC, and thus to simplify rithm, and its relationship with margin optimization. Given the algorithm. Third, two new regularized algorithms are a set of hypothesis functions H ¼fhðxÞ : x !f1gg, derived, referred to as SMO and SECC, where shrinkage AdaBoostP finds an ensemble functionP in the form of as regularization is explicitly exploited in MO and ECC, F ðxÞ¼ athtðxÞ or f ðxÞ¼F ðxÞ= at, where "t, at P 0, respectively. t t P by minimizing the cost function G N exp y F x . We also report experiments on the algorithms’ behavior ¼ n¼1 ð n ð nÞÞ D in the presence of mislabeled training data. Mislabeling The sample margin is defined as qðxnÞ¼ynf ðxnÞ, and the D noise is critical for many applications, where preparing a classifier margin, or simply margin, as q ¼ min16n6N qðxnÞ. good training data set is a challenging task. Indeed, an Interested readers can refer to Freund and Schapire erroneous human supervision in hard-to-classify cases (1997) and Sun et al. (in press) for more detailed discussion may lead to training sets with a significant number of mis- on AdaBoost. labeled data. The influence of mislabeled training data on It has been empirically observed that AdaBoost can the classification error for two-class AdaBoost was also effectively increase the margin (Schapire et al., 1998). For investigated in (Dietterich, 2000). However, to our knowl- this reason, since the invention of AdaBoost, it has been edge, no such study has been reported for multi-class Ada- conjectured that when t !1, AdaBoost solves a linear Boost yet. The experimental results support our theoretical programming (LP) problem: findings. In a very likely event, when for example 10% of max q; training patterns are mislabeled, OC outperforms ECC. ð1Þ Moreover, in the presence of mislabeling noise, SECC con- s:t: qðxnÞ P q; n ¼ 1; ...; N: Y. Sun et al. / Pattern Recognition Letters 28 (2007) 631–643 633 where the margin is directly maximized. In the recent paper In the light of the connection between AdaBoost and LP, Rudin et al. (2004), however, the equivalence of the two discussed in Section 3, the cost function that MO optimizes algorithms has been proven to not always hold. Neverthe- can be specified as ! less, these two algorithms are closely connected in the sense XN XL X ðlÞ that both algorithms try to maximize the margin. Through- G ¼ exp q ðxnÞ at out this paper, we will make use of this connection to define n¼1 l¼1 t a cost function over the domain of margin values, upon XN XL ðlÞ which the multi-class AdaBoost.MO, ECC and OC algo- ¼ expðF ðxnÞMðyn; lÞÞ; ð3Þ rithms perform stage-wise gradient descent. n¼1 l¼1 which is indeed proved in the following theorem. 4. AdaBoost.MO Theorem 1. AdaBoost.MO performs a stage-wise functional In this section, we study AdaBoost.MO. In Fig. 1,we gradient descent procedure on the cost function given by Eq. present the pseudo-code of the algorithm as proposed (3). in (Schapire and Singer, 1999). Given a code matrix M 2 {±1}C·L, in each iteration t, t =1,...,T, a distribu- h tion Dt is generated over pairs of training examples and Proof. The proof is provided in the Appendix. columns of the matrix M. A set of base learners hðlÞ x L is then trained with respect to the distribution f t ð Þgl¼1 5. AdaBoost.ECC and AdaBoost.OC Dt, based on the binary partition defined by each column of M. The error of hðlÞ x L is computed as Step (4) t f t ð Þgl¼1 In this section, we investigate ECC (Schapire, 1997) and in Fig. 1 and the combination coefficient at is computed OC (Guruswami and Sahai, 1999), whose pseudo-codes are 1 1t in Step (5) as at ¼ ln ð Þ. Then, the distribution is 2 t given in Figs. 2 and 3, respectively. Our goal is to unify updated as D n; l D n; l exp a M y ; l hðlÞ x = tþ1ð Þ¼ tð Þ ð t ð n Þ t ð nÞÞ both algorithms under the framework of margin theory, Zt, where Zt is a normalizationP P constant such that Dt+1 is and to establish the relationship between the two a distribution, i.e., N L D n; l 1. The process n¼1 l¼1 tþ1ð Þ¼ algorithms. continues until the maximum iteration number T is We begin by defining the sample margin, q(xn), as reached. After T rounds, MO outputs aP final classifier (1) (L) T T D F(x)=[F (x),...,F (x)] or fðxÞ¼FðxÞ= t¼1at, where qðxnÞ¼ min DðMðcÞ; fðxnÞÞ DðMðynÞ; fðxnÞÞ; ð4Þ P fc2C;c6¼y g ðlÞ T ðlÞ n F ðxÞ¼ t¼1atht ðxÞ. When presented with an unseen sample x, MO predicts label y*, whose row M(y*) is the where D(Æ) is a distance measure. Maximization of q(xn)in ‘‘closest’’ to F(x), with respect to a specified decoding Eq. (4) can be interpreted as finding f(xn) close to the code strategy. word of the true label, while at the same time distant from the code word of the most confused class. By specifying Now, we seek to define the cost function upon which D T 2 MO performs the optimization. Given a fixed code matrix, DðMðcÞ; fðxÞÞ ¼ kMðcÞf ðxÞk , from Eq. (4) we derive one natural idea is to find F(x) such that the minimum qðxnÞ¼2MðynÞfðxnÞ max f2MðcÞfðxnÞg: ð5Þ value of the sample margin of each bit position is fc2C;c6¼yng maximized: Maximization of q(xn) can be formulated as an optimiza- tion problem: max q X ðlÞ atht ðxnÞ ðlÞ Xt s:t: q ðxnÞ¼Mðyn; lÞ P q; ð2Þ at t n ¼ 1; ...; N; l ¼ 1; ...; L:

Fig. 1. Pseudo-code of AdaBoost.MO, as proposed in (Schapire and Fig. 2. Pseudo-code of AdaBoost.ECC, as proposed in (Guruswami and Singer, 1999). Sahai, 1999). 634 Y. Sun et al. / Pattern Recognition Letters 28 (2007) 631–643

zation of G in Eq. (8), we prove the theorem. For the moment, we assume that the code matrix M is given, the generation of which is discussed later. After (t 1) iteration steps, ECC produces Ft1(xn)=[a1h1(xn),..., T at1ht1(xn),0,...,0] . In the tth iteration step, the goal of the algorithm is to compute the tth entry of T Ft(xn)=[a1h1(xn),...,at1ht1(xn),atht(xn),0,...,0] . From Eq. (8), the negative functional derivative of G with respect toP Ft1(x), at x = xn, is computed as C rF Gj ¼ ðMðy ÞMðcÞÞGt 1ðnÞ, where t1 x¼xn fc¼1;c6¼yng n D Gt1ðn; cÞ¼expððMðynÞMðcÞÞFt1ðxnÞÞ. To find the optimal hypothesis function ht in the next training step, we resort to Friedman’s optimization approach (Friedman, 2001). It is straightforward to show that ht should be selected from H, by maximizing its correlation with thePtth Fig. 3. Pseudo-code of AdaBoost.OC, as proposed in (Schapire, 1997). N Pcomponent of rFt1 G as ht ¼ arg maxh2H n¼1 C hðxnÞðMðy ; tÞMðc; tÞÞGt 1ðn; cÞ. To facilitate fc¼1;c6¼yng n max q; the computation of ht, we introduce the following terms: XC s:t: MðynÞfðxnÞ max fMðcÞfðxnÞg P q; n ¼ 1;...;N: D fc2C;c6¼y g n V tðnÞ¼ jMðyn; tÞMðc; tÞjGt1ðn; cÞ; ð6Þ fc¼1;c6¼yng XN ð9Þ We are particularly interested in findingPf(x) of the form D T T V t ¼ V tðnÞ; f(x)=[a1h1(x),...,aThT(x)] , where t¼1at ¼ 1; at P 0. n¼1 From Eq. (6), we derive D dtðnÞ¼V tðnÞ=V t: max q Note that Vt differs fromQUt, defined in Step (4) in Fig. 2, XT t1 by a constant 2NðC 1Þ i¼1Zi, where Zi is a normaliza- s:t: atðMðyn; tÞMðc; tÞÞhtðxnÞ P q t 1 tion constant defined in Step (9) in Fig. 2. Also, note that ¼ ð7Þ (M(yn,t) M(c,t)) either equals zero or has the same sign n ¼ 1; ...; N; c ¼ 1; ...; C; c 6¼ yn; XT as M(yn,t). It follows that at ¼ 1; a P 0: XN t¼1 ht ¼ arg max V tðnÞsignðMðyn; tÞÞhðxnÞ; ð10Þ h2H n¼1 From the discussion on LP and AdaBoost in Section 3, XN it appears reasonable to define a new cost function, opti- ¼ arg max V t dtðnÞMðyn; tÞhðxnÞ; ð11Þ h2H mized by a multi-class AdaBoost algorithm: n¼1 XN

XN XC ¼ arg max U t dtðnÞMðyn; tÞhðxnÞ; ð12Þ D h2H n 1 G ¼ expððMðynÞMðcÞÞFðxnÞÞ ¼ XN n¼1 fc¼1;c6¼yng ! N C T ¼ arg min IðMðyn; tÞ 6¼ hðxnÞÞdtðnÞ; ð13Þ X X X h2H n¼1 ¼ exp atðMðyn; tÞMðc; tÞÞhtðxnÞ : n¼1 fc¼1;c6¼yng t¼1 ¼ arg min e; ð14Þ h2H ð8Þ where e is the weighted error of h. Once ht is found, a can be computed by a line search as The following theorem shows that AdaBoost.ECC opti- t mizes the above cost function. at ¼ arg min Gt aP0 Theorem 2. AdaBoost.ECC performs a stage-wise func- XN XC tional gradient descent procedure on the cost function given ¼ arg min Gt1ðn; cÞ exp½aðMðyn; tÞ aP0 by Eq. (8). n¼1 c¼1 c6¼yn

Mðc; tÞÞhtðxnÞ: ð15Þ Proof. Fig. 2 presents the pseudo-code of ECC (Guru- swami and Sahai, 1999). By comparing the expressions In our case, where the hypotheses ht are specified as binary for (i) the data-sampling distribution dt (Step (5)), and (ii) classifiers, at can be solved analytically. Taking the deriva- the weights at (Step (8)), with those obtained from minimi- tive of Gt with respect to at gives Y. Sun et al. / Pattern Recognition Letters 28 (2007) 631–643 635

XN oGt M.t and ht is known be an NP-hard problem (Crammer and ¼ V t dtðnÞMðyn; tÞhtðxnÞ expð2atMðyn; tÞhtðxnÞÞ Singer, 2000). In both OC and ECC, this problem is oat n¼1 alleviated by conducting a two-stage optimization. That is, XN 2a M is first generated by maximizing U , given in Step (4) in ¼ V IðMðy ; tÞ¼h ðx ÞÞd ðnÞe t .t t t n t n t Fig. 2, and then h is trained based on the binary partition n¼1 ! t XN defined by M.t.In(Schapire, 1997; Guruswami and Sahai, 2at IðMðyn; tÞ 6¼ htðxnÞÞdtðnÞe ; ð16Þ 1999), this procedure is justified by showing that maximiz- n¼1 ing Ut decreases the upper bound of the training error. Maximizing U is a special case of the ‘‘Max-Cut’’ problem, where we used the definitions from Eq. (9), and the fact t which is known to be NP-complete. For computing the that (M(y ,t) M(c,t)) takes values in the set {0,2,2}, n approximate solution of the optimal M , in our experiments and that (M(y ,t) M(c,t)) has the same sign as M(y ,t). .t n n we use the same approach as that used in (Schapire, 1997). From Eqs. (13) and (14), and oGt/oat = 0, we obtain 1 We point out that the proof of Theorem 2 provides for at ¼ 4 ln½ð1 etÞ=et, which is equal to the expression given in Step (8), Fig. 2. yet another interpretation of the outlined procedure. Ideally, in the tth iteration step we want to find M.t and Finally we check the update rule for data distribution dt. By unravelling Dt(n,c) in Step (9) of the pseudo-code of ECC (seeQ Fig. 2), we derive Dtðn; cÞ¼Gt1ðn; cÞ=N Table 1 t1 ðC 1Þ i¼1Zi, for c 5 yn, and Dt(n,c) = 0, for c = yn.By The number of samples in each database plugging this result into Steps (4) and (5) of the pseudo- Database Training Cross-validation Test code of ECC, it is straightforward to show that the Cars 865 (50%) 286 (15%) 577 (35%) (t) expressions for d in Step (5) and Eq. (9) are the same. Images 210 (9%) 210 (9%) 1890 (82%) This completes the proof. Letters 8039 (40%) 3976 (20%) 7985 (40%) Now, we discuss how to generate the columns of the code PenDigits 5621 (75%) 1873 (25%) 3498 USPS 6931 (95%) 360 (5%) 2007 matrix, denoted as M.t. Simultaneous optimization of both

Cars: Test Error with Label Noise 0% Cars: Test Error with Label Noise 10% 0.1 0.16 AdaBoost.SECC AdaBoost.SECC 0.095 AdaBoost.ECC 0.155 AdaBoost.ECC AdaBoost.OC AdaBoost.OC 0.09 AdaBoost.SMO AdaBoost.SMO AdaBoost.MO 0.15 AdaBoost.MO 0.085 0.145 0.08

0.075 0.14

0.07 0.135

Classification Error 0.065 Classification Error 0.13 0.06 0.125 0.055

0.05 0.12 0 100 200 300 400 500 0 100 200 300 400 500 Number of Iterations Number of Iterations

Cars: Test Error with Label Noise 20% Cars: Test Error with Label Noise 30% 0.25 0.36 AdaBoost.SECC AdaBoost.SECC 0.24 AdaBoost.ECC AdaBoost.ECC 0.34 AdaBoost.OC AdaBoost.OC 0.23 AdaBoost.SMO AdaBoost.SMO AdaBoost.MO AdaBoost.MO 0.32 0.22

0.21 0.3

0.2 0.28

0.19 Classification Error Classification Error 0.26 0.18 0.24 0.17

0.16 0.22 0 100 200 300 400 500 0 100 200 300 400 500 Number of Iterations Number of Iterations

Fig. 4. Classification errors on the test data of Cars. In some plots, the error curves of SECC and SMO overlap those of ECC and MO (i.e., g = 1), respectively. For SECC and SMO, the optimal number of training steps T* is found by cross-validation and indicated in the figures. 636 Y. Sun et al. / Pattern Recognition Letters 28 (2007) 631–643

~ ~ ht simultaneously to maximize their correlation, which Iðyn 62 htðxnÞÞ þ Iðc 2 htðxnÞÞ however is NP-hard. Therefore, we resort to the two-stage 1 optimization. It is evident from Eq. (12) that M should be ¼ ðMðc; tÞMðy ; tÞÞhtðxnÞþ1: ð17Þ .t 2 n generated such that Ut is maximized. h It follows that 5.1. Relationship between AdaBoost.OC and AdaBoost.ECC X X 1 N C 1 ~et ¼ Dtðn; cÞðMðc; tÞMðyn; tÞÞhtðxnÞ þ ECC is derived from OC on the algorithm level, as dis- 4 |fflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl{zfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflfflffl}n¼1 c¼1 2 cussed in (Guruswami and Sahai, 1999). However, their r relationship is not fully and strictly examined in the litera- 1 1 1 ¼ r þ ) r ¼ 4 ~et : ð18Þ ture, which is addressed in the following theorem. 4 2 2

Theorem 3. AdaBoost.OC is a shrinkage version of Now, let us take a look at the pseudo-code of ECC given in AdaBoost.ECC. Fig. 2. The training error, et, is computed in Step (7) as

XN Proof. We compare the expressions for computing the e ¼ IðMðy ; tÞ 6¼ h ðx ÞÞd ðnÞ weights, and for updating the data-sampling distribution, t n t n t n¼1 to establish the relationship between OC and ECC. Let 1 1 XN us first take a look at the pseudo-code of OC given in ¼ d ðnÞMðy ; tÞh ðx Þ: ð19Þ 2 2 t n t n Fig. 3. In Step (7), a pseudo-hypothesis function is con- n¼1 ~ ~ structed as htðxÞ¼fc 2 Y : htðxÞ¼Mðc; tÞg. Using htðxÞ, a pseudo-loss, ~et, is computed in Step (8) as ~et ¼ P P From Step (5) in Fig. 2 and the fact that I(M(yn,t) 5 1 N C ~ ~ 2 n¼1 c¼1Dtðn; cÞðIðyn 62 htðxnÞÞ þ Iðc 2 htðxnÞÞÞ, where M(c,t))M(yn,t)=(M(yn,t) M(c,t))/2, we have Dt(n,c) is updated in Step (9). Note that:

ImageSeg: Test Error with Label Noise 0% ImageSeg: Test Error with Label Noise 10% 0.06 0.12 AdaBoost.SECC AdaBoost.SECC 0.058 0.115 AdaBoost.ECC AdaBoost.ECC AdaBoost.OC AdaBoost.OC 0.056 AdaBoost.SMO 0.11 AdaBoost.SMO AdaBoost.MO AdaBoost.MO 0.054 0.105

0.052 0.1

0.05 0.095

0.048 0.09

Classification Error 0.046 Classification Error 0.085

0.044 0.08

0.042 0.075

0.04 0 100 200 300 400 500 0 100 200 300 400 500 Number of Iterations Number of Iterations

ImageSeg: Test Error with Label Noise 20% ImageSeg: Test Error with Label Noise 30% 0.2 0.3 AdaBoost.SECC AdaBoost.SECC 0.19 AdaBoost.ECC AdaBoost.ECC AdaBoost.OC 0.28 AdaBoost.OC 0.18 AdaBoost.SMO AdaBoost.SMO AdaBoost.MO AdaBoost.MO 0.26 0.17

0.16 0.24

0.15 0.22 0.14 Classification Error Classification Error

0.13 0.2

0.12 0.18 0.11 0 100 200 300 400 500 0 100 200 300 400 500 Number of Iterations Number of Iterations

Fig. 5. Classification errors on the test data of Images. Y. Sun et al. / Pattern Recognition Letters 28 (2007) 631–643 637

XN XC 1 1 Boost.ECC in the search for ht over the functional space et ¼ Dtðn; cÞðMðyn; tÞ H. 2 4U t n¼1 c¼1 Finally, by using Eq. (17), it is straightforward to show 1 1 that the updating rules of the data-sampling distributions Mðc; tÞÞhtðxnÞ¼ þ r: ð20Þ 2 4U t of the two algorithms, i.e., Step (5) in Figs. 2 and 3, are the same. In conclusion, AdaBoost.OC is a shrinkage version By plugging Eq. (18) into Eq. (20), we get:  of AdaBoost.ECC. h 1 1 1 e ¼ þ ~e : ð21Þ t 2 U t 2 t 5.2. Discussion 6 1 6 1 From Eq. (21), we observe that et 2 if and only if ~et 2, which means that both algorithms provide the same condi- The following remarks are immediate from Theorems tions for the regular operation of AdaBoost. That is, when 1–3: 6 1 et 2 both at P 0 and a~t P 0. Furthermore, note that Ut 2 [0,1], as defined in Step (4) in Fig. 2. From Eq. (21), • It is possible to reduce the computational complexity of 6 1 ~ we have that, for ~et 2, OC, by eliminating Steps (7) and (8) in Fig. 3. Instead, et  can be directly calculated from Eq. (21), given e . Here, 1 1 t et ~et ¼ 1 ~et 6 0 ) et 6 ~et: ð22Þ the simplification stems from the fact that et is easier to U t 2 compute, as in Eq. (13). Finally, from Eq. (22), and Steps (8) and (9) in Figs. 2 and • In (Guruswami and Sahai, 1999), the authors prove that 3, respectively, we derive ECC has a better upper bound of the training error than OC. They also experimentally observe that the training a~ ¼ g a ; ð23Þ t t t error of ECC converges faster than that of OC. How- where gt 2 [0,1]. Eq. (23) asserts that in each iteration step ever, the fact that the training-error upper bound of t, AdaBoost.OC takes a smaller step size than Ada- one algorithm is better than that of the other cannot

Letters: Test Error with Label Noise 0% Letters: Test Error with Label Noise 10% 0.35 0.4 AdaBoost.SECC AdaBoost.SECC AdaBoost.ECC AdaBoost.ECC 0.3 AdaBoost.OC 0.35 AdaBoost.OC AdaBoost.SMO AdaBoost.SMO AdaBoost.MO AdaBoost.MO

0.25 0.3

0.2 0.25

Classification Error Classification Error 0.2 0.15

0.15 0.1

50 100 150 200 250 300 350 400 450 500 0 100 200 300 400 500 Number of Iterations Number of Iterations

Letters: Test Error with Label Noise 20% Letters: Test Error with Label Noise 30% 0.5 AdaBoost.SECC AdaBoost.SECC AdaBoost.ECC AdaBoost.ECC 0.45 0.55 AdaBoost.OC AdaBoost.OC AdaBoost.SMO AdaBoost.SMO AdaBoost.MO 0.5 AdaBoost.MO 0.4

0.45 0.35 0.4 0.3 0.35 Classification Error Classification Error 0.25 0.3

0.2 0.25

0.15 0.2 0 100 200 300 400 500 0 100 200 300 400 500 Number of Iterations Number of Iterations

Fig. 6. Classification errors on the test data of Letters. 638 Y. Sun et al. / Pattern Recognition Letters 28 (2007) 631–643

explain the empirical evidence related to the conver- values, i.e., that they increase the classifier margin over a gence rate. Theorem 3 provides for the strict proof of given number of training iterations. Therefore, the mar- the observed phenomenon. gin-based theoretical analysis of two-class AdaBoost • Shrinkage can be considered a regularization method, by Schapire et al. (1998) can be directly applied to the which has been reported to significantly improve classi- multi-class AdaBoost to explain their good generaliza- fication performance in the case of noise corrupted data tion capability. This property is also demonstrated (Friedman, 2001). Introducing shrinkage in the steepest experimentally in the following section. decent minimization of ECC is analogous to specifying a learning rate in neural networks. Consequently, based In the following section, we present experiments with on the analysis in Theorem 3, one can expect that in MO, SMO, ECC, SECC, and OC, which support our the- noise-corrupted-data cases, OC should perform better oretical findings. than ECC. This provides a guideline for selecting the appropriate algorithm between the two. 6. Experiments • Theorems 1 and 2 give rise to a host of novel algorithms that can be constructed by introducing the shrinkage In our experiments, we choose C4.5 as a base learner. term, g 2 (0,1], into the original multi-class algorithms. C4.5 is a decision-tree classifier with a long record of Thereby, we design SMO and SECC, respectively. The successful implementation in many classification systems pseudo-codes of SMO and MO are identical, except (Quinlan, 1993). Although, in general, C4.5 can be for Step (5), where at should be computed as employed to classify multiple classes, in our experiments, 1 at ¼ g 2 ln½ð1 etÞ=et. Similarly, the pseudo-codes of we use it as a binary classifier. SECC and ECC are identical, except for Step (8), We test AdaBoost.MO, SMO, ECC, SECC, and OC 1 where at should be computed as at ¼ g 4 ln½ð1 etÞ=et. on five databases, four of which are publicly available at • From Theorems 1–3, it follows that MO, SMO, ECC, the UCI Benchmark Repository (Blake and Merz, 1998). SECC and OC perform stage-wise functional gradient These are (1) Car Evaluation Database (or short Cars), descent on a cost function expressed in terms of margin (2) Image Segmentation Database (Images), (3) Letter Rec-

PenDigits: Test Error with Label Noise 0% PenDigits: Test Error with Label Noise 10%

AdaBoost.SECC AdaBoost.SECC AdaBoost.ECC AdaBoost.ECC AdaBoost.OC AdaBoost.OC 0.3 AdaBoost.SMO AdaBoost.SMO AdaBoost.MO 0.3 AdaBoost.MO

0.25 0.25

0.2 Classification Error Classification Error 0.2

0.15

0.15 0 100 200 300 400 500 0 100 200 300 400 500 Number of Iterations Number of Iterations

PenDigits: Test Error with Label Noise 20% PenDigits: Test Error with Label Noise 30% 0.4 0.5 AdaBoost.SECC AdaBoost.SECC AdaBoost.ECC AdaBoost.ECC AdaBoost.OC 0.45 AdaBoost.OC 0.35 AdaBoost.SMO AdaBoost.SMO AdaBoost.MO AdaBoost.MO 0.4

0.3 0.35

0.25 0.3 Classification Error Classification Error

0.25 0.2

0.2

0.15 0 100 200 300 400 500 0 100 200 300 400 500 Number of Iterations Number of Iterations

Fig. 7. Classification errors on the test data of PenDigits. Y. Sun et al. / Pattern Recognition Letters 28 (2007) 631–643 639 ognition Database (Letters), and (4) Pen-Based Recogni- algorithms is evaluated for several different noise levels, tion of Handwritten Digits (PenDigits). The fifth database ranging from 0% to 30%. is USPS Dataset of Handwritten Characters (USPS) For MO and SMO, we choose ‘‘one against all’’ coding (LeCun et al., 1989). We divide each database into training, method, because of its simplicity, considering that there is test, and validation sets. For Cars, Images, and Letters, all no decisive answer as to which output code is the best available data samples are grouped in a single file per each (Allwein et al., 2000). Throughout, for SMO and SECC, database. Here, training, test, and validation sets are the optimal shrinkage parameter g*, and the optimal num- formed by random selection of samples from that single ber of training steps T* are found by cross-validation. We file, such that a certain percentage of samples per class is find T* by sliding in 10-step increments a 20-step-wide aver- present in each of the three datasets, as detailed in Table aging window over the test-error results on validation data. 1. For PenDigits and USPS, the training and test datasets The sliding of the window is stopped at the training step are already given. Here, the test datasets are kept intact, T* = t, when the increase in the average test error over while new training and validation sets are formed from the previously recorded value in step (t 10) is detected. the original training data, as reported in Table 1. For Further, we choose g* for which the classification error USPS database, to reduce the run-time of our experiments, on the validation data at the T*th step is minimum. we projected each sample using principal component In all experiments, the maximum number of training analysis onto a lower-dimensional feature space (256 steps is preset to T = 500 for ECC, SECC, and OC. As typ- features ! 54 features) at the price of 10% of the represen- ically done in many approaches (Schapire and Singer, 1999; tation error. Schapire, 1997), we preset the number of training steps To conduct experiments with mislabeling, noise is intro- to obtain a reasonable comparison of the performance of duced only to the training and validation sets, while the test the algorithms balanced against the required processing set is kept intact. The level of noise represents a percentage time. Although the algorithms may not reach the minimum of randomly selected training data (or validation data), possible test error in the specified number of iteration steps, whose class labels are changed. The performance of the five this minimum error becomes irrelevant in real applications

USPS: Test Error with Label Noise 0% USPS: Test Error with Label Noise 10% 0.11 0.14 AdaBoost.SECC AdaBoost.SECC 0.105 AdaBoost.ECC 0.13 AdaBoost.ECC AdaBoost.OC AdaBoost.OC 0.1 AdaBoost.SMO AdaBoost.SMO 0.12 AdaBoost.MO AdaBoost.MO 0.095 0.11 0.09

0.085 0.1

0.08 0.09 Classification Error Classification Error 0.075 0.08 0.07 0.07 0.065

0.06 0.06 0 100 200 300 400 500 50 100 150 200 250 300 350 400 450 500 Number of Iterations Number of Iterations

USPS: Test Error with Label Noise 20% USPS: Test Error with Label Noise 30% 0.4 0.24 AdaBoost.SECC AdaBoost.SECC AdaBoost.ECC AdaBoost.ECC 0.35 0.22 AdaBoost.OC AdaBoost.OC AdaBoost.SMO AdaBoost.SMO AdaBoost.MO AdaBoost.MO 0.2 0.3

0.18 0.25 0.16 0.2 0.14 Classification Error Classification Error 0.12 0.15

0.1 0.1

0.08 0.05 50 100 150 200 250 300 350 400 450 500 0 100 200 300 400 500 Number of Iterations Number of Iterations

Fig. 8. Classification errors on the test data of USPS. 640 Y. Sun et al. / Pattern Recognition Letters 28 (2007) 631–643 when the processing time exceeds several hours on 2.4GHZ the algorithms increase the margin as the number of itera- 1GB RAM PC, which is the case when T = 500. To obtain tions becomes larger, which is conducive to a good gener- comparable results (in terms of the computation) for MO alization capability. and SMO, where we use ‘‘one against all’’ coding strategy, From the results we observe the following. First, the the maximum number of training steps is computed as training error of ECC converges faster to zero than that T = 500/(# of classes), which is different for each database. of OC and SECC, which is in agreement with Theorem 3. Thus, for Cars T = 125, for Images T = 72, for Letters Also, the training convergence rate of SECC is the slowest, T = 20, for PenDigits T = 50, and for USPS T = 50. Note which becomes very pronounced for high-level noise set- that the outlined numbers of iteration steps take approxi- tings, where typically a small value of the shrinkage param- mately the same processing time as T = 500 for ECC, eter g* is used, as detailed in Table 2. SECC, and OC. The classification error both on the test Second, in the absence of mislabeling noise, we observe and validation sets is averaged over 10 runs for each that ECC performs similarly to OC with respect to the test database. error. However, with the increase of noise level, OC outper- Figs. 4–8 show classification error on the test data. In forms ECC, as predicted in Section 5.2. Moreover, SECC Figs. 9 and 10 we plot the training errors for Images and performs better than both OC and ECC at all noise levels Letters as two typical examples of the algorithms’ train- above zero. For example, for Letters, in a likely event, ing-error convergence rates. The optimal g* values for when 10% of training patterns are mislabeled, SECC SECC and SMO, are presented in Table 2.InTable 3, improves the classification performance of ECC by about we also list the classification error on the test data at the 50% (28.2% vs. 13.4%). optimal step T*, indicated in the parentheses if different Third, regularization of MO and ECC, by introducing from the maximum preset number of iteration steps T. the shrinkage parameter g, improves their performance; Fig. 11 illustrates the classifier margin on Letters over a however, it may also lead to overfitting (see Figs. 4 and range of mislabeling noise levels for the five algorithms. 5). It is possible to estimate the optimal number of training These results experimentally validate Theorems 1–3 that steps to avoid overfitting through cross-validation. Overall,

ImageSeg: Training Error with Label Noise 0% ImageSeg: Training Error with Label Noise 10% 0.1 0.1 AdaBoost.SECC AdaBoost.SECC 0.09 AdaBoost.ECC 0.09 AdaBoost.ECC AdaBoost.OC AdaBoost.OC 0.08 0.08

0.07 0.07

0.06 0.06

0.05 0.05

0.04 0.04 Classification Error Classification Error 0.03 0.03

0.02 0.02

0.01 0.01

0 0 1 2 4 6 8 10 12 14 16 10 10 Number of Iterations Number of Iterations

ImageSeg: Training Error with Label Noise 20% ImageSeg: Training Error with Label Noise 30% 0.1 0.1 AdaBoost.SECC AdaBoost.SECC 0.09 AdaBoost.ECC 0.09 AdaBoost.ECC AdaBoost.OC AdaBoost.OC 0.08 0.08

0.07 0.07

0.06 0.06

0.05 0.05

0.04 0.04 Classification Error Classification Error 0.03 0.03

0.02 0.02

0.01 0.01

0 0 1 2 1 2 10 10 10 10 Number of Iterations Number of Iterations

Fig. 9. Classification errors on the training data of Images. Y. Sun et al. / Pattern Recognition Letters 28 (2007) 631–643 641

Letters: Training Error with Label Noise 0% Letters: Training Error with Label Noise 10% 0.1 0.1 AdaBoost.SECC AdaBoost.SECC 0.09 AdaBoost.ECC 0.09 AdaBoost.ECC AdaBoost.OC AdaBoost.OC 0.08 0.08

0.07 0.07

0.06 0.06

0.05 0.05

0.04 0.04 Classification Error Classification Error 0.03 0.03

0.02 0.02

0.01 0.01

0 0 1 2 10 15 20 25 30 10 10 Number of Iterations Number of Iterations

Letters: Training Error with Label Noise 20% Letters: Training Error with Label Noise 30% 0.1 0.1 AdaBoost.SECC AdaBoost.SECC 0.09 AdaBoost.ECC 0.09 AdaBoost.ECC AdaBoost.OC AdaBoost.OC 0.08 0.08

0.07 0.07

0.06 0.06

0.05 0.05

0.04 0.04

Classification Error 0.03 Classification Error 0.03

0.02 0.02

0.01 0.01

0 0 1 2 1 2 10 10 10 10 Number of Iterations Number of Iterations

Fig. 10. Classification error on the training data of Letters.

Table 2 Table 3 Optimal g* values Classification errors (%) on the test data Database AdaBoost.SECC AdaBoost.SMO Database Noise (%) ECC SECC MO SMO OC Noise level Noise level Letters 0 9.3 8.0 9.5 9.5 8.3 0% 10% 20% 30% 0% 10% 20% 30% 10 28.2 13.4 20.6 18.5 19.8 20 35.6 16.9 27.2 23.4 24.9 Cars 1 1 0.05 0.05 1 1 0.05 0.05 30 41.9 23.3 34.5 29.0 31.5 Images 0.5 0.05 0.05 0.05 0.8 0.05 0.05 0.05 Letters 0.2 0.05 0.05 0.05 1 0.5 0.35 0.2 USPS 0 6.3 6.2 7.2 7.0 6.5 PenDigits 1 0.5 0.5 0.2 1 1 0.8 0.8 10 7.3 6.6 8.1 8.0 6.9 USPS 0.5 0.35 0.05 0.05 0.65 0.65 0.65 0.05 20 9.7 7.5 9.6 9.1 7.9 30 12.4 8.7 13.0 11.4 9.1 Cars 0 5.5 5.5 7.1 7.1 7.3 SECC outperforms other four algorithms with significant 10 12.5 12.5 12.7 12.7 13.3 improvements for all databases and all noise levels. 20 20.8 17.0 (70) 20.1 18.3 (160) 20.0 30 30.2 22.9 (90) 27.6 24.1 (80) 28.3

7. Conclusion Images 0 4.5 4.2 4.7 4.6 4.5 10 8.6 7.6 (70) 8.9 8.9 (210) 8.4 20 15.1 11.5 (160) 15.1 13.7 (280) 14.0 In this paper, we have unified AdaBoost.MO, ECC, and 30 22.5 18.2 (150) 22.5 19.3 (210) 22.9 OC under the framework of margin theory. We have PenDigits 0 12.9 12.9 14.9 14.9 13.9 shown that MO and ECC perform stage-wise functional 10 15.2 14.7 16.3 16.3 14.8 gradient descent on a cost function defined over margin 20 17.7 16.3 19.1 18.8 17.1 values, and that OC is a shrinkage version of ECC. 30 21.7 19.2 23.1 21.3 19.4 Based on these analyses, we have formulated and explained The best results are marked in bold face. 642 Y. Sun et al. / Pattern Recognition Letters 28 (2007) 631–643

Letters: Margin for 10% of Mislabeled Data Letters: Margin for 20% of Mislabeled Data Letters: Margin for 30% of Mislabeled Data

0 0 0

–0.2 –0.2 –0.2

–0.4 –0.4 –0.4 Margin Margin Margin –0.6 AdaBoost.SECC –0.6 AdaBoost.SECC –0.6 AdaBoost.SECC AdaBoost.ECC AdaBoost.ECC AdaBoost.ECC AdaBoost.SMO AdaBoost.SMO AdaBoost.SMO –0.8 –0.8 –0.8 AdaBoost.MO AdaBoost.MO AdaBoost.MO OC OC OC –1 –1 –1 20 50 100 200 300 400 500 20 50 100 200 300 400 500 20 50 100 200 300 400 500 Number of Iteration Steps Number of Iteration Steps Number of Iteration Steps

Fig. 11. Margin of the five algorithms on Letters. several properties of ECC and OC, and derived the shrink- in Step (6) in Fig. 1. First, by unravelling Step (6), we age versions of MO and ECC, referred to as SMO and obtain

SECC. ðlÞ expðMðyn; lÞF t1ðxnÞÞ We have also conducted empirical studies to compare Dtðn; lÞ¼X X : ð26Þ L N ðiÞ five multi-class AdaBoost algorithms. The experimental expðMðy ; lÞF ðxjÞÞ i¼1 j¼1 j t1 results showed: (1) ECC is faster than OC, while SECC is the slowest with respect to the convergence rate of the Then, by plugging Eq. (26) into Eq. (25), we get training error; (2) in the absence of mislabeling noise, XL XN ðlÞ ht ¼ argmax Dtðn;lÞMðyn;lÞh ðxnÞ ECC performs similarly to OC. However, for noise levels h2H above zero, OC outperforms ECC, which is supported by l¼1 n¼1 ! XL XN XN D n;l our theoretical studies; (3) regularization of MO and X tð Þ ðlÞ ¼ argmin Dtði;lÞ N IðMðyn;lÞ 6¼ h ðxnÞÞ: h2H ECC, by introducing the shrinkage parameter, improves l¼1 i¼1 n¼1 Dtði;lÞ i 1 significantly their performance in the presence of mislabel- ¼ 27 ing noise. Overall, SECC performs significantly better than ð Þ ðlÞ other four algorithms for all databases and all noise levels. From Eq. (27), it follows that each ht should be selected from H, such that the training error of the binary problem Appendix. Proof of Theorem 1 defined by the lth column is minimized. This is equivalent to Step (3) in Fig. 1. By comparing the steps in the algorithm of MO – in par- After ht is found, at can be computed by minimizing ticular, the optimization of the hypothesis functions in Step the cost function in Eq. (3). From Eqs. (3) and (24)P, the (3) of Fig. 1, and the computation of weights a in Step (5) – o o L t Pderivative of Gt with respect to at is Gt= at ¼ l¼1 with the corresponding expressions obtained from the min- N ðlÞ ðlÞ ðlÞ n¼1 expððF t1ðxnÞþatht ðxnÞÞMðyn; lÞÞht ðxnÞMðyn; lÞ. imization of G, we prove the theorem. From oGt/oat = 0 and Eq. (26), we have In the (t 1)th iteration, MO produces Ft1ðxÞ¼ XL XN ð1Þ ðLÞ T ½F ðxÞ; ...; F ðxÞ . In the tth iteration, the algorithm ðlÞ ðlÞ t1 t1 Dtðn; lÞ expðatht ðxnÞMðyn; lÞÞht ðxnÞMðyn; lÞ¼0: updates Ft1(x)as l¼1 n¼1 ð1Þ ðLÞ T ð28Þ FtðxÞ¼Ft1ðxÞþat½ht ðxÞ; ...; ht ðxÞ It follows that ¼ Ft1ðxÞþatht: ð24Þ XL XN From Eq. (3), the gradient of G with respect to Ft1,at at ðlÞ 1 Dtðn;lÞe Iðht ðxnÞ F ð Þ ðx ÞMðy ;1Þ x = x , reads G M y ; 1 e t1 n n ; ...; l¼1 n¼1 n rFt1 jx¼xn ¼½ ð n Þ F ðLÞ x M y ;L XL XN t1ð nÞ ð n Þ Mðyn; LÞe . Similar to Friedman’s approach at ðlÞ ¼ Mðyn;lÞÞ Dtðn;lÞe Iðht ðxnÞ 6¼ Mðyn;lÞÞ ¼ 0; (Friedman, 2001), we find ht as l¼1 n¼1 XN T 2 1 which yields at ¼ ln½ð1 etÞ=et, where et is defined in Step ht ¼ arg min krFt1 Gjx¼x bðhðxnÞÞ k 2 fh2H;bg n n¼1 (4) in Fig. 1. Thus, by minimizing the cost function Gt with XL XN respect to at, we obtain the same expression as the one in ðlÞ ðlÞ ¼ argmax expðMðyn;lÞF t1ðxnÞÞMðyn;lÞh ðxnÞ; Step (5). This completes the proof. h2H l¼1 n¼1 ð25Þ References where b is a nuisance parameter, and h 2 H denotes that Allwein, E.L., Schapire, R.E., Singer, Y., 2000. Reducing multiclass to ðlÞ "l, h 2 H. The right-hand side of Eq. (25) can be conve- binary: a unifying approach for margin classifiers. J. Mach. Learn. niently expressed in terms of data distribution D, computed Res. 1, 113–141. Y. Sun et al. / Pattern Recognition Letters 28 (2007) 631–643 643

Blake, C., Merz, C., 1998. UCI repository of machine learning databases. LeCun, Y., Boser, B., Denker, J.S., Hendersen, D., Howard, R., Hubbard, Breiman, L., 1999. Prediction games and arcing algorithms. Neural W., Jackel, L.D., 1989. applied to handwritten zip Comput. 11 (7), 1493–1517. code recognition. Neural Comput. 1 (4), 541–551. Crammer, K., Singer, Y., 2000. On the learnability and design of output Mason, L., Bartlett, J., Baxter, P., Frean, M., 2000. Functional gradient codes for multiclass problems. In: Proc. 13th Annual Conf. techniques for combining hypotheses. In: Scholkopf, B., Smola, A., Computational Learning Theory. Stanford University, CA, USA, Bartlett, P., Schuurmans, D. (Eds.), Advances in Large Margin pp. 35–46. Classifiers. MIT Press, Cambridge, MA, USA, pp. 221–247. Dietterich, T.G., 2000. An experimental comparison of three methods for Quinlan, J.R., 1993. C4.5: Programs for Machine Learning. Morgan constructing ensembles of decision trees: bagging, boosting, and Kaufmann. randomization. Mach. Learn. 40 (2), 139–157. Quinlan, J., 1996. Bagging, boosting, and C4.5. In: Proc. 13th Natl. Conf. Dietterich, T.G., Bakiri, G., 1995. Solving multiclass learning problems Artificial Intelligence and 8th Innovative Applications of Artificial via error-correcting output codes. J. Artificial Intell. Res. 2, Intelligence Conf., Portland, OR, USA, pp. 725–730. 263–286. Rudin, C., Daubechies, I., Schapire, R.E., 2004. The dynamics of Freund, Y., Schapire, R.E., 1997. A decision-theoretic generalization of AdaBoost: cyclic behavior and convergence of margins. J. Mach. on-line learning and an application to boosting. J. Comput. Syst. Sci. Learn. Res. 5, 1557–1595. 55 (1), 119–139. Schapire, R.E., 1997. Using output codes to boost multiclass learning Friedman, J., 2001. Greedy function approximation: a problems. In: Proc. 14th Intl. Conf. Machine Learning. Nashville, TN, machine. Ann. Statist. 29 (5), 1189–1232. USA, pp. 313–321. Friedman, J., Hastie, T., Tibshirani, R., 2000. Additive : Schapire, R., Singer, Y., 1999. Improved boosting algorithms using a statistical view of boosting. Ann. Statist. 28 (2), 337–407. confidence-rated predictions. Mach. Learn. 37 (3), 297–336. Grove, A.J., Schuurmans, D., 1998. Boosting in the limit: maximizing the Schapire, R.E., Freund, Y., Bartlett, P., Lee, W.S., 1998. Boosting the margin of learned ensembles. In: Proc. 15th Natl. Conf. on Artificial margin: a new explanation for the effectiveness of voting methods. Intelligence, Madison, WI, USA, pp. 692–699. Ann. Statist. 26 (5), 1651–1686. Guruswami, V., Sahai, A., 1999. Multiclass learning, boosting, and error- Sun, Y., Todorovic, S., Li, J. Reducing the overfitting of AdaBoost by correcting codes. In: Proc. 12th Annual Conf. Computational Learn- controlling its data distribution skewness. Int. J. Pattern Recog. ing Theory, Santa Cruz, California, pp. 145–155. Artificial Intell., in press.