Pattern Recognition Letters 28 (2007) 631–643 www.elsevier.com/locate/patrec
Unifying multi-class AdaBoost algorithms with binary base learners under the margin framework
Yijun Sun a,*, Sinisa Todorovic b, Jian Li c
a Interdisciplinary Center for Biotechnology Research, P.O. Box 103622, University of Florida, Gainesville, FL 32610, USA b Beckman Institute, University of Illinois at Urbana-Champaign, 405 N. Mathews Ave, Urbana, IL 61801, USA c Department of Electrical and Computer Engineering, P.O. Box 116130, University of Florida, Gainesville, FL 32611, USA
Received 30 January 2006; received in revised form 21 June 2006 Available online 14 December 2006
Communicated by K. Tumer
Abstract
Multi-class AdaBoost algorithms AdaBooost.MO, -ECC and -OC have received a great attention in the literature, but their relation- ships have not been fully examined to date. In this paper, we present a novel interpretation of the three algorithms, by showing that MO and ECC perform stage-wise functional gradient descent on a cost function defined over margin values, and that OC is a shrinkage ver- sion of ECC. This allows us to strictly explain the properties of ECC and OC, empirically observed in prior work. Also, the outlined interpretation leads us to introduce shrinkage as regularization in MO and ECC, and thus to derive two new algorithms: SMO and SECC. Experiments on diverse databases are performed. The results demonstrate the effectiveness of the proposed algorithms and val- idate our theoretical findings. 2006 Elsevier B.V. All rights reserved.
Keywords: AdaBoost; Margin theory; Multi-class classification problem
1. Introduction as the base learner. This approach is important, since the majority of well-studied classification algorithms are AdaBoost is a method for improving the accuracy of a designed only for binary problems. It is also the main focus learning algorithm (Freund and Schapire, 1997; Schapire of this paper. Several algorithms have been proposed and Singer, 1999). It repeatedly applies a base learner to within the outlined framework, including AdaBoost.MO the re-sampled version of training data to produce a collec- (Schapire and Singer, 1999), -OC (Schapire, 1997) and tion of hypothesis functions, which are then combined via a -ECC (Guruswami and Sahai, 1999). In these algorithms, weighted linear vote to form the final ensemble classifier. a code matrix is specified such that each row of the code Under a mild assumption that the base learner consistently matrix (i.e., code word) represents a class. The code matrix outperforms random guessing, AdaBoost can produce a in MO is specified before learning; therefore, the underly- classifier with arbitrary accurate training performance. ing dependence between the fixed code matrix and so-con- AdaBoost, originally designed for binary problems, can structed binary classifiers is not explicitly accounted for, as be extended to solve for multi-class problems. In one of discussed in (Allwein et al., 2000). ECC and OC seem to such approaches, a multi-class problem is first decomposed alleviate this problem by alternatively generating columns into a set of binary ones, and then a binary classifier is used of a code matrix and binary hypothesis functions in a pre-defined number of iteration steps. Thereby, the under- * Corresponding author. Tel.: +1 352 273 8065; fax: +1 352 392 0044. lying dependence between the code matrix and binary clas- E-mail address: [email protected]fl.edu (Y. Sun). sifiers is exploited in a stage-wise manner.
0167-8655/$ - see front matter 2006 Elsevier B.V. All rights reserved. doi:10.1016/j.patrec.2006.11.001 632 Y. Sun et al. / Pattern Recognition Letters 28 (2007) 631–643
MO, ECC, and OC, as the multi-class extensions of verges fastest to the smallest test error, as compared to the AdaBoost, naturally inherit some of the well-known prop- other algorithms. erties of AdaBoost, including a good generalization cap- The remainder of the paper is organized as follows. In ability. Extensive theoretical and empirical studies have Sections 2 and 3, we briefly review the output-coding been reported in the literature aimed at understanding method for solving the multi-class problem and two-class the generalization capability of the two-class AdaBoost AdaBoost. Then, in Sections 4and5, we show that MO (Dietterich, 2000; Grove and Schuurmans, 1998; Quinlan, and ECC perform functional gradient-descent. Further, 1996). One leading explanation is the margin theory (Scha- in Section 5, we prove that OC is a shrinkage version of pire et al., 1998), stating that AdaBoost can effectively ECC. The experimental validation of our theoretical find- increase the margin, which in turn is conducive to a good ings is presented in Section 6. We conclude the paper with generalization over unseen test data. It has been proved our final remarks in Section 7. that AdaBoost performs a stage-wise functional gradient descent procedure on the cost function of sample margins 2. Output coding method (Mason et al., 2000; Breiman, 1999; Friedman et al., 2000). However, to the best of our knowledge neither sim- In this section, we briefly review the output coding ilar margin-based analysis, nor empirical comparison of the method for solving multi-class classification problems multi-class AdaBoost algorithms has to date been reported. (Allwein et al., 2000; Dietterich and Bakiri, 1995). Let Further, we observe that the behavior of ECC and OC N D ¼fðxn; ynÞgn¼1 2 X Y denote a training dataset, where for various experimental settings, as well as the relationship X is the pattern space and Y ¼f1; ...; Cg is the label between the two algorithms are not fully examined in the space. To decompose the multi-class problem into several literature. For example, Guruswami and Sahai (1999) binary ones, a code matrix M 2 {±1}C·L is introduced, claim that ECC outperforms OC, which, as we show in this where L is the length of a code word. Here, M(c) denotes paper, is not true for many settings. Also, it has been the cth row, that is, a code word for class c, and M(c,l) empirically observed that the training error of ECC con- denotes an element of the code matrix. Each column of verges faster to zero than that of OC (Guruswami and M defines a binary partition of C classes over data samples Sahai, 1999), but no mathematical explanation of this phe- – the partition, on which a binary classifier is trained. After nomenon, as of yet, has been proposed. L training steps, the output coding method produces a final T In this paper, we investigate the aforementioned missing classifier f(x)=[f1(x),...,fL(x)] , where flðxÞ : x ! R. links in the theoretical developments and empirical studies When presented an unseen sample x, the output coding of MO, ECC, and OC. We show that MO and ECC per- method predicts the label y*, such that the code word form stage-wise functional gradient descent on a cost func- M(y*) is the ‘‘closest’’ to f(x), with respect to a specified tion, defined in the domain of margin values. We further decoding strategy. In this paper, we use the loss-based prove that OC is actually a shrinkage version of ECC. This decoding strategyP (Allwein et al., 2000), given by theoretical analysis allows us to derive the following L y ¼ arg miny2Y l¼1 expð Mðy; lÞflðxÞÞ. results. First, several properties of ECC and OC are formu- lated and proved, including (i) the relationship between 3. AdaBoost and LP-based margin optimization their training convergence rates, and (ii) their performances in noisy regimes. Second, we show how to avoid the redun- In this section, we briefly review the AdaBoost algo- dant calculation of pseudo-loss in OC, and thus to simplify rithm, and its relationship with margin optimization. Given the algorithm. Third, two new regularized algorithms are a set of hypothesis functions H ¼fhðxÞ : x !f 1gg, derived, referred to as SMO and SECC, where shrinkage AdaBoostP finds an ensemble functionP in the form of as regularization is explicitly exploited in MO and ECC, F ðxÞ¼ athtðxÞ or f ðxÞ¼F ðxÞ= at, where "t, at P 0, respectively. t t P by minimizing the cost function G N exp y F x . We also report experiments on the algorithms’ behavior ¼ n¼1 ð n ð nÞÞ D in the presence of mislabeled training data. Mislabeling The sample margin is defined as qðxnÞ¼ynf ðxnÞ, and the D noise is critical for many applications, where preparing a classifier margin, or simply margin, as q ¼ min16n6N qðxnÞ. good training data set is a challenging task. Indeed, an Interested readers can refer to Freund and Schapire erroneous human supervision in hard-to-classify cases (1997) and Sun et al. (in press) for more detailed discussion may lead to training sets with a significant number of mis- on AdaBoost. labeled data. The influence of mislabeled training data on It has been empirically observed that AdaBoost can the classification error for two-class AdaBoost was also effectively increase the margin (Schapire et al., 1998). For investigated in (Dietterich, 2000). However, to our knowl- this reason, since the invention of AdaBoost, it has been edge, no such study has been reported for multi-class Ada- conjectured that when t !1, AdaBoost solves a linear Boost yet. The experimental results support our theoretical programming (LP) problem: findings. In a very likely event, when for example 10% of max q; training patterns are mislabeled, OC outperforms ECC. ð1Þ Moreover, in the presence of mislabeling noise, SECC con- s:t: qðxnÞ P q; n ¼ 1; ...; N: Y. Sun et al. / Pattern Recognition Letters 28 (2007) 631–643 633 where the margin is directly maximized. In the recent paper In the light of the connection between AdaBoost and LP, Rudin et al. (2004), however, the equivalence of the two discussed in Section 3, the cost function that MO optimizes algorithms has been proven to not always hold. Neverthe- can be specified as ! less, these two algorithms are closely connected in the sense XN XL X ðlÞ that both algorithms try to maximize the margin. Through- G ¼ exp q ðxnÞ at out this paper, we will make use of this connection to define n¼1 l¼1 t a cost function over the domain of margin values, upon XN XL ðlÞ which the multi-class AdaBoost.MO, ECC and OC algo- ¼ expð F ðxnÞMðyn; lÞÞ; ð3Þ rithms perform stage-wise gradient descent. n¼1 l¼1 which is indeed proved in the following theorem. 4. AdaBoost.MO Theorem 1. AdaBoost.MO performs a stage-wise functional In this section, we study AdaBoost.MO. In Fig. 1,we gradient descent procedure on the cost function given by Eq. present the pseudo-code of the algorithm as proposed (3). in (Schapire and Singer, 1999). Given a code matrix M 2 {±1}C·L, in each iteration t, t =1,...,T, a distribu- h tion Dt is generated over pairs of training examples and Proof. The proof is provided in the Appendix. columns of the matrix M. A set of base learners hðlÞ x L is then trained with respect to the distribution f t ð Þgl¼1 5. AdaBoost.ECC and AdaBoost.OC Dt, based on the binary partition defined by each column of M. The error of hðlÞ x L is computed as Step (4) t f t ð Þgl¼1 In this section, we investigate ECC (Schapire, 1997) and in Fig. 1 and the combination coefficient at is computed OC (Guruswami and Sahai, 1999), whose pseudo-codes are 1 1 t in Step (5) as at ¼ ln ð Þ. Then, the distribution is 2 t given in Figs. 2 and 3, respectively. Our goal is to unify updated as D n; l D n; l exp a M y ; l hðlÞ x = tþ1ð Þ¼ tð Þ ð t ð n Þ t ð nÞÞ both algorithms under the framework of margin theory, Zt, where Zt is a normalizationP P constant such that Dt+1 is and to establish the relationship between the two a distribution, i.e., N L D n; l 1. The process n¼1 l¼1 tþ1ð Þ¼ algorithms. continues until the maximum iteration number T is We begin by defining the sample margin, q(xn), as reached. After T rounds, MO outputs aP final classifier (1) (L) T T D F(x)=[F (x),...,F (x)] or fðxÞ¼FðxÞ= t¼1at, where qðxnÞ¼ min DðMðcÞ; fðxnÞÞ DðMðynÞ; fðxnÞÞ; ð4Þ P fc2C;c6¼y g ðlÞ T ðlÞ n F ðxÞ¼ t¼1atht ðxÞ. When presented with an unseen sample x, MO predicts label y*, whose row M(y*) is the where D(Æ) is a distance measure. Maximization of q(xn)in ‘‘closest’’ to F(x), with respect to a specified decoding Eq. (4) can be interpreted as finding f(xn) close to the code strategy. word of the true label, while at the same time distant from the code word of the most confused class. By specifying Now, we seek to define the cost function upon which D T 2 MO performs the optimization. Given a fixed code matrix, DðMðcÞ; fðxÞÞ ¼ kMðcÞ f ðxÞk , from Eq. (4) we derive one natural idea is to find F(x) such that the minimum qðxnÞ¼2MðynÞfðxnÞ max f2MðcÞfðxnÞg: ð5Þ value of the sample margin of each bit position is fc2C;c6¼yng maximized: Maximization of q(xn) can be formulated as an optimiza- tion problem: max q X ðlÞ atht ðxnÞ ðlÞ Xt s:t: q ðxnÞ¼Mðyn; lÞ P q; ð2Þ at t n ¼ 1; ...; N; l ¼ 1; ...; L:
Fig. 1. Pseudo-code of AdaBoost.MO, as proposed in (Schapire and Fig. 2. Pseudo-code of AdaBoost.ECC, as proposed in (Guruswami and Singer, 1999). Sahai, 1999). 634 Y. Sun et al. / Pattern Recognition Letters 28 (2007) 631–643
zation of G in Eq. (8), we prove the theorem. For the moment, we assume that the code matrix M is given, the generation of which is discussed later. After (t 1) iteration steps, ECC produces Ft 1(xn)=[a1h1(xn),..., T at 1ht 1(xn),0,...,0] . In the tth iteration step, the goal of the algorithm is to compute the tth entry of T Ft(xn)=[a1h1(xn),...,at 1ht 1(xn),atht(xn),0,...,0] . From Eq. (8), the negative functional derivative of G with respect toP Ft 1(x), at x = xn, is computed as C rF Gj ¼ ðMðy Þ MðcÞÞGt 1ðnÞ, where t 1 x¼xn fc¼1;c6¼yng n D Gt 1ðn; cÞ¼expð ðMðynÞ MðcÞÞFt 1ðxnÞÞ. To find the optimal hypothesis function ht in the next training step, we resort to Friedman’s optimization approach (Friedman, 2001). It is straightforward to show that ht should be selected from H, by maximizing its correlation with thePtth Fig. 3. Pseudo-code of AdaBoost.OC, as proposed in (Schapire, 1997). N Pcomponent of rFt 1 G as ht ¼ arg maxh2H n¼1 C hðxnÞðMðy ; tÞ Mðc; tÞÞGt 1ðn; cÞ. To facilitate fc¼1;c6¼yng n max q; the computation of ht, we introduce the following terms: XC s:t: MðynÞfðxnÞ max fMðcÞfðxnÞg P q; n ¼ 1;...;N: D fc2C;c6¼y g n V tðnÞ¼ jMðyn; tÞ Mðc; tÞjGt 1ðn; cÞ; ð6Þ fc¼1;c6¼yng XN ð9Þ We are particularly interested in findingPf(x) of the form D T T V t ¼ V tðnÞ; f(x)=[a1h1(x),...,aThT(x)] , where t¼1at ¼ 1; at P 0. n¼1 From Eq. (6), we derive D dtðnÞ¼V tðnÞ=V t: max q Note that Vt differs fromQUt, defined in Step (4) in Fig. 2, XT t 1 by a constant 2NðC 1Þ i¼1Zi, where Zi is a normaliza- s:t: atðMðyn; tÞ Mðc; tÞÞhtðxnÞ P q t 1 tion constant defined in Step (9) in Fig. 2. Also, note that ¼ ð7Þ (M(yn,t) M(c,t)) either equals zero or has the same sign n ¼ 1; ...; N; c ¼ 1; ...; C; c 6¼ yn; XT as M(yn,t). It follows that at ¼ 1; a P 0: XN t¼1 ht ¼ arg max V tðnÞsignðMðyn; tÞÞhðxnÞ; ð10Þ h2H n¼1 From the discussion on LP and AdaBoost in Section 3, XN it appears reasonable to define a new cost function, opti- ¼ arg max V t dtðnÞMðyn; tÞhðxnÞ; ð11Þ h2H mized by a multi-class AdaBoost algorithm: n¼1 XN
XN XC ¼ arg max U t dtðnÞMðyn; tÞhðxnÞ; ð12Þ D h2H n 1 G ¼ expð ðMðynÞ MðcÞÞFðxnÞÞ ¼ XN n¼1 fc¼1;c6¼yng ! N C T ¼ arg min IðMðyn; tÞ 6¼ hðxnÞÞdtðnÞ; ð13Þ X X X h2H n¼1 ¼ exp atðMðyn; tÞ Mðc; tÞÞhtðxnÞ : n¼1 fc¼1;c6¼yng t¼1 ¼ arg min e; ð14Þ h2H ð8Þ where e is the weighted error of h. Once ht is found, a can be computed by a line search as The following theorem shows that AdaBoost.ECC opti- t mizes the above cost function. at ¼ arg min Gt aP0 Theorem 2. AdaBoost.ECC performs a stage-wise func- XN XC tional gradient descent procedure on the cost function given ¼ arg min Gt 1ðn; cÞ exp½ aðMðyn; tÞ aP0 by Eq. (8). n¼1 c¼1 c6¼yn