Arxiv:0904.2037V1 [Cs.LG] 14 Apr 2009 Sifiers and Developed Margin Distribution Based Gen- Boosting Has Been Extensively Studied in the Past Eralization Bounds
Total Page:16
File Type:pdf, Size:1020Kb
Boosting through Optimization of Margin Distributions Keywords: AdaBoost, boosting, margin distribution, column generation Chunhua Shen [email protected] Hanxi Li [email protected] NICTA, Canberra Research Laboratory, Canberra, ACT 2601, Australia Australian National University, Canberra, ACT 0200, Australia Abstract tional space. Mason et al. [5] developed similar idea, AnyBoost, for boosting arbitrary loss functions. De- Boosting has attracted much research atten- spite its large success in practice, there are still open tion in the past decade. The success of boost- questions about why and how boosting works that re- ing algorithms may be interpreted in terms mains to be answered. Inspired by the large-margin of the margin theory. Recently it has been theory in kernel methods, The authors of [1] presented shown that generalization error of classifiers a margin-based bound for AdaBoost. Experiments on can be obtained by explicitly taking the mar- Arc-Gv show contrary results to the margin theory: gin distribution of the training data into ac- Arc-Gv always has a minimum margin that is prov- count. Most of the current boosting algo- ably larger than AdaBoost but Arc-Gv performs worse rithms in practice usually optimizes a con- in terms of test error. Grove and Schuurmans [6] ob- vex loss function and do not make use of the served the same phenomenon. In the literature, much margin distribution. In this work we design work has focused on maximizing the minimum mar- a new boosting algorithm, termed margin- gin [7, 8, 9]. Recently, Reyzin and Schapire [10] re-ran distribution boosting (MDBoost), which di- Breiman's experiments by controlling weak classifier’s rectly maximizes the average margin and complexity. They found that a better margin distribu- minimizes the margin variance simultane- tion is more important than the minimum margin. It ously. This way the margin distribution is is of importance to have a large minimum margin, but optimized. A totally-corrective optimization not at the expense of other factors. They thus conjec- algorithm based on column generation is pro- tured that maximizing the average margin rather than posed to implement MDBoost. Experiments the minimum margin may lead to improved boosting on UCI datasets show that MDBoost outper- algorithms. We try to verify this conjecture in this forms AdaBoost and LPBoost in most cases. work. Recently, Garg and Roth [11] introduced margin dis- 1. Introduction tribution based complexity measure for learning clas- arXiv:0904.2037v1 [cs.LG] 14 Apr 2009 sifiers and developed margin distribution based gen- Boosting has been extensively studied in the past eralization bounds. Competitive classification results decade [1, 2]. Boosting algorithms are designed to have been shown by optimizing this bound. Another reduce the classification error of any learning algo- relevant work is [12]. [12] applies a boosting method to rithm that consistently produces classifiers whose per- optimize the margin distribution based generalization formance is only slightly better than random guessing. bound obtained by [13]. Experiments show that the Boosting was originally proposed as ensemble learning new boosting methods achieves considerable improve- methods, which depends on majority voting of multi- ments over AdaBoost. The optimization of this new ple individual classifiers. Later, Breiman [3] and Fried- boosting method is based on the AnyBoost framework man et al. [4] observed that many boosting algorithms [5]. Aligned with these attempts, we proposed a new can be viewed as gradient descent optimization in func- boosting algorithm through optimization of margin distribution (termed MDBoost). Instead of minimiz- ing a margin distribution based generalization bound, we directly optimize the margin distribution: maxi- Boosting through Optimization of the Margin Distribution mizing the average margin and at the same time min- The following theorem serves as the basis of the pro- imizing the variance of the margin distribution. posed MDBoost algorithm. The theoretical justification of this idea is that, ap- Theorem 2.1. AdaBoost maximizes the unnormal- proximately, AdaBoost actually maximizes the aver- ized average margin and simultaneously minimizes the age margin and minimizes the margin variance. For variance of the margin distribution under the assump- self-completeness, we include the theoretical results tion that the margin follows a Gaussian distribution. from [14] in the next section. Proof: See appendix. 1.1. Notation Mathematically, the above theorem can be formulated Throughout the paper, a matrix is denoted by an as: upper-case letter (X); a column vector is denoted by a bold low-case letter (x). The ith row of X is denoted max ρ¯ − 1 σ2; s:t: w 0; 1>w = D; (1) w 2 < ¯ ¯ by Xi: and the ith column X:i. We use I to denote the identity matrix. 1 and 0 are row vectors of 1's and 0's, where σ is the unnormalized margin variance andρ ¯ is ¯ ¯ respectively. Their sizes will be clear from the context. the unnormalized average margin. Let ρi denote the We use <; 4 to denote component-wise inequalities. unnormalized margin for the ith example datum, i.e., The rest of the paper is structured as follows. In Sec- ρ = y H w; 8i = 1; ··· ; M: (2) tion 2 we present the main idea. In Section 3 the i i i: dual of the MDBoost's optimization problem is de- We explicitly write the optimization in ρ: rived, which enables us to design an LPBoost-like col- M umn generation based boosting algorithm. We provide 1 X 2 X min (ρi − ρj) − ρi an experimental comparison of the algorithms on UCI w 2(M − 1) data in Section 4, and conclude the paper in Section i>j i=1 5. s:t: w 0; 1>w = D: (3) < ¯ ¯ 2. Algorithm We define a matrix A 2 RM×M : 2 1 1 3 Before we present our main results, we introduce some 1 − M−1 ::: − M−1 1 1 preliminary concepts. Let f(xi; yi)gi=1;··· ;M be the set 6− M−1 1 ::: − M−1 7 A = 6 7 : of training data, where xi 2 X and yi 2 {−1; +1g, 8i. 6 . 7 . .. Let h(·) 2 H be a base/weak classifier that projects 4 5 − 1 − 1 ::: 1 an input vector x into [−1; +1]. We assume that the M−1 M−1 set H is finite and we have N possible weak classifiers. Let the matrix H 2 RM×N where the (i; j) entry of Our optimization problem can be rewritten into H is Hij = hj(xi). Hij is the label predicted by weak min 1 ρ>Aρ − 1>ρ; classifier hj(·) on the training datum xi. Therefore w 2 ¯ each column H of the matrix H consists of the output > :j s:t: w < 0; 1 w = D; of weak classifier hj(·) on all the training data; while ¯ ¯ ρi = yiHi:w; 8i = 1; ··· ; M: (4) each row Hi: contains the outputs of all weak classifiers on the training datum xi. It is easy to see that A is positive semidefinite1. So Boosting is a typical example of ensemble learning, (4) is a convex quadratic problem (QP) in w. where multiple learners are trained to solve a single classification problem. A boosting algorithm creates a If we could access all the weak classifiers (the entire strong learner by incrementally adding weak learners matrix H is knew), we can solve the problem (4) using to the final strong learner [2]. The weak learner has off-the-shelf quadratic programming (QP) solvers [15]. an important impact on the strong learner. In general, However, in many cases, we do not know H before a boosting algorithm builds on a user-specified base hand simply because the size of the weak classifier set learning procedure and runs it repeatedly on mod- could be prohibitively (or even infinitely) large. As in ified data that are outputs from the previous itera- LPBoost [9], column generation can be used to attack tions. The final output strong classifier takes the form this problem. Column generation was first proposed PN F (x) = j=1 hj(x). 1A is not strictly positive definite. Since the sum of A's each row is zero, one of A's eigenvalues is zero. Boosting through Optimization of the Margin Distribution by [16] for solving some special structured linear pro- This leads to ρ = −A−1(u − 1); and the infimum is ¯ grams with extremely large number of variables. A − 1 (u − 1)>A−1(u − 1): 2 ¯ ¯ comprehensive survey on this technique is [17]. The By putting the above results together, the dual is general idea of column generation is that, instead of solving the original large-scale problem (master prob- 1 > −1 max − Dr − 2 (u − 1) A (u − 1); s:t: (7): (9) lem), one works on a restricted master problem with a r;u ¯ ¯ reasonably small subset of variables at each step. The We can reformulate (9) as dual of the restricted master problem is solved by con- min r + 1 (u − 1)>A−1(u − 1); s:t: (7): (10) ventional convex programming, and the optimal dual r;u 2D ¯ ¯ ¯ ¯ solution is used to find the new variable to be included into the restricted master problem. LPBoost is a di- Under some mild conditions, weak duality and strong rect application of column generation in boosting. For duality exist between the primal and dual problems we the first time, LPBoost shows that in a linear program have derived. By strong duality, the two problems are framework, unknown weak hypotheses can be learned equivalent in the sense that we can always solve one from the dual although the space of all weak hypothe- problem from the other. ses is infinitely large.