Direct Optimization for Classification with Boosting
Total Page:16
File Type:pdf, Size:1020Kb
Direct Optimization for Classification with Boosting A thesis submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy by Shaodan Zhai M.S., Northwest University, 2009 B.S., Northwest University, 2006 2015 Wright State University Wright State University GRADUATE SCHOOL November 9, 2015 I HEREBY RECOMMEND THAT THE THESIS PREPARED UNDER MY SUPER- VISION BY Shaodan Zhai ENTITLED Direct Optimization for Classification with Boosting BE ACCEPTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF Doctor of Philosophy. Shaojun Wang, Ph.D. Dissertation Director Michael Raymer, Ph.D. Director, Department of Computer Science and Engineering Robert E. W. Fyffe, Ph.D. Vice President for Research and Dean of the Graduate School Committee on Final Examination Shaojun Wang, Ph.D. Keke Chen, Ph.D. Krishnaprasad Thirunarayan, Ph.D. Chaocheng Huang, Ph.D. ABSTRACT Zhai, Shaodan. Ph.D., Department of Computer Science and Engineering, Wright State University, 2015. Direct Optimization for Classification with Boosting. Boosting, as one of the state-of-the-art classification approaches, is widely used in the industry for a broad range of problems. The existing boosting methods often formulate classification tasks as a convex optimization problem by using surrogates of performance measures. While the convex surrogates are computationally efficient to globally optimize, they are sensitive to outliers and inconsistent under some conditions. On the other hand, boosting’s success can be ascribed to maximizing the margins, but few boosting approaches are designed to directly maximize the margin. In this research, we design novel boosting algorithms that directly optimize non-convex performance measures, including the empir- ical classification error and margin functions, without resorting to any surrogates or ap- proximations. We first applied this approach on binary classification, and then extended this idea to more complicated classification problems, including multi-class classification, semi-supervised classification, and multi-label classification. These extensions are non- trivial, where we have to mathematically re-formulate the optimization problem: defining new objectives and designing new algorithms that depend on the specific learning tasks. Moreover, we showed good theoretical properties of the optimization objectives, which ex- plains why we define these objectives and how we design algorithms to efficiently optimize them. Finally, we showed experimentally that the proposed approaches display competitive or better results than state-of-the-art convex relaxation boosting methods, and they perform especially well on noisy cases. iii Contents 1 Introduction1 1.1 Direct Loss Minimization...........................3 1.2 Direct Margin Maximization.........................4 1.3 Multi-class Boosting.............................5 1.4 Semi-supervised Boosting..........................6 1.5 Multi-label Boosting.............................8 1.6 Our works...................................9 2 DirectBoost: a boosting method for binary classification 12 2.1 Minimizing Classification Error....................... 13 2.2 Maximizing Margins............................. 19 2.3 Experimental Results............................. 34 2.3.1 Experiments on UCI data....................... 35 2.3.2 Evaluate noise robustness...................... 38 3 DMCBoost: a boosting method for multi-class classification 41 3.1 Minimizing Multi-class Classification Errors................. 42 3.2 Maximizing a Multi-class Margin Objective................. 48 3.3 Experiments.................................. 60 3.3.1 Experimental Results on UCI Datasets................ 61 3.3.2 Evaluation of Noise Robustness................... 64 4 SSDBoost: a boosting method for semi-supervised classification 66 4.1 Minimize generalized classification error................... 67 4.2 Maximize the generalized average margin.................. 70 4.3 Experiments.................................. 77 4.3.1 UCI datasets............................. 79 4.3.2 Evaluation of Noise Robustness................... 81 5 MRMM: a multi-label boosting method 84 5.1 Minimum Ranking Margin.......................... 85 5.2 The Algorithm................................ 89 iv 5.2.1 Outline of MRMM algorithm.................... 89 5.2.2 Line search algorithm for a given weak classifier.......... 90 5.2.3 Weak learning algorithm....................... 92 5.2.4 Computational Complexity...................... 93 5.3 Experiments.................................. 94 5.3.1 Settings................................ 94 5.3.2 Experimental Results......................... 97 5.3.3 Influence Of Each Parameter In MRMM............... 100 6 Conclusion and future work 103 Bibliography 105 v List of Figures 1.1 Convex surrogates of 0-1 loss.........................2 2.1 An example of computing minimum 0-1 loss of a weak learner over 4 sam- ples....................................... 16 2.2 0-1 loss curve in one dimension when 0-1 loss hasn’t reached 0....... 17 2.3 0-1 loss surface in two dimensions when 0-1 loss hasn’t reached 0...... 17 2.4 0-1 loss curve in one dimension when 0-1 loss has reached 0......... 18 2.5 0-1 loss surface in two dimensions when 0-1 loss has reached 0....... 18 2.6 Margin curves of six examples. At points q1; q2; q3 and q4, the median 0 example is changed. At points q2 and q4, the set of bottom n = 3 examples are changed................................... 22 2.7 The average margin curve over bottom 5 examples in one dimension.... 29 2.8 The average margin surface over bottom 5 examples in two dimensions... 29 2.9 Contour of the average margin surface over bottom 5 examples in two di- mensions.................................... 30 2.10 Path of the -relaxation method, adapted from (Bertsekas, 1998, 1999).... 31 2.11 The bottom k-th order margin curve in one dimension............ 33 2.12 The bottom k-th order margin surface in two dimensions........... 33 2.13 Contour of the bottom kth order margin surface in two dimensions...... 34 2.14 The value of average margins of bottom n0 samples vs. the number of itera- tions for LPBoost with column generation and DirectBoostavg on Australian dataset, left: Decision tree, right: Decision stump............... 37 3.1 Three scenarios to compute the empirical error of a weak learner ht over an example pair (xi; yi), where l denotes the incorrect label with the high- est score, and q denotes the intersection point that results in an empirical error change. The red bold line for each scenario represents the inference function of example xi and its true label yi.................. 43 3.2 Three scenarios of margin curve of a weak learner ht over an example pair (xi; yi), where l denotes the incorrect label with the highest score...... 47 vi 3.3 Left: Training errors of AdaBoost.M1, SAMME, and DMCBoost with 0- 1 loss minimization algorithm on the Car dataset. Middle: Test errors of AdaBoost.M1, SAMME, and DMCBoost with 0-1 loss minimization algo- rithm on Car dataset. Right: Training and test error of DMCBoost when it switches to the margin maximization algorithm on the Car dataset...... 62 4.1 Mean error rates (in %) of ASSEMBLE, EntropyBoost and SSDBoost with 20 labeled and increasing number of unlabeled samples on Mushroom dataset. 80 4.2 Margins distribution of ASSEMBLE, EntropyBoost and SSDBoost for la- beled and unlabeled samples on synthetic data with 20% label noise. For SSDBoost, the parameters n0 and m0 which we used to plot the figure were n m selected by the validation set, they are 2 and 2 respectively......... 82 5.1 The average 5 smallest MRMs along two coordinates in one-dimension (upper left and right), and the surface of the average 5 smallest MRMs in these two coordinates (lower)......................... 85 5.2 Left: F1 and micro-F1 vs. different values of m on the medical dataset. Right: F1 and micro-F1 vs. different values of k on the medical dataset... 101 5.3 Left: F1 and micro-F1 vs. number of ensemble rounds on the medical dataset. Right: objective value vs. number of ensemble rounds on the medical dataset................................. 101 vii List of Tables 2.1 Percent test errors of AdaBoost, LogitBoost, soft margin LPBoost with col- umn generation, BrownBoost, and three DirectBoost methods on 10 UCI datasets each with N samples and D attributes................. 35 2.2 Number of iterations and total run times (in seconds) in the training stage on Adult dataset with 10000 training samples and the depth of decision trees is 3.................................... 37 2.3 Percent test errors of AdaBoost and LogitBoost, LPBoost, BrownBoost, DirectBoostavg, and DirectBoostorder on Long and Servedio’s three points example with random noise.......................... 39 2.4 Percent test errors of AdaBoost and LogitBoost, LPBoost, BrownBoost, DirectBoostavg and DirectBoostorder on two UCI datasets with random noise...................................... 39 3.1 Description of 13 UCI datasets........................ 61 3.2 Test error (and standard deviation) of multi-class boosting methods Ad- aBoost.M1, SAMME, GD-MCBoost, and DMCBoost on 13 UCI datasets, using multi-class decision trees with a depth of 3............... 63 3.3 Test error (and standard deviation) of multi-class boosting methods Ad- aBoost.M1, AdaBoost.MH, SAMME, GD-MCBoost, and DMCBoost on 13 UCI datasets, using decision trees with a maximum depth of 12...... 64 3.4 Test error (and standard deviation) of multi-class boosting methods Ad- aBoost.M1, AdaBoost.MH, SAMME, GD-MCBoost,