Genetic Programming for Classification with Unbalanced Data
Total Page:16
File Type:pdf, Size:1020Kb
Genetic Programming for Classification with Unbalanced Data by Urvesh Bhowan A thesis submitted to the Victoria University of Wellington in fulfilment of the requirements for the degree of Doctor of Philosophy in Computer Science. Victoria University of Wellington 2012 Abstract In classification, machine learning algorithms can suffer a performance bias when data sets are unbalanced. Binary data sets are unbalanced when one class is represented by only a small number of training examples (called the minority class), while the other class makes up the rest (majority class). In this scenario, the induced classifiers typically have high accuracy on the majority class but poor accuracy on the minority class. As the minority class typically represents the main class-of-interest in many real-world problems, accurately classifying examples from this class can be at least as important as, and in some cases more important than, accurately classifying examples from the majority class. Genetic Programming (GP) is a promising machine learning technique based on the principles of Darwinian evolution to automatically evolve computer programs to solve problems. While GP has shown much success in evolving reliable and accurate classifiers for typical classification tasks with balanced data, GP, like many other learning algorithms, can evolve biased classifiers when data is unbalanced. This is because traditional training criteria such as the overall success rate in the fitness function in GP, can be influenced by the larger number of examples from the majority class. This thesis proposes a GP approach to classification with unbalanced data. The goal is to develop new internal cost-adjustment techniques in GP to improve classification performances on both the minority class and the majority class. By focusing on internal cost-adjustment within GP rather than the traditional data- balancing techniques, the unbalanced data can be used directly or “as is” in the learning process. This removes any dependence on a sampling algorithm to first artificially re-balance the input data prior to the learning process. This thesis shows that by developing a number of new methods in GP, genetic program classifiers with good classification ability on the minority and the majority classes can be evolved. This thesis evaluates these methods on a range of binary benchmark classification tasks with unbalanced data. This thesis demonstrates that unlike tasks with multiple balanced classes where some dynamic (non-static) classification strategies perform significantly better than the simple static classification strategy, either a static or dynamic strategy shows no significant difference in the performance of evolved GP classifiers on these binary tasks. For this reason, the rest of the thesis uses this static classification strategy. This thesis proposes several new fitness functions in GP to perform cost adjustment between the minority and the majority classes, allowing the unbal- anced data sets to be used directly in the learning process without sampling. Using the Area under the Receiver Operating Characteristics (ROC) curve (also known as the AUC) to measure how well a classifier performs on the minority and majority classes, these new fitness functions find genetic program classifiers with high AUC on the tasks on both classes, and with fast GP training times. These GP methods outperform two popular learning algorithms, namely, Naive Bayes and Support Vector Machines on the tasks, particularly when the level of class imbalance is large, where both algorithms show biased classification performances. This thesis also proposes a multi-objective GP (MOGP) approach which treats the accuracies of the minority and majority classes separately in the learning process. The MOGP approach evolves a good set of trade-off solutions (a Pareto front) in a single run that perform as well as, and in some cases better than, multiple runs of canonical single-objective GP (SGP). In SGP, individual genetic program solutions capture the performance trade-off between the two objectives (minority and majority class accuracy) using an ROC curve; whereas in MOGP, this requirement is delegated to multiple genetic program solutions along the Pareto front. This thesis also shows how multiple Pareto front classifiers can be combined into an ensemble where individual members vote on the class label. Two ensemble diversity measures are developed in the fitness functions which treat the diversity on both the minority and the majority classes as equally important; otherwise, these measures risk being biased toward the majority class. The evolved ensembles outperform their individual members on the tasks due to good cooperation between members. This thesis further improves the ensemble performances by developing a GP approach to ensemble selection, to quickly find small groups of individuals that cooperate very well together in the ensemble. The pruned ensembles use much fewer individuals to achieve performances that are as good as larger (unpruned) ensembles, particularly on tasks with high levels of class imbalance, thereby reducing the total time to evaluate the ensemble. Publications Produced The following fully-referred papers were published during this Ph.D. 1. Urvesh Bhowan, Mark Johnston, Mengjie Zhang, Xin Yao. “Evolving Diverse Ensembles using Genetic Programming for Classification with Unbalanced Data”. IEEE Transactions on Evolutionary Computation (Accepted April 2012). 2. Urvesh Bhowan, Mark Johnston, Mengjie Zhang. “Developing New Fitness Functions in Genetic Programming for Classification with Unbalanced Data”. IEEE Transactions on Systems, Man, and Cybernetics (Part B), volume 42, issue 2. 2011. pp 406–421. 3. Urvesh Bhowan, Mengjie Zhang and Mark Johnston. “Ensemble Learning and Pruning in Multi-Objective Genetic Programming for Classification with Unbalanced Data”. Proceedings of the 24th Australasian Joint Conference on Artificial Intelligence (AI 2011). Lecture Notes in Artificial Intelligence. Vol. 7106. Springer. Perth, Australia, December, 2011. pp. 192–202. 4. Urvesh Bhowan, Mengjie Zhang, Mark Johnston. “Evolving Ensembles in Multi-objective Genetic Programming for Classification with Unbalanced Data”. Proceeding of the Genetic and Evolutionary Computation Conference (GECCO 2011). ACM Press. Dublin, Ireland. 2011. pp. 1331–1338. 5. Urvesh Bhowan, Mengjie Zhang, Mark Johnston. “A Comparison of Classification Strategies in Genetic Programming with Unbalanced Data”. Proceedings of the 23rd Australasian Joint Conference on Artificial Intelligence. AI 2010: Advances in Artificial Intelligence. Lecture Notes in Artificial Intelligence. Vol. 6464. Springer. Adelaide, Australia, 2010. pp. 243–252. (Nominated for the Best Student Paper award) 6. Urvesh Bhowan, Mengjie Zhang, Mark Johnston. “AUC Analysis of the Pareto-Front using Multi-objective GP for Classification with Unbalanced iii iv Data”. Proceedings of the 2010 Genetic and Evolutionary Computation Confer- ence (GECCO 2010). ACM Press. Portland, USA. 2010. pp.845–852. 7. Urvesh Bhowan, Mengjie Zhang, Mark Johnston. “Genetic Programming for Classification with Unbalanced Data”. Proceedings of the 13th European Conference on Genetic Programming (EuroGP 2010). Lecture Notes in Com- puter Science, Vol. 6021. Springer. Istanbul, Turkey. 2010. pp. 1–13. 8. Urvesh Bhowan, Mengjie Zhang, Mark Johnston. “Multi-Objective Genetic Programming for Classification with Unbalanced Data”. Proceedings of the 22nd Australasian Joint Conference on Artificial Intelligence (AI 2009). Lecture Notes in Artificial Intelligence. Vol. 5866, Springer. Melbourne, Australia. 2009. pp. 370–380. 9. Urvesh Bhowan, Mengjie Zhang, Mark Johnston. “Genetic Programming for Image Classification with Unbalanced Data”. Proceeding of the 24th International Conference on Image and Vision Computing New Zealand. IEEE Press. Wellington, NZ. 2009. pp. 316–321. 10. Urvesh Bhowan, Mark Johnston, and Mengjie Zhang. “Differentiating Between Individual Class Performance in Genetic Programming Fitness for Classification with Unbalanced Data”. Proceedings of the 2009 IEEE Congress on Evolutionary Computation (CEC 2009). IEEE Press. Trondheim, Norway. 2009. pp. 2802–2809. Acknowledgments I would like to thank my supervisors, Dr Mengjie Zhang and Dr Mark Johnston, for their guidance and constant encouragement over the past three years, and constructive feedback in writing this thesis and the articles that came before it. Thank you to Dr Mengjie Zhang, the Marsden Fund of New Zealand (under contract number VUW0806), and the BuildIT PhD Scholarship, for the financial assistance over the past 3 years. Thank you to the rest of the Evolutionary Computation Research Group, in particular Dr Kourosh Neshatian, for the many lively and interesting discussions. Thank you to my Dad for his encouragement. And most of all, thank you to Niamh (and Murdoch) for the support you’ve given me these three long years. At times it has been difficult but you have always brought me through it with your encouragement, love and support. v vi Contents 1 Introduction 1 1.1 Motivation................................. 2 1.2 Researchgoals............................... 4 1.3 MajorContributions ........................... 5 1.4 OrganisationofThesis .......................... 7 1.5 Benchmark Tasks with Unbalanced Data . 9 2 Literature Review 11 2.1 MachineLearning............................. 11 2.1.1 Classification ........................... 12 2.1.2 ClassImbalanceLearning . 14 2.1.3 Evaluating Classifier Performance . 15 2.2 EvolutionaryComputation