New Jersey Institute of Technology Digital Commons @ NJIT Theses Electronic Theses and Dissertations Spring 5-31-2017 Decision tree rule-based feature selection for imbalanced data Haoyue Liu New Jersey Institute of Technology Follow this and additional works at: https://digitalcommons.njit.edu/theses Part of the Computer Engineering Commons Recommended Citation Liu, Haoyue, "Decision tree rule-based feature selection for imbalanced data" (2017). Theses. 25. https://digitalcommons.njit.edu/theses/25 This Thesis is brought to you for free and open access by the Electronic Theses and Dissertations at Digital Commons @ NJIT. It has been accepted for inclusion in Theses by an authorized administrator of Digital Commons @ NJIT. For more information, please contact [email protected]. Copyright Warning & Restrictions The copyright law of the United States (Title 17, United States Code) governs the making of photocopies or other reproductions of copyrighted material. Under certain conditions specified in the law, libraries and archives are authorized to furnish a photocopy or other reproduction. One of these specified conditions is that the photocopy or reproduction is not to be “used for any purpose other than private study, scholarship, or research.” If a, user makes a request for, or later uses, a photocopy or reproduction for purposes in excess of “fair use” that user may be liable for copyright infringement, This institution reserves the right to refuse to accept a copying order if, in its judgment, fulfillment of the order would involve violation of copyright law. Please Note: The author retains the copyright while the New Jersey Institute of Technology reserves the right to distribute this thesis or dissertation Printing note: If you do not wish to print this page, then select “Pages from: first page # to: last page #” on the print dialog screen The Van Houten library has removed some of the personal information and all signatures from the approval page and biographical sketches of theses and dissertations in order to protect the identity of NJIT graduates and faculty. ABSTRACT DECISION TREE RULE-BASED FEATURE SELECTION FOR IMBALANCED DATA by Haoyue Liu A class imbalance problem appears in many real world applications, e.g., fault diagnosis, text categorization and fraud detection. When dealing with an imbalanced dataset, feature selection becomes an important issue. To address it, this work proposes a feature selection method that is based on a decision tree rule and weighted Gini index. The effectiveness of the proposed methods is verified by classifying a dataset from Santander Bank and two datasets from UCI machine learning repository. The results show that our methods can achieve higher Area Under the Curve (AUC) and F-measure. We also compare them with filter-based feature selection approaches, i.e., Chi-Square and F-statistic. The results show that they outperform them but need slightly more computational efforts. DECISION TREE RULE-BASED FEATURE SELECTION FOR IMBALANCED DATA by Haoyue Liu A Thesis Submitted to the Faculty of New Jersey Institute of Technology in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Engineering Helen and John C. Hartmann Department of Electrical and Computer Engineering May 2017 APPROVAL PAGE DECISION TREE RULE-BASED FEATURE SELECTION FOR IMBALANCED DATA Haoyue Liu Dr. Mengchu Zhou, Thesis Advisor Date Distinguished Professor of Electrical and Computer Engineering, NJIT Dr. Osvaldo Simeone, Committee Member Date Professor of Electrical and Computer Engineering, NJIT Dr. Yunqing Shi, Committee Member Date Professor of Electrical and Computer Engineering, NJIT BIOGRAPHICAL SKETCH Author: Haoyue Liu Degree: Master of Science Date: May, 2017 Undergraduate and Graduate Education: • Master of Science in Computer Engineering, New Jersey Institute of Technology, Newark, NJ, 2017 • Bachelor of Science in Automation, Kunming University of Science and Technology, Kunming, P. R. China, 2014 Major: Computer Engineering Presentations and Publications: Liu, H.Y., and Zhou, M.C., “Decision tree rule-based feature selection for large-scale imbalanced data,” The 27th Wireless and Optical Communication Conference (WOCC), New Jersey, USA, April 2017. iv Dedicated to my family, all inclusive, known and unknown –for giving birth to me at the first place and supporting me spiritually throughout my life. v ACKNOWLEDGMENT Foremost, I would like to express my deepest gratitude to my advisor, Dr.Mengchu Zhou, for excellent guidance and patience. Professor Zhou served as my research advisor, but he was very influential in the academic path choice I have made and gave me many excellent suggestions. I would also like to thank Dr. Osvaldo Simeone for sharing with me his deep knowledge of machine leaning and Dr. Yunqing Shi for his advice on my research. My sincere thanks also go to my group members, Xiaoyu Lu and Liang Qi, and my good friends, Yanfei Gao and Keyuan Wu, for giving me many excellent suggestions and assisting me in completing this thesis work. vi TABLE OF CONTENTS Chapter Page 1 INTRODUCTION……………………………………………………………. 1 1.1 High-dimensional Data…...……………………..……………………… 1 1.2 Class Imbalance………………..……………………………………….. 2 1.3 Methods for Class Imbalance Problem………………………………… 3 1.3.1 Sampling Methods………………………………..…………….. 3 1.3.2 Feature Selection……….…….………………………………… 4 1.3.3 Cost-sensitive Learning……....………………………………… 5 2 REVIEW OF LITERATURES..……………………………………….……... 6 2.1 Feature Selection…………………………………...…………………... 6 2.1.1 Filter Methods…………………………………………………... 10 2.1.2 Wrapper Methods……...………………….……………………. 11 2.1.3 Embedded Methods………...…………………………………... 12 2.2 Decision Tree…………………………………………………..………. 12 2.3 Evaluation Methods…………………………………………………….. 15 2.3.1 Confusion Matrix…….…………………………………………. 15 2.3.2 Accuracy……………...………………………………………… 16 2.3.3 F-Measure………………………………………………………. 18 3.3.4 ROC AUC..………..……………………………………………. 19 3 METHDOLOGY……………………………………………………………... 20 3.1 Filter-based Feature Selection………..……...…………………………. 20 vii TABLE OF CONTENTS (Continued) Chapter Page 3.1.1 Chi-square………………………………………..……………... 20 3.1.2 F-statistic……………….…….………………………………… 23 3.2 Decision Tree Rule-based Feature Selection Method……………..….... 24 3.2.1 Splitting Criteria…………………...………………………….... 25 3.2.2 CART…………...……...………………….……………………. 27 3.2.3 Weighted Gini Index for CART………………………………... 30 3.3 Classification……….…………………………………………..………. 37 4 Experimental Results…...…………………………………………………….. 38 4.1 Datasets………………………………………………………………… 38 4.2 Experimental Design………………...…………………………………. 39 4.3 Case Study 1: Santander Bank Dataset………………………...………. 40 4.4 Case Study 2: Letter Recognition Dataset………..…………....………. 47 4.5 Case Study 3: Statlog Dataset……………………...…………………... 60 4.5 Summary……………………………………………………………….. 53 5 CONCLUSION AND FUTURE WORK……...……………………………... 61 5.1 Summary of Contribution of This Thesis……………………..………... 61 5.2 Limitations and Future work…………………………………………… 62 REFERENCES……………………………………………………………………. 64 viii LIST OF TABLES Table Page 2.1 Confusion Matrixes for Binary Classification…...…………………………. 15 2.2 Confusion Matrixes for Two Models……….………………………………. 17 3.1 Chi-square Computation Example..………………………………………… 21 3.2 Count for Node A……………………………………………………………. 26 3.3 Matrix for One Splitting Node…………………………………...…………. 30 3.4 Splitting Results for Node 1…………………………………...……………. 31 3.5 Splitting Results for Node 2…………………………...……………………. 32 3.6 Splitting Results for Node 3………………………...………………………. 33 4.1 Summary of Benchmark Datasets…………………………...……………… 38 4.2 Non-feature Selection Versus Top Ranked Features Chosen Based on the DT-FS……………………………..………………………………………… 45 4.3 Random Feature Selection Versus Top Ranked Features Chosen based on the DT-FS …………………………….…………………………………….. 46 4.4 Comparison among Chi-Square, F-statistic and DT-FS ……………………. 47 4.5 Performance of Five Feature Selection Methods in terms of F-measure (selecting 20% features).………………………………………………..….. 50 4.6 Performance of Five Feature Selection Methods in terms of ROC AUC (selecting 20% features)….…………………………………………………. 50 4.7 Performance of Five Feature Selection Methods in terms of F-measure (selecting 20% features) .…………………………………………..………. 56 4.8 Performance of Five Feature Selection Methods in terms of ROC AUC (selecting 20% features) .…………………………………………..………. 57 ix LIST OF FIGURES Figure Page 2.1 A graphical view of how the filter, wrapper and embedded methods work on a dataset………………………………………………………………… 9 2.2 An example of a decision tree…….………………….……………………. 13 2.3 ROC curve……………………………………………………..…………... 19 4.1 Score of features based on DT-FS…….………………………...………… 41 4.2 Performance of three feature selection methods in terms of ROC AUC…….……………………….………………………………………… 42 4.3 Number of needed features among three feature selection methods…………………………………………………………………… 43 4.4 Performance of three feature selection methods in terms of ROC AUC (number of sequential trees = 110)..…….……………………………….. 44 4.5 Performance of three feature selection methods in terms of ROC AUC (number of sequential trees = 170) ..…….…………………….…………. 44 4.6 Performance of five feature selection methods in terms of F-measure (Letter dataset)...…….…………………………………………………... 48 4.7 Performance of five feature selection methods in terms of ROC AUC (Letter dataset) ...…….………………………………………………….. 49 4.8 Boxplot for F-measure and ROC AUC (Letter dataset)…………………….………………………………………………. 51 4.9 Performance of F-measure vs. feature ranking (Letter dataset)………………………………………………………………….…. 52 4.10 Performance of ROC AUC vs. feature
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages82 Page
-
File Size-