Maalouf Ou 0169D 10270.Pdf
Total Page:16
File Type:pdf, Size:1020Kb
UNIVERSITY OF OKLAHOMA GRADUATE COLLEGE ROBUST WEIGHTED KERNEL LOGISTIC REGRESSION IN IMBALANCED AND RARE EVENTS DATA A DISSERTATION SUBMITTED TO THE GRADUATE FACULTY in partial fulfillment of the requirements for the Degree of DOCTOR OF PHILOSOPHY By MAHER MAALOUF Norman, Oklahoma 2009 ROBUST WEIGHTED KERNEL LOGISTIC REGRESSION IN IMBALANCED AND RARE EVENTS DATA A DISSERTATION APPROVED FOR THE SCHOOL OF INDUSTRIAL ENGINEERING BY Theodore B. Trafalis, Chair S. Lakshmivarahan Hillel Kumin Binil Starly Amy McGovern © Copyright by MAHER MAALOUF 2009 All Rights Reserved. To my father, who instilled in me the love for science To my mother, for her measureless love, support, and contagious Greek passion Acknowledgements Warmest thanks I give to Dr. Theodore B. Trafalis for the freedom to determine the path of my rare-events study and for having served as my advisor. I would also like to thank Dr. S. Lakshmivarahan for all his instruction and insightful guidance. Sincerest gratitude I express to Drs. Mary Court and Sridhar Radhakrishnan for their support in my last year. I would also like to thank the rest of my committee members, Drs. Amy McGovern and Binil Starly, as well as the director of the Industrial Engineering Department, Dr. Randa Shehab, for their encouragement and support. For the critical contribution of many useful ideas and the bootstrap code, I thank Robin Gilbert. Appreciation goes to Dr. Indra Adrianto for providing the code of the DDAG method. I would also like to thank Dr. Michele Eodice, the Industrial Engineering faculties, staff, and my friend and colleague Dr. Hicham Mansouri for all their support. For taking the time to edit this dissertation, I sincerely thank my cousin, Daniel E. Karim. My utmost thanks go to Dr. Hillel Kumin, without whose support throughout most of my doctoral years this dissertation would not have been possible. iv Contents Acknowledgements iv List of Tables vii List of Figures viii Abstract ix 1 Introduction 1 2 Rare Events and Imbalanced Datasets Research: An Overview 4 2.1 EvaluationMetrics .............................. 7 2.2 Algorithm Level Techniques . 10 2.2.1 ThresholdMethod .......................... 10 2.2.2 LearnOnlyTheRareClass. 10 2.2.3 Cost-Sensitive Learning . 10 2.2.4 OtherMethods ............................ 11 2.3 DataLevelTechniques ............................ 13 2.3.1 FeatureSelection . 13 2.3.2 Sampling ............................... 13 2.4 Kernel-BasedMethods ............................ 15 2.5 Conclusion .................................. 17 3 Logistic Regression: An overview 18 3.1 LogisticRegression.............................. 18 3.2 Iteratively Re-weighted Least Squares . ..... 23 3.2.1 TR-IRLS Algorithm . 24 3.3 Logistic Regression in Rare Events Data . 27 3.3.1 Endogenous (Choice-Based) Sampling . 27 3.3.2 Correcting Estimates Under Endogenous Sampling . 29 4 Kernel Logistic Regression Using Truncated Newton Method 36 4.1 Introduction.................................. 36 4.2 KernelLogisticRegression . 38 4.3 Iteratively Re-weighted Least Squares . ..... 41 4.4 TR-KLRAlgorithm.............................. 42 4.5 Computational Results & Discussion . 44 4.5.1 BinaryClassification . 45 4.5.2 Multi-classClassification. 47 v 5 Robust Weighted Kernel Logistic Regression in Imbalanced and Rare Events Data 52 5.1 KLR for Rare Events and Imbalanced Data . 52 5.1.1 Rare Events and Finite Sample Correction . 54 5.2 Iteratively Re-weighted Least Squares . ..... 56 5.3 RE-WKLRAlgorithm ............................ 57 6 Computational Results, Applications & Discussion 61 6.1 BenchmarkDatasets ............................. 62 6.1.1 BalancedTrainingData . 64 6.1.2 Imbalanced Training Data . 67 6.2 RE-WKLR for Tornado Detection . 76 6.2.1 TornadoData ............................. 76 6.2.2 Experimental Results and Discussion . 76 7 Conclusion 82 Bibliography 84 A Maximum Likelihood Estimation 94 A.1 Asymptotic Properties of Maximum Likelihood Estimators......... 95 A.1.1 Asymptotic Consistency . 95 A.1.2 Asymptotic Normality . 97 A.1.3 Asymptotic Efficiency . 98 A.1.4 Invariance............................... 98 B Support Vector Machines 99 vi List of Tables 2.1 Confusion matrix for binary classification. ....... 7 4.1 Datasets..................................... 45 4.2 Optimal parameter values found for the binary-class datasets.. 46 4.3 Comparison of ten-fold CV accuracy (%) with 95 % confidence level. 46 4.4 Comparison of ten-fold CV time (in seconds). 47 4.5 Maximum number of iterations reached by IRLS and CG during the ten- foldCVonthebinary-classdatasets. 47 4.6 Optimal parameter values for the multi-class datasets. C is the SVM regu- larization parameter, σ is the parameter (width) of the RBF kernel, and λ is the regularization parameter for both TR-IRLS and TR-KLR. 49 4.7 Comparison of ten-fold CV accuracy (%) with 95 % confidence level. 50 4.8 Maximum number of iterations reached by IRLS and CG during the ten- foldCVonthemulti-classdatasets. 51 6.1 Datasets..................................... 63 6.2 Benchmark datasets optimal parameter values with balanced training sets. 64 6.3 Benchmark datasets bootstrap accuracy (%) comparison using balanced training sets. Bold accuracy values indicate the highest accuracy reached by the algorithms being compared. 65 6.4 Benchmark datasets optimal parameter values with imbalanced training sets. 68 6.5 Benchmark datasets bootstrap accuracy (%) using imbalanced training sets. 68 6.6 Spam dataset optimal parameter values with balanced training datasets. 71 6.7 Spam dataset bootstrap accuracy (%) using balanced training sets. 71 6.8 Spam dataset optimal parameter values with imbalanced training datasets. 73 6.9 Spam dataset comparison of bootstrap accuracy (%) using imbalanced train- ingsets. .................................... 73 6.10 Tornado dataset optimal parameter values. ........ 77 6.11 Tornado dataset bootstrap accuracy (%). ....... 77 6.12 Tornado dataset execution time in seconds. ........ 77 vii List of Figures 3.1 LogisticResponseFunction. 21 4.1 Mapping of non-linearly separable data from the input space to the feature space. ..................................... 39 4.2 Illustration of DDAG for classification with four classes. .......... 48 6.1 Benchmark datasets accuracy comparison with balanced training sets. 65 6.2 Benchmark datasets accuracy comparison using balanced training sets with 95%confidence................................. 66 6.3 Benchmark datasets accuracy comparison with imbalanced training sets. 68 6.4 Benchmark datasets accuracy comparison using imbalanced training sets with95%confidencelevel. 69 6.5 Comparison of algorithms on benchmark datasets with balanced and im- balancedtrainingsets.. 70 6.6 Spam dataset accuracy comparison with balanced trainingsets. 72 6.7 Spam dataset accuracy comparison using balanced training sets with 95% confidencelevel. ............................... 72 6.8 Spam dataset accuracy comparison with imbalanced training sets. 74 6.9 Spam dataset accuracy comparison using imbalanced training sets with 95%confidencelevel.. 74 6.10 Comparison of algorithms on Spam dataset with balanced and imbalanced trainingsets................................... 75 6.11 Tornado bootstrap accuracy (%) with 95 % confidence level using the orig- inaltrainingset................................. 79 6.12 Tornado bootstrap accuracy (%) with 95 % confidence level. Training set includes 774 non-tornado instances and 387 tornado instances. ....... 80 6.13 Tornado bootstrap accuracy (%) with 95 % confidence level. Training set includes 387 non-tornado instances and 387 tornado instances. ....... 81 B.1 SVMforclassification. .100 viii Abstract Recent developments in computing and technology, along with the availability of large amounts of raw data, have contributed to the creation of many effective techniques and algorithms in the fields of pattern recognition and machine learning. Some of the main objectives for developing these algorithms are to identify patterns within the available data or to make predictions, or both. Great success has been achieved with many classifica- tion techniques in real-life applications. Concerning binary data classification in particular, analysis of data containing rare events or disproportionate class distributions poses a great challenge to industry and to the machine learning community. This study examines rare events (REs) with binary dependent variables containing many times more non-events (ze- ros) than events (ones). These variables are difficult to predict and to explain as has been demonstrated in the literature. This research combines rare events corrections on Logistic Regression (LR) with truncated-Newton methods and applies these techniques on Kernel Logistic Regression (KLR). The resulting model, Rare-Event Weighted Kernel Logistic Regression (RE-WKLR) is a combination of weighting, regularization, approximate nu- merical methods, kernelization, bias correction, and efficient implementation, all of which enable RE-WKLR to be at once fast, accurate, and robust. ix Chapter 1 Introduction Politics, national security, weather forecasting, medical diagnosis, image and speech recog- nition, and bioinformatics, are but a few of the fields in which pattern recognition and ma- chine learning have been applied. Predictive tasks whose outcomes are quantitative (real numbers) are called regression, and tasks whose outcomes are qualitative (binary, cate- gorical, or discrete) are called classification. The most fundamental method to address re- gression problems is the least squares