Dissertation
Total Page:16
File Type:pdf, Size:1020Kb
AndreasDISSERTATION Titel der Dissertation Efficient Feature Reduction and Classification Methods Applications in Drug Discovery and Email Categorization VerfasserJanecek Mag. Andreas Janecek angestrebter akademischer Grad Doktor der technischen Wissenschaften (Dr. techn.) Wien, im Dezember 2009 Studienkennzahl lt. Studienblatt: A 786 880 Dissertationsgebiet lt. Studienblatt: Informatik Betreuer: Priv.-Doz. Dr. Wilfried Gansterer 2 Andreas Janecek Contents Andreas 1 Introduction 11 1.1 Motivation and Problem Statement . 11 1.2 Synopsis . 12 1.3 Summary of Publications . 13 1.4 Notation . 14 1.5 Acknowledgements . 14 I Theoretical Background 15 2 Data Mining and Knowledge Discovery 17 2.1 Definition of Terminology .Janecek . 18 2.2 Connection to Other Disciplines . 19 2.3 Data . 20 2.4 Models for the Knowledge Discovery Process . 21 2.4.1 Step 1 { Data Extraction . 23 2.4.2 Step 2 { Data Pre-processing . 24 2.4.3 Step 3 { Feature Reduction . 25 2.4.4 Step 4 { Data Mining . 26 2.4.5 Step 5 { Post-processing and Interpretation . 29 2.5 Relevant Literature . 31 3 Feature Reduction 33 3.1 Relevant Literature . 34 3.2 Feature Selection . 34 3.2.1 Filter Methods . 35 3.2.2 Wrapper Methods . 40 3.2.3 Embedded Approaches . 41 3.2.4 Comparison of Filter, Wrapper and Embedded Approaches . 42 3.3 Dimensionality Reduction . 43 3 4 CONTENTS 3.3.1 Low-rank Approximations . 43 3.3.2 Principal Component Analysis . 44 3.3.3 Singular Value Decomposition . 48 3.3.4 Nonnegative Matrix Factorization . 50 3.3.5 Algorithms for Computing NMF . 51 3.3.6 Other Dimensionality Reduction Techniques . 54 Andreas3.3.7 Comparison of Techniques . 55 4 Supervised Learning 57 4.1 Relevant Literature . 58 4.2 The k-Nearest Neighbor Algorithm . 58 4.3 Decision Trees . 60 4.4 Rule-Based Learners . 63 4.5 Support Vector Machines . 65 4.6 Ensemble Methods . 70 4.6.1 Bagging . 70 4.6.2 Random Forest . 71 4.6.3 Boosting . 73 4.6.4 Stacking . .Janecek . 74 4.6.5 General Evaluation . 74 4.7 Vector Space Model . 76 4.8 Latent Semantic Indexing . 77 4.9 Other Relevant Classification Methods . 79 4.9.1 Na¨ıve Bayesian . 79 4.9.2 Neural Networks . 80 4.9.3 Discriminant Analysis . 81 5 Application Areas 83 5.1 Email Filtering . 83 5.1.1 Classification Problem for Email Filtering . 84 5.1.2 Relevant Literature for Spam Filtering . 85 5.1.3 Relevant Literature for Phishing . 89 5.2 Predictive QSAR Modeling . 91 5.2.1 Classification Problem for QSAR Modeling . 93 5.2.2 Relevant Literature for QSAR Modeling . 94 5.3 Comparison of Areas . 96 CONTENTS 5 II New Approaches in Feature Reduction and Classification 99 6 On the Relationship Between FR and Classification Accuracy 101 6.1 Overview of Chapter . 101 6.2 Introduction and Related Work . 102 6.3 Open Issues and Own Contributions . 102 6.4 Datasets . 103 Andreas6.5 Feature Subsets . 104 6.6 Machine Learning Methods . 105 6.7 Experimental Results . 105 6.7.1 Email Data . 106 6.7.2 Drug Discovery Data . 109 6.7.3 Runtimes . 113 6.8 Discussion . 114 7 Email Filtering Based on Latent Semantic Indexing 117 7.1 Overview of Chapter . 117 7.2 Introduction and Related Work . 118 7.3 Open Issues and Own Contributions . 118 7.4 Classification based on VSM and LSI . 119 7.4.1 Feature Sets . .Janecek . 119 7.4.2 Feature/Attribute Selection Methods Used . 122 7.4.3 Training and Classification . 123 7.5 Experimental Evaluation . 123 7.5.1 Datasets . 123 7.5.2 Experimental Setup . 124 7.5.3 Analysis of Data Matrices . 124 7.5.4 Aggregated Classification Results . 126 7.5.5 True/False Positive Rates . 127 7.5.6 Feature Reduction . 128 7.6 Discussion . 130 8 New Initialization Strategies for NMF 133 8.1 Overview of Chapter . 133 8.2 Introduction and Related Work . 133 8.3 Open Issues and Own Contributions . 135 8.4 Datasets . 135 8.5 Interpretation of Factors . 136 8.5.1 Interpretation of Email Data . 137 8.5.2 Interpretation of Drug Discovery Data . 138 6 CONTENTS 8.6 Initializing NMF . 140 8.6.1 Feature Subset Selection . 140 8.6.2 FS-Initialization . 141 8.6.3 Results . 141 8.6.4 Runtime . 142 8.7 Discussion . 144 9 Utilizing NMF for Classification Problems 147 Andreas9.1 Overview of Chapter . 147 9.2 Introduction and Related Work . 148 9.3 Open Issues and Own Contributions . 148 9.4 Classification using Basis Features . 149 9.5 Generalizing LSI Based on NMF . 152 9.5.1 Two NMF-based Classifiers . 152 9.5.2 Classification Results . 153 9.5.3 Runtimes . 154 9.6 Discussion . 157 10 Improving the Performance of NMF 161 10.1 Overview of Chapter . 161 10.2 Introduction and Related Work . 161 10.3 Open Issues and Own ContributionsJanecek . 163 10.4 Hardware, Software, Datasets . 164 10.4.1 Datasets . 164 10.4.2 Hardware Architecture . 165 10.4.3 Software Architecture . 165 10.5 Task-Parallel Speedup . 166 10.6 Improvements for Single Factorizations . 168 10.6.1 Multithreading Improvements . 168 10.6.2 Improving Matlab's ALS Code . 169 10.7 Initialization vs. Task-parallelism . 171 10.8 Discussion . 174 11 Conclusion and Future Work 177 AndreasSummary The sheer volume of data today and its expected growth over the next years are some of the key challenges in data mining and knowledge discovery applications. Besides the huge number of data samples that are collected and processed, the high dimensional nature of data arising in many applications causes the need to develop effective and efficient techniques that are able to deal with this massive amount of data. In addition to the significant increase in the demand of computational resources, those large datasets might also influence the quality of several data min- ing applications (especially if the number of features is very high compared to the number of samples). As the dimensionality of data increases, many types of data analysis and classification problems become significantly harder. This can lead to problems for both supervised and unsupervisedJanecek learning. Dimensionality reduction and feature (subset) selection methods are two types of techniques for reducing the attribute space. While in feature selection a subset of the original attributes is extracted, dimensionality reduction in general produces linear combinations of the original attribute set. In both approaches, the goal is to select a low dimensional subset of the attribute space that covers most of the information of the original data. During the last years, feature selection and dimensionality reduction techniques have.