Scalable Subset Selection with Filters and Its Applications

Scalable Subset Selection with Filters and Its Applications A Thesis Submitted to the Faculty of Drexel University by Gregory Charles Ditzler in partial fulfillment of the requirements for the degree of Doctorate of Philosophy in Electrical & Computer Engineering April 2015 c Copyright 2015 Gregory Charles Ditzler. All Rights Reserved. This work is licensed under the terms of the Creative Commons Attribution- ShareAlike license Version 3.0. The license is available at http://creativecommons.org/licenses/by-sa/3.0/ ii Dedications To my father. iii Acknowledgements First of all, I would like to thank my advisor Gail Rosen for allowing me the opportunity to work in the Ecological & Evolutionary Signal Processing and Infor- matics Lab (EESI). She has been a constant source of encouragement throughout my graduate education. Robi Polikar has also served as a fantastic co-advisor, and I am extremely grateful for his support. I would also like to extend my thanks and gratitude to the additional members of my Ph.D. committee, Gavin Brown, Andrew Cohen, John M. Walsh, and Steve Weber. I would extend a debt of gratitude towards Diamantino Caseiro for his mentorship while I was at AT&T Research and for his suggestions after my PhD research proposal. I would like to thank the greatest group of labmates that a graduate student could ask for: Cricket Reichenberger, Yemin Lan, Steve Essinger, Calvin Morrison, Stephen Woloszynek, and Steve Pastor. Additionally, I would like to thank David Kosliki and Jean-luc Bouchot for some incredibly insightful discussions when they were doing post-docs at Drexel. Finally, I thank my family for their continued support. iv Abstract Increasingly many applications of machine learning are encountering large data that were almost unimaginable just a few years ago, and hence, many of the current algorithms cannot handle, i.e., do not scale to, today's extremely large volumes of data. The data are made up of a large set of features describing each observation, and the complexity of the models for making predictions tend to increase not only with the number of observations, but also the number of features. Fortunately, not all of the features that make up the data carry meaningful information about making the predictions. Thus irrelevant features should be filtered from the data prior to build- ing a model. Such a process of removing features to produce a subset is commonly referred to as feature subset selection. In this work, we present two new filter-based feature subset selection algorithms that are scalable to large data sets that address: (i) potentially large & distributed data sets, and (ii) they are capable of scaling to very large feature sets. Our first proposed algorithm, Neyman-Pearson Feature Se- lection (NPFS), uses a statistical hypothesis test derived from the Neyman-Pearson lemma for determining if a feature is statistically relevant. The proposed approach can be applied as a wrapper to any feature selection algorithm, regardless of the feature selection criteria used, to determine whether a feature belongs in the relevant set. Perhaps more importantly, this procedure efficiently determines the number of relevant features given an initial starting point, and it fits into a computationally attractive MapReduce model. We also describe a sequential learning framework for feature subset selection (SLSS) that scales with both the number of features as well v as the number of observations. SLSS uses bandit algorithms to process features and form a level of importance for each feature. Feature selection is performed indepen- dently from the optimization of any classifier to reduce unnecessary complexity. We demonstrate the capabilities of NPFS and SLSS on synthetic and real-world data sets. We also present a new approach for classifier-dependent feature selection that is an online learning algorithm that easily handles large amounts of missing feature values in a data stream. There are many real-world applications that can benefit from scalable feature subset selection algorithms; one such area is the study of the microbiome (i.e., the study of micro-organisms and their influence on the environments that they inhabit). Feature subset selection algorithms can be used to sift through massive amounts of data collected from the genomic sciences to help microbial ecologists understand the microbes { particularly the micro-organisms that are the best indicators by some phenotype, such as healthy or unhealthy. In this work, we provide insights into data collected from the American Gut Project, and deliver open-source software implementations for feature selection with biological data formats. vi Table of Contents ABSTRACT ........................................................................... iv LIST OF TABLES .................................................................... ix LIST OF FIGURES................................................................... xi 1. Introduction........................................................................ 1 1.1 A Simple and Concise Example............................................. 1 1.2 The Nature of Data.......................................................... 5 1.2.1 Large Volumes of Data.............................................. 5 1.2.2 Data in Life Science................................................. 5 1.2.3 Volume, Velocity, Variety, Veracity and Value..................... 6 1.3 Research Questions.......................................................... 7 1.4 Contributions ................................................................ 8 1.5 Broader Impacts ............................................................. 9 1.6 Organization of the Thesis.................................................. 9 2. Feature Selection and Related Works ............................................ 11 2.1 Introduction.................................................................. 11 2.1.1 Feature Subset Selection ............................................ 11 2.1.2 Wrappers............................................................. 14 2.1.3 Emdedded Methods ................................................. 15 2.1.4 Filters ................................................................ 17 2.1.5 Bandits in Subset Selection......................................... 23 2.2 Scaling Feature Selection for Large Data .................................. 27 2.3 Figures of Merit.............................................................. 28 2.4 Analyzing Data from the Microbiome...................................... 30 2.4.1 An Introduction to the Microbiome................................ 30 2.4.2 An Introduction to Metagenomics.................................. 31 2.4.3 Answering Questions with Metagenomic Data .................... 32 2.4.4 Collecting Samples from the Microbiome.......................... 33 2.4.5 Representing Data in the Microbiome ............................. 34 2.4.6 Subset Selection and Its Role in the Analysis of the Microbiome 35 2.5 Summary ..................................................................... 36 3. Subset Selection with Neyman-Pearson Test .................................... 37 3.1 Introduction.................................................................. 37 3.2 Neyman-Pearson Subset Selection.......................................... 37 3.2.1 Algorithm Description & Derivation ............................... 38 3.2.2 Theoretical Properties on NPFS ................................... 42 3.2.3 Advantages of NPFS ................................................ 44 3.3 Experiments.................................................................. 44 3.3.1 Data Sets and Testing Procedure .................................. 45 3.3.2 Results on Synthetic Data Sets..................................... 46 3.3.3 Results on UCI Data Sets........................................... 49 3.3.4 Optical Character Recognition ..................................... 53 vii 3.3.5 Biasing NPFS to Control False Positives .......................... 53 3.3.6 Convergence Checking forp ^t ........................................ 57 3.3.7 Big Data Trials ...................................................... 60 3.4 Summary ..................................................................... 61 4. Sequential Learning for Subset Selection......................................... 63 4.1 Introduction.................................................................. 63 4.2 Forming Feature Selection as a Bandit Problem .......................... 63 4.3 Sequential Learning for Subset Selection................................... 64 4.3.1 SLSS Description .................................................... 64 4.3.2 Selecting Features with SLSS....................................... 68 4.3.3 Scaling to Large Data Sets.......................................... 69 4.3.4 Theoretical Analysis................................................. 72 4.4 Experiments.................................................................. 73 4.4.1 Description of Synthetic & Real World Data...................... 74 4.4.2 Algorithms for Comparison & Figures of Merit ................... 76 4.4.3 Examination of Regret .............................................. 78 4.4.4 Results on Synthetic Data .......................................... 81 4.4.5 Evaluation with the Bag-of-Little Bootstraps ..................... 84 4.4.6 Results on Real-World Data........................................ 84 4.4.7 Summary............................................................. 90

Load more