RICE UNIVERSITY an Empirical Study of Feature

RICE UNIVERSITY An Empirical Study of Feature Selection in Binary Classification with DNA Microarray Data by Michael Louis Lecocke A Thesis Submitted in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Approved, Thesis Committee: Dr. Rudy Guerra, Chairman Professor of Statistics Rice University Dr. Kenneth Hess, Adviser Associate Professor of Biostatistics UT MD Anderson Cancer Center Dr. Jeff Morris Assistant Professor of Biostatistics UT MD Anderson Cancer Center Dr. David W. Scott Noah Harding Professor of Statistics Rice University Dr. Devika Subramanian Professor of Computer Science Rice University Houston, Texas May, 2005 Abstract An Empirical Study of Feature Selection in Binary Classification with DNA Microarray Data by Michael Louis Lecocke Motivation: Binary classification is a common problem in many types of research including clinical applications of gene expression microarrays. This research is com- prised of a large-scale empirical study that involves a rigorous and systematic comparison of classifiers, in terms of supervised learning methods and both univariate and multivariate feature selection approaches. Other principle areas of investigation involve the use of cross-validation (CV) and how to guard against the effects of optimism and selection bias when assessing candidate classifiers via CV. This is taken into account by ensuring that the feature selection is performed during training of the classification rule at each stage of a CV process (“external CV”), which to date has not been the traditional approach to performing cross-validation. Results: A large-scale empirical comparison study is presented, in which a 10-fold CV procedure iii is applied internally and externally to a univariate as well as two genetic algorithm- (GA-) based feature selection processes. These procedures are used in conjunction with six supervised learning algorithms across six published two-class clinical microarray datasets. It was found that external CV generally provided more realistic and honest misclassification error rates than those from using internal CV. Also, al- though the more sophisticated multivariate FSS approaches were able to select gene subsets that went undetected via the combination of genes from even the top 100 uni- variately ranked gene list, neither of the two GA-based methods led to significantly better 10-fold internal nor external CV error rates. Considering all the selection bias estimates together across all subset sizes, learning algorithms, and datasets, the av- erage bias estimates from each of the GA-based methods were roughly 2.5 times that of the univariate-based method. Ultimately, this research has put to test the more traditional implementations of the statistical learning aspects of cross-validation and feature selection and has provided a solid foundation on which these issues can and should be further investigated when performing limited-sample classification studies using high-dimensional gene expression data. Acknowledgements I would like to first of all thank my adviser, Dr. Hess, for guiding me along this thesis path for the past couple of years, and helping me stay focused on not missing the forest for the trees throughout this process. From my literature review(s) on GBM’s to the wonderful world of microarrays, it’s been quite a ride. I would like to thank the rest of my thesis committee – Dr. Guerra, Dr. Morris, Dr. Scott, and Dr. Subramanian – for allowing me to pursue this type of large-scale empirical comparison study, as it was extremely interesting and proved to be a very useful project with what I feel are very practical and insightful results to contribute to the microarray classification literature! Jeff, my fellow Eagles fan and great friend, thank you especially for all your insights on everything from the GA to job thoughts. I would also like to thank Dr. Baggerly, Dr. Coombes, and especially James Martin for all their help in my understanding and implementation of the genetic algorithm. I would like to acknowledge my officemates over the years, from Rick to the “Sweatshop Crew of DH1041” – Ginger, Gretchen, Jason, and Chris – your patience has been tested and proven! I’d especially like to thank Rick for helping me keep my head above water during my first couple of years as a graduate student in the world of sigma fields, and for being an incredible friend. HG and Chris, the same goes to both of you as great friends and fellow survivors, especially the past couple of years. Finally, and most importantly: I’d like to thank my parents for giving me the wonderful v opportunities over the years to be where I am today, my sister for the countless email exchanges that simply cannot be duplicated and that helped the days go by much better, and last but absolutely, positively, NEVER least, my wife Meredith – her love and support (and patience!) have been beyond what I could have ever imagined before, and without her, this road would have been immeasurably tougher and much less pleasantly traveled. Contents Abstract ii Acknowledgements iv List of Figures xi List of Tables xiv 1 Introduction 1 1.1 MicroarraysOverview .......................... 1 1.2 Motivation ................................ 2 1.3 AreasofInvestigation .......................... 7 2 Background 10 2.1 Introduction................................ 10 2.2 Supervised Learning: A General Overview . .. 11 2.3 Supervised Learning: Some Popular Existing Methods . ..... 12 2.3.1 Standard Discriminant Analysis . 12 vii 2.3.2 k-NearestNeighbors . 15 2.3.3 SupportVectorMachines. 17 2.4 Feature Subset Selection (FSS) . 21 2.4.1 Univariate Screening (“Filter”) Approach to FSS . ... 23 2.4.2 MultipleComparisons . 24 2.4.3 MultivariateApproachtoFSS . 30 2.4.4 A Modular Multivariate Approach in an “Evolutionary” Way . 33 2.5 Assessing the Performance of a Prediction Rule: Cross-Validation . 35 2.6 Two Approaches to Cross-Validation: InternalandExternalCV . .. .. 38 2.7 Optimism Bias and Selection Bias . 39 3 Previous Work & Results 42 3.1 Introduction ............................... 42 3.2 Published Dataset Descriptions . 43 3.3 UnivariateScreening ........................... 47 3.4 MultivariateFeatureSelection . 51 3.4.1 MCandSFSApproaches . 51 3.4.2 GeneticAlgorithms: GA+kNN . 53 3.5 What’s Next: A Large-Scale Investigation . ... 60 4 Univariate-Based FSS Results 64 viii 4.1 Introduction................................ 64 4.1.1 Supervised Learning Methods . 65 4.1.2 FeatureSubsetSelection . 65 4.1.3 InternalandExternalCV . 67 4.1.4 SingleandRepeated-CVruns . 67 4.1.5 PlotBreakdowns ......................... 68 4.2 PreprocessingofDatasets . 69 4.2.1 An Initial Glimpse of the Datasets: Unsupervised Learning via Multimensional Scaling . 71 4.3 InternalCVResults ........................... 75 4.4 ExternalCVResults ........................... 80 4.5 Resubstitution, Internal & External CV, & Selection & Optimism Bias: ACloserLookattheRepeated-RunCVApproach . 85 4.5.1 Resubstitution, Internal CV, and External CV MER’s . ... 85 4.5.2 Optimism Bias, Selection Bias, and Total Bias . .. 94 4.6 FinalThoughts .............................. 99 5 Multivariate-Based FSS Results 103 5.1 Introduction ............................... 103 5.1.1 Internal CV, External CV, and Repeated Runs with the GA . 105 5.1.2 Single- and Two-Stage GA-Based Approaches . 106 5.1.3 Genalg FilesandParameterization . 107 ix 5.1.4 PlotBreakdowns . .. .. 109 5.2 Resubstitution, External & Internal CV, & Selection & Optimism Bias 110 5.2.1 Resubstitution, Internal CV, and External CV MER’s . ... 111 5.2.2 Optimism Bias, Selection Bias, and Total Bias . 120 5.3 FinalThoughts .............................. 129 6 Univariate or Multivariate: Comparing the Results 133 6.1 Introduction................................ 133 6.2 Gene Selection: Univariate vs. Multivariate . ..... 134 6.3 Head-to-HeadCVMERResults . 138 6.4 Head-to-Head Optimism and Selection Bias Results . ..... 151 6.5 FinalThoughts .............................. 160 7 Conclusions and Further Thoughts 162 7.1 Learning Algorithms and Subset Sizes . 163 7.2 Feature Subset Selection Approaches . 163 7.2.1 GeneSelection .......................... 163 7.2.2 CVErrorRates.......................... 165 7.3 InternalCVvs.ExternalCV . 166 7.4 OptimismandSelectionBias. 168 7.5 Impact................................... 169 7.6 FutureDirections ............................. 171 x A Other Results from Current Research 175 Bibliography 194 List of Figures 4.1 Multidimensional Scaling Plots for Each Dataset . ...... 74 4.2 1 × 10-FoldInternalCVw/UnivFSS. 76 4.3 10 × 10-FoldInternalCVw/UnivFSS . 77 4.4 1 × 10-FoldExternalCVw/UnivFSS . 81 4.5 10 × 10-FoldExternalCVw/UnivFSS. 82 4.6 10 × 10-Fold CV; Int CV vs. Ext CV; Univ FSS; Alon Data . 86 4.7 10 × 10-Fold CV; Int CV vs. Ext CV; Univ FSS; Golub Data . 87 4.8 10 × 10-Fold CV; Int CV vs. Ext CV; Univ FSS; Nutt Data . 88 4.9 10 × 10-Fold CV; Int CV vs. Ext CV; Univ FSS; Pomeroy Data . 89 4.10 10 × 10-Fold CV; Int CV vs. Ext CV; Univ FSS; Shipp Data . 90 4.11 10 × 10-Fold CV; Int CV vs. Ext CV; Univ FSS; Singh Data . 91 4.12 10 × 10-Fold CV w/ Univ FSS; Optimism Bias vs. Gene Subset Size . 95 4.13 10 × 10-Fold CV w/ Univ FSS; Selection Bias vs. Gene Subset Size . 96 4.14 10×10-Fold CV w/ Univ FSS; Total (Sel + Opt) Bias vs. Gene Subset Size .................................... 97 xii 5.1 10-Fold CV; Int CV vs. Ext CV; 1- and 2-Stage GA FSS; Alon Data 112 5.2 10-Fold CV; Int CV vs. Ext CV; 1- and 2-Stage GA FSS; Golub Data 113 5.3 10-Fold CV; Int CV vs. Ext CV; 1- and 2-Stage GA FSS; Nutt Data 114 5.4 10-Fold CV; Int CV vs. Ext CV; 1- and 2-Stage GA FSS; Pomeroy Data115 5.5 10-Fold CV; Int CV vs. Ext CV; 1- and 2-Stage GA FSS; Shipp Data 116 5.6 10-Fold CV; Int CV vs. Ext CV; 1- and 2-Stage GA FSS; Singh Data 117 5.7 10-Fold CV w/ 1-Stage GA FSS; Optimism Bias vs. Gene Subset Size 121 5.8 10-Fold CV w/ 2-Stage GA FSS; Optimism Bias vs.

Load more