A Learning Classifier System Approach
Total Page:16
File Type:pdf, Size:1020Kb
The Detection and Characterization of Epistasis and Heterogeneity: A Learning Classifier System Approach A Thesis Submitted to the Faculty in partial fulfillment of the requirements for the degree of Doctor of Philosophy in Genetics by Ryan John Urbanowicz DARTMOUTH COLLEGE Hanover, New Hampshire March 2012 Examining Committee: Professor Jason H. Moore (chair) Professor Michael L. Whitfield Professor Margaret J. Eppstein Professor Robert H. Gross Professor Tricia A. Thornton-Wells Brian W. Pogue, Ph.D. Dean of Graduate Studies c 2012 - Ryan John Urbanowicz All rights reserved. Abstract As the ubiquitous complexity of common disease has become apparent, so has the need for novel tools and strategies that can accommodate complex patterns of association. In particular, the analytic challenges posed by the phenomena known as epistasis and heterogeneity have been largely ignored due to the inherent difficulty of approaching multifactor non-linear relationships. The term epistasis refers to an interaction effect between factors contributing to diesease risk. Heterogeneity refers to the occurrence of similar or identical phenotypes by means of independent con- tributing factors. Here we focus on the concurrent occurrence of these phenomena as they are likely to appear in studies of common complex disease. In order to ad- dress the unique demands of heterogeneity we break from the traditional paradigm of epidemiological modeling wherein the objective is the identification of a single best model describing factors contributing to disease risk. Here we develop, evaluate, and apply a learning classifier system (LCS) algorithm to the identification, modeling, and characterization of susceptibility factors in the concurrent presence of heterogeneity and epistasis. This work includes an examination of existing LCS algorithms, the development of new strategies to address the inherent problem of knowledge discov- ery in LCSs, the introduction of novel heuristics that improve LCS performance and allow for the explicit characterization of heterogeneity, and an application of these ii Abstract collective advancements to an investigation of bladder cancer susceptibility. Success- ful analysis of simulated and real-world complex disease associations demonstrates the validity and unique potential of this LCS approach. iii Citations to Previously Published Work Chapter 1 has been published as: R. J. Urbanowicz, and J. H. Moore. Learning Classifier Systems: A Com- plete Introduction, Review, and Roadmap. Journal of Artificial Evolution and Applications. (2009):1, 2009. Chapter 2 has been published as: R. J. Urbanowicz, and J. H. Moore. The Application of Michigan-Style Learning Classifier Systems to Address Genetic Heterogeneity and Epis- tasis in Association Studies. Proceedings of the 12th Annual Conference on Genetic and Evolutionary Computation. 195–202, 2010. Chapter 3 has been published as: R. J. Urbanowicz, and J. H. Moore. The Application of Pittsburgh-Style Learning Classifier Systems to Address Genetic Heterogeneity and Epis- tasis in Association Studies. Parallel Problem Solving from Nature–PPSN XI. 404–413, 2011 Chapter 4 has been submitted as: R. J. Urbanowicz, A. Granizo-Mackenzie, and J. H. Moore. An Analysis Pipeline with Visualization-Guided Knowledge Discovery for Michigan- Style Learning Classifier Systems. Chapter 5 has been submitted as: R. J. Urbanowicz, A. Granizo-Mackenzie, and J. H. Moore. Instance- Linked Attribute Tracking and Feedback for Michigan-Style Supervised Learning Classifier Systems. iv Acknowledgments This work would not have been possible without the financial support of the Na- tional Institute of Health (LM010098, LM009012, and AI59694), and the William H. Neukom Institute for Computational Science at Dartmouth College. I am grateful for the expertise of my thesis committee including Margaret Eppstein and Michael Whit- field, each of whom provided excellent professional guidance throughout this research. I would especially like to thank my advisor Jason Moore for his support, trust, and guidance throughout the development of this work. I would also like to acknowledge the contribution of my co-authors who have been instrumental in the publication of works included in this thesis. Ambrose Granizo-Mackenzie has contributed to chap- ters 4 and 5. Jeff Kiralis, Jonathan Fisher, Nicholas Sinnott-Armstrong, and Tamra Heberling contributed to the development of the model simulation strategy utilized in chapters 2, 3, 4, and 5 (submitted for publication outside of this thesis). I am grate- ful to Angeline Andrew and Margaret Karagas for their collaboration on chapter 6, and for making the bladder cancer susceptibility data available for analysis. I am fortunate to have worked alongside such a diverse and inquisitive group of students and post-docs. In particular I would like to thank Chantel Sloan, Anna Tyler, Casey Greene, Nicholas Sinnott-Armstrong, Delaney Granizo-Mackenzie, Ambrose Granizo- Mackenzie, Arvis Sulovari, Ting Hu, Richard Cowper, Cristian Dabaros, Josh Payne, and Dov Pechenick for their insights and day to day collaboration. I’d also like to v Acknowledgments thank Peter Andrews, Jonathan Fisher, and Bill White for their programming ex- pertise and Anna Green for having developed the LaTeX thesis template used here. Lastly I thank my all my teachers, mentors, family, and friends for their encourage- ment, love, and support. A special thanks to the Andrews family. vi Contents TitlePage.................................... i Abstract..................................... ii Citations to Previously Published Work . iv Acknowledgments................................ v TableofContents................................ vii ListofTables .................................. xi ListofFigures.................................. xii Dedication.................................... xvi 1 An Introduction and Review of Learning Classifier Systems 1 1.1 Introduction................................ 2 1.2 AGeneralLCS .............................. 5 1.3 TheDrivingMechanisms......................... 6 1.3.1 Discovery - The Genetic Algorithm . 6 1.3.2 Learning.............................. 10 1.4 AMinimalClassifierSystem ....................... 12 1.5 HistoricalPerspective........................... 16 1.5.1 TheEarlyYears.......................... 17 1.5.2 TheRevolution .......................... 29 1.5.3 IntheWakeofXCS ....................... 32 1.5.4 Revisiting the Pitt . 34 1.5.5 Visualization . 35 1.6 ProblemDomains............................. 35 1.7 Biological Applications . 37 1.8 OptimizingLCS.............................. 38 1.9 ComponentRoadmap........................... 39 1.9.1 DetectorsandEffectors. 40 vii Contents 1.9.2 Population............................. 41 1.9.3 PerformanceComponentandSelection . 44 1.9.4 ReinforcementComponent . 47 1.9.4.1 BucketBrigade ..................... 47 1.9.4.2 Q-Learning-based. 49 1.9.4.3 More Credit Assignment . 51 1.9.5 DiscoveryComponents . 52 1.9.6 BeyondtheBasics ........................ 54 1.10Conclusion................................. 55 1.11Resources ................................. 57 1.12Bibliography................................ 58 2 The Application of Michigan-Style Learning Classifier Systems to Address Genetic Heterogeneity and Epistasis in Association Studies 97 2.1 Introduction................................ 98 2.2 Methods.................................. 102 2.2.1 LearningClassifierSystems . 102 2.2.1.1 Implementing LCS . 104 2.2.1.2 Three Michigan-Style LCSs . 107 2.2.2 Data Simulation . 108 2.2.3 SystemEvaluations. .. .. 110 2.3 Results................................... 113 2.3.1 ParameterSweep ......................... 113 2.3.2 Michigan LCS Evaluation . 114 2.4 DiscussionandConclusions . 117 2.5 Bibliography................................ 121 3 The Application of Pittsburgh-Style Learning Classifier Systems to Address Genetic Heterogeneity and Epistasis in Association Studies128 3.1 Introduction................................ 129 3.2 Methods.................................. 132 3.2.1 LearningClassifierSystems . 132 3.2.1.1 Implementing LCSs . 133 3.2.1.2 Two Pittsburgh-Style LCSs . 134 3.2.2 SystemEvaluations. .. .. 135 3.3 Results................................... 137 3.3.1 ParameterSweep ......................... 137 3.3.2 PittsburghLCSEvaluations . 139 viii Contents 3.4 DiscussionandConclusions . 144 3.5 Bibliography................................ 146 4 An Analysis Pipeline with Visualization-Guided Knowledge Discov- ery for Michigan-Style Learning Classifier Systems: Interpreting the Black Box 149 4.1 Introduction................................ 150 4.2 LearningClassifierSystems . 154 4.3 BioinformaticsProblem. .. .. 156 4.4 Analysis Pipeline . 160 4.4.1 RuntheM-LCS.......................... 160 4.4.2 Significance Testing of M-LCS Statistics . 161 4.4.3 Visualization - Heat-Map . 169 4.4.4 Visualization - Co-Occurrence Network . 173 4.4.5 Guided Knowledge Discovery . 174 4.4.6 NegativeControlAnalysis . 176 4.5 Conclusions ................................ 177 4.6 Bibliography................................ 178 5 Instance-Linked Attribute Tracking and Feedback for Michigan-Style Supervised Learning Classifier Systems 184 5.1 Introduction................................ 185 5.2 Methods.................................. 188 5.2.1 M-LCS............................... 189 5.2.2 AttributeTracking ........................ 190 5.2.3 AttributeFeedback . 193 5.2.3.1 Mutation ........................ 194 5.2.3.2 UniformCrossover . 195 5.2.4 Evaluation............................