Bayesian Frameworks for Parsimonious Modeling of Molecular Cancer Data

BAYESIAN FRAMEWORKS FOR PARSIMONIOUS MODELING OF MOLECULAR CANCER DATA by Arturo López Pineda B. Sc. in Computer Science, Tecnológico de Monterrey, 2006 M. Sc. in Intelligent Systems, Tecnológico de Monterrey, 2008 M. Sc. in Biomedical Informatics, University of Pittsburgh, 2012 Submitted to the Graduate Faculty of the School of Medicine in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Pittsburgh 2015 UNIVERSITY OF PITTSBURGH SCHOOL OF MEDICINE This dissertation was presented by Arturo López Pineda It was defended on December 1, 2015 and approved by Vanathi Gopalakrishnan, Ph.D. Associate Professor of Biomedical Informatics, University of Pittsburgh Shyam Visweswaran, M.D., Ph.D. Associate Professor of Biomedical Informatics, University of Pittsburgh Gregory F. Cooper, M.D., Ph.D. Professor of Biomedical Informatics, University of Pittsburgh Claudia Rangel Escareño, Ph.D. Professor of Computational Genomics, National Institute of Genomic Medicine (Mexico) Dissertation Advisor: Vanathi Gopalakrishnan, Ph.D. ii BAYESIAN FRAMEWORKS FOR PARSIMONIOUS MODELING OF MOLECULAR CANCER DATA Arturo López Pineda, M.S. University of Pittsburgh, 2015 Copyright © by Arturo López Pineda 2015 iii BAYESIAN FRAMEWORKS FOR PARSIMONIOUS MODELING OF MOLECULAR CANCER DATA Arturo López Pineda, M.S. University of Pittsburgh, 2015 In this era of precision medicine, clinicians and researchers critically need the assistance of computational models that can accurately predict various clinical events and outcomes (e.g,, diagnosis of disease, determining the stage of the disease, or molecular subtyping). Typically, statistics and machine learning are applied to ‘omic’ datasets, yielding computational models that can be used for prediction. In cancer research there is still a critical need for computational models that have high classification performance but are also parsimonious in the number of variables they use. Some models are very good at performing their intended classification task, but are too complex for human researchers and clinicians to understand, due to the large number of variables they use. In contrast, some models are specifically built with a small number of variables, but may lack excellent predictive performance. This dissertation proposes a novel framework, called Junction to Knowledge (J2K), for the construction of parsimonious computational models. The J2K framework consists of four steps: filtering (discretization and variable selection), Bayesian network generation, Junction tree generation, and clique evaluation. The outcome of applying J2K to a particular dataset is a parsimonious Bayesian network model with high predictive performance, but also that is composed of a small number of variables. Not only does J2K find parsimonious gene cliques, but also provides the ability to create multi-omic models that can further improve the classification performance. These multi-omic models have the potential to accelerate biomedical discovery, followed by translation of their results into clinical practice. iv TABLE OF CONTENTS TABLE OF CONTENTS ............................................................................................................ V LIST OF TABLES ................................................................................................................... VIII LIST OF FIGURES ..................................................................................................................... X LIST OF EQUATIONS ............................................................................................................ XII LIST OF ALGORITHMS ....................................................................................................... XIII ACKNOWLEDGEMENTS .................................................................................................... XIV 1.0 INTRODUCTION ........................................................................................................ 1 1.1 THE PROBLEM .................................................................................................. 4 1.1.1 Modeling molecular cancer data .................................................................... 5 1.1.2 Interpreting the models ................................................................................... 7 1.2 THE APPROACH ............................................................................................... 8 1.2.1 Thesis .............................................................................................................. 10 1.3 SIGNIFICANCE ................................................................................................ 11 1.4 DISSERTATION OVERVIEW ....................................................................... 12 2.0 BACKGROUND ........................................................................................................ 13 2.1 ANALYSIS WORKFLOW OF MOLECULAR DATA ................................ 13 2.2 PARSIMONIOUS DATA MODELS ............................................................... 16 2.3 JUNCTION TREES FOR BIOMEDICAL DATA ......................................... 18 v 2.4 MULTI-OMIC DATA INTEGRATION ......................................................... 20 3.0 METHODS ................................................................................................................. 22 3.1 DESCRIPTION OF DATA ............................................................................... 22 3.1.1 Microarray Technology ................................................................................ 22 3.1.2 The Cancer Genome Atlas ............................................................................ 26 3.1.3 TCGA Datasets .............................................................................................. 30 3.2 THE J2K FRAMEWORK ................................................................................ 35 3.2.1 Discretization ................................................................................................. 36 3.2.2 Feature Selection ............................................................................................ 38 3.2.3 Building Bayesian Networks ......................................................................... 41 3.2.4 Creating Junction Trees ................................................................................ 46 3.3 THE MODI FRAMEWORK ............................................................................ 49 3.3.1 Integrating Multiple Bayesian Models ........................................................ 50 3.3.2 Latent Variables and the Expectation-Maximization Algorithm ............. 52 3.4 CLASSIFICATION PERFORMANCE .......................................................... 52 4.0 ANNOTATED EXAMPLE ....................................................................................... 54 4.1 DATASET .......................................................................................................... 54 4.2 DISCRETIZING WITH MDLCP .................................................................... 56 4.3 FEATURE SELECTION WITH RELIEFF ................................................... 58 4.4 MODEL BUILDING WITH EBMC ................................................................ 60 4.5 JUNCTION TREE BUILDING ....................................................................... 62 4.6 CLIQUE EVALUATION ................................................................................. 64 4.7 MODI FRAMEWORK MODEL ..................................................................... 65 vi 4.8 PARSIMONY IN J2K ....................................................................................... 69 5.0 EXPERIMENTS AND ANALYSIS ......................................................................... 71 5.1 DISCRETIZING CONTINUOUS VALUES .................................................. 72 5.2 SELECTING VARIABLES FOR CLASSIFIERS ......................................... 75 5.3 BUILDING BAYESIAN NETWORK CLASSIFIERS .................................. 81 5.4 SELECTING PARSIMONIOUS MODELS ................................................... 84 5.5 HYPOTHESIS TESTING................................................................................. 87 5.6 SUMMARY ........................................................................................................ 91 6.0 CONCLUSIONS, LIMITATIONS AND FUTURE WORK .................................. 93 6.1 CONCLUSIONS ................................................................................................ 93 6.2 LIMITATIONS .................................................................................................. 94 6.3 FUTURE WORK ............................................................................................... 95 6.3.1 Investigate the Junction Structure. .............................................................. 95 6.3.2 Explore Novel Search Strategies for Bayesian Model Building. ............... 96 6.3.3 Expand the MODI Framework to Integrate more ‘Omics’. ..................... 97 6.3.4 Modeling from Liquid Biopsy Samples ....................................................... 97 APPENDIX A ............................................................................................................................ 100 BACKGROUND ............................................................................................................... 100 EXPERIMENTAL DESIGN ..........................................................................................

Bayesian Frameworks for Parsimonious Modeling of Molecular Cancer Data

Abnormal Developmental Control of Replication-Timing Domains in Pediatric Acute Lymphoblastic Leukemia

Identification of Genetic Factors Underpinning Phenotypic Heterogeneity in Huntington’S Disease and Other Neurodegenerative Disorders

A Genome Wide Screen in C. Elegans Identifies Cell Non-Autonomous Regulators of Oncogenic Ras Mediated Over-Proliferation DISSER

Genomic Mapping of the MHC Transactivator CIITA Using An

The Human Canonical Core Histone Catalogue David Miguel Susano Pinto, Andrew Flaus,†

A Multiprotein Occupancy Map of the Mrnp on the 3 End of Histone

Supplemental Data.Pdf

A Novel Histone H4 Variant Regulates Rdna Transcription in Breast Cancer

Primate-Specific Histone Variants

The Changing Chromatome As a Driver of Disease: a Panoramic View from Different Methodologies

WO 2013/095793 Al 27 June 2013 (27.06.2013) W P O P C T

On Predicting Lung Cancer Subtypes Using