Mining Patterns in Genomic and Clinical Cancer Data to Characterize Novel Driver Genes
Total Page:16
File Type:pdf, Size:1020Kb
Mining patterns in genomic and clinical cancer data to characterize novel driver genes Rachel D. Melamed Submitted in partial fulfillment of the requirements for the degree of Doctor of Philosophy under the Executive Committee of the Graduate School of Arts and Sciences COLUMBIA UNIVERSITY 2015 ©2015 Rachel D. Melamed All rights reserved ABSTRACT Mining patterns in genomic and clinical cancer data to characterize novel driver genes Rachel D. Melamed Cancer research, like many areas of science, is adapting to a new era characterized by increasing quantity, quality, and diversity of observational data. An example of the advances, and the resulting challenges, is represented by The Cancer Genome Atlas, an enormous public effort that has provided genomic profiles of hundreds of tumors of each of the most common solid cancer types. Alongside this resource is a host of other data and knowledge, including gene interaction databases, Mendelian disease causal variants, and electronic health records spanning many millions of patients. Thus, a current challenge is how best to integrate these data to discover mechanisms of oncogenesis and cancer progression. Ultimately, this could enable genomics- based prediction of an individual patient’s outcome and targeted therapies, a goal termed precision medicine. In this thesis, I develop novel approaches that examine patterns in populations of cancer patients to identify key genetic changes and suggest likely roles of these driver genes in the diseases. In the first section I show how genomics can lead to the identification of driver alterations in melanoma. The most recurrent genetic mutations are often in important cancer driver genes: in a newly sequenced melanoma cohort, recurrent inactivating mutations point to an exciting new melanoma candidate tumor suppressor, FBXW7, with therapeutic implications. But each tumor is unique, underlining the fact that recurrence will never capture all relevant mutations responsible for the disease. Tumors are a result of random events that must collaborate to endow a cell with all of the invasive and immortal properties of a cancer. Some combinations of events are lethal to a developing tumor, while other combinations are simply not preferentially selected. In order to discover these complex patterns, I develop a method based on the joint entropy of a set of genes, called GAMToC. Using GAMToC, I identify sets of recurrently altered genes with a strongly non-random joint pattern of co-occurrence and mutual exclusivity. Then, I extend this method as a means of identifying novel genes with a role in cancer, by virtue of their non-random pattern of alteration. Insights into the roles of these novel drivers can come from their most strongly co-selected partners. In the final section of the main text, I develop the use of cancer comorbidity, or increased cancer risk, as a novel data source for understanding cancer. The recent availability of clinical records spanning a large percentage of the American population has enabled discovery of many cancer comorbidities. Although most cancers arise as a result of somatic mutations accumulating over a patient’s lifespan, mutations present at birth could predispose some rare populations to increased cancer risk. Mendelian disease phenotype provides strong insight into the genotype of an afflicted individual. Thus, if Mendelian diseases with cancer comorbidity can be shown to have specific defects in processes that are important in the development of that cancer, statistical comorbidity could provide a new a resource for prioritizing Mendelian disease genes as novel cancer related genes. For this purpose, I integrate clinical comorbidity, Mendelian disease causal variants, and somatic genomic profiles of thousands of cancers. I demonstrate that comorbidity indeed is associated with significant genetic similarity between Mendelian diseases and the cancers these patients are predisposed to, suggesting highly interesting and plausible new candidate cancer genes. While cancer may be the result of a series of selected random events, patterns of incidence across large populations, as measured by genomics or by other phenotypes, contain much non- random signal yet to be mined. TABLE OF CONTENTS LIST OF GRAPHS, IMAGES, AND ILLUSTRATIONS .............................................................. iv LIST OF SUPPLEMENTARY TABLES ....................................................................................... vi ACKNOWLEDGEMENTS ........................................................................................................... vii 1 INTRODUCTION ...................................................................................................................... 1 2 Coding mutations influencing development of melanoma and nevi .......................................... 9 2.1 Sequencing melanomas and discovery of FBXW7 as a melanoma tumor suppressor ......... 9 2.1.1 Methods ...................................................................................................................... 10 2.1.2 Results ......................................................................................................................... 11 2.2 Sequencing nevi: exploring the progression to melanoma ................................................ 13 2.2.1 Methods ...................................................................................................................... 14 2.2.2 Results and discussion ................................................................................................ 15 2.3 Discussion .......................................................................................................................... 20 3 Applying the total correlation to identify and contextualize driver alterations ....................... 22 3.1 An information theoretic method to identify combinations of genomic alterations that promote glioblastoma .................................................................................................................. 23 3.1.1 Introduction ................................................................................................................. 23 3.1.2 Method ........................................................................................................................ 27 3.1.3 Results ......................................................................................................................... 33 3.1.4 Discussion ................................................................................................................... 43 3.2 GAMToC-L: Using patterns of co-selection of cancer genes to identify and contextualize novel drivers ................................................................................................................................ 48 3.2.1 Methods ...................................................................................................................... 50 3.2.2 Results ......................................................................................................................... 56 3.2.3 Discussion ................................................................................................................... 62 4 Genetic similarity between cancers and comorbid Mendelian diseases identifies candidate driver genes .................................................................................................................................... 64 4.1 Introduction ....................................................................................................................... 65 4.2 Comparing Mendelian disease and comorbid cancer ........................................................ 67 4.2.1 Integration of disease comorbidities and genes .......................................................... 67 4.2.2 Genetic similarity of comorbid diseases ..................................................................... 71 4.3 Mendelian disease comorbidity and cancer processes ...................................................... 78 4.3.1 Prediction of diseases with shared cellular processes ................................................. 78 4.3.2 Pan-cancer Mendelian associations ............................................................................ 87 4.4 Discussion .......................................................................................................................... 91 5 Data-driven discovery of seasonally linked diseases from an Electronic Health Records system ............................................................................................................................................. 95 5.1 Introduction ....................................................................................................................... 96 5.2 Methods ............................................................................................................................. 99 5.2.1 Quantifying incidence of diagnoses ............................................................................ 99 5.2.2 Correcting for confounding trends .............................................................................. 99 5.2.3 Evaluating periodicity ............................................................................................... 100 5.2.4 Comorbidity analysis ................................................................................................ 101 5.3 Results ............................................................................................................................