A Graph-Theoretic Approach to Model Genomic Data and Identify Biological Modules Asscociated with Cancer Outcomes

A Graph-Theoretic Approach to Model Genomic Data and Identify Biological Modules Asscociated with Cancer Outcomes Deanna Petrochilos A dissertation presented in partial fulfillment of the requirements for the degree of Doctor of Philosophy University of Washington 2013 Reading Committee: Neil Abernethy, Chair John Gennari, Ali Shojaie Program Authorized to Offer Degree: Biomedical Informatics and Health Education UMI Number: 3588836 All rights reserved INFORMATION TO ALL USERS The quality of this reproduction is dependent upon the quality of the copy submitted. In the unlikely event that the author did not send a complete manuscript and there are missing pages, these will be noted. Also, if material had to be removed, a note will indicate the deletion. UMI 3588836 Published by ProQuest LLC (2013). Copyright in the Dissertation held by the Author. Microform Edition © ProQuest LLC. All rights reserved. This work is protected against unauthorized copying under Title 17, United States Code ProQuest LLC. 789 East Eisenhower Parkway P.O. Box 1346 Ann Arbor, MI 48106 - 1346 ©Copyright 2013 Deanna Petrochilos University of Washington Abstract Using Graph-Based Methods to Integrate and Analyze Cancer Genomic Data Deanna Petrochilos Chair of the Supervisory Committee: Assistant Professor Neil Abernethy Biomedical Informatics and Health Education Studies of the genetic basis of complex disease present statistical and methodological challenges in the discovery of reliable and high-confidence genes that reveal biological phenomena underlying the etiology of disease or gene signatures prognostic of disease outcomes. This dissertation examines the capacity of graph-theoretical methods to model and analyze genomic information and thus facilitate using prior knowledge to create a more discrete and functionally relevant feature space. To assess the statistical and computational value of graph-based algorithms in genomic studies of cancer onset and progression, I apply a random walk graph algorithm in a weighted interaction network. I merge high-throughput co-expression and curated interaction data to search for biological modules associated with key cancer processes and evaluate significant modules in terms of both their predictive value and functional relevance. This approach identifies interactions among genes involved in proliferation, apoptosis, angiogenesis, immune evasion, metastasis, and energy metabolism pathways, and generates hypotheses for future cancer biology studies. Based on the results of this work, I conclude that graph-based approaches are powerful tools for the integration and analysis of complex molecular relationships that reveal significant coordinated activity of genomic features where previous statistical and analytical methods have been limited. i TABLE OF CONTENTS Table of Figures ........................................................................................................................ vi Table of Tables ...................................................................................................................... viii Glossary .................................................................................................................................... ix Acknowledgements ................................................................................................................... xi Chapter 1: Introduction .............................................................................................................. 1 1.1: Challenges in Large Scale Genomic Studies ...................................................... 1 1.2: Research Objectives ............................................................................................ 3 1.2.1: Assessing Network Characteristics of Cancer-Associated Genes in Metabolic and Signaling Networks .............................................................. 4 1.2.2: Using Weighted Random Walks to Identify Cancer-Associated Modules in Expression Data ........................................................................................... 5 1.2.3: Evaluation of the Use of Weighted Random Walks and Expression Data to Identify Cancer-Associated Modules ........................................................... 6 1.2.4: Analysis of microRNA Data in Random Walk-Generated Expression Modules........................................................................................................ 7 1.2.5: Evaluation of Analyzing miRNA Data in Random Walk-Generated Expression Modules ..................................................................................... 7 1.3: Contributions ....................................................................................................... 8 1.4: Dissertation Overview ....................................................................................... 10 Chapter 2: Network Biology and the Cancer Genome ............................................................ 11 2.1: Introduction ....................................................................................................... 11 2.2: Overview of Biological Pathways and Interaction Networks ........................... 11 2.3: Network Features and Definitions .................................................................... 16 2.4: Graph and Pathway-Based Approaches Using Prior Evidence in Genome Studies18 2.5: Graph-based Random Walks in Gene Prioritization and Module Discovery ... 25 Chapter 3: Assessing Network Characteristics of Cancer Associated Genes in Metabolic and Signaling Networks ...................................................................................................... 29 3.1: Introduction ....................................................................................................... 29 3.2: Methods ............................................................................................................. 30 ii 3.2.1: Overview .................................................................................................... 30 3.2.2: Network Construction ................................................................................ 31 3.2.3: Definition of Cancer Genes ........................................................................ 34 3.2.4: Network Features........................................................................................ 34 3.2.5: Statistical Analysis ..................................................................................... 34 3.2.6: Community Analysis .................................................................................. 35 3.3: Results and Discussion ...................................................................................... 36 3.3.1: Global Network Statistics ........................................................................... 36 3.3.2: Feature Prediction ....................................................................................... 36 3.3.3: Community Analysis .................................................................................. 38 3.4: Conclusion......................................................................................................... 43 Chapter 4: Using Random Walks to Identify Cancer-Associated Modules in Expression Data ...................................................................................................................................... 45 4.1: Introduction ....................................................................................................... 45 4.2: Methods ............................................................................................................. 47 4.2.1: Overview .................................................................................................... 47 4.2.2: Gene Expression Data ................................................................................ 48 4.2.3: Network Construction ................................................................................ 49 4.2.4: Weights and Significance Scoring.............................................................. 50 4.2.5: Definition of Cancer Genes ........................................................................ 50 4.2.6: Community Analysis .................................................................................. 51 4.3: Results and Discussion ...................................................................................... 51 4.3.1: Functional Annotation ................................................................................ 51 4.3.2: Breast Cancer .............................................................................................. 56 4.3.3: Hepatocellular Carcinoma .......................................................................... 61 4.3.4: Colorectal Cancer ....................................................................................... 66 4.3.5: Evaluation: Overlap with GSEA ................................................................ 71 4.3.6: Evaluation: Comparison with jActiveModules and Matisse ...................... 71 iii 4.4: Conclusion......................................................................................................... 74 Chapter 5: Analysis of miRNA Data in Random Walk-Generated Expression Modules ....... 76 5.1: Introduction ....................................................................................................... 76 5.2: Methods ............................................................................................................. 80 5.2.1:

A Graph-Theoretic Approach to Model Genomic Data and Identify Biological Modules Asscociated with Cancer Outcomes

Gene Symbol Gene Description ACVR1B Activin a Receptor, Type IB

Determining the Role of P53 Mutation in Human Breast

The Proximal Signaling Network of the BCR-ABL1 Oncogene Shows a Modular Organization

Mcm2 Is a Target of Regulation by Cdc7–Dbf4 During the Initiation of DNA Synthesis

Defining Functional Interactions During Biogenesis of Epithelial Junctions

Transcriptomic Characterization of Fibrolamellar Hepatocellular

Structure of the Human Frataxin-Bound Iron-Sulfur Cluster Assembly Complex Provides Insight Into Its Activation Mechanism

A Computational Approach for Defining a Signature of Β-Cell Golgi Stress in Diabetes Mellitus

Synonymous Single Nucleotide Polymorphisms in Human Cytochrome

Human Cytochrome P450 CYP2A13

Three Dact Gene Family Members Are Expressed During Embryonic Development

Integrating Single-Step GWAS and Bipartite Networks Reconstruction Provides Novel Insights Into Yearling Weight and Carcass Traits in Hanwoo Beef Cattle