Integrative Network Analysis for Understanding Human Complex Traits

Integrative Network Analysis for Understanding Human Complex Traits by Lili Wang A thesis submitted to the School of Computing in conformity with the requirements for the degree of Doctor of Philosophy Queen's University Kingston, Ontario, Canada April 2015 Copyright c Lili Wang, 2015 Abstract Over the last decade, high throughput biological data, have been accumulating at rapidly increasing rates, providing the opportunity to gain insight into various funda- mental biological processes. Such large-scale data have been explored using network representation and graph theory to study biological relationships. Meanwhile, a great amount of effort has also been dedicated to integrate diverse biological data types in order to build networks and apply computational analysis to distill meaningful in- formation for specific biological problems. As a result, network-based analysis has become a powerful paradigm to model and study large-scale biological data. The goal of network-based analysis of human complex traits is to annotate or predict new relationships between biological entities, such as proteins, drugs and phenotypes. Fur- thermore, such analysis can facilitate the diagnosis and prognosis of common complex diseases. This thesis comprises three contributions. First, heterogeneous biological data are integrated and a novel tool has been developed to easily construct and navigate networks representing the large scale data. In addition the resulting networks can be analyzed using computational methods to solve specific biological problems. Second, an integrative network-based pathway analysis for genome-wide association studies i (GWAS) has been proposed to take advantage of the large scale network to com- bine topological connectivity with signals from GWAS in order to detect enriched pathways. Third, an integrative strategy combines multiple quantitative profiles with a large scale network to assist the biomarker selection for ovarian cancer using two different computational methods: (A) an aggregate ranking to score the candidate proteins and (B) pathway analysis to find enriched sub-networks. These three contributions demonstrate a pipeline to model large heterogeneous biological data in terms of networks and conduct network-based analysis for understanding the molecular basis of human diseases. ii Acknowledgments First, I would like to say thank you to my parents who have been always standing behind me and supporting me, and to my daughter who inspires and gives me motivation to achieve success. Special thanks go to my gracious advisor and mentor Dr. Parvin Mousavi for her great encouragement and generous support through all these years for my PhD study. Without her, I would not have accomplished my study, and not able to participate in exploring such a cross-discipline research field. She is a role model of inspiration and wisdom of life. I would also like to thank my mentors Dr. Sergio Baranzini and Dr. Igor Jurisica for providing a huge amount of help with all aspects of my research. Many thanks to members of Baranzini Lab at University of California San Francisco, and Jurisica Lab at University of Toronto for great collaborations and sharing ideas and experience. Additionally, I would like to thank my PhD advisory committee members Dr. Dorothea Blostein and Dr. Janice Glasgow for their feedback and support. I really appreciate their effort in reading and commenting on my research. I would also like to thank the past and current members of the Med-i Lab: Amir, Farhad, Layan, Nathan, and Zhen for your friendship. iii Finally, I would like to give thanks to the School of Computing at Queen's Uni- versity, for the good education they offered me that helped me be who I am. iv Glossary CROC Concentrated Receiver Operating Characteristic. CTD Comparative Toxicogenomics Database. DO Disease Ontology. GO Gene Ontology. GWAS Genome-wide association studies. I2D Interologous Interaction Database. KEGG Kyoto Encyclopedia of Genes and Genomes. NHGRI National Human Genome Research Institute. OMIM Online Mendelian Inheritance in Man database. PPI Protein-protein interaction. ROC Receiver operating characteristic. SNP Single nucleotide polymorphism. v Contents Abstract i Acknowledgments iii Glossary v Contents vi List of Figures ix List of Tables xi Chapter 1: Introduction 1 1.1 Motivation . 1 1.2 Objectives . 4 1.3 Contributions . 5 1.4 Organization of Thesis . 7 Chapter 2: Background 9 2.1 Network Terminology and Definitions . 9 2.1.1 Degree . 10 2.1.2 Path . 10 2.1.3 Clustering Coefficient . 11 2.1.4 Network Models . 11 2.2 Biological Interactions and Sources . 13 2.2.1 Protein-protein Interactions . 13 2.2.2 Metabolic and Signalling Pathways . 15 2.2.3 Disease Gene Associations . 16 2.2.4 Drug Targets . 18 2.2.5 Gene Expression Data . 19 2.2.6 Protein Expression Data . 20 2.2.7 microRNA Data . 21 vi 2.3 Data Integration . 22 2.4 Network Analysis . 24 2.4.1 Gene Prioritization . 25 2.4.2 Subnetwork Detection . 33 2.5 Discussion . 39 Chapter 3: Integrative Complex Traits Networks 41 3.1 Introduction . 41 3.2 Background . 42 3.3 Materials and Methods . 43 3.3.1 Nodes . 44 3.3.2 Edges . 47 3.3.3 Computational Analysis . 48 3.4 Implementation . 50 3.4.1 Availability and Requirements . 50 3.4.2 Database . 50 3.5 Features . 51 3.5.1 Disease-centered Network Model . 51 3.5.2 Gene-centered Network Model . 54 3.5.3 Drug-centered Network Model . 56 3.5.4 Similarity Network . 58 3.6 Discussion . 59 Chapter 4: Integrative Network-based Functional Module Discov- ery for Genome-wide Association Studies 62 4.1 Introduction . 62 4.2 Background . 63 4.3 Data . 66 4.3.1 GWAS Data Sets . 66 4.3.2 Human Protein Interaction Network . 67 4.3.3 Benchmark Genes . 67 4.4 Methods . 67 4.4.1 Step One . 68 4.4.2 Step Two . 69 4.4.3 Step Three . 70 4.4.4 Parameters . 72 4.4.5 Evaluation . 72 4.5 Results . 73 4.5.1 Prediction of WTCCC2 Genes . 74 4.5.2 Predicting iChip Genes . 75 vii 4.5.3 Significantly Enriched Networks . 76 4.5.4 Sensitivity of iPINBPA to Parameters . 78 4.6 Application . 80 4.6.1 Availability and Requirements . 80 4.6.2 Features . 80 4.7 Discussion . 88 Chapter 5: Integrative Biomarker Selection for Ovarian Cancer 91 5.1 Introduction . 91 5.2 Background . 92 5.3 Data . 94 5.3.1 Proteomic Profiles . 94 5.3.2 Secretomic Profiles . 95 5.3.3 Transcriptomic Profiles . 95 5.3.4 Interologous PPI Network . 96 5.3.5 Matching Profiles . 96 5.4 Candidate Protein Selection . 97 5.5 Aggregate Ranking . 100 5.6 Sub-networks Detection . 102 5.7 Results . 103 5.7.1 Enriched Subnetworks . 107 5.7.2 Enriched KEGG Pathways . 107 5.8 Conclusion . 108 Chapter 6: Summary and Future Work 112 6.1 Future Directions . 114 Bibliography 116 Appendix A: iCTNet Data Source Description 147 Appendix B: iCTNet Database Schema 156 Appendix C: ACS Proteins Selected for Ovarian Cancer 158 viii List of Figures 1.1 Network-based analytic process of biological data . 4 2.1 KEGG Pathway: Insulin Secretion (http://www.kegg.jp) . 16 3.1 iCTNet database schema and user interface . 44 3.2 Disease-centered network schema . 53 3.3 Disease network of breast cancer . 54 3.4 Gene-centered network schema . 55 3.5 Gene network of BRCA1, BRCA2 and BRCA3 . 56 3.6 Drug-centered network schema . 57 3.7 Drug network of Methotrexate . 58 3.8 Human disease networks . 60 4.1 Work flow of iPINBPA . 68 4.2 ROC and CROC curves [1] . 73 4.3 CROC curves of Meta2.5 and WTCCC2 GWAS data sets . 76 4.4 CROC curves with different restart ratios . 78 4.5 CROC fold enrichments of different values of restart ratio r . 79 4.6 Overview of iPINBPA app . 82 4.7 Manhattan plot in iPINBPA . 83 ix 4.8 Table of genes and blocks . 84 4.9 Node attributes added after running random walk with restart . 85 4.10 Sub-network table . 86 4.11 Sub-network of genes with p-value ≤ 0:05 . 87 4.12 Histograms of random networks . 88 5.1 Data structure . 98 5.2 Data distribution . 99 5.3 Aggregate ranking strategy . ..

Load more