A Network-Based Analysis of the Cellular and Genetic Etiology of Disease

A network-based analysis of the cellular and genetic etiology of disease Alexander John Cornish Submitted in part-fulfilment of the requirements for the degree of Doctor of Philosophy of Imperial College London and the Diploma of Imperial College London Department of Life Sciences Imperial College London 2016 1 Declaration The contents of this thesis are my own work unless otherwise specified. The copyright of this thesis rests with the author and is made available under a Creative Commons Attribution Non-Commercial No Derivatives licence. Researchers are free to copy, distribute or transmit the thesis on the condition that they attribute it, that they do not use it for commercial purposes and that they do not alter, transform or build upon it. For any reuse or redistribution, researchers must make clear to others the licence terms of this work. Alexander John Cornish 2 Abstract Thousands of disease-associated loci have been identified in genome-wide association studies (GWAS). These loci can span multiple genes and identifying which, if any, of these genes are causal can be challenging. Multiple methods have been developed to identify causal genes, some of which use networks of physical interactions between proteins. The performance of many of these methods may however be limited by their failure to use data specific to the tissues and cell types that manifest each disease. Furthermore, many network-based approaches may be biased towards better-studied genes. In order to use data specific to a disease-manifesting cell type to identify disease- associated genes, it is first necessary to identify the disease-manifesting cell types. In this thesis, I report the development of the GSC (Gene Set Compactness) and GSO (Gene Set Overexpression) methods, which I use to identify associations between 352 diseases and 73 cell types. The GSC method identifies these associations using cell-type-specific protein-protein interaction (PPI) networks, which I generate by integrating PPI and gene expression data. Using text mining, it is demonstrated that these methods identify a large number of well-characterised disease-cell-type associations and associations that warrant further investigation. I also describe the development of ALPACA (Analysing Loci using Phenotypic And Cellular Associations), which identifies disease-associated genes using cell-type- specific PPI networks and phenotype data from humans and mice. I demonstrate that by taking a permutation-based approach, ALPACA avoids being biased towards better-studied genes. Furthermore, I demonstrate that using cell-type-specific networks, instead of generic networks, improves method performance. As the number of available tissue and cell-type-specific data continues to 3 increase, methods that integrate these data will become increasingly important in understanding disease etiology. 4 Acknowledgements I would first like to thank my friends and colleagues, past and present, in the Structural Bioinformatics Group, Imperial College London. I would especially like to thank Michael Sternberg for his supervision, guidance and encouragement over the last three years. My thanks also go to Alessia David, Ioannis Filippis, Suhail Islam, Christopher Yates and Joe Greener for our conversations on all matters bioinformatics. Acknowledgment must also go to the British Heart Foundation, for providing the funding that allowed me to complete this research. I would also like to thank my family and friends for the invaluable support they have given me throughout this research. Finally, I thank Hannah, for providing me with plentiful distractions when the work got tough and for reminding me that there are sometimes (occasionally) more important things in life than science. 5 Contents 1 Introduction 21 1.1 Outline of thesis . 21 1.2 Disease variants . 22 1.2.1 Identifying disease variants . 22 1.2.2 Human variant databases . 28 1.2.3 Non-human variant databases . 32 1.3 Biological networks . 33 1.3.1 Types of biological network . 34 1.3.2 PPI databases . 39 1.3.3 Context-specific networks . 40 1.4 Methods for identifying causal genes . 44 1.4.1 Text mining . 45 1.4.2 Physical interactions . 48 1.4.3 Functional relationships . 52 1.4.4 Phenotypic similarities . 54 1.4.5 Data from model organisms . 57 1.4.6 Context-specific data . 59 1.5 Methods for mapping diseases to contexts . 62 1.5.1 Text mining . 63 1.5.2 Gene expression data . 65 1.5.3 Gene expression and PPI data . 67 1.5.4 Epigenetic data . 68 1.6 Scope of thesis . 71 2 Materials and methods 72 6 2.1 Performance evaluation . 72 2.1.1 Performance statistics . 72 2.1.2 ROC curves . 73 2.2 Statistical tests . 74 2.2.1 Fisher's exact test . 74 2.2.2 The Mann-Whitney U test . 74 2.3 Adjusting for multiple testing . 75 2.3.1 Bonferroni correction . 75 2.3.2 Benjamini-Hochberg procedure . 75 2.4 Hierarchical clustering . 76 2.5 Normalising read counts . 77 2.6 Random-walk-based network algorithms . 77 2.6.1 Measuring distances . 78 2.6.2 Propagating scores . 79 3 Identifying associations between diseases and cell types 80 3.1 Introduction . 81 3.1.1 Motivation . 81 3.1.2 Generating context-specific networks . 82 3.1.3 Mapping diseases to contexts . 82 3.2 Materials and methods . 83 3.2.1 Gene expression data . 83 3.2.2 Protein-protein interaction data . 94 3.2.3 Generating context-specific PPI networks . 94 3.2.4 Disease-gene association data . 95 3.2.5 Mapping diseases to contexts . 96 3.3 Results . 103 3.3.1 Network topology features . 103 3.3.2 Disease-associated cell-type-specific sub-networks . 105 3.3.3 Parameter selection . 106 3.3.4 Effect of gene set size on method performance . 108 3.3.5 Comparison of the associations identified by the methods . 110 7 3.3.6 Disease-cell-type associations identified by the GSC method . 117 3.3.7 Cell-type-based diseasomes . 121 3.4 Method implementation and data availability . 123 3.5 Discussion . 128 3.6 Conclusions . 130 4 Prioritising genes in trait-associated loci 132 4.1 Introduction . 132 4.1.1 Motivation . 132 4.1.2 Data sources . 133 4.1.3 Generating association scores . 134 4.2 Materials and methods . 135 4.2.1 The ALPACA method . 136 4.2.2 Defining trait-associated loci . 141 4.2.3 Human disease variant data . 145 4.2.4 Human disease phenotype terms . 148 4.2.5 Mouse phenotype data . 150 4.2.6 Measuring phenotype similarity . 151 4.2.7 Protein-protein interaction data . 155 4.2.8 Evaluating method performance . 156 4.3 Results . 159 4.3.1 Study bias in PPI databases . 159 4.3.2 Calibrating trait-associated loci . 163 4.3.3 Network propagation parameter selection . 166 4.3.4 Type{1 error rate analysis . 168 4.3.5 Effect of study bias on gene prioritisation . 168 4.3.6 Comparison of method performance . 171 4.3.7 Performance using data from multiple species . 173 4.3.8 Performance using context-specific networks . 175 4.3.9 Comparison of edge weighting methods . 177 4.3.10 Case study . 178 4.4 Discussion . 181 8 4.5 Conclusions . 183 5 Discussion and future work 185 5.1 Discussion . 185 5.1.1 Moving towards a dynamic picture of the interactome . 185 5.1.2 Understanding the context-specific effects of disease genes . 187 5.1.3 Identifying causal genes using trans-acting regulatory elements 189 5.2 Future work . 190 5.2.1 Using data from additional species . 190 5.2.2 Applying ALPACA to different network types . 191 5.2.3 Prioritising disease variants . 192 5.2.4 Making ALPACA available for use . 193 6 Conclusions 194 Appendix 225 9 List of Figures 1.1 The case-control GWAS setup . 25 1.2 Imputing SNPs in unrelated individuals . 27 1.3 The Y2H method . 35 1.4 The TAP-MS method . 36 1.5 Inferring edges using the spoke and matrix models. 38 1.6 Methods for generating context-specific PPI networks . 43 1.7 Network distance measures . 51 1.8 The MICA semantic similarity measure . 56 3.1 FANTOM5 project sample normalisation and combination pipeline . 87 3.2 Clustering of cell samples of different potencies in cell type facets . 90 3.3 Number of FANTOM5 project samples in each cell type facet . 91 3.4 The GSO method . 97 3.5 The GSC method . 99 3.6 The text mining method . 101 3.7 Cell-type-specific network edge weight distribution . 104 3.8 Cell-type-specific disease sub-networks . 107 3.9 Support between the GSC and GSO methods . 114 3.10 Overlap between the GSC and GSO methods . 115 3.11 A subset of the associations identified by the GSC method . 118 3.12 Cell-type-based diseasome generated using two connections . 124 3.13 Cell-type-based diseasome generated using three connections . 125 3.14 Cell-type-based diseasome generated using four connections . 126 3.15 Cell-type-based diseasome generated using five connections . 127 4.1 The ALPACA method . 137 10 4.2 Defining trait-associated loci and the genes they contain . 142 4.3 Integrating disease-gene associations . 147 4.4 Integrating disease-phenotype-term mappings . 150 4.5.

A Network-Based Analysis of the Cellular and Genetic Etiology of Disease

Predicting and Characterising Protein-Protein Complexes

Functional Effects Detailed Research Plan

Development of Novel Strategies for Template-Based Protein Structure Prediction

Centre for Bioinformatics Imperial College London

Centre for Bioinformatics Imperial College London Second Report 31

Spinout Equinox Pharma Speeds up and Reduces the Cost of Drug Discovery

Virtual Screening of Human Class-A Gpcrs Using Ligand Profiles Built on Multiple Ligand-Receptor Interactions

A Composite Approach to Protein Structure Evolution

Drosophila Melanogaster

Erepo-ORP: Exploring the Opportunity Space to Combat Orphan Diseases with Existing Drugs

Annual Review 2003

ELIXIR UK Node