Gene and Protein Networks in Understanding Cellular Function
Total Page:16
File Type:pdf, Size:1020Kb
Gene and Protein Networks in Understanding Cellular Function Sonja Katriina Lehtinen A thesis sumbitted to University College London for the degree of Doctor of Philosophy March 2015 1 I, Sonja Lehtinen, confirm that the work presented in this thesis is my own. Where information has been derived from other sources, I confirm that this has been indicated in the thesis. Sonja Lehtinen 2 March, 2015 2 Abstract Over the past decades, networks have emerged as a useful way of representing complex large-scale systems in a variety of fields. In cellular and molecular biology, gene and protein networks have attracted considerable interest as tools for making sense of increasingly large volumes of data. Despite this interest, there is still substantial debate over how to best exploit network models in cellular biology. This thesis explores the use of gene and protein networks in various biological contexts. The first part of the thesis (Chapter 2) examines protein function prediction using network-based `guilt-by-association' approaches. Given the falling costs of genome sequencing and the availability of large volumes of biological data, automated annotation of gene and protein function is becoming increasingly useful. Chapter 2 describes the development of a new network-based protein function prediction method and compares it to a leading algorithm on a number of benchmarks. Biases in benchmarking methods are also explicitly explored. The second part (Chapters 3 and 4) explores network approaches in under- standing loss of function variation in the human genome. For a number of genes, homozygous loss of function appears to have no detrimental effect. A possible ex- planation is that these genes are only necessary in specific genetic backgrounds. Chapter 3 develops methods for identifying these types of relationships between apparently loss of function tolerant genes. Chapter 4 describes the use of net- works in predicting the functional effects of loss of function mutations. The third part of the thesis (Chapters 5 and 6) uses network representations to model the effects of cellular stress on yeast cells. Chapter 5 examines stress induced changes in co-expression and protein interaction networks, finding evi- dence of increased modularisation in both types of network. Chapter 6 explores the effect of stress on resilience to node removal in the co-expression networks. 3 Acknowledgements If theses are written from the shoulders of giants, I have been lucky to work with particularly big and friendly ones. First and foremost, I would like to thank my primary supervisor Christine Orengo for her outstanding guidance and genuine kindness. I am also grateful for the excellent input and good humour of my second supervisor J¨urg B¨ahlerand the support of my thesis chair Konstantinos Thalassinos. I have had the pleasure of collaborating with a number of excellent scientists during this PhD. My thesis benefited greatly from the expertise and insight of John Shawe-Taylor, Jon Lees, Vera Pancaldi, Suganthi Balasubramanian and Koon-Kiu Yan. I would also like to thank everybody in the Orengo group for their help along the way. In my final year, I had the chance to visit the Gerstein Lab at Yale for three months. This was a fantastic experience - a big thank you to Mark Gerstein, UCL graduate school and the Yale-UCL Collaborative for making my visit possible and all of the Gerstein Lab for their friendly welcome. This PhD would not have been possible without the financial and academic support of CoMPLEX or the friendship of the CoMPLEX people. I could not have asked for a better PhD cohort. Finally, to my family and the rest of my friends: thank you for rarely asking about the PhD - it was appreciated. 4 Contents 1 Introduction 12 1.1 Overview: Network Biology . 12 1.2 Network Concepts and Terminology . 13 1.3 Network Approaches . 15 1.3.1 Characterising Nodes . 15 1.3.2 Characterising Networks . 16 1.3.3 Modelling Networks . 18 1.4 Network Types . 20 1.4.1 Protein-Protein Interaction Networks . 22 1.4.2 Co-Expression Networks . 30 1.4.3 Genetic Interaction Networks . 32 1.4.4 Other Functional Association Networks . 32 1.4.5 Dynamic Networks . 33 1.5 Thesis Overview . 34 2 A Graph Kernel Method for Gene Function Prediction 36 2.1 Introduction . 36 2.1.1 Gene Function Prediction: Scope and Definitions . 36 2.1.2 Exploiting Networks for Function Prediction . 37 2.1.3 Problems with Network Based Approaches . 38 2.1.4 Aims and Objectives . 42 2.2 Technical Background . 42 2.2.1 Kernel Methods . 42 2.2.2 Kernelized Prediction Algorithms . 45 2.2.3 Protein Function Prediction and Negative Examples . 50 2.3 Benchmark Development . 50 2.3.1 GO Rollback Benchmark . 50 2.3.2 Phenotypic RNAi Benchmark . 51 2.3.3 Fission Yeast Ageing Benchmark . 52 2.4 Preliminary Work . 54 2.4.1 Regression and Support Vector Machines . 54 2.4.2 Dimensionality Reduction Approaches . 55 2.5 Algorithms . 57 2.5.1 Compass Algorithm . 57 2.5.2 GeneMANIA algorithm . 60 2.5.3 Network Construction and Weighting . 62 2.6 Comparison to GeneMANIA . 62 2.6.1 Results Summary . 62 2.6.2 RNAi Benchmark . 63 2.6.3 GO Benchmark . 64 2.6.4 Ageing Benchmark . 64 5 2.6.5 GeneMANIA weighting scheme . 65 2.7 Detailed Investigation of Prediction . 65 2.7.1 Cross-Validation vs Rollback . 65 2.7.2 Effect of Gene Degree on Label Predictability . 66 2.7.3 Effect of Discovery Date on Label Predictability . 68 2.7.4 Effect of Degree on Discovery of New Labels . 68 2.8 Discussion . 71 2.8.1 Relative Performance of GeneMANIA and Compass . 71 2.8.2 Further Investigation of the Effects of Network Quality on Predictive Performance . 71 2.8.3 Cross-Validation and Parameter Selection . 72 2.8.4 Temporal Effects . 72 2.8.5 Problems with CAFA-Style Benchmarks . 73 2.8.6 Conclusion and Further Work . 73 3 Identifying Genetic Interactions between Loss of Function Tol- erant Genes 75 3.1 Introduction . 75 3.1.1 Loss of Function Variation . 75 3.1.2 Challenges in LoF Variant Identification . 76 3.1.3 Interactions between Genes: Recessive LoF Variants and Epistasis . 77 3.1.4 Aims and Objectives . 77 3.2 LoF Data . 77 3.3 Identifying Pairwise Genetic Interactions . 78 3.3.1 Hypergeometric Model . 80 3.3.2 Confounding Factors and Model Refinement . 80 3.3.3 Pairwise Interactions: Results . 86 3.3.4 Interpretation and Evaluation of Putative Interactions . 86 3.4 Network Approaches to LoF pairs . 89 3.4.1 Introduction to Modularity . 90 3.4.2 Anti-Community Clustering . 92 3.4.3 Identification of Epistatic Communities from Co-Occurrence Data . 92 3.4.4 Evaluation of Partition Approaches . 95 3.4.5 Epistatic Communities . 103 3.5 Conclusion . 103 4 Functional Association Networks For Prediction of Loss of Func- tion Tolerance 107 4.1 Introduction . 107 4.2 Datasets . 108 4.3 Network properties . 108 4.3.1 Protein Interaction Networks . 109 4.3.2 Genetic Interaction Networks . 109 4.3.3 Metabolic Networks . 111 4.4 Prediction Using Centrality . 113 4.5 Guilt-by-Association . 114 4.6 Integrated Prediction . 115 4.7 Discussion and Further Work . 117 6 5 Network Approaches to Modelling the Stress Response in Fis- sion Yeast 121 5.1 Introduction . 121 5.1.1 Stress Response . 121 5.1.2 Studying Changing Networks . 122 5.1.3 Work Undertaken . 123 5.2 Methods . 124 5.2.1 Co-Expression Network Construction . 124 5.2.2 Protein Interaction Network Construction . 129 5.2.3 Network Modularity . 130 5.3 Stress Induced Changes to Network Structure . 133 5.3.1 Co-Expression Networks . 133 5.3.2 Protein Interaction Networks . 142 5.4 Biological Correlates of Network Change . 148 5.4.1 Principles of Enrichment Analysis . 148 5.4.2 Co-Expression Networks . 148 5.4.3 PPI Networks . 149 5.4.4 Summary of Enrichment Analyses . 150 5.5 Possible Extensions of this Work . 150 5.6 Conclusion . 151 6 Network Resilience to Node Removal: Variability in Network Models and Co-Expression Networks 152 6.1 Introduction . 152 6.1.1 Robustness and Stress . 154 6.1.2 Variability of Resilience . 154 6.1.3 Aims and Objectives . 154 6.2 Methods . 155 6.2.1 Network Models . 155 6.2.2 Stress Networks . 156 6.2.3 Resilience Measure . 156 6.3 Network Models . 156 6.4 Stress Networks . 157 6.5 Discussion and Conclusion . 163 7 Discussion 166 7.1 Protein Function Prediction . 166 7.2 Loss of Function Variation . 168 7.3 Stress Response . 169 7.4 Overall Conclusions . 170 7 List of Figures 1.1 Examples of regular lattice, random and `small-world' networks . 13 1.2 Degree distributions in random and `scale-free' networks . 19 1.3 Illustration of a yeast two hybrid system . 23 1.4 Illustration of tandem affinity purification . 23 1.5 The number of physical protein-protein interactions in the Bi- oGRID database . 26 2.1 Illustration of the effect of information transfer across databases on benchmarking . 40 2.2 Illustration of how a mapping into a different space can make patterns in data more detectable . 43 2.3 The relative performance of different prediction approaches on.