Communities and Homology in Protein-Protein Interactions

Communities and Homology in Protein-protein Interactions Anna Lewis Balliol College University of Oxford A thesis submitted for the degree of Doctor of Philosophy Michaelmas 2011 2 3 This thesis is dedicated to the inspirational Jen Weterings. Wishing her well in her ongoing recovery. Acknowledgements With thanks to Charlotte, Nick and Mason, for being the best supervisors a girl could ask for. With thanks to all those who have helped me with the work in this thesis. In particular Sumeet Agarwal, Rebecca Hamer, Gesine Reinert, James Wakefield, Simon Myers, Steve Kelly, Nicholas Lewis, members of the Oxford Protein Informatics Group and members of the Oxford/Imperial Systems and Signalling group. With thanks to all those who have made my DPhil years a wonderful experience, in particular my family, my Balliol family, and my DTC family. Abstract Knowledge of protein sequences has exploded, but knowledge of protein function is needed to make use of sequence information, and this lags behind. A protein’s function must be understood in context and part of this is the network of interactions between proteins. What are the relationships between protein function and the structure of the interaction network? In the first part of my thesis, I investigate the functional relevance of clusters, or communities, of proteins in the yeast protein interaction network. Communities are candidates for biological modules. The work I present is the first to systematically investigate this structure at multiple scales in such networks. I develop novel tests to assess whether communities are functionally homogeneous, and demonstrate that almost every protein is found in a functionally homogeneous community at some scale. The evolution of protein sequences is well-studied, but comparatively little is known about the evolution of protein function. Such knowledge is needed to un- derstand when it is appropriate to annotate newly sequenced proteins by transferring functional information from homologs–i.e. evolutionarily related proteins. In the sec- ond part of my thesis, I assess the success of transferring protein-protein interactions across species and use this to estimate the rate at which interactions are lost in evolution. At levels of sequence similarity associated with functional annotation transfer, I demonstrate that protein-protein interaction transfer is unreliable. The relevance of community structure for understanding protein function and the low conservation of individual interactions, suggests a possible role for communities in the evolution of cellular function. I discuss this possibility in my conclusions. Contents 1 Introduction 1 1.1 Overview.................................. 1 1.1.1 Systems biology and protein-protein interactions . ...... 1 1.1.2 Communities ........................... 2 1.1.3 Homology ............................. 3 1.1.4 Communities and homology of protein-protein interactions .. 3 1.2 Protein-proteininteractiondata . ... 4 1.2.1 Proteins .............................. 4 1.2.2 What is a protein-protein interaction? . 5 1.2.3 Experimentalprotocols. 6 1.2.3.1 Yeast-two-hybrid screens . 6 1.2.3.2 Purification and identification of complexes . 7 1.2.3.3 Othermethods ..................... 9 1.2.3.4 Literature curation of small-scale experiments . .. 9 1.2.3.5 Databases........................ 10 1.2.4 Choiceofdataset......................... 11 1.2.5 Estimatesoferrorrates. 13 1.2.6 Consideredasanetwork . 15 1.3 Protein-protein interactions and function . ...... 17 1.3.1 Predictingproteinfunction. 17 i 1.3.2 Degreedistribution . 18 1.3.3 Motifs ............................... 20 1.3.4 Communities ........................... 21 1.4 Protein-protein interactions and evolution . ...... 22 1.4.1 Evolutionary claims in the literature . 22 1.4.2 Interactions effecting rate of evolution . .. 23 1.4.3 Co-evolution of interacting proteins . .. 24 1.4.4 Models for protein-protein interaction network evolution ... 27 1.4.5 Between-species comparisons . 31 1.4.5.1 Interologs ........................ 31 1.4.5.2 PINalignment ..................... 35 1.5 Protein-protein interactions, function, and evolution.......... 35 2 Tools and techniques 38 2.1 Communitydetection. .. .. 38 2.1.1 Introduction............................ 38 2.1.2 Choosingamethod........................ 39 2.1.3 Modularitymaximisation. 40 2.1.3.1 Resolutionlimit. 42 2.1.3.2 A multi-resolution generalisation . 42 2.1.3.3 Algorithms for maximising modularity . 43 2.1.4 Other multi-resolution approaches . 44 2.1.4.1 Traditional clustering techniques . 44 2.1.4.2 k-cliquepercolation. 44 2.1.4.3 MCL........................... 45 2.1.4.4 Hierarchicalrandomgraphs . 45 2.1.4.5 Dynamical communities . 46 2.1.5 What next after detecting communities? . 46 ii 2.2 Comparingpartitions........................... 49 2.3 Assessing predictive ability . .. 53 2.4 Similarity measures based on structured ontologies . ....... 55 3 The Function of Communities in Protein-Protein Interaction Net- works at Multiple Scales 60 3.1 Introduction................................ 60 3.2 Datasetsanddataprocessing . 62 3.2.1 Protein-protein interaction data sets . .. 62 3.2.2 GO................................. 64 3.2.3 MIPS ............................... 65 3.2.4 Chemoinformatics screen of growth rates data . .. 65 3.3 Pairwisepropertiesofproteins . .. 66 3.4 Communities ............................... 67 3.4.1 Robustnessofpartitionsfound. 70 3.5 Examples of communities found at multiple resolutions . ....... 72 3.6 The standard assessment of biological relevance: functional enrichment 75 3.7 Functional homogeneity of communities . ... 76 3.7.1 Beyondpairwisemeasures . 83 3.8 Use of topological properties to select functionally homogeneous communities .................................. 85 3.9 Tracing the community membership of a particular protein...... 92 3.10 Investigating particular protein functions . ........ 94 3.10.1 Interactions and functional classes . ... 96 3.10.2 Distribution of functional classes in communities . ....... 98 3.10.3 Co-occurrence of functional classes in communities ...... 102 3.10.4 Functionally homogeneous communities and functional classes 103 3.11Conclusions ................................ 103 iii 4 What evidence is there for the homology of protein-protein interactions? 109 4.1 Introduction................................ 109 4.2 Datasetsanddataprocessing . 111 4.2.1 Protein-proteininteractiondata . 111 4.2.2 Homologydata .......................... 113 4.3 Interactions conserved across species . .... 118 4.3.1 Theevidence ........................... 118 4.3.2 Errorsintheinteractomedata. 128 4.3.2.1 Falsepositives . 130 4.3.2.2 Coverage of the source species interactions . 133 4.3.2.3 Coverage of the target species interactions . 133 4.3.3 Probability per million years that a duplicated interaction is lost141 4.3.4 Can one select the conserved interactions? . 143 4.4 Interactions conserved within species . .... 152 4.5 Conclusions ................................ 156 5 Conclusions and future directions 159 5.1 Communities ............................... 159 5.2 Homology ................................. 161 5.3 Communities and homology of protein-protein interactions ...... 162 A Methods for predicting protein interactions 165 A.1 Assessing prediction methods . 166 A.2 Network-basedmethods . 167 A.3 Genome-basedmethods. 168 A.3.1 Geneneighbourhood . 168 A.3.2 Genefusion ............................ 169 iv A.3.3 Phylogeneticprofile. 169 A.4 Sequence-based methods . 170 A.4.1 Interologs ............................. 170 A.4.2 Phylogenetictreemethods . 170 A.5 Structure-basedmethods . 170 A.6 Domain-basedmethods. 171 A.7 Summary ................................. 173 B Examples of community membership 175 C Examples of community membership following individual proteins 198 D Alternative sequence similarity scores for inferring interactions 207 Bibliography 213 v List of Figures 1.1 Growth of the IntAct protein-protein interaction database ...... 12 1.2 PINevolutionmodels. .......................... 28 1.3 ConceptofanInterolog.......................... 33 2.1 StructureoftheGeneOntology. 57 3.1 Communities identified in the A and P Networks ........... 68 3.2 Community structure of an Erd˝os-Rényi random network and a syn- thetic network with strong community structure . .. 69 3.3 Robustness of community detection algorithm . .... 71 3.4 Examples of communities found . 73 3.5 Literature standard test for functional enrichment of communities . 77 3.6 Functional homogeneity of communities found . .... 79 3.7 Correlation of similarity measures . ... 81 3.8 Correlation of normalised and un-normalised GO similarity measures 82 3.9 Functional homogeneity of communities found: chains . ....... 86 3.10 Correlation of similarity measures: chains . ....... 87 3.11 Correlation of similarity measures: chains-based test compared to interactors- basedtest ................................. 88 3.12 Using topological properties to pick out functionally homogeneous communities .................................. 91 vi 3.13 Tracing the community membership of a particular protein: example 94 3.14 Tracing the community membership of a particular protein: further examples.................................. 95 3.15 GOslimandinteractingproteins . 99 3.16 The distribution of four GO slim terms throughout communities . 101 3.17 The fraction of interactions connecting proteins of different types found withincommunities. .. .. 104 3.18 Fraction of proteins of particular types in functionally homogeneous

Communities and Homology in Protein-Protein Interactions

(PWP2) Gene Mapping to 21Q22.3 Ronald G

Assembly and Annotation of an Ashkenazi Human Reference Genome

Simplified Computational Model for Generating Bio- Logical Networks

Identification and Characterization of TPRKB Dependency in TP53 Deficient Cancers

Prospects & Overviews Orthology Prediction Methods: a Quality Assessment Using Curated Protein Families

Supporting Information

WO 2019/079361 Al 25 April 2019 (25.04.2019) W 1P O PCT

Biological Network Approaches and Applications in Rare Disease Studies

Spatial Protein Interaction Networks of the Intrinsically Disordered Transcription Factor C(%3$

Complementary Symbiont Contributions to Plant Decomposition in a Fungus-Farming Termite

Peregrine and Saker Falcon Genome Sequences Provide Insights Into Evolution of a Predatory Lifestyle

De Novo Genome Assembly and Analysis of Non- Allelic Recombination in Pathogenic Yeast Candida Glabrata