Computational Prediction of Protein Complexes from Protein Interaction Networks
Sriganesh Srihari The University of Queensland Institute for Molecular Bioscience Chern Han Yong Duke-National University of Singapore Medical School Limsoon Wong National University of Singapore
ACM Books #16 Copyright © 2017 by the Association for Computing Machinery and Morgan & Claypool Publishers
Computational Prediction of Protein Complexes from Protein Interaction Networks Sriganesh Srihari, Chern Han Yong, Limsoon Wong books.acm.org www.morganclaypoolpublishers.com
ISBN: 978-1-97000-155-6 hardcover ISBN: 978-1-97000-152-5 paperback ISBN: 978-1-97000-153-2 ebook ISBN: 978-1-97000-154-9 ePub Series ISSN: 2374-6769 print 2374-6777 electronic DOIs: 10.1145/3064650 Book 10.1145/3064650.3064656 Chapter 5 10.1145/3064650.3064651 Preface 10.1145/3064650.3064657 Chapter 6 10.1145/3064650.3064652 Chapter 1 10.1145/3064650.3064658 Chapter 7 10.1145/3064650.3064653 Chapter 2 10.1145/3064650.3064659 Chapter 8 10.1145/3064650.3064654 Chapter 3 10.1145/3064650.3064660 Chapter 9 10.1145/3064650.3064655 Chapter 4 10.1145/3064650.3064661 References, Bios
A publication in the ACM Books series, #16 Editor in Chief: M. Tamer Ozsu,¨ University of Waterloo
First Edition 10987654321 Contents
Preface xi
Chapter 1 Introduction to Protein Complex Prediction 1 1.1 From Protein Interactions to Protein Complexes 6 1.2 Databases for Protein Complexes 11 1.3 Organization of the Rest of the Book 13
Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks 15 2.1 High-Throughput Experimental Systems to Infer PPIs 15 2.2 Data Sources for PPIs 23 2.3 Topological Properties of PPI Networks 25 2.4 Theoretical Models for PPI Networks 27 2.5 Visualizing PPI Networks 31 2.6 Building High-Confidence PPI Networks 34 2.7 Enhancing PPI Networks by Integrating Functional Interactions 50
Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks 59 3.1 Basic Definitions and Terminologies 60 3.2 Taxonomy of Methods for Protein Complex Prediction 60 3.3 Methods Based Solely on PPI Network Clustering 61 3.4 Methods Incorporating Core-Attachment Structure 78 3.5 Methods Incorporating Functional Information 85
Chapter 4 Evaluating Protein Complex Prediction Methods 91 4.1 Evaluation Criteria and Methodology 91 4.2 Evaluation on Unweighted Yeast PPI Networks 93 4.3 Evaluation on Weighted Yeast PPI Networks 95 4.4 Evaluation on Human PPI Networks 99 x Contents
4.5 Case Study: Prediction of the Human Mechanistic Target of Rapamycin Complex 103 4.6 Take-Home Lessons from Evaluating Prediction Methods 105
Chapter 5 Open Challenges in Protein Complex Prediction 107 5.1 Three Main Challenges in Protein Complex Prediction 107 5.2 Identifying Sparse Protein Complexes 112 5.3 Identifying Overlapping Protein Complexes 118 5.4 Identifying Small Protein Complexes 124 5.5 Identifying Protein Sub-complexes 134 5.6 An Integrated System for Identifying Challenging Protein Complexes 136 5.7 Recent Methods for Protein Complex Prediction 138 5.8 Identifying Membrane-Protein Complexes 141
Chapter 6 Identifying Dynamic Protein Complexes 145 6.1 Dynamism of Protein Interactions and Protein Complexes 145 6.2 Identifying Temporal Protein Complexes 149 6.3 Intrinsic Disorder in Proteins 156 6.4 Intrinsic Disorder in Protein Interactions and Protein Complexes 157 6.5 Identifying Fuzzy Protein Complexes 162
Chapter 7 Identifying Evolutionarily Conserved Protein Complexes 165 7.1 Inferring Evolutionarily Conserved PPIs (Interologs) 166 7.2 Identifying Conserved Complexes from Interolog Networks, I 170 7.3 Identifying Conserved Complexes from Interolog Networks, II 178
Chapter 8 Protein Complex Prediction in the Era of Systems Biology 185 8.1 Constructing the Network of Protein Complexes 185 8.2 Identifying Protein Complexes Across Phenotypes 189 8.3 Identifying Protein Complexes in Diseases 192 8.4 Enhancing Quantitative Proteomics Using PPI Networks and Protein Complexes 208 8.5 Systems Biology Tools and Databases to Analyze Biomolecular Networks 221
Chapter 9 Conclusion 225
References 233 Authors’ Biographies 279 Preface
The suggestion and motivation to write this book came from Limsoon, who thought that it would be a great idea to compile our (Sriganesh’s and Chern Han’s) Ph.D. research conducted at National University of Singapore on protein complex predic- tion from protein-protein interaction (PPI) networks into a comprehensive book for the research community. Since we (Sriganesh and Chern Han) completed our Ph.D.s not long ago, the timing could not have been better for writing this book while the topic is still fresh in our minds and the empirical set up (datasets and software pipelines) for evaluating the methods is still in a “quick-to-run” form. However, although we had our Ph.D. theses to our convenient disposal and ref- erence, it is only after we started writing this book that we realized the real scale of the task that we had embarked upon. The problem of protein complex prediction may be just one of the plethora of computational problems that have opened up since the deluge of proteomics (protein-protein interaction; PPI) data over the last several years. However, in re- ality this problem encompasses or directly relates to several important and open problems in the area—in particular, the fundamental problems of modeling, vi- sualizing, and denoising of PPI networks, prediction of PPIs (novel as well as evo- lutionarily conserved), and protein function prediction from PPI data. Therefore, to write a comprehensive self-contained book, we had to cover even these closely related problems to some extent or at least allude to or reference them in the book. We had to do so without missing the connection between these problems and our central problem of protein complex prediction in the book. The early tone to write the book in this manner was set by our review article in a 2015 special issue of FEBS Letters, where we covered a number of protein com- plex prediction methods which are based on a diverse range of topological, func- tional, temporal, structural, and evolutionary information. However, being only a single-volume article, the description of the methods was brief, and to compile xii Preface
this description in the form of a book we had to delve a lot deeper into the algo- rithmic underpinnings of each of the methods, highlight how each method utilized the information (topological, functional, temporal, structural, and evolutionary) on which it was based in its own unique way, and evaluate and study the applications of the methods across a diverse range of datasets and scenarios. To do this well, we had to: (i) cover in substantial detail the preliminaries such as the experimen- tal techniques available to infer PPIs, the limitations of each of these techniques, PPI network topology, modeling, and denoising, PPI databases that are available, and how functional, temporal, structural, and evolutionary information of proteins can be integrated with PPI networks; and (ii) we had to categorize protein complex prediction methods into logical groups based on some criteria, and dedicate a sep- arate chapter for each group to make our description comprehensive. In the book, we cover (i) in Chapter 2 and in the form of independent sections within each of the other Chapters 3, 5, 6, 7, and 8. We cover (ii) by allocating Chapters 3 and 4 for “classical” methods and their comprehensive evaluation, Chapter 5 for methods that predict certain kinds of “challenging complexes” which the classical meth- ods do not predict well, Chapter 6 for methods that utilize temporal and structural information, Chapter 7 for methods that utilize information on evolutionary con- servation, and Chapter 8 for methods that integrate other kinds of omics datasets to predict “specialized” complexes—e.g., protein complexes in diseases. The requirement for a book exclusively dedicated to the problem of protein complex prediction from PPI networks at this point in time cannot be understated. Over the last two decades, a major focus of high-throughput experimental tech- nologies and of computational methods to analyze the generated data has been in genomics—e.g., in the analysis of genome sequencing data. It is relatively re- cently that this focus has started to shift toward proteomics and computational methods to analyze proteomics data. For example, while the complete sequence of the human genome was assembled more than a decade ago, it is only over the last three years that there have been similar large-scale efforts to map the human proteome. The ProteomicsDB (http://www.proteomicsdb.org/), Human Proteome Map (http://humanproteomemap.org/), and the Human Protein Atlas projects (http://www.proteinatlas.org/) have all appeared only over the last three years. Sim- ilarly is The Cancer Proteome Atlas (TCPA) project (http://app1.bioinformatics .mdanderson.org/tcpa/_design/basic/index.html) which complements The Cancer Genome Atlas project (TCGA). This means that developing more effective solutions for fundamental problems such as protein complex prediction has become all the more important today, as we try to apply these solutions to larger and more com- plex datasets arising from these newer technologies and projects. In this respect, Preface xiii we had to write the book not just by considering protein complex prediction meth- ods (i.e., their algorithmic details) as important in their own right but also by giving significant importance to the applications of these methods in the light of today’s complex datasets and research questions. There are several sections within each of the Chapters 6, 7, and 8 that play this dual role, e.g., a section in Chapter 7 discusses the evolutionary conservation of core cellular processes based on conservation pat- terns of protein complexes, and a section in Chapter 8 discusses the dysregulation of these processes in diseases based on rewiring of protein complexes between normal and disease conditions. In the end, we hope that we have done justice to what we intend this book to be. We hope that this book provides valuable insights into protein complex prediction and inspires further research in the area especially for tackling the open challenges, as well as inspires new applications in diverse areas of biomedicine.
Acknowledgments Although this book is primarily concerned with the problem of protein complex prediction, the book also covers several other aspects of PPI networks. We would like to therefore dedicate this book to the students—Honors, Masters, and Ph.D. students—who worked on these different aspects of PPI networks by being part of the computational biology group at the Department of Computer Science, National University of Singapore, over the years. Several of the methods covered in this book are a result of the extensive research conducted by these students. Sriganesh would like to thank Hon Wai Leong (Professor of Computer Science, National University of Singapore) under whom he conducted his Ph.D. research on protein complex prediction; Mark Ragan (Head of Division of Genomics of Development and Disease at Institute for Molecular Bioscience, The University of Queensland) under whom he conducted his postdoctoral research, a substantial portion of which was on identifying protein complexes in diseases; and Kum Kum Khanna (Senior Principal Research Fellow and Group Leader at QIMR-Berghofer Medical Research Institute) whose guidance played a significant part in his understanding of biological aspects of protein complexes. Sriganesh is grateful to Mark for passing him an original copy of a 1977 volume of Progress in Biophysics and Molecular Biology in which G. Rickey Welch makes a consistent principled argument that “multienzyme clusters” are advantageous to the cell and organism because they enable metabolites to be channeled within the clusters and protein expression to be co-regulated [Welch 1977]—a possession which Sriganesh will deeply cherish. Chern Han would like to thank his coauthors: Sriganesh for doing the heavy lifting in writing, editing, and xiv Preface
driving this project and Limsoon Wong for guiding him through his Ph.D. journey on protein complexes. He would also like to acknowledge the support of Bin Tean Teh (Professor with Program in Cancer and Stem Cell Biology, Duke-NUS Medical School), who currently oversees his postdoctoral research. Limsoon would like to acknowledge Chern Han and Sriganesh for doing the bulk of the writing for this book, and especially thank Sriganesh for taking the overall lead on the project. When he suggested the book to Chern Han and Sriganesh, he had not imagined that he would eventually be a co-author. We are indebted also to the Editor-in-Chief of ACM Books, Tamer Ozsu,¨ Exec- utive Editor Diane Cerra, Production Manager Paul C. Anagnostopoulos, and the entire team at ACM Books and Morgan & Claypool Publishers for their encourage- ment and for producing this book so beautifully.
Sriganesh Srihari Chern Han Yong Limsoon Wong May 2017 Introduction to Protein Complex Prediction
Unfortunately, the proteome is much more complicated than the genome. —Carol Ezzell [Ezzel et al. 2002]
In an early survey, American biochemist Bruce Alberts termed large assemblies of proteins as protein machines of cells [Alberts et al. 1998]. Protein assemblies are composed of highly specialized parts that coordinate to execute almost all of the biochemical, signaling, and functional processes in cells [Alberts et al. 1998]. It is not hard to see why protein assemblies are more advantageous to cells than 1individual proteins working in an uncoordinated manner. Compare, for example, the speed and elegance of the DNA replication machinery that simultaneously replicates both strands of the DNA double helix with what could ensue if each of the individual components—DNA helicases for separating the double-stranded DNA into single stands, DNA polymerases for assembling nucleotides, DNA primase for generating the primers, and the sliding clamp to hold these enzymes onto the DNA—acted in an uncoordinated manner [Alberts et al. 1998]. Although what might seem like individual parts brought together to perform arbitrary functions, protein assemblies can be very specific and enormously complicated. For example, the spliceosome is composed of 5 small nuclear RNAs (snRNAs or “snurps”) and more than 50 proteins, and is thought to catalyze an ordered sequence of more than 10 RNA rearrangements at a time as it removes an intron from an RNA transcript [Alberts et al. 1998, Baker et al. 1998]. The discovery of this intron-splicing process won Phillip A. Sharp and Richard J. Roberts the 1993 Nobel Prize in Physiology or Medicine.1
1. http://www.nobelprize.org/nobel_prizes/medicine/laureates/1993/ 2 Chapter 1 Introduction to Protein Complex Prediction
Protein assemblies are known to be in the order of hundreds even in the simplest of eukaryotic cells. For example, more than 400 protein assemblies have been iden- tified in the single-celled eukaryote Saccharomyces cerevisiae (budding yeast) [Pu et al. 2009]. However, our knowledge of these protein assemblies is still fragmentary, as is our conception of how each of these assemblies work together to constitute the “higher level” functional architecture of cells. A faithful attempt toward identifica- tion and characterization of all protein assemblies is therefore crucial to elucidate the functioning of the cellular machinery. To identify the entire complement of protein assemblies, it is important to first crack the proteome—a concept so novel that the word “proteome” first ap- peared only around 20 years ago [Wilkins et al. 1996, Bryson 2003, Cox and Mann 2007]. The proteome, as defined in the UniProt Knowledgebase, is the entire complement of proteins expressed or derived from protein-coding genes in an organism [Bairoch and Apweiler 1996, UniProt 2015]. With the introduction of high-throughput experimental (proteomics) techniques including mass spectro- metric [Cox and Mann 2007, Aebersold and Mann 2003] and protein quantitative trait locus (QTL) technologies [Foss et al. 2007], mapping of proteins on a large scale has become feasible. Just like how genomics techniques (including genome sequencing) were first demonstrated in model organisms, proteome-mapping has progressed initially and most rapidly for model prokaryotes including Escherichia coli (bacteria) and model eukaryotes including Saccharomyces cerevisiae (budding or baker’s yeast), Drosophila melanogaster (fruit fly), Caenorhabditis elegans (a ne- matode), and Arabidopsis thaliana (a flowering plant). Table 1.1 summarizes the numbers of proteins or protein-coding genes identified from these organisms. Of these, the proportions of protein-coding genes that are essential (genes that are thought to be critical for the survival of the cell or organism; “fitness genes”) range from ∼2% in Drosophila to ∼6.5% in Caenorhabditis and ∼18% in Saccha- romyces [Cherry et al. 2012, Chen et al. 2012]. Recent landmark studies using large-scale proteomics [Wilhelm et al. 2014, Kim et al. 2014, Uhl´en et al. 2010, Uhl´en et al. 2015]onHomo sapiens (human) cells have characterized >17,000 (or >90%) putative protein-coding genes from ≥40 tissues and organs in the human body. An encyclopedic resource on these proteins covering their levels of expres- sion and abundance in different human tissues is available from the ProteomicsDB (http://www.proteomicsdb.org/)[Wilhelm et al. 2014], The Human Proteome Map (http://humanproteomemap.org/)[ Kim et al. 2014], and The Human Protein At- las (http://www.proteinatlas.org/)[Uhl´en et al. 2010, Uhl´en et al. 2015] projects. GeneCards (http://www.genecards.org/)[Safran et al. 2002, Safran et al. 2010] ag- gregates information on human protein-coding genes from >125 Web sources Chapter 1 Introduction to Protein Complex Prediction 3 Kim et al. 2014 , en et al. 2015 ] ´ Ishii et al. 2007 ] Picotti et al. 2013 ] Rhee et al. 2003 ] McDowall et al. 2015 ] Uhl (Norwegian rat), en et al. 2010 , ´ [ Huala et al. 2001 , [ Stein et al. 2001 ] [ Sprague et al. 2001 ] [ The Flybase Consortium 1996 ] [ Keseler et al. 2009 , [ Bult et al. 2010 ] [ Wilhelm et al. 2014 , [ Shimoyama et al. 2015 ] [ Cherry et al. 2012 , [ Wood et al. 2012 , [ Karpinka et al. 2015 ] Uhl Rattus norvegicus (African clawed frog) (house mouse), Xenopus laevis Mus musculus http://www.arabidopsis.org/ http://www.wormbase.org/ http://zfin.org/ http://flybase.org/ http://ecocyc.org/ , http://ecoli.iab.keio.ac.jp/ http://www.informatics.jax.org/ http://humanproteomemap.org/ , , http://www.proteomicsdb.org/ http://www.proteinatlas.org/ http://rgd.mcw.edu/ http://www.yeastgenome.org/ http://www.pombase.org/ http://www.xenbase.org/ (fission yeast), and (Zebrafish), No. of Proteins/ > 27,400 ∼ 20,500 > 26,000 ∼ 17,700 ∼ 4,300 ∼ 23,200 > 17,000 ∼ 29,600 ∼ 6,500 ∼ 5,100 ∼ 4,700 Danio rerio Organism Protein-Coding Genes Source References D. rerio E. coli H. sapiens S. cerevisiae X. laevis A. thaliana C. elegans D. melanogaster M. musculus R. norvegicus S. pombe Examples of proteome resourcescovering for also some model and higher-order organisms (as of December 2015), Schizosaccharomyces pombe Table 1.1 4 Chapter 1 Introduction to Protein Complex Prediction
and presents the information in an integrative user-friendly manner. The expres- sion levels of nearly 200 proteins that are essential for driving different human cancers are available from The Cancer Proteome Atlas (TCPA) project (http://app1 .bioinformatics.mdanderson.org/tcpa/_design/basic/index.html)[Li et al. 2013], measured from more than 3,000 tissue samples across 11 cancer types studied as part of The Cancer Genome Atlas (TCGA) project (http://cancergenome.nih .gov/). Short-hairpin RNA (shRNA)-mediated knockdown [Paddison et al. 2002, Lambeth and Smith 2013], clustered regularly interspaced short palindromic re- peats (CRISPR)/Cas9-based gene editing [Sanjana et al. 2014, Baltimore et al. 2015, Shalem et al. 2015], and disruptive mutagenesis [Bokel¨ 2008] screening using MCF-10A (near-normal mammary), MDA-MB-435 (breast cancer), KBM7 (chronic myeloid leukemia), HAP1 (haploid), A375 (melanoma), HCT116 (colorectal cancer), and HUES62 (human embryonic stem) cells have characterized 1,500–1,880 (or 8– 10%) “core” protein-coding genes as essential in human cells [Marcotte et al. 2016, Silva et al. 2008, Wang et al. 2014, Hart et al. 2015, Hart et al. 2014, Wang et al. 2015, Blomen et al. 2015]. Comparative analyses of proteomes from different species have revealed inter- esting insights into the evolution and conservation of proteins. For example, it is es- timated that the genomes (proteomes) of human and budding yeast diverged about 1 billion years ago from a common ancestor [Douzery et al. 2014], and these share several thousand genes accounting for more than one-third of the yeast genome [O’Brien et al. 2005, Ostlund¨ et al. 2010]. Yeast and human orthologs are highly diverged; the amino-acid sequence similarity between human and yeast proteins ranges from 9–92%, with a genome-wide average of 32%. But, sequence similarity predicts only a part of the picture [Sun et al. 2016]. Recent studies [Kachroo et al. 2015, Laurent et al. 2015] have reported that 414 (or nearly half of the) essential protein-coding genes in yeast could be “replaced” by human genes, with replace- ability depending on gene (protein) assemblies: genes in the same process tend to be similarly replaceable (e.g., sterol biosynthesis) or not replaceable (e.g., DNA replication initiation). Irrespective of whether in a lower-order model or a higher-order complex or- ganism, a protein has to physically interact with other proteins and biomolecules to remain functional. Estimates in human suggest that over 80% of proteins do not function alone, but instead interact to function as macromolecular assemblies [Bergg´ard et al. 2007]. This organization of individual proteins into assemblies is tightly regulated in cellular space and time, and is supported by protein conforma- tional changes, posttranslational modifications, and competitive binding [Gibson and Goldberg 2009]. On the basis of the stability (area of interaction surface and du- Chapter 1 Introduction to Protein Complex Prediction 5 ration of interaction) and partner specificity, the interactions between proteins are classified as homo- or hetero-oligomeric, obligate or non-obligate, and permanent or transient [Zhang 2009, Nooren and Thornton 2003]. Proteins in obligate inter- actions cannot exist as stable structures on their own and are frequently bound to their partners upon translation and folding, whereas proteins in non-obligate in- teractions can exist as stable structures in bound and unbound states. Obligate interactions are generally permanent or constitutive, which once formed exist for the entire lifetime of the proteins, whereas non-obligate interactions may be per- manent, or alternatively transient, wherein the protein interacts with its partners for a brief time period and dissociates after that. Depending on the functional, spatial, and temporal context of the interactions, protein assemblies are classified as pro- tein complexes, functional modules, and biochemical (metabolic) and signaling pathways. Protein complexes are the most basic forms of protein assemblies and consti- tute fundamental functional units within cells. Complexes are stoichiometrically stable structures and are formed from physical interactions between proteins com- ing together at a specific time and space. Complexes are responsible for a wide range of functions within cells including formation of cytoskeleton, transporta- tion of cargo, metabolism of substrates for the production of energy, replication of DNA, protection and maintenance of the genome, transcription and translation of genes to gene products, maintenance of protein turn over, and protection of cells from internal and external damaging agents. Complexes can be permanent—i.e., once assembled can function for the entire lifetime of cells (e.g., ribosomes)—or transient—i.e., assembled temporarily to perform a specific function and are disas- sembled after that (e.g., cell-cycle kinase-substrate complexes formed in a cell-cycle dependent manner). Functional modules are formed when two or more protein complexes interact with each other and often other biomolecules (viz. nucleic acids, sugars, lipids, small molecules, and individual proteins) at a specific time and space to perform a particular function and disassociate after that. This molecular organization has been termed “protein sociology” [Robinson et al. 2007]. For example, the DNA repli- cation machinery, highlighted earlier, is formed by a tightly coordinated assembly of DNA polymerases, DNA helicase, DNA primase, the sliding clamp and other com- plexes within the nucleus to ensure error-free replication of the DNA during cell division. Pathways are formed when sets of complexes and individual proteins inter- act via an ordered sequence of interactions to transduce signals (signaling path- ways) or metabolize substrates from one form to another (metabolic pathways). For 6 Chapter 1 Introduction to Protein Complex Prediction
example, the MAPK pathway is composed of a sequence of microtubule-associated protein kinases (MAPKs) that transduce signals from the cell membrane to the nucleus, to induce the transcription of specific genes within the nucleus. Unlike complexes and functional modules, pathways do not require all components to co-localize in time and space.
From Protein Interactions to Protein Complexes 1.1 Physical interactions between proteins are fundamental to the formation of protein complexes. Therefore, mapping the entire complement of protein interactions (the “interactome”) occurring within cells (in vivo) is crucial for identifying and charac- terizing complexes. However, inferring all interactions occurring during the entire lifetime of cells in an organism is challenging, and this challenge increases multi- fold as the complexity of the organism increases—e.g., for multicellular organisms made up of multiple cell types. The development of high-throughput proteomics technologies including yeast two-hybrid- (Y2H) [Fields and Song 1989], co-immunoprecipitation (Co-IP) [Golemis and Adams 2002] and affinity-purification (AP)-based [Rigaut et al. 1999] screens have revolutionized our ability to interrogate protein interactions on a massive scale, and have enabled global surveys of interactomes from a number of organisms. In particular, up to 70% of the interactions from model organisms in- cluding yeast [Ito et al. 2000, Uetz et al. 2000, Ho et al. 2002, Gavin et al. 2002, Gavin et al. 2006, Krogan et al. 2006], fly [Guruharsha et al. 2011], and nematode [Butland et al. 2005, Li et al. 2004] have been mapped, and the identification of interactions from higher-order multicellular organisms including species of flowering plant Arabidopsis, fish Danio (zebrafish), and several mammals—Mus musculus (house mouse), Rattus norvegicus (Norwegian rat), and humans—is rapidly underway; the interactions are cataloged in large public databases [Stark et al. 2011, Rolland et al. 2014]. The earliest and most widely used experimental techniques to capture binary interacting proteins on a high-throughput scale were mostly yeast two-hybrid (Y2H) [Fields and Song 1989]. However, datasets of protein interactions inferred from Y2H screens were found to have significant numbers of spurious interactions [Von Mering et al. 2002, Bader and Hogue 2002, Bader et al. 2004]. This is attributed in part to the nature of the Y2H protocol in which all potential interactors are tested within the same compartment (nucleus) even though some of these do not meet during their lifetimes due to compartmentalization (different subcellular localizations) within living cells. 1.1 From Protein Interactions to Protein Complexes 7
Co-immunoprecipitation or affinity-purification (Co-IP/AP) techniques were in- troduced later and these are more specific in detecting interactions between co- complexed proteins [Golemis and Adams 2002, Rigaut et al. 1999, Kocher¨ and Superti-Furga 2007]. In these protocols, cohesive groups or complexes of pro- teins are “pulled down,” from which the binary interactions between the proteins are individually inferred. However, this indirect inference could lead to over- or under-estimation of protein interactions. In the tandem affinity purification (TAP) procedure [Rigaut et al. 1999, Puig et al. 2001], proteins of interest (“baits”) are TAP-tagged and purified in an affinity column with potential interaction part- ners (“preys”). The pulled-down complexes are subjected to mass spectrometric (MS) analysis to identify individual components within the complexes. However, although more reliable than Y2H, the TAP/MS procedure can be elaborate and with the inclusion of MS, it can be expensive too. The exhaustiveness of TAP/MS depends on the baits used—there is no way to identify all possible complexes unless all possible baits are tested. Proteins which do not interact directly with the chosen bait but interact with one or more of the preys, might also get pulled down as part of the purified complex. In some cases, these proteins are indeed part of the real complex whereas in other cases these proteins are not (i.e., they are contaminants); therefore multiple purifications are required, possibly with each protein as a bait and as a prey, to identify the correct set of proteins within the complex. The TAP procedure therefore offers two successive affinity purifi- cations so that the chance of retained contaminants reduces significantly. Con- versely, a chosen bait might form a real complex with a set of proteins without actually interacting directly with every protein from the set, and therefore some proteins might not get pulled down as part of the purified complex. In these cases, multiple baits would need to be tested to assemble the complete complex. Moreover, since some proteins participate in more than one complex, multiple independent purifications are required to identify all hosting complexes for these proteins. Binary interactions between the proteins in a pulled-down protein complex are inferred using two models: matrix and spoke. In the matrix model, a binary interaction is inferred between every pair of proteins within the complex, whereas in the spoke model interactions are inferred only between the bait and all its preys. Since all pairs of proteins within a complex do not necessarily interact, the matrix model is usually an overestimation of the total number of binary interactions, whereas the spoke model is an underestimation. Therefore, usually a balance is struck between the two models that is close enough to the estimated total number of interactions for the species or organism. 8 Chapter 1 Introduction to Protein Complex Prediction
Table 1.2 Numbers of mapped physical interactions between proteins across different model and higher-order organisms
Organism No. of Interactions No. of Proteins A. thaliana 34,320 9,240 C. elegans 5,783 3,269 D. rerio 188 181 D. melanogaster 36,741 8,071 E. coli 99 104 H. sapiens 230,843 20,006 M. musculus 18,465 8,611 R. norvegicus 4,537 3,328 S. cerevisiae 82,327 6,278 S. pombe 9,492 2,944 X. laevis 532 471
Based on BioGrid version 3.4.130 (November 2015) [Stark et al. 2011, Chatr-Aryamontri et al. 2015].
Despite differences in procedures and technologies, the use of different experi- mental protocols can effectively complement one another in detecting interactions. While TAP can be more specific and detect mainly stable (co-complexed) protein interactions, Y2H can be more exhaustive and detect even transient and between- complex interactions. Based on BioGrid version 3.4.130 (November 2015) (http:/ /thebiogrid.org/)[Stark et al. 2011, Chatr-Aryamontri et al. 2015], the numbers of mapped physical interactions range from 99 in E. coli to ∼82,300 in S. cerevisiae and ∼230,900 in H. sapiens (summarized in Table 1.2). It remains to be seen how many of these interactions actually occur in the physiological contexts of living cells or cell types, how many are subject to genetic and physiological variations, and how many still remain to be mapped. The binary interactions inferred from the different experiments are assembled into a protein-protein interaction network, or simply, PPI network. The PPI network presents a global or “systems” view of the interactome, and provides a mathemati- cal (topological) framework to analyze these interactions. Protein complexes are ex- pected to be embedded as modular structures within the PPI network [Hartwell et al. 1999, Spirin and Mirny 2003]. Topologically, this modularity refers to densely con- nected subsets of proteins separated by less-dense regions in the network [Newman 1.1 From Protein Interactions to Protein Complexes 9
2004, Newman 2010]. Biologically, this modularity represents division of labor among the complexes, and provides robustness against disruptions to the network from internal (e.g., mutations) and external (e.g., chemical attacks) agents. Com- putational methods developed to identify protein complexes therefore mine for modular subnetworks in the PPI network. While this strategy appears reasonable in general, limitations in PPI datasets, arising due to the shortcomings highlighted above in experimental protocols, severely restrict the feasibility of accurately pre- dicting complexes from the network. Specifically, the limitations in existing PPI datasets that directly impact protein complex prediction include:
1. presence of a large number of spurious (noisy) interactions; 2. relative paucity of interactions between “complexed” proteins; and 3. missing contextual—e.g., temporal and spatial—information about the in- teractions.
These limitations translate to the following three main challenges currently faced by computational methods for protein complex prediction:
1. difficulty in detecting sparse complexes; 2. difficulty in detecting small (containing fewer than four proteins) and sub- complexes; and 3. difficulty in deconvoluting overlapping complexes (i.e., complexes that share many proteins), especially when these complexes occur under different cel- lular contexts.
While the interactome coverage can be improved by integrating multiple PPI datasets, the lack of agreement between the datasets from different experimental protocols [Von Mering et al. 2002, Bader et al. 2004], and the multifold increase in accompanying noise (spurious interactions), tend to cancel out the advantage gained from the increased coverage. Consequently, the confidence of each interac- tion has to be assessed (confidence scoring) and low-confidence interactions have to be first removed from the datasets (filtering) before performing any downstream analysis. To summarize, computational identification of protein complexes from interaction datasets follows these steps (Figure 1.1):
1. integrating interactions from multiple experiments and stringently assess- ing the confidence (reliability) of these interactions; 2. constructing a reliable PPI network using only the high-confidence interac- tions; Anaphase promoting complex 26S proteasome Potentially novel
20S proteasome Nuclear pore complex Potentially novel
NFKB1-NFKB2-RELA-RELB complex BAF complex Potentially novel (a) (b) Fanconi anemia core complex
Figure 1.1 Identification of protein complexes from protein interaction data. (a) A high-confidence PPI network is assembled from physical interactions between proteins after discarding low-confidence (potentially spurious) interactions. (b) Candidate protein complexes are predicted from this PPI network using network-clustering approaches. The quality of the predicted complexes is validated against bona fide complexes, whereas novel complexes are functionally assessed and assigned new roles where possible. 1.2 Databases for Protein Complexes 11
3. identifying modular subnetworks from the PPI network to generate a candi- date list of protein complexes; and 4. evaluating these candidate complexes against bona fide complexes, and validating and assigning roles for novel complexes.
As we shall see in the following chapters, several sophisticated approaches have been developed over the years to overcome some of the above-mentioned chal- lenges. Computational methods have co-evolved with proteomics technologies, and over the last ten years a plethora of computational methods have been developed to predict complexes from PPI networks, which is the subject of this book. In general, computational methods complement experimental approaches in several ways. These methods have helped counter some of the limitations arising in proteomic studies, e.g., by eliminating spurious interactions via interaction scoring, and by enriching true interactions via prediction of missing interactions. The novel in- teractions and protein complexes predicted from these methods have been added back to proteomics databases, and these have helped to further enhance our re- sources and knowledge in the field.
Databases for Protein Complexes 1.2 Several high-quality resources for protein complexes have been developed over the years covering both lower-order model and higher-order organisms (summarized in Table 1.3). In total, Aloy [Aloy et al. 2004], CYC2008 [Pu et al. 2009], and MIPS [Mewes et al. 2008] contain over 450 manually curated complexes from S. cerevisiae (budding yeast). CORUM [Reuepp et al. 2008, 2010] contains ∼3,000 mammalian complexes of which ∼1,970 are protein complexes identified from human cells. The European Molecular Biology Laboratory (EMBL) and European Bioinformatics Institute (EBI) maintain a database of manually curated protein complexes from 18 different species including C. elegans, H. sapiens, M. musculus, S. cerevisiae, and S. pombe [Meldal et al. 2015]. Havugimana et al. [2012] present a dataset of 622 putative human soluble pro- tein complexes (http://human.med.utoronto.ca/) identified using high-throughput AP/MS pulldown and PPI-clustering approaches. Huttlin et al. [2015] present 352 putative human complexes identified from human embryonic (HEK293T) cells (http://wren.hms.harvard.edu/bioplex/). Wanetal.[2015] present a catalog of con- served metazoan complexes (http://metazoa.med.utoronto.ca/) identified by clus- tering of high-quality pulldown interactions from C. elegans, D. melanogaster, H. sapiens, M. musculus, and Strongylocentrotus purpuratus (purple sea urchin). This 12 Chapter 1 Introduction to Protein Complex Prediction R. 2017 ] (399), and Strongylocentrotus S. cerevisiae , and [ Vinayagam et al. 2013 ] [ Meldal et al. 2015 ] [ Wan et al. 2015 ] [ Reuepp et al. 2008 , Ruepp et al. 2010 ] [ Aloy et al. 2004 ] [ Huttlin et al. 2015 , [ Havugimana et al. 2012 ] [ Mewes et al. 2008 ] [ Ori et al. 2016 ] [ Pu et al. 2009 ] [ Drew et al. 2017 ] (house mouse) (15%) and (404), M. musculus , M. musculus . The EMBL-EBI portal includes protein M. musculus H. sapiens , (64%), (441), S. cerevisiae H. sapiens H. sapiens , and D. melanogaster , H. sapiens , C.elegans http://www.flyrnai.org/compleat/ http://metazoa.med.utoronto.ca/ http://www.ebi.ac.uk/intact/complex/ http://mips.helmholtz-muenchen.de/genre/ proj/corum/ http://www.russelllab.org/complexes/ http://wren.hms.harvard.edu/bioplex/ http://human.med.utoronto.ca/ http://www.helmholtz-muenchen.de/en/ibis/ http://www.bork.embl.de/Docu/variable_ complexes/ http://wodaklab.org/cyc2008/ http://proteincomplexes.org (16 complexes), a C. elegans D. melanogaster No. of 102 354 622 313 408 > 4,600 3 species 8,886 Metazoa 344–490 18 species 1,564 S. cerevisiae H. sapiens (HEK293T) H. sapiens S. cerevisiae S. cerevisiae H. sapiens c b (purple sea urchin), consisting of 344 complexes with entirely ancient proteins and 490 complexes with largely ancient (12%) (Norwegian rat). (16). CORUM includes mammalian protein complexes, mainly from EMBL-EBI Complex Portal MIPS—CORUM Mammals 2,837 Database Organisms Complexes Source Reference Aloy BioPlex COMPLEAT Human Soluble Metazoan conserved MIPS—Yeast CYC2008 hu.MAP Ori Mammals 279 purpuratus proteins conserved ubiquitously among eurkaryotes. norvegicus c. Includes mainly conserved complexes among the metazoans, complexes from 18 different speciesS. of pombe which are Publicly available databases for protein complexes a. No. of complexes as ofb. 2016. COMPLEAT includes protein complexes from Table 1.3 1.3 Organization of the Rest of the Book 13
dataset includes ∼300 complexes composed of entirely ancient proteins (evolu- tionarily conserved from lower-order organisms), and ∼500 complexes composed of largely ancient proteins conserved ubiquitously among eurkaryotes. Drew et al. [2017] present a comprehensive catalog of >4,600 computationally predicted human protein complexes covering >7,700 proteins and >56,000 interactions by analyzing data from >9,000 published mass spectrometry experiments. Vinayagam et al. [2013] present COMPLEAT (http://www.flyrnai.org/compleat/), a database of 3,077, 3,636, and 2,173 literature-curated protein complexes from D. melanogaster, H. sapiens, and S. cerevisiae, respectively. Ori et al. [2016] combined mammalian complexes from CORUM and COMPLEAT to generate a dataset of 279 protein com- plexes from mammals.
Organization of the Rest of the Book 1.3 The rest of this book reads as follows. Chapter 2 discusses important concepts underlying PPI networks and presents prerequisites for understanding subse- quent chapters. We discuss different high-throughput experimental techniques employed to infer PPIs (including the Y2H and AP/MS techniques mentioned ear- lier), explaining briefly the biological and biochemical concepts underlying these techniques and highlighting their strengths and weaknesses. We explain compu- tational approaches that denoise (PPI weighting) and integrate data from multiple experiments to construct reliable PPI networks. We also discuss topological proper- ties of PPI networks, theoretical models for PPI networks, and the various databases and software tools that catalog and visualize PPI networks. Chapter 3 forms the main crux of this book as it introduces and discusses in depth the algorithmic underpinnings of some of the classical (seminal) computational methods to iden- tify protein complexes from PPI networks. While some of these methods work solely on the topology of the PPI network, others incorporate additional biolog- ical information—e.g., in the form of functional annotations—with PPI network topology to improve their predictions. Chapter 4 presents a comprehensive em- pirical evaluation of six widely used protein complex prediction methods available in the literature using unweighted and weighted PPI networks from yeast and hu- man. Taking a known human protein complex as an example, we discuss how the methods have fared in recovering this complex from the PPI network. Based on this evaluation, we explain in Chapter 5 the shortcomings of current methods in detecting certain kinds of protein complexes, e.g., protein complexes that are sparse or that overlap with other complexes. Through this, we highlight the open challenges that need to be tackled to improve coverage and accuracy of protein 14 Chapter 1 Introduction to Protein Complex Prediction
complex prediction. We discuss some recently proposed methods that attempt to tackle these open challenges and to what extent these methods have been suc- cessful. Chapter 6 is dedicated to an important class of protein complexes that are dynamic in their protein composition and assembly. While some of these protein complexes are temporal in nature—i.e., assemble at a specific timepoint and dis- sociate after that—others are structurally variable—e.g., change their 3D structure and/or composition—based on the cellular context. Quite obviously, it is not pos- sible to detect dynamic complexes solely by analyzing the PPI network; methods that integrate gene or protein expression and 3D structural information are re- quired. These more-sophisticated methods are covered here. Chapter 7 discusses methods to identify protein complexes that are conserved between organisms or species; these evolutionarily conserved complexes provide important insights into the conservation of cellular processes through the evolution. Finally, in today’s era of systems biology where biological systems are studied as a complex interplay of multiple (biomolecular) entities, we explain how protein complex prediction methods are playing a crucial role in shaping up the field; these applications are covered in Chapter 8. We discuss the application of these methods for predicting dysregulated or dysfunctional protein complexes, identifying rewiring of interac- tions within complexes, and in discovery of new disease genes and drug targets. We conclude the book in Chapter 9 by reiterating the diverse applications of pro- tein complex prediction methods and thereby the importance of computational methods in driving this exciting field of research. Constructing Reliable Protein-Protein Interaction (PPI) Networks
No molecule arising naturally (MAN) is an island, entire of itself. —John Donne (1573–1631), English poet and cleric (modified [Dunn 2010] from original quote, “No man is an island, entire of itself.”)
2The identification of PPIs yields insights into functional relationships between pro- teins. Over the years, a number of different experimental techniques have been developed to infer PPIs. This inference of PPIs is orthogonal, but also complemen- tary, to experiments inferring genetic interactions; both provide lists of candidate interactions and implicate functional relationships between proteins [Morris et al. 2014].
High-Throughput Experimental Systems to Infer PPIs 2.1 Physical interactions between proteins are inferred using different biochemical, biophysical, and genetic techniques (summarized in Table 2.1). Yeast two-hybrid (Y2H; less commonly, YTH) [Ito et al. 2000, Uetz et al. 2000, Ho et al. 2002] and protein-fragment complement assays [Michnick 2003, Remy and Michnick 2004, Remy et al. 2007] enable identification of direct binary physical interactions be- tween the proteins, whereas co-immunoprecipitation or affinity purification assays [Golemis and Adams 2002, Rigaut et al. 1999, Kocher¨ and Superti-Furga 2007, Dunham et al. 2012] enable pull down of whole protein complexes from which the binary interactions are inferred. Protein-fragment complementation assay (PCA) 16 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks Ho et Rigaut et Petschnigg Tarassov et al. Kittanakom et al. Remy and Michnick Uetz et al. 2000 , Yao et al. 2017 ] Lalonde et al. 2010 , Remy et al. 2007 , al. 2002 ] al. 1999 ] [ Michnick 2003 , 2004 , 2008 ] Key References [ Lalonde et al. 2008 , 2009 , et al. 2014 , ractions; these techniques can be employed in a high-throughput Binary interactions; can infer membrane-protein interactions Binary interactionsCo-complex relationships [ Ito et al. , 2000 [ Golemis and Adams 2002 , Binary interactions between membrane proteins (yeast, (yeast, (yeast, ies for potential interactors In vivo mammalian) Cell Assay Interaction Type In vivo mammalian) In vitro In vivo mammalian) Protein-fragment complement (PCA) Experimental Technique Yeast two-hybrid (Y2H) Co-immunoprecipitation precipitation followed by mass spectrometry (Co-IP/MS) Membrane yeast two- hybrid (MYTH), Mammalian membrane YTH (MaMTH) Experimental techniques for screening protein inte manner to screen whole protein librar Table 2.1 2.1 High-Throughput Experimental Systems to Infer PPIs 17 coupled with biomolecular fluorescence complementation (BIFC) [Grinberg et al. 2004] enables mapping of interaction surfaces of proteins, and is thus a good tool to confirm protein binding. Membrane YTH and mammalian membrane YTH (MaMTH) [Lalonde et al. 2008, Kittanakom et al. 2009, Lalonde et al. 2010, Petschnigg et al. 2014, Yao et al. 2017] enable identification of interactions in- volving membrane or membrane-bound proteins which are typically difficult to identify using traditional Y2H and AP techniques. Techniques inferring genetic in- teractions [Brown et al. 2006] enable detection of functional associations or genetic relationships between the proteins (genes), but these associations do not always correspond to physical interactions. Here, we present only an overview of each of the experimental techniques; for a more descriptive survey, the readers are referred to Br¨uckneret al. [2009], Shoemaker and Panchenko [2007], and Snider et al. [2015].
Yeast Two-Hybrid (Y2H) Screening System Y2H was first described by Fields and Song [1989] and is based on the modularity of binding domains in eukaryotic transcription factors. Eukaryotic transcription factors have at least two distinct domains: (1) the DNA binding domain (BD) which directs binding to a promoter DNA sequence (upstream activating sequence (UAS)); and (2) the transcription activating domain (AD) which activates the transcription of target reporter genes. Splitting the BD and AD domains inactivates transcription, but even indirectly connecting AD and BD can restore transcription resulting in activation of specific reporter genes. Plasmids are engineered to produce a protein product (chimeric or “hybrid”) in which the BD fragment is in-frame fused onto a protein of interest (the bait), while the AD fragment is in-frame fused onto another protein (the prey) (Figure 2.1). The plasmids are then transfected into cells chosen for the screening method, usually from yeast. If the bait and prey proteins interact, the AD and BD domains are indirectly connected, resulting in the activation of reporters within nuclei of cells. Typically, multiple independent yeast colonies are assayed for each combination of plasmids to account for the heterogeneity in protein expression levels and their ability to activate reporter transcription. This basic Y2H technique has been improved over the years to enable large library screening [Chien et al. 1991, Dufree et al. 1993, Gyuris et al. 1993, Finley and Brent 1994]. Interaction mating is one such protocol that can screen more than one bait against a library of preys, and can save considerable time and materials. In this protocol, the AD- and BD-fused proteins begin in two different haploid yeast strains with opposite mating types. These proteins are brought together by mating, a process in which two haploid cells fuse to form a single diploid cell. The diploids are then tested using conventional reporter activation for possible interactors. 18 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks
Prey AD Chimeric plasmids with in-frame fused bait and prey
Bait
BD
UAS Reporter gene
Figure 2.1 Schematic representation of the yeast two-hybrid protocol to detect interaction between bait and prey proteins. If the bait and prey proteins interact, the DNA binding domain (BD) fused to the bait and the transcription activing domain (AD) fused to the prey are indirectly connected resulting in the activation of the reporter gene. UAS: upstream activating sequence (promoter) of the reporter gene.
Therefore, different bait-expressing strains can be mated with a library of prey- expressing strains, and the resulting diploids can be screened for interactors. It is important to know how many viable diploids have arisen and to determine the false-positive frequency of the detected interactions. True interactors tend to come up in a timeframe specific for each given bait, with false positives clustering at a different timepoint. Multiple yeast colonies are assayed to confirm the interactors. Y2H screens have been extensively used to detect protein interactions among yeast proteins, with two of the earliest studies reporting 692 [Uetz et al. 2000] and 841 [Ito et al. 2000] interactions for S. cerevisiae. In the bacteria Helicobacter py- lori, one of the first applications of Y2H identified over 1,200 interactions, covering about 47% of the bacterial proteome [Rain et al. 2001]. Applications on fly pro- ceeded on an even greater scale when Giot et al. [2003] identified 10,021 protein interactions involving 4,500 proteins in D. melanogaster. More recently, Vo et al. [2016] used Y2H to map binary interactions in the yeast S. pombe (fission yeast). This network, called FissionNet, consisted of 2,278 interactions covering 4,989 protein- coding genes in S. pombe. The Y2H system has also been applied for humans, with two initial studies [Rual et al. 2005, Stelzl et al. 2005] yielding over 5,000 interac- tions among human proteins. More recently, Rolland et al. [2014] employed Y2H to characterize nearly 14,000 human interactions. 2.1 High-Throughput Experimental Systems to Infer PPIs 19
However, inherent to this type of library screening, the number of detected false-positive interactions is usually high. Among the possible reasons for the gen- eration of false positives is that the experimental compartmentalization (within the nucleus) for bait and prey proteins does not correspond to the natural cellular compartmentalization. Moreover, proteins that are not correctly folded under ex- perimental conditions or are “sticky” may show non-specific interactions. The third source of false positives is the interaction of the preys themselves with reporter pro- teins, which can turn on the reporter genes. Von Mering et al. [2002] estimated the accuracy of classic Y2H to be less than 10%, with subsequent evaluations suggest- ing the number of false positives to be between 50% and 70% in large-scale Y2H interaction datasets for yeast [Bader and Hogue 2002, Bader et al. 2004].
Co-Immunoprecipitation/Affinity Purification (AP) Followed by Mass Spectrometry (Co-IP/AP followed by MS) Complementing the in vivo Y2H screens are the in vitro Co-IP/AP followed by MS screens that identify whole complexes of interacting proteins, from which the bi- nary interactions between proteins can be inferred [Golemis and Adams 2002, Rigaut et al. 1999, Kocher¨ and Superti-Furga 2007, Dunham et al. 2012]. The Co- IP/AP followed by MS screens consist of two steps: co-immunoprecipitation/affinity purification and mass spectrometry (Figure 2.2). In the first step, cells are lysed
MS
Harvest and Incubate lysate with Antibody binds to Elute Separate out Cut bands from gel; lyse cells tagged proteins; bait; nonbinding complex the eluted identify proteins by introduce specific proteins are proteins mass spectrometry antibody for washed; retain using gel protein of proteins (bait and electrophoresis interest (bait) preys) in complex
Figure 2.2 Schematic representation of the co-immunoprecipitation/affinity purification followed by mass spectrometry (Co-IP/AP followed by MS) protocol. The protein of interest (bait) is targeted with a specific antibody and pulled down with its interactors in a cell lysate buffer. The individual components of the pulled-down complex are identified using mass spectrometry. These days, liquid chromatography with mass spectrometry (LCMS) instead of running the gel is increasingly being used more frequently for as a combined physical-separation and MS-analysis technique [Pitt 2009]. 20 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks
in a radioimmunoprecipitation assay (RIPA) buffer. The RIPA buffer enables effi- cient cell lysis and protein solubilization while avoiding protein degradation and interference with biological activity of the proteins. A known member of the set of proteins (the protein of interest or bait) is epitope-tagged and is either immunopre- cipitated using a specific antibody against the tag or purified using affinity columns recognizing the tag, giving the interacting partners (preys) of the bait. Normally, this purification step is more effective when two consecutive purification steps are used with proteins that are doubly tagged (hence called tandem affinity purification or TAP). This results in an enrichment of native multi-protein complexes containing the bait. The individual components within each such purified complex are then screened by gel electrophoresis and identified using mass spectrometry. In one of the first applications of TAP/MS, Ho et al. [2002] expressed 10% of the coding open reading frames from yeast, and the identified interactions connected 25% of the yeast proteome as multi-protein complexes. Subsequently, Gavin et al. [2002], Gavin et al. [2006], and Krogan et al. [2006] purified 1,993 and 2,357 TAP- tagged proteins covering 60% and 72% of the yeast proteome, and identified 7,592 and 7,123 protein interactions from yeast, respectively. One of the first proof-of- concept studies for humans applied AP/MS to characterize interactors using 338 bait proteins that were selected based on their putative involvement in diseases, and the study identified 6,463 interactions between 2,235 proteins [Ewing et al. 2007].
Comparison of Y2H and AP/MS Experimental Techniques A majority of the interaction data collected so far has come from Y2H screening. For example, approximately half of the data available in databases including IntAct [Hermjakob et al. 2004, Kerrien et al. 2012] and MINT [Zanzoni et al. 2002, Chatr- Aryamontri et al. 2007] are from Y2H screens [Br¨uckneret al. 2009] (more sources of PPI data are listed in Table 2.2). This could in part be attributed to the inaccessi- bility of mass spectrometry due to the expensive large equipment that is required. But, in general, Y2H and AP/MS techniques are complementary in the kind of in- teractors they detect. If a set of proteins form a stable complex, then an AP/MS screen can determine all the proteins within the complex, but may not necessarily confirm every interacting pair (the binary interactions) within the complex. On the other hand, a Y2H screen can detect whether any given two proteins directly inter- act. While stable interactions between co-complexed proteins can be accurately determined using AP/MS techniques, Y2H techniques are useful for identifying transient interactions between the proteins. However, due to considerable func- 2.1 High-Throughput Experimental Systems to Infer PPIs 21 tional cross-talk within cells, Y2H can also report an interaction even when the proteins are not directly connected. In addition, some types of interactions can be missed in Y2H due to inherent limitations in the technique—e.g., interactions in- volving membrane proteins, or proteins requiring posttranslational modifications to interact—but these limitations may also occur with AP/MS-based approaches [Br¨uckneret al. 2009]. Therefore, only a combination of different approaches that necessarily also includes computational methods (to filter out the incorrectly de- tected interactions) will eventually lead to a fairly complete characterization of all physiologically relevant interactions in a given cell or organism.
Protein-Fragment Complementation Assay (PCA) PCA is a relatively new technique which can detect in vivo protein interactions as well as their modulation or spatial and temporal changes [Michnick 2003, Morell et al. 2009, Tarassov et al. 2008]. Similar to Y2H, PCA is based on the principle of splitting a reporter protein into two fragments, each of which cannot function alone [Michnick 2003]. However, unlike Y2H, PCA is based on the formation of a biomolecular complex between the bait and prey, where both are fused to the split domains of the reporter. Importantly, the formation of this complex occurs in competition with alternative endogenous interaction partners present within the cell. The interaction brings the two split fragments in proximity enabling their non-covalent reassembly, folding, and recovery of protein reporter function [Morell et al. 2009]. Typically, the reporter proteins are fluorescent proteins, and the for- mation of biomolecular complexes is visualized using biomolecular fluorescence complementation (BIFC). BIFC can also be used to map the interaction surfaces of these complexes. This enables investigation of competitive binding between mu- tually exclusive interaction partners as well as comparison of their intracellular distributions [Grinberg et al. 2004]. PCA can be used as a screening tool to identify potential interaction partners of a specific protein [Remy and Michnick 2004, Remy et al. 2007], or to validate the interactions detected from other techniques such as Y2H [Vo et al. 2016]. In one of the first applications of PCA on a genome-wide in vivo scale, Tarassov et al. [2008] identified 2,770 interactions among 1,124 proteins from S. cerevisiae. Vo et al. [2016] used PCA as an orthogonal assay to reconfirm the interactions detected in S. pombe (from the FissionNet network consisting of 2,278 interactions; discussed earlier). PCA has also been employed to validate interactions between membrane proteins or membrane-associated proteins [Babu et al. 2012, Shoemaker and Panchenko 2007] (discussed next). 22 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks
Techniques for Inferring Membrane-Protein Interactions Membrane proteins are attached to or associated with membranes of cells or their organelles, and constitute approximately 30% of the proteomes of organisms [Carpenter et al. 2008, Von Heijne 2007, Byrne and Iwata 2002]. Being non-polar (hydrophobic), membrane proteins are difficult to crystallize using traditional X- ray crystallography compared to soluble proteins, and are the least studied among all proteins using high-throughput proteomics techniques [Carpenter et al. 2008]. Membrane proteins are involved in the transportation of ions, metabolites, and larger molecules such as proteins, RNA, and lipids across membranes, in sending and receiving chemical signals and propagating electrical impulses across mem- branes, in anchoring enzymes and other proteins to membranes, in controlling membrane lipid composition, and in organizing and maintaining the shape of organelles and the cell itself [Lodish et al. 2000]. In humans, the G-protein-coupled- receptors (GPCRs), which are membrane proteins involved in signal transduction across membranes, alone account for 15% of all membrane proteins; and 30% of all drug targets are GPCRs [Von Heijne 2007]. Due to the key roles of membrane pro- teins, identifying interactions involving these proteins has important applications especially in drug development. Membrane protein complexes are notoriously difficult to study using traditional high-throughput techniques [Lalonde et al. 2008]. Intact membrane-protein com- plexes are difficult to pull down using conventional AP/MS systems. This is due in part to the hydrophobic nature of membrane proteins as well as the ready dissocia- tion of subunit interactions, either between trans-membrane subunits or between trans-membrane and cytoplasmic subunits [Barrera et al. 2008]. Further, mem- brane protein structure is difficult to study by commonly used high-resolution methods including X-ray crystallography and NMR spectroscopy. A major avenue by which one can understand membrane proteins and their complexes is by mapping the membrane-protein “subinteractome”—the subset of interactions involving membrane proteins. Conventional Y2H system is confined to the nucleus of the cell thereby excluding the study of membrane proteins. New biochemical techniques have been developed to facilitate the characterization of interactions among membrane proteins. Among these is the split-ubiquitin mem- brane yeast two-hybrid (MYTH) system [Miller et al. 2005, Kittanakom et al. 2009, Stagljar et al. 1998, Petschnigg et al. 2012]. This system is based on ubiquitin, an evolutionarily conserved 76-amino acid protein that serves as a tag for proteins targeted for degradation by the 26S proteasome. The presence of ubiquitin is recog- nized by ubiquitin-specific proteases (UBPs) located in the nucleus and cytoplasm of all eukaryotic cells. Ubiquitin can be split and expressed as two halves: the amino- 2.2 Data Sources for PPIs 23
terminal (N) and the carboxyl terminal (C). These two halves have a high affinity for each other in the cell and can reconstitute to form pseudo-ubiquitin that is recog- nizable by UBPs. In MYTH, the bait proteins are fused to the C-terminal of a split-ubiquitin, and the prey proteins are fused to the N-terminal. The two halves reconstitute into a pseudo-ubiquitin protein if there is affinity between the bait and prey proteins. This pseudo-ubiquitin is recognized by UBPs, which cleaves after the C-terminus of ubiquitin to release the transcription factor, which then enters the nucleus to activate reporter genes. Two of the earliest studies using the MYTH screens reported a fair number of interactions among membrane proteins from yeast: 343 interactions among 179 proteins by Lalonde et al. [2010], and 808 interactions among 536 proteins by Miller et al. [2005]. PCA has also been adopted to identify and/or verify membrane-protein interactions. For example, Babu et al. [2012] used PCA to validate and integrate 1,726 yeast membrane-protein interactions obtained from multiple studies, and these encompassed 501 putative membrane protein complexes. The mammalian version of membrane yeast two-hybrid, MaMTH, is also based on the split-ubiquitin assay and is derived from the MYTH assay. Stagljar and col- leagues [Petschnigg et al. 2014, Yao et al. 2017] used MaMTH to probe interactions involving the epidermal growth factor receptor/receptor tyrosine-protein kinase (RTK) ErbB-1 (EGFR/ERBB1), Erb-B2 receptor tyrosine kinase 2 (ERBB2), and other RTKs that localize to the plasma membrane in human cells. When applied to hu- man lung cancer cells, the assay identified 124 interactors for wild-type and mutant EGFR [Petschnigg et al. 2014].
Data Sources for PPIs 2.2 Several public and proprietary databases now catalog protein interactions from both lower-order model and higher-order organisms (summarized in Table 2.2). These databases contain PPI data in an acceptable format required for data depo- sition, such as IMEx (http://www.imexconsortium.org/submit-your-data)[Orchard et al. 2012]. The Biomolecular Interaction Network Database (BIND) [Bader et al. 2003], now called Biomolecular Object Network Database (BOND), includes experi- mentally determined protein-protein, protein-small molecule, and protein-nucleic acid interactions. BioGrid [Stark et al. 2011] catalogs physical and genetic inter- actions inferred from multiple high-throughput experiments. The Database of In- teracting Proteins (DIP) [Xenarios et al. 2002] contains experimentally determined protein interactions with a “core” subset of interactions that have passed quality 24 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks Yu et al. 2011 ] Szklarczyk et al. 2011 ] Kerrien et al. 2012 ] Salwinski et al. 2004 ] Chatr-Aryamontri et al. Yu et al. 2008 , Turner et al. 2010 ] Prasad et al. 2009 ] protein-DNA interactions Persico et al. 2005 ] uldener et al. 2005 ] Reference [ Peri et al. 2004 , [ Brown and Jurisica 2005 ] [ Lynn et al. 2008 ] [ Razick et al. 2008 , [ Zanzoni et al. 2002 , [ Mewes et al. 2008 ] [ Pagel et al. 2005 ] [ Von Mering et al. 2003 , 2007 , [ Bader et al. 2003 ] [ Rolland et al. 2014 , [ G¨ [ Xenarios et al. 2002 , [ Hermjakob et al. 2004 , [ Chen et al. 2009 ] [ Stark et al. 2011 ] , protein-small molecule, and http://irefindex.org/wiki/index.php?title=iRefIndex http://string-db.org/ http://www.hprd.org/ http://ophid.utoronto.ca/ http://www.innatedb.com/ http://mint.bio.uniroma2.it/mint/ http://mips.helmholtz-muenchen.de/proj/ppi/ http://mips.helmholtz-muenchen.de/proj/ppi/ http://bind.ca http://interactome.dfci.harvard.edu/ http://mips.helmholtz-muenchen.de/genre/proj/yeast/ http://www.ebi.ac.uk/intact/ http://dip.doe-mbi.ucla.edu http://discern.uits.iu.edu:8340/HAPPI/ http://thebiogrid.org iRefIndex STRING HPRD OPHID/IID InnateDB MINT/HomoMINT MIPS MPPI BIND CYGD EMBL-EBI IntAct CCSB DIP HAPPI PPI DatabaseBioGrid Source Public and proprietary databases for protein-protein Table 2.2 2.3 Topological Properties of PPI Networks 25
assessment (for example, based on literature verification). The Centre for Can- cer Systems Biology (CCSB) Interactome Database at Harvard [Rolland et al. 2014, Yu et al. 2008, Yu et al. 2011] contains yeast, plant, virus, and human interac- tions. STRING [Von Mering et al. 2003, Szklarczyk et al. 2011] catalogs physical and functional interactions inferred from experimental and computational techniques. MIPS Comprehensive Yeast Genome Database (CYGD) [G¨uldener et al. 2005] and MIPS Mammalian Protein-Protein Interaction Database (MPPI) [Pagel et al. 2005] catalog protein interactions and also expert-curated protein complexes from yeast and mammals. The Human Protein Reference Database (HPRD) [Peri et al. 2004, Prasad et al. 2009] mainly contains experimentally identified human interactions. IRefIndex [Razick et al. 2008, Turner et al. 2010] and Integrated Interaction Data- base (IID) [Brown and Jurisica 2005] integrate experimental and computationally predicted interactions for human and several other species.
Topological Properties of PPI Networks 2.3 A simple yet effective way to represent interaction data is in the form of an undi- rected network called a protein-protein interaction network or simply PPI network, given as G =V , E, where V is the set of proteins and E is the set of physical interactions between the proteins. Such a network presents a global or “systems” view of the entire set of proteins and their interactions, and provides a topolog- ical (mathematical) framework to interrogate the interactions. In the definitions throughout this book, we also use V (G) and E(G) to refer to the set of proteins and ∈ interactions of a (sub)network of G. For a protein v V , the set N(v)or Nv includes all immediate neighbors of v, and deg(v) =|N(v)| is the degree of v. These neigh- = ={ ∈ }∪{ ∈ bors together with their interactions, Ev E(v) (v, u) : u N(v) (u, w) : u N(v), w ∈ N(v)∩ N(u)}, constitute the local (immediate) neighborhood subnetwork of v. PPI networks, like most real-world networks, have characteristic topological properties which are distinct from that of random networks. But, to understand this distinction we need to first understand what are random networks. Traditionally, random networks have been described using the Erd¨os-R´enyi (ER) model, in which G(n, p) is a random network with |V |=n nodes and each possible edge connecting pairs of these nodes has probability p of existing [Erdos¨ 1960, Bollob´as 1985]. The n expected number of edges in the network is 2 p, and the expected mean degree is np. Alternatively, a random network is defined as a network chosen uniformly at n (2) random from the collection m of all possible networks with n nodes and m edges. If p is the probability for the existence of an edge, the probability for each network 26 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks
in this collection is,
m n −m p (1 − p)(2) , (2.1)
where the closer p gets to 1, the more skewed the collection is toward networks with higher number of edges. When p = 1/2, the collection contains all possible n n (2) (2) m networks with equal probability, (1/2) . A marked feature of PPI networks that makes them non-random is the extremely broad distribution of degrees of proteins: While a majority of proteins have small number of immediate binding partners, there exist some proteins, referred to as “hubs,” with unusually large number of binding partners. Moreover, the degrees of most proteins are larger than the average node degree of the network. This degree distribution P(k)can be approximated by a power law P(k)∼ k−γ , where k is the node degree and γ is a small constant. Such networks are called scale-free networks [Barab´asi 1999, Albert and Barab´asi 2002]. Proteins within PPI networks exhibit an inherent tendency to cluster, which can be quantified by the local clustering coefficient for these proteins [Watts and Strogatz 1998]. For a selected protein v that is connected to |N(v)| other nodes, if these immediate neighbors were part of a clique (a complete subgraph), there |N(v)|.(|N(v)|−1) will be 2 interactions between them. The ratio of the actual number of interactions that exist between these nodes and the total number of possible interactions gives the local clustering coefficient CC(v) for the protein v: 2|{(u, w) : u, w ∈ N(v), (u, w) ∈ E}| CC(v) = . (2.2) |N(v)| . (|N(v)|−1) The average clustering coefficient of the entire network is therefore given by 1 CC(G) = CC(v). (2.3) |V (G)| v∈V (G) This clustering property gives rise to groups (subnetworks) of proteins in the PPI network that exhibit dense interactions within the groups, but sparse or less-dense interactions between the groups. The interactions between the groups occur via a few “central” proteins through which pass most paths in the network. The average shortest path length d(G)¯ is given by 2 u =v(u, v) d(G)¯ = , (2.4) |V (G)| . (|V (G)|−1) where (u, v)is the length of the shortest path between two connected nodes u and v. This average shortest path length of the PPI network is significantly shorter than 2.4 Theoretical Models for PPI Networks 27
that of an ER random network; in an ER network, the average shortest path length is proportional to ln n. However, both PPI networks as well as ER networks fall under “small-world” networks, because the average path length between any two nodes is still significantly smaller than the network size: n>>k>>ln n>>1, where k is the average node degree. The average clustering coefficient of a PPI network is significantly higher than a random network constructed on the same protein set and with the same average shortest path length. This small-world property can also be quantified using two coefficients: close- ness and betweenness [Hormozdiari et al. 2007]. The closeness of v is defined based on its average shortest path length to all other proteins reachable from v (that is, within the same connected component as v) in the network,
|V (G)|−1 Cl(v) = , (2.5) u∈R(G,v)(u, v)
where R(G, v) is the set of proteins reachable from v. Closeness is thus the inverse of the average distance of v to all other nodes in the network. The betweenness of the node v measures the extent to which v lies “between” any pair of proteins
connected to v in the network. Let Sxy be the number of shortest paths between all ∈ pairs x, y R(G, v), and let Sxy(v) be the number of these shortest paths that pass through v. The betweenness Bet(v) of v is defined as: Sxy(v) Bet(v) = . (2.6) S x,y∈R(G,v),x =y xy
The distributions of closeness and betweenness coefficients for PPI networks are significantly different from that of random networks [Hormozdiari et al. 2007]. In particular, PPI networks contain central proteins which have high betweenness and these connect and hold together different groups of proteins or regions of the network.
Theoretical Models for PPI Networks 2.4 Studying the topological properties of PPI networks has gained considerable at- tention in the last several years, and in particular various network models have been proposed to describe PPI networks. As new PPI data became available, these theoretical models have been improved to accurately model the data. These in- clude the earliest models such as the Erdos-R¨ ´enyi model [Erdos¨ 1960, Bollob´as 1985], and more recent ones such as the Barab´asi-Albert [Barab´asi 1999, Albert and 28 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks
Barab´asi 2002], Watts-Strogatz [Watts and Strogatz 1998], and hierarchical [Ravasz and Barab´asi 2003] network models. With a fixed probability for each edge, the Erdos-R¨ ´enyi (ER) networks (discussed above) look to be intuitive and seemingly correct, and were once commonly used to model real-world networks. However, in reality, ER networks are rarely found in real-world examples. Most real-world networks—including road and airline routes, social contacts, webpage links, and also PPI networks—do not have evenly dis- tributed degrees. Moreover, although ER networks have the small-world property, they have almost no clustering effects. For these reasons, using ER networks as null models for comparison against real-world networks including PPI networks is usually inappropriate. In the Barab´asi-Albert (BA) model [Barab´asi 1999, Albert and Barab´asi 2002], nodes are added one at a time to simulate network growth. The probability of an edge forming between an incoming new node and existing nodes is based on the principle of preferential attachment—that is, existing nodes with higher degrees have a higher likelihood of forming an edge with the incoming node. Hence, the degree distribution follows a power-law distribution (scale-free) P(k)∼ k−γ and exhibits small-world behavior, but like ER, lacks modular organization (low clustering behavior). The BA model is particularly important as one of the earliest instances of network models proposed as a suitable null model, instead of ER, for comparisons against biological networks [Barab´asi 1999, Albert and Barab´asi 2002]. This led to a panoply of works describing the properties of various biological network types including metabolic [Wagner and Fell 2001] and PPI networks [Yook et al. 2004] using BA. At the same time, the deficiencies of the BA model started to become more clear: while the scale-free property is preserved, modularity is not. Thus, better null models are needed. The Watts-Strogatz (WS) model [Watts and Strogatz 1998] is seldom used for modeling biological networks but demonstrates how easy it is to transition from a non-small world network (e.g., a lattice graph) to small-world network with random rewiring of a few edges. The generation procedure is amazingly simple: From a ring lattice with n vertices and k edges per vertex, each edge is rewired at random with probability p.Ifp = 0, then the graph is completely regular, and if 1, the graph is completely random/disordered. For the WS model, the intermediate region, where 0
3-node graphlets 4-node graphlets
12 3 4 567 8
5-node graphlets
9 10 11 12 13 14 15 16 17 18 19
20 21 22 23 24 25 26 27 28 29
Figure 2.3 The 29 graphlets of 3–5 nodes defined by Prˇzuljet al. [2004].
geometric random graphs and PPI networks using the distributions for graphlets. A graphlet is a connected network with a small number of nodes, and graphlet fre- quency is the number of occurrences of a graphlet in a network. The authors defined 29 types of graphlets using 3–5 nodes (Figure 2.3). The relative frequency of a graphlet i from the PPI network G is defined as Ni(G)/T (G), where Ni(G) is the number of ∈{ } = 29 graphlets of type i 1,...,29, and T (G) i=1 Ni(G) is the total number of graphlets of G. The same is defined for a generated geometric random network H . The relative graphlet frequency distance D(G, H) between the two networks G and H is measured as 29 = | − | D(G, H) Fi(G) Fi(H ) , (2.7) i=1 =− where Fi(G) log(Ni(G)/T (G)); the logarithm being used because the graphlet frequencies can differ by several orders of magnitude. The authors generated ge- ometric random networks with the same number of nodes as the proteins in S. cerevisiae and D. melanogaster PPI networks from high-throughput experiments. The authors found that, although the degree distributions of these PPI networks were closer to that of scale-free random networks, other topological parameters matched closely with those of geometric random networks. Specifically, the di- ameter, local and whole-network clustering coefficients, and the relative graphlet frequencies, computed as above, showed that PPI networks were closer to geomet- 2.5 Visualizing PPI Networks 31
ric random networks than scale-free networks. Furthermore, the authors suggest that as the quality and quantity of PPI data improve, the geometric random network may become better suited compared to scale-free and other networks to model PPI networks.
Visualizing PPI Networks 2.5 Visualization concerns an important component of the analysis of PPI networks. Very simply, a PPI network is visualized as dots and lines, where the dots (or other shapes) represent proteins and the lines connecting the dots represent interactions between the proteins. Such a visualization enables quick exploration of the topolog- ical properties of PPI networks—for example, counting the neighbors of a selected protein in a PPI network, the number of connected components in the network, or even spotting dense and sparse regions of the network. However, the ease of this exploration and subsequent analysis depends on how effective is the visualization method or tool used to render the PPI network. This rendering of the PPI network concerns the field of graph or network layout, where layout algorithms are used to draw the network—by appropriately positioning the dots and lines—in a 2D space. A good layout should be able to (visually) bring out the topological properties of the PPI network easily, and this has been a subject of research in graph visualization for several years. Here we briefly introduce some of the commonly used algorithms for PPI network layout; for details the readers are referred to the excellent reviews of Morris et al. [2014], Agaptio et al. [2013], and Doncheva et al. [2012]. Random layout arranges the dots (nodes) and lines (edges) in a random man- ner in the 2D space. The advantage of this algorithm is its simplicity, but on the other hand, it presents a high number of criss-crossing edges, and does not nec- essarily use the available space optimally, especially for large networks. Circular layout arranges the nodes in succession, one after the other, on a circle. While this algorithm also suffers from criss-crossing of edges, it is widely used to visu- alize small (sub)networks such as protein complexes and pathways. Tree layout arranges the network as a tree with a hierarchical organization of the nodes. This is obviously more suitable for visualizing trees than networks with cycles. Often, the layout is “ballooned out” by placing the children of each node in the tree on a circle surrounding the node, resulting in several concentric circles. Force-directed layout places the nodes according to a system of forces based on physical concepts in spring mechanics. Typically, the system combines attractive forces between adja- cent nodes with repulsive forces between all pairs of nodes to seek an optimal layout in which the overall edge lengths are small while the nodes are well separated. 32 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks
Number of proteins: 1977 Transcriptional Number of interactions: 5679 regulation Average number of neighbors: 5.47 Network diameter: 31 Clustering coefficient: 0573
Proteasome
Eukaryotic initiation factor 4F complex
Fanconi anemia pathway Chromatin remodeling
Nuclear pore DNA damage complex repair
Figure 2.4 PPI network visualization using the Cytoscape 3.4.0 tool [Shannon et al. 2003, Smoot et al. 2010]. A portion of the human PPI network (1,977 proteins and 5,679 interactions) downloaded from BioGrid [Stark et al. 2011] is visualized here using force-directed layout. Basic statistics—average number of neighbors, network diameter, etc.—are displayed for the network. Proteins (e.g., BRCA1), protein complexes (e.g., eukaryotic initiation factor 4F and nuclear pore complexes), pathways (e.g., Fanconi anaemia pathway), and cellular processes (e.g., DNA-damage repair and chromatin remodeling) are “pulled-out” and highlighted. Cytoscape provides “link-out” to external databases and tools—e.g., KEGG [Kanehisa and Goto 2000]—to enable further analysis.
Once the PPI network is laid out, a good visualization tool should allow at least some basic visual analysis of the network. The following aspects become important here (see Figure 2.4). The ease of navigation through the PPI network to explore individual proteins and interactions is of prime importance. In particular, the tool should be able to load and enable nagivation of even large networks. 2.5 Visualizing PPI Networks 33
Next is the provision to annotate the network using internal (e.g., labeling nodes by serial numbers or by their network properties) or external information (see below). The tool should also be able to compute (basic) topological properties of the network—for example, node degree, shortest path lengths, and clustering, closeness, and betweenness coefficients. These statistics help users get at least a preliminary idea of the network. Another valuable feature of a good tool is linkout to external databases, for example to PubMed literature (http://www.ncbi.nlm.nih .gov/pubmed), National Center for Biotechnology Information (NCBI) (http://www .ncbi.nlm.nih.gov/), UniProt or SwissProt (http://www.uniprot.org/)[Bairoch and Apweiler 1996, UniProt 2015], BioGrid (http://thebiogrid.org/)[Stark et al. 2011], Gene Ontology (GO) (http://www.geneontology.org/)[Ashburner et al. 2000], and Kyoto Encyclopedia of Genes and Genomes (KEGG) (http://www.genome.jp/kegg/ pathway.html)[Kanehisa and Goto 2000]. These enable functional annotation of proteins and interactions. Finally, the tool should also possibly support advanced analyses such as clustering of the network, comparison (based on topological characteristics, for example) between networks, and enrichment analysis, e.g., using GO terms. Table 2.3 lists some of the popular tools available for PPI network visualization and (visual) analysis. OMICS Tools (http://omictools.com/network- visualization-category) maintains an exhaustive list of visualization tools for PPI and other biomolecular network analysis.
Table 2.3 Software tools for PPI network visualization and analysis
Visualization Tool Source Reference
Arena3D http://arena3d.org/ [Pavlopoulos et al. 2011] AVIS http://actin.pharm.mssm.edu/AVIS2/ [Seth et al. 2007] BioLayout http://www.biolayout.org/ [Theocharidis et al. 2009] Cytoscape http://www.cytoscape.org/download.html [Shannon et al. 2003, Smoot et al. 2010] Medusa http://coot.embl.de/medusa/ [Hooper and Bork 2005] NAViGaTOR http://ophid.utoronto.ca/navigator/download.html [Brown et al. 2009] ONDEX http://www.ondex.org/ [Kohler¨ et al. 2006] Osprey http://biodata.mshri.on.ca/osprey/servlet/Index [Breitkreutz et al. 2003] Pajek http://vlado.fmf.uni-lj.si/pub/networks/pajek/ [Vladimir and Andrej 2004] PIVOT http://acgt.cs.tau.ac.il/pivot/ [Orlev et al. 2004] ProViz http://cbi.labri.fr/eng/proviz.htm [Florian et al. 2005] 34 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks
Building High-Confidence PPI Networks 2.6 From our discussions on experimental protocols in earlier sections, we know that some protocols—including the AP/MS ones—offer only pulled-down complexes consisting of baits and their preys without specifying the binary interactions be- tween these components. Therefore, binary interactions need to be specifically inferred between the bait and each of its preys within the pulled-down complexes. However, not all preys in a pulled-down complex interact directly with the bait (but, get pulled down due to their interactions with other preys in the complex). There- fore, it is necessary to infer binary interactions not just between the bait and its preys but also between the interacting preys. Yet, care should be taken to avoid inferring spurious (false-positive) interactions between the preys that do not inter- act. To overcome these uncertainties, often a balance is sought between two kinds of models, spoke and matrix, which are used to transform pulled-down complexes into binary interactions between the proteins [Gavin et al. 2006, Krogan et al. 2006, Spirin and Mirny 2003, Zhang et al. 2008]. The spoke model assumes that the only interactions in the complex are between the bait and its preys, like the spokes of a wheel. This model is useful to reduce the complexity of the data, but misses all (true) prey–prey interactions. On the other hand, the matrix model assumes that every pair of protein within a complex interact. This model can cover all possible true interactions, but can also predict a large number of spurious interactions. An empirical evaluation using 1,993 baits and 2,760 preys from the dataset from Gavin et al. [2006] against 13,384 pairwise protein interactions between proteins within the expert-curated MIPS complexes [Mewes et al. 2006] revealed 80.2% true-negative (missing) interactions and 39% false-positive (spurious) interactions in the spoke model, and 31.2% true-negative interactions but 308.7% false-positive interactions in the matrix model [Zhang et al. 2008]. However, note that many of the missing interactions could be due to the lack of protein coverage in these experiments. A balance is struck between the two models that covers as many true interactions between the baits and preys as possible without allowing too many false interactions [Gavin et al. 2006]; see Figure 2.5.
Gaining Confidence in PPI Networks Although high-throughput studies have been successful in mapping large frac- tions of interactomes from multiple organisms, the datasets generated from these studies are not free from errors. High-throughput PPI datasets often contain a considerable number of spurious interactions, while missing a substantial num- 2.6 Building High-Confidence PPI Networks 35
p p
b p p p p 0.60 p 0.60 p 0.85 0.55 0.85 0.55 A 0.70 0.70 p p 0.25 0.60 b b 0.75 0.70 b 0.75 0.60 0.70 p p 0.65 0.65 p p 0.75 0.75 0.25 p p p p b C D
p p B
Figure 2.5 Inferring protein interactions from pull-down protein complexes. Bait–prey relationships from pull-down complexes are assembled using the spokes model, where the bait is connected to each of the preys (A); or using the matrix model, where every bait–prey and prey–prey pair is connected (B). However, these models either miss many true interactions or produce too many spurious interactions. Therefore, a combination of the spoke and matrix models is used, where a balance is sought between the two models using weighting of interactions (C). Interactions with low weights are discarded to give the final set of high-confidence inferred interactions (D).
ber of true interactions [Von Mering et al. 2002, Bader and Hogue 2002, Cusick et al. 2009]. Consequently, a crucial challenge in adopting these datasets for down- stream analysis—including protein complex prediction—is in overcoming these challenges.
Spurious Interactions Spurious or false-positive interactions in high-throughput screens may arise from technical limitations in the underlying experimental protocols, or limitations in the (computational) inference of interactions from the screen. For example, the Y2H system, despite being in vivo, does not consider the localization (compart- mentalization), time, and cellular context while testing for binding partners. Since all proteins are tested within one compartment (the nucleus), the chances that two proteins, belonging to two different compartments and are not likely to meet during their lifetimes in live cells, end up testing positive for interaction, is high. Similarly, in vitro TAP pull downs are carried out using cell lysates in an environ- ment where every protein is present in the same “uncompartmentalized soup” [Mackay et al. 2007, Welch 2009]. Therefore, even though two proteins interact 36 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks
under these laboratory conditions, it is not certain that they will ever meet or inter- act during their life times in live cells. Opportunities are high for “sticky” molecules to function as bridges between proteins causing these proteins to interact promis- cuously with partners that never interact with in live cells [Mackay et al. 2007]. Once these complexes are pulled down, the model used to infer binary interactions— between bait and prey or between preys—can also result in inference of spurious interactions (further discussed below). Recent analyses showed that only 30–50% of interactions inferred from high-throughput screens actually occur within cells [Shoemaker and Panchenko 2007, Welch 2009], while the remaining interactions are false positives.
Missing Interactions and Lack of Concordance Between Datasets Comparisons between datasets from different techniques have shown a striking lack of concordance, with each technique producing a unique distribution of in- teractions [Shoemaker and Panchenko 2007, Von Mering et al. 2002, Bader and Hogue 2002, Cusick et al. 2009]. Moreover, certain interactions depend on post- translational modifications such as disulfide-bridge formation, glycosylation, and phosphorylation, which may not be supported in the adopted system. Many of these techniques also show bias toward abundant proteins (e.g., soluble proteins) and bias against certain kind of proteins (e.g., membrane proteins). For example, AP/MS screens predict relatively few interactions for proteins involved in transport and sensing (trans-membrane proteins), while Y2H screens being targeted in the nu- cleus fail to cover extracellular proteins [Shoemaker and Panchenko 2007]. These limitations effectively result in a considerable number of missed interactions in interactome datasets. Welch [2009] summed up the status of interactome maps, based on these above limitations, as “fuzzy,” i.e., error-prone, yet filled with promise.
Estimating Reliabilities of Interactions The coverage of true interactions can be increased by integrating datasets from multiple experiments. This integration ensures that all or most regions of the in- teractome are sufficiently represented in the PPI network. However, overcoming spurious interactions still remains a challenge, which is further magnified when datasets are integrated. Therefore, estimating the reliabilities of interactions be- comes necessary, thereby keeping only the highly reliable interactions while dis- carding the spurious or less-reliable ones. Confidence or reliability scoring schemes offer a score (weight) to each interac- tion in the PPI network. For an interaction (u, v) ∈ E in the scored (weighted) PPI 2.6 Building High-Confidence PPI Networks 37
Table 2.4 Classification of confidence scoring (PPI weighting) schemes for protein interactions
Classification Scoring Scheme Reference Sampling or Bootstrap sampling [Friedel et al. 2009] counting based Comparative Proteomic [Sowa et al. 2009] Analysis (ComPASS) Dice coefficient a [Zhang et al. 2008] Hypergeometric sampling [Hart et al. 2007] Significance Analysis of [Choi et al. 2011, Teo et INTeractions (SAINT) al. 2014] Socio-affinity scoring [Gavin et al. 2006]
Independent Bayesian networks and C4.5 [Krogan et al. 2006] evidence-based decision trees Topological Clustering [Jain and Bader 2010] Semantic Similarity (TCSS) Purification Enrichment (PE) [Collins et al. 2007]
Topology-based Collaborative Filtering (CF) [Luo et al. 2015] Functional Similarity (FS) [Chua et al. 2006] Weight a Geometric embedding [Prˇzuljet al. 2004, Higham et al. 2008] Iterative Czekanowski-Dice [Liu et al. 2008] (ICD) distance a PageRank affinity [Voevodski et al. 2009]
a. Dice coefficient, FS Weight, and Iterative CD scoring schemes can also be considered as independent evidence-based schemes, because if a pair of proteins have several common partners then these proteins most likely perform the same or similar functions and/or are present in the same cellular compartment (a biological evidence).
network G =V , E, w, the score w(u, v) encodes the confidence for the physical interaction between the two proteins u and v. The scoring function w: V × V → R accounts for the biological variability and technical limitations of the experi- ments used to infer the interactions. The scoring schemes can be classified into three broad categories (Table 2.4): (i) sampling or counting-based, (ii) biological evidence-based, and (iii) topology-based schemes. 38 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks
Sampling or Counting-Based Schemes These schemes estimate the confidence of protein pairs by measuring the number of times each protein pair is observed to interact across multiple trials against what would be expected by chance given the abundance of each protein in the library. If the protein pairs are coming from the same experiment, the counting is performed across multiple purifications of the experiment. Given multiple PPI datasets, this idea can be extended to score interacting pairs by measuring the number of times each pair is observed across the different datasets against what would be expected from random given the number of times these proteins appear across the datasets. However, if the PPI datasets come from different experiments (e.g., Y2H and TAP/MS-based), which is usually the case, then it is useful to capture the relative reliability of each experimental technique or source of the datasets into this computation. For example, if Y2H is believed to be less reliable than TAP/MS- based techniques, then protein pairs can be assigned lower weights when observed in Y2H datasets, but assigned higher weights when observed in TAP/MS datasets. In the study by Gavin et al. [2006], a “socio-affinity” scheme based on this counting idea was used to estimate confidence for interactions inferred from pulled-down complexes detected from TAP purifications. The interactions within the pulled-down complexes are inferred as a combination of spoke and matrix- modeled relationships. A socio-affinity index SA(u, v) then quantifies the tendency for two proteins u and v to identify each other when tagged (spoke model, S) and to co-purify when other proteins are tagged (matrix model, M): = + + SA(u, v) S(u,v)|u=bait S(u,v)|v=bait M(u,v), n(u,v)|u=bait where S | = = log , (u,v) u bait bait . . prey . prey fu nbait fv nu=bait n(u,v)|v=bait (2.8) S | = = log , (u,v) v bait bait . . prey . prey fv nbait fu nv=bait ⎛ ⎞ nprey = ⎝ (u,v) ⎠ and M(u,v) log − , prey . prey . npreys(npreys 1) fu fv all baits 2
where, for the spoke model (S), n(u,v)|u=bait is the number of times that u retrieves bait prey v when u is tagged; fu is the fraction of purifications when u was bait; fv is the fraction of all retrieved preys that were v; nbait is the total number of purifications preys (i.e., using baits); and nu=bait is the number of preys retrieved with u as bait. These prey terms are similarly defined for v as bait. For the matrix model (M), n(u,v) is the 2.6 Building High-Confidence PPI Networks 39 number of times that u and v are seen in purifications as preys with baits other prey prey than u or v; fu and fv are as above; and nprey is the number of preys observed with a particular bait (excluding the bait itself). Friedel et al. [2009] combined the bait–prey relationships detected from the Gavin et al. [2006] and Krogan et al. [2006] experiments, and used a random sampling-based scheme to estimate confidence of interactions. In this approach, = a list (φ1,...,φn) purifications were generated where each purification φi con- sisted of one bait bi and the preys pi,1,...,pi,m identified for this bait in the = = purification: φi bi ,[pi,1,...,pi,m] . From , l 1000 bootstrap samples were created by drawing n purifications with replacement. This means that the bootstrap sample Sj () contains the same number of purifications as and each purifica- tion φi can be contained in Sj () once, multiple times, or not at all, with multiple copies being treated as separate purifications. Interaction scores for the protein pairs are then calculated from these l bootstrap samples using socio-affinity scor- ing as above, where each protein pair is counted for the number of times the pair appeared across randomly sampled sets of interactions against what would be ex- pected for the pair from random based on the abundance of each protein in the two datasets. Zhang et al. [2008] modeled each purification as a bit vector which lists proteins pulled down as preys against a bait across different experiments. The authors then used the Sørensen-Dice similarity index [Sørensen 1948, Dice 1945] between the vectors to estimate co-purification of preys across experiments, and thus the inter- action reliability between proteins. Specifically, the pull-down data is transformed into a binary protein pull-down matrix in which a cell [u, i] in the matrix is 1 if u is pulled down as a prey in the experiment or purification i, and a zero otherwise. For two protein vectors in this matrix, the Sørensen-Dice similarity index, or simply the Dice coefficient, is computed as follows:
2q D(u, v) = , (2.9) 2q + r + s where q is the number of the matrix elements (experiments or purifications) that have ones for both proteins u and v; r is the number of elements where u has ones, but v has zeroes; and s is the number of elements where v has ones, but u has zeroes. If u and v indeed interact (directly or as part of a complex), then most likely the two proteins will be frequently co-purified in different experiments. The Dice coefficient therefore estimates the fraction of times u and v are co-purified in order to estimate the interaction reliability between u and v. 40 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks
Hart et al. [2007] generated a Probabilistic Integrated Co-complex (PICO) net- work by integrating matrix-modeled relationships from the Gavin et al. [2002], Gavin et al. [2006], Krogan et al. [2006], and Ho et al. [2002] datasets using hypergeo- metric sampling. Specifically, the significance (p-value) for observing an interaction between the proteins u and v at least k times in the dataset is estimated using the hypergeometric distribution as
min (n,m) p(#interactions ≥ k|n, m, N)= p(i|n, m, N), i=k (2.10) n N−n − p(i|n m N)= i m i where , , N , m
where k is the number of times the interaction between u and v is observed, n and m are the total number of interactions for u and v, respectively, and N is the total number of interactions in the entire dataset. The lower the p-value, the lesser is the chance that the observed interaction between u and v is random, and therefore higher is the chance that the interaction is true. Methods such as Significance Analysis of INTeractome (SAINT) are based on quantitative analysis of mass spectrometry data. SAINT, developed by Choi et al. [2011] and Teoetal.[2014], assigns confidence scores to interactions based on the spectral counts of proteins pulled down in AP/MS experiments. The aim is to con-
vert the spectral count Xij for a prey protein i identified in a purification of bait | j into the probability of true interaction between the two proteins, P(True Xij ). | | For this, the true and false distributions, P(Xij T rue) and P(Xij False), and the prior probability πT of true interactions in the dataset are inferred from the spectral counts of all interactions involving prey i and bait j. Essentially, SAINT assumes that, if proteins i and j interact, then their “interaction abundance” is proportional | to the product XiXj of their spectral counts Xi and Xj . To compute P(Xij T rue), the spectral counts Xi and Xj are learned not only from the interaction between i and j, but also from all bona fide interactions that involve i and j. The same prin- | ciple is applied to compute P(Xij False) for false interactions. These probability distributions are then used to calculate the posterior probability of true interaction | P(True Xij ). The interactions are then ranked in decreasing order of their proba- bilities, and a threshold is used to select the most likely true interactions. Comparative Proteomic Analysis (ComPASS) [Sowa et al. 2009] employs a com- parative methodology to assign scores to proteins identified within parallel pro- × = teomic datasets. It constructs a stats table X[k m] where each cell X[i, j] Xi,j 2.6 Building High-Confidence PPI Networks 41 is the total spectral count (TSC) for an interactor j (arranged as m rows) in an ex- periment i (arranged as k columns). ComPASS uses a D-score to normalize the TSCs across proteins such that the highest scores are given to proteins in each experiment that are found rarely, found in duplicate runs, and have high TSCs—all characteristics that qualify proteins to be candidate high-confidence interactors. The D-score is a modification of the Z-score, which weights all interactors equally ¯ regardless of the number of replicates or their TSCs. Let Xj be the average TSC for interactor j across all the experiments, i=k i=1,j=n Xi,j X¯ = ; n = 1,2,...,m. (2.11) j k
The Z-score is computed as
X − X¯ = i,j j Zi,j , (2.12) σj where σj is the standard deviation for the spectral counts of interactor j across the experiments. The D-score improves the Z-score by incorporating the uniqueness, the TSC, and the reproducibility of the interaction to assign a score to each protein within each experiment. The D-score first rescales Xi,j as