Computational Prediction of Complexes from Protein Interaction Networks

Sriganesh Srihari The University of Queensland Institute for Molecular Bioscience Chern Han Yong Duke-National University of Singapore Medical School Limsoon Wong National University of Singapore

ACM Books #16 Copyright © 2017 by the Association for Computing Machinery and Morgan & Claypool Publishers

Computational Prediction of Protein Complexes from Protein Interaction Networks Sriganesh Srihari, Chern Han Yong, Limsoon Wong books.acm.org www.morganclaypoolpublishers.com

ISBN: 978-1-97000-155-6 hardcover ISBN: 978-1-97000-152-5 paperback ISBN: 978-1-97000-153-2 ebook ISBN: 978-1-97000-154-9 ePub Series ISSN: 2374-6769 print 2374-6777 electronic DOIs: 10.1145/3064650 Book 10.1145/3064650.3064656 Chapter 5 10.1145/3064650.3064651 Preface 10.1145/3064650.3064657 Chapter 6 10.1145/3064650.3064652 Chapter 1 10.1145/3064650.3064658 Chapter 7 10.1145/3064650.3064653 Chapter 2 10.1145/3064650.3064659 Chapter 8 10.1145/3064650.3064654 Chapter 3 10.1145/3064650.3064660 Chapter 9 10.1145/3064650.3064655 Chapter 4 10.1145/3064650.3064661 References, Bios

A publication in the ACM Books series, #16 Editor in Chief: M. Tamer Ozsu,¨ University of Waterloo

First Edition 10987654321 Contents

Preface xi

Chapter 1 Introduction to Protein Complex Prediction 1 1.1 From Protein Interactions to Protein Complexes 6 1.2 Databases for Protein Complexes 11 1.3 Organization of the Rest of the Book 13

Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks 15 2.1 High-Throughput Experimental Systems to Infer PPIs 15 2.2 Data Sources for PPIs 23 2.3 Topological Properties of PPI Networks 25 2.4 Theoretical Models for PPI Networks 27 2.5 Visualizing PPI Networks 31 2.6 Building High-Confidence PPI Networks 34 2.7 Enhancing PPI Networks by Integrating Functional Interactions 50

Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks 59 3.1 Basic Definitions and Terminologies 60 3.2 Taxonomy of Methods for Protein Complex Prediction 60 3.3 Methods Based Solely on PPI Network Clustering 61 3.4 Methods Incorporating Core-Attachment Structure 78 3.5 Methods Incorporating Functional Information 85

Chapter 4 Evaluating Protein Complex Prediction Methods 91 4.1 Evaluation Criteria and Methodology 91 4.2 Evaluation on Unweighted Yeast PPI Networks 93 4.3 Evaluation on Weighted Yeast PPI Networks 95 4.4 Evaluation on Human PPI Networks 99 x Contents

4.5 Case Study: Prediction of the Human Mechanistic Target of Rapamycin Complex 103 4.6 Take-Home Lessons from Evaluating Prediction Methods 105

Chapter 5 Open Challenges in Protein Complex Prediction 107 5.1 Three Main Challenges in Protein Complex Prediction 107 5.2 Identifying Sparse Protein Complexes 112 5.3 Identifying Overlapping Protein Complexes 118 5.4 Identifying Small Protein Complexes 124 5.5 Identifying Protein Sub-complexes 134 5.6 An Integrated System for Identifying Challenging Protein Complexes 136 5.7 Recent Methods for Protein Complex Prediction 138 5.8 Identifying Membrane-Protein Complexes 141

Chapter 6 Identifying Dynamic Protein Complexes 145 6.1 Dynamism of Protein Interactions and Protein Complexes 145 6.2 Identifying Temporal Protein Complexes 149 6.3 Intrinsic Disorder in 156 6.4 Intrinsic Disorder in Protein Interactions and Protein Complexes 157 6.5 Identifying Fuzzy Protein Complexes 162

Chapter 7 Identifying Evolutionarily Conserved Protein Complexes 165 7.1 Inferring Evolutionarily Conserved PPIs (Interologs) 166 7.2 Identifying Conserved Complexes from Interolog Networks, I 170 7.3 Identifying Conserved Complexes from Interolog Networks, II 178

Chapter 8 Protein Complex Prediction in the Era of Systems Biology 185 8.1 Constructing the Network of Protein Complexes 185 8.2 Identifying Protein Complexes Across Phenotypes 189 8.3 Identifying Protein Complexes in Diseases 192 8.4 Enhancing Quantitative Proteomics Using PPI Networks and Protein Complexes 208 8.5 Systems Biology Tools and Databases to Analyze Biomolecular Networks 221

Chapter 9 Conclusion 225

References 233 Authors’ Biographies 279 Preface

The suggestion and motivation to write this book came from Limsoon, who thought that it would be a great idea to compile our (Sriganesh’s and Chern Han’s) Ph.D. research conducted at National University of Singapore on protein complex predic- tion from protein-protein interaction (PPI) networks into a comprehensive book for the research community. Since we (Sriganesh and Chern Han) completed our Ph.D.s not long ago, the timing could not have been better for writing this book while the topic is still fresh in our minds and the empirical set up (datasets and software pipelines) for evaluating the methods is still in a “quick-to-run” form. However, although we had our Ph.D. theses to our convenient disposal and ref- erence, it is only after we started writing this book that we realized the real scale of the task that we had embarked upon. The problem of protein complex prediction may be just one of the plethora of computational problems that have opened up since the deluge of proteomics (protein-protein interaction; PPI) data over the last several years. However, in re- ality this problem encompasses or directly relates to several important and open problems in the area—in particular, the fundamental problems of modeling, vi- sualizing, and denoising of PPI networks, prediction of PPIs (novel as well as evo- lutionarily conserved), and protein function prediction from PPI data. Therefore, to write a comprehensive self-contained book, we had to cover even these closely related problems to some extent or at least allude to or reference them in the book. We had to do so without missing the connection between these problems and our central problem of protein complex prediction in the book. The early tone to write the book in this manner was set by our review article in a 2015 special issue of FEBS Letters, where we covered a number of protein com- plex prediction methods which are based on a diverse range of topological, func- tional, temporal, structural, and evolutionary information. However, being only a single-volume article, the description of the methods was brief, and to compile xii Preface

this description in the form of a book we had to delve a lot deeper into the algo- rithmic underpinnings of each of the methods, highlight how each method utilized the information (topological, functional, temporal, structural, and evolutionary) on which it was based in its own unique way, and evaluate and study the applications of the methods across a diverse range of datasets and scenarios. To do this well, we had to: (i) cover in substantial detail the preliminaries such as the experimen- tal techniques available to infer PPIs, the limitations of each of these techniques, PPI network topology, modeling, and denoising, PPI databases that are available, and how functional, temporal, structural, and evolutionary information of proteins can be integrated with PPI networks; and (ii) we had to categorize protein complex prediction methods into logical groups based on some criteria, and dedicate a sep- arate chapter for each group to make our description comprehensive. In the book, we cover (i) in Chapter 2 and in the form of independent sections within each of the other Chapters 3, 5, 6, 7, and 8. We cover (ii) by allocating Chapters 3 and 4 for “classical” methods and their comprehensive evaluation, Chapter 5 for methods that predict certain kinds of “challenging complexes” which the classical meth- ods do not predict well, Chapter 6 for methods that utilize temporal and structural information, Chapter 7 for methods that utilize information on evolutionary con- servation, and Chapter 8 for methods that integrate other kinds of omics datasets to predict “specialized” complexes—e.g., protein complexes in diseases. The requirement for a book exclusively dedicated to the problem of protein complex prediction from PPI networks at this point in time cannot be understated. Over the last two decades, a major focus of high-throughput experimental tech- nologies and of computational methods to analyze the generated data has been in genomics—e.g., in the analysis of genome sequencing data. It is relatively re- cently that this focus has started to shift toward proteomics and computational methods to analyze proteomics data. For example, while the complete sequence of the was assembled more than a decade ago, it is only over the last three years that there have been similar large-scale efforts to map the human proteome. The ProteomicsDB (http://www.proteomicsdb.org/), Human Proteome Map (http://humanproteomemap.org/), and the Human Protein Atlas projects (http://www.proteinatlas.org/) have all appeared only over the last three years. Sim- ilarly is The Cancer Proteome Atlas (TCPA) project (http://app1.bioinformatics .mdanderson.org/tcpa/_design/basic/index.html) which complements The Cancer Genome Atlas project (TCGA). This means that developing more effective solutions for fundamental problems such as protein complex prediction has become all the more important today, as we try to apply these solutions to larger and more com- plex datasets arising from these newer technologies and projects. In this respect, Preface xiii we had to write the book not just by considering protein complex prediction meth- ods (i.e., their algorithmic details) as important in their own right but also by giving significant importance to the applications of these methods in the light of today’s complex datasets and research questions. There are several sections within each of the Chapters 6, 7, and 8 that play this dual role, e.g., a section in Chapter 7 discusses the evolutionary conservation of core cellular processes based on conservation pat- terns of protein complexes, and a section in Chapter 8 discusses the dysregulation of these processes in diseases based on rewiring of protein complexes between normal and disease conditions. In the end, we hope that we have done justice to what we intend this book to be. We hope that this book provides valuable insights into protein complex prediction and inspires further research in the area especially for tackling the open challenges, as well as inspires new applications in diverse areas of biomedicine.

Acknowledgments Although this book is primarily concerned with the problem of protein complex prediction, the book also covers several other aspects of PPI networks. We would like to therefore dedicate this book to the students—Honors, Masters, and Ph.D. students—who worked on these different aspects of PPI networks by being part of the computational biology group at the Department of Computer Science, National University of Singapore, over the years. Several of the methods covered in this book are a result of the extensive research conducted by these students. Sriganesh would like to thank Hon Wai Leong (Professor of Computer Science, National University of Singapore) under whom he conducted his Ph.D. research on protein complex prediction; Mark Ragan (Head of Division of Genomics of Development and Disease at Institute for Molecular Bioscience, The University of Queensland) under whom he conducted his postdoctoral research, a substantial portion of which was on identifying protein complexes in diseases; and Kum Kum Khanna (Senior Principal Research Fellow and Group Leader at QIMR-Berghofer Medical Research Institute) whose guidance played a significant part in his understanding of biological aspects of protein complexes. Sriganesh is grateful to Mark for passing him an original copy of a 1977 volume of Progress in Biophysics and Molecular Biology in which G. Rickey Welch makes a consistent principled argument that “multienzyme clusters” are advantageous to the cell and organism because they enable metabolites to be channeled within the clusters and protein expression to be co-regulated [Welch 1977]—a possession which Sriganesh will deeply cherish. Chern Han would like to thank his coauthors: Sriganesh for doing the heavy lifting in writing, editing, and xiv Preface

driving this project and Limsoon Wong for guiding him through his Ph.D. journey on protein complexes. He would also like to acknowledge the support of Bin Tean Teh (Professor with Program in Cancer and Stem Cell Biology, Duke-NUS Medical School), who currently oversees his postdoctoral research. Limsoon would like to acknowledge Chern Han and Sriganesh for doing the bulk of the writing for this book, and especially thank Sriganesh for taking the overall lead on the project. When he suggested the book to Chern Han and Sriganesh, he had not imagined that he would eventually be a co-author. We are indebted also to the Editor-in-Chief of ACM Books, Tamer Ozsu,¨ Exec- utive Editor Diane Cerra, Production Manager Paul C. Anagnostopoulos, and the entire team at ACM Books and Morgan & Claypool Publishers for their encourage- ment and for producing this book so beautifully.

Sriganesh Srihari Chern Han Yong Limsoon Wong May 2017 Introduction to Protein Complex Prediction

Unfortunately, the proteome is much more complicated than the genome. —Carol Ezzell [Ezzel et al. 2002]

In an early survey, American biochemist Bruce Alberts termed large assemblies of proteins as protein machines of cells [Alberts et al. 1998]. Protein assemblies are composed of highly specialized parts that coordinate to execute almost all of the biochemical, signaling, and functional processes in cells [Alberts et al. 1998]. It is not hard to see why protein assemblies are more advantageous to cells than 1individual proteins working in an uncoordinated manner. Compare, for example, the speed and elegance of the DNA replication machinery that simultaneously replicates both strands of the DNA double helix with what could ensue if each of the individual components—DNA helicases for separating the double-stranded DNA into single stands, DNA polymerases for assembling nucleotides, DNA primase for generating the primers, and the sliding clamp to hold these enzymes onto the DNA—acted in an uncoordinated manner [Alberts et al. 1998]. Although what might seem like individual parts brought together to perform arbitrary functions, protein assemblies can be very specific and enormously complicated. For example, the spliceosome is composed of 5 small nuclear RNAs (snRNAs or “snurps”) and more than 50 proteins, and is thought to catalyze an ordered sequence of more than 10 RNA rearrangements at a time as it removes an intron from an RNA transcript [Alberts et al. 1998, Baker et al. 1998]. The discovery of this intron-splicing process won Phillip A. Sharp and Richard J. Roberts the 1993 Nobel Prize in Physiology or Medicine.1

1. http://www.nobelprize.org/nobel_prizes/medicine/laureates/1993/ 2 Chapter 1 Introduction to Protein Complex Prediction

Protein assemblies are known to be in the order of hundreds even in the simplest of eukaryotic cells. For example, more than 400 protein assemblies have been iden- tified in the single-celled eukaryote Saccharomyces cerevisiae (budding yeast) [Pu et al. 2009]. However, our knowledge of these protein assemblies is still fragmentary, as is our conception of how each of these assemblies work together to constitute the “higher level” functional architecture of cells. A faithful attempt toward identifica- tion and characterization of all protein assemblies is therefore crucial to elucidate the functioning of the cellular machinery. To identify the entire complement of protein assemblies, it is important to first crack the proteome—a concept so novel that the word “proteome” first ap- peared only around 20 years ago [Wilkins et al. 1996, Bryson 2003, Cox and Mann 2007]. The proteome, as defined in the UniProt Knowledgebase, is the entire complement of proteins expressed or derived from protein-coding in an organism [Bairoch and Apweiler 1996, UniProt 2015]. With the introduction of high-throughput experimental (proteomics) techniques including mass spectro- metric [Cox and Mann 2007, Aebersold and Mann 2003] and protein quantitative trait locus (QTL) technologies [Foss et al. 2007], mapping of proteins on a large scale has become feasible. Just like how genomics techniques (including genome sequencing) were first demonstrated in model organisms, proteome-mapping has progressed initially and most rapidly for model prokaryotes including Escherichia coli (bacteria) and model eukaryotes including Saccharomyces cerevisiae (budding or baker’s yeast), Drosophila melanogaster (fruit fly), Caenorhabditis elegans (a ne- matode), and Arabidopsis thaliana (a flowering plant). Table 1.1 summarizes the numbers of proteins or protein-coding genes identified from these organisms. Of these, the proportions of protein-coding genes that are essential (genes that are thought to be critical for the survival of the cell or organism; “fitness genes”) range from ∼2% in Drosophila to ∼6.5% in Caenorhabditis and ∼18% in Saccha- romyces [Cherry et al. 2012, Chen et al. 2012]. Recent landmark studies using large-scale proteomics [Wilhelm et al. 2014, Kim et al. 2014, Uhl´en et al. 2010, Uhl´en et al. 2015]onHomo sapiens (human) cells have characterized >17,000 (or >90%) putative protein-coding genes from ≥40 tissues and organs in the human body. An encyclopedic resource on these proteins covering their levels of expres- sion and abundance in different human tissues is available from the ProteomicsDB (http://www.proteomicsdb.org/)[Wilhelm et al. 2014], The Human Proteome Map (http://humanproteomemap.org/)[ Kim et al. 2014], and The Human Protein At- las (http://www.proteinatlas.org/)[Uhl´en et al. 2010, Uhl´en et al. 2015] projects. GeneCards (http://www.genecards.org/)[Safran et al. 2002, Safran et al. 2010] ag- gregates information on human protein-coding genes from >125 Web sources Chapter 1 Introduction to Protein Complex Prediction 3 Kim et al. 2014 , en et al. 2015 ] ´ Ishii et al. 2007 ] Picotti et al. 2013 ] Rhee et al. 2003 ] McDowall et al. 2015 ] Uhl (Norwegian rat), en et al. 2010 , ´ [ Huala et al. 2001 , [ Stein et al. 2001 ] [ Sprague et al. 2001 ] [ The Flybase Consortium 1996 ] [ Keseler et al. 2009 , [ Bult et al. 2010 ] [ Wilhelm et al. 2014 , [ Shimoyama et al. 2015 ] [ Cherry et al. 2012 , [ Wood et al. 2012 , [ Karpinka et al. 2015 ] Uhl Rattus norvegicus (African clawed frog) (house mouse), Xenopus laevis Mus musculus http://www.arabidopsis.org/ http://www.wormbase.org/ http://zfin.org/ http://flybase.org/ http://ecocyc.org/ , http://ecoli.iab.keio.ac.jp/ http://www.informatics.jax.org/ http://humanproteomemap.org/ , , http://www.proteomicsdb.org/ http://www.proteinatlas.org/ http://rgd.mcw.edu/ http://www.yeastgenome.org/ http://www.pombase.org/ http://www.xenbase.org/ (fission yeast), and (Zebrafish), No. of Proteins/ > 27,400 ∼ 20,500 > 26,000 ∼ 17,700 ∼ 4,300 ∼ 23,200 > 17,000 ∼ 29,600 ∼ 6,500 ∼ 5,100 ∼ 4,700 Danio rerio Organism Protein-Coding Genes Source References D. rerio E. coli H. sapiens S. cerevisiae X. laevis A. thaliana C. elegans D. melanogaster M. musculus R. norvegicus S. pombe Examples of proteome resourcescovering for also some model and higher-order organisms (as of December 2015), Schizosaccharomyces pombe Table 1.1 4 Chapter 1 Introduction to Protein Complex Prediction

and presents the information in an integrative user-friendly manner. The expres- sion levels of nearly 200 proteins that are essential for driving different human cancers are available from The Cancer Proteome Atlas (TCPA) project (http://app1 .bioinformatics.mdanderson.org/tcpa/_design/basic/index.html)[Li et al. 2013], measured from more than 3,000 tissue samples across 11 cancer types studied as part of The Cancer Genome Atlas (TCGA) project (http://cancergenome.nih .gov/). Short-hairpin RNA (shRNA)-mediated knockdown [Paddison et al. 2002, Lambeth and Smith 2013], clustered regularly interspaced short palindromic re- peats (CRISPR)/Cas9-based editing [Sanjana et al. 2014, Baltimore et al. 2015, Shalem et al. 2015], and disruptive mutagenesis [Bokel¨ 2008] screening using MCF-10A (near-normal mammary), MDA-MB-435 (breast cancer), KBM7 (chronic myeloid leukemia), HAP1 (haploid), A375 (melanoma), HCT116 (colorectal cancer), and HUES62 (human embryonic stem) cells have characterized 1,500–1,880 (or 8– 10%) “core” protein-coding genes as essential in human cells [Marcotte et al. 2016, Silva et al. 2008, Wang et al. 2014, Hart et al. 2015, Hart et al. 2014, Wang et al. 2015, Blomen et al. 2015]. Comparative analyses of proteomes from different species have revealed inter- esting insights into the evolution and conservation of proteins. For example, it is es- timated that the genomes (proteomes) of human and budding yeast diverged about 1 billion years ago from a common ancestor [Douzery et al. 2014], and these share several thousand genes accounting for more than one-third of the yeast genome [O’Brien et al. 2005, Ostlund¨ et al. 2010]. Yeast and human orthologs are highly diverged; the amino-acid sequence similarity between human and yeast proteins ranges from 9–92%, with a genome-wide average of 32%. But, sequence similarity predicts only a part of the picture [Sun et al. 2016]. Recent studies [Kachroo et al. 2015, Laurent et al. 2015] have reported that 414 (or nearly half of the) essential protein-coding genes in yeast could be “replaced” by human genes, with replace- ability depending on gene (protein) assemblies: genes in the same process tend to be similarly replaceable (e.g., sterol biosynthesis) or not replaceable (e.g., DNA replication initiation). Irrespective of whether in a lower-order model or a higher-order complex or- ganism, a protein has to physically interact with other proteins and biomolecules to remain functional. Estimates in human suggest that over 80% of proteins do not function alone, but instead interact to function as macromolecular assemblies [Bergg´ard et al. 2007]. This organization of individual proteins into assemblies is tightly regulated in cellular space and time, and is supported by protein conforma- tional changes, posttranslational modifications, and competitive binding [Gibson and Goldberg 2009]. On the basis of the stability (area of interaction surface and du- Chapter 1 Introduction to Protein Complex Prediction 5 ration of interaction) and partner specificity, the interactions between proteins are classified as homo- or hetero-oligomeric, obligate or non-obligate, and permanent or transient [Zhang 2009, Nooren and Thornton 2003]. Proteins in obligate inter- actions cannot exist as stable structures on their own and are frequently bound to their partners upon translation and folding, whereas proteins in non-obligate in- teractions can exist as stable structures in bound and unbound states. Obligate interactions are generally permanent or constitutive, which once formed exist for the entire lifetime of the proteins, whereas non-obligate interactions may be per- manent, or alternatively transient, wherein the protein interacts with its partners for a brief time period and dissociates after that. Depending on the functional, spatial, and temporal context of the interactions, protein assemblies are classified as pro- tein complexes, functional modules, and biochemical (metabolic) and signaling pathways. Protein complexes are the most basic forms of protein assemblies and consti- tute fundamental functional units within cells. Complexes are stoichiometrically stable structures and are formed from physical interactions between proteins com- ing together at a specific time and space. Complexes are responsible for a wide range of functions within cells including formation of cytoskeleton, transporta- tion of cargo, metabolism of substrates for the production of energy, replication of DNA, protection and maintenance of the genome, transcription and translation of genes to gene products, maintenance of protein turn over, and protection of cells from internal and external damaging agents. Complexes can be permanent—i.e., once assembled can function for the entire lifetime of cells (e.g., ribosomes)—or transient—i.e., assembled temporarily to perform a specific function and are disas- sembled after that (e.g., cell-cycle kinase-substrate complexes formed in a cell-cycle dependent manner). Functional modules are formed when two or more protein complexes interact with each other and often other biomolecules (viz. nucleic acids, sugars, lipids, small molecules, and individual proteins) at a specific time and space to perform a particular function and disassociate after that. This molecular organization has been termed “protein sociology” [Robinson et al. 2007]. For example, the DNA repli- cation machinery, highlighted earlier, is formed by a tightly coordinated assembly of DNA polymerases, DNA helicase, DNA primase, the sliding clamp and other com- plexes within the nucleus to ensure error-free replication of the DNA during cell division. Pathways are formed when sets of complexes and individual proteins inter- act via an ordered sequence of interactions to transduce signals (signaling path- ways) or metabolize substrates from one form to another (metabolic pathways). For 6 Chapter 1 Introduction to Protein Complex Prediction

example, the MAPK pathway is composed of a sequence of microtubule-associated protein kinases (MAPKs) that transduce signals from the cell membrane to the nucleus, to induce the transcription of specific genes within the nucleus. Unlike complexes and functional modules, pathways do not require all components to co-localize in time and space.

From Protein Interactions to Protein Complexes 1.1 Physical interactions between proteins are fundamental to the formation of protein complexes. Therefore, mapping the entire complement of protein interactions (the “interactome”) occurring within cells (in vivo) is crucial for identifying and charac- terizing complexes. However, inferring all interactions occurring during the entire lifetime of cells in an organism is challenging, and this challenge increases multi- fold as the complexity of the organism increases—e.g., for multicellular organisms made up of multiple cell types. The development of high-throughput proteomics technologies including yeast two-hybrid- (Y2H) [Fields and Song 1989], co-immunoprecipitation (Co-IP) [Golemis and Adams 2002] and affinity-purification (AP)-based [Rigaut et al. 1999] screens have revolutionized our ability to interrogate protein interactions on a massive scale, and have enabled global surveys of interactomes from a number of organisms. In particular, up to 70% of the interactions from model organisms in- cluding yeast [Ito et al. 2000, Uetz et al. 2000, Ho et al. 2002, Gavin et al. 2002, Gavin et al. 2006, Krogan et al. 2006], fly [Guruharsha et al. 2011], and nematode [Butland et al. 2005, Li et al. 2004] have been mapped, and the identification of interactions from higher-order multicellular organisms including species of flowering plant Arabidopsis, fish Danio (zebrafish), and several mammals—Mus musculus (house mouse), Rattus norvegicus (Norwegian rat), and humans—is rapidly underway; the interactions are cataloged in large public databases [Stark et al. 2011, Rolland et al. 2014]. The earliest and most widely used experimental techniques to capture binary interacting proteins on a high-throughput scale were mostly yeast two-hybrid (Y2H) [Fields and Song 1989]. However, datasets of protein interactions inferred from Y2H screens were found to have significant numbers of spurious interactions [Von Mering et al. 2002, Bader and Hogue 2002, Bader et al. 2004]. This is attributed in part to the nature of the Y2H protocol in which all potential interactors are tested within the same compartment (nucleus) even though some of these do not meet during their lifetimes due to compartmentalization (different subcellular localizations) within living cells. 1.1 From Protein Interactions to Protein Complexes 7

Co-immunoprecipitation or affinity-purification (Co-IP/AP) techniques were in- troduced later and these are more specific in detecting interactions between co- complexed proteins [Golemis and Adams 2002, Rigaut et al. 1999, Kocher¨ and Superti-Furga 2007]. In these protocols, cohesive groups or complexes of pro- teins are “pulled down,” from which the binary interactions between the proteins are individually inferred. However, this indirect inference could lead to over- or under-estimation of protein interactions. In the tandem affinity purification (TAP) procedure [Rigaut et al. 1999, Puig et al. 2001], proteins of interest (“baits”) are TAP-tagged and purified in an affinity column with potential interaction part- ners (“preys”). The pulled-down complexes are subjected to mass spectrometric (MS) analysis to identify individual components within the complexes. However, although more reliable than Y2H, the TAP/MS procedure can be elaborate and with the inclusion of MS, it can be expensive too. The exhaustiveness of TAP/MS depends on the baits used—there is no way to identify all possible complexes unless all possible baits are tested. Proteins which do not interact directly with the chosen bait but interact with one or more of the preys, might also get pulled down as part of the purified complex. In some cases, these proteins are indeed part of the real complex whereas in other cases these proteins are not (i.e., they are contaminants); therefore multiple purifications are required, possibly with each protein as a bait and as a prey, to identify the correct set of proteins within the complex. The TAP procedure therefore offers two successive affinity purifi- cations so that the chance of retained contaminants reduces significantly. Con- versely, a chosen bait might form a real complex with a set of proteins without actually interacting directly with every protein from the set, and therefore some proteins might not get pulled down as part of the purified complex. In these cases, multiple baits would need to be tested to assemble the complete complex. Moreover, since some proteins participate in more than one complex, multiple independent purifications are required to identify all hosting complexes for these proteins. Binary interactions between the proteins in a pulled-down protein complex are inferred using two models: matrix and spoke. In the matrix model, a binary interaction is inferred between every pair of proteins within the complex, whereas in the spoke model interactions are inferred only between the bait and all its preys. Since all pairs of proteins within a complex do not necessarily interact, the matrix model is usually an overestimation of the total number of binary interactions, whereas the spoke model is an underestimation. Therefore, usually a balance is struck between the two models that is close enough to the estimated total number of interactions for the species or organism. 8 Chapter 1 Introduction to Protein Complex Prediction

Table 1.2 Numbers of mapped physical interactions between proteins across different model and higher-order organisms

Organism No. of Interactions No. of Proteins A. thaliana 34,320 9,240 C. elegans 5,783 3,269 D. rerio 188 181 D. melanogaster 36,741 8,071 E. coli 99 104 H. sapiens 230,843 20,006 M. musculus 18,465 8,611 R. norvegicus 4,537 3,328 S. cerevisiae 82,327 6,278 S. pombe 9,492 2,944 X. laevis 532 471

Based on BioGrid version 3.4.130 (November 2015) [Stark et al. 2011, Chatr-Aryamontri et al. 2015].

Despite differences in procedures and technologies, the use of different experi- mental protocols can effectively complement one another in detecting interactions. While TAP can be more specific and detect mainly stable (co-complexed) protein interactions, Y2H can be more exhaustive and detect even transient and between- complex interactions. Based on BioGrid version 3.4.130 (November 2015) (http:/ /thebiogrid.org/)[Stark et al. 2011, Chatr-Aryamontri et al. 2015], the numbers of mapped physical interactions range from 99 in E. coli to ∼82,300 in S. cerevisiae and ∼230,900 in H. sapiens (summarized in Table 1.2). It remains to be seen how many of these interactions actually occur in the physiological contexts of living cells or cell types, how many are subject to genetic and physiological variations, and how many still remain to be mapped. The binary interactions inferred from the different experiments are assembled into a protein-protein interaction network, or simply, PPI network. The PPI network presents a global or “systems” view of the interactome, and provides a mathemati- cal (topological) framework to analyze these interactions. Protein complexes are ex- pected to be embedded as modular structures within the PPI network [Hartwell et al. 1999, Spirin and Mirny 2003]. Topologically, this modularity refers to densely con- nected subsets of proteins separated by less-dense regions in the network [Newman 1.1 From Protein Interactions to Protein Complexes 9

2004, Newman 2010]. Biologically, this modularity represents division of labor among the complexes, and provides robustness against disruptions to the network from internal (e.g., mutations) and external (e.g., chemical attacks) agents. Com- putational methods developed to identify protein complexes therefore mine for modular subnetworks in the PPI network. While this strategy appears reasonable in general, limitations in PPI datasets, arising due to the shortcomings highlighted above in experimental protocols, severely restrict the feasibility of accurately pre- dicting complexes from the network. Specifically, the limitations in existing PPI datasets that directly impact protein complex prediction include:

1. presence of a large number of spurious (noisy) interactions; 2. relative paucity of interactions between “complexed” proteins; and 3. missing contextual—e.g., temporal and spatial—information about the in- teractions.

These limitations translate to the following three main challenges currently faced by computational methods for protein complex prediction:

1. difficulty in detecting sparse complexes; 2. difficulty in detecting small (containing fewer than four proteins) and sub- complexes; and 3. difficulty in deconvoluting overlapping complexes (i.e., complexes that share many proteins), especially when these complexes occur under different cel- lular contexts.

While the interactome coverage can be improved by integrating multiple PPI datasets, the lack of agreement between the datasets from different experimental protocols [Von Mering et al. 2002, Bader et al. 2004], and the multifold increase in accompanying noise (spurious interactions), tend to cancel out the advantage gained from the increased coverage. Consequently, the confidence of each interac- tion has to be assessed (confidence scoring) and low-confidence interactions have to be first removed from the datasets (filtering) before performing any downstream analysis. To summarize, computational identification of protein complexes from interaction datasets follows these steps (Figure 1.1):

1. integrating interactions from multiple experiments and stringently assess- ing the confidence (reliability) of these interactions; 2. constructing a reliable PPI network using only the high-confidence interac- tions; Anaphase promoting complex 26S proteasome Potentially novel

20S proteasome Nuclear pore complex Potentially novel

NFKB1-NFKB2-RELA-RELB complex BAF complex Potentially novel (a) (b) Fanconi anemia core complex

Figure 1.1 Identification of protein complexes from protein interaction data. (a) A high-confidence PPI network is assembled from physical interactions between proteins after discarding low-confidence (potentially spurious) interactions. (b) Candidate protein complexes are predicted from this PPI network using network-clustering approaches. The quality of the predicted complexes is validated against bona fide complexes, whereas novel complexes are functionally assessed and assigned new roles where possible. 1.2 Databases for Protein Complexes 11

3. identifying modular subnetworks from the PPI network to generate a candi- date list of protein complexes; and 4. evaluating these candidate complexes against bona fide complexes, and validating and assigning roles for novel complexes.

As we shall see in the following chapters, several sophisticated approaches have been developed over the years to overcome some of the above-mentioned chal- lenges. Computational methods have co-evolved with proteomics technologies, and over the last ten years a plethora of computational methods have been developed to predict complexes from PPI networks, which is the subject of this book. In general, computational methods complement experimental approaches in several ways. These methods have helped counter some of the limitations arising in proteomic studies, e.g., by eliminating spurious interactions via interaction scoring, and by enriching true interactions via prediction of missing interactions. The novel in- teractions and protein complexes predicted from these methods have been added back to proteomics databases, and these have helped to further enhance our re- sources and knowledge in the field.

Databases for Protein Complexes 1.2 Several high-quality resources for protein complexes have been developed over the years covering both lower-order model and higher-order organisms (summarized in Table 1.3). In total, Aloy [Aloy et al. 2004], CYC2008 [Pu et al. 2009], and MIPS [Mewes et al. 2008] contain over 450 manually curated complexes from S. cerevisiae (budding yeast). CORUM [Reuepp et al. 2008, 2010] contains ∼3,000 mammalian complexes of which ∼1,970 are protein complexes identified from human cells. The European Molecular Biology Laboratory (EMBL) and European Bioinformatics Institute (EBI) maintain a database of manually curated protein complexes from 18 different species including C. elegans, H. sapiens, M. musculus, S. cerevisiae, and S. pombe [Meldal et al. 2015]. Havugimana et al. [2012] present a dataset of 622 putative human soluble pro- tein complexes (http://human.med.utoronto.ca/) identified using high-throughput AP/MS pulldown and PPI-clustering approaches. Huttlin et al. [2015] present 352 putative human complexes identified from human embryonic (HEK293T) cells (http://wren.hms.harvard.edu/bioplex/). Wanetal.[2015] present a catalog of con- served metazoan complexes (http://metazoa.med.utoronto.ca/) identified by clus- tering of high-quality pulldown interactions from C. elegans, D. melanogaster, H. sapiens, M. musculus, and Strongylocentrotus purpuratus (purple sea urchin). This 12 Chapter 1 Introduction to Protein Complex Prediction R. 2017 ] (399), and Strongylocentrotus S. cerevisiae , and [ Vinayagam et al. 2013 ] [ Meldal et al. 2015 ] [ Wan et al. 2015 ] [ Reuepp et al. 2008 , Ruepp et al. 2010 ] [ Aloy et al. 2004 ] [ Huttlin et al. 2015 , [ Havugimana et al. 2012 ] [ Mewes et al. 2008 ] [ Ori et al. 2016 ] [ Pu et al. 2009 ] [ Drew et al. 2017 ] (house mouse) (15%) and (404), M. musculus , M. musculus . The EMBL-EBI portal includes protein M. musculus H. sapiens , (64%), (441), S. cerevisiae H. sapiens H. sapiens , and D. melanogaster , H. sapiens , C.elegans http://www.flyrnai.org/compleat/ http://metazoa.med.utoronto.ca/ http://www.ebi.ac.uk/intact/complex/ http://mips.helmholtz-muenchen.de/genre/ proj/corum/ http://www.russelllab.org/complexes/ http://wren.hms.harvard.edu/bioplex/ http://human.med.utoronto.ca/ http://www.helmholtz-muenchen.de/en/ibis/ http://www.bork.embl.de/Docu/variable_ complexes/ http://wodaklab.org/cyc2008/ http://proteincomplexes.org (16 complexes), a C. elegans D. melanogaster No. of 102 354 622 313 408 > 4,600 3 species 8,886 Metazoa 344–490 18 species 1,564 S. cerevisiae H. sapiens (HEK293T) H. sapiens S. cerevisiae S. cerevisiae H. sapiens c b (purple sea urchin), consisting of 344 complexes with entirely ancient proteins and 490 complexes with largely ancient (12%) (Norwegian rat). (16). CORUM includes mammalian protein complexes, mainly from EMBL-EBI Complex Portal MIPS—CORUM Mammals 2,837 Database Organisms Complexes Source Reference Aloy BioPlex COMPLEAT Human Soluble Metazoan conserved MIPS—Yeast CYC2008 hu.MAP Ori Mammals 279 purpuratus proteins conserved ubiquitously among eurkaryotes. norvegicus c. Includes mainly conserved complexes among the metazoans, complexes from 18 different speciesS. of pombe which are Publicly available databases for protein complexes a. No. of complexes as ofb. 2016. COMPLEAT includes protein complexes from Table 1.3 1.3 Organization of the Rest of the Book 13

dataset includes ∼300 complexes composed of entirely ancient proteins (evolu- tionarily conserved from lower-order organisms), and ∼500 complexes composed of largely ancient proteins conserved ubiquitously among eurkaryotes. Drew et al. [2017] present a comprehensive catalog of >4,600 computationally predicted human protein complexes covering >7,700 proteins and >56,000 interactions by analyzing data from >9,000 published mass spectrometry experiments. Vinayagam et al. [2013] present COMPLEAT (http://www.flyrnai.org/compleat/), a database of 3,077, 3,636, and 2,173 literature-curated protein complexes from D. melanogaster, H. sapiens, and S. cerevisiae, respectively. Ori et al. [2016] combined mammalian complexes from CORUM and COMPLEAT to generate a dataset of 279 protein com- plexes from mammals.

Organization of the Rest of the Book 1.3 The rest of this book reads as follows. Chapter 2 discusses important concepts underlying PPI networks and presents prerequisites for understanding subse- quent chapters. We discuss different high-throughput experimental techniques employed to infer PPIs (including the Y2H and AP/MS techniques mentioned ear- lier), explaining briefly the biological and biochemical concepts underlying these techniques and highlighting their strengths and weaknesses. We explain compu- tational approaches that denoise (PPI weighting) and integrate data from multiple experiments to construct reliable PPI networks. We also discuss topological proper- ties of PPI networks, theoretical models for PPI networks, and the various databases and software tools that catalog and visualize PPI networks. Chapter 3 forms the main crux of this book as it introduces and discusses in depth the algorithmic underpinnings of some of the classical (seminal) computational methods to iden- tify protein complexes from PPI networks. While some of these methods work solely on the topology of the PPI network, others incorporate additional biolog- ical information—e.g., in the form of functional annotations—with PPI network topology to improve their predictions. Chapter 4 presents a comprehensive em- pirical evaluation of six widely used protein complex prediction methods available in the literature using unweighted and weighted PPI networks from yeast and hu- man. Taking a known human protein complex as an example, we discuss how the methods have fared in recovering this complex from the PPI network. Based on this evaluation, we explain in Chapter 5 the shortcomings of current methods in detecting certain kinds of protein complexes, e.g., protein complexes that are sparse or that overlap with other complexes. Through this, we highlight the open challenges that need to be tackled to improve coverage and accuracy of protein 14 Chapter 1 Introduction to Protein Complex Prediction

complex prediction. We discuss some recently proposed methods that attempt to tackle these open challenges and to what extent these methods have been suc- cessful. Chapter 6 is dedicated to an important class of protein complexes that are dynamic in their protein composition and assembly. While some of these protein complexes are temporal in nature—i.e., assemble at a specific timepoint and dis- sociate after that—others are structurally variable—e.g., change their 3D structure and/or composition—based on the cellular context. Quite obviously, it is not pos- sible to detect dynamic complexes solely by analyzing the PPI network; methods that integrate gene or protein expression and 3D structural information are re- quired. These more-sophisticated methods are covered here. Chapter 7 discusses methods to identify protein complexes that are conserved between organisms or species; these evolutionarily conserved complexes provide important insights into the conservation of cellular processes through the evolution. Finally, in today’s era of systems biology where biological systems are studied as a complex interplay of multiple (biomolecular) entities, we explain how protein complex prediction methods are playing a crucial role in shaping up the field; these applications are covered in Chapter 8. We discuss the application of these methods for predicting dysregulated or dysfunctional protein complexes, identifying rewiring of interac- tions within complexes, and in discovery of new disease genes and drug targets. We conclude the book in Chapter 9 by reiterating the diverse applications of pro- tein complex prediction methods and thereby the importance of computational methods in driving this exciting field of research. Constructing Reliable Protein-Protein Interaction (PPI) Networks

No molecule arising naturally (MAN) is an island, entire of itself. —John Donne (1573–1631), English poet and cleric (modified [Dunn 2010] from original quote, “No man is an island, entire of itself.”)

2The identification of PPIs yields insights into functional relationships between pro- teins. Over the years, a number of different experimental techniques have been developed to infer PPIs. This inference of PPIs is orthogonal, but also complemen- tary, to experiments inferring genetic interactions; both provide lists of candidate interactions and implicate functional relationships between proteins [Morris et al. 2014].

High-Throughput Experimental Systems to Infer PPIs 2.1 Physical interactions between proteins are inferred using different biochemical, biophysical, and genetic techniques (summarized in Table 2.1). Yeast two-hybrid (Y2H; less commonly, YTH) [Ito et al. 2000, Uetz et al. 2000, Ho et al. 2002] and protein-fragment complement assays [Michnick 2003, Remy and Michnick 2004, Remy et al. 2007] enable identification of direct binary physical interactions be- tween the proteins, whereas co-immunoprecipitation or affinity purification assays [Golemis and Adams 2002, Rigaut et al. 1999, Kocher¨ and Superti-Furga 2007, Dunham et al. 2012] enable pull down of whole protein complexes from which the binary interactions are inferred. Protein-fragment complementation assay (PCA) 16 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks Ho et Rigaut et Petschnigg Tarassov et al. Kittanakom et al. Remy and Michnick Uetz et al. 2000 , Yao et al. 2017 ] Lalonde et al. 2010 , Remy et al. 2007 , al. 2002 ] al. 1999 ] [ Michnick 2003 , 2004 , 2008 ] Key References [ Lalonde et al. 2008 , 2009 , et al. 2014 , ractions; these techniques can be employed in a high-throughput Binary interactions; can infer membrane-protein interactions Binary interactionsCo-complex relationships [ Ito et al. , 2000 [ Golemis and Adams 2002 , Binary interactions between membrane proteins (yeast, (yeast, (yeast, ies for potential interactors In vivo mammalian) Cell Assay Interaction Type In vivo mammalian) In vitro In vivo mammalian) Protein-fragment complement (PCA) Experimental Technique Yeast two-hybrid (Y2H) Co-immunoprecipitation precipitation followed by mass spectrometry (Co-IP/MS) Membrane yeast two- hybrid (MYTH), Mammalian membrane YTH (MaMTH) Experimental techniques for screening protein inte manner to screen whole protein librar Table 2.1 2.1 High-Throughput Experimental Systems to Infer PPIs 17 coupled with biomolecular fluorescence complementation (BIFC) [Grinberg et al. 2004] enables mapping of interaction surfaces of proteins, and is thus a good tool to confirm protein binding. Membrane YTH and mammalian membrane YTH (MaMTH) [Lalonde et al. 2008, Kittanakom et al. 2009, Lalonde et al. 2010, Petschnigg et al. 2014, Yao et al. 2017] enable identification of interactions in- volving membrane or membrane-bound proteins which are typically difficult to identify using traditional Y2H and AP techniques. Techniques inferring genetic in- teractions [Brown et al. 2006] enable detection of functional associations or genetic relationships between the proteins (genes), but these associations do not always correspond to physical interactions. Here, we present only an overview of each of the experimental techniques; for a more descriptive survey, the readers are referred to Br¨uckneret al. [2009], Shoemaker and Panchenko [2007], and Snider et al. [2015].

Yeast Two-Hybrid (Y2H) Screening System Y2H was first described by Fields and Song [1989] and is based on the modularity of binding domains in eukaryotic transcription factors. Eukaryotic transcription factors have at least two distinct domains: (1) the DNA binding domain (BD) which directs binding to a promoter DNA sequence (upstream activating sequence (UAS)); and (2) the transcription activating domain (AD) which activates the transcription of target reporter genes. Splitting the BD and AD domains inactivates transcription, but even indirectly connecting AD and BD can restore transcription resulting in activation of specific reporter genes. Plasmids are engineered to produce a protein product (chimeric or “hybrid”) in which the BD fragment is in-frame fused onto a protein of interest (the bait), while the AD fragment is in-frame fused onto another protein (the prey) (Figure 2.1). The plasmids are then transfected into cells chosen for the screening method, usually from yeast. If the bait and prey proteins interact, the AD and BD domains are indirectly connected, resulting in the activation of reporters within nuclei of cells. Typically, multiple independent yeast colonies are assayed for each combination of plasmids to account for the heterogeneity in protein expression levels and their ability to activate reporter transcription. This basic Y2H technique has been improved over the years to enable large library screening [Chien et al. 1991, Dufree et al. 1993, Gyuris et al. 1993, Finley and Brent 1994]. Interaction mating is one such protocol that can screen more than one bait against a library of preys, and can save considerable time and materials. In this protocol, the AD- and BD-fused proteins begin in two different haploid yeast strains with opposite mating types. These proteins are brought together by mating, a process in which two haploid cells fuse to form a single diploid cell. The diploids are then tested using conventional reporter activation for possible interactors. 18 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

Prey AD Chimeric plasmids with in-frame fused bait and prey

Bait

BD

UAS Reporter gene

Figure 2.1 Schematic representation of the yeast two-hybrid protocol to detect interaction between bait and prey proteins. If the bait and prey proteins interact, the DNA binding domain (BD) fused to the bait and the transcription activing domain (AD) fused to the prey are indirectly connected resulting in the activation of the reporter gene. UAS: upstream activating sequence (promoter) of the reporter gene.

Therefore, different bait-expressing strains can be mated with a library of prey- expressing strains, and the resulting diploids can be screened for interactors. It is important to know how many viable diploids have arisen and to determine the false-positive frequency of the detected interactions. True interactors tend to come up in a timeframe specific for each given bait, with false positives clustering at a different timepoint. Multiple yeast colonies are assayed to confirm the interactors. Y2H screens have been extensively used to detect protein interactions among yeast proteins, with two of the earliest studies reporting 692 [Uetz et al. 2000] and 841 [Ito et al. 2000] interactions for S. cerevisiae. In the bacteria Helicobacter py- lori, one of the first applications of Y2H identified over 1,200 interactions, covering about 47% of the bacterial proteome [Rain et al. 2001]. Applications on fly pro- ceeded on an even greater scale when Giot et al. [2003] identified 10,021 protein interactions involving 4,500 proteins in D. melanogaster. More recently, Vo et al. [2016] used Y2H to map binary interactions in the yeast S. pombe (fission yeast). This network, called FissionNet, consisted of 2,278 interactions covering 4,989 protein- coding genes in S. pombe. The Y2H system has also been applied for humans, with two initial studies [Rual et al. 2005, Stelzl et al. 2005] yielding over 5,000 interac- tions among human proteins. More recently, Rolland et al. [2014] employed Y2H to characterize nearly 14,000 human interactions. 2.1 High-Throughput Experimental Systems to Infer PPIs 19

However, inherent to this type of library screening, the number of detected false-positive interactions is usually high. Among the possible reasons for the gen- eration of false positives is that the experimental compartmentalization (within the nucleus) for bait and prey proteins does not correspond to the natural cellular compartmentalization. Moreover, proteins that are not correctly folded under ex- perimental conditions or are “sticky” may show non-specific interactions. The third source of false positives is the interaction of the preys themselves with reporter pro- teins, which can turn on the reporter genes. Von Mering et al. [2002] estimated the accuracy of classic Y2H to be less than 10%, with subsequent evaluations suggest- ing the number of false positives to be between 50% and 70% in large-scale Y2H interaction datasets for yeast [Bader and Hogue 2002, Bader et al. 2004].

Co-Immunoprecipitation/Affinity Purification (AP) Followed by Mass Spectrometry (Co-IP/AP followed by MS) Complementing the in vivo Y2H screens are the in vitro Co-IP/AP followed by MS screens that identify whole complexes of interacting proteins, from which the bi- nary interactions between proteins can be inferred [Golemis and Adams 2002, Rigaut et al. 1999, Kocher¨ and Superti-Furga 2007, Dunham et al. 2012]. The Co- IP/AP followed by MS screens consist of two steps: co-immunoprecipitation/affinity purification and mass spectrometry (Figure 2.2). In the first step, cells are lysed

MS

Harvest and Incubate lysate with Antibody binds to Elute Separate out Cut bands from gel; lyse cells tagged proteins; bait; nonbinding complex the eluted identify proteins by introduce specific proteins are proteins mass spectrometry antibody for washed; retain using gel protein of proteins (bait and electrophoresis interest (bait) preys) in complex

Figure 2.2 Schematic representation of the co-immunoprecipitation/affinity purification followed by mass spectrometry (Co-IP/AP followed by MS) protocol. The protein of interest (bait) is targeted with a specific antibody and pulled down with its interactors in a cell lysate buffer. The individual components of the pulled-down complex are identified using mass spectrometry. These days, liquid chromatography with mass spectrometry (LCMS) instead of running the gel is increasingly being used more frequently for as a combined physical-separation and MS-analysis technique [Pitt 2009]. 20 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

in a radioimmunoprecipitation assay (RIPA) buffer. The RIPA buffer enables effi- cient cell lysis and protein solubilization while avoiding protein degradation and interference with biological activity of the proteins. A known member of the set of proteins (the protein of interest or bait) is epitope-tagged and is either immunopre- cipitated using a specific antibody against the tag or purified using affinity columns recognizing the tag, giving the interacting partners (preys) of the bait. Normally, this purification step is more effective when two consecutive purification steps are used with proteins that are doubly tagged (hence called tandem affinity purification or TAP). This results in an enrichment of native multi-protein complexes containing the bait. The individual components within each such purified complex are then screened by gel electrophoresis and identified using mass spectrometry. In one of the first applications of TAP/MS, Ho et al. [2002] expressed 10% of the coding open reading frames from yeast, and the identified interactions connected 25% of the yeast proteome as multi-protein complexes. Subsequently, Gavin et al. [2002], Gavin et al. [2006], and Krogan et al. [2006] purified 1,993 and 2,357 TAP- tagged proteins covering 60% and 72% of the yeast proteome, and identified 7,592 and 7,123 protein interactions from yeast, respectively. One of the first proof-of- concept studies for humans applied AP/MS to characterize interactors using 338 bait proteins that were selected based on their putative involvement in diseases, and the study identified 6,463 interactions between 2,235 proteins [Ewing et al. 2007].

Comparison of Y2H and AP/MS Experimental Techniques A majority of the interaction data collected so far has come from Y2H screening. For example, approximately half of the data available in databases including IntAct [Hermjakob et al. 2004, Kerrien et al. 2012] and MINT [Zanzoni et al. 2002, Chatr- Aryamontri et al. 2007] are from Y2H screens [Br¨uckneret al. 2009] (more sources of PPI data are listed in Table 2.2). This could in part be attributed to the inaccessi- bility of mass spectrometry due to the expensive large equipment that is required. But, in general, Y2H and AP/MS techniques are complementary in the kind of in- teractors they detect. If a set of proteins form a stable complex, then an AP/MS screen can determine all the proteins within the complex, but may not necessarily confirm every interacting pair (the binary interactions) within the complex. On the other hand, a Y2H screen can detect whether any given two proteins directly inter- act. While stable interactions between co-complexed proteins can be accurately determined using AP/MS techniques, Y2H techniques are useful for identifying transient interactions between the proteins. However, due to considerable func- 2.1 High-Throughput Experimental Systems to Infer PPIs 21 tional cross-talk within cells, Y2H can also report an interaction even when the proteins are not directly connected. In addition, some types of interactions can be missed in Y2H due to inherent limitations in the technique—e.g., interactions in- volving membrane proteins, or proteins requiring posttranslational modifications to interact—but these limitations may also occur with AP/MS-based approaches [Br¨uckneret al. 2009]. Therefore, only a combination of different approaches that necessarily also includes computational methods (to filter out the incorrectly de- tected interactions) will eventually lead to a fairly complete characterization of all physiologically relevant interactions in a given cell or organism.

Protein-Fragment Complementation Assay (PCA) PCA is a relatively new technique which can detect in vivo protein interactions as well as their modulation or spatial and temporal changes [Michnick 2003, Morell et al. 2009, Tarassov et al. 2008]. Similar to Y2H, PCA is based on the principle of splitting a reporter protein into two fragments, each of which cannot function alone [Michnick 2003]. However, unlike Y2H, PCA is based on the formation of a biomolecular complex between the bait and prey, where both are fused to the split domains of the reporter. Importantly, the formation of this complex occurs in competition with alternative endogenous interaction partners present within the cell. The interaction brings the two split fragments in proximity enabling their non-covalent reassembly, folding, and recovery of protein reporter function [Morell et al. 2009]. Typically, the reporter proteins are fluorescent proteins, and the for- mation of biomolecular complexes is visualized using biomolecular fluorescence complementation (BIFC). BIFC can also be used to map the interaction surfaces of these complexes. This enables investigation of competitive binding between mu- tually exclusive interaction partners as well as comparison of their intracellular distributions [Grinberg et al. 2004]. PCA can be used as a screening tool to identify potential interaction partners of a specific protein [Remy and Michnick 2004, Remy et al. 2007], or to validate the interactions detected from other techniques such as Y2H [Vo et al. 2016]. In one of the first applications of PCA on a genome-wide in vivo scale, Tarassov et al. [2008] identified 2,770 interactions among 1,124 proteins from S. cerevisiae. Vo et al. [2016] used PCA as an orthogonal assay to reconfirm the interactions detected in S. pombe (from the FissionNet network consisting of 2,278 interactions; discussed earlier). PCA has also been employed to validate interactions between membrane proteins or membrane-associated proteins [Babu et al. 2012, Shoemaker and Panchenko 2007] (discussed next). 22 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

Techniques for Inferring Membrane-Protein Interactions Membrane proteins are attached to or associated with membranes of cells or their organelles, and constitute approximately 30% of the proteomes of organisms [Carpenter et al. 2008, Von Heijne 2007, Byrne and Iwata 2002]. Being non-polar (hydrophobic), membrane proteins are difficult to crystallize using traditional X- ray crystallography compared to soluble proteins, and are the least studied among all proteins using high-throughput proteomics techniques [Carpenter et al. 2008]. Membrane proteins are involved in the transportation of ions, metabolites, and larger molecules such as proteins, RNA, and lipids across membranes, in sending and receiving chemical signals and propagating electrical impulses across mem- branes, in anchoring enzymes and other proteins to membranes, in controlling membrane lipid composition, and in organizing and maintaining the shape of organelles and the cell itself [Lodish et al. 2000]. In humans, the G-protein-coupled- receptors (GPCRs), which are membrane proteins involved in signal transduction across membranes, alone account for 15% of all membrane proteins; and 30% of all drug targets are GPCRs [Von Heijne 2007]. Due to the key roles of membrane pro- teins, identifying interactions involving these proteins has important applications especially in drug development. Membrane protein complexes are notoriously difficult to study using traditional high-throughput techniques [Lalonde et al. 2008]. Intact membrane-protein com- plexes are difficult to pull down using conventional AP/MS systems. This is due in part to the hydrophobic nature of membrane proteins as well as the ready dissocia- tion of subunit interactions, either between trans-membrane subunits or between trans-membrane and cytoplasmic subunits [Barrera et al. 2008]. Further, mem- brane protein structure is difficult to study by commonly used high-resolution methods including X-ray crystallography and NMR spectroscopy. A major avenue by which one can understand membrane proteins and their complexes is by mapping the membrane-protein “subinteractome”—the subset of interactions involving membrane proteins. Conventional Y2H system is confined to the nucleus of the cell thereby excluding the study of membrane proteins. New biochemical techniques have been developed to facilitate the characterization of interactions among membrane proteins. Among these is the split-ubiquitin mem- brane yeast two-hybrid (MYTH) system [Miller et al. 2005, Kittanakom et al. 2009, Stagljar et al. 1998, Petschnigg et al. 2012]. This system is based on ubiquitin, an evolutionarily conserved 76-amino acid protein that serves as a tag for proteins targeted for degradation by the 26S proteasome. The presence of ubiquitin is recog- nized by ubiquitin-specific proteases (UBPs) located in the nucleus and cytoplasm of all eukaryotic cells. Ubiquitin can be split and expressed as two halves: the amino- 2.2 Data Sources for PPIs 23

terminal (N) and the carboxyl terminal (C). These two halves have a high affinity for each other in the cell and can reconstitute to form pseudo-ubiquitin that is recog- nizable by UBPs. In MYTH, the bait proteins are fused to the C-terminal of a split-ubiquitin, and the prey proteins are fused to the N-terminal. The two halves reconstitute into a pseudo-ubiquitin protein if there is affinity between the bait and prey proteins. This pseudo-ubiquitin is recognized by UBPs, which cleaves after the C-terminus of ubiquitin to release the transcription factor, which then enters the nucleus to activate reporter genes. Two of the earliest studies using the MYTH screens reported a fair number of interactions among membrane proteins from yeast: 343 interactions among 179 proteins by Lalonde et al. [2010], and 808 interactions among 536 proteins by Miller et al. [2005]. PCA has also been adopted to identify and/or verify membrane-protein interactions. For example, Babu et al. [2012] used PCA to validate and integrate 1,726 yeast membrane-protein interactions obtained from multiple studies, and these encompassed 501 putative membrane protein complexes. The mammalian version of membrane yeast two-hybrid, MaMTH, is also based on the split-ubiquitin assay and is derived from the MYTH assay. Stagljar and col- leagues [Petschnigg et al. 2014, Yao et al. 2017] used MaMTH to probe interactions involving the epidermal growth factor receptor/receptor tyrosine-protein kinase (RTK) ErbB-1 (EGFR/ERBB1), Erb-B2 receptor tyrosine kinase 2 (ERBB2), and other RTKs that localize to the plasma membrane in human cells. When applied to hu- man lung cancer cells, the assay identified 124 interactors for wild-type and mutant EGFR [Petschnigg et al. 2014].

Data Sources for PPIs 2.2 Several public and proprietary databases now catalog protein interactions from both lower-order model and higher-order organisms (summarized in Table 2.2). These databases contain PPI data in an acceptable format required for data depo- sition, such as IMEx (http://www.imexconsortium.org/submit-your-data)[Orchard et al. 2012]. The Biomolecular Interaction Network Database (BIND) [Bader et al. 2003], now called Biomolecular Object Network Database (BOND), includes experi- mentally determined protein-protein, protein-small molecule, and protein-nucleic acid interactions. BioGrid [Stark et al. 2011] catalogs physical and genetic inter- actions inferred from multiple high-throughput experiments. The Database of In- teracting Proteins (DIP) [Xenarios et al. 2002] contains experimentally determined protein interactions with a “core” subset of interactions that have passed quality 24 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks Yu et al. 2011 ] Szklarczyk et al. 2011 ] Kerrien et al. 2012 ] Salwinski et al. 2004 ] Chatr-Aryamontri et al. Yu et al. 2008 , Turner et al. 2010 ] Prasad et al. 2009 ] protein-DNA interactions Persico et al. 2005 ] uldener et al. 2005 ] Reference [ Peri et al. 2004 , [ Brown and Jurisica 2005 ] [ Lynn et al. 2008 ] [ Razick et al. 2008 , [ Zanzoni et al. 2002 , [ Mewes et al. 2008 ] [ Pagel et al. 2005 ] [ Von Mering et al. 2003 , 2007 , [ Bader et al. 2003 ] [ Rolland et al. 2014 , [ G¨ [ Xenarios et al. 2002 , [ Hermjakob et al. 2004 , [ Chen et al. 2009 ] [ Stark et al. 2011 ] , protein-small molecule, and http://irefindex.org/wiki/index.php?title=iRefIndex http://string-db.org/ http://www.hprd.org/ http://ophid.utoronto.ca/ http://www.innatedb.com/ http://mint.bio.uniroma2.it/mint/ http://mips.helmholtz-muenchen.de/proj/ppi/ http://mips.helmholtz-muenchen.de/proj/ppi/ http://bind.ca http://interactome.dfci.harvard.edu/ http://mips.helmholtz-muenchen.de/genre/proj/yeast/ http://www.ebi.ac.uk/intact/ http://dip.doe-mbi.ucla.edu http://discern.uits.iu.edu:8340/HAPPI/ http://thebiogrid.org iRefIndex STRING HPRD OPHID/IID InnateDB MINT/HomoMINT MIPS MPPI BIND CYGD EMBL-EBI IntAct CCSB DIP HAPPI PPI DatabaseBioGrid Source Public and proprietary databases for protein-protein Table 2.2 2.3 Topological Properties of PPI Networks 25

assessment (for example, based on literature verification). The Centre for Can- cer Systems Biology (CCSB) Interactome Database at Harvard [Rolland et al. 2014, Yu et al. 2008, Yu et al. 2011] contains yeast, plant, virus, and human interac- tions. STRING [Von Mering et al. 2003, Szklarczyk et al. 2011] catalogs physical and functional interactions inferred from experimental and computational techniques. MIPS Comprehensive Yeast Genome Database (CYGD) [G¨uldener et al. 2005] and MIPS Mammalian Protein-Protein Interaction Database (MPPI) [Pagel et al. 2005] catalog protein interactions and also expert-curated protein complexes from yeast and mammals. The Human Protein Reference Database (HPRD) [Peri et al. 2004, Prasad et al. 2009] mainly contains experimentally identified human interactions. IRefIndex [Razick et al. 2008, Turner et al. 2010] and Integrated Interaction Data- base (IID) [Brown and Jurisica 2005] integrate experimental and computationally predicted interactions for human and several other species.

Topological Properties of PPI Networks 2.3 A simple yet effective way to represent interaction data is in the form of an undi- rected network called a protein-protein interaction network or simply PPI network, given as G =V , E, where V is the set of proteins and E is the set of physical interactions between the proteins. Such a network presents a global or “systems” view of the entire set of proteins and their interactions, and provides a topolog- ical (mathematical) framework to interrogate the interactions. In the definitions throughout this book, we also use V (G) and E(G) to refer to the set of proteins and ∈ interactions of a (sub)network of G. For a protein v V , the set N(v)or Nv includes all immediate neighbors of v, and deg(v) =|N(v)| is the degree of v. These neigh- = ={ ∈ }∪{ ∈ bors together with their interactions, Ev E(v) (v, u) : u N(v) (u, w) : u N(v), w ∈ N(v)∩ N(u)}, constitute the local (immediate) neighborhood subnetwork of v. PPI networks, like most real-world networks, have characteristic topological properties which are distinct from that of random networks. But, to understand this distinction we need to first understand what are random networks. Traditionally, random networks have been described using the Erd¨os-R´enyi (ER) model, in which G(n, p) is a random network with |V |=n nodes and each possible edge connecting pairs of these nodes has probability p of existing [Erdos¨ 1960, Bollob´as 1985]. The n expected number of edges in the network is 2 p, and the expected mean degree is np. Alternatively, a random network is defined as a network chosen uniformly at n (2) random from the collection m of all possible networks with n nodes and m edges. If p is the probability for the existence of an edge, the probability for each network 26 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

in this collection is,

m n −m p (1 − p)(2) , (2.1)

where the closer p gets to 1, the more skewed the collection is toward networks with higher number of edges. When p = 1/2, the collection contains all possible n n (2) (2) m networks with equal probability, (1/2) . A marked feature of PPI networks that makes them non-random is the extremely broad distribution of degrees of proteins: While a majority of proteins have small number of immediate binding partners, there exist some proteins, referred to as “hubs,” with unusually large number of binding partners. Moreover, the degrees of most proteins are larger than the average node degree of the network. This degree distribution P(k)can be approximated by a power law P(k)∼ k−γ , where k is the node degree and γ is a small constant. Such networks are called scale-free networks [Barab´asi 1999, Albert and Barab´asi 2002]. Proteins within PPI networks exhibit an inherent tendency to cluster, which can be quantified by the local clustering coefficient for these proteins [Watts and Strogatz 1998]. For a selected protein v that is connected to |N(v)| other nodes, if these immediate neighbors were part of a clique (a complete subgraph), there |N(v)|.(|N(v)|−1) will be 2 interactions between them. The ratio of the actual number of interactions that exist between these nodes and the total number of possible interactions gives the local clustering coefficient CC(v) for the protein v: 2|{(u, w) : u, w ∈ N(v), (u, w) ∈ E}| CC(v) = . (2.2) |N(v)| . (|N(v)|−1) The average clustering coefficient of the entire network is therefore given by 1 CC(G) = CC(v). (2.3) |V (G)| v∈V (G) This clustering property gives rise to groups (subnetworks) of proteins in the PPI network that exhibit dense interactions within the groups, but sparse or less-dense interactions between the groups. The interactions between the groups occur via a few “central” proteins through which pass most paths in the network. The average shortest path length d(G)¯ is given by 2 u =v(u, v) d(G)¯ = , (2.4) |V (G)| . (|V (G)|−1) where (u, v)is the length of the shortest path between two connected nodes u and v. This average shortest path length of the PPI network is significantly shorter than 2.4 Theoretical Models for PPI Networks 27

that of an ER random network; in an ER network, the average shortest path length is proportional to ln n. However, both PPI networks as well as ER networks fall under “small-world” networks, because the average path length between any two nodes is still significantly smaller than the network size: n>>k>>ln n>>1, where k is the average node degree. The average clustering coefficient of a PPI network is significantly higher than a random network constructed on the same protein set and with the same average shortest path length. This small-world property can also be quantified using two coefficients: close- ness and betweenness [Hormozdiari et al. 2007]. The closeness of v is defined based on its average shortest path length to all other proteins reachable from v (that is, within the same connected component as v) in the network,

|V (G)|−1 Cl(v) = , (2.5) u∈R(G,v)(u, v)

where R(G, v) is the set of proteins reachable from v. Closeness is thus the inverse of the average distance of v to all other nodes in the network. The betweenness of the node v measures the extent to which v lies “between” any pair of proteins

connected to v in the network. Let Sxy be the number of shortest paths between all ∈ pairs x, y R(G, v), and let Sxy(v) be the number of these shortest paths that pass through v. The betweenness Bet(v) of v is defined as: Sxy(v) Bet(v) = . (2.6) S x,y∈R(G,v),x =y xy

The distributions of closeness and betweenness coefficients for PPI networks are significantly different from that of random networks [Hormozdiari et al. 2007]. In particular, PPI networks contain central proteins which have high betweenness and these connect and hold together different groups of proteins or regions of the network.

Theoretical Models for PPI Networks 2.4 Studying the topological properties of PPI networks has gained considerable at- tention in the last several years, and in particular various network models have been proposed to describe PPI networks. As new PPI data became available, these theoretical models have been improved to accurately model the data. These in- clude the earliest models such as the Erdos-R¨ ´enyi model [Erdos¨ 1960, Bollob´as 1985], and more recent ones such as the Barab´asi-Albert [Barab´asi 1999, Albert and 28 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

Barab´asi 2002], Watts-Strogatz [Watts and Strogatz 1998], and hierarchical [Ravasz and Barab´asi 2003] network models. With a fixed probability for each edge, the Erdos-R¨ ´enyi (ER) networks (discussed above) look to be intuitive and seemingly correct, and were once commonly used to model real-world networks. However, in reality, ER networks are rarely found in real-world examples. Most real-world networks—including road and airline routes, social contacts, webpage links, and also PPI networks—do not have evenly dis- tributed degrees. Moreover, although ER networks have the small-world property, they have almost no clustering effects. For these reasons, using ER networks as null models for comparison against real-world networks including PPI networks is usually inappropriate. In the Barab´asi-Albert (BA) model [Barab´asi 1999, Albert and Barab´asi 2002], nodes are added one at a time to simulate network growth. The probability of an edge forming between an incoming new node and existing nodes is based on the principle of preferential attachment—that is, existing nodes with higher degrees have a higher likelihood of forming an edge with the incoming node. Hence, the degree distribution follows a power-law distribution (scale-free) P(k)∼ k−γ and exhibits small-world behavior, but like ER, lacks modular organization (low clustering behavior). The BA model is particularly important as one of the earliest instances of network models proposed as a suitable null model, instead of ER, for comparisons against biological networks [Barab´asi 1999, Albert and Barab´asi 2002]. This led to a panoply of works describing the properties of various biological network types including metabolic [Wagner and Fell 2001] and PPI networks [Yook et al. 2004] using BA. At the same time, the deficiencies of the BA model started to become more clear: while the scale-free property is preserved, modularity is not. Thus, better null models are needed. The Watts-Strogatz (WS) model [Watts and Strogatz 1998] is seldom used for modeling biological networks but demonstrates how easy it is to transition from a non-small world network (e.g., a lattice graph) to small-world network with random rewiring of a few edges. The generation procedure is amazingly simple: From a ring lattice with n vertices and k edges per vertex, each edge is rewired at random with probability p.Ifp = 0, then the graph is completely regular, and if 1, the graph is completely random/disordered. For the WS model, the intermediate region, where 0

3-node graphlets 4-node graphlets

12 3 4 567 8

5-node graphlets

9 10 11 12 13 14 15 16 17 18 19

20 21 22 23 24 25 26 27 28 29

Figure 2.3 The 29 graphlets of 3–5 nodes defined by Prˇzuljet al. [2004].

geometric random graphs and PPI networks using the distributions for graphlets. A graphlet is a connected network with a small number of nodes, and graphlet fre- quency is the number of occurrences of a graphlet in a network. The authors defined 29 types of graphlets using 3–5 nodes (Figure 2.3). The relative frequency of a graphlet i from the PPI network G is defined as Ni(G)/T (G), where Ni(G) is the number of ∈{ } = 29 graphlets of type i 1,...,29, and T (G) i=1 Ni(G) is the total number of graphlets of G. The same is defined for a generated geometric random network H . The relative graphlet frequency distance D(G, H) between the two networks G and H is measured as 29 = | − | D(G, H) Fi(G) Fi(H ) , (2.7) i=1 =− where Fi(G) log(Ni(G)/T (G)); the logarithm being used because the graphlet frequencies can differ by several orders of magnitude. The authors generated ge- ometric random networks with the same number of nodes as the proteins in S. cerevisiae and D. melanogaster PPI networks from high-throughput experiments. The authors found that, although the degree distributions of these PPI networks were closer to that of scale-free random networks, other topological parameters matched closely with those of geometric random networks. Specifically, the di- ameter, local and whole-network clustering coefficients, and the relative graphlet frequencies, computed as above, showed that PPI networks were closer to geomet- 2.5 Visualizing PPI Networks 31

ric random networks than scale-free networks. Furthermore, the authors suggest that as the quality and quantity of PPI data improve, the geometric random network may become better suited compared to scale-free and other networks to model PPI networks.

Visualizing PPI Networks 2.5 Visualization concerns an important component of the analysis of PPI networks. Very simply, a PPI network is visualized as dots and lines, where the dots (or other shapes) represent proteins and the lines connecting the dots represent interactions between the proteins. Such a visualization enables quick exploration of the topolog- ical properties of PPI networks—for example, counting the neighbors of a selected protein in a PPI network, the number of connected components in the network, or even spotting dense and sparse regions of the network. However, the ease of this exploration and subsequent analysis depends on how effective is the visualization method or tool used to render the PPI network. This rendering of the PPI network concerns the field of graph or network layout, where layout algorithms are used to draw the network—by appropriately positioning the dots and lines—in a 2D space. A good layout should be able to (visually) bring out the topological properties of the PPI network easily, and this has been a subject of research in graph visualization for several years. Here we briefly introduce some of the commonly used algorithms for PPI network layout; for details the readers are referred to the excellent reviews of Morris et al. [2014], Agaptio et al. [2013], and Doncheva et al. [2012]. Random layout arranges the dots (nodes) and lines (edges) in a random man- ner in the 2D space. The advantage of this algorithm is its simplicity, but on the other hand, it presents a high number of criss-crossing edges, and does not nec- essarily use the available space optimally, especially for large networks. Circular layout arranges the nodes in succession, one after the other, on a circle. While this algorithm also suffers from criss-crossing of edges, it is widely used to visu- alize small (sub)networks such as protein complexes and pathways. Tree layout arranges the network as a tree with a hierarchical organization of the nodes. This is obviously more suitable for visualizing trees than networks with cycles. Often, the layout is “ballooned out” by placing the children of each node in the tree on a circle surrounding the node, resulting in several concentric circles. Force-directed layout places the nodes according to a system of forces based on physical concepts in spring mechanics. Typically, the system combines attractive forces between adja- cent nodes with repulsive forces between all pairs of nodes to seek an optimal layout in which the overall edge lengths are small while the nodes are well separated. 32 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

Number of proteins: 1977 Transcriptional Number of interactions: 5679 regulation Average number of neighbors: 5.47 Network diameter: 31 Clustering coefficient: 0573

Proteasome

Eukaryotic initiation factor 4F complex

Fanconi anemia pathway Chromatin remodeling

Nuclear pore DNA damage complex repair

Figure 2.4 PPI network visualization using the Cytoscape 3.4.0 tool [Shannon et al. 2003, Smoot et al. 2010]. A portion of the human PPI network (1,977 proteins and 5,679 interactions) downloaded from BioGrid [Stark et al. 2011] is visualized here using force-directed layout. Basic statistics—average number of neighbors, network diameter, etc.—are displayed for the network. Proteins (e.g., BRCA1), protein complexes (e.g., eukaryotic initiation factor 4F and nuclear pore complexes), pathways (e.g., Fanconi anaemia pathway), and cellular processes (e.g., DNA-damage repair and chromatin remodeling) are “pulled-out” and highlighted. Cytoscape provides “link-out” to external databases and tools—e.g., KEGG [Kanehisa and Goto 2000]—to enable further analysis.

Once the PPI network is laid out, a good visualization tool should allow at least some basic visual analysis of the network. The following aspects become important here (see Figure 2.4). The ease of navigation through the PPI network to explore individual proteins and interactions is of prime importance. In particular, the tool should be able to load and enable nagivation of even large networks. 2.5 Visualizing PPI Networks 33

Next is the provision to annotate the network using internal (e.g., labeling nodes by serial numbers or by their network properties) or external information (see below). The tool should also be able to compute (basic) topological properties of the network—for example, node degree, shortest path lengths, and clustering, closeness, and betweenness coefficients. These statistics help users get at least a preliminary idea of the network. Another valuable feature of a good tool is linkout to external databases, for example to PubMed literature (http://www.ncbi.nlm.nih .gov/pubmed), National Center for Biotechnology Information (NCBI) (http://www .ncbi.nlm.nih.gov/), UniProt or SwissProt (http://www.uniprot.org/)[Bairoch and Apweiler 1996, UniProt 2015], BioGrid (http://thebiogrid.org/)[Stark et al. 2011], (GO) (http://www.geneontology.org/)[Ashburner et al. 2000], and Kyoto Encyclopedia of Genes and Genomes (KEGG) (http://www.genome.jp/kegg/ pathway.html)[Kanehisa and Goto 2000]. These enable functional annotation of proteins and interactions. Finally, the tool should also possibly support advanced analyses such as clustering of the network, comparison (based on topological characteristics, for example) between networks, and enrichment analysis, e.g., using GO terms. Table 2.3 lists some of the popular tools available for PPI network visualization and (visual) analysis. OMICS Tools (http://omictools.com/network- visualization-category) maintains an exhaustive list of visualization tools for PPI and other biomolecular network analysis.

Table 2.3 Software tools for PPI network visualization and analysis

Visualization Tool Source Reference

Arena3D http://arena3d.org/ [Pavlopoulos et al. 2011] AVIS http://actin.pharm.mssm.edu/AVIS2/ [Seth et al. 2007] BioLayout http://www.biolayout.org/ [Theocharidis et al. 2009] Cytoscape http://www.cytoscape.org/download.html [Shannon et al. 2003, Smoot et al. 2010] Medusa http://coot.embl.de/medusa/ [Hooper and Bork 2005] NAViGaTOR http://ophid.utoronto.ca/navigator/download.html [Brown et al. 2009] ONDEX http://www.ondex.org/ [Kohler¨ et al. 2006] Osprey http://biodata.mshri.on.ca/osprey/servlet/Index [Breitkreutz et al. 2003] Pajek http://vlado.fmf.uni-lj.si/pub/networks/pajek/ [Vladimir and Andrej 2004] PIVOT http://acgt.cs.tau.ac.il/pivot/ [Orlev et al. 2004] ProViz http://cbi.labri.fr/eng/proviz.htm [Florian et al. 2005] 34 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

Building High-Confidence PPI Networks 2.6 From our discussions on experimental protocols in earlier sections, we know that some protocols—including the AP/MS ones—offer only pulled-down complexes consisting of baits and their preys without specifying the binary interactions be- tween these components. Therefore, binary interactions need to be specifically inferred between the bait and each of its preys within the pulled-down complexes. However, not all preys in a pulled-down complex interact directly with the bait (but, get pulled down due to their interactions with other preys in the complex). There- fore, it is necessary to infer binary interactions not just between the bait and its preys but also between the interacting preys. Yet, care should be taken to avoid inferring spurious (false-positive) interactions between the preys that do not inter- act. To overcome these uncertainties, often a balance is sought between two kinds of models, spoke and matrix, which are used to transform pulled-down complexes into binary interactions between the proteins [Gavin et al. 2006, Krogan et al. 2006, Spirin and Mirny 2003, Zhang et al. 2008]. The spoke model assumes that the only interactions in the complex are between the bait and its preys, like the spokes of a wheel. This model is useful to reduce the complexity of the data, but misses all (true) prey–prey interactions. On the other hand, the matrix model assumes that every pair of protein within a complex interact. This model can cover all possible true interactions, but can also predict a large number of spurious interactions. An empirical evaluation using 1,993 baits and 2,760 preys from the dataset from Gavin et al. [2006] against 13,384 pairwise protein interactions between proteins within the expert-curated MIPS complexes [Mewes et al. 2006] revealed 80.2% true-negative (missing) interactions and 39% false-positive (spurious) interactions in the spoke model, and 31.2% true-negative interactions but 308.7% false-positive interactions in the matrix model [Zhang et al. 2008]. However, note that many of the missing interactions could be due to the lack of protein coverage in these experiments. A balance is struck between the two models that covers as many true interactions between the baits and preys as possible without allowing too many false interactions [Gavin et al. 2006]; see Figure 2.5.

Gaining Confidence in PPI Networks Although high-throughput studies have been successful in mapping large frac- tions of interactomes from multiple organisms, the datasets generated from these studies are not free from errors. High-throughput PPI datasets often contain a considerable number of spurious interactions, while missing a substantial num- 2.6 Building High-Confidence PPI Networks 35

p p

b p p p p 0.60 p 0.60 p 0.85 0.55 0.85 0.55 A 0.70 0.70 p p 0.25 0.60 b b 0.75 0.70 b 0.75 0.60 0.70 p p 0.65 0.65 p p 0.75 0.75 0.25 p p p p b C D

p p B

Figure 2.5 Inferring protein interactions from pull-down protein complexes. Bait–prey relationships from pull-down complexes are assembled using the spokes model, where the bait is connected to each of the preys (A); or using the matrix model, where every bait–prey and prey–prey pair is connected (B). However, these models either miss many true interactions or produce too many spurious interactions. Therefore, a combination of the spoke and matrix models is used, where a balance is sought between the two models using weighting of interactions (C). Interactions with low weights are discarded to give the final set of high-confidence inferred interactions (D).

ber of true interactions [Von Mering et al. 2002, Bader and Hogue 2002, Cusick et al. 2009]. Consequently, a crucial challenge in adopting these datasets for down- stream analysis—including protein complex prediction—is in overcoming these challenges.

Spurious Interactions Spurious or false-positive interactions in high-throughput screens may arise from technical limitations in the underlying experimental protocols, or limitations in the (computational) inference of interactions from the screen. For example, the Y2H system, despite being in vivo, does not consider the localization (compart- mentalization), time, and cellular context while testing for binding partners. Since all proteins are tested within one compartment (the nucleus), the chances that two proteins, belonging to two different compartments and are not likely to meet during their lifetimes in live cells, end up testing positive for interaction, is high. Similarly, in vitro TAP pull downs are carried out using cell lysates in an environ- ment where every protein is present in the same “uncompartmentalized soup” [Mackay et al. 2007, Welch 2009]. Therefore, even though two proteins interact 36 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

under these laboratory conditions, it is not certain that they will ever meet or inter- act during their life times in live cells. Opportunities are high for “sticky” molecules to function as bridges between proteins causing these proteins to interact promis- cuously with partners that never interact with in live cells [Mackay et al. 2007]. Once these complexes are pulled down, the model used to infer binary interactions— between bait and prey or between preys—can also result in inference of spurious interactions (further discussed below). Recent analyses showed that only 30–50% of interactions inferred from high-throughput screens actually occur within cells [Shoemaker and Panchenko 2007, Welch 2009], while the remaining interactions are false positives.

Missing Interactions and Lack of Concordance Between Datasets Comparisons between datasets from different techniques have shown a striking lack of concordance, with each technique producing a unique distribution of in- teractions [Shoemaker and Panchenko 2007, Von Mering et al. 2002, Bader and Hogue 2002, Cusick et al. 2009]. Moreover, certain interactions depend on post- translational modifications such as disulfide-bridge formation, glycosylation, and phosphorylation, which may not be supported in the adopted system. Many of these techniques also show bias toward abundant proteins (e.g., soluble proteins) and bias against certain kind of proteins (e.g., membrane proteins). For example, AP/MS screens predict relatively few interactions for proteins involved in transport and sensing (trans-membrane proteins), while Y2H screens being targeted in the nu- cleus fail to cover extracellular proteins [Shoemaker and Panchenko 2007]. These limitations effectively result in a considerable number of missed interactions in interactome datasets. Welch [2009] summed up the status of interactome maps, based on these above limitations, as “fuzzy,” i.e., error-prone, yet filled with promise.

Estimating Reliabilities of Interactions The coverage of true interactions can be increased by integrating datasets from multiple experiments. This integration ensures that all or most regions of the in- teractome are sufficiently represented in the PPI network. However, overcoming spurious interactions still remains a challenge, which is further magnified when datasets are integrated. Therefore, estimating the reliabilities of interactions be- comes necessary, thereby keeping only the highly reliable interactions while dis- carding the spurious or less-reliable ones. Confidence or reliability scoring schemes offer a score (weight) to each interac- tion in the PPI network. For an interaction (u, v) ∈ E in the scored (weighted) PPI 2.6 Building High-Confidence PPI Networks 37

Table 2.4 Classification of confidence scoring (PPI weighting) schemes for protein interactions

Classification Scoring Scheme Reference Sampling or Bootstrap sampling [Friedel et al. 2009] counting based Comparative Proteomic [Sowa et al. 2009] Analysis (ComPASS) Dice coefficient a [Zhang et al. 2008] Hypergeometric sampling [Hart et al. 2007] Significance Analysis of [Choi et al. 2011, Teo et INTeractions (SAINT) al. 2014] Socio-affinity scoring [Gavin et al. 2006]

Independent Bayesian networks and C4.5 [Krogan et al. 2006] evidence-based decision trees Topological Clustering [Jain and Bader 2010] Semantic Similarity (TCSS) Purification Enrichment (PE) [Collins et al. 2007]

Topology-based Collaborative Filtering (CF) [Luo et al. 2015] Functional Similarity (FS) [Chua et al. 2006] Weight a Geometric embedding [Prˇzuljet al. 2004, Higham et al. 2008] Iterative Czekanowski-Dice [Liu et al. 2008] (ICD) distance a PageRank affinity [Voevodski et al. 2009]

a. Dice coefficient, FS Weight, and Iterative CD scoring schemes can also be considered as independent evidence-based schemes, because if a pair of proteins have several common partners then these proteins most likely perform the same or similar functions and/or are present in the same cellular compartment (a biological evidence).

network G =V , E, w, the score w(u, v) encodes the confidence for the physical interaction between the two proteins u and v. The scoring function w: V × V → R accounts for the biological variability and technical limitations of the experi- ments used to infer the interactions. The scoring schemes can be classified into three broad categories (Table 2.4): (i) sampling or counting-based, (ii) biological evidence-based, and (iii) topology-based schemes. 38 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

Sampling or Counting-Based Schemes These schemes estimate the confidence of protein pairs by measuring the number of times each protein pair is observed to interact across multiple trials against what would be expected by chance given the abundance of each protein in the library. If the protein pairs are coming from the same experiment, the counting is performed across multiple purifications of the experiment. Given multiple PPI datasets, this idea can be extended to score interacting pairs by measuring the number of times each pair is observed across the different datasets against what would be expected from random given the number of times these proteins appear across the datasets. However, if the PPI datasets come from different experiments (e.g., Y2H and TAP/MS-based), which is usually the case, then it is useful to capture the relative reliability of each experimental technique or source of the datasets into this computation. For example, if Y2H is believed to be less reliable than TAP/MS- based techniques, then protein pairs can be assigned lower weights when observed in Y2H datasets, but assigned higher weights when observed in TAP/MS datasets. In the study by Gavin et al. [2006], a “socio-affinity” scheme based on this counting idea was used to estimate confidence for interactions inferred from pulled-down complexes detected from TAP purifications. The interactions within the pulled-down complexes are inferred as a combination of spoke and matrix- modeled relationships. A socio-affinity index SA(u, v) then quantifies the tendency for two proteins u and v to identify each other when tagged (spoke model, S) and to co-purify when other proteins are tagged (matrix model, M): = + + SA(u, v) S(u,v)|u=bait S(u,v)|v=bait M(u,v), n(u,v)|u=bait where S | = = log , (u,v) u bait bait . . prey . prey fu nbait fv nu=bait n(u,v)|v=bait (2.8) S | = = log , (u,v) v bait bait . . prey . prey fv nbait fu nv=bait ⎛ ⎞ nprey = ⎝ (u,v) ⎠ and M(u,v) log − , prey . prey . npreys(npreys 1) fu fv all baits 2

where, for the spoke model (S), n(u,v)|u=bait is the number of times that u retrieves bait prey v when u is tagged; fu is the fraction of purifications when u was bait; fv is the fraction of all retrieved preys that were v; nbait is the total number of purifications preys (i.e., using baits); and nu=bait is the number of preys retrieved with u as bait. These prey terms are similarly defined for v as bait. For the matrix model (M), n(u,v) is the 2.6 Building High-Confidence PPI Networks 39 number of times that u and v are seen in purifications as preys with baits other prey prey than u or v; fu and fv are as above; and nprey is the number of preys observed with a particular bait (excluding the bait itself). Friedel et al. [2009] combined the bait–prey relationships detected from the Gavin et al. [2006] and Krogan et al. [2006] experiments, and used a random sampling-based scheme to estimate confidence of interactions. In this approach, = a list  (φ1,...,φn) purifications were generated where each purification φi con- sisted of one bait bi and the preys pi,1,...,pi,m identified for this bait in the =  = purification: φi bi ,[pi,1,...,pi,m] . From , l 1000 bootstrap samples were created by drawing n purifications with replacement. This means that the bootstrap sample Sj () contains the same number of purifications as  and each purifica- tion φi can be contained in Sj () once, multiple times, or not at all, with multiple copies being treated as separate purifications. Interaction scores for the protein pairs are then calculated from these l bootstrap samples using socio-affinity scor- ing as above, where each protein pair is counted for the number of times the pair appeared across randomly sampled sets of interactions against what would be ex- pected for the pair from random based on the abundance of each protein in the two datasets. Zhang et al. [2008] modeled each purification as a bit vector which lists proteins pulled down as preys against a bait across different experiments. The authors then used the Sørensen-Dice similarity index [Sørensen 1948, Dice 1945] between the vectors to estimate co-purification of preys across experiments, and thus the inter- action reliability between proteins. Specifically, the pull-down data is transformed into a binary protein pull-down matrix in which a cell [u, i] in the matrix is 1 if u is pulled down as a prey in the experiment or purification i, and a zero otherwise. For two protein vectors in this matrix, the Sørensen-Dice similarity index, or simply the Dice coefficient, is computed as follows:

2q D(u, v) = , (2.9) 2q + r + s where q is the number of the matrix elements (experiments or purifications) that have ones for both proteins u and v; r is the number of elements where u has ones, but v has zeroes; and s is the number of elements where v has ones, but u has zeroes. If u and v indeed interact (directly or as part of a complex), then most likely the two proteins will be frequently co-purified in different experiments. The Dice coefficient therefore estimates the fraction of times u and v are co-purified in order to estimate the interaction reliability between u and v. 40 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

Hart et al. [2007] generated a Probabilistic Integrated Co-complex (PICO) net- work by integrating matrix-modeled relationships from the Gavin et al. [2002], Gavin et al. [2006], Krogan et al. [2006], and Ho et al. [2002] datasets using hypergeo- metric sampling. Specifically, the significance (p-value) for observing an interaction between the proteins u and v at least k times in the dataset is estimated using the hypergeometric distribution as

min(n,m) p(#interactions ≥ k|n, m, N)= p(i|n, m, N), i=k (2.10) n N−n − p(i|n m N)= i m i where , , N , m

where k is the number of times the interaction between u and v is observed, n and m are the total number of interactions for u and v, respectively, and N is the total number of interactions in the entire dataset. The lower the p-value, the lesser is the chance that the observed interaction between u and v is random, and therefore higher is the chance that the interaction is true. Methods such as Significance Analysis of INTeractome (SAINT) are based on quantitative analysis of mass spectrometry data. SAINT, developed by Choi et al. [2011] and Teoetal.[2014], assigns confidence scores to interactions based on the spectral counts of proteins pulled down in AP/MS experiments. The aim is to con-

vert the spectral count Xij for a prey protein i identified in a purification of bait | j into the probability of true interaction between the two proteins, P(True Xij ). | | For this, the true and false distributions, P(Xij T rue) and P(Xij False), and the prior probability πT of true interactions in the dataset are inferred from the spectral counts of all interactions involving prey i and bait j. Essentially, SAINT assumes that, if proteins i and j interact, then their “interaction abundance” is proportional | to the product XiXj of their spectral counts Xi and Xj . To compute P(Xij T rue), the spectral counts Xi and Xj are learned not only from the interaction between i and j, but also from all bona fide interactions that involve i and j. The same prin- | ciple is applied to compute P(Xij False) for false interactions. These probability distributions are then used to calculate the posterior probability of true interaction | P(True Xij ). The interactions are then ranked in decreasing order of their proba- bilities, and a threshold is used to select the most likely true interactions. Comparative Proteomic Analysis (ComPASS) [Sowa et al. 2009] employs a com- parative methodology to assign scores to proteins identified within parallel pro- × = teomic datasets. It constructs a stats table X[k m] where each cell X[i, j] Xi,j 2.6 Building High-Confidence PPI Networks 41 is the total spectral count (TSC) for an interactor j (arranged as m rows) in an ex- periment i (arranged as k columns). ComPASS uses a D-score to normalize the TSCs across proteins such that the highest scores are given to proteins in each experiment that are found rarely, found in duplicate runs, and have high TSCs—all characteristics that qualify proteins to be candidate high-confidence interactors. The D-score is a modification of the Z-score, which weights all interactors equally ¯ regardless of the number of replicates or their TSCs. Let Xj be the average TSC for interactor j across all the experiments, i=k i=1,j=n Xi,j X¯ = ; n = 1,2,...,m. (2.11) j k

The Z-score is computed as

X − X¯ = i,j j Zi,j , (2.12) σj where σj is the standard deviation for the spectral counts of interactor j across the experiments. The D-score improves the Z-score by incorporating the uniqueness, the TSC, and the reproducibility of the interaction to assign a score to each protein within each experiment. The D-score first rescales Xi,j as

p R =  k D = . Xi,j i,j i k f i=1 i,j (2.13)  1, when X > 0 = i,j where fi,j 0, otherwise, and p is the number of replicate runs in which the interactor is present. A D-score distribution is generated using a simulated random dataset, and a D-score thresh- old DT is determined below which 95% of this randomized data falls. A normalized N = R T D-score is then computed using this threshold as Di,j Di,j /D . All interactors N ≥ N with Di,j 1 are considered true, whereas those with Di,j < 1 are less likely to be bona fide interactors. Huttlin et al. [2015] employed ComPASS to identify 23,744 high-confidence interactors for 2,594 baits expressed in human embryonic kidney (HEK293T) cells (“the BioPlex network,” which as of 2016 contains over 50,000 in- teractions: http://wren.hms.harvard.edu/bioplex/). The readers are referred to the review by Nesvizhskii [2012] for a number of other scoring methods based on analy- sis of mass spectrometry data. 42 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

Evidence-Based Schemes These schemes use external or independence evidence to estimate confidence of interactions in the PPI dataset. For example, these evidence may include Gene Ontology (GO) annotations [Ashburner et al. 2000, Mi et al. 2013] for protein func- tions and localization (compartmentalization), and co-complex memberships in validated complexes. Some of the methods are learning-based; for example, given a (training) set of known interacting pairs and GO annotations, the methods learn (train on) the conditional probability distribution for interacting pairs with and without similar GO annotations (e.g., similar functions or localization), and us- ing this learned distribution, the methods estimate the probability of interaction for the proteins in each pair. In Krogan et al.’s study [Krogan et al. 2006], a ma- chine learning approach using Bayesian networks and C4.5-decision trees trained on validated physical interactions and functional evidence—co-occurrence in man- ually curated complexes from MIPS—was used to estimate confidence for protein pairs in a spoke-modeled experimental dataset. Collins et al. [2007] developed Pu- rification Enrichment (PE) scoring to generate a “Consolidated network” using the matrix-modeled relationships from Gavin et al. and Krogan et al. datasets. The PE scoring is based on a Bayes classifier trained on manually curated co- complexed protein pairs, GO annotations, mRNA expression patterns, and cellular co-localization and co-expression profiles. The Consolidated network was shown to be of high quality, comparable to that of PPIs derived from low-throughput ex- periments. In other methods, explicit learning may not be involved, but instead the evidence is directly used to assess the interaction confidence of protein pairs. For example, Resnick’s measure [Resnick 1995] for computing the semantic similarity between annotation terms has been adopted to compute confidence based on GO annota- tions between the proteins [Xu et al. 2008, Pesquita et al. 2008, Jain and Bader 2010]. Specifically, the semantic similarity between two ontology terms (S, T)having a set A of common ancestors in the GO graph is given as the information content, r(S, T)=− p(A) log(p(A)), (2.14) A∈A where p(A) is the fraction of proteins annotated to term A and all its descendants in the GO graph. Suppose that proteins u and v are annotated to sets of GO terms S and T , respectively. Then the semantic similarity between u and v is defined as the maximum information content (Resnick’s measure) of the set S × T ,

= sim(u, v) max r(Si , Tj ). (2.15) Si∈S,Tj ∈T 2.6 Building High-Confidence PPI Networks 43

The interaction confidence between u and v is then estimated as sim(u, v). GO graphs tend to be unbalanced with some paths containing more details (depth) compared to others, which stems from the complex biological structure of the GO annotations. However, this creates a bias against terms that do not represent such complex structures, i.e., terms that do not have sufficient depth in the GO graph. To account for this topological imbalance of the GO graph, Jain and Bader [2010] developed Topological Clustering Semantic Similarity (TCSS) which collapses subgraphs that define similar concepts. Terms that are lower down the GO tree have higher information content (i.e., more specific) than the terms at higher levels (i.e., less specific). A cut-off is used to identify subgraphs—subgraph root terms and all their children—with high information content. Since GO terms often have multiple parents, it is likely that this results in overlapping subgraphs. Overlaps between subgraphs are removed in two steps: edge removal by transitive reduction and term duplication. In the edge-removal step, a reduction is performed on the subgraphs: If nodes u and v are connected both via a directed edge u → → → v as well as a directed path u w1,...,wk v, then a transitive reduction is performed to preserve only u → v. After this step, if a term still belongs to more than one subgraph, then the term and all its descendents are duplicated across the subgraphs. The similarity between two proteins is measured using this reduced GO graph using Resnick’s similarity between their GO terms, as described above.

Topology-Based Schemes These schemes analyze the topology of the PPI network—usually in the immediate neighborhood of each protein pair—to estimate interaction confidence for the protein pairs. If the proteins in an interacting pair have many common neighbors in the network, the proteins and their shared neighbors have similar functions and/or are co-localized [Batada et al. 2004], and it is likely that the observed interaction between the proteins is true. This can be understood from the following simple example. Two proteins need to be localized to the same compartment to interact physically. Let us assume that a PPI screen with a false-positive rate of p reports that protein A interacts with C1,...,Cn and D1,...,Dm, and another protein B interacts with C1,...,Cn and E1,...,Em . Suppose that each of these proteins can be localized to say h subcellular compartments with equal chance. The probability = − − that A and B interact with a Ci in two different places is therefore x (1 p) . (1 − p) . (h 1)/h. Thus, the probability that A and B interact with each C1,...,Cn in two different compartments is xn, and the probability that A and B interact with − n some Ci in the same compartment is 1 x , which monotonically increases with n. That is, the more common partners A and B have (as reported by the screen), the 44 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

higher the chance they are in the same compartment. Thus, the more common partners the two proteins have in the PPI network, the higher the chance they satisfy the co-localization requirement for interaction. In other words, topology- based schemes based on counting common partners can be viewed as ways to guess whether A and B are likely to be in the same compartment (without needing to know explicitly which compartment); that is, these indices are in fact based on exploiting the biological fact that two proteins must be in the same compartment to physically interact (i.e., a piece of biological information). Therefore, Functional Similarity Weighting (FS Weight) [Chua et al. 2006], Iterative Czekanowski-Dice (CD) weight [Liu et al. 2008], and Dice coefficient [Zhang et al. 2008] can be classified as both topology-based as well as biological evidence-based schemes. FS Weight, proposed by Chua et al. [2006], is inspired from the graph theory measure Czekanowski-Dice (CD) distance, which is given by |N N | CD(u, v) = u v , (2.16) | ∪ |+| ∩ | Nu Nv Nu Nv = where Nu includes u and all the neighbors of u, similarly for Nv, and NuNv − ∪ − (Nu Nv) (Nv Nu) is the symmetric difference between the sets of neighbors = = Nu and Nv. CD is a distance (dissimilarity) measure. If Nu Nv, then CD(u, v) 0; ∩ =∅ = and if Nu Nv , then CD(u, v) 1. FS Weight takes inspiration from CD to use common neighbors, and estimates the confidence of the physical interaction between u and v based on their common neighbors. Chua et al. [2006] show that proteins share functions with their strictly indirect neighbors more than with their strictly direct neighbors, and proteins share functions with even higher likelihood with proteins that are simultaneously their direct and indirect neighbors than with proteins that are either strictly their direct or strictly their indirect neighbors. The weight FS(u, v) of the interaction between u and v is estimated as 2|N ∩ N | FS(u, v) = u v | − |+ | ∩ |+ Nu Nv 2 Nu Nv λu,v (2.17) 2|N ∩ N | × u v , | − |+ | ∩ |+ Nv Nu 2 Nu Nv λv,u

where λu,v and λv,u are used to penalize protein pairs with very few neighbors, and are given by = − | − |+| ∩ | λu,v max(0, navg ( Nu Nv Nu Nv )), and (2.18) = − | − |+| ∩ | λu,v max(0, navg ( Nv Nu Nv Nu )), 2.6 Building High-Confidence PPI Networks 45

where navg is the average number of level-1 neighbors that a protein has in the net- work. FS Weight thus assigns higher weight to protein pairs with larger number of common neighbors. FS Weight can be used to predict new functional associa- tions between proteins based on the number of neighbors they share; some of these functional associations may be physical interactions. In Iterative CD, proposed by Liu et al. [2008], the weight for each interaction is computed using the number of neighbors shared between the two interacting proteins, in a manner similar to FS Weight. However, Iterative CD then iteratively corrects these weights for the interactions, such that the weights computed in an iteration uses the weights computed in the previous iteration. By doing so, Iterative CD progressively reinforces the weights of true interactions and dampens the weights of false-positive interactions with each iteration. Iterative CD begins with an unweighted network (all weights set to 1) or a network with weights coming from some prior evidence (e.g., reliabilities of experiments from which the protein pairs were inferred). The weight wk(u, v) for each protein pair (u, v) in the k-th (k > 1) iteration is estimated as k−1 + k−1 x∈N ∩N [w (x, u) w (x, v)] wk(u, v) = u v , (2.19) wk−1(x, u) + λk + wk−1(x, v) + λk x∈Nu u x∈Nv v where w1(u, v) = 1 if the interaction (u, v) exists in the original network, w1(u, v) = 0 if the interaction (u, v) does not exist; alternatively, w1(u, v) = r(e)(u,v), which is the reliability of the experiment e used to infer (u, v). The parameters λu and λv penalize protein pairs with very few common neighbors. Liu et al. show that the iterative procedure converges in two iterations for typical PPI networks, but 10–30 iterations may be necessary if the network has high levels of noise. The procedure produces a weight between 0 and 1 for each interaction, and a cut-off (recommended 0.2) is used to filter out low-scoring interactions. Luo et al. [2015] scored interactions based on collaborative-filtering (CF). This approach is inspired by personalized recommendation in e-commerce where the problem is to identify useful patterns reflecting the connection between users and items from their usage history, and make reliable predictions for possible user- item links based on these patterns [Herlocker et al. 2002]. The CF scheme first computes the similarity sim(u, v) between a pair of interacting proteins u and v in the PPI network using the cosine distance of their adjacency vectors,

¯u, v¯ sim(u, v) = , (2.20) ||u ¯||.||¯v|| 46 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

where ... denotes the inner product between the vectors, and ||.|| denotes the Euclidean norm of the vectors. Therefore, the closer sim(u, v) is to 1, more the common neighbors that u and v share. This cosine distance is then rescaled by

incorporating a Cγ parameter which, similar to λ in FS Weight and Iterative CD, is used to account for spurious neighbors in the network,

¯u, v¯ sim(u, v) =   . (2.21) || ¯||2 ||¯||2 ( u + 1). ( v + 1) Cγ Cγ

Like in e-commerce where people judge an item based on multiple reviews, prefer- ably from known contacts, the score sim(u, v) from Equation (2.21) is rescaled as:

¯ ¯ d u, v . (ru v) sim(u, v) =   , , (2.22) || ¯||2 ||¯||2 ( u + 1). ( v + 1) Cγ Cγ

d where (ru,v) (d being a tunable parameter) is the mean of the scores of interactions with common neighbors, {(u, x) : x ∈ N(u) ∩ N(v)}∪{(v, x) : x ∈ N(u) ∩ N(v)},in the PPI network. Voevodski et al. [2009] developed PageRank Affinity, a scoring scheme inspired from Google’s PageRank algorithm [Haveliwala 2003], to score interactions in the PPI network. PageRank Affinity uses random walks to estimate the interconnected- ness of protein interactions in the network. A random walk is a Markov process where in each step the walk moves from the current protein to the next with a cer- tain preset probability. PageRank Affinity begins with an unweighted network G (with all weights set to 1) given by the adjacency matrix  1, if (u, v) ∈ E = AG(u, v) (2.23) 0, otherwise.

The transition probability matrix for the random walks (often called the random walk matrix) is the normalized adjacency matrix where each row sums to one,

= −1 WG DG AG, (2.24)

where D is the degree matrix, which is a diagonal matrix containing the degrees of the proteins,  deg(u) =|N(u)|,ifu = v = DG(u, v) (2.25) 0, otherwise. 2.6 Building High-Confidence PPI Networks 47

Therefore, the transition probabilities are given by the matrix WG, where the tran- + = sition probabilities pt+1 at step t 1 are simply pt+1 ptWG. PageRank Affinity repeatedly simulates random walks beginning from each protein in the network. Let pr(s) be the steady-state probability distribution of a random walk with restart probability α, where the starting vector s gives the probability distribution upon restarting. Then, pr(s) is the unique solution for the linear system given by

= + − prα(s)t+1 αs (1 α) . prα(s)t . WG. (2.26)

The starting vector here is set as follows:  1, if i = u = su(i) (2.27) 0, otherwise.

Therefore, pr(su) is the steady-state probability distribution of a random walk that always restarts at u. Let pr(su)[v] represent the steady-state probability value that a protein v has in the distribution vector pr(su). This can be thought of as the probability contribution that u makes to v, and is denoted as pr(u → v). This provides the affinity between u and v when the walks always restart at u. The final affinity between u and v is computed as the minimum of pr(u → v) and pr(v → u). Prˇzuljet al. [2004] and Higham et al. [2008] argue that PPI networks are best modeled by geometric random graphs instead of scale-free or small-world graphs, as suggested by earlier works [Watts and Strogatz 1998, Barab´asi 1999]. A geomet- ric random graph G =V ,  with radius is a graph with node set V of points in a metric space and edge set E ={(u, v) : u, v ∈ V ;0< ||u − v|| ≤ }, where ||.|| is an arbitrary distance norm in this space. The authors show that a PPI network can be represented as a geometric random graph by embedding the proteins of the PPI network in metric spaces R2, R3,orR4, and finding an such that any two pro- teins are connected in the PPI network if and only if the proteins are -close in the chosen metric space. This embedding can be used to not only weight all interac- tions, but also predict new interactions between protein pairs and remove spurious interactions from the PPI network: Interactions between proteins that are farther than -distance away in the metric embedding are pruned off as noise, whereas non-interacting protein pairs that are -close are predicted to be interacting. The embedding of the PPI network in a chosen metric space is described briefly as follows. Given a PPI network of N proteins, the interaction weights for protein pairs (i, j)are interpreted as pairwise distances dij , and the task is to find locations { [i]N } Rm in m-dimensional Euclidean space (vectors x i=1 in ) for these proteins so || [i] − [j]|| = that the pairwise distances are preserved, i.e., x x 2 dij for all i, j.InPPI 48 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

networks with only {0, 1} connectivity information, the length of the shortest path between proteins in the network is used in lieu of the Euclidean distance. Przulj = et al. suggest that using the square root of the path length, dij pathij , where pathij denotes the path length between i and j, is a good function for the purpose. | | Conditional probabilities p(dij interact) and p(dij noninteract) are learned for | distances dij from the embedding. Here, p(dij interact) is the probability density function describing the distances between pairs of proteins which are known to | interact, and p(dij noninteract) is the probability density function describing the distances between pairs of proteins which do not interact in the dataset (and are known to not interact). Given a distance threshold δ, these probabilities are used to | ≤ | ≤ compute the posterior probabilities p(interact dij δ) and p(noninteract dij δ) for protein pairs to interact or not interact. For each protein pair (i, j) within the δ-threshold, the weight of the interaction between i and j is then estimated as | ≤ p(interact(i, j) dij δ) S(i, j)= . (2.28) | ≤ + | ≤ p(interact(i, j) dij δ) p(noninteract(i, j) dij δ) A threshold on this estimated weight is applied to remove false-positive interactions from the network.

Combining Confidence Scores Interactions scored high by all or a majority of confidence-scoring schemes (con- sensus interactions) are likely to be true interactions. Therefore, a simple majority- voting scheme counts the number of times each interaction is assigned a high score (above the recommended cut-off) by each scheme, and considers only the interac- tions scored high by a majority of the schemes. Chua et al. [2009] integrated multiple scoring schemes using a na¨ıve Bayesian

approach. For an interaction (u, v) that is assigned scores pi(u, v) by different schemes i (assuming the scores are in the same range [0,1]), the combined score − − can be computed as 1 i(1 pi(u, v)). However, even within the same range [0,1], usually different scoring schemes tend to have different distribution of scores they assign to the interactions. Some schemes assign high scores (close to 1) to most or a sizeable fraction of the interactions, whereas other schemes are more conservative and assign low scores to many interactions. To account for the variability in dis- tributions of scores, it is important to consider the relative ranking of interactions within each scoring scheme instead of their absolute scores. Chua et al. [2009] therefore proposed a ranked-based combination scheme, which works as follows. For each scheme i, the scored interactions are first binned in increasing order of their scores: The first 100 interactions are placed in the first 2.6 Building High-Confidence PPI Networks 49 bin, the second 100 interactions in the second bin, and so on. For each bin k in scheme i, a weight p(i, k) is assigned based on the number of interactions from the bin that match known interactions from an independent dataset: #interactions from bin k for scheme i that match known interactions p(i, k) = #total interactions in bin k across all schemes. (2.29)

While combining the scores for interaction (u, v), the Bayesian weighting is mod- − − ified to: 1 (i,k)∈D(u,v)(1 p(i, k)), where D(u, v) is the list of scheme-bin pairs (i, k) that contain (u, v) across all schemes. This ensures that, irrespective of the scoring distribution of the schemes, if an interaction belongs to reliable bins, it is assigned a high final score. Yong et al. [2012] present a supervised maximum-likelihood weighting scheme (SWC) to combine PPI datasets and to infer co-complexed protein pairs. The method uses a na¨ıve Bayes maximum-likelihood model to derive the posterior probability that an interaction (u, v) is a co-complexed interaction based on the scores assigned to (u, v) across multiple data sources. These data sources include PPI databases, namely BioGrid [Stark et al. 2011], IntAct [Hermjakob et al. 2004, Kerrien et al. 2012], MINT [Zanzoni et al. 2002, Chatr-Aryamontri et al. 2007], and STRING [Von Mering et al. 2003, Szklarczyk et al. 2011], and evidence from co- occurrence of proteins in the PubMed literature abstracts (http://www.ncbi.nlm .nih.gov/pubmed). The set of features is the set of these data sources, and a fea- ture F has value f if proteins u and v are related by data source F with score f , else f = 0. The features are discretized using minimum-description length su- pervised discretization [Fayyad and Irani 1993]. Using a reference set of protein complexes, each (u, v) in the training set is given a class label co-complex if both u and v are in the same complex, otherwise it is given the class label non-co-complex. The maximum-likelihood parameters are learned for the two classes,

Nc,F =f P(F = f |co-complex) = Nc (2.30) N¬c,F =f P(F = f |non-co-complex) = , N¬c = where Nc is the number of interactions with label co-complex, Nc,F f is the num- ber of interactions with label co-complex and feature value F = f , and likewise for = interactions with label non-co-complex, N¬c,F f . After learning the maximum- likelihood model, the score for each interaction is computed as the posterior prob- ability of being a co-complex interaction based on the na¨ıve Bayes assumption of 50 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

independence of the features: = | = = s(u, v) P(co-complex F1 f1, F2 f2,...)  P(F = f |co-complex) . P(co-complex) (2.31) = i i i , Z where Z is the normalizing factor:  = = | Z P(Fi fi co-complex) . P(co-complex) i  (2.32) + = | P(Fi fi non-co-complex) . P(non-co-complex). i Table 2.5 displays some of the publicly available databases that integrate PPI datasets from multiple sources. However, a common problem with integrating multiple scored-datasets is the low agreement between the schemes or experiments used to produce these data. Moreover, most scoring methods favor high abundance proteins and are not effective enough to filter out common contaminants [Pu et al. 2015]. Therefore, better scoring and integrating schemes for PPI datasets are always required.

Enhancing PPI Networks by Integrating Functional Interactions 2.7 In addition to the presence of spurious interactions, another limitation in existing PPI datasets is the lack of coverage for true interactions among certain kinds of proteins (the “sparse zone” [Rolland et al. 2014]). This is in part due to limitations in experimental protocols (e.g., washing away of weakly connected proteins dur- ing purification of pull-down complexes in TAP experiments), and in part due to the under-representation of certain groups of proteins in these experiments (e.g., membrane proteins). The paucity of true interactions can considerably affect down- stream analysis including protein complex prediction. For example, in an analysis by Srihari and Leong [2012a] using protein complexes from MIPS and CYC2008, it was found that many true complexes are embedded in sparse and disconnected regions of the PPI network, thereby altering their dense connectivity and modular- ity. As we shall see in a subsequent chapter, many computational methods find it difficult to identify these sparse complexes. Computational prediction of protein interactions can be a good alternative to experimental protocols for enriching the PPI network with true interactions, and to “densify” regions of the network that are sparsely connected. However, accu- rate prediction of physical interactions between proteins is a difficult problem 2.7 Enhancing PPI Networks by Integrating Functional Interactions 51 Kotlyar et Brown and Jurisca 2007 , Kotlyar et al. 2015 , Szklarczyk et al. 2011 ] Turner et al. 2010 ] Zhang et al. 2013 ] Reference [ Veres et al. 2015 ] [ Warde-Farley et al. 2010 ] [ Patil et al. 2011 ] [ Brown and Jurisica 2005 , [ Kamburov et al. 2012 ] [ Basha et al. 2015 ] [ Brown and Jurisica 2005 , al. 2016 ] [ Aranda et al. 2011 ] [ Kalathur et al. 2014 ] Kotlyar et al. 2015 ] [ Schaefer et al. 2012 ] [ Lee et al. 2011 ] [ Li et al. 2017 ] [ Chautard et al. 2011 ] [ Zhang et al. 2012 , [ Von Mering et al. 2003 , [ Lynn et al. 2008 ] [ Razick et al. 2008 , http://comppi.linkgroup.hu/ http://www.genemania.org/ http://hintdb.hgc.jp/htp/ http://ophid.utoronto.ca/ophidv2.204/ http://intscore.molgen.mpg.de/ http://netbio.bgu.ac.il/myproteinnet/ http://ophid.utoronto.ca/iid/ http://psicquic.googlecode.com/ http://www.unihi.org/ http://cbdm.mdc-berlin.de/tools/hippie/ http://www.functionalnet.org/ http://www.lagelab.org/resources/ http://matrixdb.univ-lyon1.fr/ http://bhapp.c2b2.columbia.edu/PrePPI/ http://string-db.org/ http://www.innatedb.com/ http://irefindex.org/wiki/index.php?title=iRefIndex DatabaseComPPI GeneMANIA Source HitPredict I2D/OPHID IntScore MyProteinNet IID/OPHID PSICQUIC UniHI HIPPIE HumanNet InWeb MatrixDB PrePPI STRING InnateDB iRefIndex Publicly available databases that integratesources PPI datasets from multiple experimental, literature, and computational Table 2.5 52 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

in itself, and as several studies have noted [Von Mering et al. 2003, Szklarczyk et al. 2011, Srihari and Leong 2012a] most predicted interactions tend to be “func- tional associations”—that is, relationships connecting functionally similar pairs of proteins—instead of actual physical interactions between the proteins. Never- theless, if these functional interactions are successful in “topologically enhancing” the PPI network, these can still aid downstream analysis including protein complex prediction.

Computational Prediction of Protein Interactions Although high-throughput techniques produce large amounts of data, the covered fraction of the interactomes from most organisms are far from complete [Cusick et al. 2009, Hart et al. 2006, Huang et al. 2007]. For example, while ∼70% of the in- teractomes from model organisms including S. cerevisiae have been mapped, these interactomes still lack interactions among membrane proteins [Von Mering et al. 2002, Hart et al. 2006, Huang et al. 2007]. Likewise, estimates show that less than 50% of the interactomes from higher-order organisms including human (∼10%) and other mammals have been mapped [Hart et al. 2006, Stumpf et al. 2008, Vidal 2016]. Computational prediction of interactions could partially compensate for this lack of coverage by predicting interactions between proteins in network regions with low coverage. Here, we only present a brief conceptual overview of computa- tional methods developed for protein interaction prediction; for methodological details and for a comprehensive list of these methods, the readers are referred to excellent surveys by Valencia and Pazos [2002], Obenauer and Yaffe [2004], Zahiri et al. [2013], Ehrenberger et al. [2015], and Keskin et al. [2016].

Gene Neighbors. A commonly used approach to predict protein interactions in prokaryotes is by using co-transcribed or co-regulated sets of genes. It is based on the observation that, in prokaryotes, proteins encoded by genes that are tran- scribed or regulated as single units—e.g., as operons—are often involved in similar functions and tend to physically interact. Computational methods exist to pre- dict operons in bacterial genomes using intergenic distances [Ermolaeva et al. 2011, Price et al. 2005]. Analysis of gene-order conservation in bacterial and ar- chaeal genomes shows that protein products of 63–75% of operonic genes physi- cally interact [Dandekar et al. 1998]. In eukaryotes, evidence from yeast and worm [Teichmann and Babu 2002, Snel et al. 2004] shows that co-regulated sets of genes encode proteins that are functionally similar and these proteins are highly likely to interact. These studies therefore provide the basis to predict new interactions be- tween proteins using sets of co-transcribed and co-regulated sets of genes [Huynen et al. 2000, Bowers et al. 2004]. 2.7 Enhancing PPI Networks by Integrating Functional Interactions 53

Phylogenetic Profiles. Similar phylogenetic profiles between proteins provide strong evidence for protein interactions [Pellegrini et al. 1999, Galperin and Koonin 2000, Pellegrini 2012]. For a given protein, a phylogenetic profile is constructed as a vector of N elements, where N is the number of genomes (species). The presence or absence of the protein in a genome is indicated as 1 or 0 at the corresponding po- sition in the phylogenetic profile. Phylogenetic profiles of a collection of proteins can be clustered using a bit-distance measure, to generate clusters of proteins that co-evolve. Therefore, proteins appearing in the same cluster are considered to be evolutionarily co-evolving and these proteins are inferred to be functionally related and physically interacting. This inference is based on the hypothesis that interacting sets of non-homologous proteins that co-evolve are under evolutionary pressure to conserve their interactions and to maintain their co-functioning ability [Shoemaker and Panchenko 2007, Sun et al. 2005].

Co-Evolution of Interacting Proteins. Interacting proteins often co-evolve so that changes in one protein in a pair leading to the loss of function or interaction should be compensated by correlated changes in the other protein [Shoemaker and Panchenko 2007]. This co-evolution is reflected by the similarity between the phylogenetic protein trees (or simply, protein trees) of non-homologous interacting protein families. A protein tree represents the evolutionary history of protein fami- lies, i.e., proteins or protein families that diverged from a common ancestor. These protein trees reconciled with their species trees have their internal nodes annotated to speciation and duplication events [Vilella et al. 2009]. TreeSoft (http://treesoft .sourceforge.net/treebest.shtml) provides a suite of tools to build and visualize protein trees. The similarity between two protein trees can be computed by aligning the corresponding distance matrices so as to minimize the difference between the matrix elements: the smaller the difference between the matrices, the stronger the co-evolution between the two protein families. Interactions are predicted between proteins corresponding to the aligned columns of the two matrices. The similarity between two protein trees is influenced by the speciation process and, therefore, there is a certain background similarity between any two protein trees, irrespective of whether the proteins interact or not. Statistical approaches exist to correct for these factors (phylogenetic subtraction) [Harvey and Pagel 1991, Harvey et al. 1995]. It is also worth noting that a protein can have multiple partners, and so taking into consideration its co-evolution with all its partners further enhances the accuracy of the interaction prediction [Juan et al. 2008].

Gene Fusion. Gene fusion is a common event in evolution, wherein two or more genes in one species fuse into a single gene in another species. Gene fusion is 54 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

a result of duplication, translocation, or inversion events that affect coding se- quences during the evolution of genomes. Therefore, gene fusions play an im- portant role in determining the gene (and genomic) architecture of species. Gene fusions may occur to optimize co-transcription of genes involved in the fusion: by fusing two or more genes, it may be easier to transcribe these genes as a single en- tity, thus resulting in a single protein product. Typically, proteins coded by these fused genes in a species carry multiple functional domains, which originate from different proteins (genes) in the ancestor species. Therefore, one may infer inter- actions between these individual proteins in the ancestor species: it is likely that these proteins are partners in performing a particular function and they interact in the ancestor species, and that gene fusion has occurred in another species to optimize the transcription and to produce a single multidomain protein [Marcotte et al. 1999]. These fused proteins are referred to as chimeric or Rosetta Stone pro- teins [Marcotte et al. 1999]. The Rosetta Stone approach [Enright and Ouzounis 2001, Suhre 2007] infers protein interactions by detecting fusion events between protein sequences across species. In E. coli, this approach identified 6,809 puta- tive interacting pairs of proteins, wherein both proteins from each pair had sig- nificant sequence similarity to a single (fused) protein from at least one other species (genome). The analysis of these interacting pairs revealed that, for more than half of these pairs, both the proteins were functionally related [Marcotte et al. 1999].

PPI Network Topology. The pattern of interactions between proteins in a PPI net- work says a lot about how proteins interact, and provides a way to predict new interactions. For example, if a pair of proteins have many common neighbors in the PPI network, then most likely the two proteins in the pair and their common neighbors are involved in the same or similar function(s). Therefore, one may infer a direct physical interaction between the two proteins based on the number of neigh- bors and/or functions the two proteins share. Chua et al. [2006] used FS Weight interaction-scoring approach in this manner to predict interactions between level- 2 neighbors (connected via one other protein) in the PPI network. This is based on the observation that level-2 neighbors in the PPI network show the same or simi- lar annotations for functions and/or cellular compartment, and therefore these are more likely to interact compared to random pairs of proteins in the network. These FS-weighted predicted interactions between level-2 neighbors are added back to the PPI network after removing low-weighted interactions. Using the same rationale, one can predict new interactions using other topology-based (common-neighbor counting) schemes including Dice coefficient [Zhang et al. 2008] and Iterative CD 2.7 Enhancing PPI Networks by Integrating Functional Interactions 55

[Liu et al. 2008]. Likewise, the geometric embedding model [Prˇzuljet al. 2004, Higham et al. 2008] can also be used to predict new interactions: Proteins that are -close in the geometric embedding of the PPI network are more likely to interact compared to random pairs of proteins and proteins that are farther than -distance away in the embedding.

Functional Features. Interacting proteins are often involved in the same or similar functions. Therefore, if a pair of proteins are annotated with the same or similar functions, one could, with some degree of accuracy, infer a physical interaction between the two proteins. This is often referred to as “guilt by association,” which refers to the principle that genes or proteins with related functions tend to share properties such as genetic or physical interactions [Oliver 2000]. This inference can be further enhanced by combining other evidence that supports their functional similarity—for example, if the genes coding for the two proteins are located close by on the genome or are transcribed as an operonic unit (for prokaryotes) [Dandekar et al. 1998, Kumar et al. 2002], or the coding genes are co-transcribed or co-expressed [Huynen et al. 2000, Bowers et al. 2004, Jansen et al. 2002], or show similar phylo- genetic profiles [Pellegrini et al. 1999, Galperin and Koonin 2000, Pellegrini 2012]. Proteins within the same protein complex (co-complexed proteins) show a strong tendency to share functions and cellular localization and therefore physically inter- act. On the other hand, proteins from different cellular compartments most likely do not meet and therefore do not interact in vivo during their lifetimes. Jansen et al. [2003] used interactions between co-complexed proteins from the MIPS protein complex catalog [Mewes et al. 2006] as the positive training set, and non-interacting pairs of proteins as the negative training set, in a Bayesian framework, to predict new interactions in yeast. Blohm et al. [2014] present a dataset, the “Negatome,” of protein pairs that are highly unlikely to interact, which can be used as a negative training set. The Gene Ontology graph [Ashburner et al. 2000] integrates information on the functional and localization properties of proteins, and therefore provides a way to predict new interactions. For example, the TCSS approach by Jain and Bader [2010] can be used to compute similarity between pairs of proteins using the GO graph, and protein pairs showing high GO-semantic similarity can be predicted to physically interact. Likewise, multiple pieces of experimental and functional infor- mation can be combined to predict new interactions. For example, GeneMANIA (http://www.genemania.org/)[Warde-Farley et al. 2010] combines experimentally detected interactions from BioGrid [Stark et al. 2011, Chatr-Aryamontri et al. 2015], pathway annotations from Pathway Commons (http://www.pathwaycommons.org/) 56 Chapter 2 Constructing Reliable Protein-Protein Interaction (PPI) Networks

[Cerami et al. 2011], and information on evolutionary conservation of interactions from the Interologous Interaction Database (I2D) [Brown and Jurisica 2005], along with GO-based similarity, to predict new interactions (GeneMANIA and I2D are also listed in Table 2.5). The HumanNet [Lee et al. 2011] is a human functional interac- tion network which includes predicted interactions based on guilt by association for genes involved in human diseases.

Structural Information on Proteins. 3D structures of proteins provide first-hand ev- idence for protein interaction sites and binding surfaces of proteins. Therefore, by assessing the compatibility between the binding surfaces between two proteins, one can predict whether the two proteins interact or not. For example, Zhang et al. [2012, 2013] analyzed 3D structures of proteins from the (PDB) (http://www.rcsb.org/pdb/home/home.do)[Berman et al. 2000], a database which stores 3D structures for over 600 of the ∼6,000 characterized yeast proteins (∼10%), to predict new interactions between proteins in yeast. However, since the 3D structures are available for only a small fraction of proteins, using this approach for prediction of interactions on a larger scale is not feasible. Zhang et al. pro- posed to overcome this limitation to some extent by deriving homology models for proteins without available 3D structures. Homology models were derived for an additional ∼3,600 yeast proteins using the ModBase (http://modbase.compbio.ucsf .edu/)[Pieper et al. 2006] and Skybase (http://skybase.c2b2.columbia.edu/pdb60_ new/struct_show.php)[Mirkovic et al. 2007] databases. Given a query protein, these databases predict the most likely 3D structure for the protein based on its sequence similarity with templates built from proteins with available 3D structures. The final set of structurally predicted interactions from the Zhang et al. study is avail- able in the PrePPI database (http://bhapp.c2b2.columbia.edu/PrePPI/). Struct2Net (http://groups.csail.mit.edu/cb/struct2net/webserver/)[Singh et al. 2010] uses a structure-threading approach to predict interactions between proteins. Given two protein sequences, Struct2Net “threads” the sequences to known 3D structures from PDB, and then based on the best-matching structures, estimates the inter- action between the two proteins. PredictProtein (http://www.predictprotein.org/) [Yachdav et al. 2014] combines structure and GO-based methods to predict new interactions. Wang et al. [2012] curated a 3D-structure resolved dataset of 4,222 high-quality human PPIs enriched for human disease genes by examining relation- ships between 3,949 genes, 62,663 mutations, and 3,453 associated with human disorders.

Literature Mining. Interactions missed in PPI datasets but have direct or indi- rect reference in scientific publications can be identified by mining the litera- 2.7 Enhancing PPI Networks by Integrating Functional Interactions 57 ture. For example, these references may include abstracts or full-texts of publi- cations maintained in PubMed (http://www.ncbi.nlm.nih.gov/pubmed) by the Na- tional Center for Biotechnology Information (NCBI). Text-mining tools based on natural language processing (NLP) and other machine-learning techniques mine for co-occurrence of protein names in these literature sources, and proteins fre- quently referenced together can be predicted to interact. For example, the PubGene tool, which is a part of COREMINE (http://www.coremine.com/medical/), mines for information on genes and proteins including their co-occurrences in abstracts of publications, their , and association with cell processes and diseases. This information can be used to predict interactions between fre- quently co-occurring proteins. Similarly, iHOP (http://www.ihop-net.org/UniPub/ iHOP/)[Hoffman and Valencia 2004] presents a network of genes and proteins that co-occur in the scientific literature.

Limitations of Computationally Predicted Protein Interactions. Despite their promise in enhancing experimentally curated interaction datasets, computational methods have their own limitations. For example, methods that rely on (high-throughput) experimental datasets to predict new interactions—e.g., PPI topology-based meth- ods—carry an inherent bias in their predictions toward proteins already accounted for or enriched in these datasets. Moreover, these predictions are also affected by biological and technical noise (spurious interactions) in experimental data- sets. The methods that are based on genomic distances, fusion of genes, and co- transcription of (operonic) genes are applicable only to a selected subset of species (prokaryotes), since most eurkaryotic systems do not exhibit these properties so strongly. Knowledge-based methods—e.g., those based on the GO graph, 3D struc- tures, and literature abstracts—are restricted by the amount and kind of available information, and are again applicable to only a subset of proteins. As a result of these limitations, accurate prediction of physical interactions between proteins is a challenging problem, and most methods in fact predict only functional associa- tions instead of direct physical interactions between proteins. Depending on the application, these functional associations can turn out to be more useful or less useful. However, despite these limitations, computational techniques are orthogonal to and can effectively complement experimental techniques. Computational pre- dictions can be used to enhance the topologies of PPI networks—for example, by “densifying” sparse network regions or piecing together disconnected proteins or regions of the network—which in turn can improve downstream applications of PPI networks including protein complex prediction. Computational Methods for Protein Complex Prediction from PPI Networks

All models are wrong, but some models are useful. 3—George Box (1919–2013), British statistician (as quoted in Box [1979]) The process of identifying protein complexes from high-throughput interaction datasets involves the following steps [Spirin and Mirny 2003, Srihari and Leong 2012b, Srihari et al. 2015a] (Figure 1.1):

1. integrating high-throughput datasets from multiple sources and assessing the confidence of interactions; 2. constructing a reliable PPI network using only the high-confidence interac- tions; 3. identifying modular subnetworks from the network to generate a candidate list of complexes; and 4. evaluating the identified complexes against bona fide complexes, and vali- dating and assigning roles to novel complexes.

In this chapter, we review some of the representative methods developed to date for computational prediction of protein complexes from PPI networks. 60 Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

Basic Definitions and Terminologies 3.1 A protein-protein interaction (PPI) network is modeled as an undirected graph G =V , E, where V is the set of proteins, and E ⊆ V × V is the set of interactions between the proteins. We also use V (G) and E(G) to refer to the set of proteins and ∈ interactions of a (sub-)network G, respectively. For a protein v V , N(v) or Nv is =| |=| | the set of immediate neighbors of v, deg(v) N(v) Nv is the degree of v, and E(v) is the set of interactions in the immediate neighborhood of v. The interaction density of G (or its subnetwork) is defined as: 2|E(G)| density(G) = , (3.1) |V | . (|V (G)|−1) which gives a real number in [0, 1] quantifying the “richness of interactions” within G: 0 for a network without any interactions and 1 for a fully connected network. The clustering coefficient CC(v) measures the “cliquishness” for the neighbor- hood of v, and is given by: 2|E(v)| CC(v) = . (3.2) |N(v)| . (|N(v)|−1) If the interactions of the network are scored (weighted), i.e., G =V , E, w, then the weighted versions—weighted degree, weighted interaction density, and weighted clustering coefficients—are given as follows: = degw(v) w(u, v), u∈N(v) 2 e∈E(G) w(e) density (G) = , and (3.3) w |V (G)| . (|V (G)|−1) 2 w(e) = e∈E(v) CCw(v) |N(v)| . (|N(v)|−1) There are other variants for these definitions proposed in the literature; see Kalna and Higham [2007] and Newman [2010] for a survey.

Taxonomy of Methods for Protein Complex Prediction 3.2 Although at a generic level most methods make the basic assumption that protein complexes are embedded among densely connected subsets of proteins within the PPI network, these methods vary considerably in their algorithmic strategies or in the biological information that they employ to detect complexes. Accordingly, we classify protein complex detection methods into the following two categories [Srihari and Leong 2012b, Srihari et al. 2015a, Li et al. 2010]: 3.3 Methods Based Solely on PPI Network Clustering 61

1. methods based solely on network clustering; and 2. methods based on network clustering and additional biological information.

The biological information can be in the form of functional, structural, organiza- tional, or evolutionary evidence on complexes or their constituent proteins. Wang et al. [2010] and Chen et al. [2014] present alternative classifications based on topo- logical (interaction density, betweenness centrality, and clustering coefficient), and dynamical properties of PPI networks capitalized by computational methods, re- spectively. We present this classification of methods for protein complex prediction as two snapshots—methodology-based and chronology-based—as shown in Figures 3.1 and 3.2, respectively. The methodology-based classification (Figure 3.1) is based on algorithmic methodologies employed by the methods. At level 1 of the tree, we di- vide methods into those based solely on network clustering, and those employing additional biological information. At subsequent levels, we further subdivide the methods based on their algorithmic strategies into: (i) methods employing merging or growing of clusters from the network (agglomerative); and (ii) methods em- ploying repeated partitioning of the network (divisive). Agglomerative methods go bottom-up, i.e., by beginning with small “seeds” (e.g., triangles or cliques) and re- peatedly adding proteins or merging clusters of proteins based on certain similarity criteria to arrive at predicted set of complexes. On the other hand, divisive methods go top-down, i.e., by repeatedly partitioning the network into multiple dense sub- networks based on certain dissimilarity criteria to arrive at predicted complexes. In the chronology-based classification (Figure 3.2), we bin methods based on the time (year) when these were developed, and for each bin we stack the methods based on the kind of biological information they employ for the prediction. Figures 3.1 and 3.2 and Table 3.1 show only the methods that are discussed in this chapter. Other methods that employ diverse kinds of biological information and/or algorithmic strategies are covered in the following chapters.

Methods Based Solely on PPI Network Clustering 3.3 Most methods that directly mine for dense subnetworks from the PPI network make use of solely the topology of the network.

Molecular COmplex DEtection (MCODE) MCODE, proposed by Bader and Hogue [2003], is one of the first computational methods developed for protein complex detection from PPI networks. The MCODE algorithm operates in two steps, vertex weighting and complex prediction, and an optional third step for post-processing of the candidate complexes. In the first step, 62 Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks Network partitioning • RNSC (King et al., 2005) Functional homogeneity Merging and growing clusters • PCP (Chua et al., 2007) • DECAFF (Li et al., 2007) Biological information PPI network clustering + Core- attachment level Kind of Merging and biological growing clusters information • COACH (Wu et al., 2009) • COACH • CORE (Leung et al., 2009) • MCL-CAw (Srihari et al., 2010) (Wu et al., 2012) • CACHET used Kind of techniques algorithmic Protein Complex Prediction Methods Network partitioning • SPC (Spirin & Mirny, 2003) Dongen, 2000) • MCL (Van et al. (2004) • Leal-Pereira • Pu et al. (2007) • Friedel et al. (2008) Solely PPI network clustering Merging and growing clusters • MCODE (Bader et al., 2003) • LCMA (Li et al., 2005) • CFinder (Ademcsek et al., 2006) • DPClus (Altaf-Ul-Amin et al., 2006) • IPCA (Li et al., 2008) • SuperComplex (Qi et al., 2008) • CMC (Liu et al., 2009) et al., 2009) (Wang • HACO • ClusterONE (Nepusz et al., 2012) et al., 2012) • Ensemble (Yong Methodology-based taxonomy of protein complexplexes, prediction of size ≥ 4). methods At (for the largeinto topmost com- those level, based the solely complex onditional prediction PPI biological methods network information are clustering, (here, classified and core-attachment thosethe and that functional next take homogeneity). level, into At the account methods ad- ters are (agglomerative), and further those classified that into partitionMethods those the that network that are into merge not smaller shown and clusters here grow (divisive). are clus- covered in the following chapters. Figure 3.1 DECAFF (Li et al.) Functional RNSC (King et al.) homogeneity PCP (Chua et al.)

CORE (Leung et al.) Core- DPClus MCODE (Altaf-Ul-Amin et al.) CFinder COACH CACHET attachment (Bader et al.) MCL-CAw (Ademcsek et al.) (Wu et al.) (Srihari et al.) (Wu et al.)

MCL IPCA (Li et al.) ClusterONE LCMA (Enright & Van Dongen) HACO (Wang et al.) (Nepusz et al.) PPI network (Li et al.) MCL MCL clustering Pu et al. SuperComplex Ensemble (Van Dongen) SPC (Spirin & Mirny) (Leal-Pereira et al.) (Qi et al.) CMC (Liu et al.) (Yong et al.) 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012

Figure 3.2 Chronology-based taxonomy of protein complex prediction methods (for large complexes, of size≥4). Complex prediction methods are binned based on the year (2000–2012) these were published. The methods are stacked into layers based on whether these are based on solely PPI network clustering, or PPI network clustering with additional biological information (here, core-attachment and functional homogeneity). Methods that are not shown here are covered in the following chapters. Table 3.1 Protein complex prediction methods discussed in this chapter and weblinks to their associated software.

Classification Method Year Source Reference

PPI network MCODE 2003 http://apps.cytoscape.org/apps/mcode [Bader and Hogue 2003] clustering MCL 2000, 2004 http://micans.org/mcl/ [Van Dongen 2000, Leal-Pereira et al. 2004]

http://apps.cytoscape.org/apps/clustermaker

SPC 2003 http://www.weizmann.ac.il/complex/compphys/sites/ [Spirin and Mirny 2003, Blatt et al. 1996] complex.compphys/..., files/uploads/software/clustering_prl.ps LCMA 2005 Not available [Li et al. 2005]

CFinder 2006 http://www.cfinder.org/ [Adamcsek et al. 2006]

DPClus 2006 http://kanaya.naist.jp/DPClus/ [Altaf-Ul-Amin et al. 2006]

IPCA 2008 http://netlab.csu.edu.cn/bioinformatics/limin/IPCA/ [Li et al. 2008]

SuperComplex 2008 http://www.cs.cmu.edu/~qyj/SuperComplex/ [Qi et al. 2008]

CMC 2009 http://www.comp.nus.edu.sg/~wongls/projects/... [Liu et al. 2009] complexprediction/CMC-26may09 HACO 2010 http://www.bio.ifi.lmu.de/Complexes/ProCope/ [Friedel et al. 2009, Wang et al. 2009] ClusterONE 2012 http://apps.cytoscape.org/apps/clusterone [Nepusz et al. 2012] Ensemble 2012 http://compbio.ddns.comp.nus.edu.sg/~cherny/SWC/ [Yong et al. 2012] Core-attachment CORE 2009 http://alse.cs.hku.hk/complexes/ [Leung et al. 2009] structure COACH 2009 http://www1.i2r.a-star.edu.sg/~xlli/coach.zip [Wu et al. 2009] MCL-CAw 2009, 2010 http://sites.google.com/site/mclcaw/ [Srihari et al. 2009, Srihari et al. 2010] CACHET 2012 http://www1.i2r.a-star.edu.sg/~xlli/CACHET/CACHET.htm [Wu et al. 2012]

Functional RNSC 2004 http://www.cs.utoronto.ca/~juris/data/ppi04/ [King et al. 2004] homogeneity DECAFF 2007 http://www1.i2r.a-star.edu.sg/~xlli/DECAFF/complexes.html [Li et al. 2007]

PCP 2008 http://www.comp.nus.edu.sg/~wongls/projects/... [Chua et al. 2008] complexprediction/PCP-3aug07/ 3.3 Methods Based Solely on PPI Network Clustering 65 each protein v in the PPI network G is weighted based on its clustering coefficient (CC). However, instead of using the entire neighborhood of v, MCODE uses the density of the highest k-core in the neighborhood of v, which amplifies the weights for proteins located in densely connected regions of G.Ak-core is a subnetwork of proteins such that each protein in this subnetwork has a degree no less than k.Ak- core in the neighborhood of v is a subnetwork of v and all its neighbors that have degree no less than k.Ak-core in the neighborhood of v is highest or maximal if + there exists no (k 1)-core in that neighborhood. If Ck(v) represents this highest k-core, the weight for v is assigned as: 2|E(C (v))| CC(v) = k . (3.4) | | | |− V(Ck(v)) . ( V(Ck(v)) 1) In the second step, a protein v with the highest weight is used to “seed” a complex. MCODE then recursively moves outwards from the seed by including proteins whose weight is no more than a preset percentage, controlled by the vertex- weight parameter, away from the seed protein. The process stops when there are no more proteins to be added to the complex. This process is repeated by selecting the next unseeded protein. A protein once added to a complex is not checked subsequently for seeding. At the end of this process, multiple non-overlapping candidate complexes are generated. The optional third step postprocesses the candidate list of complexes. First, complexes without 2-cores, i.e., 1-protein and 2-protein complexes, are filtered out. New proteins in the neighborhood of candidate complexes and with weights higher than a preset ‘fluff’ parameter are then added to these complexes. These newly added proteins are not marked as “seen” and hence can be added to multiple com- plexes (shared between the complexes). This step also includes a “hair cut” option whereby proteins singly connected to complexes are removed before attempting to add to complexes. The final complexes are ranked based on their interaction densi- ties. The time complexity of the MCODE algorithm is O(|V (G)| . |E(G)| . h3), where h is the size of an average neighborhood in G.

Markov CLustering (MCL) The MCL algorithm, proposed by Van Dongen [2000], is a popular graph-clustering algorithm that works by extracting dense subnetworks or regions from graphs using random walks. These random walks are collectively called a flow. In biological applications, MCL was first applied to cluster protein families and protein ortholog groups [Enright et al. 2002] where interactions in the graph represented (sequence) similarities between the proteins. MCL was subsequently found to be effective 66 Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

for clustering PPI networks into protein complexes and functional modules [Leal- Pereira et al. 2004, Brohee and van Helden 2006, Pu et al. 2007]. MCL works by manipulating the connectivity (adjacency) matrix of the network using two operators, expansion and inflation, to control the flow. Expansion con- trols the dispersion of the flow, whereas inflation controls the contraction of the flow, making the flow thicker in dense regions and thinner in sparse regions. These parameters boost the probabilities of intra-cluster walks and demote those of inter-cluster walks. Such iterative expansion and inflation separates the net- work into multiple non-overlapping regions. An animated example of the clustering process is available from the MICANS website, http://www.micans.org/mcl/. Math- ematically, expansion coincides with matrix multiplication, whereas inflation is a Hadamard power followed by a diagonal scaling of the matrix. Therefore, being mainly repetitive matrix operations, MCL is efficient and scalable even to large net- works. The inflation parameter I is a user input, and higher the value of I the finer is the clustering. For PPI networks, Brohee and van Helden [2006] recommend I to be between 1.8 and 2.0, to obtain clusters that match bona fide protein complexes. However, if the PPI network is reasonably dense (e.g., with average node degree ≥10) there is a tendency for MCL to produce several large (size ≥30) clusters that subsume smaller clusters, and therefore, I>2 is more appropriate to produce a finer clustering in such cases. However, a large I can also often lead to artificial breaking apart of clusters. One drawback of MCL particularly in its application to detect protein complexes is that MCL only produces disjoint (non-overlapping) clusters. Therefore, proteins that participate in more than one complex will be assigned (arbitrarily) to only one of the clusters. To overcome this problem, Pu et al. [2007] tweaked MCL by allowing proteins in a cluster that have sufficiently large number of interacting partners in another cluster (the acceptor cluster) to be also assigned to that cluster. The mimimum fraction f of partners a protein requires to have in the acceptor cluster C is defined by a power-law function f = a|C|b, where a and b are empirically determined, the recommended values being a = 1.5 and b =−0.5.

Superparamagnetic Clustering (SPC) SPC, proposed by Blatt et al. [1996] and Getz et al. [2000, 2002], is based on the Potts model [Fukunaga 1990] (a model of ferromagnetism), where data points in a given metric space are clustered based on “spins” assigned to them. Initially, each data point in the metric space is assigned a spin from a fixed range of spin values such that two close-enough (based on a distance threshold) data points are assigned 3.3 Methods Based Solely on PPI Network Clustering 67 the same spin. Beginning from an initial condition (represented by a temperature parameter), the system is subjected to a relaxation (cooling) process (by constantly lowering the temperature). At each step, the spins on the data points are allowed to fluctuate such that a cost function is constantly minimized. This cost function measures the overall similarity between the spins of the data points, and is the lowest when data points that are close to one another have similar spins. The system is considered unstable at the initial condition (high temperature), but when the cost function reaches its minimum (at a lower temperature), the system is considered stable (the superparamagnetic phase). At this stable condition, the data points are considered clustered, with points with the same spins assigned to the same cluster. X ={ } Let x1, x2,...,xN be a set of data points in a given metric space that needs to be partitioned into M groups that form the clusters. If two data points xi and xj || − || ≤ = are close enough, xi xj d, they are assigned the same spin si sj from a range [1, q]. The cost function H(X) is defined as: X = H( ) J(xi , xj ) . δ(xi , xj ), (3.5) (i,j)

∝ −|| − || = = = where J(xi , xj ) xi xj , and δ(xi , xj ) 0ifsi sj , else δ(xi , xj ) 1. The func- tion δ(xi , xj ) measures dissimilarity between the spins of xi and xj . The cooling process makes the spins on the data points to fluctuate such that H(X) decreases at each step until the function reaches a (local) minima. However, due to lack of sufficient details in the original works, we are unable to describe how the spins are updated to ensure H(X) reaches a minima and the cooling process halts. Blatt et al. [1996] describe a halting condition based on variation of the fluctuating spins during the cooling process. During the initial stages of the cooling process, the spins fluctuate more (unstable system), but as the system cools, the spins tend to converge to specific values and fluctuate less (stable system). The aim is to locate a temperature range [Tt ...Tt+l] in which the system is stable. A susceptibility parameter χ over the temperature range of the system is measured which depends on the variance m of fluctuation of spins over the temperature range, given as:

N χ = (m2−m2), (3.6) l where for each temperature point Ti of the system, m is computed as:

(N /N)q − 1 m = max . (3.7) q − 1 68 Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

= { } Here, Nmax max N1, N2,...,Nq where Nμ is the number of spins with value μ, and ... refers to the average. As the system stabilizes, the variance of the fluctuating spins becomes negligible, and so does the susceptibility χ of the system.   The average spin dissimilarity δ(xi , xj ) of a subset of data points is used to decide whether or not the data points belong to the same cluster. Spirin and Mirny [2003] used SPC to predict protein complexes from PPI net- works. Again, due to lack of sufficient details in the original works, we are unable to provide a clear description of this application of SPC to networks. But briefly,

if two proteins pi and pj are immediate neighbors in the PPI network, they are mapped to data points xi and xj in the metric space such that xj is one of the k nearest data points to xi. The two points xi and xj are assigned the same spins, = si sj . As described above, the system is allowed to cool until the system attains low susceptibility, at which stage the data points and hence the proteins are consid- ered to be clustered. A subset of proteins are assigned to the same cluster (protein complex) if the average spin dissimilarity between all pairs of proteins in the subset is below a threshold.

Local Clique Merging Algorithm (LCMA) LCMA, proposed by Li et al. [2005], is based on identifying and merging local cliques into bigger dense subgraphs for protein complex detection. LCMA works in two steps. In the first step, local cliques are identified from the network G as follows. =  For a vertex v in G, its local neighborhood subgraph Gv Vv , Ev is computed, ={ }∪{ ∈ } ={ ∈ ∈ ∈ } where Vv v u : (u, v) E , and Ev (s, t) : (s, t) E, s Vv , t Vv . LCMA then repeatedly removes a (loosely connected) vertex from this local neighborhood subgraph if the removal increases the interaction density of the subgraph, until the density cannot be increased any further. The resulting local subgraph is a local clique with density 1. LCMA then moves on to the local neighborhood of the next vertex to produce similar local cliques. However, the local cliques produced this way may overlap considerably. There- fore, in the second step, LCMA repeatedly merges these local cliques using their neighborhood affinities (NA) to produce larger neighborhoods. The affinity

NA(C1, C2) between two local cliques (neighborhoods) C1 and C2 is given as | ∩ |2 C1 C2 NA(C , C ) = | | | | . Beginning with the set LC of all local cliques (neighbor- 1 2 C1 . C2 ≥ hoods), LCMA repeatedly merges two cliques C1 and C2 that have NA(C1, C2) ω, a preset threshold, to produce a merged and larger neighborhood. This resulting neighborhood is added back to LC and used in the merging process. Local cliques 3.3 Methods Based Solely on PPI Network Clustering 69 that are not used up in the merging process are retained. The final set of merged neighborhoods and unmerged cliques are output as candidate protein complexes.

Complex Finder (CFinder) CFinder, proposed by Adamcsek et al. [2006], uses the idea of clique percolation by Der´enyi et al. [2005] to locate k-clique percolation clusters that represent protein complexes in the PPI network. A k-clique is a complete subnetwork of size k, and two k-cliques are said to be adjacent if they share exactly k − 1 vertices. A k- clique percolation cluster includes: (i) all vertices that can be reached via chains of adjacent k-cliques from each other; and (ii) the edges within these cliques. A k-clique percolation cluster is very much like a building a k-clique adjacency graph— where the vertices represent the k-cliques of the original graph and there is an edge between two vertices if the corresponding k-cliques are adjacent—and moving from one vertex to another along an edge is equivalent to “rolling” a k-clique template from one k-clique of the original graph to an adjacent one. For an Erdos-Renyi¨ random network of N vertices with p being the probability that any two vertices are connected, Der´enyi et al. [2005] show that a giant k-clique component appears = in the network at probability p pc(k), the percolation threshold,

= 1 pc(k) . (3.8) [(k − 1) . N]1/(k−1)

= = For k 2, this result agrees with the threshold pc(k) 1/N because 2-clique con- nectedness is equivalent to regular edge connectedness. By locating a random k-clique, we can roll a k-clique template onto an adjacent k-clique by relocating one the vertices of the k-clique and keeping other k − 1 vertices fixed. The expres- sion 3.8 can then be obtained by requiring that after rolling the expectation value of the number of adjacent k-cliques, where the template can roll further, be equal to 1 at the percolation threshold pc(k). The intuition behind this requirement is that a smaller expectation value would cause the rolling to halt too soon, whereas a larger expectation value would allow an infinite series of bifurcations for the rolling. This expectation can be estimated as (k − 1)(N − k − 1)pk−1, where the first term (k − 1) counts the number of vertices that can be selected for the roll-on, the sec- ond term (N − k − 1) counts the number of potential destinations for this roll-on, of which only a fraction pk−1 is acceptable, because each of the new k − 1 edges (as- sociated with the roll-on) must exist in order to obtain a new k-clique. For large N, this criteria simplifies to (k − 1)Npk−1 = 1, giving the above Equation (3.8) for p(k). 70 Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

Since we are interested in locating more than one k-cliques rather than just a single giant k-clique, it is important that the number of k-cliques in the percolation cluster, denoted by N ∗, not grow as fast as N , the total number of k-cliques in the graph. We can define an order parameter associated with this choice as ψ = N ∗/N . The number of k-cliques in graph can be estimated as:   k N − N − N ≈ pk(k 1)/2 ≈ pk(k 1)/2, (3.9) k k! N because k different vertices can be selected in k different ways, and any such selection makes a k-clique if and only if all the k(k − 1)/2 edges between these k vertices exist, each with probability p. Erdos¨ [1960] showed that for random graphs with N vertices, the size of the largest component scales as N 2/3. Therefore, applying the same to the k-clique adjacency graph, the size of the giant component N ∗ N 2/3 = scales as . Plugging p pc(k) from Equation (3.8) into Equation (3.9), we get the scaling

N ∼ N k/2 (3.10)

for the total number of k-cliques. Thus, the size of the giant component N ∗ is expected to scale as N 2/3 ∼ N k/3, and the order parameter ψ scales as N 2/3/N ∼ N −k/6. But obviously, this holds only if k ≤ 3 because for k>3, the size N ∗ which scales as N k/3 is faster than N. Therefore, for k>3, we expect N ∗ ∼ N, and using Equation (3.10), we get ψ = N ∗/N ∼ N 1−(k/2), that is, as some negative power of N. Since PPI networks are generally sparse and composed of more than one connected components, we expect this to hold. Adamcsek et al. [2006] identified k-clique percolation clusters from PPI networks, and suggest k between 4 and 6 to identify clusters that match real protein complexes.

Density-Periphery-based Clustering (DPClus) DPClus, proposed by Altaf-Ul-Amin et al. [2006], works in a manner similar to MCODE [Bader and Hogue 2003] by first identifying seed vertices and then ex- panding them into protein complexes. DPClus begins by assigning a weight λ(u) =| | to every protein u in the network: λ(u) N(u) , if the network is unweighted; and = λ(u) v∈N(u) w(u, v), if the network is weighted. Proteins are then sorted in non- increasing order of their weights, and the protein with the highest weight is chosen as a seed. This seed is initialized as a cluster, and the neighbors of the cluster are then repeatedly chosen and added to the cluster. Specifically, a cluster property Cp(v, C) of a neighboring protein v with respect to a cluster C is computed as: 3.3 Methods Based Solely on PPI Network Clustering 71

p∈C w(v, p) 1 Cp(v, C) = . , (3.11) | | C densityw(C) where densityw(C) is the weighted interaction density of C. The neighbors of C are then prioritised in non-increasing order of their cluster property. A neighbor v is ≥ repeatedly selected and added to C if: (i) Cp(v, C) Cpin, a minimum threshold for the cluster property; and (ii) the addition of v does not reduce the density of C below a threshold din. The protein v is then considered to be not at the “periphery” of the cluster but rather to belong to the cluster (and hence the name, DPClus). Once a cluster cannot be further be expanded, it is output as a predicted complex, and also removed from the network. The next seed is chosen and expanded using the remaining network. DPClus thus outputs only non-overlapping clusters; how- ever, in a modification of their implementation (http://kanaya.naist.jp/DPClus/), the authors retain the original network when future seeds are expanded, to produce overlapping clusters.

Interaction Probability-based Clustering Algorithm (IPCA) IPCA, developed by Li et al. [2008], modifies the criteria used in DPClus to add (neighboring) proteins to clusters by introducing two additional measures, cluster diameter and interaction probability. The diameter DiamC of the cluster C is the maximum distance between any pair of proteins in the cluster. The interaction probability P(v, C) for a protein v with the cluster C is given by p∈C w(v, p) P(v, C) = . (3.12) p,q∈C w(p, q) Li et al. observed that if for every protein v ∈ C, P(v, C ) ≥ t, where V(C ) = V(C)− {v} and t is a fixed threshold, then density(C) ≥ t, that is, the density of the cluster C is lower-bounded by t. Li et al. also observed that many real complexes have small diameters. Based on these observations, the criteria to select proteins v to be added to cluster C in IPCA is modified to: (i) P(v, C) ≥ t; and (ii) the diameter of the cluster ≤ after including v is DiamC d, where t and d are fixed thresholds.

Clustering Based on Merging Maximal Cliques (CMC) CMC, proposed by Liu et al. [2009], detects protein complexes by repeated merging of reliable maximal cliques from the PPI network. While LCMA [Li et al. 2005] also uses a similar strategy, it is restricted to unweighted networks, whereas CMC uses the weights of interactions to filter and merge cliques. The inclusion of weights allows CMC to discount the effect of noise in the network. The interactions are 72 Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

weighted using Iterative-CD scoring (Chapter 2), which is available with the CMC software package. CMC begins by enumerating all maximal cliques in the network using a fast search space pruning-based maximum clique-finding algorithm [Tomita et al. 2006]. Although enumerating all maximal cliques in a graph is NP-hard in gen- eral, this does not pose a problem here because PPI networks are quite sparse. Each clique C is scored using its weighted interaction density, u v∈C w(u, v) density (C) = , . (3.13) w |C| . (|C|−1) The cliques are ranked in non-increasing order of their weighted densities. CMC then iteratively merges highly overlapping cliques based on the extent of

their inter-connectivity. The overlap between two cliques C1 and C2 is defined as: | ∩ | C1 C2 O(C , C ) = | ∪ | , and the cliques are considered to highly overlap if O(C , C ) ≥ 1 2 C1 C2 1 2 To, an overlap threshold. The inter-connectivity Iw(C1, C2) between C1 and C2 is then defined based on their non-overlapping regions  u∈(C −C ) v∈C w(u, v) u∈(C −C ) v∈C w(u, v) I (C , C ) = 1 2 2 . 2 1 1 . (3.14) w 1 2 | − | | | | − | | | C1 C2 . C2 C2 C1 . C1 ≥ If Iw(C1, C2) Tm, a merge threshold, then C2 is merged with C1, otherwise C2 ≥ ≥ is removed if densityw(C1) densityw(C2) and O(C1, C2) To. Finally, all merged clusters are ranked based on their weighted densities and output as predicted complexes. Since CMC takes into account the weights of interactions, it prioritises more reliable cliques for the merging process while eliminating the less reliable ones, thereby reasonably discounting the effect of noise in PPI datasets.

Clustering with Overlapping Neighborhood Expansion (ClusterONE) ClusterONE, proposed by Nepusz et al. [2012], works in a manner similar to MCODE, by greedy neighborhood expansion. However, unlike MCODE (and MCL) it allows overlaps between the generated protein complexes. Beginning with indi- vidual seed proteins, ClusterONE greedily expands them into clusters C ∈ C based on a cohesiveness measure, given by:

w(in)(C) f(C)= , (3.15) w(in)(C) + w(bound)(C) + p(C)

where w(in)(C) is the total weight of interactions within cluster C, w(bound)(C) is the total weight of interactions connecting C to the rest of the PPI network, and p(C) is a penalty term to model uncertainty in the data due to missing interactions. At each step, new proteins are added to C until f(C)does not increase any further. 3.3 Methods Based Solely on PPI Network Clustering 73

Cluster C represents a locally cohesive group of proteins. When all such clusters are generated, the highly overlapping ones are merged into candidate protein ∈ C complexes. The overlap between two clusters C1, C2 is computed as: |C ∩ C |2 ω(C , C ) = 1 2 , (3.16) 1 2 | | | | C1 . C2 ≥ and C1 and C2 are merged if ω(C1, C2) 0.8. If a cluster does not overlap with any other cluster, it is promoted by itself as a candidate protein complex without any merging. Likewise, if a cluster overlaps with more than one other clusters, then it is merged individually with each of its overlapping clusters. In the final step, ClusterONE discards all candidate complexes that contain less than three proteins or whose density is below a certain threshold, to produce the final list of protein complexes.

Hierarchical Agglomerative Clustering with Overlaps (HACO) HACO, proposed by Wang et al. [2009], modifies the classical hierarchical agglom- erative clustering (HAC) [Kaufman and Rousseeuv 2009, Sardiu et al. 2009] to allow overlaps between clusters and to identify overlapping complexes. At each step, the HAC algorithm with average linkage maintains a pool of candidate sets to be merged. The distance between two non-overlapping sets S1 and S2 is given by: 1 D(S , S ) = . d(p, q), (3.17) 1 2 | | . | | S1 S2 ∈ ∈ p S1,q S2 where d(p, q) is the complement of the affinity between proteins p and q. The affinity is usually the reliability weight w(p, q)∈ [0, 1]of the interaction (p, q)in the PPI network, and therefore d(p, q)can be set to, for example, d(p, q) = 1 − w(p, q).

In each step, two non-overlapping sets S1 and S2 with the closest distance are iteratively merged to generate a new set S12, and the original sets S1 and S2 are removed. The algorithm terminates when there are no remaining sets to be merged.

In HACO, the sets S1 and S2 are retained for a potential merges with other sets later, the intuition being that if there is another set S3 whose distance to S1 is only slightly greater than that of S2, then the decision to merge S1 and S2 could be arbitrary and unstable. This is more so when there are spurious interactions in the PPI network. Therefore, HACO produces two merged sets S12 and S13 by retaining S1 based on a divergence decision: If S1 is considerably different from S12 then S1 is retained (to generate S13), otherwise S1 is removed while S12 is retained. This procedure can potentially result in overlapping complexes, for example, when both S12 and S13 are produced, and thus improves the classical HAC for identifying overlapping protein complexes. 74 Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

Supervised Protein Complex Prediction (SuperComplex) SuperComplex, proposed by Qi et al. [2008], works by supervised clustering of the PPI network using a Bayesian Network model learned from distinguishing features of real protein complexes. If a certain subnetwork from the PPI network represents a real protein complex, a distinguishing feature is a property that can sharply distin- guish this subnetwork from other (random) subnetworks that do not represent real protein complexes. Qi et al. employed ten groups of features, of which a majority (nine groups) are based on the topology of the subnetwork, and one is based on the weight and size characteristics of proteins involved in the subnetwork. Hence, we classify SuperComplex as still a (predominantly) topology-based clustering method. The ten groups of distinguishing features for a subnetwork S are as follows (the number of features are given in parenthesis):

1. number of proteins in S (1); 2. interaction density of S (1); 3. degree statistics: mean, variance, median, and maximum degree of the pro- teins in S (4); 4. interaction-weight statistics: mean and variance of interaction weights in S (2); 5. interaction density of S with respect to preset weight cut-offs on interac- tions (1); 6. degree correlation statistics: mean, variance, and the maximum of the num- ber of interactions with immediate neighbors of each protein in S (3); 7. clustering coefficient: see below (1); 8. topological coefficient statistics: see below (1); 9. the first eigenvalues: see below (3); and 10. protein weight/size statistics: the average and maximum protein length, and average and maximum protein weight of each protein in the subnetwork (4).

Clustering coefficient (CC), which has been defined earlier (Section 3.1) but we redefine it here for the ease of reading, measures the “cliquishness” in the

neighborhood of a protein v, and is computed as follows. If Ev is the number of = | | | | | |− interactions in the immediate neighborhood of v, then CCv 2 Ev /( Nv . ( Nv 1). This feature is then averaged for all proteins in the subnetwork S in question: = | | . CC(S) 1/ S v∈S CCv. 3.3 Methods Based Solely on PPI Network Clustering 75

A B A B A B A B

C D C D C D C D

Linear Clique Star Hybrid SV1: 1.618 SV1: 3.0 SV1: 1.7321 SV1: 2.1701 SV2: 1.618 SV2: 3.0 SV2: 1.7321 SV2: 2.1701 SV3: 0.618 SV3: 1.0 SV3: 0.0 SV3: 1.0

Figure 3.3 The first three eigenvalues for four example subnetwork topologies. (Redrawn from Qi et al. [2008])

Topological coefficient (TC) is computed as follows. Let u be a neighbor of v, ∩ | ∩ |+ then Nv Nu is the set of neighbors shared between v and u. Therefore Nv Nu 1 is the number of shared neighbors to which both v and u are linked including each other. Thus, TC is defined as: | ∩ |+ 1 u∈N ,|N ∩N |≥1( Nv Nu 1) TC = . v v u , (3.18) v | | |{ ∈ | ∩ |≥ }| Nv u : u Nv , Nv Nu 1 where u in the equation includes only the neighbors of v with which v shares at least one other neighbor. This feature is then averaged for all proteins in the subnetwork = | | . S in question: TC(S) 1/ S v∈S TCv. The eigenvalues feature includes the first three singular values (SV) of the adja- cency matrix of the subnetwork S. The singular value decomposition of an m × n matrix M is a factorization of the matrix in the form M = UV, where U is a m × m unitary matrix,  is a m × n diagonal matrix with non-negative real numbers on its diagonal, and V is a n × n unitary matrix. The diagonal entries of  are known as the SVs of M. A common convention is to list the SVs in descending order. Qi et al. noted that the SVs for subnetworks of different topologies (e.g., linear, clique, star, and hybrid) show differences in the first three SVs; see Figure 3.3. A supervised Bayesian Network (BN) is then trained using a positive training set of known protein complexes, and a negative training set generated by randomly selecting proteins from the PPI network into groups. The above features are gener- ated independently of two parameters: whether a subnetwork is a protein complex or not (C = 1 or 0), and the number of proteins N in the subnetwork. The number of proteins N in the subnetwork is treated separately and is not included as part of the list of features because of the tendency for other features to depend on N. 76 Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

For a given subnetwork S, the conditional probability that it represents a protein complex is computed using Bayes’ rule as:

| = . = = | = p(N, x1, x2,...,xm C 1) p(C 1) p(C 1 N , x1, x2,...,xm) p(N, x1, x2,...,xm) | = . | = . = = p(x1, x2,...,xm N , C 1) p(N C 1) p(C 1) p(N, x1, x2,...,xm)  m | = | = = = p(xk N , C 1) . p(N C 1) . p(C 1) = k 1 , p(N, x1, x2,...,xm) (3.19)

where x1, x2,...xm represent the features discussed above. A similar conditional probability is computed for S to represent a non-complex, by replacing C = 1by C = 0 in the above equation. Using the two posteriors, the log likelihood ratio for the subnetwork S is given by:

p(C = 1|N , x , x ,...,x ) L = log 1 2 m = | p(C 0 N , x1, x2,...,xm)  (3.20) | = = m | = p(N C 1) . p(C 1) . = p(xk N , C 1) = log k 1 . | = . = . m | = p(N C 0) p(C 0) k=1 p(xk N , C 0)

Maximum likelihood estimation is used for learning these conditional probabili- ties from the training data. The prior is P(C= 0) = 1 − P(C= 1) with P(C= 1) set to 0.0001. The log likelihood score is then used to identify maximally scoring subnetworks from the PPI networks, using a heuristic iterative procedure as follows. The proce- dure begins with seed clusters consisting of pairs of proteins (u, v) in which the protein u is connected to only the neighbor v with which it has the highest interac- tion weight. In each iteration, a neighbor is repeatedly chosen from the neighbors of each current seed cluster and added to the cluster if the new log likelihood score L is higher than the current score L. But, if the addition of the neighbor decreases

the score, the expanded cluster is still accepted with a probability e(L−L )/T , where T

is a temperature parameter, with the initial temperature T0 chosen using cross val- idation on a dataset of training and testing samples. After each such iteration, the temperature is reduced as T := αT , where the scaling factor α is also chosen using cross validation. The resulting clusters after 20 iterations of this search procedure are output as predicted protein complexes. 3.3 Methods Based Solely on PPI Network Clustering 77

Ensemble Clustering Yong et al. [2012] and Srihari and Leong [2012a] developed ensemble approaches using majority voting to aggregate clusters generated from different protein com- plex prediction methods, and from different PPI networks, respectively. In the study by Yong et al. [2012], individual lists of clusters are first predicted from the PPI network using different protein complex prediction methods namely, MCL, CMC, ClusterONE and HACO. Sets of similar clusters but produced from different clustering methods are first identified: Two clusters A and B produced from two different methods are assessed for similarity using the Jaccard index, J(A, B) = |A ∩ B|/|A ∪ B|, and the two clusters are deemed similar if J(A, B) ≥ 0.75. If the two clusters are similar, then only one of them which has the highest weighted density, say cluster A here, is retained, and is assigned a score given by:

= Score(A) Num(A) . densityw(A), (3.21) where Num(A) is the number of different methods that produce A or clusters sim- ilar to A. Thus, the clusters with higher quality (density) and predicted by multiple methods are scored higher. All unique clusters produced by each method, which are not similar to clusters from other methods, are also retained and scored based on their densities. As the different complex prediction methods do not have large over- lap in their predictions, doing so improves the coverage of the generated clusters while maintaining the quality of the clusters by scoring higher those with higher densities and predicted by multiple methods. A similar approach is adopted by Benschop et al. [2010], but with the difference being that the overlapping (similar) clusters are merged such that the final merged cluster contains proteins present in at least two originating clusters. In the study by Srihari and Leong [2012a], clusters produced from the same method but from different (weighted) PPI networks are combined to produce con- G ={ } sensus clusters. Let G1, G2,...,Gk be k different PPI networks from which C =∪ the clusters are predicted, and let iC(Gi) be the union of the sets of clus- = ters, C(Gi), i 1,...,k, produced from the k networks. Sets of similar clusters but identified from different PPI networks are first identified. Two clusters A ∈ ∈ = C(Gi) and B C(Gj ), i j, are assessed for similarity using the Jaccard index J(A, B) =|A ∩ B|/|A ∪ B|, and the two clusters are deemed similar if J(A, B) ≥ =  = C 0.75. Next, a cluster-similarity network H VH , EH is constructed, where VH , ={ ∈ = = ≥ } and EH (A, B) : A C(Gi), B C(Gj ), i j , J(A, B) 0.75 . Each connected component from the network H is then collapsed into a single consensus complex T such that T contains all proteins that are present in at least half of the clusters 78 Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

in the component. All remaining clusters—singleton clusters that do not belong to any connected component, and clusters within components that are not col- lapsed into consensus complexes—are retained as individual clusters. The final list of clusters are then scored using weighted density, as before. This consensus approach accounts for the differences in clusters arising from differences in the weighting of interactions between the networks. Taking the consensus improves the protein coverage of clusters while maintaining their quality by including only proteins that are predicted from multiple networks. In another approach, instead of combining the output clusters from different clustering methods, Asur et al. [2007] used the k base clusters to construct a new network, such that an interaction in this network exists between two proteins u and v if and only if the two proteins are present together in at least one base cluster. The interaction (u, v) is weighted proportional to the reliability of the clusters containing the protein pair as: p = w(u, v) Rel(Ck) . Mem(u, v, Ck), (3.22) k=1

where Rel(Ck) represents the reliability of cluster Ck (determined, for example, as = the interaction density of the cluster) and Mem(u, v, Ck) 0/1 is the membership of the pair (u, v) in Ck. The resulting weighted network is then clustered using hierarchical agglomerative method (a method not used previously to produce the base clusters) to obtain the final candidate set of protein complexes.

Methods Incorporating Core-Attachment Structure 3.4 Protein complexes are modular entities and, in particular, proteins within eukary- otic complexes can be categorized into two distinct modular subgroups based on their function and organization—cores, which constitute central functional units of complexes, and attachments which are peripheral and aid the core proteins in their functions [Gavin et al. 2006]—as depicted in Figure 3.4. This categorization into cores and attachments (the core-attachment structure) provides important insights into the manner in which the proteins come together to constitute protein com- plexes and perform functions. For example, core proteins are strongly co-expressed with one another, and depending on the cellular context, a set of core proteins may interact with different sets of attachment proteins to form different protein com- plexes. Likewise, depending on the cellular context, the same attachment proteins may interact with different core proteins to form different complexes. But even among the attachment proteins, there are subsets of proteins called attachment modules that often function together in complexes (Figure 3.4). 3.4 Methods Incorporating Core-Attachment Structure 79

Attachment module

Cores Attachments

Figure 3.4 Core-attachment modularity observed in yeast and eukaryotic protein complexes [Gavin et al. 2006]. Cores constitute central functional units within complexes, whereas attachments are peripheral and aid the core proteins in their functions. Core proteins are strongly co- expressed with one another, and depending on the cellular context, a set of core proteins may interact with different attachment proteins to form different protein complexes. Likewise, the same attachment proteins may interact with different core proteins to form different protein complexes. Even among the attachment proteins, there may be subsets of proteins called attachment modules that often function while forming complexes.

In the PPI network, one can observe that core proteins are more densely con- nected with each other than to the rest of the proteins in the complex, but interact more densely with attachment proteins than with the rest of the PPI network [Gavin et al. 2006]. This provides an important topological insight to identify protein com- plexes. For example, not all dense clusters identified from the PPI network may correspond to real protein complexes, in particular these clusters may have been “densified” by spurious interactions; on the other hand, clusters that adhere to the core-attachment structure may correspond better to real complexes. CORE [Leung et al. 2009], COACH [Wu et al. 2009], MCL-CAw [Srihari et al. 2009, 2010], and CA- CHET [Wu et al. 2012] are some methods that predict protein complexes based on this idea.

Complex Detection by Core-Attachment (CORE) CORE, proposed by Leung et al. [2009], builds protein complexes from a PPI net- work in two steps: by (i) identifying core sets in which the proteins densely interact 80 Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

with one another and (ii) adding attachment proteins to these cores from proteins which may not densely interact with one another, but interact densely with at least

one of the cores. The probability for two proteins u and v with degrees du and dv, re- spectively, to belong to the same core is determined by the number of interactions i (i = 0 or 1) and the number of common neighbors m between u and v. The prob- ability that u and v have ≥ i interactions and ≥ m common neighbors is computed

under the null hypothesis that du interactions connecting u and dv interactions connecting v are assigned randomly in the network according to a uniform distri- bution. If there are |V (G)|=N proteins in the network, the probability that u and v have exactly i interactions between them is calculated by considering the num- − − ber of combinations to assign (du i) and (dv i) interactions connecting u and v, respectively, to the other N − 2 proteins: N−2 N−2 d −i d −i P (i|N , d , d ) = u v . (3.23) interact u v N−2 N−2 + N−2 N−2 du dv du−1 dv−1

Only two situations are studied: When i = 1, u and v must have an interaction between them, and when i = 0, u and v might or might not have an interaction between them. The probability that u and v with i interactions have exactly m common neigh- bors is computed by considering three groups of proteins—common neighbors of u and v, neighbors of u, and neighbors of v—and is given as: N−2 (N−2)−m (N−2)−m−(du−i−m) m d −i−m d −i−m (m|N d d i) = u v Pcommon , u, v , N−2 N−2 . (3.24) du−1 dv−i

The probability that u and v have ≥ i interactions and ≥ m common neighbors is computed by taking the product of the above two equations:

p-value(u, v) = | | (3.25) Pinteract(j N , du, dv) . Pcommon(k N , du, dv , j). i≤j≤1,m≤k≤min(du,dv)−j

A small p-value(u, v) means that the null hypothesis is likely to be wrong, that is, u and v have a higher chance of being a pair of core proteins. If p-value(u, v) is the smallest p-value among p-value(u, k) and p-value(v, k) for all possible proteins k, then u and v are a pair of core proteins. This core (u, v) is then progressively expanded by adding new proteins x if p-value(x, u) for all proteins u in the core is smaller than p-value(x, k) for all proteins k not in the core, until no further proteins 3.4 Methods Incorporating Core-Attachment Structure 81 can be added. It is easy to see that the core protein sets are disjoint because each protein can only associate with a unique set of proteins with the lowest p-values. Given a core set, proteins that are common neighbors to at least half of the core proteins are selected as attachment proteins to that core. The final protein complex is the union of the core and all its attachments. Note that a set of attachments may be linked to more than one core, thus giving sharing of attachments between the complexes. An additional point to note here is that, if the attachment proteins (to a given core) do not interact between themselves, then all the attachments may not necessarily belong to the same protein complex, but instead the core proteins may be interacting with different subsets of these attachments (modules; Figure 3.4) to form distinct (but overlapping) protein complexes. However, this pattern of overlapping protein complexes is not directly decipherable here, and requires more specialized methods, as we shall see in Chapter 5.

COre-AttACHment Based Complex Prediction (COACH) COACH, proposed by Wu et al. [2009], works in a manner similar to CORE, in the sense that it first identifies cores and then adds attachments to these cores to construct the final protein complexes. COACH begins by building “prelimi- nary cores,” which are dense but redundant or overlapping subsets of proteins that need to be postprocessed to determine the final cores. For every protein v =  =  in G V , E , its neighborhood subnetwork Gv Vv , Ev is first computed, where ={ }∪{ ∈ } ={ ∈ ∈ ∈ } Vv v u : (u, v) E , and Ev (s, t): (s, t) E, s Vv , t Vv . All proteins with degree 1 in this subnetwork (i.e., those only connected to v) are removed. A protein ∈ ≥ u Gv is designated as a core protein relative to Gv if deg(u) AvgDeg(Gv), the av- erage degree computed using only interactions in Gv. These core vertices and the interactions among them are used to build a core subnetwork CSv of this neigh- borhood subnetwork of v.Apreliminary core is defined as a subnetwork of CSv satisfying two properties: (i) the interaction density of the subnetwork is greater than a preset threshold (0.70) and (ii) the subnetwork is maximal, that is, there exists no other subnetwork of CSv that satisfies property (i). If the entire core subnetwork CSv itself is dense enough (that is, satisfies prop- erty (i)), then CSv is designated as a preliminary core. Otherwise, multiple prelim- inary cores are likely to be embedded in CSv which can potentially overlap, thus resulting in redundancy among these preliminary cores. To identify these prelimi- nary cores, COACH uses a core-removal and redundancy-filtering step, which works as follows. For a neighborhood network Gv whose core subnetwork CSv is not dense enough, all core proteins are first removed from Gv, breaking Gv into multiple connected components. Next, low-degree proteins are repeatedly removed from 82 Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

these components until the components achieve the required density threshold. Trivial components are discarded (those of size <4). The core proteins are then added back to the components such that these components retain the density threshold. For each resulting pair of components (A, B), a similarity is computed =| ∩ |2 | | | | as Sim(A, B) VA VB / VA . VB .IfSim(A, B)>t, a similarity threshold, then only the component with higher density is retained as a preliminary core. This way, the resulting preliminary cores do not completely encompass one another (thus, satisfying property ii). Finally, to determine the attachment proteins, for each preliminary core A identified from above, if there exists a neighboring protein p in the network such | ∩ | | |≥ that VA Np / V 0.5, then p is included as an attachment to the core A.A final complex is built from each such preliminary core A and all its attachment proteins p.

MCL Followed by Core-Attachment-based Filtering (MCL-CAw) MCL-CAw, proposed by Srihari et al. [2010], refines clusters produced from MCL [Van Dongen 2000] by incorporating core-attachment structure, to predict protein complexes. MCL-CAw is an improvement over its unweighted and predecessor ver- sion MCL-CA [Srihari et al. 2009] that is designed only for unweighted networks. The motivation behind MCL-CAw is three-fold: (i) The clusters produced from MCL often include “noisy” proteins that get “pulled in” due to the presence of spurious (but sparse) interactions connecting them to the cluster proteins; (ii) MCL produces non-overlapping clusters, and in particular, if a protein interacts with proteins from two different clusters then it is arbitrarily assigned to only one of the clusters; and (iii) MCL often produces large clusters (of sizes ≥30) which possibly encompass multiple protein complexes. If protein complexes indeed follow a core-attachment modularity, then refining the MCL clusters by extracting only the subsets of pro- teins that adhere to this core-attachment structure, should produce clusters that correspond better to real protein complexes. MCL is first run on a PPI network to produce an initial set of clusters using the recommended inflation parameter I ∈ [1.8, 2.0] [Brohee and van Helden 2006]. For each cluster of size ≥30, a subnetwork is constructed using proteins within the cluster and the interactions among them. This subnetwork is then reclustered with a higher inflation, I>2, giving clusters of size at most 30. All resultant clusters are postprocessed using a core-attachment procedure as follows. From each cluster C (from above), MCL-CAw first identifies a core set of proteins Core(C). A protein p ∈ C is considered a core protein, i.e., p ∈ Core(C) if the follow- 3.4 Methods Incorporating Core-Attachment Structure 83

≥ (in) ing two conditions are satisfied: (i) degw(p) AvgDeg(C); and (ii) degw (p, C) > (out) degw (p, C), where degw(p) is the weighted degree of p, AvgDeg(C) is the aver- (in) age weighted degree of proteins in C, degw (p, C) is the weighted in-connectivity of p defined as the total weight of interactions of p with proteins within C, and (out) degw (p, C) is the weighted out-connectivity of p defined as the total weight of interactions of p with proteins outside of C. Only those clusters with at least two core proteins are retained while the rest are discarded. The attachment proteins are chosen from the non-core proteins. An attachment protein p is assigned to a Core(C) based on the extent of interactions p has with Core(C) considering the interaction density and size of Core(C), and is decided based on the following condition: − |Core(C)| γ I(p, Core(C)) ≥ α . I(Core(C)) . , (3.26) 2 where I(p, Core(C)) is the total weight of interactions of p with Core(C), I(Core(C)) is the total weight of interactions within Core(C), and α and γ are used to control the effects of I(Core(C)) and |Core(C)|. Large and dense cores require the attachments to be strongly connected to them, whereas weak connections with attachments are allowed for smaller and less-dense cores. The final complexes are then constructed by the union of cores and their attachments, as before. Proteins that neither qualify as cores nor attachments are discarded, thus eliminating noisy proteins from the MCL clusters. Moreover, an attachment protein is allowed to be assigned to more than one core as long as the above condition (Equation (3.26)) is satisfied for each of the cores, and therefore this step allows for sharing of proteins between multiple clusters.

Core-AttaCHment Structures Directly from BipartitE TAP Data (CACHET) CACHET, proposed by Wu et al. [2012], works by the rationale that co-complex re- lationships identified from pulled-down complexes in TAP experiments hold vital information on cores: the bait–prey and prey–prey pairs within pulled-down com- plexes that are repeatedly observed across purifications may constitute permanent or constitutive interactions between these proteins, and may thus constitute cores within the complexes. These co-complex relationships are lost when the pulled- down complexes are converted to pairwise interactions in the PPI network. There- fore, CACHET works directly on the bait–prey interactions produced from TAP experiments instead of constructing the PPI network. 84 Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

The TAP data is represented as a bipartite graph G =P , Q, E, where P is the set of baits, Q is the set of preys captured by the baits, and

E ⊆{(u, v) : u ∈ P , v ∈ Q, u = v}

represents the set of bait-prey relationships. Note that, in TAP experiments, a protein p can appear both as a bait and as a prey; each such protein is maintained as two separate entities, one as a bait and the other as a prey. Additionally, a bait might tag itself as a prey; to avoid the resulting self-relationships, E is made to contain only relationships (u, v)where u = v. Each edge (u, v) ∈ E carries a reliability weight w(u, v) that estimates the likelihood of the bait u interacting with prey v. The reliability of a subgraph S =P(S), Q(S), E(S) of G is defined as the average of reliability weights of all the interactions in E(S): (u v)∈E(S) w(u, v) Reliability(S) = , . (3.27) |E(S)| Each protein u ∈ P(S)is assigned a degree-ratio given by: (u l)∈E(S) w(u, l) dr (u) = , . (3.28) S |E(S)| The degree-ratio for u ∈ Q(S) is analogously computed. Similar to the above methods, CACHET first identifies cores and then adds attachments to these cores to build protein complexes. The cores are built from reliable bicliques of G. A subgraph of G, given by B =P(B), Q(B), E(B),isa biclique if there exists (u, v) ∈ E for every u ∈ P(B) and v ∈ Q(B). The biclique B is considered reliable if Reliability(B) ≥ t, a preset threshold (0.70). The procedure begins by identifying all maximal bicliques from G, using the algorithm by Li et al. [2007]. If a maximal biclique B has a reliability lower than the threshold, CACHET ∈ repeatedly prunes B by removing proteins w B with low degree-ratios drB(w) until Reliability(B) achieves the desired threshold. The pruned biclique is then added to the list of reliable bicliques. Several of these bicliques may overlap and may produce redundant cores. CACHET therefore merges overlapping bicliques based on their =  neighborhood affinity (NA) score. The NA score for two bicliques B1 P1, Q1, E1 =  and B2 P2, Q2, E2 is given by: |(P ∪ Q ) ∩ (P ∪ Q )|2 NA(B , B ) = 1 1 2 2 . (3.29) 1 2 | ∪ | | ∪ | P1 Q1 . P2 Q2

Two (near-)bicliques B1 and B2 from the list of reliable biclques are repeatedly ≥ chosen and merged into a near-biclique if NA(B1, B2) ω, an overlap threshold, 3.5 Methods Incorporating Functional Information 85

such that the resulting near-bicliques are all maximal and non-redundant. After discarding the small near-bicliques (below a size threshold), the final set of near- bicliques form the cores. The attachment proteins are chosen from the set Q of preys, by looking for proteins that densely interact with the cores identified above. A prey p ∈ Q is added as an attachment to a core B if p satisfies the following two conditions: (i) p interacts with at least half the proteins in B; and (ii) the average reliability of these interactions between p and B is at least a preset threshold (0.7).

Methods Incorporating Functional Information 3.5 Proteins within a complex are enriched for the same or similar functions. There- fore, dense protein clusters which also show high functional coherence are highly likely to correspond to real complexes compared to random sets of (dense) clus- ters. Following this idea, Restricted Neighborhood Search Clustering (RNSC) [King et al. 2004], PCP [Chua et al. 2008], and Dense-neighborhood Extraction using Connectivity and conFidence Features (DECAFF) [Li et al. 2007] employ functional annotations (e.g., from Gene Ontology (GO) [Ashburner et al. 2000]) to enhance complex prediction from PPI networks.

Restricted Neighborhood Search Clustering (RNSC) RNSC, proposed by King et al. [2004], works in two steps: (i) clustering the PPI network, and (ii) filtering clusters based on their functional coherence. In the clus- tering step, RNSC uses a cost-based local search algorithm to repeatedly partition the node set V into smaller subsets. Beginning with a random clustering (obtained as user input), RNSC uses a variant of the tabu search metaheuristic [Glover 1986] to efficiently search the space of partitions of V , each of which is assigned a cost, to find a clustering with lower cost. At each step of this process, proteins are shuffled between the clusters and the costs are recomputed, to gradually generate a cluster- ing with the (locally) minimum cost. In the initial few steps, RNSC uses a simple integer-valued cost function which is efficient to compute, but then upgrades to a more expensive (but less efficient) real-valued cost function as the clustering pro- gresses. A common problem among local search algorithms is their tendency to settle in poor local minima. RNSC overcomes this problem by maintaining a tabu (forbidden) list of previously explored moves that are less-optimal, and by occasion- ally dispersing the contents of a cluster at random. However, being randomized, different runs of RNSC could result in different clusterings. 86 Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

In the filtering step, RNSC discards clusters that are not likely to correspond to real complexes based on their cluster sizes and functional homogeneity. Small complexes (of sizes 2 and 3) that are completely connected correspond to single interactions or triangles in the network. However, there are many interactions and even more triangles in a typical PPI network that do not necessarily correspond to complexes. Hence, most small clusters (single interactions and triangles) do not correspond to complexes. The small clusters that do correspond to complexes are difficult to identify solely based on their network topology. RNSC therefore simply discards all small clusters. Next, all clusters of low interaction densities are discarded, and the authors recommend a cluster density cut-off ranging from 0.65– 0.75. Finally, RNSC computes the enrichment for Gene Ontology (GO) functional groups within these clusters. A GO functional group Fˆ is the set of all proteins annotated with a particular function. RNSC computes a hypergeometric test p-value for each cluster C:

ˆ ˆ i=k−1 |F | |V |−|F | | |− p = − i C i 1 |V | , (3.30) i=0 |C|

where C contains k proteins annotated with terms in Fˆ. Clusters that do not pass the p-value cut-off of 0.001 for at least one GO term are discarded, and clusters that pass these filtering criteria are predicted as protein complexes.

Protein Complex Prediction Using Indirect Neighbors (PCP) Instead of postprocessing the clusters based on explicitly given functional homo- geneity information, PCP, proposed by Chua et al. [2008], uses FS Weight [Chua et al. 2006] (Chapter 2) to assess the functional similarity between proteins based on the topology of their neighborhoods in the PPI network. Weights computed by FS Weight correlate well with the functional homogeneity and co-localization of pro- teins. Therefore, clusters identified from a FS-Weighted network are expected to show high functional and co-localization coherence, and thus correspond better to real complexes, compared to those identified from an unweighted network. PCP begins by first preprocessing the unweighted PPI network by: (i) weighting all interactions by FS Weight (weights in the range [0,1]); (ii) adding new interac- tions between indirect (level-2) neighbors using FS Weight; and (iii) discarding all low-weighted interactions (typically weights below 0.2). The addition of new interac- tions is based on the observation that indirect neighbors in the network often share the same or similar functions, and therefore adding FS-Weighted interactions be- 3.5 Methods Incorporating Functional Information 87 tween them reinforces their functional association, and also potentially helps to densify the PPI network by completing many partial cliques in the network. PCP identifies dense clusters by merging maximal cliques. These maximal cliques are identified using an efficient search-space pruning algorithm by Tomita et al. [2006]. Cliques identified from the PPI network before the postprocessing step are usually small and only partially cover real complexes. On the other hand, the addition of FS-Weighted interactions between level-2 neighbors helps to com- plete many partial cliques in the network, thus producing larger cliques. However, this procedure also tends to produce numerous overlapping cliques, which need to be reconciled into larger clusters that represent real complexes. To go about this, PCP uses the inter-cluster density measure, which quantifies the interconnectivity between two cliques based on the interactions between the non-overlapping pro- teins of the two cliques. The inter-cluster density I(C1, C2) between two cliques C1 and C2 is defined as: u∈(V −V ),v∈(V −V ),(u,v)∈E FS(u, v) I(C , C ) = 1 2 2 1 . (3.31) 1 2 | − | | − | V1 V2 . V2 V1 The merging procedure begins by computing an average FS Weight for each clique

Ci as follows: (u,v)∈E(C ) FS(u, v) FS (C ) = i . (3.32) avg i | | E(Ci) The cliques are then sorted in non-increasing order of their average FS Weights, and the clique with the highest weight is repeatedly selected and compared with the remaining cliques. When an overlap above a threshold ω is found with another =| ∩ | | ∪ |≥ clique, given by J(Ci , Cj ) Ci Cj / Ci Cj ω, the two cliques are merged if ≥ their inter-cluster density is above a threshold I0: I(Ci , Cj ) I0. The merging of the two cliques produces a larger but partial clique, which is added back to the list of (partial) cliques for future merging. Essentially, PCP follows a hierarchical merging 0 = 0 0  procedure here to merge the (partial) cliques. An initial graph H VH , EH is 0 ∈ 0 defined from the maximal cliques identified from G such that each vertex Xi VH 0 0 ∈ 0 represents a maximal clique from G, and each edge (Xi , Xj ) EH is weighted using 0 0 the interconnectivity between the cliques represented by Xi and Xj in G. Cliques in H 0 represent the highly overlapping and interconnected cliques from G, and these are collapsed and merged into larger partial cliques. This procedure is then repeated on graphs constructed at higher and higher levels: In each iteration k, k = k k  k ∈ k a new graph H VH , EH is constructed such that Xi VH represents a partial k k clique and (Xi , Xj ) represents the interconnectivity between two partial cliques 88 Chapter 3 Computational Methods for Protein Complex Prediction from PPI Networks

from the previous iteration k − 1. Maximal cliques are then identified from H k and = − merged after reducing the inter-cluster density threshold as Ik : Ik−1 0.1. These iterations continue while Ik >Imin, a minimum threshold. The final set of all partial cliques produced when the iterations stop represents the partial cliques from the original network G after repeated mergers. These final partial cliques together with the cliques that were not used up in the merging process are output as the list of protein complexes.

Dense-neighborhood Extraction Using Connectivity and conFidence Features (DECAFF) DECAFF, proposed by Li et al. [2007], scores a PPI network based on functional similarity between the proteins and then clusters this scored network to gener- ate protein complexes. However, unlike PCP, DECAFF uses external evidence on functional similarity, instead of the network topology as in PCP, to score the PPI network.

DECAFF starts by assigning an initial reliability ru,v to each interaction (u, v) in the network based on the number of times (u, v) is observed across different experimental datasets, where higher reliabilities are assigned to interactions de- tected from multiple experiments. Using these reliabilities as the prior probabili- ties, DECAFF computes the posterior probabilities for the proteins to interact, by employing the functional similarity between u and v from the MIPS functional cata- log [Mewes et al. 2006] as additional evidence. This is done based on the following three cases: (i) u and v share the same or similar functions; (ii) u and v do not share any functions; and (iii) the functions of either u or v (or both) are unknown (unannotated proteins). For case (i), the posterior probability P(R|S)given the two proteins u and v share the same function is estimated as: P(R). P(S|R) P(R|S) = , (3.33) P(S) = = | + |¬ where P(R) ru,v is the prior probability, and P(S) P(S R) . P(R) P(S R) . P(¬R) is the probability that two proteins share functions. P(S|R) is estimated

using an independent high-confidence small-scale experimental dataset DS: |{(u, v) : share(u, v), (u, v) ∈ D }| P(S|R) = S , (3.34) | | DS

where share(u, v) denotes that u and v share at least one function. A dataset DN of one million randomly generated protein pairs that are not present in current 3.5 Methods Incorporating Functional Information 89 protein interaction datasets is constructed, and P(S|¬R) is estimated as: |{(u, v) : share(u, v), (u, v) ∈ D }| P(S|¬R) = N . (3.35) | | DN For case (ii), the posterior probability P(R|D) given that the two proteins do not share any functions is computed as: P(D|R) . P(R) P(R|D) = , (3.36) P(D) where P(D)= 1 − P(S), and P(D|R) = 1 − P(S|R). Finally, for case (iii), the posterior probability P(R|U) given that one or both proteins are unannotated is computed as:

P(R|U)= P(S). P(R|S) + P(D). P(R|D). (3.37)

Once the network is scored using the posterior probabilities as computed above, DECAFF clusters the network using LCMA [Li et al. 2005], with an added step to remove less-reliable clusters. For each generated cluster C from LCMA, a cluster- reliability is computed as: 1 Reliability(C) = R , (3.38) |E(C)| u,v (u,v)∈E(C) where Ru,v is the posterior probability depending on the case that (u, v) satis- fies. DECAFF removes all clusters C for which Reliability(C)<μ+ max(0.5, γ). σ , where μ and σ are the mean and standard deviation of the reliability distribution, and γ is a tuning parameter to control the stringency of the filtering. Evaluating Protein Complex Prediction Methods

The only relevant test of the validity of a hypothesis is comparison of prediction with experience. —Milton Friedman (1912–2006), American economist (as quoted in Friedman [1953])

In the previous chapter, we presented a taxonomy of protein complex prediction 4methods available in the literature, and described their algorithmic underpinnings. In this chapter, we evaluate a sampling of these algorithms for prediction of yeast and human complexes. Additionally, we investigate the impact of the weighting of PPI networks on the reduction of noise (false-positive interactions) from PPI datasets, and its impact on improving protein complex prediction.

Evaluation Criteria and Methodology 4.1 We employ criteria and methodology described and used in the literature [Brohee and van Helden 2006, Pu et al. 2007, Srihari and Leong 2012b, Srihari et al. 2015a, Sardiu et al. 2009] to evaluate protein complex prediction methods. We consider a predicted complex P to match a reference (benchmark) complex C if the Jaccard index of P and C, Jaccard(P , C) ≥ match thresh,

V ∩ V Jaccard(P , C) = P C , ∪ VP VC

where VX is the set of proteins in complex X. 92 Chapter 4 Evaluating Protein Complex Prediction Methods

We define the score of a predicted complex P as the weighted density of inter- actions in the complex: a,b∈V ,a =b weight(a, b) score(P ) =  P        − VP ( VP 1)

Here, weight(a, b) is the weight of edge (a, b) between proteins a and b. To evaluate the predicted complexes, we plot precision-recall curves by calcu- lating the precision and recall of predicted complexes at various score thresholds. Given a set of reference complexes C, consisting of large complexes (at least four

proteins) Clarge and small complexes (fewer than four proteins) Csmall, and a set of predicted complexes P, we define the precision and recall at score threshold s as:     { | ∈ ∧∃ ∈ ≥ }  C C Clarge P P, score(P ) s, P matches C  Recall = s | | Clarge   {P |P ∈ P, score(P ) ≥ s ∧∃C ∈ C, C matches P } = Precisions . |{P |P ∈ P, score(P ) ≥ s}|

Note that while we evaluate recall only for large complexes, we evaluate precision to include both large and small complexes, so that the prediction algorithm is not penalized should it also predict small complexes. As a summarizing statistic for an entire set of predicted complexes, we calculate its F-score, which is the harmonic mean of the precision and recall over all the predicted complexes:

2 × precision × recall F = . precision + recall

We also test different weighting (interaction-scoring) methods to ameliorate the noise in PPI networks. To evaluate weighting methods independent of complex prediction algorithms, we calculate and plot the precision-recall-coverage graph of each PPI-weighting method on the prediction of co-complex interactions, i.e., protein pairs belonging to the same complex. Given a set of PPI interactions in the PPI network E, and weighting function weight(e) for each PPI edge e ∈ E, we define the precision, recall, and complex-coverage for co-complex interactions at weight threshold w as: 4.2 Evaluation on Unweighted Yeast PPI Networks 93

    { | ∈ ∧ ≥ ∧ ∈ }  e e E weight(e) w e interactions(Clarge)  Recall = w | | interactions(Clarge)   {e|e ∈ E ∧ weight(e) ≥ w ∧ e ∈ interactions(C)} = Precisionw |{e|e ∈ E ∧ weight(e) ≥ w}|     { | ∈ ∧ ∃ ∈ ≥ ∧ ∈ }  C C Clarge ( e E, weight(e) w e C)  Coverage = , w |{ | ∈ }| C C Clarge where interactions(C) is the set of all co-complex interactions in the set of reference complexes C.

Data Sources for PPI Networks and Benchmark Protein Complexes To evaluate protein complex prediction algorithms on yeast and human complexes, we obtain yeast and human PPI data from two repositories, Biogrid (downloaded June 2016) [Stark et al. 2011] and IntAct (downloaded June 2016) [Hermjakob et al. 2004], keeping only the physical PPIs. In addition, for yeast PPIs, we incorporate the widely used Consolidated PPI dataset [Collins et al. 2007], which consolidates two high-throughput TAP-MS datasets from Gavin et al. [2006] and Krogan et al. [2006] using a probabilistic framework called Purification Enrichment (PE); see Chapter 2. We evaluate protein complex prediction algorithms on the yeast reference com- plexes dataset CYC2008 [Pu et al. 2009], and the human reference complexes data- set CORUM [Reuepp et al. 2008]. Here, we evaluate only methods for prediction of large protein complexes (i.e., complexes consisting of at least four proteins); prediction of small complexes is covered in Chapter 5.

Evaluation on Unweighted Yeast PPI Networks 4.2 PPI data is inherently noisy due to limitations of PPI assay technologies. It is characterized by high rates of spurious interactions (false positives) and missing interactions (false negative) [Bader and Hogue 2002, Bader et al. 2004, Yong and Wong 2015a]. To ameliorate this problem for protein complex prediction, most methods are designed for weighted PPI networks, where such weights may indicate the reliability or quality of the PPI in some way, such as the number of times reported or the experiments used. As an indication of baseline performance, we first perform yeast protein complex prediction on the yeast unweighted PPI network, so that no information about the reliability of the PPIs is utilized. We obtain yeast PPI data from the BioGrid and IntAct databases and the Con- solidated PPI dataset, deriving an unweighted PPI network composed of 136,552 94 Chapter 4 Evaluating Protein Complex Prediction Methods

Table 4.1 Clustering algorithms tested a

Clustering Algorithm Parameters CMC min_deg_ratio=1, min_size=4, overlap_thres=0.5, merge_thres=0.75 ClusterOne -s 4 IPCA -S4 -P2 -T.6 MCL -I 3.0 RNSC -e10 -D50 -d10 -t20 -T3 COACH Default settings

a. Parameters used: CMC: min_deg_ratio minimum degree ratio; min_size minimum cluster size; overlap_thres threshold to consider two clusters as overlapping; merge_thres threshold to merge overlapping clusters. ClusterOne: -s minimum cluster size. IPCA: -S minimum cluster size; -P shortest path length; -T interaction probability threshold. MCL: -I inflation parameter. RNSC: -e number of experiments; -D diversification frequency; - d shuffling diversification length; -t tabu length; -T tabu list tolerence. COACH: default settings provided by the software were used (no parameters were set).

interactions. This set of PPIs accounts for 70.63% of co-complex interactions with a precision of 5.61%, and covers 99.33% of yeast complexes with at least one PPI each. We evaluate the performance of the following six protein complex prediction al- gorithms, which form a mix of purely network topology-based and network topology and additional information-based methods: MCL [Bader and Hogue 2003], IPCA [Li et al. 2008], ClusterOne [Nepusz et al. 2012], CMC [Liu et al. 2009], COACH [Wu et al. 2009], and RNSC [King et al. 2004] (see Table 4.1 for parameters used). When evaluating on unweighted PPI networks we set the interaction weights to 1 for the methods that use interaction weights. As the number of interactions in the unweighted PPI network (number of interactions 136,552) is too large for most clustering algorithms, we choose 50,000 interactions at random to form the un- weighted PPI network for complex prediction. Interactions are chosen randomly so that the quality of PPIs is not factored in the PPI network. Figure 4.1 shows the precision-recall graph for complex prediction. Most algo- rithms perform poorly with unweighted PPI networks, recalling less than 7% of reference complexes at less than 10% precision. The exception is ClusterOne, which performs noticeably better than the others, achieving almost 20% recall and 10% precision. Even so, this performance is poorer than expected, showing that pro- tein complex prediction is generally unsatisfactory without any reliability weighting of PPIs. 4.3 Evaluation on Weighted Yeast PPI Networks 95

0.6 CMC 0.5 ClusterOne IPCA 0.4 MCL RNSC 0.3 Coach

Precision 0.2

0.1

0.0 0.00 0.05 0.10 0.15 0.20 Recall

Figure 4.1 Protein complex prediction from yeast unweighted PPI network.

Evaluation on Weighted Yeast PPI Networks 4.3 In the previous section we showed that an unweighted PPI network, which does not account for the reliability of PPIs, gave dismal performance in protein complex prediction. Noise in the PPI data results in a PPI network prevalent in spurious and missing interactions, making it ill-suited for analysis for protein complex discovery. Here we evaluate the impact of weighting the PPI network to ameliorate such noise. We test three different ways of weighting to account for PPI reliability: weighting by topology, publication count, and experiment.

Weighting by Topology Here, we weight each PPI based on its local topology in the PPI network. Proteins that share many common neighbors tend to interact themselves, as they share cellu- lar locations or functions via the common neighbors. Moreover, multiple proteins that function together in cohort are reflected as dense regions in the PPI network. Thus, a PPI in a dense neighborhood with many shared common neighbors is more likely to be a true PPI, compared to a solitary PPI. We weight each PPI using Iterative CD scoring with two iterations [Liu et al. 2009] (see Chapter 2), which accounts for both the density and extent of shared neighbors around each PPI.

Weighting by Publication Count Here, we weight each PPI using the number of times it has been reported in the different experiments, similar to the sampling-based scoring method by Friedel et al. [2009] (Chapter 2). A true PPI is more likely to be observed in screens by different 96 Chapter 4 Evaluating Protein Complex Prediction Methods

authors and reported in multiple publications, whereas a spuriously-detected PPI is less likely to be repeatedly observed. For each PPI, we count the number times it is reported in unique publications in the BioGrid and IntAct repositories. We normalize the counts to a maximum of 1.

Weighting by Experiment Here, we weight each PPI based on its reported experimental detection methods. Since different experimental methods are associated with different levels of speci- ficity, the reliability of a PPI should depend on the method(s) used to detect it, as well as the number of repeated observations by different authors. We use a Noisy-Or model to combine the number of times a PPI is reported and the types of experi- ment used to detect it. For each experimental detection method e, we estimate its

reliability rele as the fraction of interactions detected where both proteins share at least one high-level cellular-component Gene Ontology term. Then we estimate the reliability of each PPI (a, b) as:  = − − ne, (a, b) weight(a, b) 1 (1 rele) , ∈ e X(a, b)

where X(a,b) is the set of experimental detection methods that detected (a, b), ne,(a,b) is the number of times that experimental method e detected (a, b).We incorporate the yeast Consolidated PPI dataset by discretizing the PE scores into ten equally-spaced bins, and considering each bin as a separate experimental detection method. We avoid duplicate counting of evidences by ensuring PPIs reported by each publication ID are incorporated only once during weighting. Since the three weighted PPI networks are derived from the same datasets, they consist of the same underlying network structures (i.e., with the same sets of PPI interactions), with only differences in edge weights between them. All three net- works are composed of 136,552 interactions. However, as noted earlier, the entire PPI network of 136,552 interactions is too large for most protein complex prediction methods. Morever, most algorithms perform better when a smaller subset of high- quality PPIs is used. To predict protein complexes, we select the top k interactions from each weighted network (k=10,000, 20,000, and 40,000), before running the pre- diction algorithms. We evaluated protein complex prediction at match thresh = 0.5 (rough prediction) and 0.75 (stringent prediction). Figure 4.2a shows the complex prediction results in terms of F-scores for the six algorithms, at various levels of k and match thresh. Weighting by experiment frequently gave the best F-scores (CMC, IPCA, RNSC, and COACH for rough prediction; all clustering algorithms for strin- gent prediction), followed by weighting by publication count. For ClusterOne and iue4.2 Figure (a) egtdb xeietwt h o 000itrcin,fr(b) for network interactions, PPI 10,000 using form top algorithms, the to 6 with interactions the experiment for 40,000 by graphs and weighted Precision-recall 20,000, (b)-(c) publication 10,000, network. (topology, top PPI methods the the 3 with by experiment), weighted F-scores networks and (a) PPI count, algorithms. using prediction algorithms, complex 6 the protein for 6 using prediction complex Yeast c 0.75. (c) F score 0.0 0.1 0.2 0.3 0.4 0.5 0.6

Precision 0.0 0.2 0.4 0.6 0.8 1.0 . . . . 0.8 0.6 0.4 0.2 0.0 CMC (b) match_thresh=0.5 ih hd:k=1,0 eimsae 000Darkshade:k=40,000 Mediumshade:k=20,000 Light shade:k=10,000 Recall Topological weighting ClusterOne ulcto on PPIreliability Publication count IPCA Coach RNSC MCL IPCA ClusterOne CMC match

Precision 0.0 0.2 0.4 0.6 0.8 1.0 thresh . . 0.2 0.1 0.0 MCL = (c) match_thresh=0.75 .,and 0.5, Recall 0.3 RNSC match_thresh =0.5 match_thresh =0.75 . 0.5 0.4 Coach Coach RNSC MCL IPCA ClusterOne CMC 98 Chapter 4 Evaluating Protein Complex Prediction Methods

MCL, weighting by topology achieved the best F-scores, but only for rough predic- tion. For experiment and publication count weighting, using the smallest network (10,000 interactions) consistently gave the best performance; whereas for topolog- ical weighting, larger networks (20,000 or 40,000 interactions) improved perfor- mance. This may reflect the concentration of highly weighted interactions in a few complexes for topological weighting, so more interactions are required to uncover a wider range of complexes. Among the algorithms tested, RNSC achieved the best F-score, followed by CMC and IPCA; ClusterOne achieved the lowest F-score. The F-score reflects the aggregate precision and recall of all the complexes predicted by each prediction algorithm. However, it is sometimes beneficial to consider only a subset of complexes predicted by each algorithm (i.e., the top- scoring predicted complexes). To evaluate the performance at different prediction score thresholds, as well as to observe the trade-off between precision and recall, we vary cutoffs for the predicted complexes’ weighted densities to plot precision- recall graphs. Figures 4.2b–c show the precision-recall graphs for the six cluster- ing algorithms, using the PPI network weighted by experiment with k=10,000, for match thresh = 0.5 and 0.75, respectively. When considering only the high-scoring predicted complexes (low recall range, left side of each graph), COACH achieves the highest precision, while IPCA attains the lowest precision. On the other hand, to maximize recall by considering all predicted complexes (right side of each graph), IPCA and CMC achieve better precision. Since these algorithms achieve different trade-offs between precision and recall, an assessment of their performance de- pends on whether the objective is to predict more complexes to maximize recall (for which IPCA and CMC perform best), or to predict fewer complexes with high precision (for which COACH performs best). Figure 4.3 shows the precision-recall graph for prediction of co-complex inter- actions. Since all the PPI networks are structurally identical, they achieve the same overall recall and precision of 70.63% and 5.61%, respectively, covering 99.33% of yeast complexes. However, the shapes of the precision-recall graphs differ accord- ing to weighting method: weighting by topology achieves the highest precision at all recall levels (solid blue markers), followed by experiment (solid orange), and finally publication count (solid green). While topological weighting predicts co-complex interactions with high precision, these interactions are clustered in a few dense complexes, leading to lower complex coverage (hollow blue markers). In contrast, weighting by publication count and experiment gives lower precision, but their pre- dicted co-complex interactions are distributed among a wider range of complexes (hollow green and orange markers), enabling the prediction of a wider range of complexes. 4.4 Evaluation on Human PPI Networks 99

1.0 1.0 Unweighted Topological weighting 0.8 0.8 Precision Publication count 0.6 0.6 PPI reliability

0.4 0.4 Unweighted Precision Topological weighting

0.2 0.2 Complexes coverage Complexes coverage Publication count 0.0 0.0 0.0 0.2 0.4 0.6 0.8 1.0 PPI reliability Recall

Figure 4.3 Prediction of co-complex interactions in yeast PPI network weighted by three approaches: topology, publication-count, and experiment-based weighting.

Evaluation on Human PPI Networks 4.4 Here we evaluate the same algorithms on the prediction of human complexes. Human complex prediction presents a distinct set of challenges, making it a more difficult task compared to predicting yeast complexes. A significant challenge is the paucity of human PPI data. Hart et al. [2006] estimated the human interactome size at around 220,000 PPIs. Our human PPI data consists of around 260,000 PPIs, and with an estimated false-positive rate of 50%, this means that our human PPI network covers just slightly over half the human interactome. Many human complexes are therefore sparsely connected, which is a major hurdle for most protein complex prediction algorithms which are based on discovering dense clusters in the PPI network. In contrast, the yeast interactome is estimated at around 50,000 PPIs. Our yeast PPI data consists of around 130,000 PPIs, so even with the same false- positive rate, our yeast PPI network is a comprehensive representation of the yeast interactome. A second challenge is the more elaborate structure of interactions between protein complexes in human. Many complexes form core-attachment structures in vivo [Gavin et al. 2006]; see Chapter 3. A core consists of proteins that bind together (more or less) permanently as a subcomplex, and recruits other proteins as attachments. Such attachments to a core can vary depending on cellular location or state. These complexes are present as overlapping clusters in the PPI network, and are problematic for protein complex discovery algorithms to deconvolute: they are frequently discovered as a single large complex, in which different core-attachment configurations are fused. While such overlapping complexes also occur in the yeast complexome, they are much more frequent in human. 100 Chapter 4 Evaluating Protein Complex Prediction Methods

0.007 CMC 0.006 ClusterOne IPCA 0.005 MCL 0.004 RNSC Coach 0.003 Precision 0.002

0.001

0.000 0.000 0.002 0.004 0.006 0.008 0.0 Recall

Figure 4.4 Human complex prediction on unweighted PPI network.

We obtain human PPI data from the BioGrid and IntAct databases, deriving a PPI network composed of 261,711 interactions. While twice the size of the yeast PPI network, this set of PPIs accounts for only 34.1% of human co-complex interactions, with a precision of 4.5%; this indicates the higher extent of noise in human PPI data compared to yeast. Despite this, almost all the complexes are covered with at least one PPI (99.7%). Predicting complexes using the unweighted PPI network (taking 50,000 interactions at random) gave dismal results, with every algorithm achieving less than 1% in both recall and precision (Figure 4.4). Clearly, noise is highly prevalent in the human PPI network, such that complex prediction is untenable without addressing this problem. To reduce noise in the PPI network, we applied three PPI weighting methods as described above: weighting by topology, publication count, and experiment. Figure 4.5 shows the performance of these weighting methods on the prediction of co-complex interactions. As in yeast, topological weighting achieves the highest precision, but its high-weighted interactions are clustered in a few dense protein complexes, giving low protein complex coverage. Whereas weighting by publication count and experiment give lower precision, but their high-weighted interactions are distributed among a wider range of complexes, which enables their discovery by protein complex prediction algorithms. In contrast to yeast, for human co- complex interactions weighting by publication count achieves higher precision over weighting by experiment. 4.4 Evaluation on Human PPI Networks 101

1.0 1.0 Unweighted Topological weighting 0.8 0.8 Precision Publication count 0.6 0.6 PPI reliability

0.4 0.4 Unweighted Precision Topological weighting

0.2 0.2 Complexes coverage Complexes coverage Publication count 0.0 0.0 0.0 0.1 0.2 0.3 0.4 0.5 PPI reliability Recall

Figure 4.5 Prediction of co-complex interactions in human PPI network weighted by three ap- proaches: topology, publication count, and experiment.

Figure 4.6a shows the F-score performance of the six clustering algorithms on human complex discovery, for each of the three weighting methods with k=10,000, 20,000, and 40,000. Consistent with the results for co-complex edge prediction, and in contrast to yeast, weighting by publication count performed better than weighting by experiment. Weighting by publication count achieves the highest F-scores in four clustering algorithms (CMC, IPCA, RNSC, and COACH), while topological weighting performed best for ClusterOne and MCL. As in yeast, the smallest networks (k=10,000) gave the best performance for networks weighted by publication count and experiment; while larger networks performed better when weighted by topology. Again, this may reflect the concentration of highly-weighted interactions in a few complexes for topological weighting, so more interactions are required to uncover a wider range of complexes. Among the six clustering algorithms, IPCA and CMC achieve the highest F-score, followed by RNSC and COACH. Figures 4.6b-c show the precision-recall graphs of the six algorithms on hu- man complex discovery, using weighting by publication count (k=10,000), for match thresh = 0.5 and 0.75, respectively. COACH achieves the highest precision across a large recall range, demonstrating the importance of the core-attachment model for human complexes. In contrast, MCL performs poorly–its limitation in predicting non-overlapping clusters degrades performance particularly for human complexes, where many complexes overlap. When considering only the top pre- dicted complexes for each algorithm (low recall range, left side of graphs), COACH and ClusterOne achieve high precision. In contrast, when considering all predicted iue4.6 Figure h P ewr.()()Peiinrcl rpsfrte6agrtm,uigPInetwork (b) PPI for interactions, using 10,000 form algorithms, top to the 6 with interactions the count publication for 40,000 by graphs and weighted Precision-recall 20,000, (b)-(c) publication 10,000, network. (topology, top PPI methods the the 3 with by experiment), weighted networks and PPI F-scores count, using (a) algorithms, algorithms. 6 prediction the complex protein for 6 using prediction complex Human (a) n c 0.75. (c) and

F score 0.0 0.1 0.2 0.3 0.4

Precision 0.0 0.2 0.4 0.6 0.8 1.0 . . 0.2 0.1 0.0 CMC (b) match_thresh=0.5 ih hd:k=1,0 eimsae 000Darkshade:k=40,000 Mediumshade:k=20,000 Light shade:k=10,000 Recall Topological weighting ClusterOne 0.3 . 0.5 0.4 ulcto on PPIreliability Publication count IPCA Coach RNSC MCL IPCA ClusterOne CMC

Precision 0.0 0.2 0.4 0.6 0.8 1.0 .000 .00.15 0.10 0.05 0.00 match MCL thresh (c) match_thresh=0.75 = 0.5, Recall RNSC match_thresh =0.5 match_thresh =0.75 Coach Coach RNSC MCL IPCA ClusterOne CMC 4.5 Case Study: Prediction of the Human Mechanistic Target of Rapamycin Complex 103

complexes so as to maximize recall (right sides of graphs), IPCA and CMC achieve the highest precisions. The complex prediction performance is generally poorer in human compared to yeast, with a maximum recall of 42% in human, compared to 73% in yeast (both achieved by IPCA). Regardless, the strengths of the protein complex prediction algorithms are consistent across human and yeast complexes: for both human and yeast, COACH achieves good precision among top-scoring predictions, while IPCA and CMC are most competent for maximizing recall.

Case Study: Prediction of the Human Mechanistic Target 4.5 of Rapamycin Complex To demonstrate the challenges of protein complex prediction, and to show how the idiosyncrasies underpinning different protein complex prediction algorithms can produce complexes with different compositions even on the same PPI network, we use the mechanistic target of rapamycin complex 2 (mTORC2) as an illustrative example. mTORC2 is a signaling protein kinase complex involved in regulation of the cytoskeleton, metabolism, and cellular proliferation, and is implicated in diseases such as diabetes and cancer [Laplante and Sabatini 2012]. It consists of the four proteins MTOR, MLST8, MAPKAP1, and RICTOR as per the CORUM database (Figure 4.7(a)). mTORC2 presents three challenges for protein complex prediction algorithms. First, MTOR and MLST8 also participate in a related signaling complex called mTORC1 (consisting of MTOR, MLST8, RAPTOR, and AKT1S1). Thus, these two complexes are present as overlapping clusters in the PPI network, which may be difficult to tease apart. Second, many external proteins are connected to mTORC2 members, corresponding to regulators or phosphorylation targets of either mTORC1 or mTORC2. Finally, the mTORC2 complex is itself not fully connected in the PPI network, with five out of six possible edges present between its proteins. Each of the six protein complex prediction algorithm produces different com- plexes (Figure 4.7b-g), with CMC, COACH, and IPCA predicting complexes that best match mTORC2 (Jaccard score of 0.6). CMC’s prediction includes three of four mTORC2 proteins (MTOR, RICTOR, MAPKAP1), and an additional protein AKT1, a phosphorylation target of the complex. COACH predicts a complex identi- cal to CMC’s prediction, but additionally predicts a second complex which includes RAPTOR from the mTORC1 complex. IPCA’s results demonstrate its propensity to promiscuously predict overlapping complexes around dense regions. It predicts six complexes that overlap with mTORC2 (of which three are shown). The first consists 104 Chapter 4 Evaluating Protein Complex Prediction Methods

(a) Actual complex (b) CMC (c) COACH

(d) IPCA (e) MCL

(f) ClusterOne (g) RNSC

Figure 4.7 Example human complex mTORC2. (a) Protein complex obtained from CORUM. (b)-(g) Predictions from CMC, COACH, IPCA, MCL, ClusterONE, and RNSC methods. COACH, IPCA, and RNSC produce multiple predictions which (partially) match the actual complex, and these are shown here within different boundaries.

of three mTORC2 proteins and its target AKT1 (identical to CMC’s prediction), while the second consists of three mTORC1 proteins and an mTORC1 target. The third consists of a mix of mTORC1 and mTORC2 proteins: MTOR, MLST8, and RICTOR from the former, and MTOR, MLST8, and RAPTOR from the latter. The three other 4.6 Take-Home Lessons from Evaluating Prediction Methods 105

predictions overlap mTORC2 to a lesser extent. In contrast, since MCL does not pre- dict overlapping complexes, it instead merges the dense region around mTORC1 and mTORC2 into one large complex, also including three other proteins corre- sponding to targets or regulators of these complexes. ClusterONE also predicts a large complex, merging the entire mTORC1 and mTORC2 along with their targets and regulators. RNSC predicts two complexes: the first a small complex consist- ing of two mTORC2 proteins with its target AKT, and the second a larger complex consisting of two mTORC1 proteins with two other interactors. The best-matching mTORC2 prediction consists of only three out of its four proteins, missing out on MLST8 and instead includes extraneous protein AKT1 (predicted in common by CMC, COACH, and IPCA). On the other hand, the only prediction that includes all four mTORC2 proteins also includes the mTORC1 complex as well as four additional interactors (from ClusterONE). Thus, most predictions could not properly distinguish the boundary separating mTORC2 with the related complex mTORC1 or their interactors.

Take-Home Lessons from Evaluating Prediction Methods 4.6 In this chapter, we evaluated six complex prediction algorithms on the prediction of yeast and human complexes. Due to the effects of noise in the PPI data, manifested as false positives (spurious interactions) and false negatives (missing interactions) in the PPI network, a na¨ıve use of PPI data without accounting for PPI quality re- sults in dismal performance in protein complex prediction. We employed three PPI-weighting methods to estimate PPI quality, namely, weighting by topology, by publication count, and by the type of experimental technique. Topological weight- ing uses only information inherent in the PPI network, but tends to weight highly a limited set of interactions found in dense clusters, whereas weighting by publica- tion count and experiment exploits extraneous information that is easily obtained from the PPI databases. All three methods ameliorate the impact of noise in the PPI data, and substantially improve protein complex prediction performance. As different protein complex prediction algorithms are based on different clus- tering methods or models, their predicted complexes can vary considerably, lead- ing to distinct trade-offs between precision and recall. For example, COACH, which employs a core-attachment model for complexes, achieves the highest precision among its top-scoring predicted complexes; while IPCA, which tends to generate a large number of complexes, is effective at predicting more complexes at the ex- pense of precision. On the other hand, MCL, which does not generate overlapping 106 Chapter 4 Evaluating Protein Complex Prediction Methods

complexes, performs poorly for human complexes which are highly overlapping. Thus, the objective of the user, as well as the nature of the interactome and com- plexome being considered, should inform the choice of the appropriate protein complex prediction algorithm to use. The prediction of human complexes share some of the difficulties as yeast complex prediction (noisy PPI data), but is further hindered by unique challenges: highly overlapping complexes and insufficient PPI coverage. In the next chapter, we further discuss these challenges as well as the difficulty of predicting small complexes, which was not evaluated here. Open Challenges in Protein Complex Prediction

Euclid taught me that without assumptions there is no proof. Therefore, in any argument, examine the assumptions. —Eric Temple Bell (1883–1960), Scottish mathematician (as quoted in Eves [1998])

Although existing computational methods for protein complex prediction incor- porate diverse biological evidence (e.g., core-attachment [Leung et al. 2009, Wu 5et al. 2009, Srihari et al. 2010, Wu et al. 2012] and functional homogeneity [King et al. 2004, Chua et al. 2008, Li et al. 2007]) to improve their prediction ability, most methods work on the basic credence that protein complexes are embedded as dense subnetworks within PPI networks. While this assumption is generally cor- rect, due to limitations in current PPI datasets, some of which we highlighted in Chapter 2, most methods miss many protein complexes.

Three Main Challenges in Protein Complex Prediction 5.1 Yong and Wong [2015a] evaluated the performance of protein complex prediction algorithms by categorizing protein complexes in three ways. They stratified pro- tein complexes into large and small complexes consisting of >3 and ≤3 proteins, respectively. Large complexes were further stratified by density (DENS), and the number of external proteins highly connected to the complex (EXT). EXT is used to indicate the extent to which a complex may overlap other complexes. The au- thors showed that yeast protein complexes with low DENS and/or high EXT were more difficult for complex prediction algorithms to predict (Figure 5.1a), because the predicted complexes matched poorly with the actual complexes (Figure 5.1b). 108 Chapter 5 Open Challenges in Protein Complex Prediction

The predicted protein complexes with high EXT included large numbers of extra- neous proteins (Figure 5.1c), and these were merged with other complexes as well (Figure 5.1d). Small protein complexes were also difficult to predict. The only com- plexes that were consistently predicted well were the large complexes with high DENS and low EXT. The same conclusions were drawn from evaluating human pro- tein complexes (Figure 5.2). Based on this evaluation, we conclude that existing methods for protein complex prediction do not perform well at identifying: (i) sparse and (ii) small protein com- plexes; and at (iii) deconvoluting overlapping protein complexes—the three kinds of “challenging protein complexes.” The first limitation is partly because some protein complexes are naturally sparse (e.g., each protein within such a complex may not necessarily interact with every other protein in the complex), and partly because of the paucity of interactions between co-complexed proteins in currently available PPI datasets. A case study by Srihari and Leong [2012a] that mapped 123 large protein complexes from MIPS [Mewes et al. 2006] onto the Consolidated yeast PPI network (1,622 proteins, 9,704 interactions, and average node degree 11.96) found that only 89 complexes were completely “embedded” within the main connected component of the network, whereas the remaining 34 complexes were “scattered” among multiple compo- nents (Figure 5.3). The Consolidated network [Collins et al. 2007] is generated by combining TAP/MS interactions from Gavin et al. [2006] and Krogan et al. [2006] studies using the purification enrichment scoring (for details, refer to Chapter 2). The authors evaluated four methods, CMC, HACO, MCL and MCL_CAw, and found that these methods were able to recover only 58 of these 123 complexes. Of the 65 unrecovered complexes, 27 were scattered across the different components of the network, and 34 complexes, although intact, had low interaction densities (<0.50) in the network. For example, the MIPS complex 510.190.110 (the CCR4 complex) was scattered among four network components, thus rendering the complex dis- connected and with low density (0.19); this complex went undetected by all the four methods. Small protein complexes (consisting of fewer than four proteins) comprise a ma- jority of complexes in yeast and human, but their prediction is especially suscepti- ble to inaccuracies in the PPI network. Missing interactions could easily disconnect a small complex whereas spurious interactions could embed the complex within a larger subnetwork. Topological measures such as interaction density applicable to larger complexes are less effective for detecting small complexes (e.g., in a network with n proteins there are O(n3) triplets (with density 1) that could be predicted as three-protein complexes). Furthermore, evaluation measures such as Jaccard 5.1 Three Main Challenges in Protein Complex Prediction 109 0.7. > Small [ 2015a ]) Large (d) Merged complexes 0.7. Hi DENS: density Lo Med Hi ≤ 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 Yong and Wong density Small < Large in best cluster (c) Extra proteins Lo Med Hi (Figure adapted from 8 6 4 2 0 0 5 0 8 6 4 2 0 5 0 10 80 60 40 20 30 25 20 15 10 30 25 20 15 10 100 0.35. Med DENS: 0.35 Small ≤ Large 3 unique proteins. of best cluster (b) Match score ≤ Lo Med Hi 3. Lo DENS: density 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 > Small (a) Recall Large 3. Hi EXT: EXT ≤ Lo Med Hi 3 unique proteins. Small size: Lo Hi Lo Hi Lo Hi Lo Hi Lo Hi Lo Hi Lo Hi Lo Hi Lo Hi Lo Hi Lo Hi Lo Hi > 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 EXT SIZE MCL CMC DENS HACO Coach Performance of complex discovery algorithms on yeast complexes, stratified byproteins size, density highly (DENS), connected and number to of external theeach complex chart (i.e., corresponds connected to to the atof different least complexes stratified half recalled. groups (b) of of Jaccard the complexes, match complex’sbest-matching given score proteins; prediction. of at (d) EXT). the the Number best-matching The bottom of prediction. x-axis of (c) complexesare: of the Number Lo merged figure. of EXT: into (a) EXT extraneous the proteins Proportion best-matching in prediction. the Stratification categories ClusterOne Large size: Figure 5.1 110 Chapter 5 Open Challenges in Protein Complex Prediction 0.7. > Small [ 2015a ]) Large (d) Merged complexes 0.7. Hi DENS: density Lo Med Hi ≤ 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 Yong and Wong density Small < Large in best cluster (c) Extra proteins Lo Med Hi (Figure adapted from 5 0 0 0 8 6 4 2 0 5 0 30 25 20 15 10 80 60 40 20 60 50 40 30 20 10 12 10 30 25 20 15 10 0.35. Med DENS: 0.35 Small ≤ Large 3 unique proteins. of best cluster (b) Match score ≤ Lo Med Hi 3. Lo DENS: density 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 > Small (a) Recall Large 3. Hi EXT: EXT ≤ Lo Med Hi 3 unique proteins. Small size: Lo Hi Lo Hi Lo Hi Lo Hi Lo Hi Lo Hi Lo Hi Lo Hi Lo Hi Lo Hi Lo Hi Lo Hi > 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 1.0 0.8 0.6 0.4 0.2 0.0 EXT SIZE MCL CMC DENS HACO Coach ClusterOne Performance of complex discovery algorithmsexternal on proteins highly human connected to complexes, the stratified complex (i.e., byof connected to each size, at chart least density half corresponds (DENS), of to the and the complex’sof proteins; different number EXT). complexes stratified The of recalled. groups x-axis (b) of Jaccard complexes, match givenbest-matching score at prediction. the of (d) bottom the Number of best-matching the of prediction. figure. (c) complexesare: (a) Number Lo merged Proportion of EXT: into EXT extraneous the proteins best-matching in prediction. the Stratification categories Large size: Figure 5.2 5.1 Three Main Challenges in Protein Complex Prediction 111

YAL0210C YDL160C Main component of the PPI network: Total MIPS complexes in

1,034 proteins the PPI network: 123 YCR093W YNR052C 8,377 interactions YPR072W YFL028C

Nuclear pore YDL165W complex CCR4 complex SAGA complex

CCR4 complex RNA –I, II, and III (disconnected and sparse) complexes (overlapping complexes)

Chaperonine containing T-complex TRiC Cdc28p complees (Cdc28p interacts with diefferent proteins but at different times)

Smaller components: MIPS complees 588 proteins “scatered” among 696 interactions smaller components: 34

Figure 5.3 The Consolidated network (1,622 proteins and 9,704 interactions) [Collins et al. 2007] contains one main connected component (covering 1,034 proteins and 8,377 interactions) and several small- to medium-sized components of sizes 2–15 (covering the remaining 588 proteins and 696 interactions). Of the 123 MIPS complexes (of sizes at least 4), 89 are completely embedded within the main component, whereas the remaining 34 complexes are “scattered” among multiple components. For example, the SAGA and Chaperonine TRiC complexes are mostly embedded within a single (the main) component, whereas the CCR4 (inset) and Nuclear Pore complexes are scattered among multiple components which makes the latter two difficult to identify. On the other hand, proteins within the three RNA polymerase complexes—I, II, and III—are tightly interconnected which makes it difficult to deconvolute the individual complexes. The Cdc28p complexes, on the other hand, are formed during different cellular contexts, and without this contextual information, these are difficult to deconvolute. 112 Chapter 5 Open Challenges in Protein Complex Prediction

coefficient become less effective for evaluating small complexes (e.g., a mismatch of only one protein in a three-protein complex renders the prediction inaccurate or less useful despite achieving a Jaccard of 0.50). As a result, most methods fare poorly in detecting small complexes. Some methods therefore explicitly exclude small complexes from their predictions (e.g., Wu et al. [2009]). The third limitation can occur when proteins from different complexes are heavily shared among the complexes, thus resulting in a high density of interactions between the complexes. The example of the three RNA polymerase complexes (Pol) I, II, and III is highlighted in several studies [Krogan et al. 2006, Pu et al. 2007, Srihari et al. 2010, Liu et al. 2011], where due to the sharing of multiple proteins, the three complexes are not clearly discernable from each other (Figure 5.3). The presence of spurious interactions can further aggravate this problem by increasing the density of interactions between the complexes. In other cases, the individual complexes that share proteins are not all formed simultaneously; however, due to the lack of contextual data, it is difficult to deconvolute the individual complexes. For example, different Cdc28-cyclin complexes are formed depending on the cell- cycle phase, all of which share Cdc28; however, without the cell-cycle context, the complexes are lumped together into one large cluster which does not match any single (context-dependent) complex. To handle the above limitations, specialized methods are needed, which form the subject of this chapter.

Identifying Sparse Protein Complexes 5.2 One of the first works that attempted to study sparse protein complexes was by Habibi et al. [2010], who modeled complexes as k-connected subgraphs. A graph G =V , E is connected if every two vertices of G are linked by a path. The graph G is k-connected (or, specifically k-vertex-connected), k<|V |, if for every subset S ⊆ V , |S|

k k k where s1(C), s2(C),...,st (C) are maximal k-connected subgraphs of the complex C. Therefore, the average k-connected score for the set of complexes C is given by, k ∈C max{|s (C) : i = 1,2,...,t|} kscore(C) = C i . (5.2) |{ ∈ ∃ ∈ }| C∈C p C : q, (p, q) E While the average density of the 827 known complexes in the BioGrid PPI network was low (0.41), the kscore of these complexes was 0.784 for k = 2, and 0.537 for k = 3, meaning that, on average, 78.4% of proteins within each real complex were located in a 2-connected subgraph, and 53.7% of proteins in a 3-connected subgraph. This meant that protein complexes, in particular the low-density ones, could be identified by modeling them as k-connected instead of dense subgraphs. To do so, Habibi et al. proposed the k-Connected complex Finding Algorithm (CFA), which works in the following two steps. In the first step, CFA generates maximal k-connected subgraphs for various k ≥ 1 to build a candidate list of complexes. For a k, CFA begins by removing all pro- teins of degree less than k, because from Menger’s theorem these proteins cannot be part of any k-connected subgraphs. Next, CFA finds a subset of h

greater than k, and 1-connected subgraphs of diameter greater than 4 are removed. These subgraphs have very long shortest paths between their constituent proteins, and likely do not represent the topologies of real complexes. In the analysis by Habibi et al., the k-connected subgraphs with larger k matched real complexes more accurately, as expected. However, there were several 1-con- nected and 2-connected subgraphs that also matched real complexes with a Jaccard of at least 0.5. In particular, when evaluated against the 827 MIPS and Aloy et al. complexes, these 1-connected and 2-connected subgraphs achieved a recall of 0.163 and 0.149, respectively (i.e., 1-connected: 134, and 2-connected: 123 recovered complexes), thus identifying several sparsely connected complexes from the PPI network. Fanetal.[2012] devised a method to separate out regions of high and low densities within the PPI network and thereby to identify high- and low-density protein complexes. The method starts by identifying the set C of all maximal cliques of size at least three from the PPI network using a branch-and-bound approach. This procedure begins with pairs of interacting proteins and adds one neighboring protein at a time to expand each pair into a clique until no more vertices can be added. Since most proteins in the PPI network have small degrees (a long-tail distribution of degrees), this branch-and-bound procedure is still tractable. Next,

= |C∩C | for any two cliques C1 and C2, a similarity is defined as S(C, C ) |C|.|C | . For a clique C, all other cliques C are counted such that S(C, C ) ≥ t, a similarity threshold, and the clique C is considered to be in a low-density neighborhood if this number is below a threshold c, and in a high-density neighborhood otherwise. Cliques with high and low neighborhood densities are treated differently while building complexes. For a maximal clique with low neighborhood density, a best neighboring protein is repeatedly identified and added such that the density of the enlarged subgraph re- mains high enough and the length of the shortest path between any two proteins in the subgraph remains small (based on thresholds). For maximal cliques with high neigbourhood densities, the method first sorts all cliques in non-increasing order of their densities. For a clique C, if there exists another clique C with S(C, C ) ≥ ω, a merge threshold, then the two cliques are merged, and the resulting subgraph is added to the list of complexes. The two cliques C and C are then removed. All unmerged cliques are added to the list of complexes. Srihari and Leong [2012a] devised a method SPARC (SPARse Complexes) that adds functional interactions to the PPI network to detect sparse complexes. SPARC is based on the observation that a protein complex detection method based solely on the topology of the PPI network tends to detect a protein complex with better 5.2 Identifying Sparse Protein Complexes 115 accuracy if: (i) a majority of the protein subunits of the complex are within the same connected component; and (ii) the interaction density of the complex is higher rel- ative to its immediate (local) neighborhood. Therefore, if functional interactions can be predicted and added to the PPI network such that these interactions im- prove the connectivity as well as the neighborhood-relative density of complexes, then existing methods should be able to detect sparse complexes with better accu- racy. Based on this observation, the authors devised a Component-Edge-density (CE) score to measure the detectability of protein complexes. The CE score for a protein complex is composed of two parts: the Component score (CS) and the Edge density score (ES). CS measures the fraction of proteins within a protein complex that be- long to the same connected component. ES measures the density of interactions within the complex relative to the immediate neighborhood of the complex. For a protein complex B, its CS and ES scores with respect to the network G are given by: max |S (B)| CS(B, G) = i i |{p : p ∈ B, ∃q ∈ B, (p, q) ∈ E(G)}| (5.3) for |B| > 0, else CS(B, G) = 0, = where Si(B), i 1,2,...,t represent the t connected components of B; and e∈E(B) w(e) ES(B, G) = for E(N[B]) =∅, else ES(B, G) = 0, (5.4) e∈E(N[B]) w(e) where N[B] is the set of proteins in B and the immediate neighborhood (i.e., level-1 neighbors) of B in the network. The CE score for the complex B is then defined as,

CE(B, G) = CS(B, G) . ES(B, G). (5.5) B = = For a set containing m complexes, if CE(Bi , G) 1, for i 1,2,...m, then each ∈ B complex Bi forms a connected component in the network disjoint (isolated) from the other complexes; therefore, B forms a set of complexes that are topologi- = cally “easiest” to identify. On the other hand, if CE(Bi , G) 0, then each complex forms a “hole” in the network G, that is, the complex has no interactions among its constituent proteins but may have interactions with proteins outside the com- plex (extraneous interactions); these complexes are topologically the “hardest” to ≤ ≤ identify. Given a threshold 0 tce 1, a complex Bi is considered to be sparse with respect to the network G if CE(Bi , G)

=  C ∈ C GP VP , EP , clusters are generated using an existing method. All clusters C ≥ with CE(C, G) tce are dense, and are therefore added to the list of predicted com-

plexes. The remaining clusters C with CE(C , G)CE(C, GA) and adds p to C , until the CE score cannot be increased any further. If the CE score ≥ of the expanded cluster CE(C , GA) tce, then C is added to the list of predicted complexes. Srihari and Leong [2012a] studied the effectiveness of SPARC in improving the predictions of four existing methods, MCL, MCL_CAw, CMC, and HACO, using a yeast PPI network containing 4,113 proteins and 26,518 interactions, and a func- tional network containing 3,980 proteins and 18,683 interactions (the augmented network contained 5,145 proteins and 43,905 interactions). SPARC was evaluated against detectable protein complexes from the MIPS catalog—that is, MIPS pro- tein complexes with at least four proteins present in the PPI network. The methods achieved 25% higher recall on average upon SPARC-based processing of their clus- ters. For example, there are 155 MIPS protein complexes with at least 4 proteins present in the PPI network (of the total 313 protein complexes in MIPS). Before SPARC-based processing, MCL produced 294 clusters, which matched 38 out of the 155 MIPS complexes (a recall 0.245), but after SPARC-based processing this im- proved to 56 protein complexes (recall 0.361). Among the 294 MCL clusters, 102

were deemed sparse (tce<0.40), of which 19 underwent processing by SPARC, and these recovered 18 additional MIPS complexes—that is, 38 + 18 = 56 total recov- ered complexes, an improvement of 47% in recall for MCL. Among the MIPS com- plexes that were successfully identified because of SPARC-based processing was 510.190.110 (CCR4 complex), highlighted earlier in Figure 5.3. SPARC-processing added several functional interactions (see Figure 5.4) between the protein subunits of this complex, thus increasing its CE score and enabling its detection by all the four methods. Blindly adding functional interactions can result in many false-positive predic- tions. Therefore, SPARC selectively utilizes these functional interactions, only to improve the CE scores of sparse clusters in the network. This also suggests that while using PPI datasets for protein complex prediction is intuitive and reasonable, 5.2 Identifying Sparse Protein Complexes 117

YIL038C YCR093W Protein in complex YJR011C YNR052C External protein YIL106W YGR092W YJR011C YER068W Protein not present YER112W YGR134W YNR052C YKR036C YCR093W YNL288W YGR092W YGR134W YAL021C YPR072W YAL021C YNL288W YDL160C YJL124C YPR072W YIL106W YDL165W YDL165W YFL028C YFL028C YMR308C YDL160C

YFL028C YER112W YJR122W YER068W YJR122W YIL038C YER043C YKR036C YJL124C (a) (b) YMR308C YER043C

Figure 5.4 MIPS complex 510.190.110 (CCR4 complex) before and after refinement using SPARC [Srihari and Leong 2012a]. (a) Before: proteins of the complex (orange) are not connected to one another and the PPI network lacks several proteins (blue). (b) After: proteins of the complex (orange) belong to the same connected component, are densely connected relative to the immediate neighborhood (olive), and the SPARC-processed PPI+functional network lacks only one protein (blue).

functional linkage between proteins can be useful to further enrich the topology of PPI networks and to reveal new co-complex memberships. Yong et al. [2012] showed that by integrating data from multiple sources (e.g., sequence similarity, co-expression, and literature co-occurrence) to add functional linkage information can improve protein complex prediction and increase the credibility of predicted novel complexes. Yong et al. proposed a supervised maximum likelihood-based weighting scheme SWC to integrate functional linkage information with a PPI net- work and build a “Composite network.” The details of this weighting method are described in Section 2.6. Yong et al. demonstrate that SWC is more accurate than STRING in predicting co-complexed interactions. Consequently, when seven differ- ent protein complex prediction methods—MCL, CMC, HACO, ClusterONE, IPCA, RNSC, and an ensemble of these methods, called COMBINED—were evaluated on the Composite network, all the methods achieved considerably higher recall com- pared to when they were evaluated on the original PPI and the STRING networks. For example, when the top 20,000 predicted interactions from SWC and STRING were evaluated for co-complexed interactions, SWC achieved 50% recall at 40% pre- cision, compared to 40% recall for STRING at the same precision level. All seven methods showed improvement in protein complex prediction. For example, the COMBINED method achieved a precision-recall area-under-the-curve of 0.63 on 118 Chapter 5 Open Challenges in Protein Complex Prediction

the Composite network compared to 0.58 on the STRING network. Of the addi- tional identified protein complexes was the yeast mitochondrial cytochrome bc1 complex which had 19 co-complexed and 145 extraneous interactions in the PPI network, making it difficult to be detectable by any method. But, in the Composite network, these extraneous interactions reduced to 26, thus rendering the complex more easily detectable by all the methods. The reader is reminded here that SWC as- signs lower scores to non-co-complexed interactions, which leads to the reduction of extraneous interactions; see Section 2.6.

Identifying Overlapping Protein Complexes 5.3 Many proteins perform multiple distinct functions. As a result, these proteins par- ticipate in distinct complexes which tend to overlap in the PPI network. Most ex- isting protein complex prediction methods, being unable to demarcate the bound- aries between these overlapping complexes, tend to lump the complexes together into large clusters. Deconvoluting these overlapping complexes is important not only for mapping as many complexes as possible, but also for understanding how the shared proteins perform multiple functions via their different complexes. Every protein has a set of one or more functional domains, which enable the protein to perform its functions. Being crucial units of protein sequences, these domains are often conserved, and can fold, evolve, and function independently of the rest of the protein [Yesylevskyy et al. 2006, Weiner et al. 2006, Fong et al. 2007, Chothia et al. 2009]. Protein domains enable biomolecular interactions be- tween proteins via inter-domain or domain-domain interactions (DDIs), where the size, structure, and availability of domains are responsible for determining the part- ners of a protein [Deng et al. 2002, Yesylevskyy et al. 2006]. Proteins with multiple domains (multidomain proteins) occupy nearly half of all proteomes, and eukary- otic proteomes contain a higher fraction of multidomain proteins than prokaryotic proteomes [Han et al. 2007]. Multidomain proteins have an evolutionary advantage over large single-domain proteins with respect to folding: Multiple domains pro- vide proteins with both structural and functional plasticity [Han et al. 2007, Apic et al. 2003, Vogel et al. 2004]. Multidomain proteins can be involved in multiple inter- actions simultaneously, with the different domains involved in interactions with different proteins, thereby giving rise to large multiprotein complexes. In some cases, a protein may use the same set of domains to interact with different sets of proteins. Such a protein often tends to be multifunctional, with its functions deter- mined by the protein or substrate it interacts with. For example, cyclin-dependent kinases interact with different cyclins during the different cell-cycle phases to aid 5.3 Identifying Overlapping Protein Complexes 119 passage of the cell through cell cycle. Consequently, multifunctional proteins form “hubs” in the PPI network, and are contained within multiple complexes that over- lap in the network. However, due to missing context for the interaction (e.g., the cyclin substrate with which the interaction occurs, and the cell-cycle phase when the interaction occurs) it becomes difficult to accurately differentiate between the different complexes. The integration of such contextual information with PPIs will be discussed in Chapter 6. Here, we discuss methods that integrate information on mutual exclusivity of interactions to deconvolute overlapping complexes. If a protein has fewer interacting domains than the number of partners it inter- acts with, then the protein can participate in only a subset of these interactions at any given time [Aloy and Russell 2006, Kim et al. 2006]. Therefore, all overlapping complexes to which the protein belongs may not necessarily co-exist at the same time. Information on this mutual exclusivity between the interactions, due to lim- ited number of interacting domains, can be used to deconvolute these overlapping complexes. Jung et al. [2010] developed an approach called Simultaneous PPI Network (SPIN) to remodel PPI networks by accounting for cooperative and mutually ex- clusive interactions. SPIN is constructed using structural information on binding surfaces of interacting domains from proteins. If two or more interaction partners of a protein bind to the same domain or interfacial surface of the protein, then the domain or surface is treated as physically available to only one of the partners at a time. Therefore, the interactions making use of the domain are considered mutually exclusive. By generating individual PPI networks called SPINs, each rep- resenting only a compatible subset of interactions from the given PPI network, Jung et al. showed that overlapping protein complexes can be separated out, and thus be used to improve the performance of existing methods. SPIN begins by identifying the set of all conflicting domains for every protein in the PPI network. That is, for each protein, SPIN identifies the domains shared by multiple partners, using data on DDIs from PSIMAP [Gong et al. 2005] and In- terPro [Hunter et al. 2009, Quevillon et al. 2005]. SPIN uses Boolean clauses to represent the set of interactions that the protein can participate in the network, as shown in Figure 5.5(a). The clauses for the three proteins p1, p2, and p3 in the ⊕ ∧ ∨ ¬ ∧ = Figure 5.5(a) are: (i) (i1 i3) for p1; (ii) (i1 i2) ( i1 i2) i2 for p2; and (iii) ∧ ∨ ∧¬ = (i2 i3) (i2 i3) i2 for p3. Therefore, the Boolean clause of the three-protein ⊕ ∧ ¬ ∧ ∧ ∨ ∧ ∧¬ network is: (i1 i3) i2, which gives ( i1 i2 i3) (i1 i2 i3). Each conjunc- tive subclause contains a set of non-conflicting interactions, and these subclauses can be used to build distinct snapshots called SPINs of the network. A second 120 Chapter 5 Open Challenges in Protein Complex Prediction

p2 i1 i i 1 i2 1 i2 i1 i2 p i i3 i3 1 2 i4 i4 i4 i5 i6 i3 i5 i6 p3

(a) (b)

Figure 5.5 The SPIN model to construct simultaneous PPI networks using mutual exclusivity of domains [Jung et al. 2010]. (a) Mutual exclusivity of domains in a PPI network: here, i1 and i3 cannot simultaneously occur. (b) Constructing SPIN networks from a PPI network: here, i3 and i5, and i3 and i6 cannot simultaneously occur, so the network is decomposed into two SPIN models, one with i3 (and no i5 and i6) and the other with i5 and i6 (and no i3).

example of the same is depicted in Figure 5.5(b). Jung et al. evaluated two meth- ods namely, MCODE and LCMA, on the SPINs of a PPI network containing 4,579 proteins and 15,524 interactions, with 100 proteins having at least one domain in- volved in mutually exclusive interactions. Protein complexes were detected from each of these SPINs individually and then combined to a common set of predic- tions. MCODE predicted 171 clusters in total (vs. 140 from the original PPI network) that matched the 267 reference complexes from MIPS with a recall of 0.243 (0.213 on the original network) and precision of 0.528 (0.401 on the original network), respectively. LCMA detected 2,073 clusters (vs. 1,696 from the original network) which matched the reference complexes with a recall of 0.528 (0.401 on the orig- inal network) and precision of 0.128 (0.098 on the original network). Among the deconvoluted protein complexes were the three Skp1-Cdc53-F-box (SCF) complexes namely, 445.10, 445.20, and 445.30, that share several proteins, which were other- wise lumped into one single cluster when predicted from the original PPI network. Ozawa et al. [2010] employed a slightly different approach to integrate infor- mation on mutual exclusivity of protein interactions. In this approach, clusters produced from the PPI network are postprocessed (refined) using mutual exclu- sivity between the interacting domains. Let C be the set of predicted clusters from a PPI network identified using any existing protein complex prediction method, P = and let be the set of PPIs in these clusters. For any two proteins, u and v, Puv 1 = means the interaction between u and v occurs, and Puv 0 otherwise. Therefore, the objective is to maximize the total number of allowable PPIs in the clusters: Maximize Puv , (5.6) Puv∈P 5.3 Identifying Overlapping Protein Complexes 121 which is rationalized by the fact the proteins within real complexes are densely connected. This maximization is constrained by the number of DDIs available for D ={ } the interactions to occur. Let uv Duv,1, Duv,2,...,Duv,n be the DDIs available = for u and v to interact, where a Duv,k 1 means the DDI Duv,k occurs for the = interaction between u and v, and Duv,k 0 otherwise. This can be represented as: = Puv Duv,i (5.7) ∈D Duv, i uv subject to the constraint, ≤ Duv,i 2, (5.8) ∈D Duv, i uv because the interaction Puv can utilize at most one DDI per protein. The clusters are post-processed to maximize the allowable PPIs subject to these constaints, and all clusters with at least three proteins and at least two DDIs are output as the final list of refined predictions. Ozawa et al. [2010] evaluated their approach on a yeast PPI network contain- ing 35,353 interactions among 4,621 proteins from BioGrid [Stark et al. 2011], and 18,207 high-confidence DDIs between 3,382 domains from iPfam [Finn et al. 2005] and InterDom [Ng et al. 2003]. MCODE and MCL were used to predict the initial set of clusters—487 and 90, respectively—from the PPI network. Post refinement, the numbers of these clusters reduced to 42 and 34, respectively. The precision of the refined clusters was at least two-fold higher than the original clusters; however, as expected, the recall reduced drastically (by 31% on average) due to the reduction in the number of clusters. Despite this, the refined clusters were highly informative and accurately depicted the interaction pattern within real complexes—for exam- ple, the domain interaction pattern of the {HOS2, SET3, HST1, SIF2} complex was accurately deciphered using the approach (Figure 5.6). The approach by Will and Helms [2014] used a similar strategy based on in- corporating domain-interaction constraints into protein complex prediction. This approach, called DACO (Domain-Aware Cohesiveness Optimization), incorporates these constraints during the cluster-expansion steps of MCODE and ClusterONE. DACO was used to study transcription-factor (TF)-complexes; a protein complex is considered a TF-complex if the complex contains at least one TF. The precision of the predicted TF-complexes by DACO was higher compared to the predictions from MCODE and ClusterONE run without incorporating domain constraints. More- over, the authors characterized these TF-complexes based on the different modes of action by which the TFs controlled the transcription of target genes: by direct 122 Chapter 5 Open Challenges in Protein Complex Prediction

PHD SIR2

SET3 HST1

SET DUF592

Hist_deactly WD40

HOS2 SIF2

Figure 5.6 Domain interaction pattern of the HOS2-SET3-HST1-SIF2 complex which is accurately identified by the approach by Ozawa et al. [2010].

interaction of TFs with the basal transcriptional machinery, or by affecting the chromatin structure and histone placement. This further enabled the segregation of overlapping TF-complexes based on regulatory roles of TFs. The above approaches [Jung et al. 2010, Ozawa et al. 2010, Will and Helms 2014] demonstrate how overlapping protein complexes can be deconvoluted by ac- counting for domain-interaction patterns of PPIs. However, the success of these approaches is limited by the availability of DDI and structural datasets, thus reduc- ing the utility of these methods for large-scale identification and analysis of protein complexes based on structures. Liu et al. [2011] reasoned that since proteins which participate in multiple complexes are multifunctional proteins which tend to form hubs in the PPI network [Ryan et al. 2012], discounting these hubs during the clus- tering process of PPI networks is an alternative strategy to segregate overlapping protein complexes. Moreover, interactions in the PPI network that occur between complexes often tend to be spurious interactions, and more so when these com- plexes belong to different subcellular compartments within the cell. These spurious interactions aggravate the lumping together of complexes. Therefore, further de- composing the PPI network by discarding these spurious interactions that occur between proteins belonging to different compartments may be a way to prevent lumping together of complexes. To go about this, Liu et al. used informative GO terms on the subcellular loca- tion of proteins to identify interactions occurring between proteins from different cellular compartments. A GO term is considered informative if the number of pro- 5.3 Identifying Overlapping Protein Complexes 123

teins annotated with the term is at least NGO, and none of the descendents of the term is annotated to NGO or more proteins; the threshold NGO should be large, and is usually at least 30 [Chua et al. 2006]. Using a PPI network of 5,765 proteins and 52,096 interactions and four different protein complex prediction methods, MCL, IPCA, CMC, and RNSC, Liu et al. found that by decomposing the PPI network using ≥ NGO 500 (amounting to 10 informative GO terms, resulting in 2,193 discarded proteins and 27,477 discarded interactions), and in addition by also discarding the hubs during the clustering process (446 hubs with degree ≥ 50), the F1-measure for all the four methods improved significantly. For example, for MCL, the F1-measure improved from 0.25 in the original network to 0.406 after hub-removal and network decomposition using GO, thereby leading to several overlapping protein complexes being segregated in the process. Gagneur et al. [2004] proposed a method to disambiguate shared proteins be- tween complexes using modular decomposition of TAP/MS data, and thus to iden- tify overlapping protein complexes. Recalling from Chapter 2, TAP/MS data consists of pulled-down complexes in which bait proteins are pulled down with their inter- acting prey proteins in an affinity purification column. Often, multiple purifications are performed to cover all possible complex configurations and to enrich for true complexes. But, if a protein is pulled down as part of a different complex during each of these purifications, then it is likely that the protein is shared between com- plexes, with only of one of the complexes occurring during any given purification. Gagneur et al. used this reasoning to categorize pulled-down co-complexed sets of proteins into distinct complexes, based on the following four cases. Let protein A be the bait, and B, C, and D its preys. Then: (i) Basic case: B is pulled down with A in all purifications, then A and B form a complex; (ii) Series case: B is pulled down with A in one purification, C is pulled down with A in another purification, and B and C together are pulled down with A in the third purification then, A, B, and C belong to the same complex; (iii) Parallel case: B is pulled down with A in one purification, and C is pulled down with A in another purification (but, B and C are never pulled down together with A) then, A and B, and A and C form two separate (parallel) complexes; and, finally, (iv) Prime case: at least two preys, B and C, C and D, and B and D are pulled down with A in different purifications then, A and all the three preys B, C and D belong to the same complex. In Case (iii), A is shared between the two complexes that function in a mutually exclusive manner. Case (iv) is a transitive extension of Case (ii), and all proteins function together in the same complex. Zotenko et al. [2006] proposed a “tree of complexes” representation in which overlapping protein complexes can be hierarchically represented as a tree, and thus 124 Chapter 5 Open Challenges in Protein Complex Prediction

be used to study the sharing of proteins between these complexes. Zotenko et al.’s work is motivated by a fundamental result [Gavrill 1974]onchordal graphs, which states that every chordal graph has a clique-tree representation.Achord in a graph is any edge that connects two non-consecutive vertices of a cycle. A chordal graph is a graph which does not contain chordless cycles of length greater than three. In a clique-tree representation, each vertex is a maximal clique of the original chordal graph. Moreover, a set of maximal cliques that contain the same vertex from the original graph forms a connected subgraph in the clique tree. This “connected subgraph constraint” influences the topology of the clique tree, and in particular, the topology of the clique tree is determined by the structure of overlaps between the maximal cliques in the graph. In the example shown in Figure 5.7(a), clique B is connected to F by a path that passes through C and D, which means that any protein shared by B and F is also contained in C and D. In other words, the overlap between B and F is entirely contained in the overlap between B and D, which in turn is contained in the overlap between B and C. In Zotenko et al.’s work, cographs are used to model protein complexes. A cograph is characterized by

the absense of an induced subgraph of (shortest) path length four (P4). Thus, the diameter of a connected cograph is at most two; therefore, cographs and chordal graphs are very similar, and both tend to be dense and cliquish. Zotenko et al. show that a clique-tree representation of a PPI network can be obtained if the network can be decomposed as a collection of cographs. Therefore, complexes represented by these cographs can be represented in the clique tree, such that the complexes are hierarchically arranged based on the overlaps with each other. When applied to the Tumor Necrosis Factor (TNF)α/Nuclear Factor (NF)κβ network, this decomposition procedure yielded a clique tree of five complexes (Figure 5.7 (b)), in which complexes containing NFκβ-inducing (NIF) kinases, and Inhibitor (I)-κα proteins appear at the beginning of the tree, and the complexes containing Iκβ proteins appear later in the tree. One could infer patterns of sharing of proteins between the complexes from this representation—for example, in Figure 5.7(b), p100 functions with I-κ proteins in the first two complexes on the left, whereas in the rightmost three complexes, p100 functions with NIκB proteins.

Identifying Small Protein Complexes 5.4 A large majority of protein complexes are small (of size<4), and thus form an impor- tant subset of complexes that are completely ignored by current protein complex prediction methods. The inherent challenge in detecting small complexes lies in differentiating them from other small groups of proteins that have topological char- 5.4 Identifying Small Protein Complexes 125

1 13 Maximal cliques A: 1, 2, 3 7 B: 3, 4, 5, 6 3 12 14 2 6 16 C: 3, 5, 6, 7 8 D: 5, 6, 7, 8 15 E: 5, 8, 9, 10, 11 F: 6, 7, 8, 12 4 17 5 G: 12, 13, 14, 15 9 H: 14, 15, 16 I: 15, 17 10 11

Identifying maximal cliques (representing complexes) from an example chordal graph (a)

A I

C D E F NIκ IKKa IκBa B p100 IKKb IκBe E H NIκB IKKc Col-Tpl2

“Tree of complexes” representation for “Tree of complexes” representation for the example chordal graph the NFκB network (b) (c)

Figure 5.7 “Tree of complexes” representation [Zotenko et al. 2006]: (a) Identifying maximal cliques (complexes) from an example chordal graph; (b) Tree of complexes representation for the graph; and (c) Tree of complexes representation for protein complexes identified for the TNFα/NFκβ network by the approach by Zotenko et al. [2006].

acteristics in the PPI network similar to real complexes by chance. As mentioned before, classical topological measures including interaction density are not suffi- cient to differentiate these complexes. An important characteristic of real complexes is their modularity in the PPI net- work; that is, complexes form dense subnetworks which are either isolated (discon- nected) from other complexes, or are separated from one another by sparse regions in the network. While larger complexes may share proteins, the isolatedness is ob- served even more so for small complexes. Isolatedness of small complexes is thus 126 Chapter 5 Open Challenges in Protein Complex Prediction

an important characteristic that can be used to differentiate them from random (small) groups of proteins. However, there is one more challenge here. Spurious (false-positive) interactions can connect random (isolated) proteins, thus resulting in small groups of interacting proteins in the network. In addition, the absence of other (real or spurious) interactions that connect these random protein groups to the rest of the network may render these groups isolated in a manner similar to real (small) complexes. Consequently, the challenge here is to differentiate the interactions that connect proteins within small complexes from the (spurious) in- teractions that connect random proteins. Based on these careful observations, Yong et al. [2014] proposed a Size-Specific Supervised (SSS) weighting scheme that weights each interaction in the PPI network based on the probability of the interaction to: (i) belong to a (small or large) protein complex; and (ii) be isolated or belong to an isolated triangle in the network. Interactions taken from multiple diverse sources are integrated using SSS, and the intuition behind doing so is as follows: Even if a random group of proteins is connected and has topological characteristics similar to that of a real complex in one dataset, it is unlikely that the same group of proteins is connected and has the same topological characteristics across multiple datasets. In particular, if a random small group of proteins (protein pairs or triangles) displays isolatedness in one dataset, it is unlikely for the exact same spurious interactions that generate such an isolated group to occur in all other datasets. Therefore, only the true complexes are repeatedly observed to be isolated across multiple datasets. SSS uses a na¨ıve-Bayes model to capture the properties of true interactions belonging to real complexes. Specifically, for each interaction (a, b) several features are derived using multiple sources of information, namely: (i) the STRING database [Von Mering et al. 2003, Szklarczyk et al. 2011] to derive functional association between a and b; (ii) literature co-occurrence to derive Jaccard similarity between | ∩ | | ∪ | the sets of publications that contain a and b: Aa Ab / Aa Ab , where Ax is the set of publications containing protein x; and (iii) PPI network topology to derive three features namely, degree, neighborhood connectivity, and the extent of shared neighbors for a and b. The degree of the pair (a, b) is the total weight of outgoing interactions from a and b: DEG(a, b) = w(x, a) + w(y, b). (5.9)

x∈Na−{b} y∈Nb−{a}

The neighborhood connectivity for (a, b) is the weighted density of all neighbors excluding the pair itself: 5.4 Identifying Small Protein Complexes 127

x,y∈N w(x, y) NBC(a, b) = a, b , (5.10) | | | | − min( Na,b , λ) . (min( Na,b , λ) 1) where λ>1 is a dampening factor. The extent of shared neighbors is computed using Iterative CD scoring [Liu et al. 2008] (see Chapter 2). These features are then discretized using minimum description length (MDL) supervised discretization [Fayyad and Irani 1993], and combined in a na¨ıve Bayes formulation together with the two other properties: the probability for the interaction to belong to a (small or large) complex, and its isolatedness in the network. Using a known set of real complexes, maximum-likelihood parameters are learned for three classes lg, sm, and non:

nsm F =f P(F = f |sm) = , nsm

nlg,F =f P(F = f |lg) = (5.11) nlg

nnon,F =f P(F = f |non) = , nnon where nsm,F =f is the number of interactions belonging to small complexes and whose feature F has a value f ; nlg,F =f is the number of interactions belonging to large complexes and whose feature F has a value f ; and nnon,F =f is the number of interactions that do not belong to any complex and whose feature F has a value f . The isolatedness feature (ISO) is computed for each interaction (a, b) based on the probability that the interaction is isolated, or is part of an isolated triangle: ISO(a, b) = ISO2(a, b) + ISO3(a, b),  = ISO2(a, b) P(a,b),sm . P(x,y),non, ∈ ∈ x a,b,y Na, b (5.12)  = ISO3(a, b) P(a,b),sm . P(a,c),sm . P(b,c),sm . P(x,y),non , ∈ ∩ ∈{ } ∈ c Na Nb x a,b,c ,y Na, b, c where P(a,b),sm is the probability that (a, b) belong a small complex, and P(x,y),non is the probability that (x, y) does not belong to any (other) complex. These proba- bilities are computed in a manner similar to Equation (5.13) using features from sources (i), (ii), and (iii), but—to avoid circularity—excluding the feature ISO which is not yet defined. SSS weights interactions based on their propensity to belong to complexes based on the complex size, both small and large. The posterior probabilities for the three classes are then recomputed incorporating the new 128 Chapter 5 Open Challenges in Protein Complex Prediction

feature ISO as  P(F = f |sm) . P(sm) P ((a, b) is sm |F = f , F = f ,...) = i i i ; 1 1 2 2 Z  P(F = f |lg) . P(lg) P ((a, b) is lg |F = f , F = f ,...) = i i i ; (5.13) 1 1 2 2 Z  P(F = f |non) . P (non) P ((a, b) is non |F = f , F = f ,...) = i i i , 1 1 2 2 Z where Z is the normalizing factor,  = = | Z P(Fi fi class) . P(class). (5.14) class∈{sm,lg,non} i The prior P(sm)is estimated from the set of training complexes as the proportion of interactions in small complexes out of the total interactions in the PPI data; likewise for P(lg)and P (non). Following SSS-based weighting, small complexes are extracted from the network using a procedure, EXTRACT, which works in two steps: (i) disambiguation of interactions into size-2 and size-3 complexes; and (ii) cohesiveness-based ranking

of complexes. Specifically, each interaction (a, b) carries a weight P(a,b),sm, which denotes its probability to belong to a size-2 or a size-3 complex. If (a, b) is contained within a triangle with higher interaction weights, then (a, b) is likely to belong to

this triangle rather by itself as a size-2 complex. So, the probability P(a,b),sm2 for (a, b) to be a size-2 complex is deduced as = − . . P(a,b),sm2 P(a,b),sm P(a,b),sm P(b,x),sm P(a,x),sm. (5.15) x∈Na∩Nb Similarly, if (a, b) is contained within two triangles with different interaction weights, then (a, b) is likely to belong to the triangle with higher weight. So, the

probability P(a,b),sm3,abc for (a, b) to belong to a specific triangle (a, b, c) is de- duced as = − . . P(a,b),sm3,abc P(a,b),sm P(a,b),sm P(b,x),sm P(a,x),sm. (5.16) x∈Na∩Nb\{c} In the second step, EXTRACT scores each candidate small protein complex based on the cohesiveness of the complex. The cohesiveness of a size-2 cluster (a, b) is given by P = (a,b),sm2 Coh(a, b) , (5.17) P + ∈{ } ∈ (P + P ) (a,b),sm2 x a,b ,y Na, b (x,y),sm (x,y),lg 5.4 Identifying Small Protein Complexes 129 and for a size-3 complex (a, b, c) is given by

Coh(a, b, c) (5.18) P + P + P = (a,b),sm3,abc (a,c),sm3,abc (b,c),sm3,abc

P + P + P + ∈{ } ∈ (P + P ). (a,b),sm3,abc (a,c),sm3,abc (b,c),sm3,abc x a,b,c ,y Na, b, c (x,y),sm (x,y),lg

The score for the candidate complex is then defined as

= . CScore(a, b) Coh(a, b) P(a,b),sm2 (5.19) as a size-2 complex, and

CScore(a, b, c) + + (5.20) P(a b) sm abc P(a c) sm abc) P(b c) sm abc) = Coh(a, b, c) . , , 3, , , 3, , , 3, 3 as a size-3 complex. Yong et al. evaluated EXTRACT using top 10,000 SSS-scored interactions, and compared it against five other methods namely, MCL, RNSC, IPCA, CMC, and ClusterONE. The predicted complexes were evaluated against 259 yeast and 701 human small complexes from CYC2008 [Pu et al. 2009] and CORUM [Reuepp et al. 2008], respectively. While evaluating small complexes, exact matching of the predicted complexes with reference complexes is necessary rather than a Jaccard- based matching at a certain threshold as in the case of large complexes. All six methods performed considerably better in extracting small complexes from the SSS-scored network compared to networks scored using other schemes that are not based on the “co-complexness” and isolatedness of interactions. For exam- ple, ClusterONE achieved a precision-recall area-under-the-curve (AUC) of 0.15 on the SSS-weighted network compared to 0.1 on the STRING network. EXTRACT was significantly better at extracting small complexes compared to all other methods. EXTRACT yielded an AUC of 0.25 compared to ∼0.15 on average for other methods on the SSS-weighted network. Among the successfully detected small complexes from yeast included the DNA replication factor A complex containing Rfa1p, Rfa2p, and Rfa3p, the chromatin silencing complex containing Sir2p, Sir3p, and Sir4p, and the RENT complex containing Sir2p, Cdc14p, and Net1p. The two complexes shar- ing Sir2p were deconvoluted by EXTRACT because SSS assigned higher weights to the interactions within the two complexes and lower weights to all extraneous in- teractions. Among the correctly detected human complexes included the ubiquitin ligase heterodimer complex UBE2V2-UBE2N, which was isolated in the network 130 Chapter 5 Open Challenges in Protein Complex Prediction

and carried a high weight for its sole interaction compared to similarly isolated spurious interactions in the network. Ruan et al. [2013, 2014] proposed a method to predict heterodimeric (two- protein) complexes from the PPI network by discerning protein pairs that are heterodimers against a background of extraneous interactions. Similar to Yong et al.’s work, weights between the heterodimeric proteins and their topological properties are considered as features, and these are learned using a Support Vector Machine (SVM). Essentially, the following features for the heterodimeric complex

formed between proteins Pi and Pj are considered. To form a complex, the two proteins should interact, and therefore the first feature (F1) is the weight wij of the interaction between Pi and Pj . Next, the weights of extraneous interactions, wik and wjk, with an external protein Pk should be lower than wij , else Pk could rather be part of the complex with Pi and Pj ; therefore the second feature (F2) is the maximum of the neighboring weights wik and wjk (to compare against wij ). In the event that the neighboring weights are larger than wij , then Pi and Pj do not form a complex on their own, but instead the neighboring protein Pk could form some complex with either one or both of Pi or Pj . Moreover, if wij is smaller than both wik and wjk, then Pi and Pk, and Pj and Pk can independently form a heterodimer complex. Therefore, the minimum of the neighboring weights, and the maximum two weights are included as the third and fourth features (F3 and F4), respectively. Finally, similar to the work by Ozawa et al. [2010], constraints are introduced on the number of interactions that a protein can participate in depending on the number of available domains; so, the next two features (F5 and F6) are the maximum and minimum number of domains available for the proteins

Pi and Pj for the interaction. These six features are learned from a PPI network using SVM by using known heterodimeric complexes as positive examples. Ruan et al. used 152 heterodimeric complexes from the CYC2008 catalog [Pu et al. 2009] from yeast as positive examples, and 5,345 interacting protein pairs belonging to larger complexes (size>2) as negative examples. Ten-fold cross- validation was performed to assess the performance of the SVM classifier. The clas- sifier achieved considerably higher precision and recall in predicting heterodimeric complexes compared to MCL: a precision of 0.586 and recall of 0.659 compared to 0.017 and 0.023, respectively, for MCL. Tatsuke and Maruyama [2013] proposed Protein Partition Sampler (PPSam- pler) to predict small protein complexes from PPI networks based on Markov Chain Monte Carlo (MCMC) sampling [Liu et al. 2008]. PPSampler is based on the Metropolis-Hastings (MH) algorithm, a kind of MCMC sampling, which generates a partition C of the entire protein set V according to a specified probability distri- 5.4 Identifying Small Protein Complexes 131 bution with each element c ∈ C of this partition considered as a predicted protein complex. Therefore, the partition C can be represented as   = ⊆ ∀ =∅ ∪ = ∀ = ∩ =∅ C c1,...,cn V : i, ci ; ici V ; i, j(i j), ci cj . (5.21)

Each partition is called a state, and the collection C of all states the algorithm can reach is called the domain. The MH algorithm in PPSampler generates a sequence of states from a probability distribution given as −f(C) P(C)∝ exp , (5.22) T where f(C)is a scoring function on the partition C, and T is a parameter called temperature (initially set to T = 5). Given a partition C ∈ C, the proposal probability Q(C |C) of reaching a new partition C ∈ C is determined in two ways, in both of which, a protein u ∈ V is chosen uniformly at random and is removed from the cluster to which u belongs to in C. Thus, the probability of choosing u is 1/|V |.In the first way, u forms a singleton cluster with a probability β (set to 1/100), in which case,

β Q(C |C) = . (5.23) |V | In the second way, u is added to an existing cluster in C. The cluster is chosen probabilistically as follows. All proteins v( = u) are ranked in non-increasing order ≥ ≥ ≥ of their interaction weights to u from the PPI network: w(u, v1) w(u, v2) ... ∈ w(u, v|V |−1). The probability of choosing a cluster c C is set inversely proportional to the total rank of proteins v that the cluster contains, 1. As a result, the i vi∈c i proposal distribution becomes, (1 − β) 1 Q(C |C) ∝ . . (5.24) |V | i vi∈c − Next, the scoring function f(C) is chosen as the product, f1(C) . f2(C) . f3(C), of three scoring functions f1(C), f2(C), and f3(C), which depend on three factors: (i) weights of interactions within the clusters in C, (ii) relative frequency of sizes of clusters in C, and (iii) the total number of proteins within clusters of size at least two = in C, respectively. The first scoring function f1(C) is defined as f1(C) c∈C f1(c), where ⎧ | |= ⎨⎪ 0, if c 1 f (c) = −∞,if|c| >Nor ∃u ∈ c, ∀v( = u) ∈ c, w(u, v) = 0 (5.25) 1 ⎩⎪ u,v( =u)∈c w(u, v), otherwise. 132 Chapter 5 Open Challenges in Protein Complex Prediction

The second scoring function f2(C) is defined based on how close is the relative frequency of clusters in C based on size to a predefined target distribution provided as a parameter. For size i = (2,3,...,N), let φ(i) be a predefined target of the

relative frequency of clusters of size i in C, and let φC(i) be the relative frequency of clusters of size i in C. Then, f2(C) is defined as

N 1 f (C) = . (5.26) 2 1 + i2 . (φ(i) − φ (i))2 i=2 C

Let S2(C)be the number of proteins within clusters of size at least two, that is, = | | S2(C) c∈C,|c|≥2 c . The third scoring function f3(C) is defined as

1 f (C) = , (5.27) 3 + (S2(C)−λ)2 1 1,000

where λ is a parameter specifying the target value for S2(C), and the division by

1,000 is to adjust the strength of f3 (which is a larger number) relative to f2 and f1. The initial state C0 of the algorithm consists of the following clusters: (i) a unique cluster of two proteins for every pair (u, v)that interact with weight w(u, v)in the PPI network, and (ii) for each singleton (non-interacting) protein w, a cluster consisting =∞ = of only w. This way, f(C0) , and therefore, P(C0) 0, and thus C0 is a valid initial state for the MH algorithm. PPSampler was evaluated on a PPI network consisting of 49,607 interactions among 5,953 proteins, and the predicted protein complexes were matched against the 408 reference complexes from CYC2008 [Pu et al. 2009]. Of these 408 complexes, 172 (42%) and 87 (21%) are two-protein and three-protein complexes, respectively. PPSampler produced 350 predicted complexes of average size 5.72, which matched the CYC2008 complexes with a precision of 0.537 and recall of 0.534, giving a F- measure of 0.536. PPSampler particularly performed well for two- and three-protein complexes, achieving a precision of 0.302 and recall of 0.331 (F-measure 0.316) for two-protein complexes, and a precision of 0.533 and recall of 0.552 (F-measure 0.542) for three-protein complexes. Widita and Maruyama [2013] subsequently improved PPSampler by modifying the scoring functions in the MH algorithm, to develop PPSampler2. The scoring =− + + function in PPSampler2 is defined as f(C) (g1(C) g2(C) g3(C)), using three new scoring functions g1(C), g2(C), and g3(C), but using the same data source as = f1(C), f2(C), and f3(C). The first scoring function is given by g1(C) c∈C g1(c), where 5.4 Identifying Small Protein Complexes 133

⎧ ⎪ 0, if |c|=1 ⎨⎪ = −∞,if|c| >Nor ∃u ∈ c, ∀v( = u) ∈ c, w(u, v) = 0 g1(c) (5.28) ⎪ w(u,v) ⎩ = ∈ √ , otherwise, u,v( u) c |c| √ | | which differs from f1(c) by the use of an additional c factor. The addition of w(u,v) u, v( =u)∈c this factor rescales u,v( =u)∈c w(u, v) to approximate the density |c|.(|c|−1) of the cluster. Widita and Maruyama observe that using density√ penalizes larger clusters more severely, but using a more gradual function as |c| tends to ease this penalization on larger clusters.

To define the second scoring function, the size distribution φC(i) for clusters of size i, is estimated using a normal distribution with mean φ(i) and standard deviation σ2,i, given by  −(φ (i) − φ(i))2 p(φ (i)|φ(i) σ ) ∝ C C , 2,i exp 2 . (5.29) 2σ2,i

The joint probability of φC(2), φC(3),...,φC(N) given φ(2), φ(3),...,φ(N) and = σ2 (σ2,2, σ2,3,...,σ2,N ) is formulated as their product  N − 2 | ∝ −(φC(i) φ(i)) p(φC(2), φC(3),...,φC(N) φ, σ ) exp 2 2σ 2 i=2 2,i  N (φ (i) − φ(i))2 (5.30) = exp − C 2σ 2 i=2 2,i = { } exp g2(C) ,

− 2 where g (C) =− N (φC(i) φ(i)) . The third scoring function is derived by esti- 2 i=2 σ 2 2 2, i mating S2(C) using a normal distribution with mean λ and standard deviation σ2 given by  (S2(C) − λ)2 p(S (C)|λ σ ) ∝ − 2 , 2 exp 2 2σ2 (5.31) = { } exp g3(C) , =−(S2(C)−λ)2 where g3(C) 2 . The parameters in PPSampler2 are set to the following 2σ2 = −9 = 2 = 6 new values: T 10 , λ 2000, and σ2 10 . Upon evaluating on the same PPI dataset as above (as used for PPSampler), PPSampler2 generated 402 protein com- plexes, with average size 5, which matched the CYC2008 reference complexes with 134 Chapter 5 Open Challenges in Protein Complex Prediction

a precision of 0.618 and recall of 0.742 (F-measure 0.674). For size-two complexes, PPSampler2 achieved a precision of 0.41 and recall of 0.6 (F-measure 0.49), and for size-three complexes, the precision was 0.58 and recall was 0.8 (F-measure 0.67), thus outperforming PPSampler for complexes of all sizes.

Identifying Protein Sub-complexes 5.5 Identifying protein sub-complexes is as an interesting special case of overlapping and/or small complexes in which a subset of proteins from a larger complex forms a smaller but by itself a distinct functional complex. For example, the origin recog- nition (ORC) and the minichromosome maintenance (MCM) complexes are sub- complexes which come together to facilitate DNA replication in eukaryotic cells by forming the pre-replication complex (pre-RC). One may relate sub-complexes to “cores,” in which the set of core proteins interact with different sets of attachments to form different distinct complexes, as suggested by Gavin et al. [2006]. Since these sub-complexes share proteins with the larger complexes they constitute, most ex- isting methods predict only the larger complexes and fail to detect the constituting smaller sub-complexes. One indication for the presence of sub-complexes comes from examining data from TAP/MS experiments. If during multiple TAP purifications, a bait and its set of preys are co-purified repeatedly in such a way that the purified sets of preys never completely overlap, then each such purified set of bait and its preys may form a sub-complex. In the normal case, all bait-prey relationships are combined to build a PPI network in which every relationship occuring in more than one purifications is collapsed into a single interaction (see Chapter 2). However, by preserving these re- lationships from individual purifications, it may be possible to extract the different sub-complexes represented by these relationships. CACHET [Wu et al. 2012], ex- plained in Chapter 3, can be used to detect sub-complexes in this manner because CACHET treats individual sets of bait-preys detected from different purifications as a unique core (a modeled as biclique) to which attachments are added to build dis- tinct complexes; each of these distinct cores can be considered as a sub-complex. Indeed, in an evaluation of protein complex prediction methods, Zaki and Mora [2014] found that CACHET performs particularly well in identifying sub-complexes. Inspired by CACHET, Zaki and Mora developed TRIBAL (TRIad-Based ALgorithm) to detect sub-complexes from TAP data. Like CATCHET, TRIBAL preserves the multi-interaction information from dif- ferent purifications for the PPI scoring and clustering steps. TRIBAL starts by gen- erating a pull-down matrix [b, (p × p)], in which each row represents a bait b, and { } each column represents a pair of preys (pi , pj ). Each cell (bk , pi , pj ) in the ma- 5.5 Identifying Protein Sub-complexes 135

p3 p3

b1 b1

b1 b1

p1 p2 p1 p3

p1 p2 p1 p2

b2 b2

b2 b2

p1 p2 p1 p4

p4 p4

Reliable trends Spoke-modeled Overlaid on a real

interaction network. complex (b1, p1, p2, b2). Interactions (b1, p1), Sub-complex (b2, p2) detected in identified (in red) multiple directions. (b1, p1, b2).

Figure 5.8 Identifying sub-complexes using TRIBAL [Zaki and Mora 2014]. TRIBAL preserves interactions identified from individual purifications. These interactions are mapped onto reference complexes; the subset of proteins traced out by interactions from multiple purifications are predicted to be subcomplexes of the reference complexes.

trix isa1ora0depending on whether or not the preys pi and pj are co-purified along with a bait bk. This way, the co-purification relationships among preys (with common baits) is maintained. The Dice coefficient [Zhang et al. 2008] (Chapter 2) is used to score each interaction between a bait and a prey-pair, and all interac- tions above a certain threshold of the score are considered reliable. The interactions deemed reliable by the Dice scoring are used to build a purification network: Each { } reliable triad (b, p1, p2 ) is converted into a spoke graph and added to the net- work, by preserving duplicate interactions between baits and preys. Finally, these spoke-modeled interactions are considered as templates and are overlaid onto a set of known (real) complexes, as shown in Figure 5.8. If a set of duplicated interac- tions induces a connected subnetwork within a real complex, then that subnetwork is considered a sub-complex. When applied on the yeast TAP/MS dataset, TRIBAL was able to detect TIM9-TIM10 as a sub-complex of the TIM22 complex, ADA-GCN5 as a sub-complex of the SAGA complex, and the ORC and MCM complexes as sub- complexes of the pre-RC complex, and the post-replication, DNA polymerase δ, DNA polymerase , and DNA polymerase ζ complexes as sub-complexes of the DNA repli- cation complex. 136 Chapter 5 Open Challenges in Protein Complex Prediction

An Integrated System for Identifying Challenging 5.6 Protein Complexes The three challenges that we have described remain open problems in protein complex prediction. Although the above-described approaches can go some way to address each of these issues with varying degrees of success, none of them can completely resolve them. Yong and Wong [2015a, 2015b] showed that most complexes fall into at least one of the three categories of challenging complexes— sparse complexes, small complexes, or overlapping complexes—in both yeast and human. The majority of complexes are small complexes, while substantial numbers of the large complexes are of low density or overlap with other complexes. The less- challenging complexes are the non-overlapping large complexes with high density, and based on current reference datasets these constitute only 15% and 5% of yeast and human complexes respectively. Thus, although no approach can completely tackle all the three challenges, it still makes sense to integrate these approaches so that the partial improvements to tackle each of these challenges stack up to a greater overall improvement in complex prediction. To this end, Yong and Wong [2015b] combined three approaches in an inte- grated complex prediction system to address the three challenges (Figure 5.9): (i) Supervised Weighting of Composite Networks (SWC) [Yong et al. 2012], which uses diverse data sources with supervised learning to fill-in missing PPIs and thereby densify sparse complexes; (ii) PPI network decomposition using GO terms and hub removal [Liu et al. 2011], which improves the prediction of overlapping complexes and complexes embedded within highly-connected regions; and (iii) Size-Specific Supervised Weighting (SSS) [Yong et al. 2014], which integrates diverse data sources and their topological features to predict small complexes specifically. SWC and PPI network decomposition utilize multiple clustering-based complex prediction al- gorithms (CMC, ClusterOne, IPCA, COACH, MCL, and RNSC), with voting-based aggregation (the Ensemble method [Yong et al. 2012] described in Chapter 3), to derive complexes from their respective processed PPI networks. Each of these three approaches are run independently, and the resulting complexes are combined us- ing a voting-based aggregation strategy. Yong et al. tested the integrated system on yeast and human complex pre- diction, and observed marked improvement compared to each of its constituent approaches, and also compared to each individual clustering-based complex pre- diction algorithm, predicting a greater number of reference complexes at higher precision levels. This improvement was especially dramatic for human complexes, 5.6 An Integrated System for Identifying Challenging Protein Complexes 137

PPI data Experimental STRING literature

SWC PPI network SSS decomposition Hub removal

Data integration and Data integration supervised weighting GO decomposition and size-specific supervised weighting

CMC … IPCA CMC … IPCA Extract

Recombine Recombine decomposed … decomposed Combine predictions clusters clusters from algorithms

Re-add hubs … Re-add hubs

Combine predictions from algorithms

Large complexes Overlapping Sparsecomplexescomplexes Small complexes

Rescale scores and combine

Final predictions

Figure 5.9 Integrated system to predict challenging protein complexes [Yong and Wong 2015b]. Large, small, overlapping, and sparse complexes differ in their topological characteristics, and one single strategy (e.g., density-based prediction) does not cover all protein complexes. A combination of strategies is therefore required to predict protein complexes. 138 Chapter 5 Open Challenges in Protein Complex Prediction

where an additional 10% of reference complexes were predicted with greater than twofold increase in precision. Further analysis showed that the three categories of challenging complexes saw the greatest benefits in improvements, as each of the three constituent approaches complemented each other to predict different types of challenging complexes. While this approach took convincing strides in tackling these three challenges in protein complex prediction, there remains much room for further improve- ment; e.g., only 40% of human complexes were predicted by the integrated system. Nonetheless, these results highlight a need to identify and understand specific chal- lenges within protein complex prediction, and to design solutions to address them. As the “easy” protein complexes (in fact constituting only a minority of yeast and human complexes) can already be predicted by the wide variety of existing pro- tein complex prediction algorithms, it is the other, more difficult, complexes that present the best opportunities for improvements in protein complex prediction.

Recent Methods for Protein Complex Prediction 5.7 The last two to three years have seen development of new protein complex predic- tion methods, which have been evaluated on more recent and larger PPI datasets and shown to outperform most traditional methods in precision and recall. For ex- ample, ClusterEP, proposed by Liu et al. [2016], is a supervised complex prediction method that uses emerging patterns (EPs) to distinguish real complexes from ran- dom subgraphs in the PPI network. An EP is a kind of conjunctive pattern that can contrast sharply between different classes of data; that is, the elements of the data- set that exhibit a particular EP are more likely to belong to a particular class than other classes. ClusterEP combines multiple topological properties of complexes to build these EPs, and works in the following three steps. In the first step, a feature vector is built to represent the positive (real complexes) and negative (random subnetworks) classes of data. This feature vector F consists of 22 different topological features (F1–F22) for subnetworks S. These features include: the number of nodes in the subnetwork (F1), the density of the subnetwork (F2), and the remaining 20 features (F3–F22) are grouped into six categories based on the following properties: mean and variation of the degrees of nodes, clustering coefficient, topological coefficient, eigenvalues of the adjacency (or connectivity) matrix of the subgraph, and weight and size (number of amino acids) of the proteins in the subnetwork. The eigenvalue feature includes the first three singular values (SV) of the adjacency matrix of the subnetwork S. The singular value decomposition 5.7 Recent Methods for Protein Complex Prediction 139 of an m × n matrix M is a factorization of the matrix in the form M = UV, where U is a m × m unitary matrix,  is a m × n diagonal matrix with non-negative real numbers on its diagonal, and V is a n × n unitary matrix. The diagonal entries of  are known as the SVs of M. A common convention is to list the SVs in descending order. SVs for subnetworks of different “shapes” (for example, with topologies such as linear, clique, star, and hybrid) have remarkably different first three SVs [Qi et al. 2008]; refer to Figure 3.3 from Chapter 3.

The feature vector for the positive class Dp is learned from a set of real protein complexes. The feature vector for the negative class Dn is learned from a set of randomly generated PPI networks and their subgraphs, which is at least 20 times = ∪ larger than the set of real complexes. The feature values in D Dp Dn for each feature are discretized into ten equal-width bins. A bin for each feature is termed an item, and a set of items from different features is called an itemset (pattern). In the second step, ClusterEP identifies a specific type of EPs called noise- tolerant EPs (NEPs), defined as follows. Given two support thresholds δ1, δ2 > 0 and δ2 >> δ1, an NEP from D1 to D2 is an itemset X that satisfies the following two conditions: (i) Supp (X) ≤ δ and Supp (X) ≥ δ ; and (ii) no proper subset of D1 1 D2 2 X satisfies condition (i), where Supp (X) is the occurrence frequency of X in D Di i = (i 1, 2). The NEP from Dp to Dn is represented as NEP(Dn), and the NEP from Dn F to Dp is represented as NEP(Dp). For a given subgraph S, let (S) be the feature vector values of S. An EP-score for S with respect to Dp is computed as: EPscore(S, D ) = Supp (e), (5.32) p Dp e∈NEP(Dp),e⊆F (S) and with respect to Dn is computed as: EPscore(S, D ) = Supp (e). (5.33) n Dn e∈NEP(Dn),e⊆F (S) ∈ By definition, a noise-tolerant EP e NEP(Dp) is a pattern that occurs much more often in Dp than in Dn. Thus, EPscore(S, Dp) sums the number of real com- ∈ plexes in Dp over each noise-tolerant EP e NEP(Dp) that is consistent with F(S). In other words, EPscore(S, Dp) sums the support in Dp of patterns in F(S)that are more associated with real complexes. Analogously, EPscore(S, Dn) sums the sup- port in Dn of patterns in F(S) that are more associated with random subgraphs. Therefore, the score EPscore(S, Dp) favors a positive label; that is, the larger the score the more likely S is a real protein complex. On the other hand, EPscore(S, Dn) favors a negative label; that is, the larger the score the more likely S is a random 140 Chapter 5 Open Challenges in Protein Complex Prediction

subgraph. These EP scores are normalized relative to the median EP scores in the two datasets, = Norm EPscore(S, Dp) EPscore(S, Dp)/ median(Dp) (5.34) = Norm EPscore(S, Dn) EPscore(S, Dn)/ median(Dn), and the normalized scores are used to compute a clustering score for S:

Norm EPscore(S, Dp) f(S)= , (5.35) + Norm EPscore(S, Dp) Norm EPscore(S, Dn) with f(S)>1/2 meaning S is more likely to be a real complex. In the third step, ClusterEP uses a seed-and-expand procedure similar to MCODE [Bader and Hogue 2003] (Chapter 3) to identify complexes, with the differ- ence being that each time a protein p is added to an existing cluster S, ClusterEP checks for the following conditions: f(S∪{p})>f(S)and the average degree of ∪{ } S p has increased. Two subgraphs S1 and S2 built in this manner are then =| ∩ |2 | | | | checked for a possible overlap ω(S1, S2) S1 S2 /( S1 . S2 ). If the overlap ≥ ω(S1, S2) , a fixed threshold, the two clusters are merged. Liu et al. [2016] trained ClusterEP using a yeast PPI network containing 4,931 proteins and 22,227 interactions, and yeast complexes from MIPS (103 complexes) [Mewes et al. 2006] and the Gavin et al. [2006] study (491 complexes). ClusterEP was used to predict protein complexes from a human PPI network consisting of 6,305 proteins and 62,937 interactions from HPRD [Prasad et al. 2009]. The predicted complexes were evaluated against known complexes from CORUM (1,843 complexes) [Reuepp et al. 2008, 2010]. ClusterEP achieved a recall of 0.603 at a precision of 0.211, which was considerably better than that of some methods described before—for example, ClusterONE (recall of 0.376 at precision of 0.169)— on the same dataset. Rizzetto et al. [2015] presented a SImulation-based COMplex PREdiction (SiComPre) approach that considers the absolute protein concentrations, protein binding sites, and DDIs involved in interactions between proteins, and localiza- tion of protein interactors, to generate protein complexes. The approach is based on stochastic simulations where multiple instances of proteins (corresponding to the square root of the absolute protein concentrations) move and interact ran- domly on a 2D space. The simulation space is divided into square lattices, where proteins are diffused randomly between neighboring lattices at discrete time steps. Proteins are represented as objects with binding sites, where proteins with com- plementary binding sites can interact to form complexes, or their bonds can break resulting in sub-complexes or singleton proteins. Protein domains were retrieved 5.8 Identifying Membrane-Protein Complexes 141

from the SMART database [Letunic et al. 2012], which cover domains for 34% of the protein interactions. To compensate for the lack of sufficient DDI data, fic- titious interacting domains were added to proteins: domains were added to a protein pair if the proteins are involved in the same biological function, accord- ing to MIPS. This procedure increased the coverage of interactions with DDI data to 84%. Starting with a yeast protein interaction dataset consisting of 1,622 proteins and 9,022 interactions, and a human interaction dataset consisting of 3,006 proteins and 13,992 interactions, Rizzetto et al. [2015] generated the networks containing 1,474 proteins and 7,618 interactions with protein domains in yeast, and 2,342 proteins and 9,395 interactions with protein domains in human. Since the intial (random) positioning of the proteins and their concentration levels affect the sim- ulations, multiple simulation runs were tested. However, most simulation runs produced the same common set of complexes, and the number of predicted com- plexes unique to each run only varied by 1% between the different runs. On average, the simulations yielded 657 complexes in yeast, of which 409 complexes matched a known complex from MIPS. In human, 1,158 complexes were predicted, of which 268 matched a known complex from CORUM.

Identifying Membrane-Protein Complexes 5.8 Native membrane protein complexes are difficult to purify using traditional TAP procedures owing to the hydrophobic nature of membrane proteins (MPs) [Lalonde et al. 2008]. Conventional Y2H system is confined to the nucleus of the cell thereby excluding the study of membrane proteins [Miller et al. 2005, Kittanakom et al. 2009]. Consequently, MPs and their interactions are often under-represented in public interaction datasets [Turner et al. 2010]. This poses a severe challenge for predicting MP complexes. Therefore, new experimental protocols are required that can cover MPs and interactions among MPs. Babu et al. [2012] developed a new TAP extraction and purification procedure that can pull down complexes involving MPs. The study used 2,141 MPs, of which 1,590 were tagged and processed using this TAP procedure. Of these tagged MPs, about 77% (1,228 out of 1,590) were purified as part of some complex. The physical interactions inferred within the purified complexes were scored using Purification Enrichment scoring [Collins et al. 2007]. These interactions were integrated with TAP/MS interaction datasets from Gavin et al. [2006] and Krogan et al. [2006] that mainly cover soluble proteins (the readers are referred to Chapter 2 for details on Purification Enrichment scoring, and the Gavin et al. and Krogan et al. datasets). 142 Chapter 5 Open Challenges in Protein Complex Prediction

This resulted in an integrated PPI network consisting of 13,343 high-confidence interactions among 2,875 proteins, which represented two-thirds of the yeast pro- teome detectable by mass spectrometry. Babu et al. clustered this network using MCL [Enright et al. 2002] to derive 720 putative protein complexes, of which 501 complexes contained at least one MP. Comparisons between these predicted com- plexes and the expert-curated CYC2008 catalog [Pu et al. 2009] showed that of the 167 curated complexes containing at least one MP, there were 67 complexes with each containing 90% or more proteins covered by a predicted complex. This suggests that as more interaction data on MPs become available, existing compu- tational methods for protein complex prediction will be able to perform better at predicting MP complexes. Membrane proteins are involved in the transportation of ions, metabolites, and larger molecules such as proteins, RNA, and lipids across membranes. MP com- plexes involve highly transient interactions required for the “dynamic exchange” of cargoes across membranes. For example, dynamic exchange of proteins between MP complexes has been observed for the translocase of the mitochondrial outer membrane (TOM) and the NADH-ubiquinone oxidoreductase complexes [Rapaport 2005, Lazarou et al. 2007]. Therefore, MP complexes are not stable entities like their soluble counterparts. Traditional AP/MS techniques involve dual-step strin- gent purification which filters weak interactors of baits. Keilhauer et al. [2015] used a AP/MS protocol for yeast which used a less-stringent single-step purification that preserves the weaker interactions. Although this protocol results in co-purification of a large number of nonspecific interactors, the complexes can still be identified because of their higher enrichment in specific bait pull-downs vs. all other pull- downs. This modified protocol, called the Affinity Enrichment Mass Spectrometry (AE/MS) rather than AP/MS, identified several MP complexes for yeast including the HOPS vacuolar and the SPOTS complexes [Keilhauer et al. 2015]. However, a problem with this protocol is the potential large number of contaminants: A large number of controls may be necessary to comprehensively cover all possible nonspecific interactors. There is an ongoing collaborative effort to establish a “con- taminant repository for affinity purification” (the “CRAPome”) containing control pull-downs from different laboratories performed under various experimental con- ditions [Mellacheruvu et al. 2013]. Readers are referred to the review by Laganowsky et al. [2013] for details on these and other on-going efforts for studying MP com- plexes. Since MP complexes are not stable entities like their soluble counterparts, their identification requires understanding their dynamic assembly and disassembly— how the individual proteins come together to form complexes, and how these 5.8 Identifying Membrane-Protein Complexes 143 complexes are eventually degraded. Studies reveal that this assembly occurs in an orderly fashion; that is, MP complexes are formed by an ordered assembly of intermediaries. To prevent unwanted intermediaries, this assembly is aided by chaperones [Daley 2008]. For example, in a study [Maddalo et al. 2011] that focused on 30 inner and outer membrane-bound protein complexes from E. coli, a well- characterized periplasmic chaperone (PpiD) is involved in the assembly of several MPs. One of the proteins (YfgM) in this chaperone complex had no annotated function, but its association to the complex suggested that YfgM is also part of the PpiD interactome, and is therefore a chaperone involved in the assembly of MPs. Why membrane complexes assemble in an ordered manner is unclear, but studies suggest that this could be a protection mechanism of the cell against harmful intermediary complexes [Herrmann and Funes 2005]. Identifying Dynamic Protein Complexes

Governing dynamics, gentlemen! —John Nash, played by Russell Crowe, A Beautiful Mind (2001)

Many, if not all, protein complexes are dynamic entities, which assemble at a spe- cific sub-cellular space and time to perform a particular function and disassemble after that. For example, cyclin-CDK complexes regulate cell-cycle functions at pe- riodic intervals, and are assembled when cyclin-dependent kinases (CDKs) are ac- tivated based on the concentration levels of cyclins during different phases of the cell cycle [Nurse 2001, Morgan 1995, Mendenhall and Hodge 1998, Enserink and 6Kolodner 2010]. To be able to detect dynamic complexes, we need to capture the dy- namics of the proteins and their interactions. While the interaction between two dynamic proteins, at the minimum, requires the two proteins to co-localize and be expressed (to certain minimum levels), it also depends upon other spatial, tempo- ral, structural, and/or contextual constraints of the proteins at that time [Hein et al. 2015]. Dynamic interactions occurring in this manner play an important role in driving all cellular systems, and taking this dynamics into consideration is cru- cial to understanding cellular functioning and organization [Przytycka et al. 2010, Washburn 2016]. While large numbers of genome-scale PPI datasets have been generated in the last several years, in general, these lack the specific spatial, tem- poral, and contextual information about the interactions. Therefore, it becomes necessary to integrate such information from other sources into the analysis of PPI networks in order to study the dynamics of PPIs and protein complexes.

Dynamism of Protein Interactions and Protein Complexes 6.1 Whether a protein interacts, and the choice of its interaction partner, is regulated by different cellular mechanisms [Nooren and Thornton 2003, Przytycka et al. 2010, 146 Chapter 6 Identifying Dynamic Protein Complexes

Hein et al. 2015]. For example, the co-localization of the interactors in time and space, as well as the local concentration of the interactors, are influenced by the expression of the coding genes, degradation rates of the mRNAs, and transport, secretion, and degradation of the protein products. Similarly, the binding affini- ties of different interactors are regulated through posttranslational modifications on the proteins, or changes to the physiochemical environments within the cell (e.g., by changes to the concentration levels of effector molecules such as ATP that may affect the binding affinity), or be determined by the context (e.g., response to a stimuli or availability of a metabolite). Depending on these factors, interactions are categorized as obligate, where the proteins cannot exist as stable structures on their own and are frequently bound to their partners upon translation and fold- ing; or as non-obligate, where the proteins can exist as stable structures in bound and unbound states. Obligate interactions are generally permanent or constitutive, which once formed can exist for the entire lifetime of the proteins. Non-obligate interactions may be permanent or alternatively transient, wherein a protein inter- acts with its partners for a brief time period and dissociates after that. It is not just the permanent interactions which result in formation of long-lived multipro- tein assemblies, transient interactions are also of crucial functional importance, particularly in cell signaling [Jordon et al. 2000, Pawson and Nash 2000]. In addition to the levels of expression and abundance for proteins and sub- strates, protein interactions are also mediated by the levels of intrinsic disorder in the participating proteins. Protein disorder refers to the phenomenon of a protein partially or fully not folding into a stable structure. Such proteins or protein regions are called intrinsically disordered proteins (IDPs) or IDP regions [Dunker et al. 2001, 2015] (this nomenclature will be discussed in detail later in the chapter). Nearly half of the eukaryotic proteome is found to contain regions of 40 or more amino acids with disorder [Dunker et al. 2001, 2015, Berlow et al. 2015]. Under different cellular and physiological contexts, disordered proteins and regions can undergo order-to- disorder transition, thereby determining the binding partners and binding affinity of protein interactions. A commonly observed mechanism for this form of binding is through a short linear sequence motif (commonly referred to as a SLiM), which is about 10 amino acids in length. A SLiM can induce transient interactions of low affinities with multiple target proteins [Van Roey et al. 2014]. This mechanism is observed in the oncoprotein E7 from the human papilloma virus (HPV) (Figure 6.1): E7 contains an intrinsically disordered N-terminus made up of a short LxCxE motif which interacts with the TAZ2 domain of cAMP response element binding (CREB)- binding protein (CBP)/p300, and the pocket domain of the retinoblastoma protein Rb to deregulate the cell cycle of the host [Jansma et al. 2014]. Figure 6.1 Human papilloma virus (HPV) E7 oncoprotein with disordered region containing a LxCxE motif occurring at positions 22–26 in HPV type-16, and positions 26–30 in PHV type-45 (the regions in red for the ’disorder’ track). (Figure generated from Protein Data Bank server [Berman et al. 2000]) 148 Chapter 6 Identifying Dynamic Protein Complexes

As a result of dynamism in protein interactions, protein complexes also display dynamism in their formation, composition, and stability, which impart them with important functional properties. For example, the highly conserved yeast cyclin- dependent kinase Cdc28p regulates the cell cycle by forming complexes with differ- ent cyclins; these complexes in turn phosphorylate different substrates to regulate different cell-cycle phases: Cdc28p forms a complex with cyclin Cln3p to enter

the cell cycle, with Cln1,2p to enter and regulate activities in the G1 phase, with Clb5,6p to begin DNA replication in the S phase, and with Clb1,2,3,4p to enter M phase [Morgan 1995, Mendenhall and Hodge 1998, Enserink and Kolodner 2010]. Here, the dynamism of Cdk28p-cyclin complexes is determined by transient in- teractions mediated by the levels of cyclins during the different cell-cycle phases. The second example is that of RNA polymerase II (Pol II) which undergoes large structural rearrangements of protein subunits during transcription initiation and elongation [Cramer et al. 2001, Hanh 2004]. Pol II is composed of four structural mobile elements—Core, Clamp, Shelf, and Jaw Lobe—that move relative to each other. The Core element forms a cleft through which the DNA enters from one side and the active site is buried at the base. The Shelf and Jaw Lobe elements move relatively little and can rotate parallel to the active site cleft. The Clamp ele- ment, connected to the Core through a set of flexible switches, moves with a large swinging motion of up to 30A˚ to open and close the cleft. The final example is of protein complexes formed by the intrinsically disordered hypoxia inducible factor- 1α (HIF-1α) depending on the levels of oxygen in the cellular milieu [Berlow et al. 2015]. HIF-1α plays an important role in the transcription of genes that are criti- cal for cell survival under low cellular oxygen concentrations through interactions with the TAZ1 domain of the transcriptional coactivator CBP/p300. The C-terminal disordered region of HIF-1α forms a protein complex with CBP/p300 based on cel- lular oxygen levels: Under normoxic (normal oxygen) conditions, the C-terminal is hydroxylated by the asparagine hydroxylase FIH to impair binding to the TAZ1 do- main of CBP/p300; however, under hypoxic conditions, the C-terminal undergoes a disorder-to-order transition forming helical structures that interact with the TAZ1 domain of CBP/p300, to activate survival-related genes. The importance of dynamism of protein interactions and protein complexes cannot be understated, and is rapidly becoming the subject of extensive computa- tional research. A number of computational methods have focused on the effective integration of temporal, contextual, and structural information with protein inter- action networks to understand this dynamism of protein interactions and com- plexes; some of these methods are covered in this Chapter. 6.2 Identifying Temporal Protein Complexes 149

Identifying Temporal Protein Complexes 6.2 The study by Han et al. [2004] was one of the first attempts to study dynamism at an interactome level by integrating expression levels of proteins with PPI networks. Han et al. combined a S. cerevisiae interaction network (“filtered yeast interactome” or FYI) with mRNA expression data to study the dynamics of hub proteins. The FYI dataset consisted of 2,493 high-confidence interactions (among 1,379 proteins) detected in at least two different experiments. Hubs in the FYI network were char- acterized using expression profiles containing 315 data points across five different experimental conditions. Pearson correlation coefficients (PCCs) were computed between the hubs and their partners. Han et al. found that for hubs with degree k ≥ 5, the average PCCs followed a bimodal distribution, with one peak centered around 0.1 and the other around 0.6. On the other hand, for proteins with k<5, the average PCCs showed a normal distribution centered around 0.1. Based on this bimodal distribution, Han et al. suggested that hubs (with k ≥ 5) can be split into two distinct subtypes—ones with high average PPCs (“party hubs”), and the ones with low average PCCs (“date hubs”). This bimodal distribution was observed only for specific conditions—e.g., “stress response” and “cell cycle”—but not for oth- ers, indicating condition-dependent participation of these hubs in interactions. When removed from the network, the party and date hubs had distinct effects on the overall topology of the network, with the removal of date hubs having a more significant effect on the network connectivity. The removal of date hubs resulted in disconnected subnetworks which were larger in size and number than compared to those resulting from removal of party hubs. Based on this, Han et al. suggested that date hubs are central to the network, and tend to participate in a wide range of integrated connections required for global organization of biological modules in the whole network. On the other hand, party hubs tend to form local modules, and although important for the function of these modules, tend to perform at a lower level in the functional organization. Although initially contested [Batada et al. 2006], the observations by Han et al. have since then been reproduced on larger PPI datasets [Agarwal et al. 2010, Pritykin and Singh 2013]. For example, Pritykin and Singh [2013] further demonstrated that date hubs have higher betweenness centrality, while party hubs have higher clustering and functional similarity, thus supporting the central placement of date and modular placement of party hubs in PPI networks. Moreover, the authors found a higher enrichment for essential genes among party hubs than date hubs, as determined by a hypergeometric test; thus agreeing with the notion that cellular systems try to buffer their date hubs (which are more centrally located) against 150 Chapter 6 Identifying Dynamic Protein Complexes

disruption. Komurov and White [2007] further expanded the concept to include “family hubs,” and showed that family hubs are more constitutively expressed and form static modules. Luscombe et al. [2004] studied the dynamics of transcription factor (TF) reg- ulatory networks using 7,074 regulatory interactions between 142 TFs and 3,420 target genes from S. cerevisiae. The authors integrated gene-expression data from five conditions—cell cycle, sporulation, diauxic shift, DNA damage, and stress response—and identified the subnetworks active under each of these conditions. Luscombe et al. found about half of the target genes were uniquely expressed in only one condition, whereas most TFs (95/142) were expressed across multiple condi- tions. Moreover, over half of the interactions were replaced by new ones between conditions, and only 66 interactions were retained across four or more conditions; these 66 “hot links” corresponded to housekeeping functions. Endogenous pro- cesses (cell cycle and sporulation) are multi-stage and operate with an internal transcriptional programme, whereas exogenous processes (diauxic shift, DNA dam- age, and stress response) constitute binary events that react to external stimuli with a rapid turnover of expressed genes. In Luscombe et al.’s analysis, the topology of the subnetworks changed considerably between these endogenous and exogenous processes. In particular, the average number of TFs regulating a target gene was smaller for exogenous processes, indicating that the TFs were regulating in simpler combinations. On the other hand, the average number of target genes regulated by a TF was larger for exogenous processes, indicating a wider regulation in response to external stimuli. Further, feedforward loops (FFLs) were present higher in the ex- ogenous subnetworks. A FFL is a three-gene pattern composed of two input TFs, one of which regulates the other, and both jointly regulating a target gene. FFLs respond only to persistent input signals or stimuli. Therefore, FFLs suit exogenous conditions better, as cells cannot initiate a new stage until the previous one has stabilized. Within the cell-cycle phases, Luscombe et al. found that most TFs that are active in the cell cycle operate only in a particular phase, and a small minority of TFs are ubiquitously active throughout the cell cycle. A third of these ubiquitous TFs are static hubs and perform housekeeping functions. De Lichtenberg et al. [2005] studied the dynamics of protein complex forma- tion using the example of yeast cell cycle. The authors constructed a PPI network from 300 yeast cell-cycle proteins of which 184 proteins were periodically expressed during the cell cycle. By mapping curated complexes from MIPS [Mewes et al. 2006] onto this cell-cycle network, De Lichtenberg et al. identified 29 periodic modules consisting of a combination of periodic and static proteins. For exam- ple, the minichromosome maintenance / origin recognition (MCM/ORC) complex 6.2 Identifying Temporal Protein Complexes 151

(the pre-replication complex) with 22 proteins consists of 12 static proteins and 10 periodic proteins, of which six are expressed during the M/G1 phase, two during the early G1, and one during late G1. The presence of static proteins within these modules, and the close timepoints at which most of the periodic proteins within a module are expressed, indicated a pattern for the assembly of modules. In par- ticular, de Lichtenberg et al. suggested that yeast (and in general, all eukaryotic) complexes are assembled just-in-time (as against synthesized just-in-time like in prokaryotes) where the transcription of only a few components of the complexes is tighly regulated and held off until the final complex assembly. This prevents ac- cidental switching on of complexes, but at the same time is more efficient than synthesis of entire complexes like in prokaryotes. The analysis by De Lichtenberg et al. also revealed several binary dynamic complexes (e.g., Cdc28p formed binary complexes with different cyclin substrates during the different phases of the yeast cell cycle). Srihari and Leong [2012c] studied the dynamics of protein complex assembly during the yeast cell cycle using complexes derived from the Consolidated net- work [Collins et al. 2007]. Using the Cyclebase dataset (http://www.cyclebase.org/) [Gauthier et al. 2008] of timepoints (cell-cycle phases) at which yeast cell-cycle pro- teins show their peak expression, the authors derived the most likely cell-cycle phases at which cell-cycle complexes are assembled. Specifically, cell-cycle phases

(G1, S, G2, M) were assigned to each yeast protein by averaging its expression across multiple datasets and computing the phase(s) at which the protein showed peak expression. Of the 6,114 yeast proteins available in the dataset, 5,514 were labeled static, and the remaining 600 as dynamic. Of these dynamic proteins, 576 had a unique peak phase, whereas the phases for the remaining 24 were difficult to de- termine. By mapping these labels onto the Consolidated network (1,622 proteins and 9,704 interactions), the authors found that static-static interactions dominated the network (94.69%), whereas static-dynamic and dynamic-dynamic interactions formed relatively smaller fractions (4.6% and 0.716%, respectively). This agreed with the notion that, in order for the cellular system to be stable, most of the inter- actions had to be static with only a small fraction of the interactions capable of changing dynamically. Protein complexes were derived using a core-attachment model (MCL-CAw [Srihari et al. 2010]), and by mapping the cell-cycle proteins onto these complexes, 57 cell-cycle complexes containing at least one dynamic protein each were identified. The authors found that the attachment proteins within these complexes were significantly more enriched for dynamic proteins compared to the cores. Moreover, when these complexes were mapped back onto the network, the static cores were shared between multiple complexes and these were involved in 152 Chapter 6 Identifying Dynamic Protein Complexes

Yal040c M Yal040c Ygr108w Ygr108w Ypl256c Ypl256c

Ypr119w Mapping of cell-cycle G2 phases G Ygr109c Ybr160w 1 Ybr160w Ypr119w Ygr109c

Ydl155w Ydl155w Ypr120c Ypr120c S Ylr210w Ymr199c Ylr210w Ymr199c

G1/S

Figure 6.2 Decomposing clusters by mapping cell-cycle phases [Srihari and Leong 2012c]. The decomposition of a cluster containing the kinase Cdc28 (Ybr160w) and its cyclin- substrates identifies protein complexes active at different cell-cycle phases.

multiphase interactions, i.e., dynamic proteins peaking at different (usually adja- cent) cell-cycle phases interacted with these static core proteins. These observations hinted at a biological design principle of “temporal reusability” of the static cores: By maintaining the core proteins throughout all phases, these proteins are reused, whereas the attachment proteins are transcribed when required to assemble the complexes, thus agreeing with the just-in-time assembly proposed by De Licht- enberg et al. [2005]. Srihari and Leong then decomposed large clusters using the cell-cycle phases of constituent proteins, and found that these clusters contained multiple distinct complexes. For example, when the cell-cycle phases were mapped onto a cluster containing the kinase Cdc28 (Ybr160w) and its cyclin-substrates, the authors identified distinct dynamic complexes, as shown in Figure 6.2, that were active at distinct phases. Here, Cdc28 can be considered the static core, while the cyclins as its dynamically regulated attachments. Tang et al. [2011], Li et al. [2012], and Ou-Yang et al. [2014] proposed methods to detect temporal protein complexes by constructing dynamic PPI (DPPI) networks using a static PPI network and time-course data. All the three meth- ods are very similar, and we describe here only the method by Ou-Yang et al., which is the most recent among the three. Given a generic (static) PPI network, Ou-Yang et al. constructed a DPPI network for each of the T timepoints by determining the peak timepoint(s) of expression for each protein. Specifically, each DDPI network is considered as a combination of a static and a dynamic subnetwork. The static subnetwork is estimated based on the PCCs between the proteins: two interacting 6.2 Identifying Temporal Protein Complexes 153 proteins u and v are said to have a static interaction if the PCC between their expres- sion patterns across the T timepoints is greater than a threshold δ (set to 0.6). The dynamic subnetwork at each timepoint is estimated based on the active proteins at that timepoint. A protein u is considered active at a timepoint if its expression is greater than the active threshold AT (u) given by

AT (u) = μ(u) + 3σ (u)(1 − F (u)), (6.1) where μ(u) is the mean and σ (u) is the standard deviation of u’s expression across all timepoints, and F (u) = 1/(1 + σ 2(u)) is a weight function that corrects for the fluctuation in u’s expression [Wang et al. 2013]. Equation 6.1 is based on the “three- sigma principle” to differentiate active and inactive timepoints of a protein during the cellular cycle [Wang et al. 2013]. For a normal distribution, three standard deviations (“three sigmas”) on either side of the mean encompasses 99% of the data represented in the distribution. To allow for fluctuations in the expression of the protein, it is considered to be active if its expression varies beyond this three-sigma range. However, these fluctuations can happen even due to noise in the expression data, and therefore a correction factor F (u) which is inversely proportional to the fluctuation (standard deviation) is additionally used in the Equation to determine the active threshold. An interacting pair (u, v)from the input PPI network is added to the dynamic subnetwork at time t only if both u and v are active, that is, their expression levels are above their respective active thresholds, AT (u) and AT (v). Finally, the DPPI network at timepoint t is the union of the static and dynamic subnetworks at t. Complexes are then identified by clustering each of these DPPI networks individually, and the final set of predicted temporal complexes is the union of all sets of complexes. Using PPI networks from BioGrid (59,748 interactions among 5,640 proteins) and DIP (21,592 interactions among 4,850 proteins), Ou-Yang et al. predicted 606 complexes with a precision of 0.363 and recall of 0.741, which were higher than clustering without using temporal segregation of the network (precision 0.332 and recall 0.647). In particular, Ou- Yang et al. showed that the three RNA polymerase (Pol) complexes, namely, Pol I, II, and III, could be easily identified using temporal network segregation, which are otherwise lumped into one large cluster when detected using only the generic PPI network. Similar to what Han et al. [2004] report on the bimodal distributions of PCC values with date hubs centered around 0.1 and party hubs centered around 0.6, the motivation to divide the DPPI network into static and dynamic portions in Ou-Yang et al.’s work also comes from the observed differences in PCC distributions for the 154 Chapter 6 Identifying Dynamic Protein Complexes

two portions. Essentially, Ou-Yang et al. observed a bimodal distribution for PCC values, one peak centered around 0.6, corresponding to static interactions, and the other around 0.1, corresponding to transient interactions (hence, δ was set to 0.6). Hanna et al. [2015] proposed biclustering of gene-expression data and mapping the biclusters onto PPI networks as a means to generate dynamic PPI networks, and thereby identified temporal protein complexes. Essentially, Hanna et al. argue that if an interaction in the PPI network occurs only during specific conditions, then the correponding gene pair will be correlated only under those conditions in the expression dataset. Therefore, dynamic PPI networks for specific conditions can be generated by biclustering the expression data and mapping the correlation between gene pairs falling within the same cluster, onto the PPI network. Biclustering of expression data is defined as follows [Cheng and Church 2000]. Let X be the set of

genes and Y the set of conditions. Let aij be the element of the expression matrix A representing the expression value of gene i ∈ X under condition j ∈ Y . Let BC(I, J) ⊂ ⊂ specify a submatrix AIJ, I X and J Y with the following mean squared residue (MSR) score MSR(BC(I , J)): 1 MSR(BC(I , J))= (a − a − a + a )2, (6.2) |I| . |J | ij iJ Ij IJ i∈I ,j∈J where 1 a = a , iJ |J | ij j∈J (6.3) 1 a = a , Ij |I| ij i∈I and 1 a = a (6.4) IJ |I| . |J | ij i∈I ,j∈J are the row and column means, and the mean of the submatrix (I , J). The lower the MSR score the higher is the bicluster coherence; that is, the genes within the bicluster show higher correlation in their expression patterns. After generating the set of biclusters, Hanna et al. generated subnetworks by extracting interactions from the PPI network that corresponded to these biclusters. Protein complexes were then detected from each of these PPI networks individually via clustering, and the resultant complexes were combined into a final set of predicted complexes. Using a PPI network containing 5,192 proteins and 24,698 interactions from DIP [Xenarios et al. 2002], and expression dataset GSE3431 from GEO [Barrett et al. 2013], Hanna et al. identified 89 protein complexes with a precision of 0.646 and 6.2 Identifying Temporal Protein Complexes 155 recall of 0.717, compared to 76 complexes with a precision of 0.6 and recall of 0.7 without using gene-expression biclusters. Park and Bader [2012] studied how PPI networks change across timepoints and how these affect changes to complexes, by simulating time-evolving PPI networks using stochastic models. Let {G(t) : t = 1,...,T } be a series of T time-ordered PPI networks. For an arbitrary pair, t = t , G(t) and G(t ) can have different proteins and interactions. The authors infer a sequence of time-evolving stochastic block models, {M(t) : t = 1,2,...,T }, where M(t) is a model generating the network G(t). Such a network-generative model M is built as follows. The total n vertices in M are grouped into K clusters. The probability for a vertex v to belong to a cluster k = is πk, which is the parameter for a multinomial distribution, with k πk 1. The ∈ parameter θi,j [0, 1] gives the probability of adding an unweighted undirected interaction between vertices u ∈ i and v ∈ j belonging to clusters i and j, and is modeled as independent Bernoulli trials. Essentially, each vertex u is sampled = with probability πk for cluster k, then each interaction is sampled as euv 0or1 ∈ ∈ as a Bernoulli trial with success probability θij for u i and v j. Therefore, the = = number of interactions at the cluster-level will be nij u∈i,v∈i euv for i j,or = = . u≤v∈i euv for i j. The total possible number of interactions is mij ni nj for = = − = i j, and mii ni . (ni 1)/2 for i j. The resulting clusters are then merged into larger clusters such that the probability for a vertex to belong to a large cluster k is ˆ = ˆ = πk nk/n and the probability for two such vertices to be connected is θij eij /mij , where eij is the number of interacting pairs where one protein is in cluster i and the other is in cluster j. This procedure is repeated T times to generate T PPI networks, which can be ordered and compared as if these were a single network that is slowly changing over the T timepoints. The number of vertices and edges to be generated in the model followed a yeast PPI dataset containing 13,401 interactions and 3,248 proteins. The vertices to be present in the different timepoints is based on a timecourse gene-expression data- set covering 3,510 significantly periodic genes from 36 timepoints. This resulted in 36 network snapshots which on average contained 1,380 proteins with degree 1.8. The authors recovered 31 dynamic protein complexes containing at least three proteins each from these networks. Most of the complexes matched across the en- tire time-course, but some complexes disappeared and reappeared. For example, a complex involved in DNA repair was most active at the end of each 12-point cycle. For other complexes such as the mitochondrial ribosome complex, and the nuclear pore complex, some of the proteins were transcribed and then degraded at specific timepoints—e.g., the ribosomal small subunit of mitochondria (RSM) subunit 22 was active only at t = 9, 20, and 32, suggesting a periodic requirement of RSM22 at specific timepoints for the functioning of the mitochondria. 156 Chapter 6 Identifying Dynamic Protein Complexes

Intrinsic Disorder in Proteins 6.3 Intrinsically disordered proteins (IDPs) and IDP regions do not form stable tertiary structures, yet they exhibit biological activities [Dunker et al. 2001, 2015, Dyson and Wright 2005, Oldfield and Dunker 2014]. The word “disordered” has been adopted because of Jirgensons’ use of it for protein classification [Jirgensons 1966]. The word “intrinsically” emphasizes a sequence-dependent characteristic: The struc- tural instability of IDPs is likely encoded by/in their amino acid sequences [Wright and Dyson 1999]. Other terms have also been used in the literature to explain this phenomena namely, naturally disordered, intrinsically unstructured, partially folded, flexible, rheomorphic, natively denatured, and natively unfolded. This sim- ple definition of disorder, however, covers distinct structural phenomena, the two most apparent being static and dynamic disorder [Tompa and Fuxreiter 2008]. In static disorder, an IDP region might adopt (one of the) multiple stable conforma- tions (but is missing from electron density maps and qualifies for disorder). Such regions are sometimes called “wobbly” domains [Uversky et al. 2005]. By contrast, an IDP or IDP region might also constantly fluctuate between a large number of conformations and can be best described as a conformational ensemble; this kind of disorder is dynamic. Computational studies of IDPs began with the investigations into why they do not fold into stable tertiary structures. IDPs and IDP regions do not fold primarily because they are rich in polar residues and proline and depleted in hydrophobic residues [Xie et al. 1998, Dunker et al. 2002]. More than 50% of eukaryotic proteins have at least one long (>30 amino acid sequence) IDP region. Partial disorder has been observed in almost every protein, but it seems to be more abundant in proteins belonging to certain biological processes such as transcription, cell- cycle regulation and signal transduction [Tompa and Fuxreiter 2008]. The lack of structural constraints offers IDPs and IDP regions mobile flexibility to facilitate a diverse range of biological processes. In particular, IDP regions facilitate proteins to perform tedious functions that require more flexibility and cannot be performed by rigid-structured proteins (e.g., movement through narrow pores or channels, and creation of chimeric or fused proteins [Dunker et al. 2001, 2015]). Some examples of IDPs include casein, phosvitin, fibrinogen, trypsinogen, and calcineurin [Oldfield and Dunker 2014]. There are five main databases for IDPs namely, DisProt (http://www.disprot .org/)[Hobohm and Sander 1994], Intrinsically Disordered Proteins with Exten- sive Annotations and Literature (IDEAL) (http://idp1.force.cs.is.nagoya-u.ac.jp/ IDEAL/)[Fukuchi et al. 2012], the Database of Protein Disorder and Mobility Anno- tations (MoBiDB) (http://mobidb.bio.unipd.it/)[De Domenico et al. 2012, Walsh et al. 2015], the Database of Disordered Protein Prediction (D2P2)(http://d2p2.pro/) 6.4 Intrinsic Disorder in Protein Interactions and Protein Complexes 157

[Oates et al. 2013], and the Protein Ensemble Database (pE-DB) (http://pedb.vib .be/)[Varadi et al. 2014]. There are several software and Web servers available in the literature for IDP and IDP region prediction; some of the popular ones include: Predictor Of Naturally Disordered Regions (PONDR®)(http://www.pondr .com/)[Romero et al. 1997], Intrinsically Unstructured Proteins Prediction (IUPred) (http://iupred.enzim.hu/)[Doszt´anyi et al. 2005], DisEMBL (http://dis.embl.de/) [Linding et al. 2003a], GlobProt (http://globplot.embl.de)[Linding et al. 2003b], and DISOPRED (http://bioinf.cs.ucl.ac.uk/psipred/?disopred=1)[Jones and Ward 2003]. Some of these databases and predictors (e.g., DisProt) are based on the as- sumption that disorder is encoded in specific features of the amino acid sequence, whereas others (e.g., PONDR®) take into account the flexibility, hydropathy, charge, and coordination number apart from the amino acid composition, to predict IDP regions. IDPs interact with other (IDP) proteins, nucleic acids, and other types of mol- ecules, just like their structured counterparts. These interactions are often facili- tated by small-molecule ligands, macromolecular binding partners, or posttrans- lational modifications that induce IDPs or IDP regions to become structured for the duration of those interactions. This kind of disorder-to-order transition in the partner-bound state is termed fuzziness [Tompa and Fuxreiter 2008]. The term em- phasizes the structural ambiguity of protein interactions involving IDPs and IDP domains. Using a simple extension from disordered proteins, fuzziness of PPIs can be divided along the same lines into static and dynamic fuzziness. If a PPI involves a static IDP or IDP region, then the fuzziness of the interaction is considered static fuzziness. In contrast, if a PPI involves a dynamic IDP or IDP region, then the fuzzi- ness of the interaction is considered dynamic fuzziness (for further discussion on these definitions, see Section 6.5).

Intrinsic Disorder in Protein Interactions and Protein Complexes 6.4 The representation of proteins and their interactions in PPI networks—as dots and lines connecting these dots—neglects their biophysical properties. This sim- plified representation, although makes for a powerful tool to explore topological properties, is insufficient to answer important questions such as: Which domains mediate the interactions? Which interactions are mediated by structured regions and which are mediated by disordered regions? Which interactions are transient and which are permanent? Which interactions exclude one another and which can occur simultaneously? Although there have been some recent efforts (e.g., Hein et al. [2015]) to quantify the strength and duration of interactions, these questions still largely remain unanswered using traditional proteomics techniques. 158 Chapter 6 Identifying Dynamic Protein Complexes

A complementary computational approach is to use protein docking or molecular dynamics simulations to predict the ensemble of interactions and macromolecular assemblies possible between complementary binding surfaces of proteins [Meng et al. 2011, Russel et al. 2012, De Ruyck et al. 2016]. Cross-docking involves identify- ing interacting partners by docking each protein in the set with every other protein. This, to some extent, helps us understand the structural dynamics of protein inter- actions. The Critical Assessment of PRedicted Interactions (CAPRI) challenge was started to assess the performance of docking algorithms and scoring approaches [Janin 2010]. The CAPRI challenge has led to significant improvements in avail- able tools for docking [Pierce et al. 2014, De Vries and Zacharias 2013, Torchala et al. 2013, Kozakov et al. 2013]. For example, ZDOCK (http://zdock.umassmed .edu/)[Pierce et al. 2014] is a popular docking algorithm which uses Fast Fourier Transform (FFT)-based global search to identify docking sites based on statistical potential, shape complementarity, and electrostatics. However, fuzzy or flexible (IDP-based) docking is much more difficult than rigid docking, and predicting docking pairs involving proteins undergoing large conformational changes is still a challenge. Certain proteins have a more central position in PPI networks. These proteins are typically hubs and are likely to be essential for survival [Jeong et al. 2001]. Some hub proteins have multiple simultaneous interactions (party hubs), while others have multiple sequential interactions separated in time or space (date hubs) [Han et al. 2004]. While proteomics techniques do not provide answers on the dynamics of protein interactions, and docking-based analysis for all proteins in the PPI net- work is challenging and limited by the availability of 3D structures, targeted analysis of hubs is more feasible. For example, in a pioneering work, Aloy and Russell [2002] predicted interactions for cyclin-dependent kinases (Cdks) using 3D structures of interaction surfaces. By mapping these structures onto a yeast PPI network consist- ing of 2,590 interactions, the authors identified structural interaction patterns for 59 of these interactions, which covered many of the Cdks. In another study (also highlighted earlier in Chapter 5), Kim et al. [2006] generated a structural interac- tion network (SIN) by integrating 3D structures of proteins with a yeast PPI network (available through the Structural Analysis of Protein Interaction Networks (SAPIN) server: http://sapin.crg.es/). The authors found that date hubs were often involved in mutually exclusive interactions in the SIN, which appeared to be more transient than simultaneously possible interactions. Tsuji et al. [2015] combined structural information from the Protein Data Bank (PDB) [Berman et al. 2000] with protein interactions from IntAct [Hermjakob et al. 2004] to investigate individual protein interactions involving hubs and protein complexes. The authors mapped human 6.4 Intrinsic Disorder in Protein Interactions and Protein Complexes 159

PPIs from IntAct onto protein complex models from PDB to extract 3,959 unique experimentally determined interfaces for 7,241 interactions. Of these, about 80% of the proteins had a single interface. Proteins within protein complexes possessed more interfaces, with the largest number of interfaces reaching up to seven in some proteins (e.g., 26S proteasome subunit 7, MAP kinase kinase kinase (MAP3K) 3, and growth receptor bound (GRB) protein 2). When the amino-acid sequences of these interface regions were mapped on the PDB protein complex models, it was found that approximately half (43%) of the sequence regions mapped to IDP re- gions, and between 16–57% mapped to interface, surface, and interior (burried residues) of the protein complex models. In another study, Karetal.[2009] in- tegrated structural interfaces for hubs in a human cancer PPI network to build a cancer structural protein interface network (ciSPIN). The ciSPIN was used to study cancer-related properties of hub proteins using their interface dynamics. The In- teractome3D database (http://interactome3d.irbbarcelona.org/)[Mosca et al. 2013] provides structural information for over 12,000 protein interactions covering eight model organisms. The above-mentioned studies on the existence of hubs capable of participating in multiple interactions suggests a distinct mechanism for molecular recognition: Structural features that endow these hubs the flexibility and mobility to accommo- date a diverse range of interactions with other proteins [Hasty and Collins 2001]. Since IDPs are flexible, IDP-based interactions have been proposed as a mechanism to explain multiple binding by hubs in PPI networks [Dunker et al. 2005]. Several studies have supported this idea, with disorder being associated with the interac- tion patterns observed in both date [Ekman et al. 2006, Patil and Nakamura 2006, Patil et al. 2010] as well as party hubs [Jaffee et al. 2004, Marinissen and Gutkind 2005]. Some examples of completely (100% of the protein sequence/structure is disordered) or almost completely disordered hubs include α-synuclein, caldesmon, high mobility group protein A (HMGA), and synaptobrevin [Dunker et al. 2005]. There exist hubs which are themselves ordered but interact with intrinsically dis- ordered binding partners; an example is that of cyclin-dependent kinases (Cdks) binding to their inhibitors, cyclin-kinase inhibitors (CKIs). Cdks have been intro- duced earlier in this Chapter. Cdks are considered as the master timekeepers of cell division [Nurse 2001, Morgan 1995]. Cdks are regulated by binding to their cy- clin partners to form heterodimeric complexes. For example, cyclins A and E that interact with Cdk2 are required for the G1/S transition and progression through the S phase; exit from G1 is primarily under the control of cyclin D/Cdk4/6; and the interaction of Cdk1 with cyclin B1 directs the G2/M transition [Morgan 1995, Mendenhall and Hodge 1998, Enserink and Kolodner 2010, Dunker et al. 2005]. 160 Chapter 6 Identifying Dynamic Protein Complexes

Therefore, not surprisingly, the activity of Cdks throughout the cell cycle is pre- cisely directed by a combination of mechanisms including the levels of cyclins and the levels of their inhibitors (CKI proteins), which are responsible for deactivating Cdk-cyclin complexes. The crystal structures of several Cdks and their complexes with cyclins and CKIs have been resolved [Pavletich 1999]. These studies suggest that CKIs are disordered, but upon binding to specific cyclin-Cdk complexes, the CKIs attain an order. The “molecular staple” analogy is often used to describe this phenomenon, wherein the “prongs” of the “CKI staples” are unstructured and flex- ible, but the binding of CKIs to cyclin-Cdk complexes connects the prongs, thus rendering the CKIs a transient structure, to facilitate deactivation of the cyclin-Cdk complexes. In addition to the two extremes of (dis)orderness observed in hubs, there also exists an intermediate category of partially disordered hubs. This group includes hub proteins that fold into overall stable 3D structures but contain disordered do- mains which play important roles in binding to the partners of these hubs. The percentage of the protein sequence/structure covered by these disordered regions ranges approximately from 12% (e.g., 14-3-3 proteins) to 79% (e.g., breast cancer susceptibility protein 1 BRCA1) [Dunker et al. 2005, Dunker et al. 2015]. A proto- typical example here is that of the tumor suppressor protein TP53, which has about 29% disordered regions, and is involved in crucial biological processes including response to cellular stress and DNA damage, arresting of cell cycle progression, and inducing of apoptosis [Anderson and Appella 2004]. TP53 has four (not necessar- ily structured) domains namely, the N-terminal transcription activation domain, the central DNA binding domain, the C-terminal tetramerization domain, and the C-terminal regulatory domain [Lee et al. 2000, Wells et al. 2008] (Figure 6.3). The N-terminal domain interacts with transcription factors II D (TFIID) and H (TFIIH), the mouse double minute 2 homolog / E3 ubiquitin-protein ligase (Mdm2), repli- cation protein A (RPA), cAMP response element binding (CREB)-binding protein (CBP)/p300, and COP9 signalosome subunit 5/ JUN activation domain-binding pro- tein 1 (CSN5/Jab1) among many other proteins. The C-terminal regulatory domain of TP53 interacts with glycogen synthase kinase 3 β (GSK3β), Poly(ADP-Ribose) Poly- merase (PARP) 1, TATA-Box binding protein associated factor 1 (TAF1), transforma- tion/transcription domain associated protein (TRRAP), histone acetyltransferase 5 (hGcn)5, 14-3-3, S100 calcium binding protein B (S100B), and several other pro- teins. While the DNA-binding domain is structured, the two terminal domains are instrinsically disordered, and aquire structures only upon formation of interactions or complexes [Lee et al. 2000, Wells et al. 2008]. Multiple different posttranslational modifications have also been reported in TP53, which are believed to alter the struc- (a) Predicted disordered tail

ATM

CDKN2A

BRCA1 MAPK8 MDM2 CDKN1A

TP53

EP300 CREBBP SIRT1

KAT2B

(b) (c)

Figure 6.3 Disordered regions in the human tumor suppressor protein TP53. (a) TP53 sequence with predicted disordered regions (potentially disordered shown in red, and probably disordered shown in blue in the grey track), generated from Protein Data Bank server [Berman et al. 2000]; (b) two views of TP53 secondary structure with predicted disordered tail [Walsh et al. 2015], generated from SWISS-MODEL [Kiefer et al. 2009]; and (c) high-confidence interactors of TP53, identified from STRING [Von Mering et al. 2003, Szklarczyk et al. 2011]. 162 Chapter 6 Identifying Dynamic Protein Complexes

tures of these domains to enable protein interactions. Other examples of partially disordered hubs include BRCA1, Mdm2, xeroderma pigmentosum complementa- tion group A protein (XPA), and estrogen receptor (ER) α [Dunker et al. 2015].

Identifying Fuzzy Protein Complexes 6.5 The concept of disorder in proteins can be extended to protein interactions and protein complexes involving IDP proteins or domains, to give fuzzy interactions and fuzzy complexes. As mentioned earlier, this can be a simple extension of the defi- nition: If an IDP or its domain is capable of adopting (one of the multiple) stable conformations in its unbound state, but gets completely structured in its bound state, the fuzziness of the resulting interaction or complex is considered static. Likewise, if an IDP or its domain remains in an ensemble of conformations in its unbound state, but gets completely structured following binding, the fuzziness is considered dynamic. However, if the conformational freedom provides IDPs and IDP regions functional diversity, why would nature limit this conformational free- dom of IDPs after they interact and attain a bound state? It would be a twisted logic not to exploit these beneficial features after interacting with a partner. It makes sense for IDPs and IDP regions to preserve their conformational freedom under all circumstances [Fuxreiter and Tompa 2012, Patel et al. 2007]. Fuxreiter and Tompa [Tompa and Fuxreiter 2008, Fuxreiter and Tompa 2012] identified 26 cases of pro- tein complexes where structural disorder was present even in the bound state with significant contribution to function. Based on these cases the authors proposed four major categories of fuzziness of complexes: (i) polymorphic complexes, (ii) clamp complexes, (iii) flanking complexes, and (iv) random complexes. These cases are not completely distinct, and can overlap with one another. In a polymorphic complex, at least one of the proteins involved in the complex attains a few or multiple conformations. These alternative conformations underlie different biological functions. For example, consider the protein complex formed by the binding of α-importins to proteins “tagged” with nuclear localization signals (NLS). Importins are a type of karyopherins that transport protein molecules into the cellular nucleus. NLS is an amino acid sequence that ‘tags’ to a cargo protein for import of the protein into the nucleus (e.g., PKKKRKV found in the SV40 Large T-antigen, and KR[PAATKKAGQA]KKKK found in the nucleoplasmin). Here, the same NLS peptide exhibits different side-chain conformations to enable its tagging with different cargo proteins, which then form the complex with importins, to be transported into the nucleus [Terry et al. 2007]. Other examples of polymorphic complexes include the T-cell factor 4 (Tcf4) (as the IDP) with β-catenin (partner), 6.5 Identifying Fuzzy Protein Complexes 163

Inhibitor 2 with protein phosphatase I, Myelin with Actin, and RNase I with RNase inhibitor. In clamp complexes, the disordered segment connects two or more ordered binding regions. For example, the MAPK scaffold protein Ste5 interacts with the MAPK fusion protein Fus3 via an eight-residue long linker, which dynamically in- terchanges among several conformations and provides flexibility to the interaction. The interaction of CKIs with cyclin-Cdk heterodimers, discussed above, is another example of a clamp complex. In flank complexes, the IDP region acts as recognition sequence (a linear motif) and is used to establish a specific contact with a partner. The IDP region imparts the plasticity to localize to the target site. For example, the kinase inducible domain (KID) of CREB interacts with the kinase-inducible domain interacting (KIX) domain of CBP via a 29-residue IDP region. Random complexes are extreme cases of fuzziness, where binding does not induce any structural order for the interacting regions. For example, T-cell receptor ζ chains lack a stable structure and remain in a fast dynamic equilibrium between conformations [Sigalov et al. 2004]. This random behavior is not limited only to the binding between proteins; for example, the binding of single-stranded DNA (ssDNA)-binding (SSB) protein to DNA does not induce a strict disorder-to-order transition of the protein, which keeps fluctuating among several alternative conformations [Savvides et al. 2004]. Hegyi et al. [2007] studied disorder within protein complexes derived from E. coli and S. cerevisiae, and correlated it to the size of protein complexes. The protein complexes were derived by clustering interaction data from the IntAct [Hermjakob et al. 2004, Kerrien et al. 2012] and Gavin et al. [2006] datasets. Structural disor- der for proteins or protein regions in these complexes was predicted using IUPred (http://iupred.enzim.hu/)[Doszt´anyi et al. 2005]. For each protein, the fraction of amino-acid sequence accounting for the disorder was predicted, and this was av- eraged for the entire set of proteins within each complex. The protein complexes were then grouped according to size: singular, 2–4, 5–10, 11–20, 21–30, and 31– 100 protein-containing complexes. The authors found that there were significant (p<0.001 by Chi-square test) differences between the groups for the observed aver- age disorder within complexes. In particular, larger protein complexes had greater average disorder. For example, in E. coli, the median disorder for complexes of size 11–100 was ∼12%, whereas for smaller complexes of size 5–10, the median disorder was ∼7%. Similar observations were noted for S. cerevisiae, but only stronger. For ex- ample, the median disorder for complexes of size 11–100 was ∼18% in S. cerevisiae, and that for small complexes of size 5–10 was ∼14%. These findings likely suggest that larger complexes are assembled from proteins that tend to be more disordered, or conversely, disorder in proteins facilitates the assembly of large complexes. 164 Chapter 6 Identifying Dynamic Protein Complexes

Due to the binding flexibility of IDPs it is difficult to predict the exact partners of IDPs, and therefore the composition of fuzzy protein complexes. Accurate pre- diction of binding interfaces, and thereby the binding partners, therefore becomes a prerequistite to predict fuzzy complexes. Recently, several tools have been devel- oped to predict binding interfaces involving disordered regions [Davey et al. 2012, O’Brien et al. 2013, Meszaros et al. 2009, Disfani et al. 2012, Jones and Cozzetto 2014, Malhis et al. 2015]. Some of these tools are based on the identification of the short linear motifs (SLiMS) and molecular recognition features (MoRFs) that IDPs adopt to bind to their partners. For example, SLiMPrints (http://bioware.ucd.ie/ slimprints.html)[Davey et al. 2012] predicts SLiMS based on conserved regions that may participate in binding. SLiMScape (http://apps.cytoscape.org/apps/slimscape/ )[O’Brien et al. 2013] enables identification of SLiM-mediated interactions between proteins and is integrated with the Cytoscape network-visualization environment [Shannon et al. 2003, Smoot et al. 2010]. Other methods [Disfani et al. 2012, Jones and Cozzetto 2014, Malhis et al. 2015] combine different features of MoRFs namely, amino acid propensity, flanking regions, physiochemical properties, and evolu- tionary profiles, to predict binding. Experimental techniques including Nuclear Magnetic Resonance (NMR) and Small Angle X-ray Scattering (SAXS) in combina- tion with these computational methods are also being employed to model the en- semble of conformations adopted by disordered regions within protein complexes. The Protein Ensemble Database (PE-DB) (http://pedb.vib.be/)[Varadi et al. 2014]is a collection of structural ensembles of proteins obtained from a combination of experimental and computational techniques. Recently, Methyl-TROSY (transverse relaxation optimized spectroscopy) NMR [Sprangers et al. 2007] and live cell imag- ing [Monachino et al. 2016] techniques have made the direct characterization of dynamic complexes far more accessible. For example, Picco et al. [2017] presented a pipeline to reconstruct the 3D architecture of protein complexes in vivo using live cell imaging, and applied the pipeline to study the structure of the exocyst complex. Identifying Evolutionarily Conserved Protein Complexes

Nothing in biology makes sense except in the light of evolution. —Theodosius Dobzhansky (1900–1975), Russian geneticist and evolutionary biologist

With rapid increase in the number of resources for PPIs from several different organisms—both model and higher-order (Chapter 1, Table 1.2)—applying compu- tational methods to identify complexes from these organisms has become feasible. 7Among the biological questions this has inspired is: To what extent do complexes from different organisms show evolutionary conservation? Comparison of protein orthologs between humans, other vertebrates, and lower-order species including S. cerevisae and D. melanogaster has indicated that proteins within human com- plexes are more ancient and have higher conservation on average compared to non-complexed proteins. In a study by Havugimana et al. [2012] involving 622 pre- dicted human complexes, roughly a quarter (24%; 149/622) had significant overlaps with complexes from S. cerevisae and D. melanogaster, indicating that some of the human complexes are ancient and slowly evolving. On the other hand, most other complexes (60%; 376/622) overlapped with only the vertebrates (i.e., orthologs are not present in these lower-order species), thus indicating a major shift or expansion in the PPI network and complex organization coincident with the emergence of ver- tebrates [Bloome et al. 2006]. In another recent study [Wan et al. 2015] that analyzed more than 980 putative conserved complexes, the protein components compris- ing these complexes spanned 500 million to over 1 billion years old. Strikingly, 344/980 complexes that contained entirely ancient components (>1 billion years old) and 490/980 complexes that contained predominantly ancient components were ubiquitously conserved among eukaryotes. The high degree of conservation 166 Chapter 7 Identifying Evolutionarily Conserved Protein Complexes

of complexes observed in these studies indicates that core cellular processes are (preferentially) conserved throughout the evolution. Detecting conserved complexes using PPI networks is typically based on identi- fying conserved dense (modular) subnetworks between the PPI networks from two or more species. The underlying assumption here is that if a dense subnetwork is conserved between species, then the proteins within the subnetwork are under evolutionary pressure to maintain their interactions [Shoemaker and Panchenko 2007, Juan et al. 2008, Yamada and Bork 2009], and therefore the subnetwork cor- responds to an evolutionarily conserved function, most often executed by protein complexes or functional modules. Most methods go about identifying conserved subnetworks by first building interolog or orthology networks. An interolog is a conserved interaction between a pair of proteins in an organ- ism which have interacting homologs in another organism [Walhout et al. 2000, Matthews et al. 2001]. The homologs are typically identified based on protein se- quence similarity. One of the organisms is considered the source (S), whereas the { } other the target (T ), and an interaction between a pair of proteins uS , vS from S { } is said to be conserved as uT , vT in T if the pair has a joint (geometric mean) se- quence similarity ≥80% or a joint BLAST E-value <10−70 [Yu et al. 2004]. However, many times this inference is not one-to-one, and one protein pair in a species might match multiple pairs in another; these issues would need to be suitably resolved; e.g., using genomic context [Wolf et al. 2001, Wozniak et al. 2011, Wozniak et al. 2014], and domain information [Bork et al. 1997, Nguyen et al. 2013]. The inferred =  interologs are then assembled into an interolog network GI VI , EI , in which = ∈ each node u (uS , uT ) VI represents a “supernode” of the orthologous proteins ∈ uS and uT , and each edge (u, v) EI between two supernodes corresponds to the interactions (uS , vS) and (uT , vT ) conserved between the two species S and T (see Figure 7.1). Dense subnetworks found within the interolog network are mapped back onto the original source and target PPI networks to identify conserved subnetworks between the two PPI networks. Conserved protein complexes are believed to be embedded among these dense subnetworks, which can be validated using bona fide complexes from the two species. The success of this procedure depends on how well the interologs are inferred, and how well subnetworks are identified from the interolog network.

Inferring Evolutionarily Conserved PPIs (Interologs) 7.1 Some early works on the prediction of protein function were based on the joint presence or absence of protein orthologs in different species [Pellegrini et al. 1999, 7.1 Inferring Evolutionarily Conserved PPIs (Interologs) 167

Orthologs

e e′

a a′ ′ c c

d a″ d′ b b′ Paralogs

PPI network PPI network from species S from species T

e|e′ Interologs c|c′ a|a′

a|a″ d|d′ b|b′ Interologs (orthology) network from species S and T

Figure 7.1 Constructing interolog networks. Conserved interactions (interologs) are identified by mapping orthologous proteins between source S and target T PPI networks. A protein in S can map to multiple proteins in T , in which case the inference of its interactions should be suitably resolved. For example, a from S maps to a and a from T . Here, interactions (a, b) and (a, c) from S are conserved in T as (a , b ), (a , b ), (a , c ) and (a , c ). The proteins a and a in T are paralogs.

Galperin and Koonin 2000]. This approach, called phylogenetic profiling, is built on the premise that proteins that function together are gained and lost together in evolution. This approach can also be used to infer interologs. The phylogenetic profile for a set of N proteins across M species is essentially an N × M matrix witha1atamatrix position if an ortholog is present for a protein in a particu- lar species, and a 0, otherwise. Two proteins with similar phylogenetic profiles can be predicted to be under evolutionary pressure (co-evolving) to maintain, for ex- ample, the interaction (interolog) between them across the species. Proteins with shared ancestory can be considered to be in the same or similar “orthogroups.” Dey et al. [2015] constructed human orthogroup phylogenetic (hOP) profiles for 168 Chapter 7 Identifying Evolutionarily Conserved Protein Complexes

19,973 human proteins using 177 eukaryotic species. Clustering these hOP pro- files identified over a thousand hOP modules of sizes 2–50. Enrichment analysis of these hOP modules revealed that proteins sharing phylogenetic profiles (co- evolving protein pairs) were enriched for protein complexes and pathways. The strongest co-evolving pairs (2,101 unique proteins) were significantly enriched for co-complex and co-pathway interactions involved in canonical signaling, immune response, and transcriptional control. This observation suggests a tendency for cellular networks with strong internal coupling (protein complexes and metabolic pathways) to form evolutionary modules over interlinked signaling and transcrip- tional pathways. The hOP profiles and hOP modules generated from the Dey et al. study are available from http://web.stanford.edu/group/meyerlab/hOPMAPServer/ index.html. While phylogenetic profiling strongly suggests evolutionary pressure and thus conservation of interactions, protein sequence similarity provides a more direct evidence for the conservation: If two interacting proteins in one species (source) have sequence-conserved orthologs in a second species (target), then their inter- action is most likely conserved in the second species. Yu et al. [2004] transfered interologs from C. elegans (worm), D. melanogaster (fly), and H. pylori (bacteria) (as source) to S. cerevisiae (yeast) (as target) based on joint protein sequence similarity ≥ −70 JI 80% and joint BLAST E-value JE<10 . Yu et al. used 410, 4,786, and 1,465 interactions from the three source organisms, respectively, to infer interologs and validated these against 8,250 interactions occurring between co-complexed pro- teins (from MIPS complexes) from S. cerevisae. The authors found that almost all ≥ interacting protein pairs having JI 80% in worm, fly, and bacteria tended to in- teract (and therefore be contained within the same complex) in yeast, whereas few

protein pairs interacted at JI <40%. These results indicate that pairs of proteins with high sequence similarity tend to conserve their interactions. The inferred in- terologs from this study are available at http://interolog.gersteinlab.org/. Vo et al. [2016] used their FissionNet network consisting of 2,278 high-quality interactions from S. pombe (see Chapter 2) to study conservation of interactions between the two yeasts S. pombe and S. cerevisiae, and H. sapiens. Protein sequence- based analysis revealed, not suprisingly, that the two yeasts are less divergent from each other than either yeast is from human and the two yeasts share a greater frac- tion of protein-coding genes than either yeast does with human, agreeing with ex- isting literature [Sipiczki 2000]. Interestingly, however, interaction-based analysis showed that while only ∼40% of S. pombe interactions are conserved in S. cerevisiae, ∼65% of S. pombe interactions are conserved in human. These results suggest that a large fraction of interactions are conserved between human and S. pombe but have 7.1 Inferring Evolutionarily Conserved PPIs (Interologs) 169 been lost specifically in the S. cerevisiae lineage. Furthermore, the differences in proteins between S. cerevisiae and S. pombe have been observed within protein com- plexes. For example, during the process of chromatin remodeling, histone variants replace the canonical histone subunits to change chromatin structure, and this re- placement requires an interaction between the histone variants and the chromatin remodeler inositol requiring protein 80 (INO80) complex. The INO80 complexes in S. cerevisiae and S. pombe replace the canonical histone with different histone vari- ants, thereby displaying a rewiring of interactions between the two species [Yamada and Bork 2009]. Das et al. [2013] constructed StressNet, a network of 235 high-quality protein interactions among proteins involved in stress response and cell signaling from S. pombe. Comparing this network to the interaction network from S. cerevisiae, the authors found that most of the interactions among these stress response and sig- naling proteins are not conserved between the two yeasts but are “rewired”; that is, orthologous proteins have different binding partners in both yeasts. In particu- lar, transient interactions connecting proteins in different functional modules are more likely to be rewired than conserved. Brown and Jurisica [2005] developed the Online Predicted Human Interaction Database (OPHID) to infer human PPIs using PPIs from C. elegans, D. melanogaster, M. musculus, and S. cerevisae. The inference is based on mapping human and model-organism orthologs using BLASTP and the reciprocal best-hit approach. Briefly, each model organism protein sequence was BLASTed against human pro- tein sequences, and each top-hit human sequence with an E-value <10−5 was BLASTed back against the set of model-organism protein sequences. If the top hit in the reverse direction (with E-value<10−5) matched the original query protein, the matching human protein was selected as a potential ortholog. A predicted hu- man interaction was added if both proteins in the model organism interaction were conserved in humans. These inferred interactions were scored using information from domain co-occurrence constructed from domains in InterPro (http://www.ebi .ac.uk/interpro/)[Quevillon et al. 2005], gene co-expression from GeneAtlas expres- sion dataset on 79 normal human tissues (http://www.ebi.ac.uk/gxa)[Kapusheksy et al. 2010], and GO-term similarity based on descriptive (lower level) GO terms (see Chapter 2); the final scored interologs are available at http://ophid.utoronto .ca/ (OPHID). The BIANA (Biologic Interactions and Network Analysis) Interolog Prediction Server (BIPS) (http://sbi.imim.es/BIPS.php)[Garcia-Garcia et al. 2012] offers a web interface to faciliate prediction of interologs using the BIANA framework [Garcia-Garcia et al. 2010]. Similar to OPHID, BIANA (BIPS) combines sequence 170 Chapter 7 Identifying Evolutionarily Conserved Protein Complexes

−70 similarity (BLAST alignments with JE < 10 ), domain-domain interactions (us- ing iPfam [Finn et al. 2005] and 3DID [Stein et al. 2011] databases), GO functional annotations [Ashburner et al. 2000], and also presence within the same cluster of orthologous genes (from eggNOG [Powell et al. 2012] and Ensembl [Flicek et al. 2011]) to predict interologs conserved between organisms including Arabidopsis, C. elegans, D. melanogaster, H. sapiens, M. musculus, and S. cerevisae. HomoMINT, developed by Persico et al. [2005], infers conserved PPIs in human using PPIs from E. coli, C. elegans, D. melanogaster, H. pylori, and S. cerevisae in a manner similar to the above methods. HomoMINT is integrated into the MINT database [Zanzoni et al. 2002, Chatr-Aryamontri et al. 2007], and is available at http://mint.bio.uniroma2.it/mint/release/main.php.

Identifying Conserved Complexes from Interolog Networks, I 7.2 Kelley et al. [2003, 2004] developed PATHBLAST (http://www.pathblast.org/) which performs a global alignment of PPI networks to predict conserved protein complexes and pathways between two species. The two networks from species S and T are aligned by identifying paths Path(S) and Path(T ) that allow for (i) direct matches: conserved interactions between the two paths; (ii) gaps: Path(T ) can skip over a protein in Path(S); and (iii) mismatches: a protein in Path(S) is dissimilar in sequence, E-value > 10−2, to the protein at the corresponding location in Path(T ); see Figure 7.2. The aligned paths are combined into a global alignment graph in which each aligned path P is assigned a log probability score Score(P ), p(v) q(e) Score(P ) = log + log , (7.1) 10 p 10 q v∈P random e∈P random

where p(v) is the probability of a true homology within the protein pair represented by vertex v, given its BLAST E-value, and q(e) is the probability that the PPI rep- resented by edge e is real, i.e., not a false positive. The background probabilities

prandom and qrandom are the expected values of p(v) and q(e)and are computed from global alignment graphs generated by permuting the protein names in the protein networks. For acyclic networks, the highest-scoring path of length L can be found in linear time using dynamic programming. Because the global alignment graph may contain cycles, Kelley et al. generated a large number, 5L!, of acyclic subnet- works by random removal of interactions from the global alignment network and then aggregated the results of identifying the highest-scoring path from each of the subnetworks. 7.2 Identifying Conserved Complexes from Interolog Networks, I 171

a a′ a|a′ Direct

b b′ b|b′ c

Gap d d′ d|d′

e e′ Mismatch

f f ′ f|f′ Path (S) Path (T)

Figure 7.2 The direct-gap-mismatch model [Kelley et al. 2003, Kelley et al. 2004] for global alignment of PPI networks from two species. While aligning paths in PPI networks from two species, the model either records a direct match, or skips over (a short number of) unmatched proteins to record a gap, or records a mismatch.

Global alignment between 14,489 interactions (among 4,688 proteins) from S. cerevisae and 1,465 interactions (among 732 proteins) from H. pylori resulted in a global alignment graph containing 2,036 edges (among 829 vertices), of which 7 edges were direct matches, 260 edges were gaps, and 1,769 edges were mismatches. As seen from these numbers, the conservation of direct interaction pairs between yeast and bacterial networks appear to be rare, which Kelley et al. attribute to low coverage and quality of interactions in the dataset at that time. But, this was still higher than expected from random yeast and bacterial networks (obtained by permuting protein names): 7 direct matches from real networks vs. 2.5 ± 1.9 from random networks. Moreover, the use of gaps and mismatches allowed for detection of longer conserved paths from the network even though the direct interactions per se were not conserved. Of course these paths could contain a large number of gaps and mismatches and may be irrelevant; but, interestingly, the best pathway- alignment score was still significantly higher than expected from random (8.1 vs. 6.1 ± 0.8, p = 0.006). In retrospect, it is possible that the lack of direct matches yet conservation of larger network regions is due to the reorganization of protein domains between yeast and bacteria—that is, because of multidomain proteins in yeast that subsume the domains of two or more bacterial proteins—but, pathways as a whole are generally conserved. 172 Chapter 7 Identifying Evolutionarily Conserved Protein Complexes

The top 150 highest-scoring paths in this graph covered 4.1% and 1.2% of the proteins in the H. pylori and S. cerevisiae PPI networks, respectively. Validation of these pathway alignments against the MIPS and The Institute for Genomic Re- search (http://www.tigr.org/; TIGR) [Pruitt 1998] databases identified conserved functions between the two species including rRNA transcription, protein synthe- sis, protein fate and targeting, cell envelope and nuclear transport, and proteolytic degradation. The alignment further showed that single pathways in bacteria often corresponded to multiple pathways in yeast, consistent with the model that yeast has undergone one or more whole-genome duplications relative to bacteria [De la Cruz et al. 1999, Kellis et al. 2004]. The alignment of the yeast network to itself re- sulted in a global alignment network of 1,389 direct matching edges (among 5,593 vertices). The top 300 highest scoring pathways involved alignments between pro- tein complexes known to be distinct but homologous in function, indicating that these aligned pathways represented paralogous structures (e.g., RNA polymerase II aligned with RNA polymerases I and III). Similarly, alignments were seen for the three AAA heteromeric complexes which belong to separate subcellular localiza- tions: cytosolic 26S proteasome (ATPase of the 19S regulatory particle of the 26S proteasome subunits 3–6), mitochondrial AAA protease complex (Afg3 and Yta12), and the Pex1/6 complex involved in protein disassembly. Apart from global alig- ment of networks, PATHBLAST was also used to query for specific pathways or complexes (as source structures) within networks (as the target). For example, query for the classic MAPK pathway (Ste11-Ste7-Kss1) in yeast identified two other MAPK pathways, Bck-Mkk1-Slt2 and Ssk2-Pbs2-Hog1. Sharan et al. [2005a] applied PATHBLAST to perform a three-way alignment of PPI networks from C. elegans, D. melanogaster, and S. cerevisiae, containing 3,926 (2,718 proteins), 20,720 (7,038), and 14,319 (4,389) interactions, respectively. Net- work alignment was performed to identify two types of conserved subnetwork struc- tures: short linear paths of interacting proteins, which model signal transduction pathways, and dense clusters of interactions, which model protein complexes. The search was guided by the reliabilities of individual protein interactions [Bader and Hogue 2002]. A log-likelihood ratio was used to compare the fit of a subnetwork to the desired structure (path or cluster) versus its likelihood given that each species’ interaction map was randomly constructed, under the assumption that: (i) in a real subnetwork, each interaction should be present independently within each PPI net- work with high probability, whereas (ii) in a random subnetwork, the probability of an interaction between any two proteins depends on their total number of connec- tions in the network. The network alignment identified 183 protein clusters and 240 paths at a significance level p<0.01, covering 649 proteins common to yeast, worm, and fly. The three-way comparison identified 200 yeast/worm, 835 yeast/fly, and 132 7.2 Identifying Conserved Complexes from Interolog Networks, I 173 worm/fly clusters. Comparison of these conserved clusters against complexes an- notated in MIPS showed that 94% of the clusters were “pure,” i.e., contained three or more annotated proteins and at least half of these shared the same annotation. Among the GO cellular processes represented by these conserved clusters included phosphorous metabolism (kinase signaling), intracellular transport, DNA metabo- lism, RNA metabolism, cell proliferation, and regulation of transcription. Singh et al. [2008] proposed IsoRank (http://groups.csail.mit.edu/cb/mna/), an approach based on simultaneous handling of protein and network similarities to perform global alignment of PPI networks. Given two PPI networks, IsoRank identifies a mapping between the nodes of the two networks that maximizes two objectives: (i) the size of the common subnetwork implied by the mapping; and (ii) the aggregate sequence similarity between nodes mapped to each other. Consider two PPI networks S and T , and let Rij be the similarity score for the protein pair (i, j)where i belongs to S and j belongs to T . To compute Rij , IsoRank pursues the intuition that (i, j) is a good match if the protein sequences of i and j align well and their respective neighbors are also a good match with each other. Rij is first initialized to reflect only the protein sequence similarity between i and j. IsoRank then computes the neighborhood scores in a recursive fashion, with the following equation to hold for all possible pairs (i, j):

R R = uv , i ∈ V(S), j ∈ V(T). (7.2) ij |N(u)| . |N(v)| u∈N(i) v∈N(j)

This equation requires that, for i and j to be deemed as matching, Rij should be equal to the total support provided by each of the |N(i)| . |N(j)| possible matches between the neighbors of i and j in the respective networks. In return, each node- | | | | pair (u, v) must distribute back its score Ruv equally among the N(u) . N(v) possible matches between its neighbors. This equation therefore captures the non- local influences on Rij : The score Rij depends on the scores for the neighbors of i and j, which, in turn, depend on the scores for the neighbors of neighbors, and so on. By applying the method to PPI networks from C. elegans (4,572 in- teractions), D. melanogaster (25,831), H. sapiens (36,387), M. musculus (255), and S. cerevisiae (31,899), Singh et al. identified 1,663 aligned interactions that were supported by at least two PPI networks, and 157 aligned interactions that were supported by at least three PPI networks. In comparison, the sequence-only map- ping produced 509 and 40 interactions that were supported by two and three PPI networks, respectively. The average functional coherence of the mapped pro- teins, using Gene Ontology terms describing molecular function and biological processes, was 0.220, which was comparable to the functional coherence obtained 174 Chapter 7 Identifying Evolutionarily Conserved Protein Complexes

from InParanoid (0.206), a database for orthologous and paralogous genes (http:// InParanoid.sbc.su.se)[O’Brien et al. 2005, Ostlund¨ et al. 2010]. NetAligner (http://netaligner.irbbarcelona.org), developed by Pache et al. [2012], works on a similar principle as PATHBLAST by mapping protein homologs between species using BLAST, and extending the initial alignment seeds (aligned edges) using gaps and mismatches to identify conserved connected components. These connected components are then ranked based on their statistical significance using a Monte Carlo permutation test. NetAligner is designed to generate visualizable outputs using visualization softwares including Cytoscape [Smoot et al. 2010]. Instead of using global alignment of networks, Sharan et al. [2005b] recast the problem of identifying protein complexes conserved between two PPI networks as the problem of identifying dense subnetworks from the interolog (orthology) net- work constructed from the two PPI networks. Interactions in this interolog network are weighted such that higher weights are assigned to interactions that occur within dense subnetworks, modeled as cliques, in both the PPI networks. Dense sub- networks from the interolog network are then identified using a heuristic-search procedure. Specifically, given a PPI network G =V , E, Sharan et al. defined two

models: a protein-complex model Mc, and a null model Mn. The Mc model assumes that every two proteins in a complex interact with a high probability β, independent of other interactions, and that protein complexes are embedded as cliques in the

network. On the other hand, the Mn model assumes that each interaction can oc- cur with a probability that one would expect if the interactions in the network were

randomly distributed. For a given protein pair (u, v), let Tuv denote the event that the two proteins interact, Fuv the event that they do not interact, and Ouv, the set of observations on u and v from, say, a reference dataset of protein interactions. There- | fore, one can estimate the probability P(Ouv Tuv) of the observations on this pair | given the two proteins interact, probability P(Ouv Fuv) of the observations that the proteins do not interact, and the prior probability P(Tuv) that two random proteins interact using the reference dataset or other biological information (e.g., GO-term similarity). Given a subset U ⊂ V , one can compute the likelihood of U under the

two models. Let OU be the collection of all observations on vertex pairs in U. Under the Mc model, we get,  | = | P(OU Mc) P(Ouv Mc) (u,v)∈U×U  = | | + | | (P (Ouv Tuv , Mc) . P(Tuv Mc) P(Ouv Fuv , Mc) . P(Fuv Mc)) (u,v)∈U×U  = | + − | (β . P(Ouv Tuv , Mc) (1 β) . P(Ouv Fuv , Mc)), (7.3) (u,v)∈U×U 7.2 Identifying Conserved Complexes from Interolog Networks, I 175 following the assumption that the interactions are independent. The null model assumes that G is drawn uniformly at random from the collection of all graphs whose degree sequence is close to that of the interaction graph. This induces a | probability puv for each protein pair (u, v). Therefore, the probability P(OU Mn) is computed as  | = | + − | P(OU Mn) (puv . P(Ouv Tuv , Mn) (1 puv) . P(Ouv Fuv , Mn)). (7.4) (u,v)∈U×U Then, the log likelihood ratio for the set of vertices U is P(O |M ) L(U) = log U c | P(OU Mn) (7.5) β . P(O |T , M ) + (1 − β) . P(O |F , M ) = log uv uv c uv uv c . p . P(O |T , M ) + (1 − p ) . P(O |F , M ) (u,v)∈U×U uv uv uv n uv uv uv n Finally, given the interaction networks of two species 1 and 2, consider two subsets U 1 ={u1,...,u1 } and V 2 ={v2,...,v2 }, and some mapping θ : U 1 → V 2 1 k1 1 k2 between them. Assuming that the interaction networks of the two species are independent of each other, the log likelihood ratio for these two sets is P(O1 |M1) P(O2 |M2) L(U 1, V 2) = log U c + log V c . (7.6) 1 | 1 2 | 2 P(OU Mn) P(OV Mn) However, this score does not take into account the extent of sequence conservation among the pairs of proteins associated by the mapping θ. Let Euv denote the BLAST ¯ E-value for the similarity between proteins u and v, and let huv and huv denote the events that u and v are orthologous or non-orthologous, respectively. Therefore, the likelihood ratio for (u, v) is P(E |M ) P(E |h ) uv c = uv uv . (7.7) | | + | ¯ ¯ P(Euv Mn) P(Euv huv)P (huv) P(Euv huv)P (huv)

P(huv|Euv) Using Bayes’ rule, the right-hand side simplifies to P (h) , where P (h) is the prior probability that the two proteins are orthologous (discussed further below). Therefore, the final score of U 1 and V 2 under θ is the sum of the two likelihood scores k | 1 P(hu1v2 Eu1v2) L (U 1, V 2) = L(U 1, V 2) + log i j i j . (7.8) θ P (h) i=1 2∈ 1 vj θ(ui ) Sharan et al. applied their approach to identify conserved protein complexes from yeast and bacterial PPI networks. The yeast network consisted of 14,848 in- teractions among 4,716 proteins, whereas the bacterial network consisted of 1,403 176 Chapter 7 Identifying Evolutionarily Conserved Protein Complexes

interactions among 732 proteins. Protein alignments were computed using BLAST, which gave 1,909 protein pairs with E-value below 0.01 (strong ortholog), of which 822 pairs had some observed interaction. After adding proteins pairs with weaker orthology, that is, those that did not pass the E-value cut-off but were immediate neighbors of protein pairs with strong orthology in the two networks, the final or- thology graph consisted of 866 nodes and 12,420 interactions. The probability of observing an interaction in the complex model was set in both yeast and bacte- ria to β=0.95, and the prior probability of a true interaction was set to 0.001. The prior probability P (h) that a yeast and a bacterial protein are orthologous is com- puted as the frequency of protein pairs from both species that are in the same COG (cluster of orthologous groups), which gave P (h)=0.0016. Based on these settings, 11 nonredundant conserved protein complexes with p-values<0.05, scored against empirical runs on randomized data, were identified. The complex sizes ranged from 6–20. Analysis of these complexes using MIPS functional categories showed con- servation for DNA metabolism, protein fate, protein synthesis, transcription, and cell envelop categories; this suggests conservation of core cellular processes be- tween the two species. For example, among the conserved complexes included the nuclear pore (NUP) complex which is integral to the eukaryotic nuclear membrane and serves to transport cargoes between the nucleas and cytoplasm; however, in the absence of a well-characterized nuclear envelop in bacteria, the ancestral prokary- otic NUP proteins may have less-evolved functions. Hirsh and Sharan [2007] developed a probablistic model to identify conserved protein complexes between two species by specifying the pattern of interactions in an unobserved common ancestor of the two species and probabilistically de- scribing the evolutionary events that might have yielded the observed complexes

in each of the species. Specifically, let P0 and P1 be the set of proteins in the two species 0 and 1, and P be the set of proteins in the common ancestor species. ∪ = = Let φ(.) be the mapping from Pi Pj to P , where φ(x) φ(y) for x y iff x and y are homologous. In other words, φ(x) is the ancestral protein from which x (in

species 0 or 1) originated. Given a conserved protein complex, let S0, S1, and S de- note the set of proteins comprising it in the species 0, species 1, and the common

ancestral species, respectively. The model assumes that the interactions of S0 and S1 have evolved from the ancestral interaction pattern. Let m be the number of pro- = tein pairs in the ancestral complex S. For each of these pairs pi (ai , bi) in S, let li ={ ∈ = = be the set of equivalent pairs in S0 and S1 under φ: li (x, y) S0 : φ(x) ai , φ(y) }∪{ ∈ = = } bi (x, y) S1 : φ(x) ai , φ(y) bi . Similar to the work by Sharan et al. [2005b] above, interactions in each of species

are specified under two models: the conserved complex model MC and the null 7.2 Identifying Conserved Complexes from Interolog Networks, I 177

model MN . Under the MC model, it is assumed that each interaction in li has evolved from pi independently of all other events; that is, interactions are attached with some probability PA and detached with probability PD. Moreover, for duplicated interactions, it is assumed that such duplications did not evolve from an ancestral protein pair, as the duplication is assumed to have happened after the speciation event. It is assumed that these duplicated interactions occur with some probability β, independent of other events.

For two proteins x and y, let Txy denote the event that these two proteins interact, ∈{ } and Fxy denote that event that they do not interact. Let Oxy 0, 1 denote the observation on whether x and y interact. Let OS denote the entire set of observations on complex S, and let DS denote the set of duplicated pairs in S. The likelihood of a set of observations on a conserved complex under this model MC is given by:

m  P(O O |M ) = P(O |M ) . P(O |M ) S0 , S1 C li C xy C , (7.9) i=1 (x y)∈D ∪D , S0 S1 where

P(O |M ) = P(O |T M )P (T |M ) + P(O |F M )P (F |M ) li C li ai ,bi , C ai ,bi C li ai ,bi , C ai ,bi C = βP(O |T M ) + ( − β)P(O |F M ) li ai ,bi , C 1 li ai ,bi , C (7.10) and   [O =0] = P(O |T M ) = P(O |T M ) = P xy ( − P )[Oxy 1] li ai ,bi , C xy ai ,bi , C D 1 D , x,y∈li x,y∈li  [O =1] = P(O |F M ) = P xy ( − P )[Oxy 0] li ai ,bi , C A 1 A . x,y∈li

The null model assumes that each interaction in the PPI networks of the two species is present with a probability that one would expect if the edges were ran- domly distributed, but with the vertex degrees respected. This gives the likelihood under the null model MN as   P(O O |M ) = P(O |M ) . P(O |M ) S0 , S1 N xy N xy N , (7.11) ∈ ∈ (x,y) S0 (x,y) S1 which is estimated using a Monte Carlo approach. The final likelihood score for the conserved complex is therefore given by: 178 Chapter 7 Identifying Evolutionarily Conserved Protein Complexes

L(O O ) S0 , S1 P(O , O |M ) = S0 S1 C P(O , O |M ) S0 S1 N (7.12)  [O =0] =  [O =1] = . xy − [Oxy 1] + − . xy − [Oxy 0] β x,y∈l PD (1 PD) (1 β) x,y∈l PA (1 PA) = i   i . ∈ P(O |M ) . ∈ P(O |M ) (x,y) S0 xy N (x,y) S1 xy N

Hirsh and Sharan applied their approach to predict conserved protein com- plexes from yeast and fly interaction networks. The yeast interaction network con- sisted of 15,147 interactions spanning 4,738 proteins, and the fly network consisted of 23,484 interactions spanning 7,165 proteins. Of these, 484 and 453 distinct pro- teins from yeast and fly, respectively, were orthologous at BLAST E-value ≤ 10−10. This generated an orthology graph of 890 nodes and 1,070 interactions. A range of

values from 0.01–0.1 were tested for probability PD that an interaction in a con- served complex is removed in an extant species; the algorithm showed similar

performace for the entire range. The probability PA that an interaction is intro- duced into a conserved complex was computed from PD assuming that the rate of interaction attachment within conserved complexes is equal to that of interac-

tion detachment, as is the case over the entire network. For PD=0.01, PA attained a value of ∼0.001. The probability β of observing an interaction in the complex model was set in both yeast and bacteria to β=0.8. These settings yielded 150 sig- nificant, nonredundant conserved clusters, ranging in sizes from 4 to 14, with an average size of 7. These clusters covered 227 proteins in yeast and 196 proteins in fly. Comparisons with MIPS complexes showed that 94 out of 150 clusters signifi- cantly matched some MIPS complex, with 78% of the clusters having an enriched GO annotation in yeast, and 43% enriched for fly annotations.

Identifying Conserved Complexes from Interolog Networks, II 7.3 Brown and Jurisca [2007] studied the conservation of PPIs and protein complexes from C. elegans (worm), D. melanogaster (fruit fly), H. sapiens (human), M. musculus (mouse), R. norvegicus (rat), and S. cerevisae (yeast). The authors found that evo- lutionary conservation of PPIs across these organisms is not uniform in several ways. First, highly connected components of the human PPI network are more con- served than the lower degree proteins, but as expected the proportion of proteins conserved decreases with evolutionary distance. However, this conservation also differs based on the functions of the proteins and network components. For exam- ple, overall, about 70% of human interactions are conserved in mouse. Of these, 78% 7.3 Identifying Conserved Complexes from Interolog Networks, II 179 of human kinases have an ortholog in mouse, 15% and 17% have orthologs in worm and fly, respectively, and only 6% have orthologs in yeast. In contrast, 70% of the human 26S proteasome subunits, and 44% of the human RNA polymerase compo- nents are conserved in yeast. Second, complexes are more conserved across larger evolutionary distances compared to random subsets of proteins. Among these, sta- ble complexes are more conserved than the transient ones. For example, stable complexes in human—including the 26S proteasome, 40S and 60S ribosomal pro- teins, eukaryotic initiation factor (eIF)-2 complexes, the origin recognition complex (ORC) and minichromosome maintenance (MCM) complexes—are also detected in multiple organisms and these are observed to be stable. Von Mering et al. [2002] had previously suggested that the yeast PPI network is enriched for evolutionary conserved proteins, and therefore it is likely that these proteins contribute to an abundance of stable and conserved complexes observed across organisms. Van Dam and Snel [2008] studied rewiring of protein complexes between H. sapiens (human) and S. cerevisae (yeast) by mapping bona fide protein complexes onto PPI networks for these species. The PPI networks were constructed using data from multiple high-throughput TAP/MS and Y2H experiments [Ito et al. 2000, Uetz et al. 2000, Gavin et al. 2002, Gavin et al. 2006, Krogan et al. 2006]. The complexes were taken from Reactome (391) [Vastrik et al. 2007] for human, and MIPS (217) [Mewes et al. 2006] and SGD (225) [Cherry et al. 2012] for yeast. These yielded 5,960 co-complexed PPIs for human, and 34,686 co-complexed PPIs for yeast. Of the 5,960 co-complexed human PPIs, 2,216 were absent in yeast because both proteins lacked an ortholog in yeast, and 1,828 co-complexed PPIs were disrupted because one of the two proteins lacked an ortholog. Of the remaining 1,916 PPIs, about 10% (or ∼192) were not co-purified in large-scale experiments. This implied that about 90% of 1,916 (or ∼1,724) co-complex PPIs in human, which had yeast orthologs and also belonged the same complex in yeast, were conserved. Based on this, the authors concluded that the evolutionary dynamics of protein complexes is not the result of extensive network rewiring, but mainly due to acquisition or loss of protein complex subunits. To illustrate this, the authors considered the example of the eukaryotic initiation factor (eIF3) complex (Figure 7.3): Although the eIF3 complex in human (green) has expanded compared to yeast (yellow), all yeast proteins are also part of the same complex in human (light green) and these have conserved their interactions. Modifications to the complex during evolution have mainly been through acquisition of new proteins (green). Current evolutionary models for the evolution of protein interaction networks and protein complexes are based on two types of genetic events: gene duplication and gene fusion [Yamada and Bork 2009]. Gene duplication events typically result 180 Chapter 7 Identifying Evolutionarily Conserved Protein Complexes

IF3C

IF31 IF36

IF39 HCR1 IF32

TIF34 IF3A PRT1 IF33 IF34 TIF35 RPG1 IF38

IF37 NIP1 IF35 Yeast Human

Figure 7.3 Evolution of the eukaryotic initiation factor (eIF3) complex from yeast to human [Van Dam and Snel 2008]. Van Dam and Snel [2008] hypothesized that the differences in protein complexes between yeast and human is not the result of extensive network rewiring, but rather by acquisition or loss of proteins within protein complexes, as in the case of the eIF3 complex. The yeast eIF3 complex is a subset of the human eIF3 complex.

in the acquiring of new protein subunits within protein networks and complexes. Here, if a gene encoding a protein within a complex is duplicated, initially the two identical genes will still form the same homogeneous protein complex. However, over time as the two genes diverge in sequence and become paralogs, the parologs could still form the same complex, but the complex could become exclusively het- erogeneous. This phenomenon is particularly observed in eukaryotic complexes, in which multiple gene duplication events often occur, resulting in paralogs of a single protein. For example, the eukaryotic 20S and 26S proteasome contains 14 paralogous genes [Fort et al. 2015]. Marsh and Teichmann [2014] found that the evo- lutionarily more recent paralogs tend to be more structurally flexible than the more ancient paralogs. The reason appears to be that increasing flexibility of the paralogs enables formation of complexes with more distinct subunits. In other words, the more distinct subunit types are within a complex, the more flexible those subunits tend to be, and the more flexible the complex tends to be [Marsh and Teichmann 2014]. In other cases, paralog switching can occur within complexes, whereby one paralog is replaced by another paralog under certain conditions. For example, Ori 7.3 Identifying Conserved Complexes from Interolog Networks, II 181 et al. [2016] found 16 instances of paralog switching (paralog pairs) that were co- regulated, yet were antagonistically incorporated into distinct variants of the same complex. Gene fusion events, on the other hand, reduce the number of protein subunits within complexes. Here, two previously separate genes fuse into one open reading frame. Gene fusion results in the formation of multidomain or “mosaic” proteins [Bjorklund¨ et al. 2005, Pasek et al. 2006]. There are now approximately 10,000 known gene fusions in the human genome alone [Mertens et al. 2015, Latysheva et al. 2016], with 90% of these discovered in the last five years using deep-sequencing technology. Interestingly, the parents of these fusion proteins occupy central po- sitions in protein interaction networks [Latysheva et al. 2016]. These gene fusion events commonly involve pairs of genes encoding subunits of the same protein complex, thereby providing a mechanism to shrink the protein complex without disrupting the overall structure or assembly of the complex [Marsh and Teichmann 2014]. Events such as gene fission wherein a single gene splits into two [Kummerfeld and Teichmann 2005], and gene loss wherein a genome can suffer the loss of a gene either due to an abrupt mutational event or as a consequence of a slow process of accumulation of mutations [Albalat and Canestro˜ 2016], can also result in changes to protein complexes; however, these mechanisms are far less common. Nguyen et al. [2013] noted that although several proteins are conserved by se- quence similarity between yeast and human (e.g., yeast RAD9 and human hRAD9), there are many other proteins that do not show any sequence conservation (e.g., breast cancer susceptibility gene BRCA1 in human) and yet perform core func- tions (e.g., cell cycle and DNA-damage response, DDR) that are conserved between yeast and human. These proteins retain conserved functional domains (e.g., the BRCT domain present in yeast RAD9 and human hRAD9 is also present in the non- conserved human proteins BRCA1 and 53BP1); all these proteins play vital roles in DNA-damage response [Bork et al. 1997]. Therefore, considering functional conser- vation by integrating domain similarity rather than only whole-sequence similarity is important to understanding conservation patterns of complexes [Jin et al. 2009]. Based on this, Nguyen et al. constructed interolog networks by matching of pro- tein domains between yeast and human, in addition to using sequence similarity, to study conservation of complexes between the two species. Clustering this interolog network identified about 22% more yeast and 11% more human complexes that had conserved functions (based on MIPS functional categories) compared to clustering the interolog network constructed from only sequence similarity. In particular, this procedure identified 79 yeast complexes that matched 219 human complexes (list available from http://sites.google.com/site/cocinhy/), with several yeast complexes 182 Chapter 7 Identifying Evolutionarily Conserved Protein Complexes

matching multiple human complexes and vice versa. For example, the transcription factor (TF) D (TFIID) complex from yeast matched the TFIID family of six complexes from human. Based on this, the authors suggested that complexes in higher-order organisms have undergone considerable reorganization, which has been effected through movement of proteins between complexes, duplication of proteins, fusion of multiple domains into one protein (mosaic proteins), and redistribution of do- mains between proteins, apart from mere gain or loss of proteins (Figure 7.4)[Bork et al. 1997, Lan and Pritchard 2016, Vandersluis et al. 2007]. These changes have resulted in considerable rewiring within and between complexes from lower-order to higher-order organisms. The formation of any multiprotein complex (with more than two proteins) is necessarily more complicated than a simple binary interaction. We can posit that protein complex assembly is likely to occur via an energetically favorable and or- dered assembly pathway as the odds of all protein subunits simultaneously coming together in the correct orientation to form a fully assembled complex would be infinitesimally small [Dill and Chan 1997]. Marsh et al. [2013] and Marsh and Te- ichmann [2014, 2015] approached the problem of identifying these pathways of protein-complex assembly by studying gene fusion events. As mentioned earlier, gene fusion occurs when two previously distinct genes become fused into a single open reading frame. Gene fusion is the most common mechanism by which mul- tidomain proteins acquire new domains in both bacteria and eukaryotes [Bjorklund¨ et al. 2005, Pasek et al. 2006]. Proteins encoded by different genes in one organism but fused together in another are likely to physically interact or at least be func- tionally related. Therefore, gene fusion provides a mechanism by which protein complex assembly pathways can be studied: (i) a fusion event can be compati- ble with and conserve the existing assembly; and (ii) alternatively, the event can disrupt the order of the assembly. To analyze assembly pathways, Marsh et al. employed nine heteromeric complexes with known structures from Protein Data Bank [Berman et al. 2000], and experimentally characterized assembly and disas- sembly pathways using nanoelectrospray ionization (nESI)-MS experiments. These complexes provided evidence for evolutionary selection of assembly-conserving gene fusion events (higher than expected by chance), however with some excep- tions wherein fusion events did in fact disrupt the assembly. Marsh et al. also noted that gene fusion events modulated assembly pathways through the evolu- tion. In particular, there is evolutionary selection for fusion events that optimized the assembly—for example, if events involving two subunits that share interaction partners result in fewer intermolecular surfaces in the fused complex, these events are selected. These events also optimise network topologies by reducing the num- 7.3 Identifying Conserved Complexes from Interolog Networks, II 183

g v

g b b1

mv b2

b1b2

m v b

y v yb

Protein complexes in species A

Protein complexes in species B

Figure 7.4 Reorganization of protein complexes from a lower-order species A to a higher-order species B identified by mapping of functional domains (labeled using small alphabets) between proteins (small circles). The following forms of domain conservation are likely to have occurred between the proteins of species A and B: (i) a protein and all its domains are conserved between the species (e.g., proteins carrying domains g and v); (ii) two likely paralogous proteins with similar domains are fused into one protein (e.g., proteins with domains b1 and b2 fused into a protein carrying both the domains, b1b2); (iii) two non- paralogous proteins carrying distinct domains are fused into one multidomain “mosaic” protein (e.g., proteins with domains b and y fused into one protein carrying both the domains, yb); and (iv) a multi-domain protein is split into multiple proteins, each carrying a subset of the domains (e.g., protein with domains mv split into two proteins, one carrying m and the other carrying v).

ber of discrete protein interactions, leading to conservation of complexed regions within the networks. Hopf et al. [2014] substantiated these observations by analyz- ing 3D structures of 76 protein interactions from E. coli in which the proteins within each pair had close genomic distance (e.g., an operon). Hopf et al. observed that 184 Chapter 7 Identifying Evolutionarily Conserved Protein Complexes

residues within these protein pairs tend to co-evolve due to interaction and func- tional constraints, and this “evolutionarily coupling” tends to conserve interaction patterns even within large complexes. Extending the work by Marsh et al., Ahnert et al. [2017] present a “periodic table” of protein complexes which lists complexes in quaternary structure space and provides a framework for understanding their evolution. Protein Complex Prediction in the Era of Systems Biology

The whole is greater than the sum of its parts. —Aristotle (384–322 BC), Greek philosopher and scientist

The increasing availability of resources for protein interactions and the ability to reconstruct complexes with better accuracy and coverage comes with the opportu- nity to study complexes at a systems level, i.e., how these complexes interact with one other and are organized into higher level pathways that drive cellular signaling 8and functions. Gaining a global view of protein complexes is important to enrich our understanding of cell biology and to decipher mechanisms underlying complex diseases. Several complex diseases including cancer are considered “systems or net- work diseases” [Hornberg et al. 2006, Barab´asi et al. 2011], and protein complexes constitute important components of this “disease network.” Protein complexes are the basic building blocks of cellular signaling and meta- bolic networks originating from a myriad of cross-talking molecular relationships. These relationships are formed between proteins across complexes that interact to transduce signals or to share and exchange metabolites [Pawson and Nash 2000]. Many of these relationships are transient in time and space, and determine the dynamic behavior of the networks.

Constructing the Network of Protein Complexes 8.1 A complexosome or complexome network is a network of the entire complement of protein complexes in an organism. Also referred to as complex-complex interaction (CCI) network, the complexome network enhances our understanding of how pro- tein complexes are organized, and how they communicate and function together 186 Chapter 8 Protein Complex Prediction in the Era of Systems Biology

in cells. Existing studies have characterized interactions in CCI networks based on the number and kind of interactions between proteins from different complexes— as densely or weakly connected, and as transient or stable interactions between the protein complexes [Clancy and Hovig 2014]. These have enabled global analysis of protein complexes based on their interactions or interconnectivities. Li et al. [2011] characterized interactions between complexes in terms of the number of shared protein members where each complex is represented as a node, and two nodes are connected by an edge weighted by the number of shared pro- tein members between the two complexes represented by the nodes. The resulting complexesome network showed scale-free properties. The clustering of biological processes within the network was non-random, thus indicating that complexes per- forming similar functions closely interact and are clustered together topologically. Several functional relationships were inferred from the CCI network, and in particu- lar Li et al. assigned putative functions to 44 previously uncharacterized complexes using “guilt-by-association” (GBA). GBA refers to the principle that genes or pro- teins with related functions tend to share properties such as genetic or physical interactions [Oliver 2000]. Wang et al. [2009] developed a unified network called “ComplexNet” by char- acterizing complex-complex and complex-protein interactions. Wang et al. used complexes from MIPS, SGD, and their own hand-curated set to build a list of 543 yeast complexes containing 1,100 unique proteins with an average of 4.9 proteins per complex. These complexes were integrated using the purification enrichment- scored “Consolidated network” [Collins et al. 2007] (Chapter 2). For a pair of com- plexes, a CCI was predicted based on the number of high-confidence interactions between protein members of the two complexes. Two complexes were considered to interact if the enrichment of reliable interactions was more than 20 standard devia- tions above the expected (computed under the condition that reliable interactions were randomly distributed). This gave a CCI network of 131 interactions between 421 complexes, which was called “ComplexNet,” whereas the interactions between the remaining complexes was treated as unknown. Analysis of ComplexNet re- vealed previously characterized cross-talks between complexes (e.g., between hi- stones and several chromatin modifiers including the SWI1 (switching deficient), HAT1 (histone acetyltransferase), and RSC (chromatin structure remodeling) com- plexes), and cross-talks that were previously unknown (e.g., the complex {Yer071c, Yir003w} with {Cap1, Cap2}). These novel interactions were confirmed by high- throughput fluorescence microscopy. Sprinzak et al. [2006] used logistic regression (LR) trained on features from inter- acting and non-interacting protein pairs, to infer stable protein interactions within 8.1 Constructing the Network of Protein Complexes 187 complexes and transient interactions between complexes. The interacting pairs in- cluded 8,695 experimentally validated pairs, of which 1,466 were intra-complex, and 7,229 were inter-complex interactions. The non-interacting pairs included 5,217,000 randomly generated pairs between 17,800 intra-complex protein pairs and 5,199,200 inter-complex pairs. The features for the training included: domain- domain interaction patterns, co-localization and shared cellular process terms, mRNA co-expression, and phylogenetic profiles of protein interactions within and between bona fide complexes. The LR was validated using five-fold cross validation (80% of the pairs for training and 20% for testing). At 50% specificity, similar to that of genome-wide Y2H experiments, the LR achieved 26% sensitivity, which is more than two-fold higher than Y2H, in predicting intra- and inter-complex interactions. Among the correctly predicted inter-complex interactions were those between the coat protein complex COPII and the soluble N-ethylmaleimide-sensitive fusion at- tachment receptor (SNARE) complexes t-SNARE and v-SNARE, which play a role in cellular vesicle transport. In a similar manner, Clancy et al. [2013] used a statistical approach to detect CCIs in human and yeast complexomes based on the number of observed inter- actions between protein pairs of expertly curated complexes vs. the interactions expected to be present in randomized pairs. Using 1,216 human complexes in CO- RUM and 471 yeast complexes in CYC2008 databases, this resulted in a CCI network with 375 and 428 interactions in human and yeast, respectively. Clustering these CCIs revealed 96 different communities including 15 “meta-communities” of com- plexes. These communities showed higher than random enrichment for similar GO terms within the communities. These communities represent functional and spatial modularity within cells, and demonstrate an organized structure of the pro- teome across multiple levels. Malovannaya et al. [2011] present a complex-complex interaction (CCI) resource EPICOME (http://www.epicome.org/) that hosts information on the analysis of the human endogenous complexome, based on >3,000 immunoprecipitation experi- ments that were sequenced with mass spectrometry. The web portal allows users to query CCI networks of interacting proteins grouped in minimal endogenous modules (MEMOs). MEMOs represent complexes of proteins with stoichiomet- ric interdependence across the entire IP/MS dataset and serve as the conceptual building blocks of the protein complexome. Each of the human protein-coding gene products is assigned to one MEMO only. MEMOs are then used to reconsti- tute all distinct protein complex “isoforms,” which are called unique cores (uni- COREs). Higher-order, weaker, and non-stoichiometric associations between dif- ferent MEMOs/uniCOREs form the complex-complex interaction (CCI) networks. 188 Chapter 8 Protein Complex Prediction in the Era of Systems Biology

Integration of structural information with PPI networks has helped to under- stand the structural basis for inter- and intra-complex interactions. For example, Kim et al. [2006, 2008] developed Structural Interaction Network (SIN) for yeast by integrating 3D structure and PPI network data. In SIN, the authors used 3D structures to distinguish the interfaces of each interaction: If two or more proteins interact with a common partner protein in the PPI network, and if they use the same surface of the partner, the interactions are classified mutually exclusive; conversely, if they use different interfaces, the interactions are simultaneously possible. The SIN constructed this way consisted of 873 proteins and 1,269 interactions, of which 438 were mutually exclusive. The SIN contained 147 known protein complexes. Singlish-interface hubs tend to use the same surface to interact with multiple part- ners, and are more central to the network, thus participating in multiple complexes (inter-complex interactions). On the other hand, multi-interface hubs correspond to central members within protein complexes. Based on this, the authors added a “layer of structural information” to hubs within and between complexes. Transient interactions between complexes have also been studied using motion dynamics of interacting surfaces between the complexes. For example, Bhardwaj et al. [2011] proposed the Dynamic Sructural Interaction Network (DynaSIN) which in- tegrates motion dynamics with SIN for the yeast complexosome. DynaSIN analysis suggested that the degree of conformational change that a protein can undertake (its flexibility) to interact with partners depends on the number of its interaction surfaces. So, multi-surface hubs are highly flexible and use different surfaces to interact with different partners. Moreover, permanent interactions involve larger structural surfaces than do transient interactions, and permanent interactions are likely to occur within complexes. Transient interfaces interact with fewer classes of proteins, and occur in inter-complex interactions. In another interesting study [Harding and Hancock 2008a, Harding and Han- cock 2008b], it was demonstrated that clusters of rat-sarcoma-viral-oncogene (RAS) complexes at the membrane operate like a “switch” to activate the mitogen- activated-protein-kinase (MAPK) pathway, converting ligand-receptor complex in- puts into a digital-like output. Similarly, extracellular-signal-regulated kinase (ERK) complexes were found to behave as multiple distributed messengers, interact- ing with other complexes to carry out tightly regulated cellular decisions [Von Kriegsheim et al. 2008]. Other studies have combined spatial organization to un- derstand the dynamics of complexes during T-cell signaling [Dustin and Groves 2012], the shuttling of the transcription machinery to and from the nucleus and cytosol [Bauer et al. 2011], and MAPK signaling [Tian and Song 2012]. 8.2 Identifying Protein Complexes Across Phenotypes 189

The above-mentioned advances are now gradually paving the way to model and simulate the behavior of protein complexes within the CCI network at a systems level. This is effected through the development of rule-based modeling method- ologies, and by using modeling languages such as Kappa [Danos and Laneve 2004] and BioNetGen [Blinov et al. 2004]. The rule-based methodology was recently ap- plied to understand neuronal complexes in yeast [Sorokina et al. 2011] and human [Von Eichborn et al. 2013]. These investigations have helped delineate the dynamic rewiring of complexosomes, and have thereby enabled discovery of novel disease mechanisms and drug targets, as we shall see in the following sections.

Identifying Protein Complexes Across Phenotypes 8.2 The phenome is the set of all phenotypes expressed by a cell or organism. Early studies on S. cerevisiae and C. elegans mutants generated using gene knockouts and RNA-mediated interference found that genes associating with common phenotypic profiles tended to belong to the same or similar complexes [Dudley et al. 2005, Sonnichsen¨ et al. 2005]. For example, Dudley et al. [2005] measured the growth fitness of over 4,700 yeast mutants under 21 growth conditions and found that genes that were grouped by common phenotypic profiles formed gene modules with similar functions. Specifically, the phenotypic profile for a gene g ∈ G, from a set G of genes, is represented as a vector X(g) with each element assigneda1ifthe deletion strain did not grow, ora0ifitdid. The phenotype similarity between two

genes g1 and g2 is measured as the cosine of the angle between the two phenotype X(g1).X(g2) vectors, and is computed as cos θ =    . The resulting similarity matrix X(g1) X(g2) between the |G|×|G| genes is clustered to obtain gene modules. Dudley et al. found that these gene modules significantly overlapped with 107 of the 266 protein complexes from MIPS [Mewes et al. 2006], and 132 of the 232 protein complexes from Gavin et al. [2006]. This meant that protein complexes tend to contain proteins showing similar phenotypic profiles and, in particular, proteins whose knockouts have at least one growth defect in common. Further interrogation of these protein complexes showed that 14 MIPS and 23 Gavin et al. complexes included synthetic lethal (SL) interactions between the protein complex members. In a SL interaction between two genes, the knockout of both genes results in cellular or organismal lethality, although the knockout of either one of the genes alone does not affect cell viability [Bridges 1922, Boone et al. 2007, Srihari et al. 2015b, Costanzo et al. 2010, 2016]. For example, two of the complexes identified were the Polymerase Associated Factor (PAF) complex {Cdc73, Paf1, Leo1, Rtf1} and the Sap30 histone 190 Chapter 8 Protein Complex Prediction in the Era of Systems Biology

deacetylase complex {Sap30, Cti6, Dep1, Pho23, Ume1, Ume6}, with inter-complex SL interactions between Cdc73 and Sap30, Dep1 and Pho23. During RNAi screening of ∼19,000 genes for possible roles in the embryonic development of C. elegans, Sonnichsen¨ et al. [2005] found that genes associated with distinct embryonic phenotypes and developmental events were clustered into distinctive biochemical pathways. In particular, common components of protein complexes including those from the anaphase-promoting complex, minichromo- some maintenance, nuclear pores, and ribosomes were associated with distinct common phenotypes. In a comparison against 20 different predictors spanning gene co-expression, phylogenetic profiles, physical and genetic interactions, and co-complex member- ships from 4,200 yeast strains across 82 conditions, Fraser and Plotkin [2007] found that protein complexes were the best predictors of phenotypes in yeast strains. These findings were reconfirmed in an independent large-scale analysis by Michaut et al. [2011] who used genetic knockout data from yeast spanning 191,890 genetic interactions (GIs) between 4,415 genes [Costanzo et al. 2010, Boone et al. 2007]. GIs are defined between pairs of genes whose co-perturbations alter the phenotype of the cell or organism (measured against a pool of normal phenotypes) [Beyer et al. 2007]. GIs are either positive, i.e., the co-perturbation of both genes in the pair en- hances a particular phenotype, or are negative, i.e., the co-perturbation subdues the phenotype. For example, synthetic lethality is a negative GI associated with lethality or severe loss of fitness, whereas synthetic viability is a positive GI associated with enhanced survival. Cellular processes were found to be predominantly “monochro- matic,” i.e., they are made up of predominantly positive or predominantly negative GIs [Segr´e et al. 2004]. Michaut et al. found that protein complexes could explain most of these monochromatic GIs, i.e., most complexes tended to be either en- tirely positive or entirely negative, and these covered 98% of all monochromatic GIs in yeast. Lu et al. [2013] employed the idea of monochromatic GIs within complexes to predict new GIs between proteins in yeast and human. Specifically, Lu et al. considered two scenarios (Figure 8.1). In the first scenario, proteins A and B have an asymmetric functional relationship, where the function of A depends on B but not vice versa. This asymmetry is due to the presence of a third protein C, which can compensate for a mutant of A (functional bufferring). In this case, A and C are predicted to have a negative GI (Figure 8.1(a)). For example, the asymmetry in Cln1p-Cdc28p complex, where the function of the cyclin Cln1p requires the kinase Cdc28p to transcriptionally activate the CLN1 gene; on the other hand, the function of Cdc28 does not depend on Cln1p, as the presence of Cln2p, when 8.2 Identifying Protein Complexes Across Phenotypes 191

C

C

A BAB (a) Scenario 1 (b) Scenario 2

Figure 8.1 Two scenarios for predicting negative genetic interactions (GIs) [Lu et al. 2013]. In the first scenario, proteins A and B have an asymmetric functional relationship, where the function of A depends on B but not vice versa. This asymmetry is due to the presence of a third protein C, which can compensate for a mutant of A. In the second scenario, enzymes A and B are involved in a branched metabolic pathway where the asymmetry is due to enzyme C which can catalyze the reaction in the absence of A. In both scenarios, a negative GI is predicted between A and C.

CLN2 is transcriptionally activated, can compensate for the absence of Cln1p to activate Cdc28p [Lu et al. 2013, Richardson et al. 1989]. However, the deletion of both CLN1 and CLN2 genes reduces the fitness of the cell [Mani et al. 2008]. Therefore, Cln1p and Cln2p have a negative GI. In the second scenario, enzymes A and B are involved in a branched metabolic pathway where the asymmetry is due to enzyme C which can catalyze the reaction in the absence of A. Here, again, a negative GI is predicted between A and C (Figure 8.1(b)). For example, in amino- acid biosynthesis pathway in S. cerevisiae, although there is frequent absence of the enzyme homoserine kinase (thrB), the presence of aspartate semialdehyde dehydrogenase (asd) maintains the synthesis of lysine, methionine, and threonine [Notebart et al. 2009]. The Lu et al. approach predicted that 75% of the protein complexes from S. cerevisiae contain negative GIs (monochromatic GIs). Diss et al. [2013] used genetic perturbations within complexes to study molec- ular mechanisms underlying genotype-phenotype relationships. Using one non- essential complex, the retromer complex, and one large essential complex, the nuclear pore complex (NPC) in yeast, Diss et al. found that protein complexes tended to restructure their PPI architecture in response to perturbations. In par- ticular, the authors found that, in the case of the retromer complex, which shows only positive GIs, the complex failed to assemble upon any genetic perturbation. On the other hand, the NPC complex, which shows mostly negative GIs, globally rearranged its interactions. Next, Diss et al. asked how deletion of gene A affects the PPI between proteins B and C if A has a GI with B and/or C. The authors found a 192 Chapter 8 Protein Complex Prediction in the Era of Systems Biology

significant association between GI scores and the perturbation of PPIs, with asso- ciation between negative GIs and PPIs stronger compared to that between positive GIs and PPIs. This indicated that a negative GI reflects the ability of one gene (here B and/or C) to buffer the loss of the other (A) by restructuring its PPIs, thus revealing a link between GIs and perturbation of PPIs.

Identifying Protein Complexes in Diseases 8.3 The central principle of network medicine [Barab´asi et al. 2011, Pawson and Linding 2008, Goh et al. 2007, Feldman et al. 2008, Zhou et al. 2014] states that a disease phenotype is rarely a consequence of an abnormality in a single gene product (pro- tein) but instead is a result of interactions between multiple cellular (pathological) processes as a “complex-disease network.” This emergent property of the interac- tion network implies that the impact of a specific genetic defect is not restricted to the gene product that harbors the defect, but can spread along the interactions of the network and alter the activity of gene products that otherwise harbor no defects. Therefore, understanding the network context of any genetic defect is essential in determining the phenotypic impact of the defect [Barab´asi et al. 2011, Schadt et al. 2009]. Over the years, a large number of studies (see Barab´asi et al. [2011] for an excellent review) have sought to delineate this disease network (i.e., the network of genes under a disease condition), for example, by studying network character- istics of disease genes that distinguish them from other genes [Jonsson and Bates 2006, Xu and Li 2006, Wu et al. 2008], or by quantifying network characteristics that differentiate disease phenotypes from normal phenotypes or between differ- ent disease (cancer) phenotypes [Chuang et al. 2007, Hofree et al. 2013, Wu et al. 2010]. The genesis of complex diseases, including cancer, is now viewed as a com- binatorial disease mechanism wherein defects in multiple genes interact to confer disease phenotypes. Consequently, complex diseases are now termed as “network or systems diseases” [Hornberg et al. 2006], and tackling these diseases require a combinatorial approach wherein combinations of genes need to be targeted (simul- taneously or in some order) to achieve the desired redemption—a concept termed “network approach to medicine” or network pharmacology [Barab´asi et al. 2011, Csermely et al. 2013, Poornima et al. 2016]. An in-depth review of all approaches under the network medicine paradigm is beyond the scope of this book, and readers are referred to excellent reviews on the topic [Barab´asi et al. 2011, Pawson and Linding 2008, Zanzoni et al. 2009, Chan and Loscalzo 2012]. Here, we briefly cover some important (computational) approaches developed over the last few years in network medicine, but we discuss 8.3 Identifying Protein Complexes in Diseases 193 in more depth how complexes form a central part of the disease network and how the ability to predict complexes (from disease networks) has further enhanced our understanding of diseases. The earliest references to network medicine [Barab´asi 2007, Lim et al. 2006, Erg¨unet al. 2007, Lemberger 2007] suggested that to understand disease mecha- nisms it is not sufficient to know the precise list of “disease genes,” but instead we should try to map out the detailed wiring diagram of the various cellular com- ponents influenced by these genes and their gene products (of note, it is rather funny to talk today about disease genes within double quotes). These studies further suggested that it is necessary to view diseases as the breakdown of selected “gene modules” rather than single or small groups of genes. Genes within these modules give rise to different paths to disease-inducing mechanisms, and therefore often many genes are linked to the same or similar disease phenotype. Consequently, the genes associated with different diseases can also overlap, and so the phenotypic effects of one disease can potentially impact other diseases via these overlapping sets of genes. The systematic mapping of these disease dependencies has led to the concept of diseaseome or diseasome wherein nodes represent diseases, and the connections between the nodes represent the overlap between the disease genes from different diseases. In the disease network, the effects of drugs are not limited to the molecules to which they bind, but instead these effects can spread across the entire network. Therefore, the therapeutic effect of a drug is related to the network neighborhood of the disease gene(s) targeted by the drug: The smaller the network neighborhood, the higher is the therapeutic efficacy of the drug (higher drug specificity) [Guney et al. 2015]. Likewise, any side effect of the drug could also be inherently a network phenomena [Barab´asi et al. 2011]. But corollarily, by appropriately exploiting the “spread” of this drug effect in the network, it is possible to (simultaneously or in a certain order) target multiple regions of the network, and thereby decimate disease mechanisms or counter emergence of drug resistance more effectively. Moreover, the drugs targeting a set of genes to treat a certain disease could be repurposed to treat other diseases driven by the same set or closely related set of genes in the diseasome. About 10% of human genes have a known disease association (OMIM) [Hamosh et al. 2005], and it is important to understand whether and what properties these disease genes have in the disease network that distinguish them from other genes. Specifically, some of the questions that arise include: Are disease genes located randomly or do they belong to distinguishable “disease regions” in the network? Do the topological properties of these genes correlate with their disease-associated 194 Chapter 8 Protein Complex Prediction in the Era of Systems Biology

properties? Do disease genes work together to form disease modules? And do these disease genes and their modules interact with other genes? Evidence from model organisms indicates that essential genes tend to encode “global hub” proteins and are located functionally central within protein interac- tion networks [Jeong et al. 2001, Ning et al. 2010, He and Zhang 2006]. Moreover, particularly in humans, it has been suggested that disease genes tend to have a higher number of PPIs than non-disease genes [Cai et al. 2010]. However, whether disease genes tend to be enriched with essential genes remains unclear. Goh et al. [2007] studied the relationship between essential genes, hubs, and disease genes in human disease PPI networks, and suggested that: (i) essential genes have a strong tendency to encode hubs or be associated with hubs, and these are expressed in multiple tissues and are located at the functional center of the network; however, (ii.a) most disease genes are not essential genes, and (ii.b) these non-essential dis- ease genes generally do not encode hubs. Disease genes tend to be tissue-specific and are located at the functional periphery of the network. However, importantly, although disease genes do not encode global hubs as the essential genes do, these disease genes locally influence the genes around them (“local hubs”), thereby giv- ing them a tendency to form “disease modules.” These disease modules correspond to specific biochemical processes through which the disease manifests. These pro- cesses can be broken down into specific complexes and functional modules (con- stituting specific disease pathways; for example, in cancer [Hanahan and Weinberg 2010, Hanahan and Weinberg 2011]) that give rise to the disease phenotypes.

Identifying Disease-Associated Protein Complexes Observations from studies (including the ones discussed earlier) involving model organisms, wherein mutations in single genes that interact or are part of the same protein complex result in similar phenotypes, have also been observed in humans [Goh et al. 2007, Gandhi et al. 2006]. For example, this has been observed in various inherited ataxias [Lim et al. 2006], dysplasia, and epilepsy [Oti et al. 2006]. These findings therefore hint at a widespread association between protein complexes and human diseases in a manner such that defects in several proteins, individually or in combination, can cause overlapping phenotypes. To map this association between complexes and disease-phenotypes, Lage et al. [2007] built a “phenome-interactome network” in which complexes identified by clustering a high-confidence human PPI network were linked to diseases in OMIM [Hamosh et al. 2005] using text mining. The PPI network consisted of ∼62,000 high-confidence interactions among ∼8,500 proteins. This resulted in 506 disease- associated complexes with at least one protein member (disease-causing protein) in 8.3 Identifying Protein Complexes in Diseases 195 each complex linked to a disease phenotype in OMIM. The complexes were ranked based on the phenotypes assigned to their members—complexes with greater frac- tion of disease-associated proteins were ranked higher. A Bayesian predictor was then trained to rank disease-causing proteins using collective evidence from the complexes, and the predictor was validated using five-fold cross-validation on a total of 1,404 links containing on average 109 candidate proteins. The approach predicted a total of 113 candidate proteins associated with 91 diseases, with some of the proteins being previously uncharacterized. For example, the predictor associ- ated a novel gene locus LOC130951 with a genetic group of disorders called Retinitis pigmentosa. The LOC130951 protein product is evolutionarily conserved and in- teracts with cone-rod homeobox (CRX), a homeobox transcription factor known to be associated with Retinitis pigmentosa and cone-rod dystrophy; therefore, the novel protein LOC130951 was suggested to be a suitable candidate for further in- vestigation. Similarly, receptor-interacting serine/threonine protein kinase (RIPK1) belonging to the complex {RIPK1, tumor necrosis factor receptor 2 (TNFRSF1B), tumor necrosis factor precursor (TNF), tumor necrosis factor receptor precursor (TNFRSF1A)}, was suggested as a candidate for investigation in inflammatory bowel disease. Vanunu et al. [2010] took a similar approach to prioritise disease-associated complexes, but used a network-propagation strategy via the PPI network to estab- lish association links between complexes and OMIM diseases. In their method called PRIoritizatioN and Complex Elucidation (PRINCE), a human PPI network G =V , E consisting of |V |=9,998 proteins and |E|=41,072 interactions was constructed. Given a query protein q with known association to a disease, the goal was to prioritize all proteins v ∈ V (that are not known to be associated with q) with respect to q. For this, Vanunu et al. adopted a network-propagation approach, us- ing a smoothing neigborhood function F(v). If the association F (u), u ∈ N(v), with q is known, then F(v)is computed as follows: F(v)= α . F (u) . w(v, u) + (1 − α) . Y(v), (8.1) u∈N(v) where w is the weighting function for the interactions, Y : V → [0, 1] represents a prior knowledge function which assigns positive values to proteins that are known to be associated with q and zero otherwise, and α is used to control the relative importance of the two parts of the equation. Known disease-protein associations were extracted from GeneCards [Safran et al. 2002, Safran et al. 2010], giving 1,369 diseases and 1,043 proteins. Complexes that contained high-scoring disease pro- teins were then predicted as follows. The top 100 scoring proteins with respect 196 Chapter 8 Protein Complex Prediction in the Era of Systems Biology

to the known disease proteins within the entire network were assigned as com- plex seeds. To each seed, a neighboring protein with the highest score, and up to 20 proteins per seed, was added. Next, the complexes were refined by either removing complexes with fewer than 4 proteins, or by merging complexes that over- lapped more than 50%. This process resulted in 566 disease-associated complexes, of which about 61% contained on average about 3.6 proteins per complex with evi- dence for disease associations. For example, the top predictions for prostate cancer included the breast cancer susceptibility gene 1 (BRCA1), tumor protein 53 (TP53) and nijmegen breakage syndrome 1 (NBS1), which are tumor suppressors known to be involved in the disease. PRINCE was also able to predict novel associations be- tween excision repair cross-complementation 5 and 2 (ERCC5 and ERCC2) proteins and Microcephalic Osteodysplatic Primordial Dwarfism (MOPD). Chen et al. [2014] took a similar approach to Vanunu et al. by associating complexes to diseases using information-flow maximization via the PPI network. Their approach identified ten complexes which included ataxia telangiectasia mutated (ATM), BRCA1, cadherin 1 (CDH1), RAD51 recombinase (RAD51), and TP53 with known associations with breast cancer. Yang et al. [2011] developed a random-walk based method, RWPCN (Ran- dom Walker on Protein Complex Network), for predicting disease genes using a complexosome-phenome network constructed from human protein complexes and disease phenotypes. RWPCN begins by constructing a phenotype network, which represents similarities between phenotypes in OMIM based on Medical Sub- ject Headings (MeSH). The 2009 version of MeSH contains 25,186 concepts, which are arranged as a hierarchical tree. Locations on the tree carry systematic labels known as tree numbers. For example, the concept “Digestive System Neoplasms” has the tree numbers C06.301 and C04.588.274, where C stands for Diseases, C06 for Digestive System Diseases, and C06.301 for Digestive System Neoplasms; C04 for Neoplasms, C04.588 for Neoplasms by Site, and C04.588.274 for Digestive System Neoplasms. MeSH-based similarity computation works by representing each phenotype pt ∈ = ≤ ≤ PT as a feature vector pt (x1, x2,...,xl), where each dimension 1 i l in the vector represents a MeSH concept, and each feature value xi is the weight th = of the i concept. Given two phenotype feature vectors pti (pti1, pti2,...,ptil) = and ptj (ptj1, ptj2,...,ptjl), the phenotype similarity is measured using cosine similarity

l pt . pt = k=1 ik jk sim(pti , ptj ) . (8.2) l 2 . l 2 k=1 ptik k=1 ptjk 8.3 Identifying Protein Complexes in Diseases 197

Phenotype similarity values in the range [0,0.3] are believed to be uninforma- tive and noisy, while those in [0.6,1] are considered to be reliable [Van Driel = et al. 2006]. Therefore, phenotype similarities are rescaled as L(sim(pti , ptj )) . + 1/(1 + ec sim(pti ,ptj ) d) using c =−15 and d = log 9999. For each query phenotype pti, a neighborhood phenotype set N(pti) is created using the top-k similar dis- ease phenotypes. Let dis(pti) be the set of causative genes for the phenotype pti, as annotated in OMIM. A seed-disease gene set with respect to pti is created as S =∪ dis(pt ) s ∈ S ptj ∈N(pti) j , and for each disease gene ,aseed score is assigned, = seed_score(s, pti) L(sim(ptj , pti)). (8.3) ptj ∈N(pti),s∈dis(ptj ) Next, protein complexes C are identified from an input PPI network G =V , E by network clustering. For each protein complex C ∈ C, a phenotype score with respect to pti is assigned, = F0(C, pti) density(C) . seed_score(s, pti), (8.4) s∈C where density(C) is the density of the complex C with respect to the network G. Alternatively, one could use known complexes from any reference database such as CORUM, by setting density(C) to 1. All complexes are scored in this manner with respect to a query disease phenotype where these complexes contain genes annotated as causative for the phenotype. Next, a complex-to-complex (C2C) network is built using protein complexes as “super nodes,” as follows. For each complex C, the super node for C is represented =  as C VC , EC , where VC is the set of all proteins in C, and EC is the set of all

PPIs among the proteins in VC . Given two super nodes A and B , a C2C link is quantified as u∈V ,v∈V ,u,v ∈V ∩V I(u, v) E (A , B ) = A B A B , (8.5) C2C | |−| ∩ | | |−| ∩ | ( VA VA VB ) . ( VB VA VB ) where I(u, v) = 1, if (u, v) ∈ E, and 0, otherwise. The score assigned to a complex is then propagated through the C2C network to neighboring complexes, using a flow propagation approach, using steps r such that = − . + . ≥ Fr (1 α) WC2CFr−1 α F0 (r 2), (8.6) = F1 F0, where WC2C is the column-normalized form transpose of the connectivity matrix ∈ WC2C of the C2C network, and α (0, 1) provides a weighting for spreading the 198 Chapter 8 Protein Complex Prediction in the Era of Systems Biology

| − |≤ −10 prior information. At the end of the iterations, the difference Fr Fr−1 10 , and Fr is said to have reached a steady state. Finally, genes are scored based on their association with complexes. Given a candidate gene g, its association with

phenotype pti is computed as = D(g, pti) Fr(C, pti), (8.7) C∈C C where is the set of complexes containing g, and Fr(C, pti) denotes the proba- ∈ C bility that complex C is associated with phenotype pti at steady state. Because mutations in genes shared by multiple complexes may lead to similar phenotypes, the scores of their shared genes is the accumulated score of complexes that con- tain them. Using a human PPI network of 34,364 interactions among 8,919 proteins from the HPRD database [Peri et al. 2004, Prasad et al. 2009], 379 complexes from CORUM [Reuepp et al. 2008, 2010], and 1,428 genotype-phenotype associations from BioMART [Durinck et al. 2005], RWPCN was able to identify 253 disease genes and 1,614 gene-phenotype associations. This included 274 novel associations where the disease genes were previously unknown. For example, RWPCN identified known and putative causal genes for breast cancer including elac ribonuclease Z 2 (ELAC2), histone deacetylase 1 (HDAC1), histone deacetylase 2 (HDAC2), LIM domain only 4 (LMO4), phosphatase and tensin homolog (PTEN), retinoblastoma binding protein 8 (RBBP8), ribonuclease L (RNASEL), and zinc finger protein 350 (ZNF350). Since genes associated with similar disease phenotypes cluster into the same or similar complexes and functional modules, the drugs targeting a set of genes for a particular disease could be repurposed for another disease with a similar phenotype profile. In the study by Suthram et al. [2010], diseases were clustered based on their gene-expression profiles, and gene modules were scored using the combined “ac- tivity of genes” within these modules in the diseases. Microarray gene-expression datasets across all control and diseases were normalized using Z-transformation to allow for direct comparison of the expression values across all samples. The

activity of a gene i in disease s was then computed as the t-statistic Ssr(i) of its Z-transformed score between the disease sample s and the control sample r. The

module score Msr(j) for each gene module j in disease s was computed as the mean of the gene activity scores Ssr(i) of its component genes i. This resulted in   a vector Msr of module scores for each disease s. The correlation between any two diseases s and t was computed using the partial correlation coefficient of their module vectors conditioned on the control samples r, as follows: 8.3 Identifying Protein Complexes in Diseases 199

 −  .   = Mst Msr Mtr Cst r   . (8.8) . −| |2 −| |2 (1 Msr ) . (1 Mtr

Note that each disease may have its own control sample, and therefore the respec- tive control samples are used (as the sample r) in the above equation. Using 4,620 modules identified from a human PPI network of ∼35,000 interactions among ∼9,300 proteins, the method was able to compute module scores in 54 human diseases. Of these, 70 proteins within 59 modules had known target-drugs in the DrugBank database. Suthram et al. [2010] observed that these drugs targeting the 70 proteins could potentially be used to treat 65 diseases based on similarity of disease profiles and module associations.

Understanding Dynamic Rewiring of Protein Complexes in Diseases One of the aims of network medicine is to identify regions (gene modules) of the disease network that confer the disease phenotype. While the studies highlighted above have attempted to tackle this question, in general this still remains a chal- lenge because of the dynamic nature of cellular networks: Disease networks are wired differently from normal networks. Therefore, the genes and modules inferred from these “static” maps (reflecting predominantly normal conditions) might not always hold in disease conditions. This has led to the concept of differential network biology [Ideker and Krogan 2012] wherein the analysis focuses on regions of the net- work that are rewired (differential) between normal and diseases, to identify disease genes and modules. In general, the idea of differential analysis has been around for quite some time (e.g., in the analysis of differentially expressed genes between con- ditions [Liang and Pardee 1992, Winzeler et al. 1999]). Consequently, the analysis of differential networks can be considered as a “network extension” of differential gene analysis. In fact, orthology or interolog networks (covered in Chapter 6) that are used to infer conserved interactions between species are the complemented forms of differential networks between the networks of two species. An in-depth discussion on differential network biology is beyond the scope of this book, and readers are referred to excellent reviews and software tools on the topic [Ideker and Krogan 2012, Gross and Ideker 2015, Yeger-Lotem and Sharan 2015, Goenawan et al. 2016, Lambert et al. 2013]. Here, we briefly introduce the concept, and but discuss it in the context of protein complexes in more depth. The first step in differential network biology is to map physical (or genetic) inter- actions across conditions. Barrios-Rodiles et al. [2005] developed a luminescence- based mammalian interactome mapping (LUMIER) strategy to identify pairwise PPIs among a set of human transcription factors with and without stimulation by 200 Chapter 8 Protein Complex Prediction in the Era of Systems Biology

transforming growth factor β (TGFβ). LUMIER was based on immunoprecipitation of prey proteins under the two conditions. LUMIER recorded the changes in interac- tion strengths between the two conditions; however, it did not provide quantitative scores for these changes. Bisson et al. [2011] developed an affinity purification- selected reaction monitoring (AP-SRM) system to map quantitative changes in in- teractions with growth factor receptor-bound protein 2 (Grb2) in HEK293T cells at six time points after stimulation with six growth factors. SRM measured the peak intensities for each peptide, and intensity fold changes were computed for each of the proteins between the conditions. The statistical significance of these changes was evaluated using t-test. Analysis of the significantly differential interac- tions showed that the compositions of Grb2 complexes were remarkably different between the conditions, and depended on the growth factor used for the stimu- lation. Workman et al. [2006] used a ChIP-CHIP approach to study transcriptional rewiring with respect to yeast TF binding after exposure to the DNA-damaging agent methyl methanesulfonate, for 30 different TFs. Dixon et al. [2008, 2009] and Roguev et al. [2008] compared GIs between budding and fission yeasts. These differential GIs in combination with differential PPIs have given a unique perspective to under- stand genetic dependencies. For example, Roguev et al. [2008] found that, although protein complexes are highly conserved between the two yeast species, the genetic interactions between the complexes are not. These findings have important impli- cations in human drug-target identification, e.g., the GIs (synthetic lethal) inferred from yeast as molecular targets in cancer were found to not hold up in human although several protein complexes containing these target genes are conserved between yeast and human [Nguyen et al. 2013, Srihari et al. 2015b]. Other studies have sought to understand differential wiring of networks as a result of differences in genomic variants (the “variome”) [Creixell et al. 2015a, Creixell et al. 2015b, Wang et al. 2015]. For example, Creixell et al. [2015a, 2015b] developed a method KINspect to map kinase-substrate interaction specificity at the level of individual protein residues, and how mutations in different residues determine this specificity (the determinants of specificity or DoS). By integrating the results of KINspect with those from other methods, the authors developed ReKINect to characterize mutations that rewire the interaction network, which the authors refer to as “network-attacking mutations” or NAMs. Ryang et al. [2013] and Fraser et al. [2013] present excellent reviews on the structural impact of muta- tions on proteins, and their consequences on the rewiring of protein interactions and disruption of signaling networks at a systems level, and their effect on or- ganismal fitness and development of diseases. The EMBL-EBI maintains a dataset of over 15,000 NAMs (http://www.ebi.ac.uk/intact/resources/datasets#mutationDs) 8.3 Identifying Protein Complexes in Diseases 201 that have been experimentally shown to disrupt interactions or enable new ones, and thus have downstream consequences in signaling pathways and cellular func- tion. Porta-Pardo et al. [2015] present an in silico predicted list of 103 structural interfaces involved in protein interactions and that are enriched with somatic mu- tations in different cancers. Vandin et al. [2011] developed a method HotNet to identify significantly mutated subnetworks in cancer. In their HotNet model, Vandin et al. assume that mutated genes in networks spread their influence to surrounding genes, and therefore the (surrounding) network context of any mutated gene is important to quantify its contribution toward the cancer or disease. Vandin et al. compute this influence for each gene in the network using a diffusion approach, where genes spread their influence like the flow of a fluid through the network such that at each node, the fluid is lost at a constant rate until the system reaches an equilibrium. This equilibrium distribution provides the influence weights on the connections between the genes. Let G be the set of genes to be tested, and S be the set of samples (patients). The model begins with a biomolecular interaction network such as a PPI or a functional ∈ s interaction network. For a sample s S, let fv (t) denote the amount of fluid at node ∈ s = s s s (gene) v G at time t, and let f (t) [f1 (t), f2 (t),...,fn (t)] be the column vector of fluid at all genes in the network at time t. Fluid is lost at each node at a constant = + rate γ . Let L be the Laplacian of the network, and Lγ L γI. The dynamics of this continuous-time process are governed by the vector equation:

df s(t) =−L f s(t) + bsu(t), (8.9) dt γ where bs is the elementary unit vector with 1 at the sth place and 0 otherwise, and u(t) is the unit step function. As t →∞, the system reaches steady state, and the s = −1 s equilibrium distribution of the fluid density on the network is given by f Lγ b . Computing the diffusion process gives, for each pair of genes (u, v), the influence I(u, v) that u has on v (this is not symmetric in general, i.e., I(u, v) = I(v, u)). From this, an influence network is constructed with the weight on each edge in the influence network set to w(u, v) = min(I (u, v), I(v, u)). By combining these networks across the patients (i.e., adding the values on the edges from all patient networks), the edges that accummulate a high amount of influence across the patients are identified. By clustering this combined network, subnetworks that show high influence values across the patients can be identified. Leiserson et al. [2015] applied HotNet on ∼3,280 samples across 12 cancer types from The Cancer Genome Atlas (TCGA) (http://cancergenome.nih.gov/)to 202 Chapter 8 Protein Complex Prediction in the Era of Systems Biology

identify 16 significantly mutated subnetworks comprising of well-known cancer signaling pathways from KEGG [Kanehisa and Goto 2000] and less-characterized but novel pathways in these cancers. Among the subnetworks Leiserson et al. identified were those containing key cancer genes including phosphatidylinositol- 4,5-bisphosphate 3-kinase catalytic subunit alpha (PIK3CA) and epidermal growth factor receptor (EGFR), and these constituted key cancer pathways including re- ceptor tyrosine kinase (RTK) signaling and PI3K signaling with known roles in several cancers. Importantly, several complexes including the SWI/SNF chromatin- remodeling complex were reported with mutations in at least 1.5% of the samples from each of the 12 cancer types. Similarly, the BAP1 complex was mutated in at least six samples from each cancer type. The novel subnetworks included the co- hesin and condensin complexes which were mutated in at least 7.3% of the samples. Hofree et al. [2013] developed a network-based stratification (NBS) approach to integrate mutations in tumors using gene networks and to stratify cancer pa- tients into different cancer subtypes. Specifically, mutations for each patient are represented as a profile of binary (0, 1) states of genes, in which 0 indicates no mu- tation and 1 indicates one or more mutations for a gene in the patient. For each patient independently, these mutation profiles are projected onto a gene network

G, to generate an initial patient-by-gene matrix F0. Next, the mutation profile is “smoothed” over the entire network using a network propagation approach. Net- work propagation simulates random walks (with restarts) on the network according to the function,

= + − Ft+1 α . FtG (1 α) . F0, (8.10)

where G is the degree-normalized adjacency matrix of G, created by multiplying the adjacency matrix of G by a diagonal matrix with the inverse of its row (or column) sums on the diagonal. The parameter α governs the distance that a mutation signal is allowed to diffuse through the network during the propagation. This procedure || − || −6 is run iteratively until Ft+1 converges ( Ft+1 Ft <10 ). The result of the network smoothing are gene profiles in which the state of each gene is no longer binary, but reflects its proximity to mutated genes in the network. Subnetworks that carry larger network-smoothed values represent highly mutated subnetworks in the network. Following this network smoothing, patient profiles are clustered into a predefined number of cancer subtypes. Hofree et al. integrated three sources of interaction data—from HumanNet [Lee et al. 2011], STRING [Szklarczyk et al. 2011], and PathwayCommons [Cerami 8.3 Identifying Protein Complexes in Diseases 203 et al. 2011]—to generate the gene network. Mutation profiles from patients with high-grade serous ovarian carcinoma, uterine endometrial carcinoma, and lung adenocarcinoma, downloaded from TCGA, were used in the study. This data con- tained 356 patients with a total 9,850 mutated genes from the ovarian cancer co- hort, 248 patients with 17,968 mutated genes from the uterine cancer cohort, and 381 patients with 15,967 mutated genes from the lung cancer cohort. NBS was able to cluster these patients into clinically relevant cancer subtypes that corre- lated with the differences in survival probabilities over time for these subtypes. These subtypes showed differences in the course of the disease (e.g., aggressive vs. non-aggressive) and in patient survival times. The gene subnetworks that car- ried higher network-smoothed values were prognostic, and corresponded to com- plexes and pathways known to drive tumorigenesis in these cancers. For exam- ple, a subnetwork containing DDR proteins including ATM, ataxia telangiectasia, and rad3-related (ATR), BRCA1, BRCA2, checkpoint kinase 2 (CHEK2), and RAD51 highlighted deficiencies in the DNA double-strand break repair pathway, a con- dition referred to as “BRCAness” [McCabe et al. 2006, Lord and Ashworth 2006]. Patients with cancers involving this condition show better treatment response to platinum-based chemotherapy [Konstantinopoulos et al. 2010]: Platinum-based compounds induce DNA cross-links and other DNA lesions, which cannot be re- paired by the already-defective DDR pathway in cancer cells, thereby resulting in accumulation of high levels of DNA damage and cancer cell death [Liu et al. 2014]. Srihari and Ragan [2013] developed a method CONTOUR (Cancer OncogeNes from disrupTed mOdUles and their relationships) to track protein complexes that are disrupted—have undergone interaction rewiring or have changed protein com- position through loss or gain of proteins—between conditions in human cancers. CONTOUR works in three steps (Figure 8.2). In the first step, CONTOUR scores PPIs in individual conditions (e.g., normal and cancer, or subtypes of cancer) us- ing gene co-expression profiles to build conditional PPI networks. A human PPI =  network H VH , EH is used as a backbone network. Given a condition with n ρ ρ samples (patients), two distributions EH and E\EH of gene co-expression values (as Pearson correlation values computed across the n samples) for gene pairs in \ EH (interacting set) and E EH (non-interacting set), respectively, are constructed. \ Here, E EH is the set of protein pairs that are known not to interact (built from proteins belonging to complexes from CORUM that do not share the same localiza- tion). These distributions are constructed by binning the co-expression values for the gene pairs into bins of size 0.1. Based on these distributions, the probability 204 Chapter 8 Protein Complex Prediction in the Era of Systems Biology

Normal 1 1 Normal 2 2

s t

+ Identify and match complexes between conditions

Generic PPI network Tumor

Gene expression, Tumor mutation profiles Conditional Dysfunctional PPI networks complexes

Figure 8.2 CONTOUR workflow to identify dysregulated protein complexes from PPI networks constructed under normal and cancer conditions [Srihari and Ragan 2013].

that a pair (p, q) with co-expression falling in [x, x + 0.10) coincides with a real interaction is computed as

|ρ(x)| = ∈ | = EH Pρ(p, q) Pr[(p, q) EH ρ] , (8.11) |ρ(x)|+|ρ(x) | EH E\EH

where |ρ(x)| and |ρ(x) | are the frequencies of the bins (i.e., corresponding to EH E\EH x x + ) (p q) ρ ρ [ , 0.10 ) containing , in the two distributions EH and E\EH , respec- tively. These probabilities are used to weight protein pairs in H . Since normal and cancer conditions are expected to have very different gene expression profiles, this procedure results in very different distributions for the two conditions, thereby yielding two different PPI networks (i.e., networks with different weights for the protein pairs; the difference between the two distributions is assessed using a Kolmogorov–Smirnov test). In the second step, protein complexes are identified from each of these condi- tional PPI networks using a clique-based clustering method similar to CMC [Liu et al. 2009] (Chapter 3). Briefly, given a network, the procedure first ranks all max- imal cliques in the network based on their weighted interaction densities, and traverses down this list while merging highly overlapping cliques. Since the normal 8.3 Identifying Protein Complexes in Diseases 205 and cancer PPI networks have significantly different distributions for the interac- tion weights (from the above step), the maximal cliques identified from the two networks will show different ranking orders. Consequently, the produced protein complexes after the ranked-merging process are allowed to slightly differ in their protein compositions between the two conditions. In the third step, pairs of complexes identified from the two networks are matched such that the complexes within a pair: (i) have the same or similar protein composition and (ii) show significant difference in their expression profiles. Let S and T be two protein complexes detected from the normal and tumor conditions, respectively. The two complexes are deemed as matched if they show a Jaccard similarity J(S, T)≥ 0.70 (i.e., the same or similar protein composition). Next, a co-expression score is computed for S in the normal condition as | | p∈S,q∈S PCC(p, q) score(S) = 2 . (8.12) |S| . (|S|−1|) The co-expression score(T ) is similarly computed for complex T in the tumor condition. The matched pairs {S, T } are then ranked in non-increasing order of the differences in their scores |score(S) − score(T )|. Using a human PPI network of 5,824 proteins and 29,600 interactions, and 39 normal-tumor matched gene expression profiles from pancreatic adenocarcinoma patients, the CONTOUR procedure identified 574 and 576 normal and tumor com- plexes, respectively. By matching these complexes, 45 matched pairs of complexes with difference in scores ≥ 0.1 were discovered between normal to tumor. Gene On- tology and KEGG pathway-based evaluation of the top altered complexes showed enrichment for cancer pathways including KRAS signaling, TGF-β-signaling, Wnt- signaling, P53-apoptosis, cell-cycle, and DNA-repair pathways. The analysis also revealed that some complexes had changed their protein compositions to recruit oncogenic proteins and eliminate tumor suppressors. For example, the complex [HDAC2, HDAC1, metastasis associated 1 family member 1 (MTA1), MTA2, chro- modomain helicase DNA-binding protein 4 (CHD4), and B-cell CLL/lymphoma 11B (BCL11B)] in normal had changed to [HDAC2, HDAC1, MTA2, MTA1, CHD4, SIN3 transcription regulator family member A (SIN3A) and sex determining region Y (SRY)-Box 2 (SOX2)] in tumor, with the tumor suppressor BCL11B replaced by onco- genes SIN3A and SOX2. In a follow-up work, Srihari et al. [2014a] applied CONTOUR to study rewiring in protein complexes between different subtypes of familial breast cancers. This analysis identified several DDR complexes that were overexpressed in subtypes of breast cancers. It is hypothesized that these complexes compensate for loss of DDR 206 Chapter 8 Protein Complex Prediction in the Era of Systems Biology

genes including BRCA1/2 in these cancers. The analysis also identified transcrip- tion factors responsible for upregulation of these compensatory complexes, thus highlighting a higher level transcriptional rewiring responsible for overexpression of complexes in cancers. Wang et al. [2015] subsequently applied CONTOUR to identify differential complexes in clear-cell renal carcinoma, whereas Liu et al. [2015] used CONTOUR to study complexes in narcolepsy. Ori et al. [2016] studied the compositional variations of 182 mammalian com- plexes, covering 2,048 proteins, across temporal conditions using two large-scale proteomic datasets. The first dataset is from 11 human cancer cell lines that cov- ered most cancer types such as carcinoma, leukemia, sarcoma, and glioblastoma [Geiger et al. 2012]. The second dataset is from mouse embryonic fibroblasts repro- grammed into induced pluripotent stem cells (iPSC) over 15 days with five distinct temporal states in total. Ori et al. found that 22% of the protein complex mem- bers were differentially expressed in at least one of the conditions tested, while the majority (78%) were core complex members that remained invariant in expres- sion and abundance. More than half of the protein complexes had at least 20% of their members differentially expressed in one of the cell lines or states. A ma- jority (102 out of 182, 56%) of the complexes were categorized as variable, based on the ratio of core to variable members, whereas the remaining (44%) were cat- egorized as stable. The stable complexes were enriched for GO terms related to housekeeping biological processes including transcription, RNA processing and translation, and energy production. In contrast, the 102 variable complexes were enriched for regulators of chromatin structure and epigenetic modifications, in- cluding the switch/sucrose non-fermentable (SWI/SNF), nucleosome remodeling deacetylase (NuRD), and INO80 (inositol requiring) complexes. Ori et al. found that, among other factors including expression of proteins and posttranslational mod- ifications, paralog switching is a wide-spread mechanism that modulates the com- position of protein complexes. In particular, the chromatin-remodeling and NuRD complexes displayed extensive paralog switching across the states of human can- cer cell lines and mouse stem-cell differentiation. For example, between the HeLa and HEK293 cells, switch was observed between the NuRD members methyl-cpg binding domain 2 (MBD2) and MBD3 proteins, with higher abundance of MBD3- containing NuRD complexes in HEK293 cells. Moreover, Ori et al. demonstrated that the variable complex members from the 182 complexes had a robust discrim- inative power to distinguish between normal and colon adenocarcinoma samples, which was comparable to that of a whole-proteome profiling dataset containing 7,576 proteins [Wisniewski et al. 2012]. Zhao et al. [2013] identified disease-associated complexes by measuring the dif- ferential abundances of complexes between tissue-conditions in human cancers. 8.3 Identifying Protein Complexes in Diseases 207

The abundance of each complex was computed based on the abundance of its pro- tein components as follows. Assume pi is the abundance (number of copies) of protein i (a total of N proteins), and cj is the abundance of complex j (a total of M complexes). The number of copies of protein i in complex j is denoted as Sij , where = Sij 0 if the complex j does not contain protein i. In the ideal situation where all protein copies in a cell are used up in the formation of complexes, the variable sets { } { } pi and cj satisfy

M = pi Sij cj . (8.13) j=1 N However, the number of protein copies i=1 pi is usually larger than the number used up to constitute the M complexes. Therefore, instead of trying to find an exact solution for the above equation, Zhao represented this as an optimization procedure given by the objective function ⎡ ⎤ N M = ⎣ − ⎦ δ 1 (Sij . cj /pi) , (8.14) i=1 j=1 and minimized the deviation δ using linear programming. The abundances of pro- teins were estimated based on their mRNA expression levels. By applying this pro- cedure, Zhao et al. estimated the abundances of 2,837 CORUM complexes across 39 normal and cancer tissues. Using the average abundance levels across all normal tissues as the control, a total of 106 and 209 complexes were found to be over- and under-abundant, respectively, with a change more than a factor of two, across the different normal tissues. A total of 283 and 297 complexes were over- and under- abundant between normal and cancer tissues, respectively. Of these, 227 complexes that were enriched in brain cancer included the survival of motor neurons (SMN) complex, eukaryotic initiation factor (eIF) complex, and the cyclin-dependent ki- nase 1 (CDK1)-cyclin A2 (CCNA2) complex, which have known roles in brain tumors. Srihari et al. [2014b] analyzed rewiring of complexes between stages of cancer development by integrating PPI and time-course gene-expression datasets using a Boolean model. In this model, for each interacting protein pair (p, q),ifp and q are strongly positively co-expressed, i.e., PCC(p, q) ≥ t, then the interaction (p, q) is modeled as p ⊕¯ q (NOT EXCLUSIVE OR), where t is a preset threshold. Similarly, if p and q are strongly negatively co-expressed, i.e., PCC(p, q) ≤−t, then the in- teraction (p, q) is modeled as p ⊕ q (EXCLUSIVE OR). When the Boolean clause for the interaction (p, q) evaluates to 1, i.e., satisfied, it reflects the co-expression re- lationship between p and q. Here, p ⊕¯ q represents the case where p and q are 208 Chapter 8 Protein Complex Prediction in the Era of Systems Biology

simultaneously up-regulated or down-regulated, i.e., positive co-expression. On the other hand, p ⊕ q represents the case where only one of p or q is 1(0) and the other 0(1), i.e., negative co-expression. This model is used to construct conditional Boolean networks for different conditions, e.g., normal and cancer. A Boolean net- work in a condition is considered satisfied if all the interactions in it are satisfied. A Boolean network in one condition (e.g., normal) transforms into a Boolean network in another condition (e.g., cancer) via three kinds of edits—loss, gain or toggle— of its Boolean interactions. Toggles represent changes from p ⊕ q to p ⊕¯ q or vice versa. When a Boolean network in one condition transforms to a network in an- other condition, proteins need to be flipped (i.e., changes from 0 to 1 or vice versa) to keep the network satisfied. The authors hypothesized that the network always flipped the minimum number of proteins, and modeled the problem of finding this minimum set of flipped proteins as an optimization problem called min flip. The authors proposed a fixed-parameter tractable algorithm to solve min flip. Apply- ing this model and approach to a human PPI network of 29,600 interactions and 5,824 proteins, and using gene expression data from 39 normal-tumor matched samples from pancreatic cancers and from 19 BRCA1 and 30 BRCA2 breast cancer samples, the authors identified key cancer genes among the flipped genes. This suggested that cancer genes were responsible for rewiring (transforming) the PPI networks between these conditions. Applying the approach to a mouse PPI net- work of 3,215 interactions and 1,146 proteins, and using gene expression data from five time-points post spinal-cord injury in mice, the authors found that often dis- tinct (non-overlapping) subsets of proteins were flipped during the different stages post injury, and that sometimes these gene subsets belonged to different biological processes. This observation suggested that robustness of disease mechanisms in- cluding that of cancer partly stems from the “passing of the baton” between genes at different stages—different genes or genes from different biological processes are involved in different stages of tumor progression.

Enhancing Quantitative Proteomics Using PPI Networks 8.4 and Protein Complexes The above-described applications of PPI networks and protein complexes to study phenotypes and the differences between these phenotypes (e.g., between yeast mu- tants, or between human disease conditions) can be categorized into two classes: (i) non-quantitative and (ii) quantitative applications. Non-quantitative applications employ PPI networks and protein complexes mainly in a qualitative or descriptive manner, such as functionally annotating a set of proteins that come up in an analy- 8.4 Enhancing Quantitative Proteomics Using PPI Networks and Protein Complexes 209

sis, or to explain the mechanistic basis of an observed phenotype. For example, in the study by Dudley et al. [2005] described earlier, the authors identified gene modules using phylogenetic profiles of genes from 4,700 yeast mutants under 21 growth conditions; then the authors mapped these gene modules onto known pro- tein complexes from MIPS [Mewes et al. 2006] and the study by Gavin et al. [2006], to understand the functions of genes responsible for differences in growth fitness of these yeast mutants. Therefore, here, known protein complexes were only used to explain the functions of the genes in question. These non-quantitative applications are important; however, they are insufficient to drive conceptual advancement: PPI networks and protein complexes are used here mainly as auxiliary support, and the results are often not generalizable. On the other hand, quantitative applications em- ploy PPI networks and protein complexes as reproducible means, for example, as relevant (network) features to make phenotype predictions or to construct diagnos- tic models. For example, the HotNet application by Vandin et al. [2011] identifies network features (“hot” regions from the molecular network) than can effectively classify patients based on disease (cancer) subtypes. The central principle behind network-based proteomics (NBP) [Goh and Wong 2016] states that, apart from qualitative or descriptive purposes, networks can also be used quantitatively to improve or enhance experimental (proteomics) tech- niques. For example, network-based approaches can help resolve limitations in protein coverage [Goh et al. 2013a, Gwinner et al. 2013] and inconsistencies aris- ing from proteomics screens [Goh et al. 2015, Goh et al. 2012, Goh et al. 2013b, Goh and Wong 2016a, Goh and Wong 2016b]. However, transitioning from qual- itative to quantitative applications of networks is non-trivial. Nevertheless, some of the methods described in the sections above (e.g., Lage et al. [2007], Vanunu et al. [2010], Yang et al. [2011], Vandin et al. [2011], Hofree et al. [2013], Srihari and Ragan [2013]) represent success stories in this direction. Here, we will discuss a few other recent network-based approaches that fit into this paradigm.

Enhancing Protein Coverage and Consistency of Proteomics Screens The first set of approaches employ PPI networks to improve protein coverage and consistency of proteomics screens. Mass spectrometry (MS)-based proteomics is widely used for profiling systems-wide protein expression changes [Cox and Mann 2007]. However, MS-based proteomics tend to have limitations with protein cov- erage (inability to detect the entire proteome) [Goh et al. 2013a, Gwinner et al. 2013, White et al. 2011], and consistency (poor reproducibility and inter-sample agreement) [Goh et al. 2015, Goh et al. 2013b, Liu et al. 2004]. A primary reason for the lack of coverage is the use of thresholds on the generated MS data, which 210 Chapter 8 Protein Complex Prediction in the Era of Systems Biology

although increase the stringency and thus overall data accuracy, are rather arbi- trary and tend to disregard most of the generated data. On the other hand, even basic information on the participation of proteins within complexes can help to rescue several eliminated proteins. The following example illustrates this idea [Goh and Wong 2016]. Consider a proteomics screen that has a false-discovery rate (i.e., noise in the screen) of 25%, and consider a protein complex Z comprising proteins {A, B, C, D, E}. Assume that the screen detects proteins A, B, C, and D, but not E. It is reasonable to postulate that the probability that Z is present is proportional to the fraction of its constituent proteins being detected correctly. Therefore, the probability that Z is present is 60%[= 4 × (1 − 0.25)/5]. The presence of a protein complex implies the presence of all its constituent proteins. Thus, the probability that the reported protein A is present is 100% given the presence of the complex Z, and is 75%(= 1 − 0.25) given the absence of the complex Z. The overall proba- bility that A is present is thus 90% = [100% × 0.6 + 75% × (1 − 0.6)]. Repeating the argument verbatim, each of the other reported proteins B, C, and D also has 90% probability of being present. In contrast, any single unconnected reported protein has a lower chance, 75%(= 1 − 0.25), of being actually present in the sample. Using almost the same argument, the probability that the unreported protein E is present is not unknown and not zero; instead, it is at least 60%(= 100% × 0.6). Therefore, knowing that the complex Z, most of whose proteins are reported by the screen, also contains protein E boosts E’s chances of being actually present even though E is not reported by the screen. This can be confirmed by analyzing the original MS data (before thresholding) for evidence of existence of E. A simple network-based approach that enhances protein coverage in proteomics screens is the Clique Enrichment Analysis (CEA), proposed by Li et al. [2009]. CEA is based on the assumption that an eliminated protein is more likely to be present in the original sample if it is a member of a clique for which other members have been confidently identified in the same sample. CEA first groups proteins into confident proteins and non-confident proteins after thresholding and protein assembly, and then maps them onto a PPI network. Next, CEA enumerates all maximal cliques from the network. For each identified clique, an enrichment score derived from Fisher’s exact test is used to evaluate the enrichment of confident proteins in the clique. All non-confident proteins that co-exist in a clique enriched with confident proteins are thus rescued, whereas others are discarded. The rescued proteins are rescreened in a targeted manner, or simply checked against the original MS data for evidence of existence. The enrichment threshold is set using a cross-validation approach, to achieve the desired sensitivity and specificity. Cliques represent only a subset of protein complexes, i.e., not all protein com- plexes form cliques in the PPI network. Therefore, instead a reference set of known 8.4 Enhancing Quantitative Proteomics Using PPI Networks and Protein Complexes 211

protein complexes, or protein complexes predicted from any of the plethora of pre- diction methods already available (e.g., MCODE [Bader and Hogue 2003], MCL [Van Dongen 2000, Leal-Pereira et al. 2004], CFinder [Adamcsek et al. 2006], PCP [Chua et al. 2008], and CMC [Liu et al. 2009]) can be employed in place of cliques to im- prove protein coverage from proteomics screens. Proteomics Expansion Pipeline (PEP), proposed by Goh et al. [2011], employs protein complexes to detect un- reported proteins. PEP first maps high-confidence proteins from the proteomics screen onto a large PPI network, and expands around them to immediate neigh- bors to build seeds. The subnetwork covered by these seeds is then clustered using CFinder [Adamcsek et al. 2006]. Each identified cluster is ranked based on the av- erage expression value of the proteins it contains. Proteins in high-ranking clusters that are eliminated in the proteomics screen are then checked against the original MS data for evidence of existence. In a related approach, Managbanag et al. [2008] used proteins appearing on the shortest paths between high-confidence proteins in the PPI network, to rescue eliminated proteins. In another work, Goh et al. [2013a] presented an evaluation of several methods, in addition to proposing a new method functional-class scoring (FCS), for missing protein prediction. Given a set of MS-reported proteins, the set is intersected with a protein complex (of size at least 4). The significance of the size of the intersection is tested using the hypergeometric test. If deemed significant, other proteins in the complex are predicted as present. Goh et al. [2013a] report an experiment where the mass spectra were matched against a small protein database and a larger protein database to report two lists of detected proteins. The first (smaller) list of proteins was then analyzed using FCS (with reference complexes from CORUM) to predict missing proteins. The predicted missing proteins were checked against the second (larger) list of reported proteins. About 96% of the extra reported proteins in the second list was among those predicted by FCS, while about 53% of those missing proteins predicted by FCS could be confirmed in the second list. The paper also ran a similar experiment using PEP to predict the missing proteins. In this case, only about 32% of the extra reported proteins in the second list was among those predicted by PEP, while only about 15% of those missing proteins predicted by PEP could be confirmed in the second list.

Improving Quantitative Comparison Between Proteomics Profiles Quantitative comparison of samples is central to proteomics. This is important to checking whether: (i) protein expression measurements from one sample or a batch of samples is reproducible in another batch of samples; and (ii) protein expression measurements in two sets of samples are different in a statistically 212 Chapter 8 Protein Complex Prediction in the Era of Systems Biology

significant manner. However, often protein expression measurements become dif- ficult to compare. This is likely due to: (i) noise and coverage of the proteome at the level of individual samples; and (ii) limitation of current statistical techniques, for example, because of insufficient sample size. To improve quantitative compar- ison between samples (i.e., improve the statistical power of proteomics analysis methods), additional dimensions present in the problem have to be brought into consideration. Gene Set Enrichment Analysis (GSEA), proposed by Subramanian et al. [2005], takes such an approach, by allowing the statistical analysis to be shifted from individual genes to groups of genes. The underlying idea here is simple: If a majority of genes or proteins in a group have small but correlated expression-level changes, then this can still result in a high statistical significance for the group as a whole. This idea can be further rationalized if the differentially expressed genes or proteins are grouped in a functionally relevant manner (e.g., those belonging to the same ={ } pathway). In GSEA [Subramanian et al. 2005], the list L g1, g2,...,gN of N genes in question are first ranked according to their differential expression r(gi) between two sets of samples or phenotypes. Given an a priori defined set S of genes (e.g., a curated pathway [Kanehisa and Goto 2000]), GSEA computes an enrichment score (ES) for the ranked list L with respect to S.IfS is enriched with top-ranked genes in L, then the pathway or biological knowledge represented by S is considered to be differentially expressed between the two phenotypes, suggesting that the pathway is dysfunctional or rewired. The ES is computed by walking down L and evaluating the fraction of genes in S (“hits”) weighted by their differential expression, and the fraction of genes not in S (“misses”) present up to a given position i in L,

|r(g )|p = j = | |p Phit(S, i) , where NR r(gk) NR gi∈S,j≤i gk∈S (8.15) 1 P (S, i) = , miss N −|S| gj ∈S,j≤i

where the exponent p controls the weights of the genes in L. The ES is the maxi- − mum deviation from zero of (Phit Pmiss). For a randomly distributed S, the score ES(S) will be relatively small; but if it is concentrated to top or the bottom of the list, or otherwise nonrandomly distributed, then ES(S) will be correspondingly high. The significance of ES can then be estimated using a randomization proce-

dure: By comparing the observed ES(S) with the distribution of scores ESNULL computed with, for example, 1,000 rounds of random permutations of pheno- type labels. This way, even if individual genes are only moderately differentially 8.4 Enhancing Quantitative Proteomics Using PPI Networks and Protein Complexes 213

112 123

6 3

5 4 654

(a) (b)

Figure 8.3 Two scenarios for group-based analysis of differentially expressed genes: (a) the entire group of genes (1–6) is predicted to be differentially expressed although only a subset of genes (2, 3, and 4) are truly responsible for the phenotype; (b) information on the network connectivity between the genes can identify the two genes 2 and 3 as the upstream regulators contributing to the differential expression of the regulatee gene 4; therefore, only the subnetwork of genes 2, 3, and 4 is predicted to be differentially expressed.

expressed, grouping them together in a functionally relevant manner boosts their combined significance for the observed differences between the phenotypes. The effectiveness of GSEA has been demonstrated in a number of studies including for identifying enriched pathways of differentially expressed genes between vari- ous cancer phenotypes [Subramanian et al. 2005], and GSEA is available as part of several systems-biology analytical tools [Huang et al. 2009, Cline et al. 2007]. A Java implementation of GSEA is available from http://www.broad.mit.edu/gsea/. Another popular group-of-genes analysis method is Pathifier, proposed by Drier et al. [2013]. Similar to GSEA, Pathifier also works on an a priori defined set P of genes (e.g., a curated pathway [Kanehisa and Goto 2000]), and assigns each sample

i a pathway dysregulation score (PDS) DP (i) with respect to that pathway. The PDS estimates the extent to which the expression of P deviates in sample i from a control set of samples. Each such sample i is considered a point in the |P | dimensional space defined by the set of genes in the pathway P . The entire set of samples thus forms a cloud of points in this space. Pathifier then computes the (nonlinear) “principal curve” that passes through the “middle” of this cloud; the principal curve captures the variation of the cloud in the space. The algorithm by Hastie and Stuetzle [1989] is used to compute this principal curve. Each sample is then

projected onto this curve, and the PDS for a sample i is the distance DP (i) along this curve for the projection of sample i from the control set of samples. Pathifier was demonstrated to work particularly well for classifying of cancer samples [Drier et al. 2013, Liu et al. 2016]. An implementation of Pathifier is available from http:/ /www.weizmann.ac.il/pathifier/. 214 Chapter 8 Protein Complex Prediction in the Era of Systems Biology

However, a shortcoming of whole-pathway analysis is that rarely the entire path- way is responsible for the observed phenotypic difference—only a small subnet- work of the pathway is truly responsible—and, when only a subnetwork is respon- sible, a whole-pathway analysis may often miss the pathway of interest. A significant subnetwork of the pathway potentially provides a more precise hypothesis that ex- plains the phenotype compared to the entire pathway. For example, in an analysis by Liu et al. [2016] of DNA-damage response pathways in sporadic breast cancers using Pathifier [Drier et al. 2013], it was found that rarely was the entire pathway dysregu- lated in samples with the observed clinical phenotype; instead, subnetworks of the pathways, mainly around genes affected by genomic alterations in these samples, correlated better with the phenotype. To identify this precise (smaller) subnetwork, it is necessary to take into account the connectivity between the genes. The follow- ing example illustrates this idea (Figure 8.3)[Sivachenko et al. 2007]. In the first scenario (Figure 8.3(a)), a whole-pathway analysis may pick the entire set of genes (1–6) as significant based on the presence of one strongly differentially expressed gene (gene 4; dark grey-filled circle), and two weakly differentially expressed genes (genes 2 and 3; light grey-filled circles). However, the exact subset and the order of regulation between these genes cannot be confirmed. On the other hand, in the second scenario (Figure 8.3(b)), information on the connectivity between the genes can identify the two genes 2 and 3 as the two upstream regulators contributing to the differential expression of the regulatee gene 4. Therefore, incorporating net- work connectivity can identify the precise subnetwork (genes 2, 3, and 4) that is responsible for the phenotype, and possibly the regulator-regulatee relationship between the genes within the subnetwork. The assumption (null hypothesis) in group-based analyses is that, for any gene, there is no association between being a target of a particular regulator and being differentially expressed. Therefore, if a regulator within a pathway has a total of k targets, then the probability that exactly m of these targets are differentially expressed by chance is   k − p(m, k) = qm(1 − q)k m, (8.16) m

(approximated by a binomial distribution), where q is the probability for a single randomly selected gene to be differentially expressed. Therefore, the probability for the k targets to contain M or more differentially expressed genes is equal to

k P(M, k) = p(m, k). (8.17) m=M 8.4 Enhancing Quantitative Proteomics Using PPI Networks and Protein Complexes 215

The significance for this can be ascertained using a permutation (randomization) test on the gene expression or phenotype data. To incorporate network connectiv- = (in) (in) (in) ity, a simple way is to define q Kd /Ktot , where Kd is the sum of in-degrees of (in) all differentially expressed genes, and Ktot is the total in-degree of all the genes in the network. This ratio provides the estimate for the probability of the regulator to randomly reconnect to a differentially expressed gene during random resampling of the network. A brute-force resampling, however, is computationally intensive, and an effective alternative using network randomization has been suggested (called Net- work Enrichment Analysis (NEA)) [Sivachenko et al. 2015, Sivachenko et al. 2007]. This randomization procedure breaks all edges in the network and reconnects them randomly. As a result, a regulator that picks targets randomly during this rewiring has k(in)(g) possibilities to reconnect to a particular target g, where k(in)(g) is the in-degree of g. Thus, this procedure can be used to build an effective sampling distribution while incorporating network connectivity. However, the randomization procedure above assumes the genes are mutually independent. This is obviously not true for genes in a pathway as, by definition, genes in a pathway are coordinated to perform a specific biological process. In short, the null hypothesis underlying this randomization is invalid, leading poten- tially to the null hypothesis being rejected by chance and thus more false positives. A more proper randomization procedure is to permute phenotype data, i.e., to per- mute class labels. Moreover, NEA identifies mainly trivial regulator-regulatee relationships, and therefore tends to produce very small subnetworks. Although these small subnet- works are useful, these still need to be pieced-in together to produce non-trivial subnetworks that can explain the underlying hypothesis better. SNet, proposed by Soh et al. [2011], is one way to identify non-trivial subnetworks. SNet first maps proteins that are highly expressed in most samples of the disease phenotype in question onto biological pathways and PPI networks. SNet discards all other pro- teins in these pathways and networks, causing these pathways to fragment into separate, but reasonably large, subnetworks. Those subnetworks that show sig- nificant difference in expression levels between disease and controls are declared significant. The details of SNet and its various refinements are discussed below.

Employing Protein Complexes to Analyze Proteomics Data In our opinion, the use of complete networks is not a compulsory requirement in analyzing proteomics data, but instead functional unit information (e.g., protein complexes) although imperfect, are easier to use and less problematic than whole networks [Goh and Wong 2016]. First, since protein complexes form basic units of 216 Chapter 8 Protein Complex Prediction in the Era of Systems Biology

biological function, and these obey a certain network connectivity structure (pro- teins within a complex are tightly connected to one another), protein complexes themselves can be used as (non-trivial) subnetworks to explain the underlying dif- ferences in phenotypes [Fraser and Plotkin 2007, Michaut et al. 2011]. Second, there exist centralized protein complex databases (e.g., CYC2008 [Pu et al. 2009] and MIPS [Mewes et al. 2008] for yeast, and CORUM [Reuepp et al. 2008] for human) that catalog expert-curated protein complexes, whereas there is no widely accepted centralized database of reference subnetworks. Third, standardizing protein com- plexes is relatively easy: Bona fide protein complexes are concisely represented as lists of their constituent proteins, which can be experimentally validated and do not depend on network-mining algorithms which yield different results depending on which networks and algorithms are used. Last, although still highly incomplete, the functional and subcellular localization annotations at the level of protein com- plexes are increasingly available in centralized protein complex databases, which makes it easier to leverage protein complexes as a controlled vocabulary of anno- tations for network-based proteomics. Two approaches that are based on the idea of using protein complexes to quan- tify phenotypic differences between conditions have been previously discussed [Srihari and Ragan 2013, Zhao et al. 2013], wherein differential expression for pro- tein complexes was correlated with molecular and clinical differences observed between cancer subtypes. Other approaches, commonly referred to as rank-based network algorithms [Goh and Wong 2016a, Goh and Wong 2016b], have been devel- oped since, and these include Proteomics Signature Profiling (PSP) [Goh et al. 2012, Goh et al. 2013b], SubNETworks (SNET) [Soh et al. 2011, Lim and Wong 2014] and its successors Fuzzy SNET (FSNET), Paired FSNET (PFSNet) [Lim and Wong 2014], and Extremely Small sample size SubNETworks (ESSNET) [Goh and Wong 2016b, Lim et al. 2015] (discussed below). PSP, proposed by Goh et al. [2012, 2013b], works as follows. For a given patient p, a set of k clusters, determined by an a priori set of protein complexes (e.g., from CORUM [Reuepp et al. 2008]), is used to define an unranked cluster vector C(p) = (p) (p) C1 ,...,Ck for the patient. Each element in the cluster vector measures (p) = the hit rate of the patient’s reported proteins in a proteomics screen, H (Ci) (p) (p) N (Ci)/N(Ci), where N (Ci) is the number of proteins from cluster Ci that are detected for patient p, and N(Ci) is the total number of proteins detected for cluster Ci (across all patients). The patient’s proteomics signature profile (PSP) is therefore a vector of fixed length k of hit rates. Let A and B be two phenotypes (e.g., stages of a disease) containing n and m patients, respectively. To check if a particular cluster 8.4 Enhancing Quantitative Proteomics Using PPI Networks and Protein Complexes 217

C is responsible for the difference between these two phenotypes, a t-statistic can be computed for the cluster

H (C) − H (C) t = A B , (8.18) S 1 + 1 HA,HB n m

where HA(C) and HB(C) are the mean hit rates for cluster C in the two sets of patients, and  (m − 1)S2 + (n − 1)S2 = HA HB SH H , (8.19) A, B m + n − 2

where SHA and SHB are the standard deviations of the hit rates for C in the two sets of patients. If this t-statistic is significant, then the cluster C is declared as significantly differentially expressed between the two phenotypes. The PSP approach can be generalized to other connectivity structures within the network. For example, instead of using only the protein complexes from CORUM, clusters derived from the PPI network using protein complex prediction methods can be used to define PSP profiles for patients. Goh et al. [2012] extended PSP to include graphlets, which are network patterns ranging from 2–5 nodes [Prˇzuljet al. 2004, Higham et al. 2008] (graphlets were discussed in Chapter 2). This resulted in a graphlet degree vector (GDV) for each patient. However, some positions in a graphlet are topologically equivalent—for example, the three points in a closed triangle are equivalent. To eliminate the resulting redundancy, the graphlet orbits are used: If a node of interest is involved in a graphlet position of topological equivalence (e.g., one point of a closed triangle), it only counts once. This way, there are 73 graphlet orbits from size 2–5, which are represented as a graphlet orbit degree vector of length 73. In the work by Goh et al. [2012], GDVs were used to derive clusters from the PPI network: Connected nodes that have similar GDVs are considered to belong to the same cluster. The clusters so generated are then used to define the cluster vector in PSP. Here, each element in the 73-length GDV indicates the number of counts a node of interest is found in the corresponding graphlet orbit. GDVs are then computed for all connected nodes in the network, and adjoining nodes that have similar GDVs are considered to belong to the same cluster. However, the sim- ilarity between two GDVs cannot be measured in a straightforward manner using typical distance measures such as Euclidean or Manhattan because of dependen- 218 Chapter 8 Protein Complex Prediction in the Era of Systems Biology

cies between the various orbits. Milenkovic and Prˇzulj [2008] introduced a distance

measure for this purpose. A 73-dimension vector W containing weights wi for each ∈{ } of the orbits i 1,...,73 was defined. To compute these weights wi, each orbit i is assigned an integer oi that is obtained by counting the number of orbits that affect orbit i. The weight wi is a function oi and is computed as:

= − log(oi) wi 1 . (8.20) log(73)

The distance for orbit i between two nodes u and v is measured as: log(u + 1) − log(v + 1) D (u, v) = w . i i , (8.21) i i + log(max(ui , vi)) 2

where ui and vi are the counts for the nodes u and v in orbit i. The total distance between the two nodes based on all 73 orbits is: 73 = D (u, v) D(u v) = i1 i , 73 , (8.22) i=1 wi which gives the GDV similarity between u and v as:

S(u, v) = 1 − D(u, v), (8.23)

since D(u, v) is a distance measure with value between 0 and 1. Nodes with high GDV similarity are considered to be in the same cluster. SNET, FSNET, and PFSNET are essentially improvement on top of each other. In

SNET [Lim and Wong 2014], given a protein gi and a tissue pk, the following score is = defined: fs(gi , pk) 1 if the protein gi is among the top α-percent (typically, α is 10) most abundant proteins in the tissue pk, and 0 otherwise. This is used to compute the proportion β(gi , Cj ) of tissues in a class Cj of tissues that have gi among their top α-percent most-abundant proteins, f (g , p ) β(g , C ) = s i k . (8.24) i j | | Cj pk∈Cj

The score of a tissue pk (regardless of its class) with respect to a protein complex S and class Cj is = score(S, pk , Cj ) fs(gi , pk) . β(gi , Cj ). (8.25) gi∈S 8.4 Enhancing Quantitative Proteomics Using PPI Networks and Protein Complexes 219

The function fSNET(S, X, Y , Cj ) for some protein complex S across two sets of samples (patients) X and Y with respect to class Cj is a t-statistic defined as

mean(S, X, C ) − mean(S, Y , C ) = j j fSNET(S, X, Y , Cj ) (8.26) var(S,X,Cj ) + var(S,Y ,Cj ) |X| |Y | | |+| | with X Y degrees of freedom (DOF). Here, mean(S, X, Cj ) and mean(S, Y , Cj ) are the mean of score(S, pk , Cj ) for pk in X and the mean of score(S, pk , Cj ) for pk in Y , respectively; var(S, X, Cj ) and var(S, Y , Cj ) are the variance of score(S, pk , Cj ) for pk in X and the variance of score(S, pk , Cj ) for pk in Y , respectively. The complex S is considered differential in X but not in Y if fSNET(S, X, Y , Cj ) is at the largest 5% extreme of the Student’s t-distribution. Given two classes, C1 and C2, the set of sig- { | nificant protein complexes returned by SNET is the union of S fSNET(S, C1, C2, C1) } { | } is significant and S fSNET(S, C2, C1, C2) is significant ; the former being com- plexes that are significantly consistently highly abundant in C1 but not C2, and the latter being complexes that are significantly consistently highly abundant in C2 but not C1. For FSNET [Goh and Wong 2016a], the definition of the function fs(gi , pk) is = changed to the following: fs(gi , pk) 1ifgi is among the top α1 percent (default = 10%) of the most-abundant proteins in pk, and fs(gi , pk) 0ifgi is not among the top α2 percent (default 20%) of the most-abundant proteins in pk. The range = between α1 and α2 is divided into n equal-sized bins (default n 4), and fs(gi , pk) is assigned the value 0.8, 0.6, 0.4, or 0.2 depending on which bin gi falls into for pk. The test statistic is defined analogously to fSNET, and given two classes C1 and C2 the set { | of significant complexes returned by FSNET is the union of S fSNET(S, C1, C2, C1) } { | } is significant and S fSNET(S, C2, C1, C2) is significant . If a protein pk in complex S is irrelevant to the difference between two classes = X and Y , then the paired difference of the scores of pk, that is, δ(S, pk , X, Y) − score(S, pk , X) score(S, pk , Y), is expected to be centered at 0. PFSNet [Lim and Wong 2014] assesses the significance of the deviation of this difference from 0 using a one-sample t-statistic defined as

mean(S, X, Y , Z) f (S, X, Y , Z) = , (8.27) PFSNet se(S, X, Y , Z)

where mean(S, X, Y , Z) and se(S, X, Y , Z) are the mean and standard error, re- { | } spectively, of the list δ(S, pk , X, Y) pk is a tissue in Z . The complex S is signifi- cantly consistently highly abundant in X but not in Y if fPFSNet(S, X, Y , Z)is among 220 Chapter 8 Protein Complex Prediction in the Era of Systems Biology

the largest 5% extreme of the Student’s t-distribution. The set of significant com- { | } plexes returned by PFSNet is the union of S fPFSNET(S, X, Y , Z) is significant and { | } S fPFSNET(S, Y , X, Z) is significant Given a protein p, a sample x in class X, and a sample y in class Y , the difference between the expression levels of p in x and y is called a paired difference. If a protein complex S containing k proteins is irrelevant to the difference between two classes X, containing n samples, and Y , containing m samples, then the M = k × n × m paired differences should be centered on 0. In ESSNET [Goh and Wong 2016b, Lim

et al. 2015], this is evaluated using a one-sample t-test where the test statistic Ts is defined as:

= μ Ts √ , (8.28) s/ M

where μ and s are the mean and standard deviation of the M paired differences.

The complex is considered relevant if the area above the observed Ts in the nominal distribution is < 0.05. However, the DOF for the theoretical t-distribution is not k × n × m, which is a gross overestimation. Instead, in ESSNET the DOF is hypothesized to lie between (n + m) and k × (n + m), depending on the extent of correlation of proteins within the complex. The DOF is set less conservatively at 0.5 × k × (n + m). The study by Goh and Wong [2016b] assessed ESSNET, PFSNET, QPSP, GSEA, and the classic hypergeometric enrichment analysis using simulated datasets and real datasets. For the simulated datasets, where the precise differential complexes are known, ESSNET was shown to dominate the other methods in terms of recall and precision of identifying these “planted” differential complexes. For the real datasets, since the precise differential complexes are unknown, assessment based on reproducibility and stability was used instead. Specifically, a real dataset was randomly split into two and each method was applied to the two sub-datasets inde- pendently; then the two resulting lists of significant complexes of each method were checked for concordance. A good method is expected to achieved high concordance between these two lists, indicating reproducibility of its significant complexes. Fur- ther, a real dataset was also subsampled many times, and each method was applied to these subsamples; then the proportion of subsamples in which a complex was declared significant by each method was tabulated. A good method is expected to reproducibly declare the same complexes to be significant in a large proportion of subsamples, indicating the stability of the method. In these tests, ESSNET and PFSNET were shown to have comparable reproducibility and stability and were far superior to other methods. The paper further showed that the false-positive rates of ESSNET and PFSNET were well controlled. This, together with their high repro- 8.5 Systems Biology Tools and Databases to Analyze Biomolecular Networks 221

ducibility and stability, implies the high consistency of these two methods. The study further showed that ESSNET is able to identify less-abundant complexes that are significant, whereas PFSNET is restricted to high-abundance complexes by de- sign. With recent advancements in proteomics technologies, and the increasing num- ber of studies adopting large-scale proteomics screens to catalog and study protein- expression levels across different experimental conditions and phenotypes [Kim et al. 2014, Uhl´en et al. 2010, Uhl´en et al. 2015, Geiger et al. 2012, Ellis et al. 2013, Mertins et al. 2016], it will be interesting to see how the above-described approaches can be applied to these screens, and to what extent these approaches mitigate the shortcomings of proteomics screens.

Systems Biology Tools and Databases to Analyze 8.5 Biomolecular Networks With significant emphasis on systems biology over the last decade, a number of open-source and proprietary software tools have been developed to analyze PPI, complexosome, phenome and diseaseome networks. Here, we highlight some of the popular ones (Table 8.1); some of these have also been mentioned in the earlier chapters. For a more detailed list and description of tools, the readers are referred to Agaptio et al. [2013], Berger et al. [2013], and Mitra et al. [2013]]. Cytoscape pro- vides a visualization platform to analyze large molecular networks including PPI, protein-DNA, and GI networks. Cytoscape enables integration of diverse datasets including gene expression [Cline et al. 2007], and annotation databases including Gene Ontology [Ashburner et al. 2000], KEGG [Kanehisa and Goto 2000], and Re- actome [Vastrik et al. 2007]. Cytoscape also supports development and installation of new “plugins” for different kinds of network analysis; see [Saito et al. 2012] for a description of these plugins (up-to-date as of 2012), and visit the Cytoscape App Store (http://apps.cytoscape.org/apps/with_tag/systemsbiology) for a list of popu- lar systems biology plugins. BIANA (Biologic Interactions and Network Analysis), BIPS (BIANA Interolog Prediction Server), DINA (DIfferential Network Analysis), MyProteinNet, and PINA (Protein Interaction Network Analysis) integrate diverse datasets and public annotation databases to aid elaborate multi-omic, tissue- and other context-specific network analysis. KEGG (Kyoto Encyclopedia of Genes and Genomes), KEGG Brite, Ingenuity Pathway Analysis®, InnateDB, NetPath, PathDIP, Pathway Commons, Pathway Studio®, and Reactome provide detailed pathway-level annotations and analysis platforms across a number of organisms. Consensus- PathDB integrates interaction and pathway data under different contexts, and aids 222 Chapter 8 Protein Complex Prediction in the Era of Systems Biology

analysis of metabolite lists, visualization of genes and metabolites, and gene-set analysis based on protein complexes and network modules. DisGeNet, KEGG Dis- ease, and MalaCards are focused more on diseases, and provide integrative infor- mation on genes, interactions, and pathways involved in human diseases. DGIdb and DrugBank provide interactions between drugs and their protein targets. Con- nectivity Map integrates information from a large number of gene-expression and drug-compound profiles, and enables the discovery of functional connections be- tween drugs, genes, and diseases. CellDesigner [Funahashi et al. 2007] is a mod- eling tool for biochemical networks and supports the Systems Biology Markup Language (SBML) [Hucka et al. 2003, Kitano et al. 2005] to model and analyze net- works. A comprehensive list of computational tools for systems biology is available from OMICS tools (http://omictools.com/ppis-c51-p1.html). A comprehensive list of more than 300 pathways and interaction databases is available from Pathguide (http://www.pathguide.org). 8.5 Systems Biology Tools and Databases to Analyze Biomolecular Networks 223 Hucka et al. 2003 , Smoot et al. 2010 ] Wagner et al. 2016 ] ® Reference Ingenuity Elsevier Kitano et al. 2005 ] [ Lamb et al. 2006 ] [ Garcia-Garcia et al. 2010 ] [ Garcia-Garcia et al. 2012 ] [ Krogan et al. 2015 ] [ Funahashi et al. 2007 , [ Kamburov et al. 2013 ] [ Shannon et al. 2003 , [ Huang et al. 2009 ] [ Griffith et al. 2013 , [ Gambardella et al. 2013 ] [ Pinero et al. 2015 ] [ Wishart et al. 2006 ] [ Lynn et al. 2008 ] [ Kanehisa et al. 2010 ] [ Kanehisa and Goto 2000 ] [ Rapaport et al. 2013 ] [ Basha et al. 2015 ] [ Brown et al. 2009 ] [ Xia et al. 2014 ] [ Kandasamy et al. 2010 ] [ Rahmati et al. 2016 ] [ Cerami et al. 2011 ] [ Cowley et al. 2012 ] [ Vastrik et al. 2007 ] abases to analyze interactome, complexosome, phenome, http://www.ingenuity.com/ http://www.elsevier.com/online-tools/pathway-studio URL http://www.broadinstitute.org/cmap/ http://sbi.imim.es/web/BIANA.php/ http://sbi.imim.es/BIPS.php/ http://www.ccmi.org/ http://www.systems-biology.org/002/ http://ConsensusPathDB.org/ http://www.cytoscape.org/ http://david.ncifcrf.gov/ http://dgidb.genome.wustl.edu/ http://dina.tigem.it/ http://www.disgenet.org/ http://www.drugbank.ca/ http://www.innatedb.com/ http://www.genome.jp/kegg/disease/ http://www.genome.jp/kegg/pathway.html http://www.malacards.org/ http://netbio.bgu.ac.il/myproteinnet/ http://ophid.utoronto.ca/navigator/ http://www.networkanalyst.ca/ http://www.netpath.org/ http://ophid.utoronto.ca/pathdip/ http://www.pathwaycommons.org/pc/home.do http://cbg.garvan.unsw.edu.au/pina/ http://www.reactome.org/ ® ® Connectivity Map Tool/Database BIANA BIPS CellDesigner ConsensusPathDB DAVID DINA DrugBank InnateDB KEGG Pathway MyProteinNet NetworkAnalyst PathDIP Pathway Commons Reactome Cancer Cell Map Initiative Cytoscape DGIdb DisGeNet Ingenuity Pathway Analysis KEGG Disease MalaCards NAViGaTOR NetPath Pathway Studio PINA Examples of systems biology tools and dat and diseasome networks Table 8.1 Conclusion

The most exciting phrase to hear in science, the one that heralds new discoveries, is not “Eureka!” but “That’s funny...” —Issac Asimov (1920–1992), Russian-born American biochemist (as quoted in Wainer and Lysen [2009])

Proteins and their complexes are central protagonists at the cellular level and drive almost all biological processes within cells. Protein complexes are known to be of the order of hundreds even in the simplest of eukaryotic (unicellular) organ- isms (e.g., more than 400 protein complexes have been identified for the yeast S. cerevisiae) and of the order of thousands in the cells of higher-order (multicellu- lar) organisms (e.g., in mammalian cells). A faithful attempt toward identification 9and characterization of the entire complement of protein complexes is crucial to elucidate the functioning of the cellular machinery. Ever since large-scale mapping of protein-protein interactions began on a high- throughput scale in the early 2000s, protein complex prediction methods have played a crucial role in analyzing the large amounts of protein interaction data for exhaustive and accurate characterization of protein complexes. These predic- tion methods have identified several novel complexes or novel components within already known complexes that have subsequently been validated and have thus en- hanced our knowledge of characterized protein complexes. This book is dedicated to the two decades of efforts that have gone into developing, improving, and evaluat- ing computational methods for protein complex prediction, and the contributions of these methods toward understanding the functional organization within cells. Considering how well researched the area of protein complex prediction has been—more than 30 different methods proposed over the last 15 years—here, we have classified and described these methods based on a number of criteria, namely (i) algorithmic strategies used—agglomerative vs. divisive; (ii) biological in- formation employed (e.g., core-attachment structure and functional annotations); (iii) topologies of the protein complexes detected—large vs. small and dense vs. 226 Chapter 9 Conclusion

sparse; (iv) dynamics of protein complexes—consecutively expressed/assembled vs. periodically expressed/assembled, and structurally ordered vs. disordered (fuzzy); and (v) evolutionary conservation—partially vs. fully conserved, and ancient (con- served from lower-order species) vs. recent (conserved only in vertebrates); see Tables 9.1–9.3. Chapter 3 was dedicated to the more classical methods that make use of pri- marily the PPI network topology to identify protein complexes. These included the earliest methods, namely MCODE [Bader and Hogue 2003], MCL [Van Dongen 2000, Enright et al. 2002, Leal-Pereira et al. 2004], LCMA [Li et al. 2005], and CFinder [Adamcsek et al. 2006], as well as their improvements, namely HACO [Wang et al. 2009] and ClusterONE [Nepusz et al. 2012] (which improve MCODE and MCL by allowing cluster overlaps); CMC [Liu et al. 2009] (which incorporates interaction weights while building complexes); COACH [Wu et al. 2009], CORE [Leung et al. 2009], MCL-CAw [Srihari et al. 2010], and CACHET [Wu et al. 2012] (which incor- porate core-attachment information); and RNSC [King et al. 2004], DECAFF [Li et al. 2007] and PCP [Chua et al. 2008] (which incorporate functional information); see Table 9.1. Based on our comprehensive evaluation of these methods in Chap- ter 4, we noted that most methods performed well in detecting large (size≥4) and dense protein complexes from PPI networks that are filtered (high-confidence) and show high protein and interaction coverage; however, the performance of these methods dipped severely when evaluated on networks containing considerable noise (the methods produced inaccurate complexes or lumped together smaller complexes) or lacking sufficient coverage (the methods missed many sparse com- plexes). This is typically the case when the methods were applied to PPI networks from human and other less-characterized organisms compared to the networks from well-characterized model organisms (yeast). While these limitations are ex- pected to be mitigated eventually as larger and better-quality PPI datasets are gen- erated for non-model organisms including human, for now, detection of sparse and small complexes and deconvoluting overlapping complexes remain important open challenges in the area. Chapter 5 discussed these challenges by providing specific examples of complexes that are difficult to detect, and described meth- ods that attempt to tackle each of these challenges; summarized in Table 9.2. An important category of protein complexes that are often underrepresented in traditional experimental datasets are membrane protein complexes. This is due to the non-detectability of interactions among membrane proteins using tradi- tional experimental protocols (Y2H and AP/MS). Chapter 5 discussed some recent improvements in experimental protocols (e.g., PCA, MYTH, and MaMTH) to de- tect membrane protein interactions and how the prediction of membrane protein Chapter 9 Conclusion 227 complexes improved by applying existing prediction methods on these enhanced datasets. In addition to the above-mentioned challenges, it is also important to realize that protein interactions and therefore protein complexes are dynamic entities, whose formation and composition depend on the cellular context, e.g., availability and concentration (abundance), posttranslational modifications, and subcellular localization of the proteins [Hein et al. 2015]. However, traditional proteomics techniques do not provide such contextual information on the interactions, and consequently it becomes necessary to integrate information from other sources to discern the dynamics of protein complexes. Chapter 6 described methods that inte- grate temporal (e.g., in the form of gene-expression profiles), interaction-domain, 3D-structural, and localization information to identify dynamic protein complexes; see Table 9.3. An important category of dynamic protein complexes are the struc- turally disordered or fuzzy complexes, which change their structure and/or com- position depending on the cellular context. This class of protein complexes is not easy to predict solely by analyzing PPI networks, and methods incorporating dock- ing or molecular-dynamics simulations are required. Chapter 6 discussed ongoing efforts to discern fuzzy structures and their roles in cellular functioning. Protein complexes, being fundamental functional units, reveal important in- sights into how cellular functions and processes have evolved. In particular, conser- vation of core functions within cells (e.g., maintaining the integrity of the genome, cell-cycle regulation, intracellular transport, and metabolism) can be understood by studying protein complexes conserved through the evolution. Toward this end, computational methods have been proposed to identify conserved protein interac- tions and complexes; these methods formed the subject of Chapter 7. Importantly, these methods have helped to identify new protein complexes in higher-order or- ganisms whose PPI networks are less-characterized and sparse (e.g., in the case of human) by extrapolating interactions using orthology mapping from model and other well-characterized organisms (e.g., yeast, fruit fly, and nematode). Finally, in today’s era of systems biology where complex phenotypes are per- ceived as an intricate interplay of multiple biomolecules, accurate prediction of protein assembles (protein complexes and pathways) that drive these phenotypes has become crucial. In this regard, protein complex prediction methods have been applied to a number of contexts and datasets (e.g., from yeast-mutant to stress- response and complex-disease datasets); see Table 9.3. Chapter 8 higlighted some of these applications and the discoveries (e.g., in the form of new disease genes, drug-targets, or relevance to clinical outcomes) that have potentially resulted from these applications. It is worthwhile to note that a fundamental problem as protein 228 Chapter 9 Conclusion

complex prediction has far-reaching applications—from mapping cellular orga- nization to understanding phenotypic transformation and elucidating complex- disease mechanisms. Improvements in protein complex prediction and other PPI network analysis methods have enabled better quantification and enhancement of proteomics measurements, thus further enabling closer interplay between ex- perimental and computational techniques for better characterization of cellular function and organization. We hope that our book: (i) has served as a detailed guide to understanding the algorithmic and conceptual underpinnings of protein complex prediction; (ii) has provided valuable insights to inspire future research in the area, especially for tack- ling the open challenges; and (iii) will inspire new applications, such as in network proteomics and identifying disease genes and drug targets. We foresee that new ex- perimental technologies will emerge, and the consideration and exploitation of the newly generated data (as well as more ingenious use of existing data) will greatly advance the sensitivity and precision of protein complex prediction. And conse- quently, the current repertoire of protein complexes will be greatly expanded. This in turn will call for methodologies for inferring and understanding the function of protein complexes, their role in diseases, etc. It will also call for new interesting ideas to better analyze, for example, clinical profiling data in the context of pro- tein complexes, leading to more accurate identification of disease drivers, more robust signatures for disease diagnosis and prognosis, and more successful disease treatments. Table 9.1 “Classical” methods for protein complex prediction, which were described in Chapter 3 a

Classification Method Year Source Reference

PPI network clustering MCODE 2003 http://apps.cytoscape.org/apps/mcode [Bader and Hogue 2003]

MCL 2000, 2004 http://micans.org/mcl/ [Van Dongen 2000, Leal-Pereira et al. 2004]

http://apps.cytoscape.org/apps/clustermaker

SPC 2003 http://www.weizmann.ac.il/complex/compphys/sites/ complex.compphys/...

files/uploads/software/clustering_prl.ps [Spirin and Mirny 2003, Blatt et al. 1996]

LCMA 2005 Not available [Li et al. 2005]

CFinder 2006 http://www.cfinder.org/ [Adamcsek et al. 2006]

DPClus 2006 http://kanaya.naist.jp/DPClus/ [Altaf-Ul-Amin et al. 2006]

IPCA 2008 http://netlab.csu.edu.cn/bioinformatics/limin/IPCA/ [Li et al. 2008]

SuperComplex 2008 http://www.cs.cmu.edu/~qyj/SuperComplex/ [Qi et al. 2008]

CMC 2009 http://www.comp.nus.edu.sg/~wongls/projects/... [Liu et al. 2009]

complexprediction/CMC-26may09/

HACO 2010 http://www.bio.ifi.lmu.de/Complexes/ProCope/ [Friedel et al. 2009, Wang et al. 2009]

ClusterONE 2012 http://apps.cytoscape.org/apps/clusterone [Nepusz et al. 2012]

Ensemble 2012 http://compbio.ddns.comp.nus.edu.sg/~cherny/SWC/ [Yong et al. 2012] Core-attachment CORE 2009 http://alse.cs.hku.hk/complexes/ [Leung et al. 2009] structure COACH 2009 http://www1.i2r.a-star.edu.sg/~xlli/coach.zip [Wu et al. 2009] MCL-CAw 2009, 2010 http://sites.google.com/site/mclcaw/ [Srihari et al. 2009, Srihari et al. 2010] CACHET 2012 http://www1.i2r.a-star.edu.sg/~xlli/CACHET/CACHET.htm [Wu et al. 2012]

Functional RNSC 2004 http://www.cs.utoronto.ca/~juris/data/ppi04/ [King et al. 2004] homogeneity DECAFF 2007 http://www1.i2r.a-star.edu.sg/~xlli/DECAFF/complexes.html [Li et al. 2007] PCP 2008 http://www.comp.nus.edu.sg/~wongls/projects/... [Chua et al. 2008] complexprediction/PCP-3aug07/

a. These methods are based solely on the PPI network topology or include additional information (here, core-attachment and functional homogeneity) to predict protein complexes. Table 9.2 Protein complex prediction methods: Extended list I a

Classification Method Year Source Reference

Sparse complexes CFA 2010 http://www.bioinf.cs.ipm.ir/softwares/cfa/CFA.rar [Habibi et al. 2010]

SPARC 2012 http://sites.google.com/site/sparc/ [Srihari and Leong 2012a]

Fan 20120 http://faculty.cse.tamu.edu/shsze/ndcomplex [Fan et al. 2012]

SWC + COMBINED 2012 http://compbio.ddns.comp.nus.edu.sg/~cherny/SWC/ [Yong et al. 2012]

Overlapping complexes Gagneur 2004 Not available [Gagneur et al. 2004]

Zotenko 2006 Not available [Zotenko et al. 2006]

SPIN 2010 http://code.google.com/p/simultaneous-pin/ [Jung et al. 2010]

Ozawa 2010 Not available [Ozawa et al. 2010]

Liu 2011 Not available [Liu et al. 2011]

DACO 2014 http://sourceforge.net/projects/dacoalgorithm/ [Will and Helms 2014]

Small complexes Ruan 2013,2014 http://imi.kyushu-u.ac.jp/~om/NBC-HD-PCP/NBC-HD-PCP.jar [Ruan et al. 2013, Ruan et al. 2014] PPSampler 2013 http://imi.kyushu-u.ac.jp/~om/PPSamplerVer1.2/PPSamplerVer1_2.exe [Tatsuke and Maruyama 2013] SSS + EXTRACT 2014 http://www.comp.nus.edu.sg/~wongls/projects/... [Yong et al. 2014] ...complexprediction/sss-3dec2014.zip PPSampler2 2014 http://imi.kyushu-u.ac.jp/~om/PPSamplerVer2.3/PPSampler2.exe [Widita and Maruyama 2013] TRIBAL 2014 http://www.nature.com/articles/srep04262#s1 [Zaki and Mora 2014] Integrated pipeline Integrated 2015 http://www.comp.nus.edu.sg/~wongls/projects/... complexprediction/integrated-v1 Recent methods SiComPre 2015 http://www.cosbi.eu/research/prototypes/sicompre [Rizzetto et al. 2015] ClusterEP 2016 Not available [Liu et al. 2016]

a. This table includes methods developed to identify sparse, small, and overlapping protein complexes. Chapter 9 Conclusion 231 Marsh and Teichmann 2015 ] Ou-Yang et al. 2014 ] Marsh et al. 2013 , Goh and Wong 2016b ] Leiserson et al. 2015 ] Kelley et al. 2004 ] Li et al. 2012 , Goh et al. 2013b ] [ De Lichtenberg et al. 2005 ] [ Park and Bader 2012 ] [ Srihari and Leong 2012c ] [ Hanna et al. 2015 ] [ Van Dam and Snel 2008 ] [ Singh et al. 2008 ] [ Pache et al. 2012 ] [ Nguyen et al. 2013 ] [ Lim and Wong 2014 ] [ Lim et al. 2015 ] [ Goh and Wong 2016a , Reference [ Dudley et al. 2005 ] [ Fraser and Plotkin 2007 ] [ Lage et al. 2007 ] [ Vanunu et al. 2010 ] [ Yang et al. 2011 ] [ Vandin et al. 2011 , [ Hofree et al. 2013 ] [ Srihari and Ragan 2013 ] [ Zhao et al. 2013 ] [ Ori et al. 2016 ] a 2012 2015 2008 2012 2013 2013, 2014, 2015 [ Marsh and Teichmann 2014 , 2014 2015, 2016 Year 2005 2007 2010 2011 2011, 2015 2013 2013 2013 2016 De LichtenbergTang, Li, Ou-YangPark & Bader 2005 Srihari & 2011, Leong 2012, 2014Hanna [ Tang etPATHBLAST al. 2011 , 2012 SharanHirsh & Sharan 2003, 2004Van Dam & SnelIsoRank 2007NetAligner [ Kelley 2005 et al. 2008 2003 , Nguyen Marsh [ Hirsh and Sharan 2007 ] [ Sharan et al. 2005b ] PSPSNET FSNET and PFSNETESSNET 2015, 2016 2012 [ Goh et al. 2012 , Method Dudley Fraser & PlotkinLage Vanunu 2007 RWPCN HotNet NBS CONTOUR Zhao Ori n methods: Extended list II Classification Dynamic complexes Evolutionarily conserved complexes Complexes in network- based proteomics Complexes in phenotypes and diseases Protein complex predictio a. The table includes methods developed to identify dynamic, evolutionarily conserved, and disease complexes. Table 9.3 References

B. Adamcsek, G. Palla, I. J. Farkas, I. Der´enyi, and T. Vicsek. 2006. CFinder: locating cliques and overlapping modules in biological networks. Bioinformatics, 22(8): 1021–1023. DOI: 10.1093/bioinformatics/btl039. 64, 69, 70, 211, 226, 229 R. Aebersold and M. Mann 2003. Mass spectrometry-based proteomics. Nature 422(6928): 198–207. DOI: 10.1038/nature01511. 2 P. Aloy and R. B. Russell. 2006. Structural systems biology: modelling protein interactions. Nat. Rev. Mol. Syst. Biol., 7: 188–197. DOI: 10.1016/j.sbi.2007.05.005. 119 G. Agaptio, P. H. Guzzi, and M. Cannataro. 2013. Visualization of protein interaction networks: problems and solutions. BMC Bioinform., 14(Suppl. 1): S1. DOI: 10.1186/ 1471-2105-14-S1-S1. 31, 221 S. Agarwal, C. M. Deane, M. A. Porter, and N. S. Jones. 2010. Revisiting date and party hubs: novel approaches to role assignment in protein interaction networks. PLoS Comp. Biol., 6(6): e1000817. DOI: 10.1371/journal.pcbi.1000817. 149 S. E. Ahnert, J. A. Marsh, H. Hern´andez, C. A. Robinson, and S. A. Teichmann. 2017. Principles of assembly reveal a periodic table of protein complexes. Science, 350(6266): aaa2245. DOI: 10.1126/science.aaa2245. 184 R. Albalat and C. Canestro.˜ 2016. Evolution by gene loss. Nat. Rev. Genet., 17: 379–391. DOI: 10.1038/nrg.2016.39. 181 R. Albert and A. L. Barab´asi. 2002. Statistical mechanics of complex networks. Rev. Modern Phy., 74(1): 47–97. DOI: 10.1103/RevModPhys.74.47. 26, 28 B. Alberts. 1998. The cells as a collection of protein machines: preparing the next generation of molecular biologists. Cell, 92(3): 291–294. DOI: 10.1016/S0092-8674(00)80922-8. 1 P. Aloy, B. Bottcher,¨ H. Ceulemans, et al. 2004. Structure-based assembly of protein complexes in yeast. Science, 303(5666): 2026–2029. DOI: 10.1126/science.1092645. 11, 12, 112 P. Aloy and R. B. Russell. 2002. Interrogating protein interaction networks through structural biology. Proc. Natl. Acad. Sci. USA 2002, 99(9): 5896–5901. DOI: 10.1073/ pnas.092147999. 158 234 References

M. Altaf-Ul-Amin, Y. Shinbo, K. Mihara, et al. 2006. Development and implementation of an algorithm for detection of protein complexes in large interaction networks. BMC Bioinform., 7: 207. DOI: 10.1186/1471-2105-7-207. 64, 70, 229 C. W. Anderson and E. Appella. 2004. Signaling to the p53 tumor suppressor through pathways activated by genotoxic and nongenotoxic stress. In R. A. Bradshaw and E. A. Dennis, editors. Handbook of Cell Signaling, 237–247. Academic Press, New York. DOI: 10.1016/B978-0-12-374145-5.00264-3. 160 G. Apic, W. Huber, and S. A. Teichmann. 2003. Multi-domain protein families and domain pairs: comparison with known structures and a random model of domain recombination. J. Struct. Funct. Genom., 4: 67–78. 118 B. Aranda, H. Blankenburg, S. Kerrien, et al. 2011. PSICQUIC and PSISCORE: accessing and scoring molecular interactions. Nat. Meth., 8: 528–529. DOI: 10.1038/nmeth.1637. 51 M. Ashburner, C. A. Ball, J. A. Blake, et al. 2000. Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet., 25(1): 25–29. DOI: 10.1038/ 75556. 33, 42, 55, 85, 170, 221 S. Asur, D. Ucar, and S. Parthasarathy. 2007. An ensemble framework for clustering protein- protein interaction networks. Bioinformatics, 23(ISMB 2007 issue): i29–i40. DOI: 10.1093/bioinformatics/btm212. 78 C. Bokel.¨ 2008. EMS screens: From mutagenesis to screening and mapping. Meth. Mol. Biol., 420: 119–138. DOI: 10.1007/978-1-59745-583-1_7. 4 M. Babu, J. Vlasblom, S. Pu, et al. 2012. Interaction landscape of membrane-protein complexes in Saccharomyces cerevisiae. Nature, 489(7417): 585–589. DOI: 10.1038/ nature11354. 21, 23, 141 G. D. Bader, D. Betel, and C.W.V Hogue. 2003. BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res., 31(1): 248–250. DOI: 10.1093/nar/gkg056. 23, 24 J. S. Bader, A. Chaudhuri, J. M. Rothberg, and J. Chant. 2004. Gaining confidence in high-throughput protein interaction networks. Nat. Biotechnol., 22(1): 78–85. DOI: 10.1038/nbt924. 6, 9, 19, 93 G. D. Bader and C. W. Hogue. 2002. Analyzing yeast protein-protein interaction data obtained from different sources. Nat. Biotechnol., 20: 991–997. DOI: 10.1038/nbt1002-991. 6, 19, 35, 36, 93, 172 G. D. Bader and C. W. Hogue. 2003. An automated method for finding molecular complexes in large protein interaction networks. BMC Bioinform., 4: 2. DOI: 10.1186/1471-2105- 4-2. 61, 64, 70, 94, 140, 211, 226, 229 A. Bairoch and R. Apweiler. 1996. The SWISS-PROT protein sequence data bank and its new supplement TREMBL. Nucleic Acids Res., 24(1): 21–25. DOI: 10.1093/nar/25.1.31 . 2, 33 T. A. Baker and S. P Bell. 1998. Polymerases and replisome: machines within machines. Cell, 92(3): 295–305. DOI: 10.1016/S0092-8674(00)80923-X. 1 References 235

D. Baltimore, P. Berg, M. Botchan, et al. 2015. Biotechnology. A prudent path forward for genomic engineering and germline gene modification. Science, 348: 36–38. DOI: 10.1126/science.aab1028. 4 A. L. Barab´asi, N. Gulbahce, and J. Loscalzo. 2011. Network medicine: A network-based approach to human disease. Nat. Rev. Genet., 12(1): 56–68. DOI: 10.1038/nrg2918. 185, 192, 193 A. L. Barab´asi and R. Albert. 1999. Emergence of scaling in random networks. Science, 286(5439): 509–512. DOI: 10.1126/science.286.5439.509. 26, 27, 28, 47 A. L. Barab´asi. 2007. Network medicine - from obesity to the ‘Diseasome’. New Engl. J. Med., 357: 404–407. DOI: 10.1056/NEJMe078114. 193 F. N. Barrera, M. L. Renart, J. A. Poveda, et al. 2008. Protein self-assembly and lipid binding in the folding of the potassium channel KcsA. Biochemistry, 47(7): 2123–2133. DOI: 10.1021/bi700778c. 22 T. Barrett, S. E. Wilhite, P. Ledoux, et al. 2013. NCBI GEO: archive for functional genomics datasets: Update. Nucleic Acids Res., 41(Database issue): D991–995. DOI: 10.1093/nar/ gks1193. 154 M. Barrios-Rodiles, K. R. Brown, B. Ozdamar, et al. 2005. High-throughput mapping of a dynamic signaling network in mammalian cells. Science, 307(5715): 1621–1625. DOI: 10.1126/science.1105776. 199 O. Basha, D. Flom, R. Bashir, et al. 2015. MyProteinNet: build up-to-date protein interaction networks for organisms, tissues and user-defined contexts. Nucleic Acids Res., 43(Web Server issue): W258–W263. DOI: 10.1093/nar/gkv515. 51, 223 N. N. Batada, L. A. Shepp, and D. O. Siegmund. 2004. Stochastic model of protein-protein interaction: Why signaling proteins need to be colocalized. Proc. Natl. Acad. Sci. USA, 101(17): 6445–6449. DOI: 10.1073/pnas.0401314101. 43 N. N. Batada, T. Reguly, A. Breitkreutz, L. Boucher, et al. 2006. Stratus not altocumulus: A new view of the yeast protein interaction network. PLoS Biol., 4(10): e317. DOI: 10.1371/journal.pbio.0040317. 149 S. Bauer, P. N. Robinson, and J. Gagneur. 2011. Model-based gene set analysis for bioconductor. Bioinformatics, 27(13): 1882–1883. DOI: 10.1093/bioinformatics/ btr296. 188 J. J. Benschop, N. Brabers, D. van Leenan, et al. 2010. A consensus of core protein complex compositions for Saccharomyces cerevisiae. Mol. Cell, 38(6): 916–928. 77 B. Berger, J. Peng, and M. Singh. 2013. Computational solutions for omics data. Nat. Rev. Genet., 14(5): 333–346. DOI: 10.1038/nrg3433. 221 T. Bergg´ard, S. Linse, and P. James. 2007. Methods for the detection and analysis of protein- protein interactions. Proteomics, 7(16): 2833–2842. DOI: 10.1002/pmic.200700131. 4 R. B. Berlow, H. J. Dyson, and P. E. Wright. 2015. Functional advantages of dynamic protein disorder. FEBS Lett, 19(A): 2433–2440. DOI: 10.1016/j.febslet.2015.06.003. 146, 148 236 References

H. M. Berman, J. Westbrook, Z. Feng, et al. 2000. The Protein Data Bank. Nucleic Acids Res., 28(1): 235–242. DOI: 10.1155/2010/257512. 56, 147, 158, 161, 182 A. Beyer, S. Bandyopadhyay, and T. Ideker. 2007. Integrating physical and genetic maps: from genomes to interaction networks. Nat. Rev. Genet., 8(9): 699–710. DOI: 10.1038/ nrg2144. 190 N. Bhardwaj, A. Abyzov, D. Clarke, et al. 2011. Integration of protein motions with molecular networks reveals different mechanisms for permanent and transient interactions. Protein Sci., 20(10): 1745–1754. DOI: 10.1002/pro.710. 188 N. Bisson, D. A. James, G. Ivosev, et al. 2011. Selected reaction monitoring mass spectrometry reveals the dynamics of signaling through the GRB2 adaptor. Nat. Biotechnol., 29(7): 653–658. DOI: 10.1038/nbt.1905. 200 A. K. Bjorklund,¨ D. Ekman, S. Light, et al. 2005. Domain rearrangements in protein evolution. J. Mol. Biol., 353: 911–923. DOI: 10.1016/j.jmb.2005.08.067. 181, 182 M. Blatt, S. Wiseman, and E. Domany. 1996. Superparamagnetic clustering of data. Phys. Rev. Lett., 76(18): 3251–3254. 64, 66, 67, 229 M. L. Blinov, J. R. Faeder, B. Goldstein, and W. S. Hlavacek. 2004. BioNetGen: software for rule-based modeling of signal transduction based on the interactions of molecular domains. Bioinformatics, 20(17): 3289–3291. DOI: 10.1093/bioinformatics/bth378. 189 P. Blohm, G. Frishman, P. Smialowski, et al. 2014. Negatome 2.0: a database of non- interacting proteins derived by literature mining, manual annotation and protein structure analysis. Nucleic Acids Res., 42(Database issue): D396–D400. DOI: 10.1093/ nar/gkt1079. 55 V. A. Blomen, P. Majek, L. T. Jae, et al. 2015. Gene essentiality and synthetic lethality in haploid human cells. Science, 350(6264): 1092–1096. DOI: 10.1126/science.aac7557. 4 T. Bloome, K. Vandepoele, S. De Bodt, et al. 2006. The gain and loss of genes during 600 million years of vertebrate evolution. Genome Biol., 7: R43. DOI: 10.1186/gb-2006-7-5- r43. 165 B. Bollob´as. 1985. Random Graphs. London Mathematical Society Monographs, Academic Press, London, 447. 25, 27 C. Boone, H. Bussey, and B. J. Andrews. 2007. Exploring genetic interactions and networks with yeast. Nat. Rev. Genet. 2007, 8(6): 437–449. DOI: 10.1038/nrg2085. 189, 190 P. Bork, K. Hoffman, P. Bucher, et al. 1997. A superfamily of conserved domains in DNA- damage response cell cycle checkpoint proteins. FASEB J., 11(1): 68–76. 166, 181, 182 P. M. Bowers, M. Pellegrini, M. J. Thompson, et al. 2004. Prolinks: a database of protein functional linkages derived from coevolution. Genome Biol., 5: R35. DOI: 10.1186/gb- 2004-5-5-r35. 52, 55 References 237

G. E. P. Box. 1979. Robustness in the strategy of scientific model building. In R. L. Launer and G. N. Wilkinson, editors. Robustness in Statistics, Academic Press New York. DOI: 10.1016/B978-0-12-438150-6.50018-2. 59 A. Br¨uckner,C. Polge, N. Lentze, et al. 2009. Yeast two-hybrid, a powerful tool for systems biology. Int. J. Mol. Sci., 10(6): 2763–2788. DOI: 10.3390/ijms10062763. 17, 20, 21 B. J. Breitkreutz, C. Stark, and M. Tyers. 2003. Osprey: A network visualization system. Genome Biol., 4: R22. DOI: 10.1186/gb-2003-4-3-r22. 33 C. B. Bridges. 1922. The origin of variation. Amer. Nat., 56: 51–63. DOI: 10.1086/279847. 189 S. Brohee and J. van Helden. 2006. Evaluation of clustering algorithms for protein-protein interaction networks. BMC Bioinform., 7: 488. DOI: 10.1186/1471-2105-7-488. 66, 82, 91 J. A. Brown, G. Sherlock, C. L. Myers, et al. 2006. Global analysis of gene function in yeast by quantitative phenotypic profiling. Mol. Syst. Biol., 2: 2006.001. DOI: 10.1038/ msb4100043. 17 K. R. Brown, D. Otasek, M. Ali, et al. 2009. NAViGaTOR: Network Analysis, Visualization and Graphing Toronto. Bioinformatics, 25(24): 3327–3329. DOI: 10.1093/bioinformatics/ btp595. 33, 223 K. R. Brown and I. Jurisca. 2007. Unequal evolutionary conservation of human protein interactions in interologous networks. Genome Biol., 8: R95. DOI: 10.1186/gb-2007-8- 5-r95. 51, 178 K. R. Brown and I. Jurisica. 2005. Online predicted human interaction database. Bioinformatics, 21(9): 2076–2082. DOI: 10.1093/bioinformatics/bti273. 24, 25, 51, 56, 169 B. Bryson 2003. The stuff of life. In A Brief History of Nearly Everything: Chapter 26. Black Swan (UK) Broadway Books (US), 140. 2 C. J. Bult, J. A. Kadin, J. E. Richardson, et al. 2010. The Mouse Genome Database: enhancements and updates. Nucleic Acids Res., 38(Database issue): D586–D592. DOI: 10.1093/nar/gkp880. 3 G. Butland, J. M. Peregrin-Alvarez, J. Li, et al. 2005. Interaction network containing conserved and essential protein complexes in Escherichia coli. Nature, 433(7025): 531–537. DOI: 10.1038/nature03239. 6 B. Byrne and S. Iwata. 2002. Membrane protein complexes. Curr. Opin. Struct. Biol., 12(2): 239–243. DOI: 10.1016/S0959-440X(02)00316-0. 22 J. J. Cai, E. Borenstein, and D. A. Petrov. 2010. Broker genes in human disease. Genome Biol. Evol., 2: 815–825. DOI: 10.1093/gbe/evq064. 194 E. P. Carpenter, K. Beis, A. D. Cameron, and S. Iwata. 2008. Overcoming the challenges of membrane protein crystallography. Curr. Opin. Struct. Biol., 18(5): 581–586. DOI: 10.1016/j.sbi.2008.07.001. 22 238 References

E. G. Cerami, B. E. Gross, E. Demir, et al. 2011. Pathway Commons, a web resource for biological pathway data. Nucleic Acids Res., 39(Database issue): D685–D690. DOI: 10.1093/nar/gkq1039. 56, 203, 223 S. Y. Chan and J. Loscalzo. 2012. The emerging paradigm of network medicine in the study of human disease. Circ. Res., 111(3): 359–374. DOI: 10.1161/CIRCRESAHA.111.258541. 192 A. Chatr-Aryamontri, A. Ceol, L. M. Palazzi, et al. 2007. MINT: the Molecular INTeraction database. Nucleic Acids Res., 35(Database issue): D572–D574. DOI: 10.1093/nar/ gkl950. 20, 24, 49, 170 A. Chatr-Aryamontri, B. J. Breitkreutz, R. Oughtred, et al. 2015. The BioGRID interaction database: 2015 update. Nucleic Acids Res., 43(Database issue): D470–D478. DOI: 10.1093/nar/gku1204. 8, 55 E. Chautard, M. Fatoux-Ardore, L. Ballut, N. Thierry-Mieg, and S. Ricard-Blum. 2011. MatrixDB, the extracellular matrix interaction database. Nucleic Acids Res., 39: D235– D240. DOI: 10.1093/nar/gkq830. 51 J .Y. Chen, S. Mamidipalli, and T. Huan. 2009. HAPPI: an Online Database of Comprehensive Human Annotated and Predicted Protein Interactions. BMC Genom., 10(Suppl 1): S16. 24 W. H. Chen, P. Minguez, M. J. Lercher, and P. Bork. 2012. OGEE: an online gene essentiality database. Nucleic Acids Res., D40: D901–D906. DOI: 10.1093/nar/gkr986. 2 B. Chen, W. Fan, J. Liu, and F. X. Wu. 2014. Identifying protein complexes and functional modules—from static PPI networks to dynamic PPI networks. Brief Bioinform., 15(2): 177–194. DOI: 10.1093/bib/bbt039. 61 Y. Chen, T. Jiang, and R. Jiang. 2014. Uncover disease genes by maximizing information flow in the phenome-interactome network. Bioinformatics, 27(13): 67–76. DOI: 10.1093/bioinformatics/btr213. 196 Y. Cheng and G. M. Church. 2000. Biclustering of expression data. In: Proc. Eighth Int. Conf. Intell. Syst. Mol. Biol., 93–103. 154 M. J. Cherry, E. L. Hong, C. Amundsen, et al. 2012. Saccharomyces Genome Database: the genomics resource of budding yeast. Nucleic Acids Res., 40(Database issue): D700–D705. DOI: 10.1093/nar/gkr1029. 2, 3, 179 C. T. Chien, P. L. Bartel, R. Sternglanz, and S. Fields. 1991. The two-hybrid system: a method to identify and clone genes for proteins that interact with a protein of interest. Proc. Natl. Acad. Sci. USA, 88: 9578–9582. 17 H. Choi, B. Larsen, Z. Y. Lin, et al. 2011. SAINT: probabilistic scoring of affinity purification- mass spectrometry data. Nat. Meth., 8: 70–73. 37, 40 C. Chothia and J. Gough. 2009. Genomic and structural aspects of protein evolution. Biochem. J., 419: 15–28. DOI: 10.1042/BJ20090122. 118 References 239

H. N. Chua, W. K. Sung, and L. Wong. 2006. Exploiting indirect neighbours and topological weight to predict protein function from protein-protein interactions. Bioinformatics, 22(13): 1623–1630. DOI: 10.1093/bioinformatics/btl145. 37, 44, 54, 86, 123 H. N. Chua, K. Ning, W. K. Sung, et al. 2008. Using indirect protein-protein interactions for protein complex prediction. J. Bioinform. Comp. Biol., 6(3): 435–466. DOI: 10.1142/ 9781860948732_0014. 64, 85, 86, 107, 211, 226, 229 H. N. Chua, W. Hugo, G. Liu, et al. 2009. A probabilistic graph-theoretic approach to integrate multiple predictions for the protein-protein subnetwork prediction challenge. Ann. NY Acad. Sci., 1158: 224–233. 48 H. Chuang, E. Lee, Y. T. Liu, D. Lee, and T. Ideker. 2007. Network-based classification of breast cancer metastasis. Mol. Syst. Biol., 3: 140. DOI: 10.1038/msb4100180 . 192 T. Clancy, E. A. Rødland, S. Nygard, and E. Hovig. 2013. Predicting physical interactions between protein complexes. Mol. Cell Proteom., 12(6): 1723–1734. DOI: 10.1074/mcp .O112.019828. 187 T. Clancy and E. Hovig. 2014. From proteomes to complexomes in the era of systems biology. Proteomics, 14(1): 24–41. DOI: 10.1002/pmic.201300230. 186 M. S. Cline, M. Smoot, E. Cerami, A. Kuchinsky, et al. 2007. Integration of biological networks and gene expression data using Cytoscape. Nat. Protoc., 2(10): 2366–2382. DOI: 10.1371/journal.pone.0043092. 213, 221 S. R. Collins, P. Kemmeren, X. C. Zhao, et al. 2007. Toward a comprehensive atlas of the physical interactome of Saccharomyces cerevisiae. Mol. Cell Proteom., 6(3): 439–450. DOI: 10.1074/mcp.M600381-MCP200. 37, 42, 93, 108, 111, 141, 151, 186 M. Costanzo, A. Baryshnikova, J. Bellay, et al. 2010. The genetic landscape of a cell. Science, 327(5964): 425–431. DOI: 10.1126/science.1180823. 189, 190 M. Costanzo, B. VanderSluis, E. N. Koch, et al. 2016. A global genetic interaction network maps a wiring diagram of cellular function. Science, 353(6306): aaf1420-1. DOI: 10.1126/science.aaf1420. 189 M. J. Cowley, M. Pinese, K. S. Kassahn, et al. 2012. PINA v2.0: mining interactome modules. Nucleic Acids Res., 40(Database issue): D862–D865. DOI: 10.1093/nar/gkr967. 223 J. Cox and M. Mann. 2007. Is proteomics the new genomics? Cell, 130: 395–398. DOI: 10.1016/j.cell.2007.07.032. 2, 209 P. Cramer, D. A. Bushnell, and R. D. Kornberg. 2001. Structural basis of transcription: RNA polymerase II at 2.8 Angstrom resolution. Science, 292(5523): 1863–1876. DOI: 10.1126/science.1059493. 148 P. Creixell, A. Palmeri, C. J. Miller, et al. 2015a. Unmasking determinants of specificity in the human kinome. Cell, 163(1): 187–201. DOI: 10.1016/j.cell.2015.08.057. 200 P. Creixell, E. M. Schoof, C. D. Simpson, et al. 2015b. Kinome-wide decoding of network- attacking mutations rewiring cancer signaling. Cell, 163(1): 202–217. DOI: 10.1016/j .cell.2015.08.056. 200 240 References

P. Csermely, T. Korcsm´aros, H. J. M. Kiss, G. London, and R. Nussinov. 2013. Structure and dynamics of molecular networks: a novel paradigm of drug discovery: A comprehensive review. Pharmacol. Therapeut., 138(3): 333–408. DOI: 10.1016/j .pharmthera.2013.01.016. 192 M. E. Cusick, H. Yu, A. Smolyar, et al. 2009. Literature-curated protein interaction datasets. Nat. Meth., 6: 39–46. DOI: 10.1038/nmeth.1284. 35, 36, 52 D. O. Daley. 2008. The assembly of membrane proteins into complexes. Curr. Opin. Struct. Biol., 18(4): 420–424. DOI: 10.1016/j.sbi.2008.04.006. 143 T. Dandekar, B. Snel, M. Huynen, and P. Bork. 1998. Conservation of gene order: a fingerprint of proteins that physically interact. Trend Biochem. Sci., 23(9): 324–328. DOI: 10.1016/ S0968-0004(98)01274-2. 52, 55 V. Danos and C. Laneve. 2004. Formal molecular biology. Theor. Comp. Sci. 2004, 325(1): 69–110. 189 J. Das, T. V. Vo, X. Wei, J. C. Mellor, et al. 2013. Cross-species protein interactome mapping reveals species-specific wiring of stress response pathways. Sci. Sig., 6(276): ra38. DOI: 10.1126/scisignal.2003350. 169 N. E. Davey, J. L. Cowan, D. C. Shields, T. J. Gibson, et al. 2012. SLiMPrints: conservation- based discovery of functional motif fingerprints in intrinsically disordered protein regions. Nucleic Acids Res. 2012, 40: 10628–10641. DOI: 10.1093/nar/gks854. 164 T. De Domenico, I. Walsh, T. J. Martin, and S. C. Tosatto. 2012. MobiDB: a comprehensive database of intrinsic protein disorder annotations. Bioinformatics, 28: 2080–81. 156 J. De la Cruz, D. Kressler, and P. Linder. 1999. Unwinding RNA in Saccharomyces cerevisiae: DEAD-box proteins and related families. Trend Biochem. Sci., 24: 192–198. DOI: 10.1016/S0968-0004(99)01376-6. 172 U. De Lichtenberg, L. J. Jensen, S. Brunak, and P. Bork. 2005. Dynamic complex formation during the yeast cell cycle. Science, 307(5710): 724–727. DOI: 10.1126/science .1105103. 150, 152, 231 J. De Ruyck, G. Brysbaert, R. Blossey, and M. F. Lensink. 2016. Molecular docking as a popular tool in drug design, an in silico travel. Adv. Appl. Bioinform. Chem., 9: 1–11. DOI: 10.2147/AABC.S105289. 158 S. De Vries and M. Zacharias. 2013. Flexible docking and refinement with a coarse-grained protein model using ATTRACT. Proteins, 81: 2167–2174. DOI: 10.1002/prot.24400. 158 M. Deng, S. Mehta, F. Sun, and T. Chen. 2002. Inferring domain-domain interactions from protein-protein interactions. Genome Res., 12: 1540–1548. DOI: 10.1101/gr.153002. 118 I. Der´enyi, G. Palla, and T. Vicsek. 2005. Clique percolation in random networks. Phys. Rev. Lett., 94: 160202. 69 References 241

G. Dey, A. Jaimovich, S. R. Collins, A. Seki, and T. Meyer. 2015. Systematic discovery of human gene function and principles of modular organization through phylogenetic profiling. Cell Rep., 10(6): 993–1006. DOI: 10.1016/j.celrep.2015.01.025. 167 L. R. Dice. 1945. Measures of the amount of ecologic association between species. Ecology, 26(3): 297–302. DOI: 10.2307/1932409. 39 R. Diestel. 2000. Graph Theory. Springer-Verlag, New York. 112 K. A. Dill and H. S. Chan. 1997. From Levinthal to pathways to funnels. Nat. Struct. Biol., 4(1): 10–19. DOI: 10.1038/nsb0197-10. 182 F. M. Disfani, W. L. Hsu, M. J. Mizianty, et al. 2012. MoRFpred, a computational tool for sequence-based prediction and characterization of short disorder-to-order transitioning binding regions in proteins. Bioinformatics, 28: i75-i83. DOI: 10.1093/ bioinformatics/bts209. 164 G. Diss, A. K. Dub´e, et al. 2013. A systematic approach for the genetic dissection of protein complexes in living cells. Cell Rep., 3(6): 2155–2167. DOI: 10.1016/j.celrep.2013.05 .004. 191 S. J. Dixon, Y. Fedyshyn, J. L. Y. Koh, et al. 2008. Significant conservation of synthetic lethal genetic interaction networks between distantly related eukaryotes. Proc. Natl. Acad. Sci. USA, 105(43): 16653–16658. DOI: 10.1073/pnas.0806261105. 200 S. J. Dixon, M. Costanzo, A. Baryshnikova, B. Andrews, and C. Boone. 2009. Systematic mapping of genetic interaction networks. Ann. Rev. Genet., 43: 601–625. DOI: 10.1146/annurev.genet.39.073003.114751. 200 N. T. Doncheva, Y. Assenov, F. S. Domingues, and M. Albercht. 2012. Topological analysis and interactive visualization of biological networks and protein structures. Nat. Protoc., 7: 670–685. DOI: 10.1038/nprot.2012.004. 31 Z. Doszt´anyi, V. Csizmok, P. Tompa, and I. Simon I. 2005. The pairwise energy content estimated from amino acid composition discriminates between folded and intrinsically unstructured proteins. J. Mol. Biol., 347: 827–839. DOI: 10.1016/j.jmb .2005.01.071. 157, 163 E. J. P. Douzery, E. A. Snell, E. Bapteste, et al. 2004. The timing of eukaryotic evolution: Does a relaxed molecular clock reconcile proteins and fossils? Proc. Natl. Acad. Sci. USA, 101(43): 15386–15391. DOI: 10.1073/pnas.0403984101. 4 K. Drew, C. Lee, R. L. Huizar, et al. 2017. A synthesis of over 9,000 mass spectrometry experiments reveals the core set of human protein complexes. BioRxiv. DOI: 10.1101/ 092361. 12, 13 Y. Drier, M. Sheffer, and E. Domany. 2013. Pathway-based personalized analysis of cancer. Proc. Natl. Acad. Sci. USA, 110(16): 6388–6393. DOI: 10.1073/pnas.1219651110. 213, 214 A. M. Dudley, D. M. Janse, A. Tanay, R. Shamir, and G. M. Church. 2005. A global view of pleiotropy and phenotypically derived gene function in yeast. Mol. Syst. Biol.,1: E1–E11. DOI: 10.1038/msb4100004. 189, 209, 231 242 References

T. Dufree, K. Becherer, P. L. Chen, et al. 1993. The retinoblastoma protein associates with the protein phosphatase type 1 catalytic subunit. Genes Dev., 7(4): 555–569. DOI: 10.1101/gad.7.4.555. 17 W. H. Dunham, M. Mullin, and A. C. Gingras. 2012. Affinity-purification coupled to mass spectrometry: basic principles and strategies. Proteomics, 12(10): 1576–1590. DOI: 10.1002/pmic.201100523. 15, 19 K. A. Dunker, D. J. Lawson, C. J. Brown, et al. 2001. Intrinsically disordered protein. J. Mol. Graph. Model, 19(1): 26–59. 146, 156 A. K. Dunker, C. J. Brown, and L. Obradovi´c. 2002. Identification and functions of usefully disordered proteins. Adv. Protein Chem., 62: 25–49. DOI: 10.1016/S0065- 3233(02)62004-2. 156 K. A. Dunker, M. S. Cortese, P. Romero, et al. 2005. Flexible nets. The roles of intrinsic disorder in protein interaction networks. FEBS J., 272: 5129–48. DOI: 10.1111/j.1742- 4658.2005.04948.x. 159, 160 K. A. Dunker, S. E. Bondos, F. Huang, and C. J. Oldfield. 2015. Intrinsically disordered proteins and multicellular organisms. Semin. Cell. Dev. Biol., 37: 44–55. DOI: 10.1016/ j.semcdb.2014.09.025. 146, 156, 160, 162 I. S. Dunn. 2010. Searching for Molecular Solutions: Empirical Discovery and Its Future. John Wiley & Sons, New Jersey. 15 S. Durinck, Y. Moreau, A. Kasprzyk, et al. 2005. BioMart and Bioconductor: a powerful link between biological databases and microarray data analysis. Bioinformatics, 21: 3439–3340. DOI: 10.1093/bioinformatics/bti525. 198 M. L. Dustin and J. T. Groves. 2012. Receptor signaling clusters in the immune synapse. Ann. Rev. Biophys., 41: 543–556. 188 H. J. Dyson and P. E. Wright. 2005. Intrinsically unstructured proteins and their functions. Nat. Rev. Mol. Cell Biol., 6: 197–208. DOI: 10.1038/nrm1589. 156 T. Ehrenberger, L. C. Cantley, and M. B. Yaffe. 2015. Computational prediction of protein- protein interactions. Meth. Mol. Bio., 1278: 57–75. DOI: 10.1007/978-1-4939-2425-7_4. 52 D. Ekman, S. Light, A. K. Bjorklund, and A. Elofsson. 2006. What properties characterize the hub proteins of the protein-protein interaction network of Saccharomyces cerevisiae?. Genome Biol., 7(6): R45-10.1186. DOI: 10.1186/gb-2006-7-6-r45. 159 M. J. Ellis, M. Gillette, S. A. Carr, et al. 2013. Connecting genomic alterations to cancer biology with proteomics: The NCI clinical proteomic tumor analysis consortium. Cancer Discov., 3: 1108. DOI: 10.1158/2159-8290. 221 A. J. Enright, S. M. van Dongen, and C. A. Ouzounis. 2002. An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res., 30(7): 1575–1584. DOI: 10.1093/nar/30.7.1575. 65, 142, 226 References 243

A. J. Enright and C. A. Ouzounis. Functional associations of proteins in entire genomes by means of exhaustive detection of gene fusions. Genome Biol., 2(9): 34. DOI: 10.1186/gb-2001-2-9-research0034. 54 J. M. Enserink and R. D. Kolodner. 2010. An overview of Cdk1-controlled targets and processes. Cell Div., 5(11). DOI: 10.1186/1747-1028-5-11. 145, 148, 159 P. Erdos¨ and A. R´enyi. 1960. On the evolution of random graphs. Publ. Math. Inst. Hung. Acad. Sci. 1960, (Ser. A5): 17–61. 25, 27, 70 A. Erg¨un,C. A. Lawrence, M. A. Kohanski, T. A. Brennan, and J. J. Collins. 2007. A network biology approach to prostate cancer. Mol. Syst. Biol., 3: 82. DOI: 10.1038/msb4100125. 193 M. D. Ermolaeva, O. White, and S. L. Salzberg. 2001. Prediction of operons in microbial genomes. Nucleic Acids Res., 29(5): 1216–1221. DOI: 10.1093/nar/29.5.1216. 52 H. Eves. 1998. Return to Mathematical Circles. Prindle, Weber and Schmidt Series in Mathematics, Boston. 107 R. M. Ewing, P. Chu, F. Elisma, et al. 2007. Large-scale mapping of human protein-protein interactions by mass spectrometry. Mol. Syst. Biol., 3: 89. DOI: 10.1038/msb4100134. 20 C. Ezzell. 2002. Proteins rule. Sci. Amer., 44–45. 1 J. H. Fan, J. Chen, and S. H. Sze. 2012. Identifying complexes from protein interaction networks according to different types of neighborhood density. J. Comp. Biol., 19(12): 1284–94. 114, 230 U. M. Fayyad and K. B. Irani. 1993. Multi-interval discretization of continuous-valued attributes for classification learning. Proc. 13th Int. Joint Conf. Arti. Intel., Morgan Kauffman, 1022–1027. 49, 127 I. Feldman, A. Rzhetsky, and D. Vitkup. 2008. Network properties of genes harboring inherited disease mutations. Proc. Natl. Acad. Sci. USA, 105(11): 4323–4328. DOI: 10.1073/pnas.0701722105. 192 S. Fields and O. Song. 1989. A novel genetic system to detect protein-protein interactions. Nature, 340(6230): 245–246. DOI: 10.1038/340245a0. 6, 17 R. L. Finley Jr. and R. Brent. 1994. Interaction mating reveals binary and ternary connections between Drosophila cell cycle regulators. Proc. Natl. Acad. Sci. USA, 91(26): 12980– 12984. 17 R. D. Finn, M. Marshall, and A. Bateman. 2005. iPfam: visualization of protein-protein interactions in PDB at domain and amino acid resolutions. Bioinformatics, 21(3): 410–412. DOI: 10.1093/bioinformatics/bti011. 121, 170 P. Flicek, M. R. Amode, D. Barrell, et al. 2011. Ensembl 2011. Nucleic Acids Res., 39(Database issue): D800–D806. DOI: 10.1093/nar/gkq1064. 170 I. Florian, N. Macha, M. Bertrand, et al. 2005. Proviz: protein interaction visualization and exploration. Bioinformatics, 21(2): 272–274. DOI: 10.1093/bioinformatics/bth494. 33 244 References

J. H. Fong, L. Y. Geer, A. R. Panchenko, and S. H. Bryant. 2007. Modeling the evolution of protein domain architectures using maximum parsimony. J. Mol. Biol., 366: 307–315. DOI: 10.1016/j.jmb.2006.11.017. 118 P. Fort, A. V. Kajava, F. Delsuc, and O. Coux. 2015. Evolution of proteasome regulators in eukaryotes. Genome Biol. Evol., 7(5): 1363–1379. DOI: 10.1093/gbe/evv068. 180 E. J. Foss, D. Radulovic, S. A. Shaffer, et al. 2007. Genetic basis of proteome variation in yeast. Nat Genet., 39: 1369–1375. DOI: 10.1038/ng.2007.22. 2 J. S. Fraser, J. D. Gross, and N. J. Krogan. 2013. From Systems to structure: bridging networks and mechanism. Mol. Cell 2013, 49(2): 222–231. DOI: 10.1016/j.molcel.2013.01.003. 200 H .B. Fraser and J. B. Plotkin. 2007. Using protein complexes to predict phenotypic effects of gene mutation. Genome Biol., 8: R252. DOI: 10.1186/gb-2007-8-11-r252. 190, 216, 231 C. C. Friedel, J. Krumsiek, and R. Zimmer. 2009. Bootstrapping the interactome: unsupervised identification of protein complexes in yeast. J. Comp. Biol., 16(8): 971–987. DOI: 10.1007/978-3-540-78839-3_2. 37, 39, 64, 95, 229 M. Friedman. 1953. Essays in Positive Economics Part I - The Methodology of Positive Economics. University of Chicago Press, Chicago, 3–43. 91 S. Fukuchi, S. Sakamoto, Y. Nobe, et al. 2012. IDEAL: Intrinsically Disordered Proteins with Extensive Annotations and Literature. Nucleic Acids Res., 40: D507–D511. DOI: 10.1093/nar/gkr884. 156 K. Fukunaga. 1990. Introduction to Statistical Pattern Recognition.Academic Press, San Diego. 66 A. Funahashi, M. Morohashi, Y. Matsuoka, A. Jouraku, and H. Kitano. 2007. CellDesigner: a graphical biological network editor and workbench interfacing simulator. Intro. Syst. Biol., 422–434. DOI: 10.1007/978-1-59745-531-2_21. 222, 223 M. Fuxreiter and P. Tompa. 2012. Fuzzy complexes: a more stochastic view of protein function. Fuzziness: Structural Disorder in Protein Complexes (Edited by Monika Fuxreiter and Peter Tompa, 2012 Landes Bioscience and Springer Science). 162 U. Gu¨uldener,M. M¨unsterkotter,¨ G. Kastenm¨uller,et al. 2005. CYGD: the Comprehensive Yeast Genome Database. Nucleic Acids Res., 33(Database issue): D364–368. DOI: 10.1093/nar/gki053. 24, 25 J. Gagneur, R. Krause, T. Bouwmeester, and G. Casari. 2004. Modular decomposition of protein-protein interaction networks. Genome Biol., 5: R57. DOI: 10.1186/gb-2004-5- 8-r57. 123, 230 M. Y. Galperin and E. V. Koonin. 2000. Who’s your neighbor? New computational approaches for functional genomics. Nat. Biotechnol., 18(6): 609–613. DOI: 10.1038/76443. 53, 55, 167 References 245

G. Gambardella, M. N. Moretti, R. de Cegli, et al. 2013. Differential network analysis for the identification of condition-specific pathway activity and regulation. Bioinformatics, 29(14): 1776–1785. DOI: 10.1093/bioinformatics/btt290. 223 T. K. Gandhi, J. Zhong, S. Mathivanan, et al. 2006. Analysis of the human protein interactome and comparison with yeast, worm and fly interaction datasets. Nat Genet., 38(3): 285– 293. DOI: 10.1038/ng1747. 194 J. Garcia-Garcia, E. Guney, R. Aragues, J. Planas-Iglesias, and B. Oliva. 2010. BIANA: A software framework for compiling biological interactions and analyzing networks. BMC Bioinform., 11: 56. 169, 223 J. Garcia-Garcia, S. Schleker, J. Klein-Seetharaman, and B. Oliva. 2012. BIPS: BIANA Interolog Prediction Server. A tool for protein-protein interaction inference. Nucleic Acids Res., 40(Web server issue): W147–W151. DOI: 10.1093/nar/gks553. 169, 223 N. P. Gauthier, L. J. Jensen, R. Wernersson, S. Brunak, et al. 2008. Cyclebase.org - a comprehensive multiorganism online database of cell-cycle experiments. Nucleic Acids Res., 36(Database issue): D854–859. DOI: 10.1093/nar/gkm729. 151 A. C. Gavin, M. Bosche, R. Krause, et al. 2002. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415: 141–147. DOI: 10.1038/ 415141a. 6, 20, 40, 179 A. C. Gavin, P. Aloy, P. Grandi, et al. 2006. Proteome survey reveals modularity of the yeast cell machinery. Nature, 440: 631–636. DOI: 10.1038/nature04532. 6, 20, 34, 37, 38, 39, 40, 78, 79, 93, 99, 108, 134, 140, 141, 163, 179, 189, 209 F. Gavril. 1974. The intersection graphs of subtrees in trees are exactly the chordal graphs. J. Combinatorial Theor., 16(1): 47–56. DOI: 10.1016/0095-8956(74)90094-X. 124 T. Geiger, A. Wehner, C. Schaab, J. Cox, and M. Mann. 2012. Comparative proteomic analysis of eleven common cell lines reveals ubiquitous but varying expression of most proteins. Mol. Cell Proteom., 11(3): M111.014050. DOI: 10.1074/mcp.M111.014050. 206, 221 G. Getz, E. Levine, and E. Domany. 2000. Coupled two-way clustering analysis of gene microarray data. Proc. Natl. Acad. Sci. USA, 97(22): 12079–12084. DOI: 10.1073/pnas .210134797. 66 G. Getz, M. Vendruscolo, D. Sachs, and E. Domany. 2002. Automated assignment of SCOP and CATH protein structure classifications from FSSP scores. Proteins: Struct. Func. Bioinform., 46(4): 405–415. 66 T. A. Gibson and D. S. Goldberg. 2009. Reverse engineering the evolution of protein interaction networks. Pac. Symp. Biocomput.: 190–202. 4 L. Giot, J. S. Bader, C. Brouwer, et al. 2003. A protein interaction map of Drosophila melanogaster. Science, 302(5651): 1727–1736. DOI: 10.1126/science.1090289. 18, 29 F. Glover. 1986. Future paths for integer programming and links to artificial intelligence. Comp. Oper. Res., 13(5): 533–549. DOI: 10.1016/0305-0548(86)90048-1. 85 246 References

I. H. Goenawan, K. Bryan, and D. J. Lynn. 2016. DyNet: visualization and analysis of dynamic molecular interaction networks. Bioinformatics, 32(17): 2713–2715. DOI: 10.1093/ bioinformatics/btw187. 199 K. Goh, M. Cusick, D. Valle, et al. 2007. The human disease network. Proc. Natl. Acad. Sci. USA, 104(21): 8685–8690. DOI: 10.1073/pnas.0701361104. 192, 194 W. W. B. Goh, Y. H. Lee, R. M. Zubaidah, J. Jin, et al. 2011. Network-based pipeline for analyzing MS data: an application toward liver cancer. J. Proteome Res., 10: 2261– 2272. DOI: 10.1021/pr1010845. 211 W. W. B. Goh, Y. H. Lee, Z. M. Ramdzan, et al. 2012. Proteomics Signature Profiling (PSP): A novel contextualization approach for cancer proteomics. J. Proteome Res., 11: 1571–1581. DOI: 10.1021/pr200698c. 209, 216, 217, 231 W. W. B. Goh, M. J. Sergot, J. C. Sng, and L. Wong. 2013a. Comparative network-based recovery analysis and proteomic profiling of neurological changes in valproic acid-treated mice. J. Proteome Res., 12(5): 2116–2127. DOI: 10.1021/pr301127f. 209, 211 W. W. B Goh, M. Fan, H. S. Low, et al. 2013b. Enhancing the utility of Proteomics Signature Profiling (PSP) with Pathway Derived Subnets (PDSs), performance analysis and specialised ontologies. BMC Genom., 14: 35. DOI: 10.1186/1471-2164-14-35. 209, 216, 231 W. W. B Goh, T. Guo, R. Aebersold, and L. Wong. 2015. Quantitative proteomics signature profiling based on network contextualization. Biol. Direct, 10: 71. DOI: 10.1186/ s13062-015-0098-x. 209 W. W. B. Goh and L. Wong. 2016. Integrating networks and proteomics: moving forward. Trend Biotechnol., 34(12): 951–959. DOI: 10.1016/j.tibtech.2016.05.015. 209, 210, 215 W. W. B. Goh and L. Wong. 2016a. Evaluating feature-selection stability in next-generation proteomics. J. Bioinform. Comp. Biol., 14(5): 1650029. 209, 216, 219, 231 W. W. B. Goh and L. Wong. 2016b. Advancing clinical proteomics via analysis based on biological complexes: A tale of five paradigms. J. Proteome Res., 15(9): 3167–3179. DOI: 10.1021/acs.jproteome.6b00402. 209, 216, 220, 231 E. A. Golemis and P. D. Adams (Editors). 2002. Protein-Protein Interactions: A Molecular Cloning Manual. Cold Spring Harbor Laboratory Press, New York. 6, 7, 15, 16, 19 S. Gong, G. Yoon, I. Jang, et al. 2005. PSIbase: a database of Protein Structural Interactome map (PSIMAP). Bioinformatics, 21(10): 2541–2543. DOI: 10.1093/bioinformatics/ bti366. 119 M. Griffith, O. L. Griffith, A. C. Coffman, et al. 2013. DGIdb - Mining the druggable genome. Nat. Meth., 10(12): 1209–1210. 223 A. V. Grinberg, C. D. Hu, and T. K. Kerppola. 2004. Visualization of Myc/Max/Mad family dimers and the competition for dimerization in living cells. Mol. Cell Biol., 24(10): 4294–4308. DOI: 10.1128/MCB.24.10.4294-4308.2004. 17, 21 References 247

A. M. Gross and T. Ideker. 2015. Molecular networks in context. Nat. Biotech., 33: 720–721. DOI: 10.1038/nbt.3283. 199 E. Guney, J. Menche, M. Vidal, and A. L. Barab´asi. 2015. Network-based in silico drug efficacy screening. Nat. Comm., 7: 10331. DOI: 10.1038/ncomms10331. 193 K. G. Guruharsha, J. F. Rual, B. Zhai, et al. 2011. A protein complex network of Drosophila melanogaster. Cell, 147(3): 690–703. DOI: 10.1016/j.cell.2011.08.047. 6 F. Gwinner, A. E. Acosta-Martin, L. Boytard, et al. 2013. Identification of additional proteins in differential proteomics using protein interaction networks. Proteomics, 13(7): 1065–1076. DOI: 10.1002/pmic.201200482. 209 J. Gyuris, E. Golemis, H. Chertkov, and R. Brent. 1993. Cdil1, a human G1 and S phase protein phosphatase that associates with Cdk2. Cell, 75: 791–803. 17 M. Habibi, C. Eslahchi, and L. Wong. 2010. Protein complex prediction based on k-connected subgraphs in protein interaction network. BMC Syst. Biol., 4: 129. DOI: 10.1186/1752- 0509-4-129. 112, 230 A. Hamosh, A. F. Scott, J. S. Amberger, et al. 2005. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res., 33(Database issue): D514–D517. DOI: 10.1093/nar/gki033. 193, 194 J. D. J. Han, N. Bertin, T. Hao, D. S. Goldberg, et al. 2004. Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature, 430: 88–93. DOI: 10.1038/nature02555. 149, 153, 158 J. H. Han, S. Batey, A. A. Nickson, S. A. Teichmann, and J. Clark. 2007. The folding and evolution of multidomain proteins. Nat. Rev. Mol. Cell Biol., 8: 319–330. DOI: 10.1038/ nrm2144. 118 proofnoteAU: new question. Should the dates here be 2000 rather than 2010? 194 D. Hanahan and R. A. Weinberg. 2010. The hallmarks of cancer. Cell, 100: 57–70. DOI: 10.1016/S0092-8674(00)81683-9. 194 D. Hanahan and R. A. Weinberg. 2011. Hallmarks of cancer: the next generation. Cell, 144: 646–674. DOI: 10.1016/j.cell.2011.02.013. 194 S. Hahn. 2004. Structure and mechanism of the RNA Polymerase II transcription machinery. Nat. Struct. Mol. Biol., 11(5): 394–403. DOI: 10.1038/nsmb763. 148 E. M. Hanna, N. Zaki, and A. Amin. 2015. Detecting protein complexes in protein interaction networks modeled as gene expression biclusters. PLoS One, 10(12): e0144163. DOI: 10.1371/journal.pone.0144163. 154, 231 A. S. Harding and J. F. Hancock. 2008a. Using plasma membrane nanoclusters to build better signaling circuits. Trend Cell Biol., 18(8): 364–371. DOI: 10.1016/j.tcb.2008.05.006. 188 A. S. Harding and J. F. Hancock. 2008b. Ras nanoclusters: combining digital and analog signaling. Cell Cycle, 15: 127–34. DOI: 10.4161/cc.7.2.5237. 188 248 References

G. T. Hart, A. K. Ramani, and E. M. Marcotte. 2006. How complete are current yeast and human protein-interaction networks? Genome Biol., 7(11): 120. DOI: 10.1186/gb- 2006-7-11-120. 52, 99 G. T. Hart, I. Lee, and E. M. Marcotte. 2007. A high-accuracy consensus map of yeast protein complexes reveals modular nature of gene essentiality. BMC Bioinform., 8: 236. DOI: 10.1186/1471-2105-8-236. 37, 40 T. Hart, K. R. Brown, F. Sircoulomb, R. Rottapel, and J. Moffat. 2014. Measuring error rates in genomic perturbation screens: gold standards for human functional genomics. Mol. Bio. Syst., 10(7): 733. DOI: 10.15252/msb.20145216. 4 T. Hart, M. Chandrashekar, M. Areggar, Z. Steinhart, et al. 2015. High-resolution CRISPR screens reveal fitness genes and genotype-specific cancer liabilities. Cell, 163(6): 1515–1526. DOI: 10.1016/j.cell.2015.11.015. 4 L. H. Hartwell, J. J. Hopfield, S. Leibler, and A. W. Murray. 1999. From molecular to modular cell biology. Nature, 402: C47–C51. DOI: 10.1038/35011540. 8 P. H. Harvey, A. F. Read, and S. Nee. Why ecologists need to be phylogenetically challenged. J. Ecol., 85: 535–536. 53 P. H. Harvey and M. Pagel. 1991. The Comparative Method in Evolutionary Biology. Oxford University Press. 53 T. Hastie and W. Stuetzle. 1989. Principal curves. J. Am. Stat. Assoc., 84(406): 502–516. 213 J. Hasty and J. J. Collins. 2001. Protein interactions. Unspinning the web. Nature, 411(6833): 30–31. DOI: 10.1038/35075182. 159 T. H. Haveliwala. 2003. Topic-sensitive PageRank: a context-sensitive ranking algorithm for Web search. IEEE Trans. Knowl. Data Eng., 15(4): 784–796. DOI: 10.1109/TKDE.2003 .1208999. 46 P. C. Havugimana, G. T. Hart, T. Nepusz, et al. 2012. A census of human soluble protein complexes. Cell, 150(5): 1068–1081. DOI: 10.1016/j.cell.2012.08.011. 11, 12, 165 X. He and J. Zhang. 2006. Why do hubs tend to be essential in protein networks? PLoS Genet., 2(6): e88. DOI: 10.1371/journal.pgen.0020088. 194 H. Hegyi, H. Schad, and P. Tompa. 2007. Structural disorder promotes assembly of protein complexes. BMC Struct. Biol., 7: 65. DOI: 10.1186/1472-6807-7-65. 163 M. Y. Hein, N. C. Hubner, I. Poser, et al. 2015. A human interactome in three quantitative dimensions organized by stoichiometries and abundances. Cell, 163: 712–723. DOI: 10.1016/j.cell.2015.09.053. 145, 146, 157, 227 J. Herlocker, J. A. Konstan, and J. Riedl. 2002. An empirical analysis of design choices in neighborhood-based collaborative filtering algorithms. Inform. Retriev., 5(4): 287–310. DOI: 10.1023/A:1020443909834. 45 H. Hermjakob, L. Montecchi-Palazzi, C. Lewington, et al. 2004. IntAct: an open source molecular interaction database. Nucleic Acids Res., 32(Database issue): D452–D455. DOI: 10.1093/nar/gkh052. 20, 24, 49, 93, 158, 163 References 249

J. M. Herrmann and S. Funes. Biogenesis of cytochrome oxidase-sophisticated assembly lines in the mitochondrial inner membrane. Gene, 354: 43–52. DOI: 10.1016/j.gene .2005.03.017. 143 D. J. Higham, M. Rasajski, and N. Prˇzulj.2008. Fitting a geometric graph to a protein-protein interaction network. Bioinformatics, 24(8): 1093–1099. DOI: 10.1093/bioinformatics/ btn079. 37, 47, 55, 217 E. Hirsh and R. Sharan. 2007. Identification of conserved protein complexes based on a model of protein network evolution. Bioinformatics, 23(2): e170–e176. DOI: 10.1093/ bioinformatics/btl295. 176, 231 Y. Ho, A. Gruhler, A. Heilbut, et al. 2002. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature, 415(6868): 180–183. DOI: 10.1038/415180a. 6, 15, 16, 20, 40 U. Hobohm and C. Sander. 1994. Enlarged representative set of protein structures. Protein Sci., 3: 522–24. 156 R. Hoffman and A. Valencia. A gene network for navigating the literature. 2004. Nat Genet., 36(664). DOI: 10.1038/ng0704-664. 57 M. Hofree, J. P. Shen, H. Carter, A. Gross, and T. Ideker. 2013. Network-based stratification of tumor mutations. Nat. Meth., 10: 1108–1115. DOI: 10.1038/nmeth.2651. 192, 202, 209, 231 S. D. Hooper and P. Bork. 2005. Medusa: A simple tool for interaction graph analysis. Bioinformatics, 21(24): 4432–4433. DOI: 10.1093/bioinformatics/bti696. 33 T. A. Hopf, P. I. Sch¨arfe, et al. 2014. Sequence co-evolution gives 3D contacts and structures of protein complexes. eLife, 3: e03430. DOI: 10.7554/eLife.03430. 183 F. Hormozdiari, P. Berenbrink, N. Pr˜zulj,and S. C. Sahinalp. 2007. Not All scale-free networks are born equal: the role of the seed graph in ppi network evolution. PLoS Comp. Biol., 3(7): e118. DOI: 10.1371/journal.pcbi.0030118. 27 J. J. Hornberg, F. J. Bruggeman, H. V. Westerhoff, and J. Lankelma. 2006. Cancer: A systems biology disease. Biosystems, 83: 81–90. DOI: 10.1016/j.biosystems.2005.05.014. 185, 192 E. Huala, A. W. Dickerman, M. Garcia-Hernandez, et al. 2001. The Arabidopsis Information Resource (TAIR): a comprehensive database and web-based information retrieval, analysis and visualization system for a model plant. Nucleic Acids Res., 29(1): 102–105. DOI: 10.1093/nar/29.1.102. 3 H. Huang, B. M. Jedynak, and J. S. Bader. 2007. Where have all the interactions gone? Estimating the coverage of two-hybrid protein interaction maps. PLoS Comp. Biol., 3(11): e214. DOI: 10.1371/journal.pcbi.0030214. 52 D. W. Huang, B. T. Sherman, and R. A. Lempicki. 2009. Systematic and integrative analysis of large gene lists using DAVID bioinformatics resources. Nat. Protoc., 4: 44–57. DOI: 10.1038/nprot.2008.211. 213, 223 250 References

M. Hucka, A. Finney, et al. 2003. The systems biology markup language (SBML): a medium for representation and exchange of biochemical network models. Bioinformatics, 19(4): 524–531. DOI: 10.1093/bioinformatics/btg015. 222, 223 S. Hunter, R. Apweiler, T. K. Attwood, et al. 2009. InterPro: the integrative protein signature database. Nucleic Acids Res., 37(Database issue): D211–D215. DOI: 10.1093/nar/ gkn785. 119 E. L. Huttlin, L. Ting, R. J. Bruckner, et al. 2015. The BioPlex Network: A systematic exploration of the human interactome. Cell, 162(2): 425–440. DOI: 10.1016/j.cell .2015.06.043. 11, 12, 41 E. L. Huttlin, R. J. Bruckner, J. A. Paulo, et al. Architecture of the human interactome defines protein communities and disease networks. Nature (2017). DOI: 10.1038/ nature22366. 12 M. Huynen, B. Snel, W. Lathe, and P. Bork. 2000. Predicting protein function by genomic context: quantitative evaluation and qualitative inferences. Genome Res., 10(8): 1204–1210. DOI: 10.1101/gr.10.8.1204. 52, 55 T. Ideker and N. J. Krogan. 2012. Differential network biology. Mol. Syst. Biol., 8: 565. DOI: 10.1038/msb.2011.99. 199 N. Ishii, K. Nakahigashi, T. Baba, et al. 2007. Multiple high-throughput analyses monitor the response of E. coli to perturbations. Science, 316 (5824): 593–597. DOI: 10.1126/ science.1132067. 3 T. Ito, K. Tashiro, S. Muta, et al. 2000. Toward a protein-protein interaction map of the budding yeast: A comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins. Proc. Natl. Acad. Sci. USA, 97(3): 1143–1147. DOI: 10.1073/pnas.97.3.1143. 6, 15, 16, 18, 29, 179 A. B. Jaffe, P. Aspenstrom, and A. Hall. 2004. Human CNK1 acts as a scaffold protein, linking Rho and Ras signal transduction pathways. Mol. Cell Biol., 24 (4): 1736–1746. DOI: 10.1128/MCB.24.4.1736-1746.2004. 159 M. J. Marinissen and J. S. Gutkind. 2005. Scaffold proteins dictate Rho GTPase-signaling specificity. Trend Biochem. Sci., 30(8): 423–426. DOI: 10.1016/j.tibs.2005.06.00. 159 S. Jain and G. D. Bader. 2010. An improved method for scoring protein-protein interactions using semantic similarity within the gene ontology. BMC Bioinform., 11: 562. DOI: 10.1186/1471-2105-11-562. 37, 42, 43, 55 J. Janin. 2010. Protein-protein docking tested in blind predictions: the CAPRI experiment. Mol. Biosyst., 6(12): 2351–2362. DOI: 10.1039/C005060C. 158 R. Jansen, D. Greenbaum, and M. Gerstein. 2002. Relating whole-genome expression data with protein-protein interactions. Genome Res., 12(1): 37–46. DOI: 10.1101/gr.205602. 55 R. Jansen, H. Yu, D. Greenbaum, et al. 2003. A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science, 302(5644): 449–453. DOI: 10.1126/science.1087361. 55 References 251

A. L. Jansma, M. A. Martinez-Yamout, R. Liao, et al. 2014. The high-risk HPV16 E7 oncoprotein mediates interaction between the transcriptional coactivator CBP and the retinoblastoma protein pRb. J. Mol. Biol., 426: 4030–4048. DOI: 10.1016/j .jmb.2014.10.021. 146 H. Jeong, S. P. Mason, A. L. Barab´asi, and Z. N. Oltvai. 2001. Lethality and centrality in protein networks. Nature, 411: 41–42. DOI: 10.1038/35075138. 29, 158, 194 J. Jin, X. Xie, C. Chen, et al. 2009. Eukaryotic protein domains as functional units of cellular evolution. Sci. Sig., 2(98): ra76. DOI: 10.1126/scisignal.2000546. 181 B. Jirgensons. 1966. Classification of proteins according to conformation. Makromol. Chem., 91: 74–86. DOI: 10.1002/macp.1966.020910105. 156 D. T. Jones and D. Cozzetto. 2014. DISOPRED3: precise disordered region predictions with annotated protein-binding activity. Bioinformatics, 31(6): 857–863. DOI: 10.1093/ bioinformatics/btu744. 164 D. T. Jones and J. J. Ward. 2003. Prediction of disordered regions in proteins from position specific score matrices. Proteins, 53(Suppl 6): 573–578. DOI: 10.1002/prot.10528. 157 P. F. Jonsson and P. A. Bates. 2006. Global topological features of cancer proteins in the human interactome. Bioinformatics, 22(18): 2291–2297. DOI: 10.1093/bioinformatics/ btl390. 192 J. D. Jordon, E. M. Landau, and R. Iyengar R. 2000. Signaling networks: The origins of cellular multitasking. Cell, 103(2): 193–2000. DOI: 10.1016/S0092-8674(00)00112-4. 146 D. Juan, F. Pazos, and A. Valencia. 2008. High-confidence prediction of global interactomes based on genome-wide coevolutionary networks. Proc. Natl. Acad. Sci. USA, 105(3): 934–939. DOI: 10.1073/pnas.0709671105. 53, 166 S. H. Jung, B. Hyun, W. H. Jang, et al. 2010. Protein complex prediction based on simultaneous protein interaction network. Bioinformatics, 26(3): 385–391. DOI: 10.1093/bioinformatics/btp668. 119, 120, 122, 230 T. Kocher¨ and G. Superti-Furga. 2007. Mass spectrometry-based functional proteomics: from molecular machines to protein networks. Nat. Meth., 4: 807–815. DOI: 10.1038/ nmeth1093. 7, 15, 19 A. H. Kachroo, J. M. Laurent, C. M. Yellman, et al. 2015. Systematic humanization of yeast genes reveals conserved functions and genetic modularity. Science, 348(6237): 921–925. DOI: 10.1126/science.aaa0769. 4 R. K. Kalathur, J. P. Pinto, M. A. Hern´andez-Prieto, et al. 2014. UniHI 7: an enhanced database for retrieval and interactive analysis of human molecular interaction networks. Nucleic Acids Res., 42(Database issue): D408–D414. DOI: 10.1093/nar/gkt1100. 51 G. Kalna and D. J. Higham. 2007. A clustering coefficient for weighted networks, with application to gene expression data. AI Commun., 20: 263–271. 60 A. Kamburov, U. Stelzl, and R. Herwig. 2012. IntScore: a web tool for confidence scoring of biological interactions. Nucleic Acids Res., 40(Web server issue): W140–W146. DOI: 10.1093/nar/gks492. 51 252 References

A. Kamburov, U. Stelzl, H. Lehrach, and R. Herwig. 2013. The ConsensusPathDB interaction database: 2013 update. Nucleic Acids Res., 41(Database issue): D793–D800. DOI: 10.1093/nar/gks1055. 223 K. Kandasamy, S. S. Mohan, et al. 2010. NetPath: a public resource of curated signal transduction pathways. Genome Biol., 11: R3. DOI: 10.1186/gb-2010-11-1-r3. 223 M. Kanehisa, S. Goto, M. Furumichi, M. Tanabe, and M. Hirakawa. 2010. KEGG for representation and analysis of molecular networks involving diseases and drugs. Nucleic Acids Res., 38(Database issue): D355–D360. DOI: 10.1093/nar/gkp896. 223 M. Kanehisa and S. Goto. 2000. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res., 28(1): 27–30. DOI: 10.1093/nar/28.1.27. 32, 33, 202, 212, 213, 221, 223 K. Ning, H. K. Ng, S. Srihari, et al. 2010. Examination of the relationship between essential genes in PPI network and hub proteins in reverse nearest neighbor topology. BMC Bioinform., 11: 505. DOI: 10.1186/1471-2105-11-505. 194 M. Kapushesky, I. Emam, E. Holloway, et al. 2010. Gene Expression Atlas at the European Bioinformatics Institute. Nucleic Acids Res., 38(Database issue): D69–D74. DOI: 10.1093/nar/gkp788. 169 G. Kar, A. Gursoy, and O. Keskin. 2009. Human cancer protein-protein interaction network: A structural perspective. PLoS Comp. Biol., 5(12): e1000601. DOI: 10.1371/journal .pcbi.1000601. 159 J. B. Karpinka, J. D. Fortriede, K. A. Burns, et al. 2015. Xenbase, the Xenopus model organism database: new virtualized system, data types and genomes. Nucleic Acids Res., 43(Database issue): D756–D763. DOI: 10.1093/nar/gku956. 3 L. Kaufman and P. J. Rousseeuw. 2009. Finding Groups in Data: An Introduction to Cluster Analysis. Wiley-Interscience, New York. 73 E. C. Keilhauer, M. Y. Hein, and M. Mann. 2015. Accurate protein complex retrieval by affinity enrichment mass spectrometry (AE-MS) rather than affinity purification mass spectrometry (AP-MS). Mol. Cell Proteom., 14: 120–135. 142 B. P. Kelley, R. Sharan, R. M. Karp, et al. 2003. Conserved pathways within bacteria and yeast as revealed by global protein network alignment. Proc. Natl. Acad. Sci. USA, 100(20): 11394–11399. DOI: 10.1073/pnas.1534710100. 170, 171, 231 B. P. Kelley, B. Yuan, F. Lewitter, et al. 2004. PathBLAST: a tool for alignment of protein interaction networks. Nucleic Acids Res., 32(Web server issue): W83–W88. DOI: 10.1093/nar/gkh411. 170, 171, 231 M. Kellis, B. W. Birren, and E. S. Lander. 2004. Proof and evolutionary analysis of ancient genome duplication in the yeast Saccharomyces cerevisiae. Nature, 428: 617–624. DOI: 10.1038/nature02424. 172 S. Kerrien, B. Aranda, L. Breuza, et al. 2012. The IntAct molecular interaction database in 2012. Nucleic Acids Res., 40(Database issue): D841–D846. DOI: 10.1093/nar/gkr1088. 20, 24, 49, 163 References 253

I. M. Keseler, C. Bonavides-Martinez, J. Collado-Vides, et al. 2009. EcoCyc: a comprehensive view of Escherichia coli biology. Nucleic Acids Res., 37(Database issue): D464–D470. DOI: 10.1093/nar/gkn751. 3

O. Keskin, N. Tuncbag, and A. Gursoy. 2016. Predicting protein-protein interactions from the molecular to the proteome level. Chem. Rev., 116(8): 4884–4909. DOI: 10.1021/acs .chemrev.5b00683. 52

F. Kiefer, K. Arnold, et al. The SWISS-MODEL repository and associated resources. Nucleic Acids Res., 37(Database issue): D387–D392. DOI: 10.1093/nar/gkn750. 161

P. M. Kim, L. J. Lu, Y. Xia, and M. B. Gerstein. 2006. Relating three-dimensional structures to protein networks provides evolutionary insights. Science, 314(5807): 1938–1941. DOI: 10.1126/science.1136174. 119, 158, 188

P. M. Kim, A. Sboner, Y. Xia, and M. B. Gerstein. 2008. The role of disorder in interaction networks: A structural analysis. Mol. Syst. Biol., 4: 179. DOI: 10.1038/msb.2008.16. 188

M. S. Kim, S. M. Pinto, D. Getnet, et al. 2014. A draft map of the human proteome. Nature, 509: 575–581. DOI: 10.1038/nature13302. 2, 3, 221

A. D. King, N. Przulj, and I. Jurisca. 2004. Protein complex prediction via cost-based clustering. Bioinformatics, 20(17): 3013–3020. DOI: 10.1093/bioinformatics/bth351. 64, 85, 94, 107, 226, 229

H. Kitano, A. Funahashi, Y. Matsuoka, and K. Oda. 2005. Using process diagrams for the graphical representation of biological networks. Nat. Biotech., 23: 961–966. DOI: 10.1038/nbt1111. 222, 223

S. Kittanakom, M. Chuk, V. Wong, et al. 2009. Analysis of membrane protein complexes using the split-ubiquitin membrane yeast two-hybrid (MYTH) system. Meth. Mol. Biol., 548: 247–271. DOI: 10.1007/978-1-59745-540-4_14. 16, 17, 22, 141

J. Kohler,¨ J. Baumbach, J. Taubert, M. Specht, A. Skusa, A. R¨uegg,C. Rawlings, P. Verrier, and S. Philippi. 2006. Graph-based analysis and visualization of experimental results with ONDEX. Bioinformatics, 22(11), 1383–1390. DOI: 10.1093/bioinformatics/btl081. 33

K. Komurov and M. White. 2007. Revealing static and dynamic modular architecture of the eukaryotic protein interaction network. Mol. Syst. Biol., 3: 110. DOI: 10.1038/ msb4100149. 150

P. A. Konstantinopoulos, D. Spentzos, B. Y. Karlan, et al. 2010. Gene expression profile of BRCAness that correlates with responsiveness to chemotherapy and with outcome in patients with epithelial ovarian cancer. J. Clin. Oncol., 28(22): 3555–3561. DOI: 10.1200/JCO.2009.27.5719. 203

M. Kotlyar, C. Pastrello, F. Pivetta, et al. 2015. In silico prediction of physical protein interactions and characterization of interactome orphans. Nat. Meth., 12: 79–84. DOI: 10.1038/nmeth.3178. 51 254 References

M. Kotlyar, C. Pastrello, N. Sheahan, and I. Jurisica. 2016. Integrated interactions database: tissue-specific view of the human and model organism interactomes. Nucleic Acids Res., 44(Database issue): D536–D541. DOI: 10.1093/nar/gkv1115. 51 D. Kozakov, D. Beglov, T. Bohnuud, S. E. Mottarella, B. Xia, D. R. Hall, and S. Vajda. 2013. How good is automated protein docking? Proteins, 81: 2159–2166. DOI: 10.1002/prot .24403. 158 N. J. Krogan, G. Cagney, H. Yu, et al. 2006. Global landscape of protein complexes in the yeast Saccharomyces cerevisiae. Nature, 440: 637–643. DOI: 10.1038/nature04670. 6, 20, 34, 37, 39, 40, 42, 93, 108, 112, 141, 179 N. J. Krogan, S. Lippman, D. A. Agard, A. Ashworth, and T. Ideker. 2015. The Cancer Cell Map Initiative: Defining the hallmark networks of cancer. Mol. Cell, 58(4): 690–698. DOI: 10.1016/j.molcel.2015.05.008. 223 A. Kumar, S. Agarwal, J. A. Heyman, et al. 2002. Subcellular localization of the yeast proteome. Genes Dev., 16(6): 707–719. DOI: 10.1101/gad.970902. 55 S. K. Kummerfeld and S. A. Teichmann. 2005. Relative rates of gene fusion and fission in multi-domain proteins. Trend Genet., 21: 25–30. DOI: 10.1016/j.tig.2004.11.007. 181 A. Laganowsky, E. Reading, J. T. S. Hopper, and C. V. Robinson. 2013. Mass spectrometry of intact membrane protein complexes. Nat. Protoc., 8: 639–651. DOI: 10.1038/nprot .2013.024. 142 K. Lage, E. O. Karlberg, Z. M. Størling, et al. 2007. A human phenome-interactome network of protein complexes implicated in genetic disorders. Nat. Biotech., 25(3): 309–316. DOI: 10.1038/nbt1295. 194, 209, 231 S. Lalonde, D. W. Ehrhardt, D. Loque, et al. 2008. Molecular and cellular approaches for the detection of protein-protein interactions: latest techniques and current limitations. Plant J., 53: 610–635. DOI: 10.1111/j.1365-313X.2007.03332.x. 16, 17, 22, 141 S. Lalonde, A. Sero, R. Pratelli, et al. 2010. A Membrane Protein/Signaling Protein Interaction Network for Arabidopsis Version AMPv2. Front. Physiol., 1: 24. DOI: 10.3389/fphys .2010.00024. 16, 17, 23 J. Lamb, E. D. Crawford, D. Peck, et al. 2006. The Connectivity Map: using gene-expression signatures to connect small molecules, genes and disease. Science, 313(5795): 1929–35. DOI: 10.1126/science.1132939. 223 J. P. Lambert, G. Ivosev, A. L. Couzens, et al. 2013. Mapping differential interactomes by affinity purification coupled with data-independent mass spectrometry acquisition. Nat. Meth., 10(12): 1239–45. 199 L. S. Lambeth and C. A. Smith. 2013. Short hairpin RNA-mediated gene silencing. Meth. Mol. Biol., 942: 205–32. DOI: 10.1007/978-1-62703-119-6_12. 4 X. Lan and J. K. Pritchard. 2016. Coregulation of tandem duplicate genes slows evolution of subfunctionalization in mammals. Science, 352(6288): 1009–1013. DOI: 10.1126/ science.aad8411. 182 References 255

M. Laplante and D. M. Sabatini. 2012. mTOR signaling in growth control and disease. Cell, 149(2): 274–293. DOI: 10.1016/j.cell.2012.03.017. 103 N. S. Latysheva, M. E. Oates, L. Maddox, et al. 2016. Molecular principles of gene fusion mediated rewiring of protein interaction networks in cancer. Mol. Cell, 63(4): 579–592. DOI: 10.1016/j.molcel.2016.07.008. 181 J. M. Laurent, J. H. Young, A. H. Kachroo, and E. M. Marcotte. 2015. Efforts to make and apply humanized yeast. Brief Funct. Genom., 15(2): 155–163. DOI: 10.1093/bfgp/elv041. 4 M. Lazarou, M. McKenzie, A. Ohtake, D. R. Thorburn, and M. T. Ryan. 2007. Analysis of the assembly profiles for mitochondrial- and nuclear-DNA-encoded subunits into complex I. Mol. Cell Biol., 27(12): 4228–4237. DOI: 10.1128/MCB.00074-07. 142 J. B. Leal-Pereira, A. J. Enright, and O. A. Ouzounis. 2004. Detection of functional modules from protein interaction networks. Prot.: Struct. Funct., Bioinform., 54: 49–57. DOI: 10.1002/prot.10505. 64, 66, 211, 226, 229 H. Lee, K. H. Mok, R. Muhandiram, et al. 2000. Local structural elements in the mostly unstructured transcriptional activation domain of human p53. J. Biol. Chem., 275 (38): 29426–29432. DOI: 10.1074/jbc.M003107200. 160 I. Lee, U. M. Blom, P. I. Wang, et al. 2011. Prioritizing candidate disease genes by network- based boosting of genome-wide association data. Genome Res., 21(7): 1109–1121. DOI: 10.1101/gr.118992.110. 51, 56, 202 M. D. M. Leiserson, F. Vandin, H. T. Wu, et al. 2015. Pan-cancer network analysis identifies combinations of rare somatic mutations across pathways and protein complexes. Nat Genet., 47: 106–114. DOI: 10.1038/ng.3168. 201, 231 T. Lemberger. 2007. Systems biology in human health and disease. Mol. Syst. Biol., 3: 136. DOI: 10.1038/msb4100175. 193 I. Letunic, T. Doerks, and P. Bork. 2012. SMART 7: recent updates to the protein domain annotation resource. Nucleic Acids Res., 40(Database issue): D302–D305. DOI: 10.1093/nar/gkr931. 141 H. Leung, Q. Xiang, S. M. Yiu, and F. Y. Chin. 2009. Predicting protein complexes from PPI data: A core-attachment approach. J. Comput. Biol., 16(2): 133–144. DOI: 10.1089/ cmb.2008.01TT. 64, 79, 107, 226, 229 S. Li, C. M. Armstrong, N. Bertin, et al. 2004. A map of the interactome network of the metazoan C. elegans. Science, 303(5657): 540–543. DOI: 10.1126/science.1091403. 6 T. Li, R. Wernersson, R. B. Hansen, et al. 2017. A scored human protein-protein interaction network to catalyze genomic interpretation. Nat. Meth., 14: 61–64. X. L. Li, S. H. Tan, C. S. Foo, and S. K. Ng. 2005. Interaction graph mining for protein complexes using local clique merging. Genome Inform. (World Scientific), 16(2): 260–269. 64, 68, 71, 89, 226, 229 J. Li, G. Liu, H. Li, et al. 2007. Maximal biclique subgraphs and closed pattern pairs of the adjacency matrix: A one-to-one correspondence and mining algorithms. IEEE Trans. Knowl. Data Eng., 19: 1625–1637. 84 256 References

X. L. Li, C. S. Foo, and S. K. Ng. 2007. Discovering protein complexes in dense reliable neighborhoods of protein interaction networks. Proc. Comput. Syst. Bioinform. Conf., 6: 157–168. 64, 85, 88, 107, 226, 229

M. Li, J. Chen, J. X. Wang, et al. 2008. Modifying the DPClus algorithm for identifying protein complexes based on new topological structures. BMC Bioinform., 9: 398. DOI: 10.1186/1471-2105-9-398. 64, 71, 94, 229

J. Li, L. J. Zimmerman, B. H. Park, et al. 2009. Network-assisted protein identification and data interpretation in shotgun proteomics. Mol. Syst. Biol., 5: 303. DOI: 10.1038/msb .2009.54. 210

X. Li, M. Wu, C. K. Kwoh, and S. K. Ng. 2010. Computational approaches for detecting protein complexes from protein interaction networks: A survey. BMC Genom., 11(Suppl 1): S3. DOI: 10.1186/1471-2164-11-S1-S3. 60

S. S. Li, K. Xu, and M. R. Wilkins. 2011. Visualization and analysis of the complexome network of Saccharomyces cerevisiae. J. Proteome Res., 10(10): 4744–4756. DOI: 10.1021/pr200548c. 186

M. Li, X. Wu, J. Wang, and Y. Pan. 2012. Towards the identification of protein complexes and functional modules by integrating PPI network and gene expression data. BMC Bioinform., 13: 109. 152, 231

J. Li, Y. Lu, R. Akbani, Z. Ju, et al. 2013. TCPA: a resource for cancer functional proteomics data. Nat. Meth., 10: 1046–1047. 4

T. Li, R. Wernersson, R. B. Hansen, et al. 2017. A scored human protein-protein interaction network to catalyze genomic interpretation. Nat. Meth. 14: 61–64. DOI: 10.1038/ nmeth.4083. 51

P. Liang and A. B. Pardee. 1992. Differential display of eukaryotic messenger RNA by means of the polymerase chain reaction. Science, 257(5072): 967–971. DOI: 10.1126/science .1354393. 199

J. Lim, T. Hao, C. Shaw, et al. 2006. A protein-protein interaction network for human inherited ataxias and disorders of purkinje cell degeneration. Cell, 125(4): 801–814. DOI: 10.1016/j.cell.2006.03.032. 193, 194

K. Lim, Z. Li, K. P. Choi, and L. Wong. 2015. A quantum leap in the reproducibility, precision and sensitivity of gene expression profile analysis even when sample size is extremely small. J. Comp. Biol. Bioinform., 13(4): 1550018. DOI: 10.1142/S0219720015500183. 216, 220, 231

K. Lim and L. Wong. 2014. Finding consistent disease subnetworks using PFSNet. Bioinformatics, 30: 189–196. 216, 218, 219, 231

R. Linding, L. J. Jensen, F. Diella, P. Bork, T. J. Gibson, and R. B. Russell. 2003a. Protein disorder prediction: Implications for structural proteomics. Structure, 11: 1453–1459. DOI: 10.1016/j.str.2003.10.002. 157 References 257

R. Linding, R. B. Russell, V. Neduva, and T. J. Gibson. 2003b. GlobPlot: Exploring protein sequences for globularity and disorder. Nucleic Acids Res., 31: 3701–3708. DOI: 10.1093/nar/gkg519. 157 H. Liu, R. G. Sadygov, and J. R. Yates III. 2004. A model for random sampling and estimation of relative protein abundance in shotgun proteomics. Ann. Chem. 2004, 76: 4193– 4201. DOI: 10.1021/ac0498563. 209 G. Liu, J. Li, and L. Wong. 2008. Assessing and predicting protein interactions using both local and global network topological metrics. Genome Inform., 21: 138–149. DOI: 10.11234/gi1990.21.138. 37, 44, 45, 55, 127 J. S. Liu. 2008. Monte Carlo Strategies in Scientific Computing. Springer, New York. 130 G. M. Liu, H. N. Chua, and L. Wong. 2009. Complex discovery from weighted PPI networks. Bioinformatics, 25(15): 1891–1897. DOI: 10.1093/bioinformatics/btp311. 64, 71, 94, 95, 204, 211, 226, 229 G. Liu, C. H. Yong, H. N. Chua, and L. Wong. 2011. Decomposing PPI networks for complex discovery. Proteome Sci., S1: S15. DOI: 10.1186/1477-5956-9-S1-S15. 112, 122, 136, 230 C. Liu, S. Srihari, et al. 2014. A fine-scale dissection of the DNA double-strand break repair machinery and its implications for breast cancer therapy. Nucleic Acids Res., 42(10): 6106–6127. DOI: 10.1093/nar/gku284. 203 Z. Liu, J. Zhao, Y. Tan, M. Tang, and G. Li. 2015. Systematic tracking of dysregulated modules identifies disrupted pathways in narcolepsy. Int. J. Clin. Exp. Med., 8(6): 9384–9393. 206 C. Liu, S. Srihari, S. Lal, et al. 2016. Personalised pathway analysis reveals association between dna repair pathway dysregulation and chromosomal instability in sporadic breast cancer. Mol. Oncol., 10(1): 179–193. DOI: 10.1016/j.molonc.2015.09.007. 213, 214 Q. Liu, J. Song, and J. Li. 2016. Using contrast patterns between true complexes and random subgraphs in PPI networks to predict unknown protein complexes. Sci. Rep., 6: 21223. DOI: 10.1038/srep21223. 138, 140, 230 H. Lodish, A. Berk, S. L. Zipursky, P. Matsudaira, D. Baltimore, and J. Darnell. 2000. Molecular Cell Biology, 4th edition, W. H. Freeman, New York. 22 C. J. Lord and A. Ashworth. 2006. BRCAness revisited. Nat. Rev. Cancer, 16: 110–120. 203 X. Lu, P. R. Kensche, M. A. Huynen, and R. A. Notebaart. 2013. Genome evolution predicts genetic interactions in protein complexes and reveals cancer drug targets. Nat. Comm., 4: 2124. 190, 191 X. Luo, Z. You, M. Zhou, et al. 2015. A highly efficient approach to protein interactome mapping based on collaborative filtering framework. Scient. Rep., 5: 7702. DOI: 10.1038/srep07702. 37, 45 N. M. Luscombe, M. M. Babu, H. Yu, M. Snyder, S. A. Teichmann, and M. Gerstein. 2004. Genomic analysis of regulatory network dynamics reveals large topological changes. Nature, 431: 308–312. DOI: 10.1038/nature02782. 150 258 References

D. J. Lynn, G. L. Winsor, C. Chan, et al. 2008. InnateDB: Facilitating systems-level analyses of the mammalian innate immune response. Mol. Syst. Biol., 4(1): 218. DOI: 10.1038/ msb.2008.55. 24, 51, 223 J. P. Mackay, M. Sunde, J. A. Lowry, et al. 2007. Protein interactions: is seeing believing? Trend Biochem. Sci., 32(12): 530–531. 35, 36 G. Maddalo, F. Stenberg-Bruzell, et al. 2011. Systematic analysis of native membrane protein complexes in Escherichia coli. J. Proteome Res., 10: 1848–1859. DOI: 10.1021/ pr101105c. 143 N. Malhis, E. T. C. Wong, R. Nassar, and J. Gsponer. 2015. Computational identification of MoRFs in protein sequences. PloS One, 31(11): 1738–1744. DOI: 10.1093/ bioinformatics/btv060. 164 A. Malovannaya, R. B. Lanz, S. Y. Jung, et al. 2011. Analysis of the human endogenous coregulator complexome. Cell, 145(5): 787–799. DOI: 10.1016/j.cell.2011.05.006. 187 J. R. Managbanag, T. M. Witten, D. Bonchev, L. A. Fox, et al. 2008. Shortest-path network analysis is a useful approach toward identifying genetic determinants of logevity. PLoS One, 3: e3802. DOI: 10.1371/journal.pone.0003802. 211 R. Mani, R. P. Onge, J. L. Hartman, G. Giaever, and F. P. Roth. 2008. Defining genetic in- teraction. Proc. Natl. Acad. Sci. USA, 105: 3461–3466. DOI: 10.1073/pnas.0712255105. 191 E. M. Marcotte, M. Pellegrini, H. L. Ng, et al. 1999. Detecting protein function and protein- protein interactions from genome sequences. Science, 285(5428): 751–753. DOI: 10.1126/science.285.5428.751. 54 R. Marcotte, A. Sayad, K. R. Brown, et al. 2016. Functional genomic landscape of human breast cancer drivers, vulnerabilities and resistance. Cell, 164(1-2): 293–309. DOI: 10.1016/j.cell.2015.11.062. 4 J. A. Marsh, H. Hern´andez, Z. Hall, et al. 2013. Protein complexes are under evolutionary selection to assemble via ordered pathways. Cell, 153(2): 461–470. DOI: 10.1016/j.cell .2013.02.044. 182, 231 J. A. Marsh and S. A. Teichmann. 2014. Protein flexibility facilitates quaternary structure assembly and evolution. PLoS Biol., 12: e1001870. DOI: 10.1371/journal.pbio.1001870. 180, 181, 182, 231 J. A. Marsh and S. A. Teichmann. 2015. Structure, dynamics, assembly and evolution of protein complexes. Ann. Rev. Biochem., 84: 551–575. DOI: 10.1146/annurev-biochem- 060614-034142. 182, 231 S. Maslov and K. Sneppen. 2002. Specificity and stability in topology of protein networks. Science, 292(5569): 910–913. 29 L. R. Matthews, P. Vaglio, J. Reboul, et al. 2001. Identification of potential interaction networks using sequence-based searches for conserved protein-protein interactions or ‘interologs’. Genome Res., 11(12): 2120–2126. DOI: 10.1101/gr.205301. 166 References 259

N. McCabe, N. C. Turner, C. J. Lord, et al. 2006. Deficiency in the repair of DNA damage by homologous recombination and sensitivity to poly(adp-ribose) polymerase inhibition. Cancer Res., 66(16): 8109–8115. DOI: 10.1158/0008-5472.CAN-06-0140. 203 M. D. McDowall, M. A. Harris, A. Lock, K. Rutherford, et al. 2015. PomBase 2015: updates to the fission yeast database. Nucleic Acids Res., 43(Database issue): D665–D661. DOI: 10.1093/nar/gku1040. 3 B. H. M. Meldal, O. Forner-Martinez, M. C. Costanzo, et al. 2015. The complex portal—an encyclopaedia of macromolecular complexes. Nucleic Acids Res. 2015, 43(Database issue): D479–D484. DOI: 10.1093/nar/gku975. 11, 12 D. Mellacheruvu, Z. Wright, A. L. Couzens, J. P. Lambert, et al. 2013. The CRAPome: a contaminant repository for affinity purification-mass spectrometry data. Nat. Meth., 10: 730–736. 142 A. Mendenhall and A. Hodge. 1998. Regulation of Cdc28 cyclin-dependent protein kinase activity during the cell cycle of the yeast Saccharomyces cerevisiae. Microbiol. Mol. Biol. Rev., 62(4): 1191–1243. 145, 148, 159 X. Y. Meng, H. X. Zhang, M. Mezei, and M. Cui. 2011. Molecular docking: a powerful approach for structure-based drug discovery. Curr. Comput. Aided Drug. Des., 7(2): 146–157. 158 K. Menger. 1927. Zur allgemeinen Kurventheorie. Fundament. Math., 10: 96–115. 112 F. Mertens, B. Johansson, T. Fioretos, and F. Mitelman. 2015. The emerging complexity of gene fusions in cancer. Nat. Rev. Cancer, 15: 371–381. DOI: 10.1038/nrc3947. 181 P. Mertins, D. R. Mani, K. V. Ruggles, et al. 2016. Proteogenomics connects somatic mutations to signaling in breast cancer. Nature, 534: 55–62. DOI: 10.1038/ nature18003. 221 B. Meszaros, I. Simon, and Z. Dosztanyi. 2009. Prediction of protein binding regions in disordered proteins. PLoS Comp. Biol., 5: e1000376. DOI: 10.1371/journal.pcbi .1000376. 164 H. W. Mewes, D. Frishman, K. F. Mayer, et al. 2006. MIPS: analysis and annotation of proteins from whole genomes in 2005. Nucleic Acids Res., 34(Database issue): D169–D172. DOI: 10.1093/nar/gkj148. 34, 55, 88, 108, 112, 140, 150, 179, 189, 209 H. W. Mewes, S. Dietmann, D. Frishman, et al. 2008. MIPS: analysis and annotation of genome information in 2007. Nucleic Acids Res., 36(Database issue): D196–D201. DOI: 10.1093/nar/gkm980. 11, 12, 24, 216 H. Mi, A. Muruganujan, J. T. Casagrande, and P. D. Thomas. 2013. Large-scale gene function analysis with the PANTHER classification system. Nat. Protoc., 8(8): 1551–1566. DOI: 10.1038/nprot.2013.092. 42 M. Michaut, A. Baryshnikova, M. Costanzo, et al. 2011. Protein complexes are central in the yeast genetic landscape. PLoS Comp. Biol., 7(2): e1001092. 190, 216 260 References

S. W. Michnick. 2003. Protein fragment complementation strategies for biochemical network mapping. Curr. Opin. Biotechnol., 14(6): 610–617. DOI: 10.1016/j.copbio .2003.10.014. 15, 16, 21 T. Milenkovic and N. Prˇzulj.2008. Uncovering biological network function via graphlet degree signatures. Cancer Inform., 6: 257–273. 218 J. P. Miller, R. S. Lo, A. Ben-Hur, et al. 2005. Large-scale identification of yeast integral membrane protein interactions. Proc. Natl. Acad. Sci. USA 2005, 102: 12123–12128. DOI: 10.1073/pnas.0505482102. 22, 23, 141 N. Mirkovic, Z. Li, A. Parnassa, and D. Murray. 2007. Strategies for high-throughput comparative modeling: Applications to leverage analysis in structural genomics and protein family organization. Prot.: Struct., Func. Bioinform., 66(4): 766–777. 56 K. Mitra, A. R. Carvunis, S. K. Ramesh, and T. Ideker. 2013. Integrative approaches for finding modular structure in biological networks. Nat. Rev. Genet., 14: 719–732. DOI: 10.1038/nrg3552. 221 E. Monachino, L. M. Spenkelink, and A. M. van Oijen. 2016. Watching cellular machinery in action, one molecule at a time. J. Cell Biol., 216(1): 41–51. DOI: 10.1083/jcb.201610025. 164 M. Morell, S. Ventura, and F. X. Aviles. 2009. Protein complementation assays: Approaches for the in vivo analysis of protein interactions. FEBS Lett., 583(11): 1684–1691. DOI: 10.1134/S0026893312050020. 21 D. O. Morgan. 1995. Principles of CDK regulation. Nature, 374: 131–134. DOI: 10.1038/ 374131a0. 145, 148, 159 J. H. Morris, G. M. Knudsen, E. Verschueren, et al. 2014. Affinity purification-mass spectrometry and network analysis to understand protein-protein interactions. Nat. Prot., 9(11): 2539–2554. DOI: 10.1038/nprot.2014.164. 15, 31 R. Mosca, A. C´eol, and P. Aloy. 2013. Interactome3D: adding structural details to protein networks. Nat. Meth., 10: 47–53. 159 T. Nepusz, H. Yu, and A. Paccanaro. 2012. Detecting overlapping protein complexes in protein-protein interaction networks. Nat. Meth., 9: 471–472. 64, 72, 94, 226, 229 A. I. Nesvizhskii. 2012. Computational and informatics strategies for identification of specific protein interaction partners in affinity purification mass spectrometry experiments. Proteomics, 12(10): 1639–1655. DOI: 10.1002/pmic.201100537. 41 M. E. J. Newman. 2004. Fast algorithm for detecting community structure in networks. Phys. Rev. E Statist., Nonlin. Soft Matt. Phys., 69: 066133. DOI: 10.1103/PhysRevE.69.066133. 8, 9 M. E. J. Newman. 2010. Networks: An Introduction. Oxford University Press, Oxford, UK. 9, 60 S. K. Ng, Z. Zhang, S. H. Tan, and K. Lin. 2003. InterDom: a database of putative interacting protein domains for validating predicted protein interactions and complexes. Nucleic Acids Res., 31: 251. DOI: 10.1093/nar/gkg079. 121 References 261

P. V. Nguyen, S. Srihari, and H. W. Leong. 2013. Identifying conserved protein complexes between species by constructing interolog networks. BMC Bioinform., 14: S8. DOI: 10.1186/1471-2105-14-S16-S8. 166, 181, 200, 231 N. Orlev, R. Shamir, and Y. Shiloh. 2004. Pivot: protein interacions visualization tool. Bioinformatics, 20(3): 424–425. DOI: 10.1093/bioinformatics/btg426. 33 M. A. Nooren and J. M. Thornton. 2003. Diversity of protein-protein interactions. EMBO J., 22(14): 3486–3492. DOI: 10.1093/emboj/cdg359. 5, 145 R. A. Notebaart, P. R. Kensche, M. A. Huynen, and B. E. Dutilh. 2009. Asymmetric relationships between proteins shape genome evolution. Genome Biol., 10: R19. DOI: 10.1186/gb-2009-10-2-r19. 191 P. Nurse. 2001. Cyclin dependent kinases and cell cycle control. Nobel Prize Lecture, December 9, 2011. DOI: 10.1023/A:1022017701871. 145, 159 G. Ostlund,¨ T. Schmitt, K. Forslund, et al. 2010. InParanoid 7: new algorithms and tools for eukaryotic orthology analysis. Nucleic Acids Res., 38(Database issue): D196–D203. DOI: 10.1093/nar/gkp931. 4, 174 M. E. Oates, P. Romero, T. Ishida, et al. 2013. D2P2: database of disordered protein predictions. Nucleic Acids Res., 41(Database issue): D508-D516. DOI: 10.1093/nar/ gks1226. 157 J. C. Obenauer and M. B. Yaffe. 2004. Computational prediction of protein-protein interactions. Meth. Mol. Biol., 261: 445–468. DOI: 10.1385/1-59259-762-9:445. 52 K. P. O’Brien, M. Remm, and E. L. L. Sonnhammer. 2005. InParanoid: a comprehensive database of eukaryotic orthologs. Nucleic Acids Res., 33(Database issue): D476–D480. DOI: 10.1093/nar/gki107. 4, 174 K. T. O’Brien, N. J. Haslam, and D. C. Shields. 2013. SLiMScape: a protein short linear motif analysis plugin for Cytoscape. BMC Bioinform., 14: 224. 164 C. J. Oldfield and K. A. Dunker. 2014. Intrinsically disordered proteins and intrinsically disordered protein regions. Ann. Rev. Biochem., 83: 553–584. DOI: 10.1146/annurev- biochem-072711-164947. 156 S. Oliver. 2000. Proteomics: Guilt-by-association goes global. Nature, 403: 601–603. DOI: 10.1038/35001165. 55, 186 S. Orchard, S. Kerrien, S. Abbani, et al. 2012. Protein interaction data curation: the International Molecular Exchange (IMEx) consortium. Nat. Meth., 9: 345–350. DOI: 10.1038/nmeth.1931. 23 A. Ori, M. Iskar, K. Buczak, P. Kastritis, et al. 2016. Spatiotemporal variation of mammalian protein complex stoichiometries. Genome Biol., 17: 47. DOI: 10.1186/s13059-016- 0912-5. 12, 13, 181, 206, 231 M. Oti, B. Snel, M. A. Huynen, and H. G. Brunner. 2006. Predicting disease genes using protein-protein interactions. J. Med. Genet., 43: 691–698. DOI: 10.1136/jmg.2006 .041376. 194 262 References

L. Ou-Yang, D. Q. Dai, X. L. Li, et al. 2014. Detecting temporal protein complexes from dynamic protein-protein interaction networks. BMC Bioinform., 15: 335. DOI: 10.1186/1471-2105-15-335. 152, 231 Y. Ozawa, R. Saito, S. Fujimori, et al. 2010. Protein complex prediction via verifying and reconstructing the topology of domain-domain interactions. BMC Bioinform., 11: 350. 120, 121, 122, 130, 230 R. A. Pache, A. C´eol, and P. Aloy P. 2012. NetAligner: a network alignment server to compare complexes, pathways and whole interactomes. Nucleic Acids Res., 40(W1): W157– W161. DOI: 10.1093/nar/gks446. 174, 231 P. J. Paddison, A. A. Caudy, E. Bernstein, G. J. Hannon, and D. S. Conklin. 2002. Short hairpin RNAs (shRNAs) induce sequence-specific silencing in mammalian cells. Genes Develop., 16: 948–958. DOI: 10.1101/gad.981002. 4 P. Pagel, S. Kovac, M. Oesterheld, et al. 2005. The MIPS mammalian protein-protein interaction database. Bioinformatics, 21(6): 832–834. DOI: 10.1093/bioinformatics/ bti115. 24, 25 Y. Park and J. S. Bader. 2012. How networks change with time. Bioinformatics, 28(12): 40–48. 155, 231 S. Pasek, J. L. Risler, and P. Br´ezellec P. 2006. Gene fusion/fission is a major contributor to evolution of multi-domain bacterial proteins. Bioinformatics, 22: 1418–1423. DOI: 10.1093/bioinformatics/btl135. 181, 182 S. S. Patel, B. J. Belmont, J. M. Sante, et al. 2007. Natively unfolded nucleoporins gate protein diffusion across the nuclear pore complex. Cell, 129: 83–96. DOI: 10.1016/j.cell.2007 .01.044. 162 A. Patil, K. Kinoshita, and H. Nakamura. 2010. Hub promiscuity in protein-protein interaction networks. Int. J. Mol. Sci., 11: 1930–1943. DOI: 10.3390/ijms11041930. 159 A. Patil, K. Nakai, and H. Nakamura. 2011. HitPredict: a database of quality assessed protein-protein interactions in nine species. Nucleic Acids Res., 39(Database issue): D744–D749. DOI: 10.1093/nar/gkq897. 51 A. Patil and H. Nakamura. 2006. Disordered domains and high surface charge confer hubs with the ability to interact with multiple proteins in interaction networks. FEBS Lett., 580: 2041–2045. DOI: 10.1016/j.febslet.2006.03.003. 159 N. P. Pavletich. 1999. Mechanisms of cyclin-dependent kinase regulation: structures of Cdks, their cyclin activators and Cip and INK4 inhibitors. J. Mol. Biol., 287: 821–828. DOI: 10.1006/jmbi.1999.2640. 160 G. A. Pavlopoulos, M. Secrier, C. N. Moschopoulos, et al. 2011. Using graph theory to analyze biological networks. BioData Mining, 28: 4–10. DOI: 10.1186/1756-0381-4-10. 33 T. Pawson and R. Linding. 2008. Network medicine. FEBS Lett., 582(8): 1266–1270. DOI: 10.1016/j.febslet.2008.02.011. 192 T. Pawson and P. Nash. 2000. Protein-protein interactions define specificity in signal transduction. Genes Dev., 14: 1027–1047. DOI: 10.1101/gad.14.9.1027. 146, 185 References 263

M. Pellegrini, E. M. Marcotte, M. J. Thompson, et al. 1999 Assigning protein functions by comparative genome analysis: Protein phylogenetic profiles. Proc. Natl. Acad. Sci. USA, 96(8): 4285–4288. 53, 55, 166 M. Pellegrini. 2012. Using phylogenetic profiles to predict functional relationships. Meth. Mol. Biol., 804: 167–177. DOI: 10.1007/978-1-61779-361-5_9. 53, 55 M. Penrose. 2003. Geometric Random Graphs. Oxford University Press, Oxford, UK. 29 S. Peri, J. D. Navarro, T. Z. Kristiansen, et al. 2004. Human protein reference database as a discovery resource for proteomics. Nucleic Acids Res., 32(Database issue): D497–D501. DOI: 10.1093/nar/gkh070. 24, 25, 198 M. Persico, A. Ceol, C. Gavrila, et al. 2005. HomoMINT: an inferred human network based on orthology mapping of protein interactions discovered in model organisms. BMC Bioinform., 6(Suppl 4): S21. 24, 170 C. Pesquita, D. Faria, H. Bastos, et al. 2008. Metrics for GO based protein semantic similarity: a systematic evaluation. BMC Bioinform., 9(Suppl 5): S4. 42 J. Petschnigg, V. Wong, J. Snider, and I. Stagljar. 2012. Investigation of membrane protein interactions using the split-ubiquitin membrane yeast two-hybrid system. Meth. Mol. Biol., 812: 225–244. DOI: 10.1007/978-1-61779-455-1_13. 22 J. Petschnigg, B. Groisman, M. Kotlyar, et al. 2014. The mammalian-membrane two-hybrid assay (MaMTH) for probing membrane-protein interactions in human cells. Nat. Meth., 11: 585–592. 16, 17, 23 A. Picco, I. Irastorza-Azcarate, T. Specht, D. Boke,¨ et al. 2017. The in vivo architecture of the exocyst provides structural basis for exocytosis. Cell, 168(3): 400–412. DOI: 10.1016/j.cell.2017.01.004. 164 P. Picotti, M. Cl´ement-Ziza, H. Lam, et al. 2013. A complete mass-spectrometric map of the yeast proteome applied to quantitative trait analysis. Nature, 494(7436): 266–270. DOI: 10.1038/nature11835. 3 U. Pieper, N. Eswar, F. P. Davis, et al. 2006. MODBASE: a database of annotated comparative protein structure models and associated resources. Nucleic Acids Res., 34(Suppl 1): D291–D295. DOI: 10.1093/nar/gkj059. 56 B. G. Pierce, K. Wiehe, H. Hwang, B. H. Kim, et al. 2014. ZDOCK server: interactive docking prediction of protein-protein complexes and symmetric multimers. Bioinformatics, 30(12): 1771–1773. DOI: 10.1093/bioinformatics/btu097. 158 J. Pinero,˜ N. Queralt-Rosinach, A.´ Bravo, et al. 2015. DisGeNET: a discovery platform for the dynamical exploration of human diseases and their genes. Database (Oxford), bav028. DOI: 10.1093/database/bav028. 223 J. J. Pitt. 2009. Principles and applications of liquid chromatography-mass spectrometry in clinical biochemistry. Clin. Biochem. Rev., 30(1): 19–34. 19 P. Poornima, J. D. Kumar, Q. Zhao, et al. 2016. Network pharmacology of cancer: From understanding of complex interactomes to the design of multi-target specific 264 References

therapeutics from nature. Pharmacol. Res., 111. DOI: 10.1016/j.phrs.2016.06.018. 192 E. Porta-Pardo, L. Garcia-Alonso, T. Hrabe, et al. 2015. A pan-cancer catalogue of cancer driver protein interaction interfaces. PLoS Comp. Biol., 11(10): e1004518. DOI: 10.1371/journal.pcbi.1004518. 201 S. Powell, D. Szklarczyk, K. Trachana, et al. 2012. eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges. Nucleic Acids Res., 40(Database issue): D284–D289. DOI: 10.1093/nar/gkr1060. 170 N. Prˇzulj,D. G. Corneil, and I. Jurisica. 2004. Modeling interactome: scale-free or geometric? Bioinformatics, 20(18): 3508–3515. DOI: 10.1093/bioinformatics/bth436. 29, 30, 37, 47, 55, 217 K. T. S. Prasad, G. R. Kandasamy, S. Keerthikumar, et al. 2009. Human Protein Reference Database—2009 update. Nucleic Acids Res., 37(Database issue): D767–D772. DOI: 10.1093/nar/gkn892. 24, 25, 140, 198 M. N. Price, K. H. Huang, E. J. Alm, and A. P. Arkin. 2005. A novel method for accurate operon predictions in all sequenced prokaryotes. Nucleic Acids Res., 33(3): 880–892. DOI: 10.1093/nar/gki232. 52 Y. Pritykin and M. Singh. 2013. Simple topological features reflect dynamics and modularity in protein interaction networks. PLoS Comp. Biol., 9(10): e1003243. DOI: 10.1371/ journal.pcbi.1003243. 149 K. D. Pruitt. 1998. WebWise: Guide to The Institute for Genomic Research website. Genome Res., 8: 1000–1004. DOI: 10.1101/gr.8.10.1000. 172 T. M. Przytycka, M. Singh, and D. K. Slonim. 2010. Toward the dynamic interactome: it’s about time. Brief Bioinform., 11(1): 15–29. DOI: 10.1093/bib/bbp057. 145 S. Pu, J. Vlasblom, A. Emili, J. Greenblatt, and S. J. Wodak. 2007. Identifying functional modules in the physical interactome of Saccharomyces cerevisiae. Proteomics, 7 (6): 944–960. DOI: 10.1002/pmic.200600636. 66, 91, 112 S. Pu, J. Vlasblom, A. Turinsky, et al. 2015. Extracting high confidence protein interactions from affinity purification data: At the crossroads. J. Proteomics, 118: 63–80. DOI: 10.1016/j.jprot.2015.03.009. 50 S. Pu, J. Wong, B. Turner, et al. 2009. Up-to-date catalogue of yeast protein complexes. Nucleic Acids Res., 37(3): 825–831. DOI: 10.1093/nar/gkn1005. 2, 11, 12, 93, 129, 130, 132, 142, 216 O. Puig, F. Caspary, G. Rigaut, et al. 2001. The tandem affinity purification (TAP) method: a general procedure of protein complex purification. Methods 24(3): 218–229. DOI: 10.1006/meth.2001.1183. 7 Y. Qi, F. Balem, C. Faloutsos, et al. 2008. Protein complex identification by supervised graph local clustering. Bioinformatics, 24 (ISMB 2008 issue): i250-i258. DOI: 10.1093/ bioinformatics/btn164. 64, 74, 75, 139, 229 References 265

E. Quevillon, V. Silventoinen, S. Pillai, et al. 2005. InterProScan: protein domains identifier. Nucleic Acids Res., 33(Suppl 2): W116-W120. DOI: 10.1093/nar/gki442. 119, 169 S. Rahmati, M. Abovsky, C. Pastrello, and I. Jurisica. 2016. PathDIP: an annotated resource for known and predicted human gene-pathway associations and pathway enrichment analysis. Nucleic Acids Res., 45(D1): D419–D426. DOI: 10.1093/nar/gkw1082. 223 C. J. Rain, L. Selig, H. De Reuse, et al. 2001. The protein-protein interaction map of Helicobacter pylori. Nature, 409(6817): 211–215. DOI: 10.1038/35051615. 18 D. Rapaport. 2005. How does the TOM complex mediate insertion of precursor proteins into the mitochondrial outer membrane? J. Cell. Biol., 171(3): 419–423. DOI: 10.1083/jcb .200507147. 142 N. Rapaport, N. Nativ, G. Stelzer, M. Twik, et al. 2013. MalaCards: An integrated compendium for diseases and their annotation. Database, Volume 2013, bat018. DOI: 10.1093/ database/bat018. 223 E. Ravasz and A. L. Barab´asi. 2003. Hierarchical organization in complex networks. Phys. Rev. E, 67: 026112. DOI: 10.1103/PhysRevE.67.026112. 28, 29 S. Razick, G. Magklaras, and I. M. Donaldson. 2008. iRefIndex: a consolidated protein interaction database with provenance. BMC Bioinform., 9: 405. DOI: 10.1186/1471- 2105-9-405. 24, 25, 51 I. Remy, F. X. Campbell-Valois, and S. W. Michnick. 2007. Detection of protein-protein interactions using a simple survival protein-fragment complementation assay based on the enzyme dihydrofolate reductase. Nat. Prot., 2: 2120–2125. DOI: 10.1038/nprot .2007.266. 15, 16, 21 I. Remy I and S. W. Michnick. 2004. Mapping biochemical networks with protein-fragment complementation assays. Meth. Mol. Biol., 261: 411–426. DOI: 10.1385/1-59259-762- 9:411. 15, 16, 21 P. Resnick. 1995. Using information content to evaluate semantic similarity in a taxonomy. Proc. 14th Int. Joint Conf. Artifi. Intell., Vol I: 448–453. 42 A. Ruepp, B. Brauner, I. Dunger-Kaltenbach, et al. 2008. CORUM: the comprehensive resource of mammalian protein complexes. Nucleic Acids Res., 36(Database issue): D646–D650. DOI: 10.1093/nar/gkm936. 11, 12, 93, 129, 140, 198, 216 S. Y. Rhee, W. Beavis, T. Z. Berardini, et al. 2003. The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res., 31(1): 224–228. DOI: 10.1093/nar/gkg076. 3 H. E. Richardson, C. Wittenberg, F. Cross, and S. I. Reed. 1989. An essential G1 function for cyclin-like proteins in yeast. Cell, 59: 1127–1133. DOI: 10.1016/0092-8674(89)90768-X. 191 G. Rigaut, A. Shevchenko, B. Rutz, et al. 1999. A generic protein purification method for protein complex characterization and proteome exploration. Nat. Biotechnol., 17(10): 1030–1032. DOI: 10.1038/13732. 6, 7, 15, 16, 19 266 References

S. Rizzetto, C. Priami, and A. Csik´asz-Nagy. 2015. Qualitative and quantitative protein complex prediction through proteome-wide simulations. PLoS Comp. Biol., 11(10): e1004424. DOI: 10.1371/journal.pcbi.1004424. 140, 141, 230 C. V. Robinson, A. Sali, and W. Baumeister. 2007. The molecular sociology of the cell. Nature, 450: 973–982. DOI: 10.1038/nature06523. 5 A. Roguev, S. Bandyopadhyay, M. Zofall, et al. 2008. Conservation and rewiring of functional modules revealed by an epistasis map in fission yeast. Science 2008, 322(5900): 405–410. DOI: 10.1126/science.1162609. 200 T. Rolland, M. Tasan, B. Charloteaux, et al. 2014. A proteome-scale map of the human interactome network. Cell, 159(5): 1212–1226. DOI: 10.1016/j.cell.2014.10.050. 6, 18, 24, 25, 50 P. Romero, Z. Obradovic, and A. K. Dunker. 1997. Sequence data analysis for long disordered regions prediction in the calcineurin family. Genome Inform., 8: 110–124. DOI: 10.11234/gi1990.8.110. 157 J. F. Rual, K. Venkatesan, T. Hao, et al. 2005. Towards a proteome-scale map of the human protein-protein interaction network. Nature, 437: 1173–1178. DOI: 10.1038/ nature04209. 18 P. Ruan, M. Hayashida, O. Maruyama, and T. Akutsu. 2013. Prediction of heterodimeric protein complexes from weighted protein-protein interaction networks using novel features and kernel functions. PLoS One, 8(6): e65265. DOI: 10.1371/journal.pone .0065265. 130, 230 P. Ruan, M. Hayashida, O. Maruyama, and T. Akutsu. 2014. Prediction of heterotrimeric protein complexes by two-phase learning using neighbouring kernels. BMC Bioinform., 15(Suppl 2): S6. 130, 230 A. Ruepp, B. Waegele, M. Lechner, et al. 2010. CORUM: the comprehensive resource of mammalian protein complexes - 2009. Nucleic Acids Res., 38(Database issue): D497–D501. DOI: 10.1093/nar/gkp914. 11, 12, 140, 198 D. Russel, K. Lasker, B. Webb, et al. 2012. Putting the pieces together: Integrative modeling platform software for structure determination of macromolecular assemblies. PLoS Biol., 10: e1001244. DOI: 10.1371/journal.pbio.1001244. 158 C. J. Ryan, A. Roguev, K. Patrick, J. Xu, et al. 2012. Hierarchical modularity and the evolution of genetic interactomes across species. Mol. Cell, 46: 691–704. DOI: 10.1016/j.molcel .2012.05.028. 122 C. J. Ryan, P. Cimerman, Z. A. Szpiech, et al. 2013. High-resolution network biology: connecting sequence with function. Nat. Rev. Genet., 14: 865–879. DOI: 10.1038/ nrg3574. 200 M. Safran, I. Solomon, O. Shmueli, et al. 2002. GeneCards 2002: towards a complete, object-oriented, human gene compendium. Bioinformatics, 18(11): 1542–1543. DOI: 10.1093/bioinformatics/18.11.1542. 2, 195 References 267

M. Safran, I. Dalah, J. Alexander, et al. 2010. GeneCards Version 3: the human gene integrator. Database (Oxford), baq020. DOI: 10.1093/database/baq020. 2, 195 R. Saito, M. E. Smoot, K. Ono, J. Ruscheinski, et al. 2012. A travel guide to Cytoscape plugins. Nat. Meth., 9(11): 1069–1076. 221 L. Salwinski, C. S. Miller, A. J. Smith, et al. 2004. The Database of Interacting Proteins: 2004 update. Nucleic Acids Res., 32(Database issue): D449–D451. DOI: 10.1093/nar/gkh086. 24 N. E. Sanjana, O. Shalem, and F. Zhang. 2014. Improved vectors and genome-wide libraries for CRISPR screening. Nat. Meth., 11: 783–784. 4 M. E. Sardiu, L. Florens, and M. P. Washburn. 2009. Evaluation of clustering algorithms for protein complex and protein interaction network assembly. J. Proteome Res., 8(6), 2944–2952. 73, 91 S. N. Savvides, S. Raghunathan, K. Futterer, et al. 2004. The C-terminal domain of full-length E. coli SSB is disordered even when bound to DNA. Protein Sci., 13(7): 1942–1947. DOI: 10.1110/ps.04661904. 163 E. Schadt, S. H. Friend, and D. A. Shaywitz. 2009. A network view of disease and compound screening. Nat. Rev. Drug Discov., 8: 286–295. DOI: 10.1038/nrd2826. 192 M. H. Schaefer, J. F. Fontaine, A. Vinayagam, et al. 2012. HIPPIE: Integrating protein interaction networks with experiment based quality scores. PLoS One 2012, 7(2): e31826. DOI: 10.1371/journal.pone.0031826. 51 D. Segr´e, A. De Luna, G. M. Church, and R. Kishony. 2004. Modular epistasis in yeast metabolism. Nat Genet., 37: 77–83. DOI: 10.1038/ng1489. 190 B. Seth, I. Ravi, and M. Avi. 2007. AVIS: AJAX viewer of interactive signaling networks. Bioinformatics, 23(20): 2803–2805. DOI: 10.1093/bioinformatics/btm444. 33 O. Shalem, N. E. Sanjana, and F. Zhang. 2015. High-throughput functional genomics using CRISPR/Cas9. Nat. Rev. Genet., 16: 299–311. DOI: 10.1038/nrg3899. 4 P. Shannon, A. Markiel, O. Ozier, et al. 2003. Cytoscape: A software environment for integrated models of biomolecular interaction networks. Genome Res., 13(11): 2498– 2504. DOI: 10.1101/gr.1239303. 32, 33, 164, 223 R. Sharan, S. Suthram, R. M. Kelley, et al. 2005a. Conserved patterns of protein interaction in multiple species. Proc. Natl. Acad. Sci. USA, 102(6): 1974–1979. DOI: 10.1073/pnas .0409522102. 172 R. Sharan, T. Ideker, B. Kelley, R. Shamir, and R. M. Karp. 2005b. Identification of protein complexes by comparative analysis of yeast and bacterial protein interaction data. J. Comp. Biol., 12(6): 835–846. 174, 176, 231 M. Shimoyama, J. D. Pons, T. G. Hayman, et al. 2015. The Rat Genome Database 2015: genomic, phenotypic and environmental variations and disease. Nucleic Acids Res., 43(Database issue): D743–D750. DOI: 10.1093/nar/gku1026. 3 268 References

B. A. Shoemaker and A. R. Panchenko. 2007. Deciphering protein-protein interactions. Part I. Experimental techniques and databases. PLoS Comp. Biol., 3(3): e42. DOI: 10.1371/journal.pcbi.0030042. 17, 21, 36, 53, 166 A. Sigalov, D. Aivazian, and L. Stern. 2004. Homooligomerization of the cytoplasmic domain of the T cell receptor zeta chain and of other proteins containing the immunoreceptor tyrosine-based activation motif. Biochemistry, 43(7): 2049–2061. 163 J. M. Silva, K. Marran, J. S. Parker, et al. 2008. Profiling essential genes in human mammary cells by multiplex RNAi screening. Science, 319 (5863): 617–620. DOI: 10.1126/science .1149185. 4 R. Singh, J. Xu, and B. Berger. 2008. Global alignment of multiple protein interaction networks with application to functional orthology detection. Proc. Natl. Acad. Sci. USA, 105: 12763–12768. DOI: 10.1073/pnas.0806627105. 173, 231 R. Singh, D. Park, J. Xu, et al. 2010. Struct2Net: a web service to predict protein-protein interactions using a structure-based approach. Nucleic Acids Res., 38(Web server issue): W508–W515. DOI: 10.1093/nar/gkq481. 56 M. Sipiczki. 2000. Where does fission yeast sit on the tree of life? Genome Biol., 1(2): 1011.1-1011.4. DOI: 10.1186/gb-2000-1-2-reviews1011. 168 A. Y. Sivachenko, A. Yuryev, N. Daraselia, and I. Mazo. 2007. Molecular networks in microarray analysis. J. Bioinfo. Comp. Biol., 5(2b): 429–456. 214, 215 A. Y. Sivachenko, A. Yuryev, N. Daraselia, and I. Mazo. 2005. Identifying local gene expression patterns in biomolecular networks. Proc. Comp. Syst. Bioinform. Conf., Stanford University, CA, 180–184. 215 M. E. Smoot, K. Ono, J. Ruscheinski, et al. 2010. Cytoscape 2.8: New features for data integration and network visualization. Bioinformatics, 27: 431–432. DOI: 10.1093/ bioinformatics/btq675. 32, 33, 164, 174, 223 B. Snel, P. Bork, and M. Huynen. 2000. Genome evolution. Gene fusion versus gene fission. Trend Genet., 16(1): 9–11. DOI: 10.1016/S0168-9525(99)01924-1. B. Snel, V. V. Noort, and M. A. Huynen. 2004. Gene co-regulation is highly conserved in the evolution of eukaryotes and prokaryotes. Nucleic Acids Res., 32(16): 4725–4731. DOI: 10.1093/nar/gkh815. 52 J. Snider, M. Kotlyar, P. Saraon, et al. 2015. Fundamentals of protein interaction network mapping. Mol. Syst. Biol., 11: 848. DOI: 10.15252/msb.20156351. 17 D. Soh, D. Dong, Y. Guo, and L. Wong. 2011. Finding consistent disease subnetworks across microarray datasets. BMC Bioinform., 12: S15. 215, 216 B. Sonnichsen,¨ L. B. Koski, A. Walsh, et al. 2005. Full-genome RNAi profiling of early embryogenesis in Caenorhabditis elegans. Nature, 434(7032): 462–469. DOI: 10.1038/nature03353. 189, 190 T. Sørensen. 1948. A method of establishing groups of equal amplitude in plant sociology based on similarity of species and its application to analyses of the vegetation on Danish commons. Kongelige Danske Videnskabernes Selskab, 5(4): 1–34. 39 References 269

O. Sorokina, A. Sorokin, and J. D. Armstrong. 2011. Towards a quantitative model of the post-synaptic proteome. Mol. Biosyst., 7(10): 2813–2823. DOI: 10.1039/C1MB05152K. 189 M. E. Sowa, E. J. Bennett, S. P. Gygi, and J. W. Harper. 2009. Defining the human deubiquitinating enzyme interaction landscape. Cell, 138(2): 389–403. DOI: 10.1016/j.cell.2009.04.042. 37, 40 V. Spirin and L. A. Mirny. 2003. Protein complexes and functional modules in molecular networks. Proc. Natl. Acad. Sci. USA, 100(21): 12123–12128. DOI: 10.1073/pnas .2032324100. 8, 34, 59, 64, 68, 229 J. Sprague, E. Doerry, S. Douglas, M. Westerfield, et al. 2001. The Zebrafish Information Network (ZFIN): a resource for genetic, genomic and developmental research. Nucleic Acids Res., 29(1): 87–90. DOI: 10.1093/nar/29.1.87. 3 R. Sprangers, A. Velyvis, and L. E. Kay. 2007. Solution NMR of supramolecular complexes: providing new insights into function. Nat. Meth., 4: 697–703. 164 E. Sprinzak, Y. Altuvia, and H. Margalit. 2006. Characterization and prediction of protein- protein interactions within and between complexes. Proc. Natl. Acad. Sci. USA, 103(40): 14718–14723. DOI: 10.1073/pnas.0603352103. 186 S. Srihari, K. Ning, and H. W. Leong. 2009. Refining Markov Clustering for protein complex prediction by incorporating core-attachment structure. Genome Inform., 23(1): 159– 168. DOI: 10.1142/9781848165632_0015. 64, 79, 82, 229 S. Srihari, K. Ning, and H. W. Leong. 2010. MCL-CAw: A refinement of MCL for detecting yeast complexes from weighted PPI networks by incorporating core-attachment structure. BMC Bioinform., 11: 504. DOI: 10.1186/1471-2105-11-504. 64, 79, 82, 107, 112, 151, 226, 229 S. Srihari, P. B. Madhamshettiwar, S. Song, et al. 2014a. Complex-based analysis of dysregulated cellular processes in cancer. BMC Syst. Biol., 8(Suppl 4): S1. DOI: 10.1186/1752-0509-8-S4-S1. 205 S. Srihari, V. Raman, H. W. Leong, and M. A. Ragan. 2014b. Evolution and controllability of cancer networks: a boolean perspective. IEEE/ACM Tran. Comp. Biol. Bioinform., 11(1): 83–94. 207 S. Srihari, C. H. Yong, A. Patil, and L. Wong. 2015a. Methods for protein complex prediction and their contributions towards understanding the organization, function and dynamics of complexes. FEBS Lett., 589(19A): 2590–2602. 59, 60, 91 S. Srihari, J. Singla, L. Wong, and M .A. Ragan. 2015b. Inferring synthetic lethal interactions from mutual exclusivity of genetic events in cancer. Biol. Dir., 10: 57. DOI: 10.1186/ s13062-015-0086-1. 189, 200 S. Srihari and H. W. Leong. 2012a. Employing functional interactions for characterisation and detection of sparse complexes from yeast PPI networks. Int. J. Bioinform. Res. Appl., 8(3/4): 286–304. 50, 52, 77, 108, 114, 116, 117, 230 270 References

S. Srihari and H. W. Leong. 2012b. A survey of computational methods for protein complex prediction from protein interaction networks. J. Bioinform. Comp. Biol., 11(2): 1230002. DOI: 10.1142/S021972001230002X. 59, 60, 91 S. Srihari and H. W. Leong. 2012c. Temporal dynamics of protein complexes in PPI Networks: a case study using yeast cell cycle dynamics. BMC Bioinform., 13(Suppl 17): S16. DOI: 10.1186/1471-2105-13-S17-S16. 151, 152, 231 S. Srihari and M. A. Ragan. 2013. Systematic tracking of dysregulated modules iden- tifies novel genes in cancer. Bioinformatics, 29(12): 1553–1561. DOI: 10.1093/ bioinformatics/btt191. 203, 204, 209, 216, 231 I. Stagljar, C. Korostensky, N. Johnsson, and S. te Heesen. 1998. A genetic system based on split-ubiquitin for the analysis of interactions between membrane proteins in vivo. Proc. Natl. Acad. Sci. USA 1998, 95(9): 5187–5192. DOI: 10.1073/pnas.95.9.5187. 22 C. Stark, B. J. Breitkreutz, A. Chatr-Aryamontri, et al. 2011. The BioGRID Interaction Database: 2011 update. Nucleic Acids Res., 39(Database issue): D698–D704. DOI: 10.1093/nar/gkq1116. 6, 8, 23, 24, 32, 33, 49, 55, 93, 112, 121 L. D. Stein, P. Sternberg, R. Durbin, et al. 2001. WormBase: network access to the genome and biology of Caenorhabditis elegans. Nucleic Acids Res., 29: 82–86. DOI: 10.1093/ nar/29.1.82. 3 A. Stein, A. C´eol, and P. Aloy. 2011. 3DID: identification and classification of domain-based interactions of known three-dimensional structure. Nucleic Acids Res., 39(Database issue): D718–D723. DOI: 10.1093/nar/gkq962. 170 U. Stelzl, U. Worm, M. Lalowski, et al. 2005. A human protein-protein interaction network: a resource for annotating the proteome. Cell, 122(6): 957–968. DOI: 10.1016/j.cell .2005.08.029. 18 M. P. H. Stumpf, T. Thorne, E. de Silva, et al. 2008. Estimating the size of the human interactome. Proc. Natl. Acad. Sci. USA, 105(19): 6959–6964. DOI: 10.1073/pnas .0708078105. 52 A. Subramanian, P. Tamayo, V. Mootha, S. Mukherjee, et al. 2005. Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles. Proc. Natl. Acad. Sci. USA, 102(43): 15545–15550. DOI: 10.1073/pnas .0506580102. 212, 213 K. Suhre. 2007. Inference of gene function based on gene fusion events: the Rosetta-stone method. Meth. Mol. Biol., 396: 31–41. DOI: 10.1007/978-1-59745-515-2_3. 54 J. Sun, J. Xu, Z. Liu, et al. 2005. Refined phylogenetic profiles method for predicting protein- protein interactions. Bioinformatics, 21(16): 3409–3415. 53 S. Sun, F. Yang, G. Tan, M. Costanzo, et al. 2016. An extended set of yeast-based functional assays accurately identifies human disease mutations. Genome Res., 26: 670–680. DOI: 10.1101/gr.192526.115. 4 S. Suthram, J. T. Dudley, A. P. Chiang, et al. 2010. Network-based elucidation of human disease similarities reveals common functional modules enriched for pluripotent References 271

drug targets. PLoS Comp. Biol., 6(2): e1000662. DOI: 10.1371/journal.pcbi.1000662. 198, 199 D. Szklarczyk, A. Franceschini, M. Kuhn, et al. 2011. The STRING database in 2011: functional interaction networks of proteins, globally integrated and scored. Nucleic Acids Res. 2011, 39(Database issue): D561–D568. DOI: 10.1093/nar/gkq973. 24, 25, 49, 51, 52, 126, 161, 202 X. Tang, J. Wang, B. Liu, et al. 2011. A comparison of the functional modules identified from time course and static PPI network data. BMC Bioinform., 12: 339. DOI: 10.1186/1471- 2105-12-339. 152, 231 K. Tarassov, V. Messier, C. R. Landry, et al. 2008. An in vivo map of the yeast protein interactome. Science, 320(5882): 1465–1470. DOI: 10.1126/science.1153878. 16, 21 D. Tatsuke and O. Maruyama. 2013. Sampling strategy for protein complex prediction using cluster size frequency. Gene, 518(1): 152–158. DOI: 10.1016/j.gene.2012.11.050. 130, 230 S. A. Teichmann and M. M. Babu. 2002. Conservation of gene co-regulation in prokaryotes and eukaryotes. Trends Biotechnol., 20(10): 407–410. 52 G. Teo, G. Liu, J. Zhang, et al. 2014 SAINTexpress: improvements and additional features in Significance Analysis of Interactome software. J. Proteomics, 100: 37–43. DOI: 10.1016/j.jprot.2013.10.023. 37, 40 L. J. Terry, E. B. Shows, and S. R. Wente. 2007. Crossing the nuclear envelope: hierarchical regulation of nucleocytoplasmic transport. Science, 318(5855): 1412–1416. DOI: 10.1126/science.1142204. 162 A. Theocharidis, S. van Dongen, A. J. Enright, and T. C. Freeman. 2009. Network visualization and analysis of gene expression data using BioLayout Express (3D). Nat. Prot., 4(10): 1535–1550. DOI: 10.1038/nprot.2009.177. 33 The FlyBase Consortium. 1996. FlyBase—The Drosophila database. Nucleic Acids Res., 24: 53–56. DOI: 10.1093/nar/24.1.53. 3 T. Tian and J. Song. 2012. Mathematical modelling of the map kinase pathway using proteomic datasets. PLoS One, 7(8): e42230. DOI: 10.1371/journal.pone.0042230. 188 E. Tomita, A. Tanaka, and H. Takahashi. 2006. The worst-case time complexity for generating all maximal cliques and computational experiments. Theor. Comput. Sci., 363: 28–42. DOI: 10.1016/j.tcs.2006.06.015. 72, 87 P. Tompa and M. Fuxreiter. 2008. Fuzzy complexes: polymorphism and structural disorder in protein-protein interactions. Trend Biochem. Sci., 33: 2–8. DOI: 10.1016/j.tibs.2007 .10.003. 156, 157, 162 M. Torchala, I. H. Moal, R. A. Chaleil, J. Fernandez-Recio, and B. A. Bates. SwarmDock: a server for flexible protein-protein docking. Bioinformatics, 29: 807–809. DOI: 10.1093/bioinformatics/btt038. 158 272 References

T. Tsuji, T. Yoda, and T. Shirai. 2015. Deciphering supramolecular structures with protein- protein interaction network modeling. Sci. Rep., 5: 16341. DOI: 10.1038/srep16341. 158 B. Turner, S. Razick, A. L. Turinsky, J. Vlasblom, et al. 2010. iRefWeb: interactive analysis of consolidated protein interaction data and their supporting evidence. Database (Oxford), baq023. DOI: 10.1093/database/baq023. 24, 25, 51, 141 P. Uetz, L. Giot, G. Cagney, et al. 2000. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature, 403: 623–627. DOI: 10.1038/ 35001009. 6, 15, 16, 18, 29, 179 M. Uhl´en, P. Oksvold, L. Fagerberg, et al. 2010. Towards a knowledge-based Human Protein Atlas. Nat. Biotech., 1248–1250. DOI: 10.1038/nbt1210-1248. 2, 3, 221 M. Uhl´en, L. Fagerberg, B. M. Hallstrom,¨ et al. 2015. Proteomics. Tissue-based map of the human proteome. Science, 347(6220): 1260419. DOI: 10.1126/science.1260419. 2, 3, 221 The UniProt Consortium 2015. UniProt: A hub for protein information. Nucleic Acids Res., 43(Database issue): D204-D212. DOI: 10.1093/nar/gku989. 2, 33 V. N. Uversky, C. J. Oldfield, and A. K. Dunker. 2005. Showing your ID: intrinsic disorder as an ID for recognition, regulation and cell signaling. J. Mol. Recognit., 18(5): 343–384. DOI: 10.1002/jmr.747. 156 A. Valencia and F. Pazos. 2002. Computational methods for the prediction of protein interac- tions. Curr. Opin. Struct. Biol., 12(3): 368–373. DOI: 10.1016/S0959-440X(02)00333-0. 52 T. J. P. Van Dam and B. Snel. 2008. Protein complex evolution does not involve extensive network rewiring. PLoS Comp. Biol., 4(7): e1000132. 179, 180, 231 S. M. Van Dongen. 2000. Graph clustering by flow simulation. Ph.D. Thesis, University of Utrecht. 64, 65, 82, 211, 226, 229 M. A. Van Driel, J. Bruggeman, G. Vriend, et al. 2006. A text-mining analysis of the human phenome. Eur. J. Human Genet., 14: 535–542. DOI: 10.1038/sj.ejhg.5201585. 197 K. Van Roey, B. Uyar, R. J. Weatheritt, et al. 2014. Short linear motifs: ubiquitous and functionally diverse protein interaction modules directing cell regulation. Chem. Rev., 114: 6733–6778. DOI: 10.1021/cr400585q. 146 B. VanderSluis, J. Bellay, G. Musso, et al. 2007. Genetic interactions reveal the evolutionary trajectories of duplicate genes. Mol. Syst. Biol., 6: 429. DOI: 10.1038/msb.2010.82. 182 F. Vandin, E. Upfal, and B. J. Raphael. 2011. Algorithms for detecting significantly mutated pathways in cancer. J. Comp. Biol., 18(3): 507–522. DOI: 10.1089/cmb.2010.0265. 201, 209, 231 O. Vanunu, O. Magger, E. Ruppin, T. Shlomi, and R. Sharan. 2010. Associating genes and protein complexes with disease via network propagation. PLoS Comp. Biol., 6(1): e1000641. DOI: 10.1371/journal.pcbi.1000641. 195, 209, 231 References 273

M. Varadi, S. Kosol, P. Lebrun, et al. 2014. pE-DB: the database of structural ensembles of intrinsically disordered and denatured proteins. Nucleic Acids Res., 42(Database issue): D326-D335. DOI: 10.1093/nar/gkt960. 157, 164 I. Vastrik, P. D’Eustachio, E. Schmidt, et al. 2007. Reactome: a knowledge base of biologic pathways and processes. Genome Biol., 8: R39. DOI: 10.1186/gb-2007-8-3-r39. 179, 221, 223 D. V. Veres, D. M. Gyurko,´ B. Thaler, et al. 2015. ComPPI: a cellular compartment- specific database for protein-protein interaction network analysis. Nucleic Acids Res., 43(Database issue): D485–D493. DOI: 10.1093/nar/gku1007. 51 M. Vidal. 2016. How much of the human protein interactome remains to be mapped? Sci. Signal., 9: eg7. DOI: 10.1126/scisignal.aaf6030. 52 A. J. Vilella, J. Severin, A. Ureta-Vidal, et al. 2009. EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees in vertebrates. Genome Res., 19(2): 327–335. DOI: 10.1101/gr.073585.107. 53 A. Vinayagam, Y. Hu, M. Kulkarni, C. Roesel, et al. 2013. Protein complex-based analysis framework for high-throughput datasets. Sci. Signal., 6(264): rs5. DOI: 10.1126/ scisignal.2003629. 12, 13 B. Vladimir and M. Andrej. 2004. Pajek, analysis and visualization of large networks. In M. Junger, P. Mutzel, G. Farin, H-C. Hege, D. Hoffman, C. R. Johnson, and K. Polthier, editors. Graph Drawing Software, 77–103. Springer, Berlin Heidelberg. 33 T. V. Vo, J. Das, M. J. Meyer, N. A. Cordero, et al. 2016. A proteome-wide fission yeast interactome reveals network evolution principles from yeasts to human. Cell, (1/2): 310–323. DOI: 10.1016/j.cell.2015.11.037. 18, 21, 168 K. Voevodski, S. H. Teng, and Y. Xia. 2009. Spectral affinity in protein networks. BMC Syst. Biol., 3: 112. DOI: 10.1186/1752-0509-3-112. 37, 46 C. Vogel, M. Bashton, N. D. Kerrison, C. Chothia, and S. A. Teichmann. 2004. Structure, function and evolution of multidomain proteins. Curr. Opin. Struct. Biol., 14: 208–216. DOI: 10.1016/j.sbi.2004.03.011. 118 J. Von Eichborn, M. Dunkel, B. O. Gohlke, et al. 2013. SynSysNet: integration of experimental data on synaptic protein-protein interactions with drug-target relations. Nucleic Acids Res., 41(Database issue): D834–D840. DOI: 10.1093/nar/gks1040. 189 G. Von Heijne. 2007. The membrane protein universe: what’s out there and why bother? J. Int. Med., 261(6): 543–557. 22 A. Von Kriegsheim, D. Baiocchi, M. Birtwistle, et al. 2009. Cell fate decisions are specified by the dynamic ERK interactome. Nat. Cell Biol., 11(12): 1458–1464. DOI: 10.1038/ ncb1994. 188 C. Von Mering, R. Krause, B. Snel, et al. 2002. Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 417(6887): 399–403. DOI: 10.1038/nature750. 6, 9, 19, 35, 36, 52, 179 274 References

C. Von Mering, M. Huynen, D. Jaeggi, et al. 2003. STRING: a database of predicted functional associations between proteins. Nucleic Acids Res., 31(1): 258–261. DOI: 10.1093/nar/ gkg034. 24, 25, 49, 51, 52, 126, 161 A. H. Wagner, A. C. Coffman, B. J. Ainscough, N. C. Spies, et al. 2016. DGIdb 2.0: mining clinically relevant drug-gene interactions. Nucleic Acids Res., 44(Database issue): D1036–44. DOI: 10.1093/nar/gkv1165. 223 A. Wagner and D. A. Fell. 2001. The small world inside large metabolic networks. Proc. Biol. Sci., 268(1478): 1803–1810. DOI: 10.1098/rspb.2001.1711. 28 H. Wainer and S. Lysen. 2009. A window on data can be a window on discovery. Amer. Scientist, 97(4): 272. 225 A. J. M. Walhout, R. Sordella, X. Lu, et al. 2000. Protein interaction mapping in C. elegans using proteins involved in vulval development. Science, 287: 116–122. DOI: 10.1126/ science.287.5450.116. 166 I. Walsh, M. Giollo, T. D. Domenico, C. Ferrari, et al. 2015 Comprehensive large-scale assessment of intrinsic protein disorder. Bioinformatics, 31(2): 201–208. DOI: 10.1093/bioinformatics/btu625. 156, 161 C. Wan, B. Borgeson, S. Phanse, et al. 2015. Panorama of ancient metazoan macromolecular complexes. Nature, 525(7569): 339–344. DOI: 10.1038/nature14877. 11, 12, 165 H. Wang, B. Kakaradov, S. R. Collins, et al. 2009. A Complex-based Reconstruction of the Saccharomyces cerevisiae Interactome. Mol. Cell Proteom., 8(6): 1361–1381. DOI: 10.1074/mcp.M800490-MCP200. 64, 73, 186, 226, 229 J. Wang, M. Li, Y. Deng, and Y. Pan. 2010. Recent advances in clustering methods for protein interaction networks. BMC Genomics, 11(Suppl 3): S10 DOI: 10.1186/1471-2164-11- S3-S10. 61 X. Wang, X. Wei, B. Thijssen, et al. 2012. Three-dimensional reconstruction of protein networks provides insight into human genetic disease. Nat. Biotechnol., 30(2): 159– 166. DOI: 10.1038/nbt.2106. 56 J. Wang, X. Peng, M. Li, and Y. Pan. 2013. Construction and application of dynamic protein interaction network based on time-course gene-expression data. Proteomics, 13(2): 301–312. DOI: 10.1186/1477-5956-11-S1-S20. 153 T. Wang, J. J. Wei, D. M. Sabatini, and E. S. Lander. 2014. Genetic screens in human cells using the CRISPR-Cas9 system. Science, 343(6166): 80–84. DOI: 10.1126/science.1246981. 4 S. M. Wang, Z. Q. Sun, H. Y. Li, et al. 2015. Temporal identification of dysregulated genes and pathways in clear cell renal cell carcinoma based on systematic tracking of disrupted modules. Comp. Math. Meth. Med., 313740. 206 Y. Wang, N. Sahni, and M. Vidal. 2015. Global edgetic rewiring in cancer networks. Cell Syst., 1(4): 251–253. DOI: 10.1016/j.cels.2015.10.006. 200 T. Wang, K. Birsoy, N. W. Hughes, et al. 2015. Identification and characterization of essential genes in the human genome. Science, 350(6264): 1096–1101. DOI: 10.1126/science .aac7041. 4 References 275

D. Warde-Farley, S. L. Donaldson, O. Comes, et al. 2010. The GeneMANIA prediction server: biological network integration for gene prioritization and predicting gene function. Nucleic Acids Res., 38(Web server issue): W214–W220. DOI: 10.1093/nar/gkq537. 51, 55 M. P. Washburn. 2016. There is no human interactome. Genome Biol., 17: 48. DOI: 10.1186/ s13059-016-0913-4. 145 D. J. Watts and S. H. Strogatz. 1998. Collective dynamics of ‘small-world’ networks. Nature, 393: 440–442. DOI: 10.1038/30918. 26, 28, 47 J. Weiner 3rd, F. Beaussart, and E. Bornberg-Bauer. 2006. Domain deletions and substitutions in the modular protein evolution. FEBS J., 273: 2037–2047. DOI: 10.1111/j.1742-4658.2006.05220.x. 118 R. G. Welch. 1977. On the role of organized multienzyme systems in cellular metabolism: A general synthesis. Prog. Biophys. Mol. Biol., 32(2), 103–189. DOI: 10.1016/0079- 6107(78)90019-6. xiii G. R. Welch. 2009. The ‘fuzzy’ interactome. Trend Biochem. Sci., 34(1): 1–2. DOI: 10.1016/j .tibs.2008.10.007. 35, 36 M. Wells, H. Tidow, T. J. Rutherford, P. Markwick, et al. 2008. Structure of tumor suppressor p53 and its intrinsically disordered N-terminal transactivation domain. Proc. Natl. Acad. Sci. USA, 105: 5762–5767. DOI: 10.1073/pnas.0801353105. 160 M. Y. White, D. A. Brown, S. Sheng, et al. 2011. Parallel proteomics to improve coverage and confidence in the partially annotated Oryctolagus cuniculus mitochondrial proteome. Mol. Cell Proteom., 10: M110004291. 209 C. K. Widita and O. Maruyama. 2013. PPSampler2: Predicting protein complexes more accurately and efficiently by sampling. BMC Syst. Biol., 7(Suppl 6): S14. DOI: 10.1186/ 1752-0509-7-S6-S14. 132, 230 M. Wilhelm, J. Schlegl, H. Hahne, et al. 2014. Mass-spectrometry-based draft of the human proteome. Nature, 509: 582–587. DOI: 10.1038/nature13319. 2, 3 M. R. Wilkins, J. C. Sanchez, A. A. Gooley, et al. 1996. Progress with proteome projects: Why all proteins expressed by a genome should be identified and how to do it. Biotech. Genet. Eng. Rev., 13: 19–50. DOI: 10.1080/02648725.1996.10647923. 2 T. Will and V. Helms. 2014. Identifying transcription factor complexes and their roles. Bioinformatics, 30(17): i415–i421. DOI: 10.1093/bioinformatics/btu448. 121, 122, 230 E. A. Winzeler, D. D. Shoemaker, A. Astromoff, et al. 1999. Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis. Science, 285(5429): 901–906. DOI: 10.1126/science.285.5429.901. 199 D. S. Wishart, C. Knox, A. C. Guo, S. Shrivastava, et al. 2006. DrugBank: a comprehensive resource for in silico drug discovery and exploration. Nucleic Acids Res., 34(Database issue): D668–72. DOI: 10.1093/nar/gkj067. 223 276 References

J. R. Wisniewski, P. Ostasiewicz, K. Dus, et al. 2012. Extensive quantitative remodeling of the proteome between normal colon tissue and adenocarcinoma. Mol. Syst. Biol.,8: 611. DOI: 10.1038/msb.2012.44. 206

Y. I. Wolf, I. B. Rogozin, A. S. Kondrashov, and E. V. Koonin. 2001. Genome alignment, evolution of prokaryotic genome organization and prediction of gene function using genomic context. Genome Res., 11: 356–372. DOI: 10.1101/gr.161901. 166

V. Wood, M. A. Harris, M. D. McDowall, K. Rutherford, et al. 2012. PomBase: a comprehensive online resource for fission yeast. Nucleic Acids Res., 40(Database issue): D695–D699. DOI: 10.1093/nar/gkr853. 3

C. T. Workman, H. C. Mak, S. McCuine, et al. 2006. A systems approach to mapping DNA-damage response pathways. Science, 312(5776): 1054–1059. DOI: 10.1126/ science.1122088. 200

M. Wozniak, L. Wong, and J. Tiuryn. 2011. CAMBer: An appproach to support comparative analysis of multiple bacterial strains. BMC Genomics, 12(Suppl. 2): S6. DOI: 10.1186/ 1471-2164-12-S2-S6. 166

M. Wozniak, L. Wong, and J. Tiuryn. 2014. eCAMBer: Efficient support for large-scale comparative analysis of multiple bacterial strains. BMC Bioinform., 15: 65. DOI: 10.1186/1471-2105-15-65. 166

P. E. Wright and H. J. Dyson. 1999. Intrinsically unstructured proteins: re-assessing the protein structure-function paradigm. J. Mol. Biol., 293: 321–31. DOI: 10.1006/jmbi .1999.3110. 156

X. Wu, R. Jiang, M. Q. Zhang, and S. Li. 2008. Network-based global inference of human disease genes. Mol. Syst. Biol., 4: 189. DOI: 10.1038/msb.2008.27. 192

M. Wu, X. Li, and S. K. Ng. 2009. A core-attachment based method to detect protein complexes in PPI networks. BMC Bioinform., 10: 169. DOI: 10.1186/1471-2105-10-169. 64, 79, 81, 94, 107, 112, 226, 229

G. Wu, X. Feng, and L. Stein. 2010. A human functional protein interaction network and its application to cancer data analysis. Genome Biol., 11: R53. DOI: 10.1186/gb-2010-11- 5-r53. 192

M. Wu, X. Li, C. K. Kwoh, S. K. Ng, and L. Wong. 2012. Discovery of Protein Complexes with Core-Attachment Structures from TAP Data. J. Comput. Biol., 19(9): 1027–1042. DOI: 10.1089/cmb.2010.0293. 64, 79, 83, 107, 134, 226, 229

I. Xenarios, L. Salwinski, X. J. Duan, et al. 2002. DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res., 30(1): 303–305. DOI: 10.1093/nar/30.1.303. 23, 24, 29, 154

J. Xia, M. J. Benner, and R. E. W. Hancock. 2014. NetworkAnalyst - integrative approaches for protein-protein interaction network analysis and visual exploration. Nucleic Acids Res., 42(Web server issue): W167–W174. DOI: 10.1093/nar/gku443. 223 References 277

Q. Xie, G. Arnold, P. Romero, et al. 1998. The sequence attribute method for determining relationships between sequence and protein disorder. Genome Inform., 9: 193–200. 156 T. Xu, L. F. Du, and Y. Zhou. 2008. Evaluation of GO-based functional similarity measures using S. cerevisiae protein interaction and expression profile data. BMC Bioinform., 9: 472. 42 J. Xu and Y. Li. 2006. Discovering disease-genes by topological features in human protein-protein interaction network. Bioinformatics, 22(22): 2800–2805. DOI: 10.1093/bioinformatics/btl467. 192 G. Yachdav, E. Kloppmann, L. Kajan, et al. 2014. PredictProtein–An open resource for online prediction of protein structural and functional features. Nucleic Acids Res., 42(Web server issue): W337–W343. DOI: 10.1093/nar/gku366. 56 T. Yamada and P. Bork. 2009. Evolution of biomolecular networks—lessons from metabolic and protein interactions. Nat. Rev. Mol. Cell Biol., 10: 791–803. DOI: 10.1038/nrm2787. 166, 169, 179 P. Yang, X. Li, M. Wu, C. K. Kwoh, and S. Ng. 2011. Inferring gene-phenotype associations via global protein complex network propagation. PLoS One, 6(7): e21502. DOI: 10.1371/journal.pone.0021502. 196, 209, 231 Z. Yao, K. Darowski, N. St-Denis, V. Wong, et al. 2017. A global analysis of the receptor tyrosine kinase-protein phosphatase interactome. Mol. Cell, 65(2): 347–360. DOI: 10.1016/j.molcel.2016.12.004. 16, 17, 23 E. Yeger-Lotem and R. Sharan. 2015. Human protein interaction networks across tissues and diseases. Front. Genet., 6: 257. DOI: 10.3389/fgene.2015.00257. 199 S. O. Yesylevskyy, V. N. Kharkyanen, and A. P. Demchenko. 2006. Dynamic protein domains: identification, interdependence and stability. Biophys. J.,91: 670–685. DOI: 10.1529/ biophysj.105.078584. 118 C. H.Yong, G. Liu, H. N. Chua, and L. Wong. 2012. Supervised maximum-likelihood weighting of composite protein networks for complex prediction. BMC Syst. Biol., 6(Suppl 2): S13. DOI: 10.1186/1752-0509-6-S2-S13. 49, 64, 77, 117, 136, 229, 230 C. H. Yong, O. Maruyama, and L. Wong. 2014. Discovery of small protein complexes from PPI networks with size-specific supervised weighting. BMC Syst. Biol., 8(Suppl 5): S3. DOI: 10.1186/1752-0509-8-S5-S3. 126, 136, 230 C. H. Yong and L. Wong. 2015a. From the static interactome to dynamic protein complexes: three challenges. J. Bioinform. Comput. Biol., 13(2): 1571001. DOI: 10.1142/S0219720015710018. 93, 107, 109, 110, 136 C. H. Yong and L. Wong. 2015b. Prediction of problematic complexes from PPI networks: sparse, embedded and small complexes. Biol. Dir., 10: 40. 136, 137 S. H. Yook, Z. N. Oltvai, and A. L. Barab´asi. 2004. Functional and topological characterization of protein interaction networks. Proteomics, 4(4): 928–942. DOI: 10.1002/pmic .200300636. 28 278 References

H. Yu, N. M. Luscombe, H. X. Lu, et al. 2004. Annotation transfer between genomes: protein- protein interologs and protein-dna regulogs. Genome Res., 14(6): 1107–1118. DOI: 10.1101/gr.1774904. 166, 168 H. Yu, P. Braun, M. A. Yildirim, et al. 2008. High-quality binary protein interaction map of the yeast interactome network. Science, 322: 104–109. DOI: 10.1126/science.1158684. 24, 25 H. Yu, L. Tardivo, S. Tam, et al. 2011. Next-generation sequencing to generate interactome datasets. Nat. Meth., 8: 478–480. 24, 25 J. Zahiri, J. H. Bozorgmehr, and A. Masoudi-Nejad. 2013. Computational prediction of protein-protein interaction networks: algorithms and resources. Curr. Genom., 14(6): 397–414. DOI: 10.2174/1389202911314060004. 52 N. Zaki and A. Mora. 2014. A comparative analysis of computational approaches and algorithms for protein subcomplex identification. Sci. Rep., 4: 4262. DOI: 10.1038/ srep04262. 134, 135, 230 A. Zanzoni, L. Montecchi-Palazzi, M. Quondam, et al. 2002. MINT: a Molecular INTeraction database. FEBS Lett., 513(1): 135–140. DOI: 10.1093/nar/gkl950. 20, 24, 49, 170 A. Zanzoni, M. Soler-Lopez,´ and P. Aloy. 2009. A network medicine approach to human disease. FEBS Lett., 583(11): 1759–1765. DOI: 10.1016/j.febslet.2009.03.001. 192 B. Zhang, B. H. Park, T. Karpinets, and N. F. Samatova. 2008. From pull-down data to protein interaction networks and complexes with biological relevance. Bioinformatics, 24(7): 979–986. DOI: 10.1093/bioinformatics/btn036. 34, 37, 39, 44, 54, 135 Q. C. Zhang, D. Petrey, L. Deng, et al. 2012. Structure-based prediction of protein-protein interactions on a genome-wide scale. Nature, 490(7421): 556–560. DOI: 10.1038/ nature11503. 51, 56 Q. C. Zhang, D. Petrey, J. I. Garzon,´ et al. 2013. PrePPI: a structure-informed database of protein-protein interactions. Nucleic Acids Res., 41(Database issue): D828–D833. DOI: 10.1093/nar/gks1231. 51, 56 A. Zhang. 2009. Protein Interaction Networks: Computational Analysis. Cambridge University Press, New York. 5 J. Zhao, S. H. Lee, M. Huss, and P. Holme. 2013. The network organization of cancer- associated protein complexes in human tissues. Scient. Rep., 3: 1583. DOI: 10.1038/ srep01583. 206, 216, 231 X. Zhou, J. Menche, A. L. Barab´asi, and A. Sharma. 2014. Human symptoms-disease network. Nat. Comm., 5: 4212. DOI: 10.1038/ncomms5212. 192 E. Zotenko, K. S. Guimar˜aes, R. Jothi, and T. Przytycka. 2006. Decomposition of overlapping protein complexes: A graph theoretical method for analyzing static and dynamic protein associations. Alg. Mol. Biol., 1: 7. DOI: 10.1186/1748-7188-1-7. 123, 125, 230