Understanding Protein-Protein Interaction Networks

Understanding Protein-protein Interaction Networks Thesis submitted for the degree of “Doctor of Philosophy” by Ariel Jaimovich Submitted to the Senate of the Hebrew University September 2010 Supervised by Prof. Hanah Margalit and Prof. Nir Friedman i Abstract All living organisms consist of living cells and share basic cellular mechanisms. Amazingly, all those cells, whether from a bacterium or a human being, although different in their structure and complexity, comprise the same building blocks of macro molecules: DNA, RNA, and proteins. Proteins play major roles in all cellular processes: they create signaling cascades, regulate almost every process in the cell, act as selective porters on the cell membrane, accelerate chemical reactions, and many more. In most of these tasks, proteins work in concert, by creating complexes of varying sizes, modifying one another and transporting other proteins. These interactions vary in many aspects: they might take place under specific conditions, have different biophysical properties, different functional roles, etc. Identifying and characterizing the full repertoire of interacting protein pairs are of crucial importance for understanding the functionality of a living cell. In the last decade, development of new technologies allowed large-scale measurements of interaction networks. In turn, many studies used the results of such assays to gain functional insights into specific proteins and specific pathways as well as learn about the more global characteristics of the interaction network. Unfortu- nately, the experimental noise in the large-scale assays makes such analyses hard, challenging the development of advanced computational approaches towards this goal. In the first part of my PhD work I used the language of relational graphical models to suggest a novel statistical framework for representing interaction networks. This framework enables taking into account uncertainty about the observed large-scale measurements, while investigating the interaction network properties. Specifically, it allows simultaneous prediction of all interactions given the results of large-scale experimental assays. I applied this model to noisy ob- servations of protein-protein interactions and showed how such simultaneous pre- dictions enable intricate information flow between the interactions, allowing for better prediction of missing information. However, application of this model to the entire interaction network would require creation of a huge model over millions ii of interactions. Thus, I turned to develop tools that will allow realistic representa- tion of such models over very large interaction networks and also enable efficient computation of approximate answers to probabilistic queries. Such tools facilitate learning the properties of these models from experimental data, while taking into account the uncertainty arising from experimental noise. Importantly, I created a code framework to allow efficient implementation of these (and other) algorithms, and devoted a special effort to provide an implementation of such ideas to general models. To date this library has been used in a wide variety of applications, such as protein design algorithms and object localization in cluttered images. The last part of my PhD research addressed genetic interaction networks. In this work, to- gether with Ruty Rinott, we showed how analysis of the network properties leads to novel biological insights. We devised an algorithm that used data from genetic interactions to create an automatic organization of the genes into functionally co- herent modules. Next, we showed how using additional information on genetic screens performed under a range of chemical perturbations sheds light on the cellular function of specific modules. As large-scale screens of genetic interactions are becoming a widely used tool, our method should be a valuable tool for extract- ing insights from these results. iii Contents 1 Introduction 1 1.1 Protein-protein interactions . 1 1.1.1 Large-scale identification of protein-protein interactions . 2 1.1.2 Genetic interactions as a tool to decipher protein function . 5 1.1.3 From interactions to networks . 7 1.1.4 Uncertainty in interaction networks . 9 1.2 Probabilistic graphical models . 9 1.2.1 Markov networks . 10 1.2.2 Approximate inference in Markov networks . 12 1.2.3 Relational graphical models . 14 1.3 Research Goals . 15 2 Paper chapter: Towards an integrated protein-protein interaction network: a relational Markov network approach 17 3 Paper chapter: Template based inference in symmetric relational Markov random fields 38 4 Paper chapter: FastInf - an efficient approximate inference library 48 5 Paper chapter: Modularity and directionality in genetic interaction maps 53 6 Discussion and conclusions 63 6.1 Learning relational graphical models of interaction networks . 63 6.2 Lifted inference in models of interaction networks . 66 6.3 In-vivo measurements of protein-protein interactions . 68 6.4 Analysis of genetic interaction maps . 68 6.5 Implications of our methodology for analysis of networks . 72 6.6 Integration of interaction networks with other data sources . 74 iv 6.7 Open source software . 74 6.8 Concluding remarks . 75 v 1 Introduction In the last couple of decades, large-scale data have accumulated for many types of interactions, varying from social interactions through links between pages of the world wide web and to various types of biological relations between proteins. Visualization of such data as networks and analysis of the properties of these networks has proven useful to explore these complex systems (Alon, 2003; Yamada and Bork, 2009; Boone et al., 2007; Handcock and Gile, 2010). In this thesis I will concentrate on analysis of protein-protein interaction networks, introducing novel methods that should be valuable also for the analysis of different kinds of networks. 1.1 Protein-protein interactions All living organisms consist of living cells and share basic cellular mechanisms. Amazingly, all those cells, whether from a bacterium or a human being, although different in their structure and complexity, comprise the same building blocks of macromolecules: DNA, RNA, and proteins. Protein sequences are encoded in DNA and synthesized by a well known path- way that is often referred to as the central dogma of molecular biology (Figure 1). It states that the blueprint of each cell is encoded in its DNA sequence. The DNA is replicated (and so the blueprint is passed on to the cell’s offsprings), and part of it is transcribed to messenger RNA (mRNA) which is, in turn, translated into proteins. After translation of a mRNA to a protein there are still many processes the protein has to undergo before it is functional. First it has to fold into the cor- rect three dimensional structure, then it has to be transported to a specific cellular localization, and often it has to undergoes specific modifications. There are many types of proteins in a cell (e.g., 6000 proteins types in the budding yeast Saccha- romyces cerevisiae) and each of them can be expressed in one copy or in thou- sands of copies (Ghaemmaghami et al., 2003). The expression levels are tightly regulated, and thus similar cells can have drastically different sets of expressed 1 Figure 1: The Central Dogma of Biology The DNA sequence (left) holds the blueprint of all cells. Part of it is transcribed to messenger RNA (mRNA; middle) which is translated to proteins (right) proteins under different conditions. Proteins play major roles in all cellular processes: they create signaling cascades, regulate almost every process in the cell, act as selective transporters on the cell membrane, accelerate chemical reactions, and many more. In most of these tasks, proteins work in concert, by creating complexes of varying sizes, modifying one another, and transporting other proteins. These complexes act as small machineries that perform a variety of tasks in the cell. They take part in cell metabolism, signal transduction, DNA transcription and duplication, DNA dam- age repair, and many more. Identifying and characterizing the full repertoire of these cellular machineries and the interplay between them are of crucial importance for understanding the functionality of a living cell. 1.1.1 Large-scale identification of protein-protein interactions Literature curated datasets, gathering information from many small-scale assays were the first to offer a proteome-scale coverage of protein-protein interactions (Mewes et al., 1998; Xenarios et al., 2000). These datasets have also served as a valuable resource for computational methods that used them to train models that can predict protein-protein interactions from genomic and evolutionary information sources (Marcotte et al., 1999a; Pellegrini et al., 1999). The first method to directly query protein-protein interactions in a systematic 2 A B C D Figure 2: Identifying protein-protein interactions using the yeast two-hybrid method: (A) A Transcription factor has a binding domain and an activation domain. (B) The bait protein is fused to the DNA binding domain and the prey protein is fused to the transcription activation domain (C) If the proteins interact, RNA polymerase is recruited to the promoter and the reporter gene is transcribed (D) If there is no physical interaction, the reporter gene is not transcribed. manner was a high throughput adaptation of the yeast two-hybrid method (Uetz et al., 2000; Ito et al., 2001). In this method a DNA binding domain is fused to a ’bait’ protein

Understanding Protein-Protein Interaction Networks

RECENT ADVANCES in BIOLOGY, BIOPHYSICS, BIOENGINEERING and COMPUTATIONAL CHEMISTRY

Program Book

Modeling and Analysis of RNA-Seq Data: a Review from a Statistical Perspective

24019 - Probabilistic Graphical Models

Curriculum Vitae—Nir Friedman

Learning Belief Networks in the Presence of Missing Values and Hidden Variables

Using Bayesian Networks to Analyze Expression Data

Speakers Info

RNA Velocity Analysis for Pertrub-Seq Mesert Kebed

David Van Dijk

UNIVERSITY of CALIFORNIA RIVERSIDE RNA-Seq

Bayesian Group Factor Analysis with Structured Sparsity