ANALYZING AND MODELING LARGE BIOLOGICAL NETWORKS: INFERRING PATHWAYS

by

GURKAN¨ BEBEK

Submitted in partial fulfillment of the requirements for the Degree of Doctor of Philosophy

Electrical Engineering And Computer Science Department CASE WESTERN RESERVE UNIVERSITY

January 2007 CASE WESTERN RESERVE UNIVERSITY

SCHOOL OF GRADUATE STUDIES

We hereby approve the dissertation of

Gürkan Bebek ______

candidate for the Ph.D. degree *.

Jiong Yang (signed)______(chair of the committee)

S. Cenk Sahinalp ______

Tekin Ozsoyoglu ______

Mark Adams ______

Jing Li ______

______

January 12, 2007 (date) ______

*We also certify that written approval has been obtained for any proprietary material contained therein. to Gamze... Contents

List of Tables iii

List of Figures iv

1 Introduction 1

1.1 Background ...... 5 1.1.1 GraphTheoreticDefinitions ...... 5 1.1.2 SignalTransductionPathways ...... 7 1.1.3 Protein-ProteinInteractions...... 10 1.1.4 Discovery of Protein-Protein Interactions ...... 11

1.2 Contributions ...... 15

2 Evolutionary Models of Proteome Networks 18

2.1 BiologicalNetworks ...... 24 2.1.1 The Evolutionof Protein-Protein Interactions ...... 26 2.1.2 RandomNetworkModels...... 28 2.1.3 PropertiesofNetworks ...... 32

2.2 ProteomeGrowthModel ...... 35

2.3 AnalysisoftheProteomeGrowthModel ...... 36 2.3.1 Propertiesofthepureduplicationmodel ...... 37 2.3.2 On the degree distribution of the proteome growth model. 41

i 2.4 Discussion...... 44

3 Enhanced Duplication Model 48

3.1 Sequence Similarity Distribution in the Proteome ...... 51

3.2 EnhancedModelBasedonSequenceSimilarity ...... 56

3.3 Discussion...... 62

4 Discovering Signaling Pathways: PathFinder 65

4.1 PathFinder...... 71 4.1.1 Preliminary ...... 73

4.2 Methods...... 75 4.2.1 MappingProteinstoFunctionalAnnotations ...... 76 4.2.2 MiningAssociationRulesfromKnownPathways . . . . . 80 4.2.3 Constructing a Weighted Protein-Protein InteractionNetwork 87 4.2.4 SearchingforPathwaySegments ...... 89

4.3 ExperimentsontheYeastProteomeNetwork...... 91

4.4 Discussion...... 102

5 Conclusions and Reflections 105

Bibliography 115

ii List of Tables

3.1 The average clustering coefficients of the DIP Protein-Protein In- teraction Network, Proteome Growth Model, and the Enhanced Model ...... 60

4.1 BinaryTableExample...... 81 4.2 PathFinderSearchResults ...... 97

iii List of Figures

2-1 ℓ hop ...... 34 − 2-2 Percentage of singletons in the pure duplication model ...... 40 2-3 Average degree of non-singleton nodes in pure duplicationmodel. 42

3-1 Degree distribution of the Yeast and the proteome growth model interactionnetworks...... 49 3-2 ℓ-hop degree distribution comparison of the Yeast and Proteome GrowthModel...... 50 3-3 Distribution of pairwise sequence similarity of yeast proteins . . . 54 3-4 Aggregate distribution of pairwise sequence similarity of yeast pro- teins...... 55 3-5 EnhancedModelBasedonSequenceSimilarity ...... 57 3-6 Degree distribution of the proteome sequence similaritynetworks. 59 3-7 Degree distributionof the interactionnetworks ...... 59 3-8 ℓ-hop degree distribution of the yeast, proteome growth model and thesequencesimilarityenhancedmodel ...... 61

4-1 MAPKinasePathways ...... 74 4-2 PathFinder...... 77 4-3 Two interacting proteins and their linked annotation terms..... 79 4-4 AssociationRuleMiningParameters ...... 93

iv 4-5 PathFinderSte7-Dig2SimplePathResults ...... 94 4-6 PathFinder Ste7-Dig2 Signaling Pathway Segment Results .... 96 4-7 ThePheromoneResponseSignalingPathway ...... 98 4-8 TheHighOsmolaritySignalingPathway ...... 101

v Acknowledgements

It is with great pleasure that I would like to thank those who have helped me in my Ph.D. studies. I would like to acknowledge Dr. S. Cenk S¸ahinalp for recruiting me as a grad- uate student, and for his guidance throughout my education. After his move to Vancouver, Canada, he offered me his continued help in finishing this program both financially and academically. I have learned a great amount of skills from him, and I will still be a supporter of Dr. S¸ahinalp after my graduation. I am very grateful to Dr. Jiong Yang, for accepting to take over my advisory duties and helping me accelerate my studies. I appreciate his financial support dur- ing my last years and his guidance throughout my studies since he moved to Case. His guidance on finding interesting problems and accurate approaches should be mentioned here. I also would like to thank him for being my dissertation committee chair. I would like to give my gratitude to Prof. Meral Ozsoyo˘glu¨ and Prof. Tekin Ozsoyo˘glu,¨ for their help and guidance during this last five years. It has always been an inspiration to see their academic achievements. I especially would like to mention Prof. Tekin Ozsoyo˘glu’s¨ support and priceless advice during my last year of study. I would like to thank Prof. Tekin Ozsoyo˘glu,¨ Dr. Mark Adams, and Dr. Jing Li for being on my dissertation committee. I deeply appreciate their input to this dissertation and my research. Soon after I met my wife, I was privileged to be introduced to the Wise, whom I am eternally indebted to, as I have gained so much from them. I always feel welcome among them, and I am happy to make them proud by finishing this degree. Here, I would like to mention Mrs. Marilyn Wise for her support in every aspect of my life and sharing her spiritual enlightenment with me. I appreciate her being

vi my mother here in the United States. I also would like to acknowledge the moral support of Mr. Jonathon K. Wise and Ms. Cheryl Davis. Mr. Jonathon K. Wise has been a great role model, whom my wife and I respect, and always look for guidance. I would like to acknowledge my lab friends, Can Alkan, Emre Karakoc¸, and Eray T¨uz¨un. Although we have been separated by moves and graduations, they were a great support in this accoplishment. Also, I do appreciate Mr. Brendan Eliott for proofreading my dissertation. Finally, I appreciate more than anything the support and understanding of my beautiful wife, Gamze throughout my Ph.D. program. I can not express enough how thankful I am for her encouragement, help and endless patience. Without her I would not have finished this study. G¨urkan Bebek, Ph. D. August 2006

vii Analyzing and Modeling Large Biological Networks: Inferring Signal Transduction Pathways Abstract by Gurkan¨ Bebek

Large scale two-hybrid screens have generated a wealth of information describing potential protein-protein intereactions (PPIs). When interacting proteins are asso- ciated with each other to generate networks, a map of the cell, picturing potential signaling pathways and interactive complexes is formed. PPI networks satisfy the small-world property and their degree distribution fol- low the power-law degree distribution. Recently, duplication based random graph models have been proposed to emulate the evolution of PPI networks and to satisfy these two graph theoretical properties. In this work, we show that the previously proposed model of Pastor-Satorras et al. (2003) does not generate a power-law degree distribution with exponential cutoff as claimed and the more restrictive model by Chung et al. (2003) cannot be interpreted unconditionally. It is possible to slightly modify these models to ensure that they generate a power-law degree distribution. However, even after this modification, the more general ℓ-hop degree distribution achieved by these models, for ℓ > 1, are very different from that of the yeast proteome network. We address this problem by introducing a new network growth model taking into account the sequence similarity between pairs of proteins as well as their interactions. The new model captures the ℓ-hop degree distribution of the yeast PPI network for all ℓ> 0, as well as the immediate degree distribution of the sequence similarity network. We further utilize the PPI networks to discover possible pathway segments. Dis- covering signal transduction pathways has been an arduous problem, even with the

viii use of systematic genomic, proteomic and metabolomic technologies. The enor- mous amount of data and how to interpret and process this data becomes a chal- lenging computational problem. In this work we present a new framework to identify signaling pathways in PPI networks. Our goal is to find biologically significant pathway segments in a given interaction network. First, we discover association rules based on known signal transduction pathways and their functional annotations. Given a pair of starting and ending proteins, our methodology returns candidate pathway segments between these two proteins. These candidate pathway segments are further filtered by their levels. In our study, we used the S. cerevisiae interaction network and microarray data, to successfully reconstruct signal transduction pathways in yeast.

ix Chapter 1

Introduction

Aristotle (384-322 B.C.) is known as the originator of the scientific study of life. Aristotle himself wrote around 146 books on the subject. Throughout the past 24 centuries of biological studies, it is no doubt that the most advancement in this field has been made during the last century. Sequencing of genomes is one of the key achievements towards understanding of the cellular machinery and biological diversity. It is likely that the first organisms were unicellular prokaryote organisms. The diversity of life is derived through evolution, the process of amplification and mutation of genetic material, followed by varying proportions of chance and natural selection. Detrimental agents such as toxins, radiation, viruses etc. may alter the genome sequence by point mutations, insertions, or deletions of nucleotides. Sometimes, these mechanisms modify the genetic material in favor of the organism, increasing its likelihood of survival, but most often its chances are decreased, or there is a negligible effect on the organism. These changes to the genomic content may lead to changes in cellular networks, which may have consequences on the tissue or sometimes the organism as a whole. The significant technological progress and the completion of many genome se- quencing projects, including that of the human, have provided us with a reasonably detailed view of the cell. From this new point of view, i. e. our new knowledge of

1 cellular networks, we have the means to understand the principles underlying the dynamic behavior of cells. However, this will require integration of theoretical and experimental approaches at a variety of levels. One of the biggest challenges waiting for the scientists is correlating the genome with the proteome to explain its biological function. Exploiting the genome to un- derstand life in both disease and healthy states will make it possible to develop new therapeutic approaches. Moreover, these approaches should be able to mimic the behavior of the systems over a wide variety of conditions. To achieve this goal, any model should be based on and fully constrained by experimental data. Hence, the amount and quality of the available experimental data will determine the reliability of the model. Also, computational models should represent the biological systems as accurately as possible. Today, scientists have access to DNA microarray technology that can simulta- neously measure the mRNA expression responses of practically every gene under various conditions, producing hundreds of thousands of individual data points. Sim- ilarly, high-throughput yeast two-hybrid and mass spectrometry experiments have identified thousands of pairwise protein-protein interactions. A recent study in sys- tems biology demonstrated that by integrating these diverse data types and assimi- lating them into biological models, one can predict cellular behaviors (Ideker et al., 2001). However, synthesizing these data into models of pathways and networks remains as a significant challenge. In other words, the use of systematic genomic, proteomic and metabolomic technologies have introduced new paths toward understanding these phenomena. However, these techniques lead to an enormous amount of data, and how to interpret and process this data is now a challenging computational problem. In short, we are asked to improve the quality and predict the missing components, to analyze and model the dynamics of these phenomena, and to integrate this knowledge with other known biological data.

2 The molecular and genetic mechanisms underlying cell proliferation, differen- tiation, dynamics, and death, and their involvement in embryonic development, and cancer are still being studied. In general, little is known about which molecules are specific to these cell activities. Therefore, the characterization of molecular interac- tions and complexity is crucial in reaching a full understanding of these biological dynamics. Here we are going to focus on the molecular level of activities in the cell. Among many cellular activities that the cell performs, signal transduction is the primary means by which cells coordinate their metabolic, morphological, and ge- netic responses to environmental cues such as growth factors, hormones, nutrients, osmolarity, and other chemical and physical stimuli. Thus, the analysis and dis- covery of these pathways can help us to understand the modifications of the cell on its way from a normal, healthy state to a transformed one, and finally to cancer generation or death. The genetic information in a cell is converted into functional components, pro- teins, through the transcription and translation processes. Hence, the set of proteins in a cell (also called the proteome) can describe the underlying events and associa- tions at a given state. A molecular level view of a cell can be created by mapping these proteins with their interactions on a network, which refers to the associations of these protein molecules. These associations are important for many biological phenomena. For instance, signals from the exterior of a cell are mediated to the inside of that cell by interacting signaling molecules. Traditionally, the discovery of the molecular components of signaling networks in organisms has relied upon the use of gene knockouts and epistasis analysis. Al- though these methods have been highly effective in generating detailed descriptions of specific linear signaling pathways, our knowledge of complex signaling networks and their interactions remains incomplete. We are going to approach to this identification problem mentioned above. It is

3 desirable to have new computational methods that capture molecular details from high-throughput genomic and proteomic data in an automated fashion. In this work, we will utilize graph theoretic and data mining techniques to accomplish this goal. These techniques are an integral part of this research process, since they lead us to analyzing and linking this data to other information, (e.g. functionality, regulation etc.) by combining theoretical and experimental approaches. This analysis leads us to build efficient models that would predict those under-studied phenomena. For all these reasons, graph theoretic approaches are becoming an important part of computational biology. In this work, we will first focus on analyzing known datasets and models of protein-protein interaction networks in model organisms. Understanding protein- protein interactions is important in investigating signaling pathways. We are going to present an in depth analysis of the currently available models of protein-protein interaction networks and have a look at their properties. We will then use the re- sults of our investigation on relationships among the elements of these networks to develop a new model for better understanding of the evolutionary processes. Next, we will present our results on discovering signaling pathways segments. Using the underlying information in the signaling pathways and the protein-protein interaction networks that we have acquired previously (Bebek et al., 2006b), we will focus on discovering these smaller functional networks. In most of the one-cell or- ganisms, the variety of signal transduction pathways influences the number of ways the cell can react and respond to its environment. Discovering signal transduction pathways has been a hard problem. Despite we now have access to systematic ge- nomic, proteomic and metabolomic technologies, the enormous amount of data we acquire through these technologies creates computationally challenging problems of interpreting and processing of this data. Here, we present a new framework to identify signaling pathways in protein- protein interaction networks. Given a protein-protein interaction network of an

4 organism, we would like to discover biologically significant pathway segments. First, we reveal association rules based on known signaling pathways and their functional annotations. The methodology developed can successfully search for pathway segments on a given protein-protein interaction network. In our study, we have used the S. cerevisiae interaction network with microarray data and known signal pathways to develop and test our models. In this chapter, we will first give basic graph theoretic definitions and then in- troduce relevant biological concepts. Then, a short summary of the results and overview of the thesis will be presented.

1.1 Background

1.1.1 Graph Theoretic Definitions

A graph (or network) is a set of objects called nodes or vertices connected by links called arcs or edges. A graph is usually denoted with G, or G(V, E) where V is the set of vertices (nodes) and E V V is the set of edges (arcs) connecting the ⊆ × vertices. The size of a graph is the number of its edges, i. e. E . | | The most common type of graph is called a simple graph. In simple graphs, at most one edge (i. e. either one edge or no edges) may connect any two vertices. If multiple edges are allowed between vertices, the graph is known as a multigraph. Vertices are usually not allowed to be self-connected, but this restriction is some- times relaxed to allow loops, i. e. a loop is an edge whose end vertices are the same vertex. A graph that may contain multiple edges and graph loops is called a pseudograph. A subgraph of a graph G is a graph whose vertex and edge sets are subsets of those of G. A supergraph of a graph G is a graph that contains G as a subgraph. A graph is directed if its edges are directed (pointing toward either one of the

5 ends) and undirected otherwise. A graph is complete (or called a clique) if every node has a connecting edge to every other node. The complete graph on n vertices is often denoted by K where K would have n(n 1)/2 edges. n n − Nodes that share an edge are called adjacent. The degree of a node is the number of edges incident with the node, i.e, a measure of immediate adjacency. In directed graphs, the in-degree of a node is the number of edges ending at the node, and the out-degree is the number of edges beginning at the node. A vertex of degree zero is an isolated vertex. If E is finite, then the total sum of vertex degrees is equal to twice the number of edges. A degree sequence is a list of degrees of a graph in non-increasing order. A sequence of non-increasing integers is realizable if it is a degree sequence of some graph.

The set of neighbors, called the (open) neighborhood NG(v) for a vertex v in a graph G, consists of all vertices adjacent to v but not including v. When v is also included, it is called a closed neighborhood, denoted by NG[v]. When stated without any qualification, a neighborhood is assumed to be open. A path in a graph is a sequence of vertices such that from each of its vertices there is an edge to the successor vertex. The length of a path is the number of edges that the path uses, counting multiple edges multiple times. On a path, the first vertex is called the start vertex and the last vertex is called the end vertex. Both of them are called end or terminal vertices of the path. The other vertices in the path are internal vertices. If it is possible to establish a path from any vertex to any other vertex of a graph, the graph is said to be connected; otherwise, the graph is disconnected. A cycle is a path such that the start vertex and end vertex are the same. In a directed graph, the same concepts apply with the edges being directed from each vertex to its successor. A path with no repeated vertices is called a simple path, and a cycle with no repeated vertices aside from the start/end vertex is a simple cycle. A simple cycle

6 that includes every vertex of the graph is known as a Hamiltonian cycle. Two paths are independent (alternatively, internally vertex-disjoint) if they do not have any internal vertex in common. A cycle (trail) is Eulerian if it uses all edges precisely once. A graph that contains an Eulerian trail is traversable. A graph that contains an Eulerian cycle is an Eulerian graph. A weighted graph is a graph in which each edge is given a numerical weight. A weighted graph is therefore a special type of labeled graph in which the labels are numbers. The weight of a path in a weighted graph is the sum of the weights of the traversed edges.

The distance dG(u, v) between two (not necessary distinct) vertices u and v in a graph G is the length of a shortest path between them. The subscript G is usually dropped when there is no danger of confusion. When u = v, thedistanceis 0. When u and v are unreachable from each other, their distance is defined to be infinity. The eccentricity of a vertex v in a graph G is the maximum distance from v to any other vertex. The diameter of a graph G is the maximum eccentricity over all vertices in a graph, while the radius is the minimum.

1.1.2 Signal Transduction Pathways

The molecular components involved in cellular signaling form signal transduction pathways. A signal transduction pathway (signaling pathway) in a cell is composed of the following events. First, a signaling molecule arrives outside of the cell and interacts with the receptor on the extracellular surface of the cell membrane. Next, the receptor interacts with intracellular pathway components, starting a cascade of protein interactions that propagates the signal within the cell. Finally, the signal arrives at its final destination, or molecular target, and evokes a functional response in the cell. Important biotechnological advances in recent years have allowed increasingly

7 detailed studies of a variety of signaling pathways. These advances include pro- duction of recombinant DNA, the Polymerase Chain Reaction (PCR) (Alberts and et al., 2002), gel electrophoresis (Vincens and Tarroux, 1988), microarrays (De- Risi and Iyer, 1999), and the serial analysis of gene expression (SAGE) (Velculescu et al., 1995). The development of such techniques are still continuing, and large- scale assays of peptides and protein-DNA binding activity are becoming more fea- sible (Abbott, 2002). A signaling molecule may be a protein, small peptide, amino acid, nucleotide, steroid, retinoid, fatty acid derivative or a dissolved gas. There are different types of signaling systems and they differ in signal origin. Paracrine signaling is a form of cell signaling in which the target cell is close to the signal releasing cell, and the signal chemical is broken down too quickly to be carried to other parts of the body. In mature organisms, paracrine signaling func- tions include responses to allergens, repairs to damaged tissue, formation of scar tissue, and clotting. Examples of paracrine agents are growth factors, somatostatin and histamine. In paracrine signaling, the signal originates from a nearby cell and, thus, the signal causes only localized effects. Endocrine signaling molecules are called hormones. After their release into the blood stream, where they are present at very low concentrations, target cells with high-affinity receptors pull out the hormone from the blood. The endocrine system links the brain to the organs that control body metabolism, growth and de- velopment, and reproduction. In endocrine signaling, hormones are secreted into the bloodstream and thus may be received by a cell some distance from the origin of the signal. In synaptic signaling, a signaling molecule is released into a synaptic cleft from one neuron and received by another neuron. Here, synapses allow nerve cells to communicate with one another through axons and dendrites, converting electrical impulses into chemical signals. Finally, a cell may send a signal to itself, which is known as autocrine signaling

8 (Alberts and et al., 2002). There are many types of signaling molecules and also many different receptors that may be present on a given cell at a given time. The set of receptors and the density and location of each receptor on the cell surface depend on cell type and on the current state and environment of the cell. Hence, The same stimulus will often cause different responses in different cells. The main method of signal transduction occurs through structural changes of pathway components. A given protein will affect the conformation of one or sev- eral other proteins, activating or inhibiting those proteins and thus propagating the signal down the pathway (Alberts and et al., 2002). The trigger for signal propaga- tion often occurs with the binding of the signaling molecule to the receptor, which causes a conformational change in the receptor. Within the cell, signal propagation depends heavily on the actions of protein kinases and protein phosphatases. Most of the intracellular portions of signaling pathways is a cascade of protein phosphorylations and dephosphorylations. Each step leads to the activation or the inhibition of downstream events or feeds back on upstream events. The traditional view of signal transduction has been as a linear sequence of phosphorylation events proceeding from the cell surface to the ultimate intracellular target. However, it has become increasingly clear that the propagation of signals in the cell is not a simple chain of events, but a complex networks of interacting pathways and regulatory feedback mechanisms (Neves and Iyengar, 2002). The responses to signaling can include activation of enzyme activity, changes in cytoskeleton organization, changes in ion permeability, activation of DNA and/or RNA synthesis, as well as many other aspects of cell function (Alberts and et al., 2002). Through such changes, signaling pathways can control cellular functions such as growth, maturation, proliferation, and differentiation. These vital functions suggest the importance of studying signal transduction pathways.

9 1.1.3 Protein-Protein Interactions

Proteins in the cell are polymers made up of a specific chain of amino acids. The cell reads the genetic information and uses it to construct these macromolecules through the transcription and translation processes. Proteins in the cell work to- gether to achieve a particular function, and often physically associate with each other to function or to form a more complex structure. Protein-protein interactions refer to the associations between protein molecules. These associations are important for many biological functions. For instance, sig- nals from the exterior of a cell are mediated to the inside of that cell by interacting signaling molecules. Protein-protein interactions might last for a long time to form part of a protein complex, or a protein may be carrying another protein. Moreover, a protein may interact briefly with another protein just to modify it, such as the phosphorylation of a target protein by a protein kinase. Interactions are important to most biologi- cal processes. Many proteins need to interact with other proteins to perform their functions properly. Thus, knowledge about the interacting proteins is crucial in the understanding of biological functions. Model organisms, species that are extensively studied to understand biologi- cal phenomena, were the first genomes to be sequenced. In eukaryotes, several , particularly Saccharomyces cerevisiae (”baker’s” or ”budding” yeast), have been widely studied. Since the sequencing of S. cerevisiae (Goffeau et al., 1996), systematic genome-wide studies of protein interactions have been conducted on S. cerevisiae. After the publication of the S. cerevisiae genome sequence, several com- putational methods based on genomic context were developed for protein-protein interaction (PPI) prediction (Fields and Song, 1989, Gavin et al., 2002, Ho et al., 2002, Ito et al., 2001, Uetz et al., 2000). Today, more than 400 genomes have been completely sequenced and more than

10 1700 projects are still in progress1 (Liolios et al., 2006). The Proteomes of these genomes have been at least partially mapped, but the functions of many proteins are unknown. Identification of the physical interactions in which these proteins participate may reveal their function.

1.1.4 Discovery of Protein-Protein Interactions

The physical interactions between proteins can be detected by the characterization of individual interactions. In the past, metabolic reactions mostly were identified through laborious studies of individual enzymes. Moreover, the function of a newly sequenced gene may be inferred from its homology to a protein with an identified function. However, in recent years, high-throughput studies have been developed in which protein-protein interactions may be identified through genome wide tech- niques. As a result, in the past few years, the number of known protein-protein inter- actions have increased significantly. The two most important methods used in iden- tifying protein-protein interactions are affinity purification followed by mass spec- trometry, which is a common technique for identifying protein complexes (Gavin et al., 2002, Ho et al., 2002), and the yeast two-hybrid method used for identifying individual protein-protein interactions (Fields and Song, 1989, Ito et al., 2001, Uetz et al., 2000). The yeast two-hybrid method (Y2H) (Fields and Song, 1989) can be used to determine if two particular proteins interact. First, the two proteins of interest, P1 and P2, are fused to two different proteins. P1 (often called bait) is fused to a DNA- binding protein, which binds to a specific stretch of DNA slightly upstream of the reporter gene, a gene encoding protein that reports the presence of an interaction between P1 and P2. On the other hand, P2 (often called prey) is fused to an acti- vating domain, which activates the transcription of the reporter gene. The reporter

1 Refer to the Genomes Online Database for recent statistics of genome sequencing projects at http://www.genomesonline.org (Liolios et al., 2006)

11 gene will not be transcribed unless an activating domain is present, and activating domain is only present if P1 interacts with P2. Therefore, a signal is observed only when P1 and P2 interact with each other. The two-hybrid method was efficiently adapted for systematic large-scale stud- ies. This technique has been used to study the entire proteome of S. cerevisiae (Ito et al., 2001, Uetz et al., 2000), Caenorhabditis elegans (Li et al., 2004, Walhout et al., 2000b) and Drosophila melanogaster (Giot et al., 2003). Although the yeast two-hybrid method is sensitive enough to detect transient as well as stable inter- actions, the method is not very accurate in detecting interactions. As many as 50- 90% of the initially published interactions are probably erroneous (false positives) (Deane et al., 2002, Sprinzak et al., 2003). Moreover, Deane et al. (2002) showed through an analysis based on the agreement of the interactions and expression data that more than half of these interactions are biologically irrelevant. In addition to these, there are a large number of known interactions between pro- teins which are missed in two-hybrid systems (false negatives) (Aloy and Russell, 2002). A two-hybrid false-negative rate of 45% was estimated by Walhout et al. (2000b) in their C. elegans study. The large number of false negatives is likely to be caused by proteins that only interact when certain activation signals have induced conformational changes in one or both of the interacting proteins (Ito et al., 2001). Also, the unnatural mechanism of fused proteins within a compartment, the nu- cleus, where most of the proteins do not naturally interact, is a likely cause for the absence of known protein-protein interactions in the two-hybrid screens. Finally, membrane protein interactions are unlikely to be detectable by Y2H. Another technique for physical interaction discovery is the affinity purification method. Affinity purification methods do not identify individual interactions be- tween proteins, but are used to determine which proteins appear in complexes to- gether. In tandem affinity purification (TAP) (Rigaut et al., 1999), some proteins

12 are selected as baits which are used to fish for the prey proteins that form a com- plex with the bait. In TAP, baits are fused with two affinity tags. The tags are used to attach the bait to an affinity chromatography column in two tandem steps. Throughout this methodology, stringent purification steps prevent detection of tran- sient interactions within complexes. Hence, mostly stable complexes are found. Furthermore, the exact interactions between the proteins in the complexes detected by TAP have not been determined. For example, some of the proteins in the com- plexes interact directly with each other, but others are at the outskirts of the protein complex and are not in direct proximity with each other, although they are likely to be functionally related. Affinity purification methods have been successfully applied to large scale stud- ies on the proteome of S. cerevisiae (Gavin et al., 2002, Ho et al., 2002) as well. However, the overlap between the protein-protein interactions that are discovered with affinity purification methods and two-hybrid screens is small in number (Gavin et al., 2002, Ho et al., 2002). This is partially because the two methods complement each other (Aloy and Russell, 2002). This also shows that both methods suffer from shortcomings within themselves. The high-throughput methods mentioned above are comprehensive and are not biased towards the expectations of individual researchers. However, the overlap of independent studies is fairly small (Uetz and Finley, 2005). For instance, the two independent yeast two-hybrid screens of S. cerevisiae had only 20% overlap (Ito ∼ et al., 2001). The reason of a low overlap in similar studies might have been caused by limited coverage of the whole yeast interactome, or by false positives. Cornell et al. (2004) showed through their analysis that among these high-throughput meth- ods, TAP is the most reliable method. By definition, the protein-protein interactions detected by two-hybrid screens are different in nature from those detected by low- throughput yeast experiments (a collection can be found at the Munich Information Center for Protein Sequences (MIPS) (Mewes et al., 1999)). Since there is little

13 overlap between the interaction pairs produced by these methods, there is a need for reliable validation measures. Interactions which have been identified in low-throughput experiments are con- sidered more reliable. Although there are existing problems associated with high- throughput methodologies, studies to improve the outcome are still in progress. Recent technical improvements in pooling strategies indicate that the accuracy of high-throughput yeast two-hybrid screens could be significantly increased, while the number of screens is simultaneously decreased (Jin et al., 2006). A protein-protein interaction that has been observed independently more than once is considered to be more reliable. However, false positives might be repro- ducible in some cases (Fields, 2005). Moreover, computational methods are fre- quently utilized to indicate the reliability of protein-protein interactions. For in- stance, functional annotations and sub-cellular localizations may provide an indica- tion of the reliability of a particular protein-protein interaction since proteins which are predicted to be active in the same sub-cellular location and have related func- tions are more likely to interact (Sprinzak et al., 2003). In addition, expression patterns for proteins in the same complex are expected to be correlated (Jansen et al., 2002). In other words, interacting proteins are often co-expressed (Grigoriev, 2003, Jansen et al., 2002). Furthermore, structural information (Aloy and Russell, 2002, Edwards et al., 2002) and functional annotations (Marcotte et al., 1999) can be used to validate protein-protein interactions. Almost all of the interaction that are discovered through experiments are col- lected in public databases. Today, there is a growing number of public databases that present protein-protein interaction data for multiple organisms. The most com- prehensive databases are Munich Information Center for Protein Sequences (MIPS) (Mewes et al., 1999), the Database of Interacting Proteins (Xenarios et al., 2002), the Biomolecular Interaction Network Database (BIND) (Bader et al., 2003), the

14 BioGRID General Repository for Interaction Datasets (Stark et al., 2006), the Molec- ular Interaction Database (MINT) (Zanzoni et al., 2002), Online Predicted Human Interaction Database (Brown and Jurisica, 2005), etc.

1.2 Contributions

The discovery of the protein-protein interaction network topology (Jeong et al., 2001, Wagner, 2001) accelerated the study of better understanding the growth of these networks. These networks drew more attention after the observations of shared topological properties with many other networks (Aiello and Chung, 2001, Aiello et al., 2000, Berger et al., 2003, Bollobas et al., 2003, Bollob´as et al., 2001, Cooper and Frieze, 2003, Kleinberg et al., 1999). Since then, the study of evo- lutionary network modeling to successfully generate the growth of proteome net- works has been a great challenge (Bhan et al., 2002, Pastor-Satorras et al., 2003, Vazquez et al., 2003). To accomplish this task, known biological theories and em- pirical studies were taken into account to develop such models. The most promising model developed, which is named in Pastor-Satorras et al. (2003) as the proteome growth model, was described independently in Bhan et al. (2002), Pastor-Satorras et al. (2003), Vazquez et al. (2003). The proteome growth model is based on Ohno’s theory of genome growth (Ohno, 1970). In this model, the two underlying mechanisms for genome evolu- tion are gene duplication and point mutations. In terms of gene functionality, after a gene duplication event, one of the genes may accumulate deleterious mutations and be lost, or both copies of the gene may be retained. The proteome growth model emulates these processes by growing a network via node duplications and then modifying the connectivity of the nodes by mechanisms that reflect point mu- tations. Through analysis of this model, different claims were made on what degree

15 distribution this model would generate. Earlier network generation models used for emulating the growth of similar networks were known to follow a power-law degree distribution. Pastor-Satorras et al. (2003) showed that the proteome growth model would generate a degree distribution that would follow a power-law with exponential cutoff. In another study, using a more restrictive model Chung et al. (2003) showed that these networks would follow a power-law degree distribution. Here, we further investigate the degree distribution generated with the proteome growth model. First, we analyze the proteome growth model of Pastor-Satorras et al. (2003), and show that this model does not generate the power-law degree distribution with exponential cutoff as claimed and the more restrictive model by Chung et al. (2003) cannot be interpreted unconditionally. Analyzing the networks, we observed that global features of networks, such as the degree distribution, might actually be misleading. In this work, we also introduce a new measure called ℓ-hop for network comparison. We address the proteome growth models impotency through the more general ℓ-hop degree distribution for ℓ > 1. We make more observations over the model, and further study the original basis of the model, i. e. Ohno’s theory of genome growth. We then introduce a new network growth model that takes into account the sequence similarity between pairs of proteins (as a binary relationship) as well as their interactions. The new model captures not only the ℓ-hop degree distribution of the yeast protein interaction network for all ℓ > 0, but also the immediate de- gree distribution of the sequence similarity network, which again seems to follow a power-law. We further utilize protein-protein interaction networks to discover possible path- way segments. Protein-protein interactions of an organism may lay out the proteins by their functional relationships. These associations improve our understanding of cellular functions as well as identifying unknown proteins and their functions. Techniques like the two-hybrid system (Giot et al., 2003, Ito et al., 2001, Li et al.,

16 2004, Reboul et al., 2003, Uetz et al., 2000) or affinity purification followed by mass spectrometry (Gavin et al., 2002, Ho et al., 2002) are developed to uncover physical interactions between proteins. These experiments identify only a small fraction of the total protein-protein interaction network (Bader and Hogue, 2002, Edwards et al., 2002, Grigoriev, 2003, Ito et al., 2002, von Mering et al., 2002, Walhout et al., 2000a,b). There are many studies in which signaling pathways were modeled using various approaches. Previously, signaling pathways were modeled as modular kinetic simulations of biochemical networks (Neves and Iyengar, 2002) and by detailed integration of biochemical properties of the pathways (Choi et al., 2004). In another recent study, Bayesian Networks are applied to multi-variable cell data to infer signaling pathways (Sachs et al., 2005). In this thesis, we present a new framework, called PathFinder, to identify sig- naling pathways in protein-protein interaction networks. To find biologically sig- nificant pathway segments in a given interaction network, we first discover associ- ation rules based on known signal transduction pathways and their functional an- notations. Given a pair of starting and ending proteins, our methodology returns candidate pathway segments between these two proteins. These candidate pathway segments are further filtered by their gene expression levels. Inour study,we usethe S. cerevisiae interaction network with microarray data and were able to reconstruct successfully signal transduction pathways of yeast. The rest of the thesis is organized as follows. In Chapter 2 we analyze the evolutionary models of proteome networks. Using our analysis, in Chapter 3 we discuss our observations about evolutionary models and protein-protein interaction networks and introduce an enhanced duplication model based on protein sequence similarity. In Chapter 4 we focus on discovering signaling pathways utilizing other biological networks and present our experimental results carried on S. cerevisiae datasets. Final conclusions are drawn in Chapter 5.

17 Chapter 2

Evolutionary Models of Proteome Networks

Small world phenomena and power-law degree distributions have previously been observed in a number of naturally occurring graphs such as communication net- works (Faloutsos et al., 1999), web graphs (Aiello et al., 2000, Barabasi and Al- bert, 1999, Bollob´as et al., 2001, Cooper and Frieze, 2003, Kleinberg et al., 1999, Kumar et al., 2000), research citation networks (Redner, 1998), human language graphs (Ferrer I Cancho and Sol, 2001), neural nets (Watts and Strogatz, 1998), etc. These two properties cannot be observed in the classical random graph models stud- ied by Erd¨os and R´enyi (Erd¨os and R´enyi, 1959) in which, the edges between pairs of nodes are determined independently. However, it is possible to generate graphs that satisfy these properties by an iterative process that adds one new node to the graph at each step (Aiello and Chung, 2001, Aiello et al., 2000, Berger et al., 2003, Bollobas et al., 2003, Bollob´as et al., 2001, Cooper and Frieze, 2003, Kleinberg et al., 1999). The new node is then connected to some b (b can be a constant or an independent random variable) of the existing nodes, each of which is chosen with probability proportional to its degree. Unfortunately such a preferential attachment model does not capture the essence of the genome evolution and hence cannot be

18 used to model proteome networks. The structure of the yeast protein-protein interaction network seems to reveal two interesting graph theoretic properties (Jeong et al., 2001, Wagner, 2001): (1) The degree distribution of the nodes (i. e. the proportion of nodes with degree k as a function of degree) approximates a power-law (i. e. is approximately ck−b for some constants c, b). (2) The graph exhibits the small world effect. According to Ohno’s model (Ohno, 1970), the two underlying mechanisms for genome evolution are gene duplication and point mutations. In terms of gene func- tionality, after a gene duplication event, one of the genes may accumulate deleteri- ous mutations and be lost, or both copies of the gene may be retained. Two possible evolutionary reasons for keeping both copies can be (1) selection for increased lev- els of expression, or (2) divergence of gene function (Nadeau and Sankoff, 1997, Seoighe and Wolfe, 1999b). In this framework, functional divergence can be pro- duced through complementary degeneration, where each daughter gene retains only a subset of the functions of the parent, or (rarely) if one daughter acquires a new function (Force et al., 1999). Although the duplicated regions of the genomes have been described and listed before (for instance S. Cerevisiae (Seoighe and Wolfe, 1999a, Wolfe and Shields, 1997)), there is no known scheme for how duplications formed the current shape of the genomes. Recent work, thus, has focused on ran- dom graph models that grow via node duplications and get modified by mechanisms that emulate point mutations. Among these studies, the most promisingone, which is named in Pastor-Satorras et al. (2003) as the proteome growth model, was described independently in Bhan et al. (2002), Pastor-Satorras et al. (2003) and Vazquez et al. (2003). In this model, the network grows in iterations. The model starts with a set of connected vertices of size N0. In each iteration, a gene or an associated protein represeted by a node, is chosen uniformly at random and is duplicated with all of its edges. After the duplication step, to emulate mutations there is the divergence step. Each edge of

19 the new node is deleted with probability q (= 1 p), followed by inserting edges − between the new node and every other node with probability r/t where t is the total number of nodes and r is a constant. In (Pastor-Satorras et al., 2003) by adjusting the parameters q and r and using a small seed graph (N0 =5), the proteome growth model was used to approximate the degree distribution of the yeast proteome net- work. The first serious study to formally analyze the degree distribution of the pro- teome growth model was by Pastor-Satorras et al. (2003), who claimed that the distribution of both the general yeast proteome network and the proteome growth model is a power-law with exponential cut-off. This means that the fraction of nodes with degree k among all nodes is independent of time and is approximated by f = ck−b a−k; here a, b, c are constants. However, they make a number of k · simplifying assumptions in their analysis to get this result. For instance, they ap- proximate the probability for generating a node with degree k by the probability of duplicating a node with degree k +1 only and subsequently deleting a single edge. This assumption also reduces the number of singletons. They further approximate this probability with a function linear in k. A more recent analysis of the degree distribution of the proteome growth model, for the special case that r =0 is given by Chung et al. (2003). As per Chung et al. (2003), we will refer to this special case as the pure duplication model. In contrast to Pastor-Satorras et al. (2003), Chung et al. (2003) claim that the fraction of nodes

−b with degree k is independent of time and is of the form fk = ck ; here b is a func- tion of p =1 q and values of b 2 are possible for some p. The pure duplication − ≤ model creates singleton nodes, i. e. nodes that are not connected to any other node ofthe graph. Since, a nodecan onlyget anew edge ifoneof itsneighbors is copied, a singleton will remain singleton during the whole graph generation process. Note that in this model all non-singleton nodes form one connected component.

20 In a separate work, van Noort et al. (2004) show that the gene coexpression net- work in S. Cerevisiae have scale-free and small-world network properties. By using the homology relations between the genes in coexpression network, they present a model which can generate networks with similar scale-free and small-world proper- ties. The model starts with a number of genes which have a number of transcription factor binding sites (TFBSs) and genes sharing a minimum number of TFBSs con- sidered coexpressed. At every time step, each gene can be duplicated or deleted with certain probabilities. Also, a TFBS of a gene can be deleted or a new TFBS from another gene can be acquired by a gene with certain corresponding probabil- ities. In contrast to other approaches, van Noort et al. (2004) consider deleting or inserting a TFBS of the gene which deletes a set of connections, or adds a set of links to the gene. Hence, in their approach, the connections of genes were consid- ered in groups. van Noort et al. (2004) claim that the model generates a degree distribution with a slope similar to the coexpression network of S. Cerevisiae1. Ad- ditionally, the average clustering coefficient2 and the shortest path length of the net- works were compared. Although these measures are for understanding the topology of a network, they are not sufficient to claim that two networks are similar at all. There is also another study presented by Przulj et al. (2004), in which a dif- ferent approach for modeling these networks has been studied. Przulj et al. (2004) claim that a random geometric model better captures the currently accepted protein- protein interaction networks. A geometric disc graph is formed by connecting two nodes of the graph with an edge, if their distance in the metric space is smaller

1 Numerical results were not presented in van Noort et al. (2004). Hence, the simulation results given draws certain amount of question about how close the degree distribution, i. e. the power-law exponent, was. 2 The clustering coefficient of a node is the ratio between the actual number of edges between neighbors of a node and the maximum possible number of edges between these neighbors. Average clustering coefficient of a network is the average of clustering coefficients over all units in the system (Watts and Strogatz, 1998)

21 than a certain threshold. Przulj et al. (2004) argue that the scale-free property of the proteomes is a result of the noise in the available data at the moment and the degree distribution of such networks should follow the Poisson distribution. By counting the number of different motifs in the networks, they form a measure of lo- cal network structure and used this to compare different models with the available proteomes. According to the experiments they carried out, a three dimensional geo- metric disc graph with the same number of nodes but six times the number of edges has a similar number of motifs to the proteomes they worked on. Although the network motifs considered capture local properties of the networks, in their work, Przulj et al. (2004) (1) do not take into account Ohno’s Theory (Ohno, 1970) which states that, the proteome network should be generated through a process, which at- tributes the genome sequence growth and evolution to subsequent gene duplications followed by mutations on the gene sequences, (2) do not consider global properties of the networks before drawing conclusions, such as the average degree or the de- gree distribution. Moreover, the work presented has vague descriptions on how scale-free networks are formed. For instance, there are many models available that can generate scale-free networks, but not every scale-free network necessarily is generated by emulating proteome network growth, i. e. duplication and divergence. The most recent study that was presented by Ispolatov et al. (2005) focuses on duplication-divergence models with completely asymmetric divergence. In a completely asymmetric divergence process, links are removed from the duplicated node only. In their study, Ispolatov et al. (2005) examines this model where the evolution is characterized by a single parameter, the link retention probability. They claim that, this single-parameter duplication-divergence network growth model can approximate the degree distribution of real protein-protein interaction networks. Although their model generates similar degree distributions, in reality the network lacks the local structure similarity. For instance, this model would not generate any triangular subgraphs (a clique of three in the network) since the duplication would

22 generate cycles of even length or degree one nodes. However, cycles of any size exists in vast numbers in the real proteome network. In these studies, the protein-protein interactions identified by high-throughput yeast two-hybrid screens or inferred from mass spectrometry of coimmunoprecip- itated protein complexes were considered. However, analysis based on the agree- ment of the interaction and expression data show that almost less than half of these interactions are biologically relevant (Deane et al., 2002). In a recent study, Han et al. (2005) showed that low coverage makes determination of the true topology of the network difficult. Han et al. (2005) also showed from sampling the real network through these experiments (since the experiments only reveal partial networks) that regardless of the topology of the network that we are looking for, the topology of the sub network that is sampled would have a degree distribution similar to a power-law. In other words, according to these experiments, it is not clear whether the proteome network follows power-law degree distribution or not. However, in this work, we assume that the proteome network should be generated through a process, which attributes the genome sequence growth and evolution to subsequent gene duplications followed by mutations on the gene sequences. Previously, it has been shown that this process would generate a network that follows a power-law degree distribution (Bhan et al., 2002, Pastor-Satorras et al., 2003, Vazquez et al., 2003). Moreover, we show that the degree distribution of the proteome growth model follows a power-law. In this chapter, first in Section 2.1 we introduce biological networks and then focus on the evolution of protein-protein interaction networks. We briefly describe topological properties of networks in Section 2.1.3 . Next, in Section 2.1.2 we introduce random network models that are studied widely for modeling large net- works. In Section 2.2 the Proteome Growth Model of Pastor-Satorras et al. (2003) is introduced. Our specific contributions are presented in the following sections. We first show

23 in Section 2.3 that the (expected) proportion of singletons generated by the pure du- plication model (r =0) grows in time. In fact, the only limiting (time independent) solution is f0 =1 and fk =0 for all k > 0. Note that for the case p = q =0.5 the average degree of nodes in the pure duplication model does not change over time (see Lemma 3). Together with the fact that the fraction of singletons increases in time, this implies that (i) the average degree of non-singletons must increase in time and (ii) there is a single connected component of size o(t) with increasing average degree. It is quite possible that this connected component of the network gener- ated by the pure duplication model exhibits a power-law degree distribution with parameter b 2, however this is difficult to establish. ≤ In the rest of Section 2.3, we show that the degree distribution of the proteome growth model (in fact, any random model based on duplications) does not follow a power-law with exponential cut-off as claimed in Pastor-Satorras et al. (2003). We achieve this by showing a bound for the maximum degree of the proteome growth model and contrasting it with that of a network which exhibits power-law with exponential cut-off.

2.1 Biological Networks

A network (graph) is a collection of points where these points are called nodes or vertices, and the arcs connecting these points are called edges (Refer to Sec- tion 1.1.1 for definitions). Biological networks, representations of biological relationships, have been con- structed to describe various biological phenomena. These networks vary from net- works describing the biochemical wirings of the cell to higher level networks such as neuronal networks or the food web. Recent studies on analysis of genomes in- creased the number and importance of cellular networks. The most common cellu- lar networks are described below.

24 A metabolic network is a network of pathways where metabolic substrates and products are connected with directed edges. These arcs indicate metabolic reaction acts on a given substrate and produces a given product. Studying metabolic net- work allows for an in depth insight in understanding the molecular mechanisms of a particular organism (Francke et al., 2005). Examples of various metabolic path- ways include glycolysis, Krebs cycle, pentose phosphate pathway, etc. In simplified terms, the construction of a metabolic network involves collecting all of the rele- vant metabolic information of an organism and then compilingit into a network that makes sense for various types of analysis to be performed. The correlation between the genome and metabolism is made by searching gene databases, such as KEGG (Kanehisa and Goto, 2000), or for particular genes by inputting enzyme or protein names. In short, metabolic networks are powerful tools, for studying and modeling metabolism. A genetic regulatory network (also called a GRN or gene regulatory network) describes gene expression, i. e . the production of proteins from the genomic code by transcription and translation. Expression of a gene can be controlled by the presence of other activating or inhibiting proteins, and thus the genome forms a switching network with nodes representing proteins and directed edges represent- ing dependence of protein production on other proteins. In other words, genetic regulatory networks are on-off switches and rheostats of a cell operating at the gene level. They dynamically orchestrate the level of expression for each gene in the genome by controlling whether and how vigorously that gene will be transcribed into RNA. Each RNA transcript then functions as the template for synthesis of a specific protein by the process of translation. Likewise, the transcriptional (regulation) network can be represented as a di- rected graph. Transcriptional interactions show the relationships between transcrip- tion factors and the operons they regulate. In transcriptional (regulation) networks, each node represents an operon, a group of contiguous genes that are transcribed

25 into a single mRNA molecule, and edges represent direct transcriptional interac- tions. Each edge is directed from an operon that encodes a transcription factor to an operon that is regulated by that transcription factor (Shen-Orr et al., 2002). Finally, protein-protein interaction networks represent undirected interactions among proteins. In other words, a protein-protein interaction network (interac- tome) is a graph in which each node represents a protein and each (undirected) edge represents an interaction. A graph including all proteins in an organism and all possible interactions between these proteins can be called the proteome network of that organism. The interactions in these networks are important to most biolog- ical processes, since many proteins need to interact with other proteins to perform their functions properly. Hence, knowledge about the interactions between proteins is crucial for understanding biological functions. In this work, we focus on models developed for generating protein-protein in- teraction networks. A protein-protein interaction network of an organism lays out the proteins by their functional relationships. This improves our understanding of cellular functions. We would like to further understand the underlying forces that have generated these networks by using network generation models. In the follow- ing section, the evolution of protein-protein interaction networks will be explained in detail. We are going to use these evolutionary processes for further development of network models.

2.1.1 The Evolution of Protein-Protein Interactions

The complete genome analysis of model organisms showed how gene and genome duplication events have shaped genomes over the time. Remarkably, 30% of the Saccharomyces cerevisiae genome, 40% that of Drosophila melanogaster, 50% that of Caenorhabditis elegans, and 38% of the are composed of dupli- cated genes (Li et al., 2001, Rubin et al., 2000). According to Ohno’s theory (Ohno,

26 1970), such duplication events should have provided genetic raw material, a source of evolutionary novelties, that could have led to the emergence of new genes and functions through mutations followed by natural selection. Recently, there has been an enormous increase in genomic knowledge. However, the patterns by which gene duplications might give rise to new gene functions over the course of evolution have not been completely understood. This is mainly due to the fact that there are very few ways of experimentally investigating the evolution of function in duplicated genes. The two underlying mechanisms for genome evolution is gene duplication and point mutations followed by natural selection (Ohno, 1970). After a gene dupli- cation event, one of the genes may accumulate deleterious mutations and be lost, or both copies of the gene may be retained. Two possible evolutionary reasons for keeping both copies can be (1) selection for increased levels of expression, or (2) di- vergence of gene function (Nadeau and Sankoff, 1997, Seoighe and Wolfe, 1999b). In this framework, functional divergence can be produced through complementary degeneration, where each daughter gene retains only a subset of the functions of the parent, or (rarely) if one daughter acquires a new function (Force et al., 1999). Al- though the duplicated regions of the genomes have been described and listed before (for instance S. Cerevisiae (Seoighe and Wolfe, 1999a, Wolfe and Shields, 1997)), there is no certain scheme to explain how duplications formed the current shape of the genomes. Moreover, closely related organisms share similar proteins. Hence, the inter- actions among these proteins are also preserved throughout different organisms. It has been observed that many of the interactions present in yeast appear to also be present in C. elegans, although the protein-protein interactions of the eukary- otic intracellular parasite Plasmodium falciparum shows little similarity with the other eukaryotes (Suthram et al., 2005). Understanding of how protein interactions evolve would improve our understanding of evolution of new functions.

27 2.1.2 Random Network Models

Measuring global and local properties of real-world, complex networks let to the proposal of multiple models to generate such networks. The Random graph model is the first model presented to form a complex network (Erd¨os and R´enyi, 1959). Since its introduction by Erd¨os and R´enyi (1959), the random graph model have been used to model networks with no apparent design. Erd¨os and R´enyi (1959) led to this field becoming a significant research area (Bollobas, 2001). The small-world model (Watts and Strogatz, 1998) which was motivated by clustering coefficients, interpolates between the highly clustered regular ring lattices and random graphs. The scale-free model (Barabasi and Albert, 1999) was motivated by the discovery of the power-law degree distribution. The geometric random graph models (Penrose, 2003) has been used to model real-world networks such as electrical power grids and protein structure networks (Milo et al., 2004).

Random Graphs

Random graphs are networks where the probability that there is an edge between any pair of nodes, p, is distributed uniformly at random. Erd¨os and R´enyi (1959) defined several versions of the random graph model, out of which the most com- monly referred one is denoted by Gn,p. In this random model, each possible edge in the graph on n vertices is present with probability p. The properties of the Gn,p are often expressed in terms of the average degree z of a node. The average number of n(n−1) edges in the random graph Gn,p is 2 p. Since every edge connects two vertices, the average degree of a node is z = (n 1)p, which is approximately np for large − n. These graphs have many properties that can be calculated exactly in the limit of large n, which makes them appealing as models of real networks. Random graphs were used to describe many phenomena such as gene networks (Kauffman, 1969), ecosystems (May, 1973), and computer viruses (Kephart and

28 White, 1991). Although, random graph models reasonably approximate the cor- responding properties of these real-world networks, there are still differences be- tween the two. The first property, which drew a lot of attention to this area, is that the real-world networks appear to have power-law degree distributions (Barabasi and Albert, 1999, Faloutsos et al., 1999). In other words, a small but not negli- gible fraction of the vertices in these networks has a very large degree. Secondly, while real-world networks have strong clustering, the random graph model does not (Watts and Strogatz, 1998). The probabilities of having an edge between two ver- tices in a random graph is by definition independent. Thus the clustering coefficient of a random graph is p.

Generalized Random Graphs

Given a degree distribution, a random graph can be generated by assigning node i degree ki from the given degree sequence, and then by choosing pairs of nodes uniformly at random to make edges so that the assigned degrees remain preserved. After all degrees have been generated for linking the vertices with edges, the result- ing random graph will have the original degree distribution. The only requirement for this model is that the degrees have to sum up to an even number.

Random Geometric Graphs

A random geometric graph G(n, r) is a geometric graph with n nodes which corre- spond to n independently and uniformly distributed points in a metric space. These points in the space correspond to nodes, and two nodes are adjacent to each other if the distance between them is at most r. The distance between two points are mostly calculated using Euclidean distance or Manhattan distance. Since these networks are constructed over a metric space, these networks would have boundaries, which exhibit different properties than the inner parts of the graphs, i. e. the boundaries of these graphs are sparser than the interiors.

29 Small-world Networks

Networks of many biological, social, and artificial systems often exhibit small- world topology. These main characteristic of these networks is their large clustering coefficient independent of the network size along with the small-world character- istic. Watts and Strogatz (1998) proposed this one-parameter model of networks in order to interpolate between an ordered finite-dimensional lattice and a random graph. Starting from a ring lattice3 with n nodes and m edges in which every node is adjacent to its first k neighbors on the ring, each edge is rewired (changing the adjacent node) uniformly at random with probability p, not allowing self-loops and pnk multiple edges. This rewiring process introduces an average of 2 long-range edges. The parameter p can be adjusted to create a network as desired. Diameter (also called as characteristic path length, L(p)) and the average clus- tering coefficient C(p) (as functions of the rewiring probability p) are the distinc- tive structural properties of the small-world networks (Watts and Strogatz, 1998). In their study, Watts and Strogatz (1998) stated that the regular lattice at p =0 is a highly clustered large-world in which the diameter grows linearly with n. On the other hand, as p gets closer to 1 the model converges to a random graph, which is a poorly clustered small-world, where the diameter grows logarithmically with n. Moreover, the clustering coefficient is not associated with the growth of the diameter. These distinguishing properties of small-world topology were observed in the collaboration graph of actors in feature films, the neural network of C. elegans, and the electrical power grid of the western United States (Watts and Strogatz, 1998). Hence, this model was claimed to be generic for many large sparse networks found

3 A ring lattice is a set of n nodes placed on a ring, where every node has an edge to the its k-neighbors on the perimeter of the ring. For instance if k = 2, the nodes have edges to their immediate neighbors and the nodes next to their neighbors.

30 in nature. This work initiated other researchers to focus on this area and many ad- ditional networks with small-world networks were published (Aiello et al., 2000, Barabasi and Albert, 1999, Bollob´as et al., 2001, Cooper and Frieze, 2003, Falout- sos et al., 1999, Ferrer I Cancho and Sol, 2001, Kleinberg et al., 1999, Kumar et al., 2000, Redner, 1998).

Scale-Free Networks

The Internet backbone (Faloutsos et al., 1999), metabolic reaction networks (Jeong et al., 2000), the world wide web (Broder et al., 2000), and the telephone call network (Abello et al., 1998) were also discovered to share the same connectiv- ity pattern, i. e. the degree distributions of these networks decay as a power-law P (k) k−γ , where γ 2.1 2.4. ≈ ≈ − Scale-free networks were first discovered by Simon (1955). Recently Albert et al. (1999) and Barabasi and Albert (1999) have drawn attention to these net- works with their preferential attachment model. In these networks such distribu- tions emerges when a stochastic growth model is used, in which new nodes are added continuously and they preferentially attach to existing nodes with probability proportional to the degree of the target nodes (Barabasi and Albert, 1999). In other words, higher degree nodes further acquire attachments with time and the resulting degree distribution is P (k) k−γ. ≈ Scale-free networks have a smaller average path length (Barabasi and Albert, 1999), showing that a heterogeneous scale-free topology is more efficient in bring- ing nodes close together then the homogeneous random graph topology. Bollobas et al. (2003) also showed that the average path length, l, satisfies l ln n . ≈ lnln n An interesting property of the scale-free networks that drew a lot of attention is that these networks are resistant to random failures since a few high-degree hubs (highly connected nodes) dominate their topology, i. e. low-degree node failures do not effect the network (Albert et al., 2000). However, such networks are vulnerable

31 to intentional attacks made on the hubs. Similar implicationswere also made for the Internet (Bornholdt and Ebel, 2001), the design of therapeutic drugs (Jeong et al., 2000), and the evolution of metabolic networks (Jeong et al., 2000, Wagner, 2001).

2.1.3 Properties of Networks

Large Networks are mostly analyzed by their global and local properties. The most extensively studied global properties of networks include the diameter4 (Al- bert et al., 1999), clustering (Hartuv and Shamir, 2000), and degree distribution (Newman et al., 2001). Despite their large size, these networks mostly have small diameters. This is often referred to as the small-world property (explained below) (Watts and Strogatz, 1998). A network shows clustering if the probability of a pair of nodes being adjacent is higher when the two nodes have a common neighbor. The clustering coefficient, C, is defined as the average probability that two neighbors of a given node are adjacent (Watts and Strogatz, 1998). In more formal terms, if a vertex v in the network has dv neighbors, the ratio between the number of edges Ev amongst the neighbors

dv(dv −1) of v, and the largest possible number of edges among them, 2 , is called the clustering coefficient of node v, and is denoted by C , i. e. C = 2Ev . The v v dv (dv −1) clustering coefficient C of the whole network is the average of the Cv for all nodes of the network. It has been observed that, complex, real-world networks exhibit a larger degree of clustering (Aiello et al., 2000, Barabasi and Albert, 1999, Bollob´as et al., 2001, Cooper and Frieze, 2003, Faloutsos et al., 1999, Ferrer I Cancho and Sol, 2001, Kleinberg et al., 1999, Kumar et al., 2000, Redner, 1998, Watts and Strogatz, 1998). The degree distribution of networks represent the probability P (k) that a ran- domly selected node of a network has degree k. Mostly large real-world networks

4 Diameter of graph G, diam(G), is the maximum of all shortest path lengths between any two nodes in graph G.

32 have non-Poisson degree distributions. These networks are observed to have a de- gree distribution following a power-law, P (k) = k−γ. These networks are called scale-free networks (Barabasi and Albert, 1999). Networks are also analyzed for their local properties. By focusing on finding small over-represented patterns in a network, such local properties were previously identified (Itzkovitz and Alon, 2005, Milo et al., 2004, 2002, Shen-Orr et al., 2002). In this approach, the motifs of large networks are identified as small subgraphs of the network that appear significantly more frequently in the network then in randomized networks. Milo et al. (2002) showed that different real-world networks contain different motifs. Hence, different real-world networks were grouped into super-families according to their local properties (Milo et al., 2004). Local properties of large networks are more accurate resources for identifying and modeling them when the networks are incomplete or not completely verified. For these networks, the global properties might be biased or misleading while the local structures of these networks are more likely to be complete and reliable. In this work, as well as examining the global properties of large networks, we have also checked local properties to make sure that the incompleteness of the networks do not effect our analysis. In Section 3 we have developed a more refined measure of structural similarity by comparing the ℓ-hop degree distribution of two networks. In a given network, the ℓ-hop degree of a node is defined as the total number of unique nodes it can reach in at most ℓ hops (See Figure 2-1). This measure overcomes any mislead- ing result that can be drawn by only using global properties. ℓ-hop measures each node’s connectivity in multiple reach levels. Hence, by comparing the ℓ-hop degree distribution of networks, a more refined measure of structural similarity is achieved. Clearly the 1-hop degree of a node is its own degree. Previous studies observed the degree distributions of various networks (1-hop degree distribution) and concluded that power-law behavior would be enough to classify these networks (Albert et al.,

33 l=3

l=2

l=1

Figure 2-1: ℓ-hop Given a graph G(V, E), The ℓ-hop degree of a node is defined as the total number of unique nodes it can reach in at most ℓ-hops. Above, the Grey node in this network has 3 immediate neighbors. Thus, 1-hop degree of the Grey node is 3. 2-hop and 3-hop degrees of the red node are 9 and 15 consecutively.

2000, Bhan et al., 2002, Jeong et al., 2000, Pastor-Satorras et al., 2003, Vazquez et al., 2003). However, looking at the ℓ-hop distributions of the graphs reveal fur- ther topological properties of these networks. In this work, we have observed that the previously presented protein-protein interaction network generation models ac- tually have a different topology, although these networks have similar degree distri- butions (See Section 3 for experimental results).

34 2.2 Proteome Growth Model

Among various studies focused on evolutionary models of protein-protein interac- tion networks, the most promising one, which is named in Pastor-Satorras et al. (2003) as the proteome growth model, was also described independently in Bhan et al. (2002), Vazquez et al. (2003). The proteome growth model works in iterations; starting with a set of connected vertices of size N0, in each iteration t, (i) one existing node (representing a gene or an associated protein) is chosen uniformly at random and is “duplicated” with all its edges. After the duplication step, to emulate mutations, also named as the divergence step, (ii) each edge of the new node is deleted with probability q. This is followed by (iii) inserting edges between the new node and every other node with probability r/t where t is the total number of nodes and r is a constant. In these studies, with the right selection of parameters q and r, starting from a connected ring of size five (N0 =5), the proteome growth model well approximates the degree distribution of a given proteome interaction network. We first define the proteome growth model formally. The proteome growth model grows iteratively in discrete time steps. Let G(t 1) be the network at the − end of time step t 1. In time step t exactly one new node is generated and will be − denoted as vt . For any node vs, we will denote its degree (or expected degree if the context is clear) at time step t s by d (t). ≥ s (i) At each time step t, the new node vt is generated by picking one of the nodes w in G(t 1) uniformly at random and “duplicating” it to create v ; i. e. v will − t t initially be connected to all neighbors of w.

(ii) The edges incident to vt are updated through the following random process. Each edge e is considered independently and is deleted with probability q (=1 p). − Then, each node u which is not connected to vt is considered independently and an edge between u and vt is created with probability r/t.

35 The pure duplication model is the special case for the proteome growth model with r = 0. In Section 2.3 we show that this special case cannot achieve a power- law degree distribution as stated by Chung et al. (2003). To address this problem the pure duplication model can be modified via a new step (3) where vt is connected to a uniformly chosen random node (either at all times or only if it had become a singleton at the end of step (2)). As a result, vt never has degree 0.

2.3 Analysis of the Proteome Growth Model

In this section, previous studies analyzing the proteome growth model and their corrections are presented. Informally, the problem addressed here is, for the above mentioned proteome growth model, to show that the network generated by this model would follow a power-law degree distribution. This would support previ- ous experimental observations that the proteome growth model would generate net- works with power-law degree distributions mathematically.

Formally, Let F k(t) denote the number of nodes of degree k at the end of step t in the random process and let F (t)=(F (t), F (t), ) be the degree sequence. 0 1 ··· Also let Fk(t) = EF k(t) be the expected value, and fk(t) = Fk(t)/t the expected fraction of nodes of degree k. Finally let e(t) be the number of edges in G(t) and e(t)= Ee(t); similarly let h(t) be the average degree of a node (averaged over all nodes) in G(t), and h(t) = Eh(t). We say a model follows a power-law degree sequence if we can find b,c > 0 constant such that f (t) f as t where k → k → ∞ −b fk = (1+ O(1/k))ck . We first show in Section 2.3.1 that the fraction of singletons in the pure duplica- tion model grows with time, and the limiting solution for this growth is F (t) t. 0 → Section 2.3.2 is on the analysis in Pastor-Satorras et al. (2003), which predicts the proteome growth model would have a degree distribution in the form of a power law with exponential cut-off, i. e. there exists constants a, b, c such that, as t , →∞

36 we have f (t) ck−ba−k for k . We show that this cannot be true by demon- k ∼ →∞ strating that the expected maximum degree for a power-law with exponential cut-off is O(log t) whereas the proteome growth model has expected maximum degree of Ω(tp).

2.3.1 Properties of the pure duplication model

The pure duplication model is a special case of the proteome growth model with parameter r = 0 (Chung et al., 2003). Since there are no additional edges created after the duplication event, during the divergence step, for a degree k node, there is a qk probability that the new node might lose all of its edges and become a singleton. The fraction of singletons in the pure duplication model (proteome growth model with parameter r = 0) grows with time in such a way that F (t) t is the only 0 → consistent limiting solution. This implies that, unless f = 0 for k 1 then k ≥ F (t) = tf , where f is a time independent solution for the limiting proportion k 6 k k of nodes of degree k. In fact, for the particularly interesting case that p = q =1/2, we show that the expected number of non singletons at time step t is between O(√t) and O(t/ log log t). This contradicts the assumption in Eqn(6) of Chung et al. (2003). Thus, without some modification, the pure duplication model of Chung et al. (2003) cannot have a power-law degree distribution in the form F (t) ctk−b k ∼ for any constants c, b.

Lemma 1. The expected proportion of singletons, f0(t), in the pure duplication model is a non-decreasing function of t and tends to a limit f 1. If also we have 0 ≤ that f (t) f for k 1 then f =1 and f =0 for k 1. k → k ≥ 0 k ≥ Proof. We have the following recurrence for singletons in the pure duplica- tion model: F (t)qk F (t +1)= F (t)+ k . 0 0 t ≥ Xk 0

37 Thus writing Fk(t)= tfk(t) we have

k (t + 1)(f0(t + 1) f0(t)) = fk(t)q 0, − ≥ ≥ Xk 1 and we see that f (t + 1) f (t). As f (t) 1 it follows that f (t) f 1 0 ≥ 0 0 ≤ 0 → 0 ≤ from below as t . →∞ Suppose that for some k 1, k constant, f (t) f > 0, then f qk = ≥ k → k k≥1 k k c> 0. Thus there exists T such that for t T , fk(t)q c/2 >P0 and ≥ k≥1 ≥ P c f (t + 1) f (t)+ . 0 ≥ 0 2(t + 1) Iterating this we get c f (t) log t/T + O(1/T )+ f (T ) 0 ≥ 2 0 i. e. , f0(t) > 1 for t large enough, which is impossible. 

This lemma excludes the existence of power-law solutions f ck−b for finite k ∼ k 1 (which are suggested in Chung et al. (2003)), but we cannot exclude non- ≥ limiting degree distributions by this argument.

It is possible to obtain a tighter estimate on the proportion of singletons in the network for the particularly interesting case that p = q =1/2. AsperLemma3 (see below), this case preserves the (expected) average degree of the nodes throughout the generation of G(t). Thus, e(t)= e(0) t (where e(0) is the number of edges of · G(0)).

Lemma 2. Consider the case p = q = 1/2. Let F +(t) = t F (t), the − 0 number of non-singleton nodes at time t and F + = EF +. Then, there are constants c ,c > 0 such that c √t F +(t) c t/ log log t. 1 2 1 ≤ ≤ 2 Proof. We have the following recurrence: 1 F +(t +1)= F +(t)+ F (t)(1 (1/2)k) (2.1) t k ≥ − Xk 0 38 Thus: F +(t) F +(t) F (t) 1 F +(t +1)= F +(t)+ k (2.2) t t F +(t) 2k − ≥ Xk 1 As F (t) F +(t), one can easily check F +(t) F +(0)√t giving the lower bound. 1 ≤ ≥ k Now let g(k)=1/2 , which is convex and thus for any set of λk for which

Fk(t) λ = 1, we must have λ g(k) g( kλ ). Now pick λ = + . We have k k ≥ k k F (t) P kFk(t)=2e(t)=2e(0)Pt. Thus: P k 2e(t)/F +(t) P Fk(t) 1 1 (2.3) F +(t) 2 ≥ 2 ≥ Xk 1     By substituting (3) into (2) and using e(t)= e(0)t we get: + F +(t) 1 2e(0)t/F (t) F +(t + 1) F +(t)+ 1 . ≤ t − 2   ! This is only satisfied if F +(t) c t/ log log t. This can be verified as follows. ≤ 2 Let c = 4e(0)log2. Either F +(t) c t/ log log t, or if not we can substitute this 2 ≤ 2 lower bound into the exponent on the right hand side and iterate the recurrence on t to obtain a contradiction. 

Lemma 3 (below) states that the expected number of edges is e(t) = ct2p and consequently the expected average degree is h(t)=2ct2p−1. Thus for p < 0.5 the average degree decreases over time and for p > 0.5 it increases. Only for p = 0.5 the average degree remains constant; however as the proportion of singletons is 1 O 1 due to Lemma 2, the average degree of non-singletons (which ≥ − log log t all form a single connected component) is c log log t. ≥ Proposition 2.1. The power-law exponent b in Chung et al. (2003) is given by the solution of 1= bp p+pb−1 and has thevalue 2 when p =1/2. This is incompatible − with e(t)=2e(0)t unless the connected component is of size o(t).

To see this, recall that kFk(t)=2e(t). Under the assumption that we have a −2 power-law degree distributionP at p =1/2, then Fk(t) ck t and ∼ ct e(t)= 1+ O 1 k−1. 2 k k≥1 X  39 ∗ However k k−1 diverges as k∗ , and we cannot have e(t)=2e(0)t, un- k=1 → ∞ less we truncateP k∗ at a finite value. Lemma 4 (below) sets the expected maximum degree in the pure model at Ω(tp), and the power-law assumption itself is not com- patible with k∗ being finite. It is however still possible that a power-law with exponent b = 2 holds for the connected component C. Putting k∗ = O(t1/2) we see that k−1 = O(log t) which gives e(t)=2e(0)t provided C = O(t/ log t), in accordanceP with the | | results of Lemma 2.

Fraction of Singletons 100 90 80 0.4 70 0.5 0.6 60 0.7 50 40 30 Percentage of singletons 20 10 1 10 100 Percentage of running time

Figure 2-2: Percentage of singletons in the pure duplication model as a function of time. Each curve is plotted for a different value of p with an initial graph of 5-circle. For each p value the experiments are run until 1000000 non-singleton nodes are created and the results are averaged over 100 experiments.

Lemma 3. The expected total number of edges and the expected average degree of nodes at step t satisfy

e(t) e(0)t2p and h(t) h(0)t2p−1 ∼ ∼

40 Proof. The number of edges at time t +1 in terms of the number of edges at time t is 1 E(e(t + 1) e(t)) = e(t)+ pd (t). | t s s≤t X The first term is trivial; the second term is obtained by considering the possibil- ity that each given node vs is duplicated at time t; then pds(t) would be the expected number of its edges retained. Because the sum of the degrees of all nodes is twice the number of edges, we have, considering our expectations again, that

p e(t +1)= 1+2 e(t) t   which has a solution e(t) e(0)t2p.  ∼ Figure 2-2 shows the percentage of the singletons in the network over the time for different values of p. The model was run until 1000000 non-singleton nodes were created. The plot uses a linear scale on the y-axis (percentage of singletons) and a logarithmic scale on the x-axis (running time). Figure 2-3 shows the average degree over time for different values of p. Again, the model was run until 1000000 non-singleton nodes were created. The average degree of the network increases by time and the larger the value of p is, the larger is the increases of the average degree.

2.3.2 On the degree distribution of the proteome growth model

In this section we analyze the proteome growth model defined in Pastor-Satorras et al. (2003). The analysis presented in Pastor-Satorras et al. (2003) predicts the proteome growth model to have a power-law with exponential cut-off degree dis- tribution i. e. there exists constants a, b, c such that, as t , we have f (t) → ∞ k ∼ ck−ba−k for k . Pastor-Satorras et al. (2003) mentioned in the same work → ∞ that this degree distribution equation was unsatisfactory since model simulations

41 Average Degree of the Connected Component 600 0.4 0.5 500 0.6 0.7 400

300

200 Average Degree

100

0 1 10 100 Percentage of running time

Figure 2-3: Average degree of non-singleton nodes in the pure duplication model as func- tion of time. Each curve is plotted for a different value of p with an initial graph of 5-circle. For each p value the experiments are run until 1000000 non-singleton nodes are created and the results are averaged over 100 experiments . would not correspond to the results shown. Although they have made a simplify- ing assumption that the deletion parameter is much smaller than 1/2, the resulting equations only hold for values greater than 1/2. The next lemma shows that the degree distribution of the proteome growth model cannot be a power-law with exponential cut-off as suggested (Pastor-Satorras et al., 2003).

Lemma 4. Let a, b, c > 0 be constants. The degree distribution of the pro- teome growth model cannot be in the form F (t) ctk−ba−k as claimed in Pastor- k ∼ Satorras et al. (2003).

Proof. Denote by kmax, the expected maximum degree in G(t). Assume an exponential cut-off i. e. F (t) tck−ba−k. Then F (t) = o(1) for k > k ∼ k≥k0 k 0 P 42 log t/ log a, and so kmax = O(log t/ log a).

On the other hand consider the expected degree of the node vs at time t +1, which is a non-decreasing function of t. Even in the worst case situation (r = 0) we have: d (t) d (t +1) = d (t)+ s p (2.4) s s t as the degree of vs can only increase if one of its neighbors is picked at time t and the edge is retained. Thus:

p p p p d (t +1) = d (t) 1+ = d (s) 1+ 1+ . . . 1+ s s t s s · s +1 t         Since log(1 + x)= x O(x2) we have − t t exp log(1 + p/τ) exp p 1/τ = ep log(t/s) ∼ τ=s ! τ=s ! X X p p which implies that ds(t +1) = Ω(ds(s)(t/s) ) and that kmax = Ω(t ) contradicting the claim.  Here, we showed that the degree distribution of the proteome growth model can- not follow a power-law with exponential cut-off by demonstrating that the expected maximum degree for a power-law with exponential cut-off is O(log t) whereas the proteome growth model has expected maximum degree of Ω(tp). We finally prove that for r> 0 there are no degenerate limiting solutions of the form f =1, f =0,k 1 for the proteome growth model of Pastor-Satorras et al. 0 k ≥ (2003).

Lemma 5. For any r > 0 constant, the proteome growth model does not have a degenerate limiting solution of the form f =1, f =0,k 1. 0 k ≥ Proof. We have the following recurrence for the expected number of sin- gletons: F (t) r F (t +1) = F (t)+ k qk 1 r t F (t). 0 0 t − t − t 0 k≥0 X  43 Assuming the existence of a limiting solution Fk(t) = fkt we have (after taking limits): (1 + r e−r) f = e−r f qk. − · 0 k ≥ Xk 1 If f =1 then f qk =0, but 1+ r e−r > 0 for r > 0 contradicting this.  0 k≥1 k − In summary,P we have showed that the previous analysis of Pastor-Satorras et al. (2003) of the proteome growth model actually cannot generate a network with a degree distribution of a power-law with an exponential cut-off.

2.4 Discussion

In this chapter, we focused on the previous models developed for generating protein- protein interaction networks. The protein-protein interaction network (proteome) of an organism lays out the proteins by their functional relationships. This improves our understanding of cellular functions. We would like to further understand the underlying forces that have generated these networks by using network generation models. The discovery of the protein-protein interaction network topology brought a wider perspective for the understanding the molecular biology (Jeong et al., 2001, Wagner, 2001). Particularly, the discovery of this shared topology among many other networks and cellular networks, such as metabolic pathways, drew more at- tention to these networks (Aiello and Chung, 2001, Aiello et al., 2000, Berger et al., 2003, Bollobas et al., 2003, Bollob´as et al., 2001, Cooper and Frieze, 2003, Klein- berg et al., 1999). There have been constant efforts to come up with a model that can successfully generate the proteome evolution and features displayed by its real counterparts in Bhan et al. (2002), Vazquez et al. (2003) and Pastor-Satorras et al. (2003). Therefore, known biological theories and empirical studies were taken into account to develop such models.

44 In this chapter we have provided an introduction to the modeling of biological networks. We have studied previous analysis of the degree distribution of the pro- teome growth model made by Chung et al. (2003) and Pastor-Satorras et al. (2003). In both studies, simplifying assumptions were made to analyze the proteome growth model, which led to contradicting results in the models. First, Chung et al. (2003) claim that the fraction of nodes with degree k is independent of time and is of the form f = ck−b; here b is a function of p =1 q k − and values of b 2 are possible for some p. The pure duplication model creates ≤ singleton nodes, i. e. nodes that are not connected to any other node of the graph. Since, a node can only get a new edge if one of its neighbors is copied, a singleton will remain singleton during the whole graph generation process. Note that in this model all non-singleton nodes form one connected component. By showing that the (expected) proportion of singletons generated by the pure duplication model grows in time, we have proved that without any modifications the pure duplication model of Chung et al. (2003) cannot have a power-law degree distribution in the form F (t) ctk−b for any constants c, b. k ∼ Secondly, we have analyzed the work of Pastor-Satorras et al. (2003). In this work the degree distribution of the proteome growth model (in fact, any random model based on duplications) is shown to follow a power-law with exponential cut- off. However, we have shown that the degree distribution of the proteome growth model cannot follow a power-law with exponential cut-off by demonstrating that the expected maximum degree for a power-law with exponential cut-off is O(log t) whereas the proteome growth model has expected maximum degree of Ω(tp). In parallel to these studies, in Bebek et al. (2006a) we proved that the proteome growth model for r > 0 and a slightly modified version of the pure duplication model indeed achieve a power-law degree distribution as per the yeast proteome network. However, a more general measure for capturing the topological properties of a network is the ℓ-hop degree distribution for all ℓ> 0. Under this measure (for

45 ℓ > 1), we will show that the proteome growth model is quite different from the yeast proteome network. Here, we have not only disproved the proposed analysis of the growth mod- els, but also provided a better understanding of the underlying forces in generating such networks. Originally the models proposed were based on the two underlying mechanisms for genome evolution: gene duplication and point mutations followed by natural selection (Ohno, 1970). The results presented in this chapter suggest that a deeper analysis of the interaction network is needed before it can be properly modeled. In order to build a successful model that can model the evolution of the interaction networks, the dynamics of the network and its growth process should be better understood. Previously, global properties of these networks were observed and similarities of these networks to other well known phenomena led researchers to model these networks with methods used to generate these other networks. Here, it is very appropriate to mention the difficulty of digging deep in un- derstanding these networks because of the inaccuracy of the experimental results. However, by looking at more locally originated measures, such as the ℓ-hop degree distribution, one can minimize the abnormalities of such misleading observations. In short, a better mechanism for generating such networks is needed, and this can be only achieved by better understanding the current state of the known networks and correlating that with empirical evidence, which will improve or replace the previously proposed network generation models. In Chapter 3, we will present our sequence similarity enhanced model which is based on the observation that the interactions of sequence-wise similar proteins are highly correlated. The model thus employs sequence similarity edges between pairs of nodes/proteins to better capture the mechanisms for updating the interactions af- ter a duplication event. Our model not only captures the degree distribution of the yeast proteome network, but also yields a much better approximation to its ℓ-hop

46 degree distribution for ℓ > 1. Moreover we have observed that the average clus- tering coefficients of networks generated by this model and the original proteome network are almost equal to each other.

47 Chapter 3

An Enhanced Duplication Model Based on Protein Sequence Similarity

The proteome growth model well approximates the degree distribution of the yeast proteome network as observed previously in Pastor-Satorras et al. (2003). In fact, Bebek et al. (2006a) showed that this degree distribution is simply a power-law for r > 0). In Figure 3-1 we compare the degree distribution of the yeast proteome network from the Database of Interacting Proteins (DIP) (Xenarios et al., 2002) to that of the (modified) proteome growth model with the best fitting1 parameters p =0.465 and r =0.08. Yeast is estimated to have about 6700 genes. The current DIP yeast data set has 15000 interactions among 4000 proteins. In other words, ∼ ∼ more than 2000 singleton proteins (proteins with no known interactions) exist in the yeast proteome. The degree distribution studies are focused on the biggest con- nected components of networks, since singleton nodes has no particular role in this distribution. Although the DIP database is incomplete and includes several interac- tions which are not commonly observed, it still provides the most comprehensive protein-protein interaction data for the yeast S. cerevisiae. As observed earlier, the

1 For all plots, the fits were achieved by calculating the average slope in both curves.

48 degree distribution of the yeast proteome network is very similar to that of the pro- teome growth model with the above parameters.

Degree Distribuiton of the Protein Protein Interaction Networks 10000 DIP Yeast Data General Model 1000

100 # nodes

10

1 1 10 100 degree

Figure 3-1: The degree distribution of the proteome interaction network of the yeast and that of the proteome growth model with parameters q = 0.535, r = 0.08

The degree distribution is one possible measure for testing the structural simi- larity of two networks. Unfortunately, structurally very different networks can have identical degree distributions. For example in an (infinite) 2-dimensional grid all nodes have degree 4, similar to a collection of cliques of size 5. The grid obviously forms a single connected component whereas the 5-cliques are not connected at all. Thus it is desirable to use additional measures for testing the similarity of two net- works more accurately. A more refined measure of structural similarity is achieved by comparing the ℓ-hop degree distribution of the proteome growth model and the yeast proteome network (See Section 2.1.3). In Figure 3-2 we plot the average ℓ-hop degree of nodes as a function of their degree, both for the proteome growth model and the yeast proteome network. By

49 1 hop 2 hop 3 hop 30 600 3000 Yeast Model 20 400 2000

10 200 1000

0 0 0 0 10 20 30 0 10 20 30 0 10 20 30

4 hop 5 hop 6 hop

Number of nodes 5000 5000 5000 4000 4000 4000 3000 3000 3000 2000 2000 2000 1000 1000 1000 0 0 0 0 10 20 30 0 10 20 30 0 10 20 30

Degree of the observed node

Figure 3-2: The ℓ-hop degree distribution of the Yeast proteome network and the proteome growth model is shown. A typical node reaches all nodes in the network in 7 hops, thus ℓ-hop degree distribution for ℓ< 7 is given. definition, the 1-hop degree distribution is a straight line with slope 1. Notice that for ℓ > 2 the ℓ-hop degree distribution of the yeast proteome network is very dif- ferent from that of the proteome growth model. In fact, for ℓ > 2, the number of nodes that can be reached by a typical node in the yeast proteome network is much higher than that observed in the proteome growth model. We observed this qualitative difference for the proteome growth model with all parameter choices we tested. In order to capture the ℓ-hop degree distribution of the yeast proteome network for ℓ > 2, we develop a more refined model that aims to emulate the divergence mechanisms in proteome network evolution more accurately. This model, which we call the sequence similarity enhanced model, exhibits a degree distribution very similar to that of the yeast network while also capturing its ℓ-hop degree distribution (see Section 3.2 for details and comparison). We provide the details of our enhanced

50 model in the next sections.

3.1 SequenceSimilarityDistributionin the Yeast Pro- teome

A mathematical model for capturing proteome network evolution should take into account Ohno’s theory, which attributes the genome sequence growth and evolution to subsequent gene duplications followed by mutations on the gene sequences. The proteome growth model implements gene duplications through a uniformly ran- dom node selection process. The mutations are implemented through random edge deletions and insertions. A more refined mutation model may take into account the sequence composition of genes and their associated proteins, and also their pairwise similarity levels. Sequence alignments and sequence similarity scores acquired through these alignments are the most common tools for comparing amino acid or DNA se- quences. Different scoring schemes were developed over the years, such as the PAM (Point Accepted Mutation) (Dayhoff et al., 1978) or the BLOSUM Matrices (BLOck SUbstitution Matrix) (Henikoff and Henikoff, 1992). However, most sta- tistical methods rely on a different measure of similarity, the distance measure. It is rather advantageous to derive a distance measure from a given similarity measure. Given two protein sequences A and B, one way to measure their similarity level is through their global alignment score S(A, B). One of the first amino acid substi- tution matrices used in alignments, the PAM (Point Accepted Mutation) matrix, was developed by Dayhoff et al. (1978). Using Dayhoff’s PAM matrix for comparing closely related species turned out not to work very well for aligning evolutionarily divergent sequences. Henikoff and Henikoff (1992) developed the Blosum (BLOck SUbstitution Matrix) series of matrices to rectify this problem. These matrices are

51 constructed using multiple alignments of evolutionarily divergent proteins. The Blosum62 matrix is calculated from observed substitutions between proteins shar- ing 62% sequence identity or less (Henikoff and Henikoff, 1992). Empirically, the Blosum matrices have performed very well and the Blosum62 has become a stan- dard for many protein alignment programs (Eddy, 2004).

Definition 3.1. A metric space is a set objects with a distance function, d, satisfying the following for every three objects x, y and z:

1. Positivity: d(A, B) 0 and d(A, B)=0 iff A = B ≥ 2. Symmetry: d(A, B)= d(B, A)

3. Triangle Inequality: d(A, C)+ d(B,C) d(A, B) ≥ A simple method for computing distance from similarity scores for proteins was introduced by Fischer (2002). The distance was calculated using an analogy with vector spaces and inner product. Blosum62 obeys the metric rules (positivity, symmetry and triangular inequality) under this distance measure, d =1 ss (ss is − the standardized similarity score) whereas other matrices, such as the PAM do not (Fischer, 2002). Formally, the normalized similarity score of A and B, ss(A, B), can be defined as S(A, B) ss(A, B)= . S(A, A)+ S(B, B) S(A, B) − This normalization of the similarity score by the length of the sequences would give ss(A, A) = 1 = 100%. Hence, d(A, B)=1 ss(A, B) would form a − metric. This turns out to be quite useful for our purposes. Metric space establishes boundaries on sequence similarity. Thus, we can improve the proteome growth model by incorporating this fact to the model. Therefore, throughout this study the Blosum62 matrix is used for sequence alignments.

52 E-value of an alignment score is defined as the number of different alignments with scores equivalent to or better than the alignment score that are expected to oc- cur in a database search by chance (Pearson and Lipman, 1988). In short, the lower the E-value is, the more significant the score would be. Here, having a metric space and comparison of distances in this metric is quite different than using E-value. E-value of an alignment score is a statistical measure giving information about an alignment. On the other hand, establishing a metric space, one can utilize the space to form distance ranges for even unknown alignments. This is quite useful for our purpose, since building a model we can employ this fact for similarity comparisons of (un)known nodes in our model. Once the similarity between two proteins is determined via the above measure, one can depict how protein sequences relate to each other by plotting the distribu- tion of their pairwise similarities. Such plots are provided for the yeast proteome in Figures 3-3 and 3-4. The yeast genome was downloaded from Saccharomyces Genome Database (Cherry et al., 1997) and the pairwise alignment of the 6700 ∼ protein coding sequences were computed via FASTA align (Pearson and Lipman, 1988) with default parameters. A similar threshold study was also conducted to correlate the sequence similar- ity with functionality (Nielsen et al., 1996). Although Nielsen et al. (1996) have observed similar distributions, they haven’t utilized a standardization on the score. Hence, their results varied through different experimental setup. In Figure 3-3, we display the number of protein pairs whose normalized similarity score is in the range x%+0.05 for varying values of x. The pairwise similarity distribution has a peak value at 50% followed by a very sharp drop. Since a standardized similarity ∼ function is used for this observation, the scoring scheme used, i. e. the parameters used for the alignments have minimal effect on the threshold value. The same distribution is depicted in a different perspective in Figure 3-4. Here the number of protein pairs whose normalized similarity score is at least x% is

53 Yeast Proteome Sequence Similarity Distribution 300000 x% similarity 250000

200000

150000 # of pairs 100000

50000

0 20 30 40 50 60 70 80 90 100

Figure 3-3: Distribution of sequence similarity between pairs of yeast proteins (granular- ity: 0.1%) plotted for x [20, 100]. Observe that most pairs have a similarity score below a ∈ threshold value 50% and comparatively very few pairs have a similarity score ∼ above that threshold value. The step function behavior of the normalized similarity score suggests that the pairs of proteins can be divided into two classes: protein pairs which are similar are the ones whose similarity scores are above the threshold; the other protein pairs are dissimilar. Through an investigationof the yeast proteome network we observed that sequence- wise similar proteins have similar interaction patterns.

Observation 1. Given three proteins A,B,C, if A and B are sequence-wise similar and A interacts with C, then the chance that B interacts with C is 21 ∼ times of that between arbitrary pair of proteins.

Another observation we made was on the correlation between sequence similar- ities of protein triplets:

54 Yeast Proteome Sequence Aggreagte Similarity Distribution 1e+08 at least x% similarity 1e+07

1e+06

100000 # of pairs 10000

1000

100 20 30 40 50 60 70 80 90 100

Figure 3-4: Aggregate distribution of pairwise sequence similarity of yeast proteins (ag- gregation performed from right to left).

Observation 2. Given three proteins A,B,C, if A B and B C are pairwise − − similar, then with 65% chance A C are similar. ∼ − Observation 2 is not very surprising as the distance measure we have formed by using normalized similarity score forms a metric. Also, the number of protein pairs whose similarity score is above the threshold value is distributed uniformly over the range [50% 100%]. Nevertheless, it will be quite useful in establishing − our enhanced proteome growth model which we describe in the next section. In this section we investigated into sequential associations of interacting pro- teins. Our goal is to incorporate these sequence compositions of genes and their associated proteins, to establish a more refined mutation model. In the next sec- tion, using our observations we improve the proteome growth model presented in Section 2.2.

55 3.2 Enhanced Model Based on Sequence Similarity

Based on our observations on the sequence similarity and its implications on protein- protein interactions we develop a more refined network generation model below. Our new model, which we call the sequence similarity enhanced model, modifies the step for updating the interaction edges of a duplicated node through the use of additional edges indicating sequence similarity. Thus the new model has two types of edges: interaction edges connecting proteins that interact with each other, and sequence similarity edges connecting proteins that are similar. Similar to the proteome growth model, our sequence similarity enhanced model works in discrete time steps. Let G(t 1) be the network at the end of time step − t 1. At each time step t, a new node v is generated, again by picking one of the − t nodes w in G(t 1) uniformly at random and duplicating it to create v ; i.e. v will − t t initially be connected to all similarity neighbors and the interaction neighbors of vt.

The new node vt will also be connected to w by a similarity edge. The following random process updates the similarity edges of vt:

1. The similarity edge between vt and w is deleted with probability δ.

2. Each remaining similarity edge is considered independently and is deleted with probability q′ (=1 p′). − ′ ′ 3. For each pair of similarity edges (vt,u) and (u,u ), a similarity edge (vt,u ) is created with probability (p′)2.

4. The interaction edges of vt are updated:

(a) Each interaction edge is considered independently and is deleted with probability q (=1 p). −

(b) For each node u, which is not initiallyconnected to vt, a new edge (u, vt) is created independently with probability r/t.

56 v G t

r/t p’2 a p p p p’ 1-d

w

Interaction Edges of Gt-1 Similarity Edges of Gt-1 Similarity edge created between (vt, w)(step 1) Similarity edges created via duplication (step 1 and 2) Similarity edge created between (vt , u') (step 3) Interaction edges created via duplication (step 1 and 4.a) Interaction edges created at random (step 4.b) Interaction edges created through similarity edges (step 4.c)

Figure 3-5: Enhanced Model Based on Sequence Similarity An iteration of the Enhanced Model Based on Sequence Similarity is shown. While solid edges belong to State Gt−1 the dashed edges are possible additions to the graph with asso- ciated generation probabilities in same colors. The red node, vt, is the new node duplicated from w at time t. In the figure, nodes u and u′ are not shown since these actually represent ′ more than one node. u is in the set of neighboring nodes of vt, and u is in the set neigh- boring nodes of u. For simplicity, all possible edges with probability r/t considered in step 4.b. are not shown. 57 ′ (c) For each interaction edge (vt,u) and each similarity edge (u,u ), a new interaction edge (v ,u′) is created with probability α = .03 ( 21 times t ∼ the chance of having an interaction edge between an arbitrary pair of nodes - following Observation 1).

At the time of duplication, vt and w are sequence-wise identical and thus each similarity edge (u,w) is duplicated as (u, vt). Step (1) of the similarity edge update process maintains the edge (w, v ) with probability 1 δ. Here, the parameter δ t − is the measure of divergence. In other words, the mutation events that occurred after the duplication event are reflected on the new edge by the deletion parameter

′ delta. Step (2) maintains every other similarity edge (u, vt) with probability p . Finally, Step (3) imposes Observation 2 on the constructed network. The interaction edge update process, in particular Steps (4.a) and (4.b), works similar to that in the proteome growth model. The only difference is in Step (4.c), where similarity edges are used to update interaction edges in order to impose Observation 1 on the constructed network. The sequence similarity edges in the network are determined by two parameters, δ and p′. It is possible to estimate the values of δ and p′ in the Yeast proteome network by fitting the sequence similarity degree distribution of the model with the sequence similarity degree distribution of the yeast proteome network. We have searched for the best fitting degree distribution parameters in 0.005 incremental steps using chi-square goodness-of-fit test. The experiments were repeated 100 times and averaged at the end. A ring of size 16 was used as a starting graph with random similarity edges connecting proteins under our distance metric defined above. The best fitting sequence similarity degree distribution is achieved for δ = 0.7 and p′ =0.225 and is given in Figure 3-6. Based on the above values of δ and p′, it is possible to estimate the other two parameters, r and p that determine the interaction edges. Once again, we have

58 Degree Distribuiton of Similarity Networks 10000 DIP Yeast Data Sequence Enhanced Model 1000

100 # nodes

10

1 1 10 100 degree

Figure 3-6: The degree distribution of the proteome sequence similarity network of the yeast and that of the enhanced model with parameters δ = 0.7 and p′ = 0.225

Degree Distribuiton of the Protein Protein Interaction Networks 10000 DIP Yeast Data Sequence Enhanced Model 1000

100 # nodes

10

1 1 10 100 degree

Figure 3-7: The degree distribution of the proteome interaction network of the yeast and that of the enhanced model with parameters q = 0.6, r = 0.1, δ = 0.7 and p′ = 0.225.

59 Table 3.1: The average clustering coefficients of the DIP Protein-Protein Interaction Net- work, Proteome Growth Model, and the Enhanced Model

Clustering Coefficient DIP PPIN 0.39 Proteome Growth Model 0.33 0.01 ± Enhanced Model 0.37 0.01 ± searched for the best fitting degree distribution parameters in 0.005 incremental steps using chi-square goodness-of-fit test. The best fitting interaction degree dis- tribution of the model to the yeast proteome network degree distribution is achieved at q = 0.6 and r = 0.04 and is given in Figure 3-7. The search processes in both cases converged to a single point in the parameter space. Thus we have acquired a unique solution for the yeast proteome network. By fitting the degree distributions to the actual network distributions, we have shown that the enhanced model can generate similar networks at least as good as the proteome growth network. Al- though we have a larger parameter set to adjust, we were still able to get a unique solution. The clustering coefficient of a node is the ratio between the actual number of edges between neighbors of a node and the maximum possible number of edges between these neighbors. Average clustering coefficient of a network is the average of clustering coefficients over all units in the system (Watts and Strogatz, 1998) (See Section 2.1.3 for more details). Previously, this measure was widely used for network similarity comparisons (Aiello et al., 2000, Albert et al., 2000, Barabasi and Albert, 1999, Bhan et al., 2002, Bollob´as et al., 2001, Faloutsos et al., 1999, Ferrer I Cancho and Sol, 2001, Jeong et al., 2000, Kleinberg et al., 1999, Kumar et al., 2000, Pastor-Satorras et al., 2003, Redner, 1998, Vazquez et al., 2003, Watts

60 and Strogatz, 1998). In short, average clustering coefficient is a singular value representing the average local structure of a network. In table 3.1, the average clustering coefficients of the Yeast proteome network, the proteome growth model and the enhanced model are shown. These values for the models are calculated using the resulting networks that have the best fitting degree distributions with the Yeast proteome network.

1 hop 2 hop 3 hop 30 600 3000

20 400 2000

10 200 1000

0 0 0 0 10 20 30 0 10 20 30 0 10 20 30

4 hop 5 hop 6 hop

Number of nodes 5000 5000 5000 4000 4000 4000 3000 3000 3000 2000 2000 2000 1000 1000 1000 0 0 0 0 10 20 30 0 10 20 30 0 10 20 30

Degree of the observed node

DIP S. cerevisiae Network Sequence Enhanced Model Proteome Growth Model

Figure 3-8: The ℓ-hop degree distribution of (i) the Yeast proteome network, (ii) the pro- teome growth model, and (iii) the sequence similarity enhanced model. A typical node reaches all nodes in the network in 7 hops, thus ℓ-hop degree distribution for ℓ 7 is not ≥ very meaningful.

Here, although the average clustering coefficients of networks gives similar re- sults as per the degree distributions, in reality the networks generated are not quite similar. In Figure 3-8, we compare the ℓ-hop degree distributions of the sequence similarity enhanced model, the proteome growth model, and the yeast proteome network. Our sequence similarity enhanced model accurately captures the ℓ-hop

61 degree distribution of the yeast proteome network for all values of ℓ.

3.3 Discussion

In this chapter, we presented our sequence similarity enhanced model. First, we have analyzed other models and came to a conclusion that the models proposed for generating protein-protein interaction networks were not sufficient enough (Chung et al., 2003, Pastor-Satorras et al., 2003). In earlier studies, global measures like degree distribution, or more superficial values such as average clustering coefficient were used to compare these models with known biological networks (Albert et al., 2000, Bhan et al., 2002, Jeong et al., 2000, Pastor-Satorras et al., 2003, Vazquez et al., 2003). However, through a new measure that we have developed, called ℓ- hop degree distribution, we showed that these measures might be quite misleading (See Section 2.1.3 for more information.). With their first public appearances, high throughput data (Fields and Song, 1989, Gavin et al., 2002, Ho et al., 2002, Ito et al., 2001, Uetz et al., 2000) was considered very resourceful with their network respresentations. Modeling these networks was a challenging task to further understand the evolution of these net- works. However, with further analysis of these networks, the quality of the data was questioned and how incomplete or false these networks might be was evalu- ated (Aloy and Russell, 2002, Deane et al., 2002, Sprinzak et al., 2003, Walhout et al., 2000b). Large networks are analyzed by investigating their global and local topological properties. However, because of the high false positive and false negative rates, the global properties of these networks might be biased or misleading. On the other hand, the local structures of these networks are more likely to be complete and reliable. Therefore, local properties of large networks, which are more accurate resources for identification and modeling were considered as well. Previously, by

62 focusing on finding small over-represented patterns in a network, such properties were identified (Itzkovitz and Alon, 2005, Milo et al., 2004, 2002, Shen-Orr et al., 2002). In this work, observing the shortcomings of global features, we have also checked local properties to minimize these effects. In short, through our observations we have concluded that a better mechanism for generating such networks is needed, and this can be only achieved by better understanding the current state of the known networks and correlating that with empirical evidence. The original theory lying behind modeling evolutionary steps was presented by Ohno (1970). This model states that the two underlying mech- anisms for genome evolution are gene duplication and point mutations. Although the models have modeled the duplication event quite simply, the differentiation step to represent mechanisms to emulate point mutations were not sufficient. To overcome with these shortcomings, we have focused on the sequential prop- erties of the proteome. Point mutations actually change the sequential informa- tion of the genes. These changes should also be incorporated into a model that is developed for modeling evolution. As a result, a model based on interactions of sequence-wise similar proteins was presented. The model employs sequence sim- ilarity edges between pairs of nodes/proteins to better capture the mechanisms for updating the interactions after a duplication event. By growing the set of relationships in the networks using sequential informa- tion, we have exploited more features of proteins. However, developing a measure- ment system to use sequential information is a challenging task. This task needs understanding of the actual basis of the model which is also used for emulating evolutionary processes. As a result, by combining these features, we have better captured the growth of the proteome. As we were first to do this, with no surprise we have accomplished a closer match in the end. When compared with previous models, the model presented here perfectly captures topological properties such as the degree distribution or the

63 average clustering coefficient of the yeast proteome network. Moreover, the model yields a much better approximation to the yeast proteome network’s ℓ-hop degree distribution for ℓ> 1. The new model developed employs five parameters, p, p′, r, δ and α to gen- erate networks, whereas previous models had half of this number. Here, α was derived through observations made, and the rest of the parameters were adjusted through simulations. We have not only improved the model, but also increased the degree of freedom representing evolutionary forces. For instance, one of the newly introduced parameters in this new model, δ, represents the rate of the mutation events differentiating the sequence of the duplicated node. In short, by accurately transforming the observations to functional steps and inputs, improvements were accomplished on this evolutionary model. In conclusion, we have developed a new measure to better understand the com- plex dynamics of biological networks. By deeper analysis of the actual components of these networks and understanding the underlying forces of evolution, we were able to improve the protein-protein interaction network growth model. As a result, we better captured the growth of the protein-protein interaction network.

64 Chapter 4

Discovering Signaling Pathways: PathFinder

The completion of sequencing efforts of various genomes has inspired many other studies to discover functionality of the genes in these genomes. Techniques like the two-hybrid system or affinity purification have been widely applied to uncover physical interactions between proteins. The application of such techniques to differ- ent organisms such as the yeast, bacteria, fly and worm revealed interaction maps of the cells. These maps drew a significant amount of attention and raised many questions. Signaling pathways are chains of interacting proteins, in which proteins interact with each other to enable or disable their partners towards a biologically identified end-result (Alberts and et al., 2002). These pathways transmit stimuli from outside of the cell to transcription factors, which in turn regulate gene expression. Bacteria and other one-cell organisms are capable of variety of signal transduction processes. The number of these processes shows how many ways the organism can react and respond to its environment. Discovery of these signaling pathways is an important task for domain biologists, and by finding new pathways, domain scientists can gain new insights on how a cell functions. To a great degree, mapping the key signaling

65 molecules in pathways is fundamentally necessary for future drug discovery efforts. Detailed studies on how cells respond to their environment began only in the mid-1970s. Early findings were concentrated on macroresponses such as chemo- taxis. In the late 1970s, it was shown that bacterial chemotaxis was proportional to the amount of receptor present on the cell surface. As the cancer studies revealed enzymes, known as kinases, to be related with cancer transformation, the concept of a molecular cascade or linear signal transduction pathway was established (Smith et al., 1986). Today, it is known that the final recipients of these pathways are most often transcription factors that act on DNA and regulate gene expression. Soon after the discovery of the first cascades, it was shown that, these multiple linear pathways act in networks (Cook and McCormick, 1993, Wu et al., 1993). Because of their link to diseases, signaling pathways are widely investigated by all major pharmaceutical and biotechnology companies as well as clinical research laboratories. Researchers working on signaling pathways currently face two major challenges. First, many pathways are still far from being completely understood. To address this challenge, signal transduction pathways are being deciphered one by one. Secondly, pathway components suspected of participating in disease will have to be correlated with genes. In the past, there were continuous efforts to map signal transduction pathways and link them to disease, and interesting methods for mapping pathways were used. Pathway mapping efforts rely on the ability to isolate complexes of interacting proteins in their native states and obtain sequence information, for example by mass spectrometry. This approach is the cornerstone of integrated efforts involving ge- nomics and proteomics. Traditionally, molecular components of signaling networks

66 in yeast and mammals are discovered by gene knockout experiments1 and epistasis2 analysis (Bateson, 1909). In summary, discovering signaling pathways could be a time-consuming and expensive process. In most of the cases, gene sequences alone are not sufficient to provide enough understanding of how cells function. Molecules, interacting inside the cells are very powerful tools to reveal this dynamic system. Various genomic and proteomic ap- proaches are contributing toward mapping signal transduction and linking it to dis- eases. Signal transduction is a fundamental process of living cells. When molecules participating in these cascades interact and regulate each other in good order, a healthy cell and a healthy organism is observed. However, if these cascades are perturbed with missing links in the chain, or altered with mutations, then the path- ways become twisted, and a disease state occurs. Therefore, focusing on signaling pathways to answer questions relating to reasons of disease and developing new drugs accordingly is an important area of study. Protein-protein interactions of an organism may lay out the proteins by their functional relationships. These associations improve our understanding of cellular functions as well as identifying unknown proteins and their functions. Techniques like the two-hybrid system (Giot et al., 2003, Ito et al., 2001, Li et al., 2004, Reboul et al., 2003, Uetz et al., 2000) or affinity purification followed by mass spectrometry (Gavin et al., 2002, Ho et al., 2002) are developed to uncover physical interactions between proteins. These experiments identify only a small fraction of the total protein-protein interaction network (Bader and Hogue, 2002, Edwards et al., 2002, Grigoriev, 2003, Ito et al., 2002, von Mering et al., 2002, Walhout et al., 2000a,b).

1 An organism is engineered to lack the expression and activity of one or more genes. Knockout experiments are often used to determine the functional role of a specific gene in the organism by studying the defects caused by the resulting mutation. 2 The interaction between two or more genes to control a single phenotype. Epistasis was in- troduced by Bateson (1909). If a mutation in one gene masks the effects of a mutation in a second gene, then the first gene is said to be epistatic to the second.

67 Signaling pathways have been an active research area during in recent history. There are many studies in which signaling pathways were modeled using various approaches. Previously, signaling pathways were modeled as modular kinetic simu- lations of biochemical networks (Neves and Iyengar, 2002) and by detailed integra- tion of biochemical properties of the pathways (Choi et al., 2004). In another recent study, Bayesian Networks are applied to multi-variable cell data to infer signaling pathways (Sachs et al., 2005). One of the approaches to reveal signaling pathways was proposed by Grae- ber and Eisenberg (2001). In their study, Graeber and Eisenberg (2001) worked on identification of autocrine receptor signaling loops. Knowing that in some autocrine pathways, the ligand and receptor are regulated by coupled mechanisms at the level of transcription, Graeber and Eisenberg (2001) searched for correlated mRNA ex- pressions. In their study, they have identified examples of ligand and receptor pairs in five cancer based gene expression datasets. Although this study has identified couple signaling loops, the results are limited within the hypothesis of the study, i. e. identifying only a subset of the pathways. Moreover, the methodology in this study relies only on differences in expression profiles. Therefore, the results pre- sented are likely to be biased (which was also noted in this study because of the false positives in experimental results). Protein-protein interaction maps were previously suggested for generating reg- ulatory networks and cellular modeling (Tucker et al., 2001). They have been also used to predict metabolic pathways (Zien et al., 2000). Steffen et al. (2002) used expression data to rank candidate pathways of interacting proteins. Steffen et al. (2002) proposed an exhaustive search method utilizing k-means clustering algorithm on an integrated network to identify signaling pathways in a protein-protein interaction network. In this work, microarray expression profiles are used to form clusters in order to classify whether two proteins are functionally related or not. In order to measure the significance of the interactions, the authors

68 score the interactions by their similarity in expression patterns. In theory, the sig- naling pathways are embedded in the protein-protein interaction network (Steffen et al., 2002). Since proteins used in the same signaling network must exist simul- taneously with its activation, the genes encoding these proteins must be transcribed at approximately the same time, and under the same environmental conditions in which the signaling network is required. In their methodology, Steffen et al. (2002) pick all possible linear paths of a specified length through the interaction map starting at any membrane protein and ending on any DNA-binding protein. These paths are then ranked by microarray expression data according to the degree of similarity in the expression profiles of pathway members. Linear pathways that have common starting points and end- points and the highest ranks are then combined into the final model of branched networks. Steffen et al. (2002) was able to extract highly correlated proteins from a protein-protein interaction network. Similar integration was also presented by Liu and Zhao (2004). In their study, Liu and Zhao (2004) developed an approach for predicting the order of signaling pathway components, assuming all the components on the pathways are known. The methodology they have presented is built on a score function that integrates protein-protein interaction data and microarray gene expression data. Hence, the integrated approach when compared to the individual datasets, returned better iden- tification of the order of the pathway components. Another approach to a similar problem with an integrated solution was pre- sented by Ideker et al. (2001). In this work, a model for cellular pathways is devel- oped based on perturbations to critical pathway components. These were analyzed using DNA microarrays, quantitative proteomics, and databases of known physical interactions. In this model, first an initial model with possible pathway interactions is proposed. This initial draft is drawn from previous genetic and biochemical re- search. Secondly, each component of the pathway is perturbed through a series

69 of genetic or environmental manipulations, and corresponding global cellular re- sponse to each perturbation are detected and measured. Technologies for large-scale mRNA- and protein-expression measurement are used for these perturbation mea- surements. Third, the observed mRNA and protein responses are integrated with the current model of the candidate pathway and known interactions. Finally new hypotheses are formulated to explain the observations not predicted by the model, and the first three steps are repeated as more perturbation experiments are needed. Using this approach, Ideker et al. (2001) identified the yeast galactose-utilization pathway. Although this iterative method is very intensive in revealing functionality of a pathway, it might require a long time and effort till it is finalized. On the other hand, this study is one of the first to propose a methodology for predicting cellular behaviors by integrating diverse data types and assimilating them into biological models (Ideker et al., 2001). More recently, Scott et al. (2006) presents a methodology in which searching for signaling pathways is carried out on a weighted network. Scott et al. (2006) utilized the color coding algorithm, a randomized algorithm for finding simple paths and simple cycles of specified length in a graph (Alon et al., 1994). In their work, they first assign weights to interactions by integrating the number of experimental observations made, expression values and small world clustering coefficients of the proteins on the interaction network. Using the color coding algorithm they pick the highest possible scoring simple paths. These paths are then merged to identify a pathway between any given two proteins. As a result they were able to verify some already known pathways with sixty percent accuracy. In another study, Shlomi et al. (2006) developed a framework, namely QPath, to search for homologous pathways in a given protein-protein interaction network. While searching for homologous structures, QPath also considers insertions and deletions of proteins in the identified pathways. Although QPath is yet another signaling pathways search tool, the requirement of a known pathway for finding

70 a homologous pathway narrows its functionality. Unlike previous studies, QPath requires a known homologous pathway of another organism to find (unknown) pathways for an organism. In their study, Shlomi et al. (2006) were able to find conserved (metabolic and signaling) pathways of fly using a collection of protein pathways from yeast. In the next section we describe the PathFinder system to discover possible path- way segments. Here we address the problem of finding biologically meaning- ful (and interesting) pathways segments utilizing protein protein interaction net- works. Previously other datasets and integrated models such as multi-variable cell data (Sachs et al., 2005), large-scale mRNA and protein-expression measurements (Ideker et al., 2001), biochemical networks (Neves and Iyengar, 2002) etc. were also used. Using existing protein protein interaction datasets as an advantage, wider coverage of the cellular mechanisms was acquired and faster synthesis of functional components was made at the same time (Graeber and Eisenberg, 2001, Ideker et al., 2001, Scott et al., 2006, Steffen et al., 2002). The above studies have proposed dif- ferent approaches to this identification problem. However, simplified, efficient and accurate results with large coverage could not be achieved (Graeber and Eisenberg, 2001, Ideker et al., 2001, Scott et al., 2006, Steffen et al., 2002). The rest of the chapter is organized as follows. An introduction to our method- ology is presented next. The problem is formally introduced in Section 4.1.1. In section 4.2, we explain our methodology in detail. Section 4.3 presents the appli- cation of our method to yeast protein-protein interaction and microarray data and discussion about results. Finally, conclusions are drawn in Section 4.4.

4.1 PathFinder

Our goal is to find biologically meaningful (and interesting) paths efficiently. How- ever, there are many challenges to this problem. First, it is very well known that,

71 high throughput interaction discovery methods can generate false negatives (report a true interaction as not existing) or predict false positives (predict a relationship between two unrelated proteins). Although the previously presented work tried to incorporate other data sources to resolve these problems, at the end, the results pre- sented are sparse and the methodologies are not focused enough to correlate the protein-protein interaction networks into signaling pathways. Secondly, different proteins may have similar biological functions. In other words, different pathways might have different proteins with same functionality. Therefore, it could be trou- blesome using interactions in one (known) pathway to infer interactions in another (unknown) pathway. To address the above challenges, we present a new systematic method, namely PathFinder, for this identification problem. In this framework, we first collect func- tional annotations of proteins in the known pathways. The aim of mapping pro- teins to functional annotations is to capture underlying characteristics of the known pathways. These characteristics are then used to search for possible (unknown) pathway segments. Association rule mining, a procedure to collect data attributes that are statistically related in the underlying data, is used to discover patterns from pathways. The number of discovered signaling pathways has increased for each organism in recent years. In a recent study, by integrating protein-protein interaction networks and sequence information, Sharan et al. (2005) showed that, among different organ- isms, there are significant amount of interaction patterns that are highly conserved (Sharan et al., 2005). Moreover, in Section 3.1, we have observed that, sequence- wise similar proteins share similar interaction patterns in the same organism. This is mostly due to the duplication of genes or a portion of the genome during the evo- lutionary processes. Hence, multiple proteins sharing the same function do exist and are being discovered. By collecting the underlying functional patterns of signaling pathways, we build

72 a library of templates. These rules are then used to evaluate candidate pathway segments for possible occurrences of these rules. Hence, by extracting associations of the proteins in pathways, one can easily identify whether a given sequence of proteins are associated with each other in a similar way or not. We use the Apriori algorithm, an efficient association rule mining algorithm, developed by Agrawal et al. (1993), Agrawal and Srikant (1994) to acquire associ- ation rules for signaling pathways. We then query our rule set for a pair of starting and ending proteins for a pathway segment. Finally, gene expression profiles are integrated into our methodology to obtain more accurate information on protein- protein interactions and the results are filtered by their expression profiles in the cell. This is due to the fact that, if participants of an interaction are also in a signal- ing pathway, then the genes producing the associated proteins should be expressed at the same time. Previously, Graeber and Eisenberg (2001), Liu and Zhao (2004), Steffen et al. (2002) used this fact to discover signaling pathways from available gene expression profiles of S. cerevisiae. Using this method, we can successfully recover previously published pathways segments accurately. In order to illustrate the success of our approach, we show that our method accurately recovers segments of well-known MAP kinase (Figure 4-1) pathways and performs better than previous studies presented. In the next section we formally define the problem.

4.1.1 Preliminary

We now formalize the problem investigated in this work. Let (V,E,corr) be the protein interaction network where V = p0,p1,p2, ...pn is the set of all proteins, E = e = (p p ) p ,p V is the set of interactions among these proteins and { i j | i j ∈ } corr is the weight function on the edges in E. In this work, corr(e) is a function of expression levels of the proteins associated with edge e and gives the expression

73 Saccharomyces Cerevisiae (Yeast) MAPK Signaling Pathway

Pheromone Hypotonic High Starvation Response Shock Osmolarity

STE2 / WSC MID2 SLN1 SHO1 SHO1 STE3 1/2/3

RAS2

GPA1

STE4 / STE18 RHO1 FKS1 YPD1 CDC42 CDC24

CDC42 PKC1 SSK1 MLP1 STE20 BEM1

BNI1

STE20 BCK1 SSK2 MLP1 STE11

MKK1 / MAPKKK STE11 Polarity STE7 MKK2 PBS2

STE5

MAPKK STE7 SLT2 MLP1 HOG1 KSS1

MAPK FUS3

DIG1 / SWI4 MSN2 / DIG1 / MCM1 STE12 FAR1 RLM1 MCM1 STE12 DIG2 SWI6 MSN4 DIG2

TEC1

CELL FKS2 GLO1 CTT1 FUS1 CYCLE

Cell Cycle Cell Wall Osmolyte Mating Filamentation Arrest Remodeling Synthesis

Figure 4-1: The MAP Kinase Pathways downloaded from the KEGG database (Kanehisa and Goto, 2000) is shown. correlation of these two proteins (The weight function, corr(e), will be discussed in detail in 4.2.3). A pathway segment of length l is an ordered list or sequence of distinct proteins

74 in V such that each consecutive pair is in E, and is a part of known or unknown pathways. In other words, a pathway segment is a simple path in a protein-protein interaction network. Given a weighted protein-protein interaction network (V,E,corr) with proteins V , interactions E and corr(e) showing the expression correlation between con- nected proteins, the starting (p1) and ending nodes (pl) of a pathway segment from this network, and lower and upper bounds of the interested pathway segment length

(llower, lupper), our objective is to discover pathways between p1 and pl within this network with length l where l l l . lower ≤ ≤ upper

4.2 Methods

Computational approaches to discover signaling pathways were previously pre- sented (Choi et al., 2004, Graeber and Eisenberg, 2001, Ideker et al., 2001, Neves and Iyengar, 2002, Sachs et al., 2005, Scott et al., 2006, Shlomi et al., 2006, Steffen et al., 2002). In each study different set of datasets and methodologies are utilized for revealing pathways. In this work, we build a (search) system utilizing properties of the signaling pathways and proteins in these pathways. More precisely, we first determine the underlying relationships of pathways by capturing the characteristics of the pathways’ structure. Our goal is to build a system that can answer queries on possible pathway segments using this acquired information. In this study we are focusing on protein-protein interactions and correlating the underlying proteins with their gene expression profiles. We also investigate sig- naling pathways and their functional annotations to better capture their underlying structure. Here, we compare our results with the most promising two other studies in terms of wider coverage capability of the cellular mechanisms and speed of the methodology (Scott et al., 2006, Steffen et al., 2002). Our hypothesis is, given association rules that capture the pathways, and a

75 weighted protein interaction network that contains reliable interactions, a pathway segment should belong to a pathway if it contains at least a number of these rules and has an average weight of interactions above a given threshold. In this study, we verify our hypothesis by carrying out experiments on S. cere- visiae data set that is publicly available now. The steps that are taken in this process are described in detail in the following sections. First, functional patterns of known signaling pathways are discovered by mapping pathway proteins to their functional annotations and then by mining association rules from this data set (Section 4.2.1 and 4.2.2). Secondly, a weighted protein-protein interaction network is created by combining microarray expression data with the protein-protein interaction network (Section 4.2.3). Finally, given a search query of interest - a starting and ending protein pair, boundaries on the length of the pathway segment - paths connecting this protein pair are checked for the existence of any association rules. The paths that contain a significant number of rules and are above a certain co-expression level are returned as results (Section 4.2.4).

4.2.1 Mapping Proteins to Functional Annotations

The first step in the PathFinder framework is to capture the underlying character- istics of known signaling pathways. Signaling pathways are by definition chains of interacting proteins, in which protein interactions enable or disable each other on the path to modify its successor. Thus proteins on a signaling cascade interact with each other with a common functionality. Therefore, looking at the functional annotations among proteins, one can better capture the true relationships among proteins rather than the binary (0/1 or yes/no relationship) protein-protein inter- actions. Henceforth, using functional annotations instead of proteins itself would

76 Query:p p Protein-Protein Gene start , end, Signaling Gene Expression l l Interaction lower , lupper Pathways Ontology Network Profiles Database

Weighted Functional Protein-Protein Rules Interaction Network

Association Rule Mining

Filtered Simple Paths Association Rules

Possible Signaling Pathway Segment

Figure 4-2: PathFinder The PathFinder framework is shown. The entities on top represent the initial datasets. Notice that the query parameters are first used on the weighted protein-protein interactions network. Also, once the Association Rules are found using Association Rule Mining and the weighted protein-protein interaction network is formed, the system can be queried over and over again for different pathways segments.

77 better capture true relationship among proteins in known pathways. Here, to repre- sent relationships among proteins, we propose to use functional annotations of the proteins rather than just protein names. In an organism, multiple proteins may have similar biological functions, i. e. one protein may replace another protein in a pathway if they both have the same biological functionality, e.g. the same set of gene annotations terms (functionality). Biological annotations, e.g., (Ashburner et al., 2000) provide a basis to find functionally similar proteins. Thus, this will strengthen the accuracy of our pathway segment prediction system. In this study, the training data for pathways is collected from various pathways databases (Campagne et al., 2004, Gough et al., 2004, Kanehisa and Goto, 2000, Mewes et al., 1999). Currently, the databases for signaling pathways provide super- ficial images of the pathways. These databases are first converted to tuples, where each interaction is kept with the proteins that are involved. Next, functional annotations of pathway proteins are collected and kept as their functionality sets. A functional annotation scheme for mapping the proteins in an organism, such as the Gene Ontology (Ashburner et al., 2000) or FunCat (Ruepp et al., 2004) annotations can be used to describe proteins and their interactions with other macro-molecules. In this study, we choose the Gene Ontology (Ashburner et al., 2000) annotations as the annotation to demonstrate the usefulness of our model. The hierarchical nature of the Gene Ontology improves the understanding of the functional assignments. However, our system is not dependent on Gene Ontology annotations, and can be easily extended or converted to other annotations if needed. Proteins in an organism might have multiple functional annotations. In this framework, for each protein, a set of annotations are kept. Using these annota- tion sets, functional relationships among proteins are established (see Figure 4-3). Each annotation of a protein is linked with interacting neighbor’s annotations and a

78 network of annotation links is formed. For instance, given two interacting proteins, (p ,p ) E, and functional anno- i j ∈ tations of the two, Fpi and Fpj , all possible annotation pairs between these two pro- teins would be S = (F ) (F ) = a , b , a , b , ..., a , b − , a , b . pi × pj {{ 1 1} { 2 1} { n m 1} { n m}} Here, all possible combinations are examined, since they represent different bio- logical functions. This multiset S, with cardinality F F , denotes all possible | pi |·| pj | annotation links between two proteins, p ,p V . Hence, F F annotation i j ∈ | pi |·| pj | tuples are generated to capture the functional relationship between pi and pj.

a1 a2 a2 a3 a4

P1 P2

Figure 4-3: Two interacting proteins, P1 and P2, with their respective annotation terms are shown. An annotation link is created by linking an annotation from the first protein’s annotation set to another annotation of the second protein. Above, the tuples (a , a ), (a , a ), (a , a ), { 1 2 1 4 2 2 (a , a ), (a , a ), (a , a ) are shown. There are a total of 3 2 = 6 annotation links 2 4 3 2 3 4 } × between P1 and P2.

Let S denote all the tuples generated from all known pathways, i. e. S = (a , b ) (p ,p ) E, a F , b F , and a , b G . Although S consists { i j | i j ∈ i ∈ pi j ∈ pj i j ∈ } of every possible match, it is quite intuitive that not every tuple in S is meaningful. The Apriori algorithm (Agrawal and Srikant, 1994) (See Section 4.2.2) is used to extract association rules among these annotation tuples.

79 4.2.2 Mining Association Rules from Known Pathways

Association rule mining is a powerful pattern discovery method that has been quite popular since its introduction (Agrawal et al., 1993, Agrawal and Srikant, 1994). This technique is used to discover elements that occur in common. In this study, association rule mining is used for finding interesting and useful patterns in func- tional relationships of the proteins in an organism. Association Rules are described in detail in the following sections. Next, Association Rule Discovery and the well known Apriori algorithm is discussed.

Association Rule Discovery

Knowledge Discovery in Databases (Data Mining) is the process of discovering interesting and previously unknown information from data. There is no restriction to the type of data that can be analyzed by data mining. Hence, in general data mining can help in making sense of large amounts of data. The association rule discovery problem was introduced by Agrawal et al. (1993), Agrawal and Srikant (1994). The association rules were initially introduced to an- alyze supermarket transactions for discovering the buying patterns of customers. A supermarket transaction consists of a list of items purchased by a customer, and the supermarket database consists of a list of such transactions. An association rule has the form X Y , where X (antecedent) and Y (consequent) represent sets → of items. Such a rule indicates that, customers who have bought the items in X tend to also buy the items in Y. To assess the importance and interestingness of the association, two measures are used. The support of the rule indicates the number of transactions that confirm the rule, i. e. the number of transactions in which both X and Y exist. The confidence of the rule represents the ratio between the number of transactions that contain X Y and the number of transactions that contain X. For instance assume that ∪

80 we are given the rule cereal milk with support 40% and confidence of 50%. → The rule indicates that, 40% of the transactions involve the buying of cereal and milk, and that 50% of the transactions that involve buying of cereal also involve purchasing milk. The association rule discovery problem consists of discovering all association rules with support and confidence larger than some given thresholds. More formally, let I = i , ..., i be a set whose elements are called items and { 1 n} let T = t , ..., t be a multiset whose elements are called transactions, where { 1 m} each ti represents a subset of I. The multiset T can be stored in a special table with binary attributes I and an additional transaction identifier attribute tid, which is called a binary database or binary table (See Table 4.1. In a binary table, the presence of an item in a transaction is indicated by 1 and with a value of 0 if it is absent.

Table 4.1: An example binary table for itemsets I = i , i , i and transactions T = { 1 2 3} i , i , i , i , i , i , i , i , i , i is shown. {{ 3} { 2 3} { 1 2} { 1 3} { 1 2 3}}

tid i1 i2 i3 1 0 0 1 2 0 1 1 3 1 1 0 4 1 0 1 5 1 1 1

In the case of mining association rules, for simplicity, we will assume that databases are binary. All algorithms can be easily modified to work with non-binary databases as well.

Definition 4.1. Let I = i ,...,i be a set whose elements are called items and { 1 n} let T = t ,...t be a multiset of transactions, where each transaction t I. { 1 m} i ⊆

81 A subset of I is called an itemset. An itemset of cardinality k is called a k-itemset. The support of an itemset I, supp(I) is defined as:

t T t I supp(I)= |{ ∈ | ⊇ }| (4.1) T | | We define supp( )=1. ∅ An association rule is an implication of the form A B, where A and B are → two disjoint itemsets called antecedent (antc(r)) and consequent (cons(r)) of rule r. Thus, items of rule r is items(r)= antc(r) cons(r). ∪ The support value for association rule r is:

supp(r)= supp(items(r)) (4.2)

The confidence value for association rule r is:

supp(items(r)) conf(r)= (4.3) supp(antc(r))

Note that, 0 supp(r) 1 and 0 conf(r) 1. ≤ ≤ ≤ ≤ Based on the above definitions the association rule mining problem is defined as: given a user specified minimum support and minimum confidence thresholds, denoted as minsupp and minconf, discover all association rules from T having support and confidence greater or equal to minsupp and minconf, respectively.

Definition 4.2. Let minsupp be a support threshold. An itemset having support greater or equal to minsupp is called frequent A frequent itemset that is maximal with respect to set inclusion is called large.

Association Rule Mining is completed in two steps (Agrawal and Srikant, 1994). First, all frequent itemsets are found. Finding itemsets is an intensive task, where all frequent itemsets are listed. However, there are methods that search for only a subset of all frequent itemsets that has the property of summarizing or of allowing

82 to infer the information on all frequent itemsets. For instance, a subset can represent a larger itemset (Roberto J. Bayardo, 1998). Secondly, all association rules starting from the frequent itemsets are found. However, mostly, the number of rules that are generated are large in number, more than an eye can handle. To overcome this problem, subset of rules that can be used to infer all others using a set of inference rules can be defined. A trivial algorithm for mining frequent itemsets would just compute the support of all itemsets from the set of rules. Practical algorithms attempt to minimize the number of nodes processed during the traversal of itemsets. Ideally, an algorithm should calculate support value of the itemsets that are actually frequent. Since it is impossible to know in advancement which itemsets are frequent, the algorithm today minimize the number of itemsets that they examine, and make sure that no frequent itemsets are missed.

Mining Association Rules: The Apriori Algorithm

After the frequent itemsets are generated, association rules are computed. A simple algorithm could be as the following: for each frequent itemset I, and for each of its non-empty subsets I , generate the rule I I I if its confidence is equal a a → − a to or greater than minconf. This algorithm would generate all possible association rules. Agrawal and Srikant (1994) introduced an improved version of this algorithm, called Apriori. This algorithm mines the frequent itemsets in a bottom-up, breadth- first fashion, and works iteratively. The algorithm starts initially with a collection of candidate itemsets of k = 1. At iteration k, Apriori starts with a collection of possible k-frequent itemsets called candidate itemsets, and scans the data to deter- mine which candidates are frequent. The algorithm then generates candidates for the next iteration. The procedure that generates these itemsets minimizes the num- ber of non-frequent candidates generated from frequent k-itemsets. The candidate

83 itemsets are hashed to minimize the query time. Once the hash-tree has been filled with a set of candidate itemsets, one can use it retrieve those itemsets that could be included in a transaction.

Algorithm 1 THE APRIORI ALGORITHM

APRIORI(T, I)

L large 1 itemsets 1 ←{ − } k 2 ← while L − = do k 1 6 ∅ C Generate(L 1) k ← k − for transaction t T do ∈ C Subset(C , t) t ← k for candidate c C do ∈ t count[c] count[c]+1 ← end for end for L c C count[c] I k ←{ ∈ k| ≥ } k k +1 ← end while

Return k Lk S In short, the algorithm generates candidate itemsets of length k from k 1 − length item sets. Then, patterns with infrequent sub pattern are pruned. According to the downward closure lemma, the generated candidate set contains all frequent k length itemsets. Next, the whole transaction database is scanned to determine frequent item sets among the candidates.

84 Mining Association Rules from Annotations Links

First a set of relationships is formed by pairing functional annotations of two con- secutive proteins on a pathway. Association rules are pairs from this set of pairs which can satisfy constraints on measures of significance and interest (these mea- sures are determined by calculating support and confidence of annotations, de- scribed below). The aim of mining association rules is to discover rules, P Q, ⇒ that stand out from all other pairs in terms of significance and interest. In this work Support-Confidence framework is used to measure significance and interest. Support is used as a measure of significance (importance) of an annotation. In this framework, Support is defined on annotations and gives the proportion of annotation links which contain the annotation. Basically this measure uses the count of pairs (also called a frequency constraint). An annotation with a support value higher than a set minimum support threshold is called a frequent or large annotation. Confidence is defined as the probability of seeing a rules implied pair (Q, the consequent item) under the condition that the relationship also contain the first item (P , the antecedent item). In short, an association rule is an expression P Q, where P and Q are ⇒ functional annotations. The meaning of such rules is quite intuitive: if P is an an- notation of a pathway protein, then the protein of annotation Q is likely to appear in this pathway as well. To derive association rules, the co-occurrences of different annotations that appear with the greatest frequencies are identified. A higher sup- port value would show that the rule, or the relationship between two annotations has appeared more than a certain amount. Therefore, a rule with high support would be a pair of annotations describing a common functional relationship on known pathways. The confidence of the rule on the other hand is the probability that a rule con- taining P also contains Q. Higher confidence values would show that given an

85 annotation rule, the rule is more correct guessing the likelihood of the annotations. So, if a rule containing P and Q is known to have a high confidence value, then Q is more likely to appear whenever annotation P is seen. More formally, lets assume that C is a collection of such annotation pairs. Given an association rule, R C, this rule implies that whenever one of the ∈ annotations in R is P , probably the other should be Q. The fraction of rela- tionships R supporting P with respect to the all pairs is called the support of P , support(P ) = R C P R / C . The rule confidence is defined as the |{ ∈ | ⊆ }| | | percentage of relations containing Q in addition to P with regard to the overall number of relations containing P . In other words, rule confidence is probability Pr(Q R P R). ⊆ | ⊆ Association rule mining might be challenging since the number of rules grows exponentially with the number of items. Algorithms developed recently are able to efficiently prune this vast amount of search space based on minimal threshold measures that are defined before. In this study, Apriori Algorithm (Agrawal et al., 1993, Agrawal and Srikant, 1994) is used in order to acquire association rules for signaling pathways (See Section 4.2.2). In this methodology, Support is first used to find frequent (significant) association rules. Then confidence is used in a second step to produce rules from the frequent itemsets that exceed a minimum confidence threshold. By collecting the underlying functional patterns of signaling pathways, a library of templates is formed. These rules are then used to evaluate candidate pathway segments for possible occurrences of these rules. Hence, by extracting associations of the proteins in pathways, one can easily identify whether a given sequence of proteins are interacting with each other in a similar way or not.

86 4.2.3 Constructing a Weighted Protein-Protein Interaction Net- work

The protein-protein interactions identified up-to-date are mostly acquired by high throughput yeast two-hybrid screens or inferred from the mass spectrometry of coimmunoprecipitated protein complexes. However, analysis based on the agree- ment of the interaction and expression data shows that almost less than half of these interactions are biologically relevant (Deane et al., 2002). Therefore, even if two proteins are observed to have an interaction, they may not be functionally related. This creates a challenge in identifying the true positive interactions that are needed for further discoveries, such as identifying the signaling pathways. If an interaction participates in a signaling pathway, then the genes producing the associated proteins should be co-expressed and might be co-regulated. Previ- ously, in separate studies Steffen et al. (2002) and Graeber and Eisenberg (2001) used this fact to discover signaling pathways from available gene expression pro- files of S. cerevisiae (Steffen et al., 2002) and human cancer cells (Graeber and Eisenberg, 2001). Liu and Zhao (2004) used the same approach to order signaling pathways. In this study, gene expression profiles are integrated into our methodol- ogy to obtain more accurate information on protein-protein interactions. Systematic methods for organizing gene expression data require a means of measuring quantitatively if two expression profiles are similar to each other or not. First, the values that form the expression profile for a single gene is considered as a series of coordinates. This definition makes it easier to consider the data for a microarray experiment as a matrix. In this matrix, the genes define the rows, and experiments define the columns. Hence, one can easily use the standard mathematical techniques to mea- sure the similarity of these profiles. One distance metric that can thus be used is the Pearson Correlation which is essentially a measure on how similar the directions in

87 which two expression vectors are. The Pearson Correlation treats the vectors as if they are the same (unit) length, and is thus insensitive to the amplitude of changes that may be seen in the expression profiles. A weighted interaction network is thus formed by calculating Pearson corre- lation coefficient of the interacting pairs’ gene expression levels. In other words, for each pair of interacting protein p and p , e = (p ,p ) where e E, corr(e) i j i j ∈ is the Pearson correlation of the interacting pairs of proteins. Thus, for two pro- tein expression levels with means p¯i and p¯j, and standard deviations Spi and Spj respectively, the Pearson correlation coefficient for pi and pj is;

(p p¯ )(p p¯ ) corr(e)= ix − i jx − j (4.4) (n 1)S S 1≤x≤n pi pj X − Notice that the correlation of two interacting proteins is the Pearson correlation of the expression levels of the genes that produce these proteins. If g1 produces pi and gene g2 produces pj, then the correlation of pi and pj is measured as the correlation of the expression level of g1 and g2. The Pearson correlation coefficient measures the strength of a linear relationship between two expression levels of gene products. The Pearson correlation coefficient would be between 1 and +1. By − definition, a positive correlation is evidence of a general tendency that high levels of g1 are associated with high levels of g2 and low levels of g1 are associated with low expression levels of g2. On the other hand, a negative correlation is evidence of a general tendency that high expression values of g1 are associated with low expression levels of g2 and low expression levels of g1 are associated with high expression levels of g . The closer the correlation is to +1 or 1, the closer to a 2 − perfect linear relationship would two genes have. Hence, in this study, the absolute value of corr(e) is used to capture inhibitory activity of genes (negative correlation) as well as activation of genes (positive correlation). In most cases, as corr(e) | | approaches to zero, there would be little or no association among the pair of genes in consideration. Usually the correlation of the expression genes is some evidence on

88 whether the produced proteins are biologically related. In our study, by carrying out an experiment among the pathways we have observed that for a given pathway, the average of absolute values of correlation of the consecutive proteins on a pathway3 is higher than 0.6.

4.2.4 Searching for Pathway Segments

PathFinder is used to search for signaling pathway segments. For this task, the acquired association rules underlying the characteristics of signaling pathways are used. As per our hypothesis, if a candidate pathway segment contains character- istics similar to some known pathways, it is highly probable that these segments belong to some known or unknown pathway as well. In other words, given associa- tion rules that capture the characteristics of some known pathways, and a weighted protein interaction network, a pathway segment should belong to a pathway if it contains at least a number of these rules and the average weight of interactions is above a given threshold. Given initial and end states of a pathway segment, we checked every possible

4 path with length between llower and lupper forming a simple path . Although an exhaustive search seems impractical, the algorithm’s run time is acceptable. While other randomized approaches might skip some of the paths, our algorithm would output all of them. Here, a modified version of Depth-First Search (DFS), All Simple Paths Depth-First Search is used (See Algorithm 2). Depth-first search (DFS) is an uninformed search that progresses by expanding

3 This observation is made with the study of currently available data. The yeast interaction network is downloaded from MIPS Comprehensive Yeast Genome Database (Mewes et al., 1999), yeast microarray data is downloaded from The Stanford Microarray Database (Ball et al., 2005) and pathways are collected from KEGG (Kanehisa and Goto, 2000). 4 In our experiments we have observed that having 3 l 4 returns the most positive results. ≤ ≤ This is due to the nature of the data that is used. The length of the most known (connected) pathway segments of the yeast protein-protein interaction are between 3 and 4 proteins.

89 the first child node of the search tree that appears and thus going deeper and deeper until a goal node is found, or until it hits a node that has no children. Then the search backtracks, returning to the most recent node it hadn’t finished exploring This process continues until all we have discovered all the vertices reachable from the original source vertex. (Cormen et al., 2001). PathFinder utilizes DFS since with couple small alteration to the algorithm. DFS can return all possible length l simple paths between two nodes. First, since the length of the path has constraints, the algorithm stops going deeper after hitting the lth node and returns to the most recent node it hadn’t finished exploring. Sec- ondly, if the end node is reached in less then l steps, the algorithm again terminates searching deeper in the graph. DFS marks the nodes as they are visited, so no paths are repeated in this discovery process. The search then removes these marks just before returning from the recursive call. Time complexity of this modified algo- rithm is larger than the original DFS algorithm’s run time, Θ(V + E), since the marked nodes might be visited again. Here, the recursive procedure is invoked for each vertex v V which is less then l hops away from the starting node, in most ∈ cases more than once. Hence, the for loop is executed ( Adj[v] )l times on average | | case. In the worst case all simple paths between two nodes can be found in O((V )l) time. Candidate pathway segments are then filtered through the association rules as the following. First, each pair of interacting proteins’ functional annotations are acquired. The annotation pairs formed from these two sets are then checked for a match with a tuple from the association rules set. For each path, a counter is kept, and those paths with the most rules are picked5. The resulting simple path candidates is generated from the weighted interaction network. For each path, an average expression correlation coefficients is easily calculated. These paths are then filtered based on their average value of coefficients

5 In our experiments, we picked the top %10 of the paths.

90 Algorithm 2 ALL SIMPLE PATHS DEPTH-FIRST SEARCH

ALL-SIMPLE-PATHS-DFS(G, start, end)

if STACK-SIZE(S) l then ≥ Return else

PUSH(S, start) if start=end then

PRINT-STACK(S) else for each v Adj[start] and v / S do ∈ ∈ ALL-SIMPLE-PATHS-DFS(G,v,end) end for end if

POP(S, start) end if with respect to the threshold. The threshold for the average expression correlation is determined by calculating the average value for known pathways. For the data in our experiments, the average threshold was 0.6. The candidate paths with average coefficients higher than the threshold are returned as the query results. Empty set is returned when every path is eliminated through these steps.

4.3 Experiments on the Yeast Proteome Network

In this study, we carried out experiments on current publicly available S. cerevisiae (budding yeast) data to prove the validity of our model. The protein-protein inter- action data is downloaded from the MIPS Comprehensive Yeast Genome Database (CYGD) (Mewes et al., 1999). The latest interaction network dated December 2005

91 contains 15457 interactions among 3557 proteins. The gene expression calculations are done using the yeast microarray data downloaded from The Stanford Microar- ray Database (Ball et al., 2005) and finally signaling pathways are collected from various databases (Campagne et al., 2004, Gough et al., 2004, Kanehisa and Goto, 2000, Mewes et al., 1999). Our methodology is implemented and tested for finding MAP kinase signaling pathway segments (Figure 4-1) from the weighted yeast protein-protein interaction network generated from MIPS and SMD databases. First, we picked two different random proteins from a known signaling pathway. We identified the pathway and removed the pathway from our training data. In our experiments, we created a large set of rules that are generated from known signaling pathways. The collected protein pairs of the pathways and matched them with their annotations as described in Section 4.2.1. These annotation rule pairs are mined by Association Rule Discovery methodology described in Section 4.2.2. The pathway interactions that we have collected generated more than 105 associations among functional annotations. Given that our collection of pathways interactions are less than 103, we have experimented on the Support and Confidence measures to generate a network of annotations that is both small in number and also can cover almost every pathway interaction link on hand. Given these requirement, we considered values between 10−3 and 107 for support and between 0.1 and 10−5 for confidence. We then checked the size of the association rule set generated by these parameters. For about 2500 rules we have observed that the association rules cover our pathways and give highly filtered annotation associations. In parallel to this, the data used in this study generated the best results with 2582 association rules generated when the support value is 0.0001 and the confidence is 0.001 (see Figure 4-4). After forming our association rules from other known pathways, we searched for the known pathway on the weighted interaction network. For each pathway

92 Association Rule Mining Parameters 100000 Confidence 0.00001 Confidence 0.0001 Confidence 0.001 Confidence 0.01 Confidence 0.02 10000 Confidence 0.04 Confidence 0.06 Confidence 0.08 Confidence 0.1

1000 # of Rules 100

10

1 1e-05 1e-04 0.001 0.01 Support

Figure 4-4: Association Rule Mining Parameters. For each fixed confidence value between 0.1 and 0.00001, the support values and corre- sponding rule set size is plotted. The red horizontal line represents the 2500 cutoff point. segment we are looking for, we queried the interaction network for paths with length 1 of the original length acquired from the pathway. After conducting 15 random ± experiments, we had a 70% recall and 34.6% precision in recovering the pathway segments. Recall is the percentage of the recovered edges with respect to the original path- way segment that we are looking for. So, if all edges are found in the resulting segment, we say that the recall for that experiment is 100%. These numbers are mostly affected by the incompleteness of the interaction network, i. e. an interac- tion existing on a pathway does not occur on the interaction network. Therefore some of the pathway segments were impossible to rediscover completely. How- ever, all of the experiments we carried out returned pathway segments as a result (See Table 4.2).

93 Precision is the rate of the number of edges that are truly recovered to the total number of edges in the resulting set. This rate increases when the discovered path- way segments have smaller number of supplementary links. However, considering the current condition of the interaction network with high false positive and false negative rates, the current rate we acquired is expected. Below we discuss one of the experimental queries we have conducted over S. cerevisiae datasets.

Ste7 Ste7 Ste7 Ste7 Ste7

0.749 0.749 0.636 0.636 0.636

Ste5 Ste5 Ste11 Ste11 Ste11

0.697 0.719 0.567 0.685 0.719

Kss1 Ste11 Fus3 Kss1 Ste5

0.615 0.685 0.629 0.615 0.697

Dig1 Kss1 Dig1 Dig1 Kss1

0.567 0.648 0.567 0.567 0.648

Dig2 Dig2 Dig2 Dig2 Dig2

_ _ _ _ _ r=0.658 r=0.700 r=0.600 r=0.626 r=0.675

Figure 4-5: Ste7-Dig2 Simple Path Results Simple paths with high number of association rules (top 10%) connecting Ste7 and Dig2 are shown above. The weights of each edge, the Pearson Correlation Coefficients of adja- cent proteins’ gene expression profiles are shown next to each edge. The average Pearson correlation coefficients of each simple path is given below each path. All five paths are greater or equal to the experimental threshold, 0.6.

94 Let Ste7 and Dig2 be the protein pair, two proteins of the S. cerevisiae proteome, that are given for a pathway segment discovery query. Given the PathFinder system trained with Gene Ontology annotations (Ashburner et al., 2000) with a weighted protein-protein interaction network established using MIPS (Mewes et al., 1999) and SMD (Ball et al., 2005), simple paths that are 3 l 5 long are searched ≤ ≤ starting from Ste7 and ending at Dig2. The current datasets available returns 38 possible paths for the Ste7-Dig2 pair. Next, these paths are sorted by their association rules associated with each link on the paths. Since we have 38 simple paths, the top five scoring paths are picked for further examination (5 instead of 4 paths are picked in this case because of a tie in scoring) (See Figure 4-5). In the next step, these paths are filtered by their average weight. All five of these paths had average Pearson Correlation coefficients larger than our experimentally determined threshold value of 0.6. Note that, in some cases no simple path passed through this filtering process. In such cases, we relaxed this threshold to examine PathFinder results, and got positive results with a threshold of 0.5, which is the lowest value we have considered 4.2. The resulting paths are then, merged for a candidate subnetwork representing a possible pathway segment between Ste7 and Dig2. Next, the PathFinder result shown in Figure 4-6 is compared to the known Ste7-Dig2 pathway segment in Fig- ure 4-1. Here, recall for this query is 50%, since Ste7-Fus3, Fus3-Dig2, Ste7-Kss1, and Ste5-Fus3 edges are not recovered, while Fus3-Dig1, Kss1-Dig1, Kss2-Dig2 and Ste5-Fus3 edges are recovered by PathFinder. The precision on the other hand is 70% since the resulting network has ten edge where seven of them exist in the MAP Kinase Signaling Pathways (Kanehisa and Goto, 2000). In short, for the Ste7- Dig2 query, PathFinder successfully recovered relationships among protein pairs. In Table 4.2, the results of the randomly picked protein pair queries are shown. Each query is run for Gene1 and Gene2 for 3 l 4. Pathways that include the ≤ ≤

95 Ste7

Ste5 Ste11

Kss1 Fus3

Dig1 Dig2

Figure 4-6: PathFinder Ste7-Dig2 Signaling Pathway Segment Results The simple paths between Ste7 and Dig2 of S. cerevisiae are merged to form a signaling network as shown above. The bold edges highlight the edges that exists in the current MAP Kinase Signaling Pathways (Kanehisa and Goto, 2000). query proteins are removed from the training set before each search. The total num- ber of Paths that are found are shown in the third column. Top 10% of these paths are kept. An average pearson correlation coefficient is calculated for the paths and the paths that are lower than the Pearson threshold are eliminated. In case no paths made it to the results, the threshold was decreased (Column 5). The resulting set of #Paths are then checked for accuracy (See Figure 4-1). Recall is the percentage of the recovered edges with respect to the original pathway segment that we are look- ing for. So, if all edges are found in the resulting segment, we say that the recall for that experiment is 100%. Precision is the rate of the number of edges that are truly recovered to the total number of edges in the resulting set (Refer to Bebek and Yang (2006) for details). Pheromone response, filamentous growth and cell wall integrity pathways were

96 Table 4.2: PathFinder Search Results Randomly picked protein pairs from S. cerevisiae MAP Kinase Signaling Pathways are queried for possible pathway segments. Pathways that include the query proteins are re- moved from the training set before each search. In this table, detailed information of each search result is shown. In the last row, average recall and precision values are calculated (Bebek and Yang, 2006).

Gene1 Gene2 Paths Top10% Pearson #Paths Recall Precision

1 Mkk2 Pkc1 275 28 0.6 5 1 0.25 2 Tec1 Dig2 6 1 0.6 1 1 1 3 Kss1 Ste11 53 5 0.6 5 1 0.1818 4 Sln1 Ssk2 8 2 0.5 2 0 0 5 Pbs2 Ste20 17 2 0.6 1 0.5 0.25 6 Bem1 Ste18 58 6 0.6 4 0 0 7 Ste12 Tec1 1 1 0.6 1 1 1 8 Bck1 Rho1 1002 100 0.5 19 0.5 0.0238 9 Mkk1 Pkc1 271 27 0.6 8 1 0.1818 10 Ste11 Pbs2 1 1 0.6 1 1 1 11 Bni1 Cdc24 1495 150 0.5 107 0.5 0.0062 12 Ste7 Kss1 48 5 0.6 5 1 0.125 13 Ste2 Ste4 9 5 0.6 5 1 0.2222 14 Ste7 Dig2 38 5 0.6 5 .5 0.7 15 Pbs2 Ste20 17 2 0.6 1 0.5 0.25

Average 0.700 0.346

97 Saccharomyces Cerevisiae MAPK Pheromone Response Pathway

STE2 / STE2 STE3 STE3 STE3

AKR1 GPA1 GPA1 AKR1 CDC24

STE4 / STE4 / STE4 / STE4 / CDC24 STE18 STE18 STE18 STE18 BEM1

FAR1 FAR1 CDC42 CDC42

GPA1 BEM1 STE20

STE11 SST2 STE11 GPA1 STE11 STE11

STE5 STE7 STE5 STE7 MPT5 STE7 STE7

STE50 KSS1 KSS1 FUS3 KSS1 FUS3 FUS3 SPH1 FUS3

DIG1 / DIG1 / DIG1 / DIG1 / STE12 STE12 STE12 STE12 DIG2 DIG2 DIG2 DIG2

A. KEGG B. PathFinder C. Color-Coding D. NetSearch

Figure 4-7: The Pheromone Response Signaling Pathway. (A) The main chain of the pheromone pathway downloaded from the KEGG Database (Kanehisa and Goto, 2000). (B) the output of our PathFinder implementation where each query is marked with a sur- rounding box. (C) The color-coding algorithm output for the pheromone pathway by Scott et al. (2006). (D) the NetSearch program prediction for the pheromone pathway by Steffen et al. (2002).

The interactions that do not exists in the interaction network are shown with dashed edges in (A). Notice that, in figure (A) and (B) the boxes show corresponding segments of the pathways considered as a query. Also notice that, for (B), (C) and (D) the proteins that were not on the main chain of the pathway, as shown in (A), were not colored, whereas the proteins on the main chain are colored Grey.

98 also analyzed previously in (Scott et al., 2006, Steffen et al., 2002). Pathfinder is also used to recover these specific pathways. The pheromone response pathway (Figure 4-1) triggers the yeast cell for mating by inducing polarized cell growth toward a mating partner. This pathway consists of ten cascading proteins with additional proteins assisting or binding on the sides (Kanehisa and Goto, 2000). First, the pathway segments that are in query are excluded from the training data set. Given the yeast protein-protein interaction network and the starting and ending protein pair of the pheromone response pathway, our model generated the pathway segments shown in Figure 4-7B. However, note that the protein-protein interaction network that is currently available is known to be incomplete. When the yeast net- work interactions (Mewes et al., 1999) are compared with the pheromone response pathway interactions (Kanehisa and Goto, 2000), the interactions between Ste4/8 and Cdc42, and between Ste20 and Ste11 do not exist in the yeast protein-protein interaction network. Given the yeast protein-protein interaction network and the starting and end pro- tein pair of two segments of the pheromone response signaling pathway, PathFinder generated the pathway segments as shown in Figure 4-7B. The pheromone response pathway is divided into three segments for experimental purposes. The separation points were picked according to the topological properties of the pathway, i. e. the missing links are considered as point of separation. Hence, the protein chain of length ten is divided into three segments. We tested our methodology by running queries for pathway segments that has more than two proteins. Given the starting and ending proteins for the two segments in consideration, (Ste2, Ste4/Ste18) and (Ste11, Ste12), PathFinder returned the resulting merged network in Figure 4-7B. The first query, returned proteins Ste2, Gpa1, Ste4 among length three paths and Ste2,Gpa1,Ste11 and Ste18 for paths of length four. The second query returned Ste11, Ste7, Fus3 or Kss1, Dig1/Dig2, Ste12 for segments of length five. No other significant result is found for other

99 simple path lengths. So, given that there are missing links, our method actually recovered all possible links that is available to us on this pathway. Moreover, the resulting pathway segments of PathFinder just consists of proteins from the original pathway and Kss1 (MAP kinase). Previous methodologies failed to accomplish this (see Figure 4-7). Kss1, a MAP kinase, does not have a part in the pheromone signaling pathway. However, as discovered with PathFinder with this experiment and earlier by Mad- hani et al. (1997), by switching places with Fus3, a MAP kinase regulating mating, Kss1 MAP kinase regulates filamentation and invasion. Remarkably, with similar- ities in kinase-dependent activation functions, Kss1 and Fus3 each have a distinct kinase-independent inhibitory function. This example shows how PathFinder, a model built on characteristics of pathways, is capable of filtering proteins accord- ing to their functions. The high osmolarity glycerol (HOG) MAP kinase pathway (Figure 4-8A) is ac- tivated by increased environmental osmolarity and results in a rise of the cellular glycerol concentration to adapt the intracellular osmotic pressure. The HOG-MAP kinase pathway has missing interactions in the yeast interaction network that sep- arates the pathway into sub-paths. In a previous study (Steffen et al., 2002), it has been noted that this prevents the attempts to discover this specific pathway. Al- though this brings a great challenge to our model as well, we have successfully recovered a given segment of the HOG-MAP kinase pathway using our PathFinder framework. In this second example, a segment of the HOG-MAP kinase pathway is searched via our model. Sln1-Ssk2 are picked as the starting and ending protein pair since the Ssk2-Pbs2 link does not exist in the interaction network (See box in Figure 4-8A). After searching for the pathway with our methodology, we acquired the pathway segment shown in Figure 4-8B. When the results are compared to the original pathway, all known interactions

100 KEGG PathFinder

SLN1 SLN1

PTP2

YPD1 YPD1

SSK1 SSK1

SSK2 SSK2

PBS2

HOG1

MSN2 / MSN4

A B

Figure 4-8: The High Osmolarity Signaling Pathway.(A) The main chain of the high osmolarity pathway downloaded from the KEGG Database (Kanehisa and Goto, 2000), (B)PathFinder output for the query marked with a box (Sln1-Ssk2 segment). For (A), the dashed interactions indicate that these interactions do not exists in the database. For (B) the proteins that were not on the main chain of the pathway were not colored, whereas the proteins on the main chain are colored Grey. and all of the interacting proteins on this pathway are recovered. An additional protein (white node in Figure 4-8B) that is interacting with this pathway is also found by our method. The additional protein, Ptp2, that do not exist in the KEGG database for this pathway actually is closely related with this pathway. The Ptp2 is involved in the inactivation of MAP kinase during osmolarity sensing (Mewes et al., 1999, Warmka et al., 2001). At first, this additional protein on the resulting

101 pathway looks like a false positive. However, literature shows that Ptp2 is related to this pathway segment. Hence, our methodology might also be used to extend known signaling pathways as well.

4.4 Discussion

In this chapter, we have worked towards creating a tool, to discover possible sig- naling pathways from biological networks. We have accomplished this goal and successfully recovered known signaling pathway segments. The hypothesis was, assuming that we are given association rules that capture the underlying relationships of the proteins on pathways, and a weighted protein- protein interaction network acquired by integrating gene expression profiles with the proteome network of an organism, a pathway segment should belong to a path- way if it contains at least a number of these association rules and has an average weight of interactions above a given threshold. We proved the correctness of our hypothesisby developingthe PathFinder search tool. We showed that, using biological annotations to capture the characteristics of known signaling pathways, one can successfully return possible signaling pathways segments. These segments are retrieved and validated against the acquired associ- ation rules. By carrying out experiments on S. cerevisiae datasets that are publicly available, we showed our methodologies advantages. In our experiments, we also observed that by building the model on character- istics of pathways, we were able to filter proteins by their functions. This is signif- icant, because signaling pathways are defined as associations of proteins working towards a common goal, and our system have successfully captured this. Another observation we made was that in certain instances, PathFinder also returned addi- tional proteins when compared to actual signaling pathways. Although these might be seen as false positives in terms of signaling pathways, we have observed that

102 these proteins actually are related with these pathways. PathFinder was extending the pathways by returning association of the original pathway proteins. This fact is one of the reason in getting low precision values we have acquired through ran- dom experiments. On the other, when we compare this value with the average recall value, we see a much higher value which describes the success of this methodology. However, there is still more room for improvement, because false positives in the protein-protein interaction data might have affected this value. In the experiments carried on Yeast MAP Kinase Pathways, our failure to query the whole Pheromone Response or HOG pathway in one shot underlines the fact that, missing interactions (false-negatives) are a more significant obstacle than are false-positive interactions. This was also mentioned in previous studies (Steffen et al., 2002). Missing interactions cannot be created by our search methodology, but we have eliminated the false-positive interactions through integration of additional gene expression profiles. In this methodology we have integrated four different data sources, namely protein-protein interaction networks, gene expression profiles, functional annota- tions, and signaling pathway chains. The system developed was trained and searched over these datasets. Hence, by connecting these biological information with each other we have formed a complex but meaningful system that can answer queries of interest. In molecular level of the cell, it is hard to predict certain behavior of proteins or genes, such as their interaction partners etc. However, by associating known facts of these proteins together we were able to infer further knowledge out of these entities. Moreover, there are three different choices that we have made in this tool. The first one is choosing the functional annotations. Currently Gene Ontology Annota- tions (Ashburner et al., 2000) are quite comprehensive for studied proteins. How- ever, this functional information can be switched with another annotation set for comparison. Secondly, we have chosen to use Apriori algorithm, which is quite

103 efficient in determining association rules from a larger set of associations. This algorithm’s performance is, within its running time, considered quite satisfactory. The last choices made was the path discovery algorithm. We chose to exhaustively search for every possible path between two nodes. This is not practical for many computational applications. However, given the quality of our data and the size of it, this was not a major drawback. In acceptable time limits, PathFinder returned search results, and performed better than other probabilistic methodologies (Scott et al., 2006). In conclusion, in this chapter we have designed a tool. Today, in most of the biological studies, strong conclusions are drawn only after pathways of the cor- responding phenomena are identified. In many proteomics studies, experimental results identifying up-regulation or down-regulation of a protein are no longer con- sidered as an acceptable conclusion. We are asked to associate these proteins with corresponding pathways. Hence, our tool will be beneficial in a large spectrum of studies.

104 Chapter 5

Conclusions and Reflections

Thesis Summary

This dissertation developed methodologies and models to tackle two problems. First, we showed that the degree distribution of the pure duplication model (r =0) cannot follow a power-law as stated in Chung et al. (2003). We then showed that the degree distribution of the proteome growth model cannot follow a power-law with exponential cut-off as stated in Pastor-Satorras et al. (2003). We have developed the ℓ-hop degree distribution, a new measure to better capture topological properties of networks, where local properties of the networks are more visible. We observed that the proteome growth model does not capture the more assured ℓ-hop degree distribution of the yeast proteome network for ℓ> 1. We have introduced a new model by considering the sequence similarity be- tween protein pairs as a binary relationship, in addition to their interactions. This new enhanced duplication model is shown to accurately capture the ℓ-hop degree distribution of the yeast interaction network for all ℓ > 0. Furthermore, the model yields a better approximation to the degree distribution of the yeast similarity net- work than the previous models. We have also observed that the sequence-wise similar proteins share similar interaction patterns in the same organism. This is

105 most likely due to the duplication of genes or a portion of the genome during the evolutionary processes. Secondly, using our previous observations and additional investigation about signaling pathways and protein-protein interaction networks, we proposed the PathFinder system to find possible signaling pathway segments. Assume that we are given as- sociation rules that capture the pathways and a weighted protein-protein interaction network. The model is built on the hypothesis that a pathway segment belongs to a pathway if it contains at least a number of rules and has an average weight of interactions above a given threshold. First, proteins were represented using biolog- ical annotations. We then employed association rules to capture the characteristics of known signaling pathways. Given a query, the paths in the protein interaction network are retrieved and validated against the acquired association rules. We have verified our hypothesis and demonstrated the advantages of the model by carrying out experiments on S. cerevisiae data set that is publicly available.

Final Discussions

The discovery of the protein-protein interaction network topology brought a wider perspective for understanding the molecular biology (Jeong et al., 2001, Wagner, 2001). Particularly, discovery of this shared topology among many other networks and cellular networks, such as metabolic pathways drew more attention to protein- protein interaction networks (Aiello and Chung, 2001, Aiello et al., 2000, Berger et al., 2003, Bollobas et al., 2003, Bollob´as et al., 2001, Cooper and Frieze, 2003, Kleinberg et al., 1999). The study of proteome networks and its growth is quite substantial. By laying out the proteins by their relationships, one can easily deduce further functional relationships on this network. To improve our understanding of cellular functions, and how they were generated, network generation models are appropriate tools. In other words, modeling network growth is a valuable tool for capturing the structure

106 of large networks. The main challenge in developing such models is addressing the structure of these large networks. With the current state of protein-protein interaction networks, this task is even harder than anytime. The current protein-protein interaction net- work was shown to have a substantial number of false positives and negatives. Moreover, as these networks are large in number, when they are plottedonapiece of paper, it is almost impossible to understand any of their properties. Therefore many global and local features were used to compare these networks. These measures vary in representations and significance. While diameter of a network or average clustering coefficient of a network is a singular value, the degree distribution is a two dimensional plot of degrees. By definition, these measures can only capture limited amount of data. However, by using multiple measures at the same time, or finding new measures that can discover new features of networks would increase understanding of networks. In this thesis we first made an introduction to modeling biological networks. We have studied previous analysis of the degree distribution of the proteome growth model studied by Chung et al. (2003), Pastor-Satorras et al. (2003). In both studies, only limited number of features were tested to capture network similarity. Any mis- take in modeling or analysis of the networks could have been prevented by further analysis of these networks. Observing the deficiency of these measurements, we have introduced a more general measure called the ℓ-hop degree distribution. Here, we have also disprovedthe proposed analysis of the growth models (Chung et al., 2003, Pastor-Satorras et al., 2003). Contradicting results with the model were generated in these studies by using simplifying assumptions. During this analy- sis we have better understood the underlying forces in generating such networks. Originally the models proposed were based on the two underlying mechanisms for genome evolution: gene duplication and point mutations followed by natural se- lection (Ohno, 1970), i. e. after a gene duplication event, one of the genes may

107 accumulate deleterious mutations and be lost, or both copies of the gene may be retained. However, we have observed that the differentiation step of the proteome growth model was not sufficient enough to reflect the original theory. To better capture the growth process we have incorporated sequential information of the pro- teome. We have succeeded in improving the proteome growth model by exploiting the dynamics of the network and its growth process. In developing the sequence similarity enhanced model, we have learned that staying within the limits of known measures, it is more difficult to see the improve- ments we made. Although the proteome network model was generating highly similar networks within known topological properties of networks (e.g. the degree distribution, average clustering coefficient, the diameter etc.), when compared with more extensive measures (e.g. the ℓ-hop degree distribution), it has failed. The proteome growth model has been an initial point of this study and was helpful in understanding the phenomena and developing the enhanced model. However, the analysis carried out to analyze these duplication based models were not as success- ful. In both above mentioned studies, simplifying assumptions were made to ana- lyze the proteome growth model, which led to contradicting results with the model. Although it is very difficult to analyze these models, we were able to prove that the proteome growth model, with a small alteration, would generate a power-law (Bebek et al., 2006a). On the other hand, we were not able to analyze the enhanced model due to its high number of variables. The enhanced model employs five parameters, which represent different aspects of the evolutionary forces that has grown the proteome. Despite our unsuccessful efforts to analyze this model, we have shown with simulations that the enhanced model would generate more similar networks than the proteome growth model. The second problem we have approach in this thesis was identifying signal- ing pathway segments in protein-protein interaction networks. We have worked towards creating a tool, to discover possible pathways by integrating various data

108 sources and querying them intelligently. We have accomplished this goal and suc- cessfully recovered known signaling pathway segments with our tool. In this work, we have relied on the given set of datasets in significant amounts. We have used functional annotations to capture the characteristics of known signal- ing pathways. To return possible signaling pathways segments, we have associated these annotations with proteins and derived association rules. However, in the pro- teome of model organisms, only well studied proteins have an extended set of anno- tations. This is a drawback for our methodology. Nevertheless, by averaging these rules over the whole candidate paths, we have recovered better signaling pathways at the end. PathFinder is based on four different data sources, i. e. protein-protein inter- action networks, gene expression profiles, functional annotations, and signaling pathway chains. We have incorporated these datasets into each other to form a complex but meaningful system that can answer queries of interest. By associating known facts of these proteins together we were able to answer queries. Carrying out experiments on S. cerevisiae data set that is publicly available we showed our methodology’s success and advantages. Among these four datasets, the most controversial one is the protein-protein interaction networks generated through high-throughput experiments. In our ex- periments carried on Yeast MAP Kinase Pathways, our failure to query the whole Pheromone Response or HOG pathway in one shot underlines this fact. Missing interactions cannot be created by our search methodology, but we have eliminated the false-positive interactions through integration of additional are ignored as a re- sult of the bias imposed by ranking paths according to the similarity of expression profiles. In this study, we have used additional data sources to minimize the defects that can be caused by erroneous datasets. Signaling pathways are associations of proteins working towards a common

109 goal. In our experiments PathFinder filtered proteins according to their known func- tions since the model was built on characteristics of pathways. Hence, our system have successfully captured the basis of pathways. We also observed that in certain instances, PathFinder returned additional proteins as well as the actual signaling pathways. Although these might be considered as false positives in terms of signal- ing pathways, we have observed that these proteins actually are related with these pathways. By returning association of the original pathway proteins PathFinder ex- tended the pathways. This is one of the reason for getting low precision values in random experiments. On the other hand we have a much higher recall value which describes the success of this methodology. However, there is still more room for im- provements since false positives in the protein-protein interaction data might have affected this value. PathFinder is a system built up with various data sources and methodologies. During its development we have made various choices among different data sources. We have chosen Gene Ontology Annotations (Ashburner et al., 2000) over other functional annotations since they were quite comprehensive for studied proteins and widely used. Other choices can be also applied to PathFinder for comparison. We have observed that, for higher success rates, PathFinder should be based on the most comprehensive and error free data sources available and we have preferred such sources initially. Moreover, we chose to exhaustively search for every pos- sible path between two nodes. This exponential time calculation was not a major drawback for our application. In acceptable time limits, PathFinder returned search results. In this thesis, we have approached the protein-protein interaction network in two different problems. In our first study, we worked on generation models of such networks, i. e. how one can generate networks with similar properties. In the second study, we have utilized a network generated by millions years of evolu- tion to discover pathway segments. Common to both studies, we have developed

110 methodologies that would incorporate other relationships of the proteome to in- crease our knowledge-base. As a result, we have worked on networks that are based on protein-protein interaction networks. But these networks also had addi- tional evolutionary and functional intelligence. These simple integrations increased our accuracy both in modeling these networks and inferring other networks. This beneficial approach should be incorporated in every aspect of biological studies to minimize the uncertainty of biological datasets.

Future Directions

The first problem we addressed was to develop evolutionary network models for protein-protein interaction networks. We have successfully developed an enhanced model improving the results of earlier models. One of the biggest challenge in this thesis was attacking a problem that has no reliable initial data. The erroneous protein-protein interaction data was misleading to many researchers earlier as well. However, to overcome this problem, local struc- tures of the networks were studied. Even with false positives, true association can be captured by analyzing local structure. We have developed a new measure, called ℓ-hop degree distribution, which is more sensitive in identifying local structures. The first future contribution to this framework could be developing new mea- sures that can improve understanding of large and complex biological networks. Previously, network motifs were studied to capture local structures Itzkovitz and Alon (2005), Milo et al. (2004, 2002), Przulj et al. (2004), Shen-Orr et al. (2002). However, no significant improvement was presented in improving such evolution- ary models. In this work we have also worked on inferring signaling pathway segments. We have presented a methodology towards a solution for this important challenge of the post-genomic era. However, as always, there is still room for improvements. High throughput data have boosted the area of computational biology, since

111 the amount of data analyzed required complex algorithms and new computational tools. Soon after it was shown that these datasets were not accurate enough to draw conclusions from them, i. e. any analysis might be misleading. Here, a better approach to cope with the false negatives and false positives up to 50-90% of the initially published data (Deane et al., 2002, Sprinzak et al., 2003) might be building a curated meta network. In this work we have created a weighted network by integrating gene expression profiles with protein-protein interaction networks. By collecting experimental data that have been submitted to publicly available domains for any organism and using their orthology information which is now available for most of the model organisms (Remm et al., 2001), one can easily generate a network of interactions through highly accurate experimental results. This meta network based on highly accurate experiments, in its essence would decrease the false negative rate of the interaction network drastically and verify the existing interactions in the database (decreasing the number of false positives). The number of false positives can be decreased also by using sub-cellular local- ization of proteins. This would eliminate impossible to observe interactions among proteins. Moreover, this additional information would decrease the search space for any given query. To summarize, by integrating these two approaches with our previous efforts to eliminate false positives a significant improvement can be made. Such systematic methods for organizing and integrating different data sources to minimize the error rate in essential networks should be implemented first. In our previous studies, for a given initial and end state pair of a pathway seg- ment, we checked every possible path within a length boundary, llower and lupper, that would form a simple path. Although an exhaustive search seems impractical, the algorithm’s run time was acceptable compared to other randomized methods. The major challenge in using this algorithm was false negatives. If a protein or an interaction of the actual pathway was missing from the search space of the query

112 (missing links or nodes from the weighted interaction network), then there was no way of recovering a pathway segment of such sort. Sub-cellular localization of proteins can be used to generate additional candi- date pathway segments. When pictured, this information would place proteins in the cell in different locations. The proteins that are active in a pathway should have different sub-cellular localization so that the signals can travel through the cell, i. e. the proteins should most likely be lined up from the membrane towards the nucleus. As an extension to PathFinder, by utilizing the sub-cellular localization information of the proteins, one can look for chains that carry this simple fact. Moreover, by adding links or nodes to the search space as predictions of missing links or pro- teins candidate paths can be generated. Hence, given that a more significant path is possible by insertions, the model might be improved. In incremental steps, this approach can be extended for further insertions to candidate paths.

Final Conclusions

In its core, what we have accomplished here is building models and developing methodologies that can represent the dynamics of the cell. First, we modeled the evolutionary processes that formed these dynamics. Secondly, we have integrated biological resources to grow this knowledge base by developing a search tool. In this work, we have focused on the model organism S. cerevisiae to a great ex- tend. However, knowledge and validation of networks and pathways are still being studied. Therefore it can be expected that much will be learned about networks and pathways of higher organisms in the upcoming years. Clearly, an important step will be completing the identification and annotation of proteomes. However, while much is needed to be revealed about protein functionality and evolution for higher eukaryotes, there is still much to be discovered regarding protein-protein interaction networks and signaling pathways as well. In conclusion, we have explored ways to improve the detailed view of the cell.

113 Today, what is known about cells, the smallest entities of life, is still small. Al- though the universe expands every single second, what is inside a living cell does not. Still, much is unknown to completely understand its dynamics. All in all, in this thesis, we have explored methodologies and developed models to further discover and explain life.

114 Bibliography

Abbott A (2002) Betting on tomorrow’s chips. Nature 415: 112–114.

Abello J, Buchsbaum AL, Westbrook J (1998) A Functional Approach to External Graph Algorithms. In: European Symposium on Algorithms, pp. 332–343.

Agrawal R, Imielinski T, Swami AN (1993) Mining Association Rules between Sets of Items in Large Databases. In: Buneman P, Jajodia S, editors, Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data, pp. 207–216, Washington, D.C.

Agrawal R, Srikant R (1994) Fast Algorithms for Mining Association Rules in Large Databases. In: VLDB 1994: Proceedings of the 20th International Confer- ence on Very Large Data Bases, pp. 487–499, San Francisco, CA, USA: Morgan Kaufmann Publishers Inc.

Aiello W, Chung F (2001) Random Evolution in Massive Graphs. In: FOCS 2001: Proceedings of the 42nd IEEE symposium on Foundations of Computer Science, p. 510, Washington, DC, USA: IEEE Computer Society.

Aiello W, Chung F, Lu L (2000) A random graph model for massive graphs. In: STOC 2000: Proceedings of the thirty-second annual ACM symposium on The- ory of computing, pp. 171–180, New York, NY, USA: ACM Press.

115 Albert R, Jeong H, Barabasi AL (1999) Internet: Diameter of the World-Wide Web. Nature 401: 130–131.

Albert R, Jeong H, Barabasi AL (2000) Error and attack tolerance of complex net- works. Nature 406: 378.

Alberts B, et al (2002) Molecular Biology of the Cell [Book and CD-ROM]. Gar- land Science.

Alon N, Yuster R, Zwick U (1994) Color-Coding. Electronic Colloquium on Com- putational Complexity (ECCC) 1.

Aloy P, Russell RB (2002) Potential artefacts in protein-interaction networks. FEBS Lett 530: 253–254.

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, et al. (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium.Nat Genet 25: 25–29.

Bader GD, Betel D, Hogue CW (2003) BIND: the Biomolecular Interaction Net- work Database. Nucleic Acids Res 31: 248–250, URL http://bind.ca.

Bader GD, Hogue CW (2002) Analyzing yeast protein-protein interaction data ob- tained from different sources. Nat Biotech 20: 991–997.

Ball CA, Awad IAB, Demeter J, Gollub J, Hebert JM, et al. (2005) The Stanford Microarray Database accommodates additional microarray platforms and data formats. Nucleic Acids Research 33: D580+.

Barabasi AL, Albert R (1999) Emergence of Scaling in Random Networks. Science 286: 509–512.

Bateson W (1909) Mendel‘s Principles of Heredity. Genetics Heritage Pr.

116 Bebek G, Berenbrink P, Cooper C, Friedetzky T, Nadeau J, et al. (2006a) The degree distribution of General Duplication Models. Theoretical Computer Science (to appear) .

Bebek G, Berenbrink P, Cooper C, Friedetzky T, Nadeau J, et al. (2006b) Improved Duplication Models for Proteome Network Evolution. Lecture Notes in Bioin- formatics (in press).

Bebek G, Yang J (2006) PathFinder: Mining Signal Transduction Pathway Seg- ments from Protein Protein Interaction Networks. Tech. rep., Case Western Re- serve Universiy.

Berger N, Bollobas B, Borgs C, Chayes J, O R (2003) Degree distribution of the FKP network model. In: Proc. ICALP, LNCS 2719, pp. 725–738.

Bhan A, Galas DJ, Dewey GT (2002) A duplication growth model of gene expres- sion networks. Bioinformatics 18: 1486–1493.

Bollobas B (2001) Random Graphs. Cambridge University Press.

Bollobas B, Borgs C, Chayes J, Riordan O (2003) Directed scale-free graphs. In: SODA 2003: Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete algorithms, pp. 132–139, Philadelphia, PA, USA: Society for Industrial and Applied Mathematics.

Bollob´as B, Riordan O, Spencer J, Tusan´ady G (2001) The degree sequence of a scale-free random graph process. Random Structures and Algorithms 18: 279– 290.

Bornholdt S, Ebel H (2001) World-Wide Web scaling exponent from Simon‘s 1955 model. Physical Review E 64: 035104.

117 Broder A, Kumar R, Maghoul F, Raghavan P, Rajagopalan S, et al. (2000) Graph structure in the Web. Computer Networks 33: 309–320.

Brown KR, Jurisica I (2005) Online predicted human interaction database. Bioin- formatics 21: 2076–2082, URL http://ophid.utoronto.ca.

Campagne F, Neves S, Chang CW, Skrabanek L, Ram PT, et al. (2004) Quantitative information management for the biochemical computation of cellular networks. Sci STKE 2004.

Cherry JM, Ball C, Weng S, Juvik G, Schmidt R, et al. (1997) Genetic and physical maps of Saccharomyces cerevisiae. Nature 387: 67–73.

Choi C, Crass T, Kel A, Kel-Margoulis O, Krull M, et al. (2004) Consistent re- modeling of signaling pathways and its implementation in the TRANSPATH database. Genome Inform Ser 15: 244–254.

Chung F, Lu L, Dewey TG, Galas DJ (2003) Duplication models for biological networks. J Comput Biol 10: 677–687.

Cook SJ, McCormick F (1993) Inhibition by cAMP of Ras-dependent activation of Raf. Science 262: 1069–1072.

Cooper C, Frieze A (2003) A general model of web graphs. Random Structures and Algorithms 22: 311–335.

Cormen TH, Leiserson CE, Rivest RL, Stein C (2001) Introduction to Algorithms, Second Edition. The MIT Press.

Cornell M, Paton NW, Oliver SG (2004) A critical and integrated view of the yeast interactome. Comp Funct Genomics 5: 382–402.

Dayhoff M, Schwartz R, Orcutt B (1978) A model of evolutionary change in pro- teins. Atlas of Protein Sequence and Structure 5(3): 345–352.

118 Deane CM, Salwinski L, Xenarios I, Eisenberg D (2002) Protein interactions: two methods for assessment of the reliability of high throughput observations. Mol Cell Proteomics 1: 349–356.

DeRisi JL, Iyer VR (1999) Genomics and array technology. Curr Opin Oncol 11: 76–79.

Eddy SR (2004) Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol 22: 1035–1036.

Edwards AM, Kus B, Jansen R, Greenbaum D, Greenblatt J, et al. (2002) Bridging structural biology and genomics: assessing protein interaction data with known complexes. Trends Genet 18: 529–536.

Erd¨os P, R´enyi A (1959) On random graphs. Publicationes Mathematicae Debrecen 6: 290–297.

Faloutsos M, Faloutsos P, Faloutsos C (1999) On Power-law Relationships of the Internet Topology. In: SIGCOMM, pp. 251–262.

Ferrer I Cancho R, Sol RV (2001) The small world of human language. Proc R Soc Lond B Biol Sci 268: 2261–2265.

Fields S (2005) High-throughput two-hybrid analysis. The promise and the peril. FEBS J 272: 5391–5399.

Fields S, Song O (1989) A novel genetic system to detect protein-protein interac- tions. Nature 340: 245–246.

Fischer I (2002) Similarity-preserving Metrics for Amino-acid Sequences. In: 22nd GIF Meeting on Challenges in Genomic Research: Neurodegenerative Diseases, Stem Cells, Bioethics.

119 Force A, Lynch M, Pickett BF, Amores A, Yan YL, et al. (1999) Preservation of Du- plicate Genes by Complementary, Degenerative Mutations. Genetics 151: 1531– 1545.

Francke C, Siezen RJ, Teusink B (2005) Reconstructing the metabolic network of a bacterium from its genome. Trends in Microbiology 13: 550–558.

Gavin AC, Bsche M, Krause R, Grandi P, Marzioch M, et al. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature 415: 141–147.

Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, et al. (2003) A protein inter- action map of Drosophila melanogaster. Science 302: 1727–1736.

Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, et al. (1996) Life with 6000 genes. Science 274.

Gough NR, Adler EM, Ray LB (2004) Focus Issue: Cell Signaling–Making New Connections. Sci STKE 2004: 12.

Graeber TG, Eisenberg D (2001) Bioinformatic identification of potential autocrine signaling loops in cancers from gene expression profiles. Nat Genet 29: 295–300.

Grigoriev A (2003) On the number of protein-protein interactions in the yeast pro- teome. Nucleic Acids Res 31: 4157–4161.

Han JDJ, Dupuy D, Bertin N, Cusick ME, Vidal M (2005) Effect of sampling on topology predictions of protein-protein interaction networks. Nature Biotechnol- ogy 23: 839–844.

Hartuv E, Shamir R (2000) A clustering algorithm based on graph connectivity. Information Processing Letters 76: 175–181.

120 Henikoff S, Henikoff JG (1992) Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 89: 10915–10919.

Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, et al. (2002) Systematic identifi- cation of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature 415: 180–183.

Ideker T, Thorsson V, Ranish JA, Christmas R, Buhler J, et al. (2001) Integrated ge- nomic and proteomic analyses of a systematically perturbed metabolic network. Science 292: 929–934.

Ispolatov I, Krapivsky PL, Yuryev A (2005) Duplication-divergence model of pro- tein interaction network. Phys Rev E Stat Nonlin Soft Matter Phys 71.

Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, et al. (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A 98: 4569–4574.

Ito T, Ota K, Kubota H, Yamaguchi Y, Chiba T, et al. (2002) Roles for the Two- hybrid System in Exploration of the Yeast Protein Interactome. Mol Cell Pro- teomics 1: 561–566.

Itzkovitz S, Alon U (2005) Subgraphs and network motifs in geometric networks. Phys Rev E Stat Nonlin Soft Matter Phys 71.

Jansen R, Greenbaum D, Gerstein M (2002) Relating whole-genome expression data with protein-protein interactions. Genome Res 12: 37–46.

Jeong H, Mason SP, Barabasi AL, Oltvai ZN (2001) Lethality and centrality in protein networks. Nature 411: 41–42.

Jeong H, Tombor B, Albert R, Oltvai ZN, Barabasi AL (2000) The large-scale or- ganization of metabolic networks. Nature 407: 651.

121 Jin F, Hazbun T, Michaud GA, Salcius M, Predki PF, et al. (2006) A pooling- deconvolution strategy for biological network elucidation. Nature Methods 3: 183–189.

Kanehisa M, Goto S (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucl Acids Res 28: 27–30.

Kauffman SA (1969) Metabolic stability and epigenesis in randomly constructed genetic nets. J Theor Biol 22: 437–467.

Kephart JO, White SR (1991) Directed-graph epidemiological models of computer viruses. In: Proceedings., 1991 IEEE Computer Society Symposium on Research in Security and Privacy, pp. 343–359.

Kleinberg J, Kumar R, Raphavan P, Rajagopalan S, Tomkins A (1999) The Web as a graph: Measurements, models and methods. In: Proceedings of COCOON, pp. 1–17, Tokyo, Japan.

Kumar R, Raghavan P, Rajagopalan S, Sivakumar D, Tomkins A, et al. (2000) Stochastic models for the Web graph. In: FOCS 2000: Proceedings of the 41st Annual Symposium on Foundations of Computer Science, p. 57, Washington, DC, USA: IEEE Computer Society.

Li S, Armstrong CM, Bertin N, Ge H, Milstein S, et al. (2004) A map of the inter- actome network of the metazoan C. elegans. Science 303: 540–543.

Li WH, Gu Z, Wang H, Nekrutenko A (2001) Evolutionary analyses of the human genome. Nature 409: 847–849.

Liolios K, Tavernarakis N, Hugenholtz P, Kyrpides NC (2006) The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide. Nucleic Acids Research 34: D332–334.

122 Liu Y, Zhao H (2004) A computational approach for ordering signal transduction pathway components from genomics and proteomics Data. BMC Bioinformatics 5.

Madhani HD, Styles CA, Fink GR (1997) MAP kinases with distinct inhibitory functions impart signaling specificity during yeast differentiation. Cell 91: 673– 684.

Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, et al. (1999) Detect- ing Protein Function and Protein-Protein Interactions from Genome Sequences. Science 285: 751–753.

May RM (1973) Stability and Complexity in Model Ecosystems. (MPB-6). Prince- ton University Press.

Mewes HW, Heumann K, Kaps A, Mayer K, Pfeiffer F, et al. (1999) MIPS: a database for genomes and protein sequences. Nucleic Acids Res 27: 44–48, URL http://mips.gsf.de.

Milo R, Itzkovitz S, Kashtan N, Levitt R, Shen-Orr S, et al. (2004) Superfamilies of evolved and designed networks. Science 303: 1538–1542.

Milo R, Shen-Orr S, Itzkovitz S, Kashtan N, Chklovskii D, et al. (2002) Network motifs: simple building blocks of complex networks. Science 298: 824–827.

Nadeau JH, Sankoff D (1997) Comparable Rates of Gene Loss and Functional Di- vergence After Genome Duplications Early in Vertebrate Evolution. Genetics 147: 1259–1266.

Neves SR, Iyengar R (2002) Modeling of signaling networks. Bioessays 24: 1110– 1117.

123 Newman MEJ, Strogatz SH, Watts DJ (2001) Random graphs with arbitrary degree distributions and their applications. Physical Review E 64: 026118.

Nielsen H, Engelbrecht J, von Heijne G, Brunak S (1996) Defining a similarity threshold for a functional protein sequence pattern: the signal peptide cleavage site. Proteins 26: 165–177.

Ohno S (1970) Evolution by gene duplication. Springer-Verlag.

Pastor-Satorras R, Smith E, Sol RV (2003) Evolving protein interaction networks through gene duplication. J Theor Biol 222: 199–210.

Pearson WR, Lipman DJ (1988) Improved tools for biological sequence compari- son. Proc Natl Acad Sci U S A 85: 2444–2448.

Penrose M (2003) Random Geometric Graphs (Oxford Studies in Probability). Ox- ford University Press, USA.

Przulj N, Corneil DG, Jurisica I (2004) Modeling interactome: scale-free or geo- metric? Bioinformatics 20: 3508+.

Reboul J, Vaglio P, Rual JF, Lamesch P, Martinez M, et al. (2003) C. elegans OR- Feome version 1.1: experimental verification of the genome annotation and re- source for proteome-scale protein expression. Nat Genet 34: 35–41.

Redner S (1998) How Popular is Your Paper? An Empirical Study of the Citation Distribution. Eur Phys Jour B 4: 131–134.

Remm M, Storm CE, Sonnhammer EL (2001) Automatic clustering of orthologs and in-paralogs from pairwise species comparisons. J Mol Biol 314: 1041–1052, URL http://inparanoid.cgb.ki.se.

124 Rigaut G, Shevchenko A, Rutz B, Wilm M, Mann M, et al. (1999) A generic pro- tein purification method for protein complex characterization and proteome ex- ploration. Nat Biotechnol 17: 1030–1032.

Roberto J Bayardo J (1998) Efficiently mining long patterns from databases. In: SIGMOD 1998: Proceedings of the 1998 ACM SIGMOD international confer- ence on Management of data, pp. 85–93, New York, NY, USA: ACM Press.

Rubin GM, Yandell MD, Wortman JR, Gabor Miklos GL, Nelson CR, et al. (2000) Comparative genomics of the eukaryotes. Science 287: 2204–2215.

Ruepp A, Zollner A, Maier D, Albermann K, Hani J, et al. (2004) The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucleic Acids Res 32: 5539–5545.

Sachs K, Perez O, Pe‘er D, Lauffenburger DA, Nolan GP (2005) Causal Protein- Signaling Networks Derived from Multiparameter Single-Cell Data. Science 308: 523–529.

Scott J, Ideker T, Karp RM, Sharan R (2006) Efficient algorithms for detecting signaling pathways in protein interaction networks. J Comput Biol 13: 133–144.

Seoighe C, Wolfe KH (1999a) Updated map of duplicated regions in the yeast genome. Gene 238: 253–261.

Seoighe C, Wolfe KH (1999b) Yeast genome evolution in the post-genome era. Curr Opin Microbiol 2: 548–554.

Sharan R, Suthram S, Kelley RM, Kuhn T, McCuine S, et al. (2005) Conserved patterns of protein interaction in multiple species. Proc Natl Acad Sci U S A 102: 1974–1979.

125 Shen-Orr SS, Milo R, Mangan S, Alon U (2002) Network motifs in the transcrip- tional regulation network of Escherichia coli. Nat Genet 31: 64–68.

Shlomi T, Segal D, Ruppin E, Sharan R (2006) QPath: a method for querying path- ways in a protein-protein interaction network. BMC Bioinformatics 7: 199.

Simon H (1955) On a class of skew distribution functions. Biometrika 42: 425–440.

Smith MR, Degudicibus SJ, Stacey DW (1986) Requirement for c-ras proteins dur- ing viral oncogene transformation. Nature 320: 540–543.

Sprinzak E, Sattath S, Margalit H (2003) How reliable are experimental protein- protein interaction data? J Mol Biol 327: 919–923.

Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, et al. (2006) Bi- oGRID: a general repository for interaction datasets. Nucleic Acids Research 34: D535, URL http://www.thebiogrid.org.

Steffen M, Petti A, Aach J, D‘haeseleer P, Church G (2002) Automated modelling of signal transduction networks. BMC Bioinformatics 3.

Suthram S, Sittler T, Ideker T (2005) The Plasmodium protein network diverges from those of other eukaryotes. Nature 438: 108–112.

Tucker CL, Gera JF, Uetz P (2001) Towards an understanding of complex protein networks. Trends in Cell Biology 11: 102–106.

Uetz P, Finley RL (2005) From protein networks to biological systems. FEBS Lett 579: 1821–1827.

Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, et al. (2000) A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature 403: 623–627.

126 van Noort V,Snel B, Huynen M (2004) The yeast coexpression network has a small- world, scale-free architecture and can be explained by a simple model. EMBO Reports 5: 280–284.

Vazquez A, Flammini A, Maritan A, Vespignani A (2003) Modeling of protein interaction networks. Complexus 1: 38.

Velculescu VE, Zhang L, Vogelstein B, Kinzler KW (1995) Serial analysis of gene expression. Science 270: 484–487.

Vincens P, Tarroux P (1988) Two-dimensional electrophoresis computerized pro- cessing. Int J Biochem 20: 499–509. von Mering C, Krause R, Snel B, Cornell M, Oliver SG, et al. (2002) Comparative assessment of large-scale data sets of protein-protein interactions. Nature 417: 399–403.

Wagner A (2001) The yeast protein interaction network evolves rapidly and con- tains few redundant duplicate genes. Mol Biol Evol 18: 1283–1292.

Walhout AJ, Boulton SJ, Vidal M (2000a) Yeast two-hybrid systems and protein interaction mapping projects for yeast and worm. Yeast 17: 88–94.

Walhout AJ, Sordella R, Lu X, Hartley JL, Temple GF, et al. (2000b) Protein Inter- action Mapping in C. elegans Using Proteins Involved in Vulval Development. Science 287: 116–122.

Warmka J, Hanneman J, Lee J, Amin D, Ota I (2001) Ptc1, a type 2C Ser/Thr phosphatase, inactivates the HOG pathway by dephosphorylating the mitogen- activated protein kinase Hog1. Mol Cell Biol 21: 51–60.

Watts DJ, Strogatz SH (1998) Collective dynamics of ’small-world’ networks. Na- ture 393: 440–442.

127 Wolfe KH, Shields DC (1997) Molecular evidence for an ancient duplication of the entire yeast genome. Nature 387: 708–713.

Wu J, Dent P, Jelinek T, Wolfman A, Weber MJ, et al. (1993) Inhibition of the EGF- activated MAP kinase signaling pathway by adenosine 3’,5’-monophosphate. Science 262: 1065–1069.

Xenarios I, Salwnski L, Duan XJ, Higney P, Kim SM, et al. (2002) DIP, the Database of Interacting Proteins: a research tool for studying cellu- lar networks of protein interactions. Nucleic Acids Res 30: 303–305, URL http://dip.doe-mbi.ucla.edu.

Zanzoni A, Montecchi-Palazzi L, Quondam M, Ausiello G, Helmer-Citterich M, et al. (2002) MINT: a Molecular INTeraction database. FEBS Lett 513: 135– 140, URL http://cbm.bio.uniroma2.it/mint/index.html.

Zien A, K¨uffner R, Zimmer R, Lengauer T (2000) Analysis of Gene Expression Data with Pathway Scores. In: Altman R, et al., editors, ISMB00, pp. 407–417, La Jolla, CA: AAAI.

128