STUDIES ON THE TOPOLOGY, MODULARITY, ARCHITECTURE AND ROBUSTNESS OF THE -PROTEIN INTERACTION NETWORK OF BUDDING YEAST SACCHAROMYCES CEREVISIAE

DISSERTATION

Presented in Partial Fulfillment of the Requirements for

The Degree Doctor of Philosophy in the

Graduate School of The Ohio State University

By

Jingchun Chen, M.S.

*****

The Ohio State University 2006

Dissertation Committee:

Professor Bo Yuan, Adviser Approved by Professor Tsonwin Hai

Professor Roger Briesewitz

Professor Ralf Bundschuh Adviser Integrated Biomedical Science Graduate Program

ABSTRACT

In this dissertation, statistical mechanics, graph theory, and machine learning methods have been used to study the topology, modularity, organization and robustness of the protein-protein interaction network of budding yeast Saccharomyces cerevisiae.

The protein-protein interaction dataset is obtained by combining high confidence

interactions, and is validated from multiple perspectives. Statistical mechanics is then

used to analyze the connectivity distribution, graph spectrum, shortest path distance and clustering coefficients of the network, which indicates that the network is both scale-free and modular. Microarray expression profiles are used to compute the weight for each interaction and the network is represented as a weighted undirected graph. An edge betweenness-based algorithm is developed and applied on the graph, and a set of functional modules is identified in the network. The functional modules are then validated rigorously against gene annotation, growth phenotype and protein complexes.

It is found that in the same functional module exhibit similar deletion phenotype,

and that known protein complexes are largely contained in the functional modules. To

find out the organizations of the yeast proteome network, the relationship between the

profiles of hubs and their interacting is analyzed. The results

indicate that subpopulations of hubs exist in the yeast proteome network, which are

ii classified as type core, local and global hubs. By examining these hub populations from

the perspectives of protein complexes, interaction overlap, clustering coefficients, module connectivity, and visualization, it is found that global hubs form the backbone of module- module interaction, while core hubs are organizers within functional modules. In addition, analysis on the interactions between the hubs indicated that each of the three types of hubs preferentially interact with hubs from the same population, which suggests an ordered architecture for the network and the existence of central processing subnetwork at both global and functional module level. Gene expression changes of the hub populations in cellular responses are then analyzed to gain insights into the dynamics of module-module interactions, and the results suggest that global hubs are the major and early responders in cellular response. Next, network breakdown simulation and graph spectrum are used to examine the contributions of each hub population to the robustness of the yeast proteome network. The results indicate that network organizers contribute

most to the robustness at both global and local levels. And last, it is found that genes

contributing most to the robustness of functional modules, not that of the entire network,

are more likely to be essential.

iii

Dedicated to my parents

iv

ACKNOWLEDGMENTS

I wish to thank my adviser, Professor Bo Yuan, for his enthusiasm about science, scientific guidance, professional advice, intellectual assistance, and both financial and personal support. Without any of these this dissertation would never have been completed.

I wish to thank the members of my dissertation advisory committee, Dr. Tsonwin

Hai, Dr. Roger Briesewitz, and Dr. Ralf Bundschuh for their advice, help and critical evaluations on this dissertation research.

I am indebted to Solomon Gibbs, who helped me tremendously both academically and personally during the first three years of this research. I wish to thank Russell Sears,

Fa Zhang, and all other colleagues for their help and stimulating discussions on this dissertation research.

I also wish to thank those nameless people who made their software available to the public free of charge, which greatly enhanced this dissertation research.

I am most grateful to my wife, Xu Huang, for her unconditional support, enduring belief and incredible patience during the course of my scientific pursuit. I am in deep debt to all my family whose love has made this dissertation more meaningful.

v

VITA

March 25, 1971 ……………………… Born – Nanchong, Sichuan Province, China

June 1993 …………………………… B.S. Biochemistry, Sichuan University

June 1996 …………………………… M.S. Cell Biology, Sichuan University

March 2001 …………………………. M.S. Biochemistry, The Ohio State University

August 2002 ………………………… M.S. Computer Science, Wright State University

1998 – 1999 ………………………… Teaching Associate, The Ohio State University

1999 – 2001 …………………………. Research Associate, The Ohio State University

2001 – 2002 …………………………. Research Associate, Wright State University

2002 – present ……………………….. Research and Teaching Associate,

The Ohio State University

PUBLICATIONS

1. Chen, J., and Yuan B. (2006). “Detecting functional modules in the yeast protein- protein interaction network”. Bioinformatics, in press.

2. Zhang, F., Liu, Z., Chen, J. and Yuan B. "The construction of structural templates for the modeling of conserved protein domains". International Conference on Bioinformatics and its Applications (ICBA'04), December 16-19, 2004, Fort Lauderdale, Florida, USA.

3. Ozer, H., Chen, J., Zhang, F. and Yuan B. "Clustering of eukaryotic orthologs using the Markov graph-flow algorithm". International Conference on

vi Bioinformatics and its Applications (ICBA'04), December 16-19, 2004, Fort Lauderdale, Florida, USA.

4. Okamoto, Y., Chaves, A., Chen, J. et al. (2001). "Transgenic mice with cardiac- specific expression of activating transcription factor 3, a stress-inducible gene, have conduction abnormalities and contractile dysfunction." American Journal of Pathology, 159(2): 639-650.

5. Chen, J. and Shi, A. (1998). "Cytological observation on the fertilization of Anodonta woodiana (Elliptica)." Journal of Fisheries of China, 22(l): 78-80.

6. Shi, A. and Chen, J. (1997). "Mussel breeding and pearl cultivation." Sichuan Sci. & Tech. Press, pp. 1-36.

7. Chen, J. and Shi, A. (1996). "Malacozoan immunobiology research: A review." Acta Hydrobiologia Sinica, 20(l): 74-78.

8. Zhou, L., Li, J., Zheng, Y. and Chen, J. (1995). "Purification and partial characterization of Endo-Polygalacturonase from commercial pectinase of Aspergillus niger." Chinese Biochemical Journal, 11(4): 446-451.

FIELDS OF STUDY

Major Field: Integrated Biomedical Science Graduate Program

vii

TABLE OF CONTENTS

Abstract ………………………………………………………………………………….. ii Dedication ………………………………………………………………………………. iv Acknowledgments …………………………………………………………………….…. v Vita ……………………………………………………………………………………... vi List of Tables …………………………………………………………………………… ix List of Figures ………………………………………………………………………….... x

Chapters:

1. Introduction ……………………………………………………………………… 1 2. Data integration and validation ………………………………………………... 11 Materials and methods ……………………………………………………....14 Results …………………………………………………………………...... 15 Discussion …………………………………………………………………...27 3. Network topology ……………………………………………………………… 30 Materials and methods ………………………………………………………31 Results ……………………………………………………………………….34 Discussion …………………………………………………………………...40 4. Module detection and validation ……………………………………………….. 42 Materials and methods ……………………………………………………....44 Results ……………………………………………………………………….51 Discussion …………………………………………………………………...63 5. Module organizations …………………………………………………………... 67 Materials and methods ………………………………………………………68 Results ……………………………………………………………………….69 Discussion …………………………………………………………………...88 6. Network robustness …………………………………………………………….. 92 Materials and methods ………………………………………………………94 Results ……………………………………………………………………….95 Discussion ………………………………………………………………….107 7. Conclusions and future directions ……………………………………………...110

Bibliography …………………………………………………………………………...117

Appendix A …………………………………………………………………………….124

viii

LIST OF TABLES

Table Page

2.1. The reliability of representative high-throughput methods …………………….13 2.2. The datasets of three representative studies …………………………………....16 2.3. The protein coverage of the dataset among the functional categories ……….....17 3.1. The diameter and shortest path length of the networks …...…………………... 38 3.2. The clustering coefficients of the networks …....……………………………… 40 5.1. Three types of hubs ...………………………………………………………….. 74 5.2. The number of unique interactors in each hub population ...…………………. 75

ix

LIST OF FIGURES

Figure Page

2.1. The protein coverage of the dataset among the functional categories ………..... 18 2.2. The interaction intensities between the functional categories of the CYGD annotation system .…………………………………………………………….…20 2.3. The interaction intensities between the functional categories of GO annotation system ……………………..………………………………………………….... 21 2.4. Boxplot for the functional similarities between interacting proteins ……….….. 23 2.5. The interaction intensities between the localization categories of CYGD database ……………………...…………………………………………………. 24 2.6. The interaction intensities between the localization categories of SGD database ……………………………………………………………………25 2.7. The scale-free fitting of the datasets ….……………………………………..…..26 3.1. The connectivity distribution of the networks …………………………………. 35 3.2. The spectral density of the networks ………………………………………….. 37 3.3. The shortest distance distribution of the networks ……………………………. 38 3.4. The clustering coefficients of the networks ……………………………………. 40 4.1. Edge betweenness based on non-redundant shortest path …………………….. 47 4.2. Algorithm performances on the partitioning of an artificial network ………….. 52 4.3. Identified functional modules are densely connected subgraphs ………………. 55 4.4. Genes within a functional module confer similar deletion phenotype ………… 56 4.5. The comparison of phenotype divergence…………….. ………………………. 58 4.6. Protein complexes are contained in functional modules ……………………….. 60 4.7. The segregation module …………………………………………. 61 5.1. Bimodal distribution of co-expression correlations among hubs ……………… 70 5.2. Correlation between connectivity and PCC for hubs .…………………………. 71 5.3. Correlation between PCC and in-degree/out-degree for hubs ...……………….. 72 5.4. Distribution of mean PCC for subpopulations of hubs …………………..……. 74 5.5. The clustering coefficients of the hub populations ….…………………………. 76 5.6. The module connectivity of the hubs …………………… …………………….. 77 5.7. The schematic of the hub-hub interactions between functional modules ……….78 5.8. The hubs represented in protein complexes ………………..…………………... 79 5.9. The interactions between hub types …………………………………….……… 81 5.10. The hub populations involved in signal transduction ………………….………. 83 5.11. The expression change of hubs in cellular response …………………………….84 5.12. The expression change profiles of hub in response to α-factor …………………87 5.13. A model for the modular organization of the yeast proteome network …………89 x 6.1. Module connectivity and network robustness ………………….………………. 97 6.2. Global hubs are important for network robustness ………….. …………….….. 97 6.3. Graph spectral change due to sequential removal of hubs ………..……………. 99 6.4. Correlations between spectral change and connectivity ……………………… 100 6.5. The correlation between module connectivity and spectral change for proteins with the same pairwise connectivity.……………………………. 102 6.6. Spectral change upon individual hub removal ………………………..………. 102 6.7. Spectral changes of functional modules upon individual hub removal……….. 104 6.8. The percentage of essential genes among the hub populations ………………. 105 6.9. Synthetic lethality and graph spectral change ………………………………….106

xi

CHAPTER 1

INTRODUCTION

Biology is a systems science

Heredity has long been fascinating and benefiting human beings in the long history of agriculture. How to breed plants and animals to produce useful hybrids has been a primary motivation to study inheritance. Even though countless people have contributed to this fundamental discipline of biology starting from ancient times, it was Gregor

Mendel, the Austrian monk, who laid the foundation of modern genetics. The concept of

“inheritance unit” in Mendel’s genetics, as being envisioned through the pea-breeding experiments, for the first time in history linked an organism’s traits to some separable entities in the cell. Subsequently, Thomas Hunt Morgan and his colleagues established that genes are the physical entities of the unit of heredity. Identifying and characterizing

genes then became the main tasks in the endeavor to explain biological phenomena. The

discovery of the double helix structure of deoxyribonucleic acid (DNA) at the turn of the

twentieth century (Watson and Crick, 1953), together with the subsequent establishment

of central dogma (Crick, 1970) and the development of technologies in genetic

engineering, transformed biological research. Laboratory manipulation of genetic

materials at unprecedented precision laid the foundation for gene discovery in all species

1

at increasing speed. Biological phenomena, particularly human diseases, have never been

so well understood at the level of molecular behaviors. In many cases, a trait of an organism can be traced back to a particular gene, a peptide fragment, a well-defined patch on a protein surface, or even a specific residue.

However, genes usually function not all by themselves. Most genes, if not all,

interact with other genes, modulating their functions and/or being modulated by them. A

gene's function can be defined properly only under the context of its interacting partners.

For example, when ATF3 binds to itself and forms a homodimer, it is a transcription

repressor. However, when ATF3 binds to gadd153/Chop10 to form a heterodimer, it is

non-functional (Chen et al., 1996) (Wolfgang et al., 1997). More general examples

include those proteins involved in signal transduction. Their function is to interact with

other proteins and transmit signals. Among them are so called scaffold proteins. The

role of a scaffold protein is to recruit the components of a pathway to form a protein

complex, which facilitates signal transmission. As a consequence of the complicated

interactions between genes, in most cases it is impossible to establish a simple gene-to-

phenotype relationship, like what Mendel had discovered in the studies of the colors of

pea flowers, the shapes of pea seeds, etc. From the perspective of biomedical science,

very often a disease is the outcome of mutations of more than one gene (Ghosh and

Collins, 1996). Mechanistic studies of these diseases will certainly fail if the involved

genes are only studied individually (Strohman, 2002). Even if only one allele is

implicated for a disease, it is still very important to understand how the mutation affects

the interactions between this gene and its interacting genes. Therefore, to really

2

understand the functions of genes and the complex relationship between genotype and

phenotype, we have to study how cell functions at the systems level.

Concepts of systems biology are nothing new of course (Kitano, 2002b). However,

biological researches at the systems level became possible only very recently. The

completion of whole genome sequencing projects of human and many other species

greatly enhanced gene discovery process. For example, over 6000 Open Reading Frames

(ORFs) have been identified in the genome of the budding yeast Saccharomyces

cerevisiae. Furthermore, large-scale studies of functional genomics have been made

possible by the development of high-throughput technologies. In particular, microarray

technology made it possible to monitor the expression of thousands of genes

simultaneously; Yeast two-hybrid screening, affinity chromatography, synthetic lethality

and other techniques were successfully applied to study the protein-protein interactions in

the whole proteomes of the yeast S. cerevisiae (Uetz et al., 2000) (Ito et al., 2001) (Ho et al., 2002) (Gavin et al., 2002), the worm C. elegans (Li et al., 2004) and the fruit fly D.

melanogaster (Giot et al., 2003). All these studies greatly contributed to a knowledge

base that is now rich enough for us to study gene-gene interactions at the systems level.

Even though a systems approach towards biology is both necessary and possible,

the complexity of this approach is overwhelming. Genes interact with each other in all

different ways. A gene, such as a transcription factor, may interact with another gene by

controlling its transcription. A gene, such as a kinase, may also interact with another

gene by phosphorylating its protein product. The main theme of gene-gene interaction,

however, is protein-protein interaction. But even this subsystem of protein-protein

3

interaction is still very complex. It has been suggested that, on average, one protein

interacts with 5 other proteins in a typical eukaryotic cell (Tucker et al., 2001).

Furthermore, unlike many other complex systems studied in statistical mechanics, biological systems consist of elements that are not the same to each other. Instead, each protein is a functional entity that behaves differently (Kitano, 2002a). This increases the complexity of the protein-protein interaction subsystem even further. Therefore at this early stage of systems biology, a further simplification is necessary. One suitable representation of the protein-protein interactions in a cell is an undirected graph, or a network, in which proteins are the nodes and the interactions are the connections.

With such a simplified representation, what can be learned of the actual complex

proteome network? Recently a series of pioneering studies have indeed discovered some

universal features of biological networks, such as the power law distribution of the

connectivities in scale-free networks (Albert and Barabasi, 2002). In a random network,

the connectivities of all the nodes follow a Poisson distribution; but in a scale-free

network, this distribution follows a power law. Many biological networks, including

protein-protein interaction networks, were shown to fit well into this model. Even though

the scale-free model may shed lights on some system behaviors of a network, such as

error and attack tolerance (Barabasi and Oltvai, 2004), functional module recently

emerged as a more intriguing concept that may help us understand the organizations and

dynamics of complex networks, such as proteome networks (Hartwell et al., 1999).

4

Biological systems are modular

Intuitively biology is organized as functional modules in a hierarchical fashion

(Barabasi and Oltvai, 2004). The protein-protein interaction network itself is a module in

this hierarchy. A functional module is a cellular entity that performs certain function that

is relatively independent from other modules (Hartwell et al., 1999). This intuitive and simple concept makes it justifiable to study cell functions through a step-by-step approach. In other words, since functional modules are relatively independent in function, we can try to first understand the behaviors of individual modules and then put

them together to assemble the big picture. Even though many questions need to be

addressed to really understand the functional organizations in the cell, the identification

of the functional modules is an important first step. Of course, with protein-protein interactions being simplified into nodes and connections, it is almost impossible to dissect the true functional organizations in the cell. But given this representative abstraction, it is still possible to gain insights into the high-level organizations of cell functions through studying the connectivities among proteins.

Recently a number of network partition algorithms have been designed and applied to find community and modular structures in complex networks, including gene transcription control networks and protein-protein interaction networks (Girvan and

Newman, 2002) (Rives and Galitski, 2003) (Spirin and Mirny, 2003) (Tanay et al., 2004)

(Pereira-Leal et al., 2004) (Xiong et al., 2005). One common theme shared by most of these studies is that networks are represented as unweighted graphs. In other words, all the interactions are considered to be the same. Even though they do capture the essential

5 features of many complex networks, unweighted graph representations will impose a big limitation on the study of protein-protein interaction networks. Protein-protein interaction networks have a very high degree of inter-modular crosstalk (Rives and

Galitski, 2003), which makes it very difficult to partition them based solely on topologies. On the other hand, in recent years high-throughput studies have generated huge amount of functional genomic data. In particular, microarray technology has been successfully applied to study the expressions of yeast genes under all kinds of conditions, and the results of these studies are centralized for public access (Ball et al., 2005). It is therefore highly desirable to develop new methods to take advantages of these functional genomics data and partition the proteome network in a biologically more meaningful way.

It is without doubt that validating the biological relevance of the identified modules is a very important step. One of the most important validations is through growth phenotype. Since a functional module is a relatively independent functional unit, genes in the same functional module are likely to exhibit similar phenotype. Therefore comparing gene phenotype is one way to tell if detected modules are truly biologically meaningful. After all, evidence of modularity at the phenotype level will justify for a divide-and-conquer approach towards systems biology.

The organizations of functional modules

Of course, the identification of functional modules is just the beginning of understanding cell functions. To truly understand the dynamics of a complex network it is key to understand the links between its subsystems (Barrat et al., 2004). Cellular

6

networks are no exceptions. Even though a module is relatively independent in carrying

out a specific function, responses to intracellular or extracellular perturbations usually

involve multiple cellular processes or functional modules. It is the communications

between functional modules, not the functional modules themselves that shape the

behaviors of cells and organisms (Rives and Galitski, 2003) (Hartwell et al., 1999).

It can be imagined that understanding the connections between the functional modules are even more difficult than identifying the modules. One way of simplification is to first draw a “sketch” of the module-module connections by focusing on the hubs, the proteins that are highly connected with other proteins. Since hubs interact with many other proteins in the network, they are the major components for the communications in the cell (Barabasi and Oltvai, 2004). However, this does not mean that all the hubs are

equally important in connecting the functional modules. In their studies on the

filamentation network of yeast, Rivers and colleagues briefly discussed the roles of some specific proteins in intermodule communications (Rives and Galitski, 2003). From a more general perspective, Han and his colleagues proposed a dynamic model for the yeast proteome network, in which different types of hubs play different roles in the network organization (Han et al., 2004). Even though it is very thought provoking, this model is based on indirect observations on mRNA expression profiles. With functional modules being identified and validated, the roles of hubs in the organizations of functional modules can be examined more closely. Furthermore, with such a sketch of module-module connections, more insights may be gained on the dynamics of module-

7

module interactions by analyzing functional genomics data, particularly gene expression

profiles under different conditions.

Network robustness

With these preliminary yet high-level understandings of the modular organizations

of the proteome network, one may again come back to the basic question regarding the

relationship between gene and phenotype. Even though there is still a long way to go

before we can describe the contribution of each single gene to the behavior of the whole

cell, it is now possible to gain insights into the importance of the genes to the whole

system in general sense by studying network robustness. Network robustness is the

capacity of a network to maintain its topological and functional integrity in response to

internal or external perturbations. Studies on the Barabasi-Albert model have shown that, compared to random networks, scale-free networks are more robust against random failure but are more vulnerable to hub attacks (Albert et al., 2000). However, the

contributions of hubs to network robustness have not been studied from the point view of

modularity. Are the hubs connecting functional modules more important than others?

What are the differences between a gene’s contributions to the robustness of the whole

network vs. the robustness of a functional module? Answers to such questions may help

us link a gene’s functional importance to its topological context.

In addition, previous studies on network robustness only look at large scale

perturbations that usually cause drastic changes, such as total network breakdown. In

reality, such total disastrous collapse almost never happens. Instead, most common

changes to networks only involve very few components. For example, many diseases are

8

caused by mutations of one or a few genes. How to measure network robustness when

one or only a few vertices are removed and no network breakdown occurs? Under such conditions, conventional measures such as components or diameters are not applicable.

Other mathematical concepts have to be borrowed for this need.

The outline of this dissertation

The fundamental goal of this dissertation is to identify the functional modules in the

protein-protein interaction network of the budding yeast Saccharomyces cerevisiae, to reveal the organization of the functional modules, and to analyze the network robustness in the context of gene essentiality. Budding yeast is chosen as the mainly because it is the most studied , and comprehensive datasets of various types are accessible to the public. This dissertation combines approaches from graph theory, pattern recognition, and statistical mechanics to achieve these goals.

In Chapter 2, the known protein-protein interaction datasets of yeast are integrated,

filtered and rigorously validated. In Chapter 3, statistical mechanics methods are used to

characterize the scale-free and modular features of the yeast proteome network. In

Chapter 4, the network is represented as a weighted graph by integrating gene expression

datasets, and an algorithm is developed to partition the network into functional modules,

which are then validated from multiple perspectives. In Chapter 5, the organization of

functional modules is analyzed from the perspective of hub-hub interactions, and the

roles of different subpopulation of hubs in module organizations are analyzed. In

Chapter 6, the network robustness is examined from both the network breakdown and the

graph spectrum perspectives, and the relationship between network robustness and gene

9 essentiality is explored. And last, conclusions and discussions on future directions are given in Chapter 7.

10

CHAPTER 2

DATA INTEGRATION AND VALIDATION

Traditionally, protein-protein interactions are identified using biochemical or genetic methods. Co-immunoprecipitation (Co-IP), for example, is a popular biochemical method that uses one antibody to bring down the protein of interest and uses other antibodies to detect its interacting partners. Another powerful method, yeast two- hybrid screening, is a genetic approach. In this method, the DNA-binding domain and the transactivation domain of the transcription activator GAL4 are individually fused with two proteins of interest. If the two proteins physically interact, the two domains are bringing into vicinity to restore the function of GAL4, which can be detected through a reporter gene (Fields and Song, 1989). With the coming of genomic era, new genes have been discovered at increasingly fast pace, and the potential interactions between them remain to be identified. To meet this overwhelming challenge, a number of high- throughput techniques have been developed. Yeast two-hybrid screening, coupled with protein array and automation workstation, has been used to analyze large-scale protein- protein interactions in yeast (Uetz et al., 2000), worm (Li et al., 2004), and fly (Giot et al., 2003). In the method Tandem Affinity Purification (TAP) (Rigaut et al., 1999) and the method High-throughput Mass-Spectrometry Protein Complex Identification (HMS-

11

PCI) (Mann et al., 2001), tagged proteins are used to purify protein complexes, which are

then analyzed by mass-spectrometry to identify the complex components (Gavin et al.,

2002) (Ho et al., 2002). Indirect interactions between genes can be inferred by synthetic

lethality (Tong et al., 2004) or mRNA co-expression (Ge et al., 2001). Still some other

methods use comparative genomics to predict gene-gene interactions (Enright et al.,

1999) (Dandekar et al., 1998).

Powerful and highthroughput as they are, these systematic approaches have their

own biases and drawbacks (von Mering et al., 2002). For example, in yeast two-hybrid

screening, the targets are put in the nucleus, which may not be the physiological

environment for many proteins. Self-activators may also produce false positives (Bader

et al., 2004). Mass-spectrometry-coupled protein complex purification methods may

miss transient and loose interactions, and tagging may disrupt protein complex formation

(von Mering et al., 2002). Correlated mRNA expression analysis does not offer proof of

physical interactions, and it is very sensitive to the choice of clustering algorithms as well

as parameters. As a result of biased coverage and false positives, only about 2400

interactions are supported by more than one study (von Mering et al., 2002).

Comparative assessment study showed that none of these methods achieves significant

degree of reliability (Table 2.1). The number of consensus interactions supported by all

these high-throughput studies is even smaller, at the level of hundreds (Bader et al.,

2004). It is therefore a serious concern as to how these protein-protein interaction

datasets should be used for data mining. This issue of reliability has been addressed by

several studies in the past few years (Bader et al., 2004) (Chen et al., 2005) (Deane et al.,

12

2002) (Deng et al., 2003) (D'Haeseleer and Church, 2004) (Patil and Nakamura, 2005)

(Sprinzak et al., 2003) (von Mering et al., 2002). All these studies laid the ground for

selecting a proteomics dataset that is both reliable and representative.

Total Reliable Reliable Method Interactions Interactions Percentage TAP 18028 2062 11% Synthetic Lethality 886 76 9% HMS-PCI 33015 1931 6% Yeast Two-Hybrid 5126 215 4% Prediction 7446 194 3% Coexpression 16496 355 2%

Table 2.1: The reliability of representative high-throughput methods. Based on von Mering (von Mering et al., 2002). The reliable interactions are those given high confidence scores.

To further ensure the quality of the entire study, the selected dataset should be rigorously validated. Generally speaking, two interacting proteins are more likely to belong to the same biological process or to be localized to the same compartment of the cell. This was confirmed by comparative studies (von Mering et al., 2002), and therefore provides ways for independent validation. Another way to assess the quality of protein- protein interaction datasets relies on the fact that degree distribution follows a power law in a scale-free network (Albert and Barabasi, 2002). Spurious interactions in a noisy dataset should occur randomly, and therefore the degree distribution will deviate from the expected curve of exponential decay.

13

Materials and Methods

Datasets. Protein-protein interaction datasets of Saccharomyces cerevisiae were downloaded from publishers’ websites, http://www.nature.com/nature/journal/v417/ n6887/suppinfo/nature750.html, http://www.nature.com/nbt/journal/v22/n1/suppinfo/ nbt924_S1.html, and researchers’ website http://helix.protein.osaka-u.ac.jp/htp/. The curated protein-protein interactions from the literature were downloaded from the

Comprehensive Yeast Genome Database (CYGD) at the Munich Information Center for

Protein Sciences (MIPS) (http://mips.gsf.de/genre/proj/yeast/). The protein localization datasets were downloaded from CGYD and the Saccharomyces Genome Database (SGD)

(http://www.yeastgenome.org/). Gene annotations and grouping were downloaded from

CYGD and the (GO) website (http://www.geneontology.org).

Protein interaction intensity. The assessment of interaction intensity between annotation groups followed that of von Mering (von Mering et al., 2002). Briefly, each yeast gene was assigned a group according to either its cellular process annotation or its cellular component localization. Genes with ambiguous annotations were discarded. For each pair of groups, the number of interactions between the genes in the two groups was counted, and the maximum number of possible interactions between the two groups was calculated. The protein interaction intensity of two groups was defined as the ratio between the number of interactions and the maximum number of possible interactions.

Functional similarity. The similarity of annotation between two interacting proteins was calculated based on information content (Lord et al., 2003a) (Lord et al.,

2003b). Briefly, assume the number of genes annotated by a GO term T in an organism

14

is a, and the total number of annotation terms for all the genes in the organism is b, then

the information content of term T (with respect to the given organism) is C = –log(a/b).

The annotation similarity between two genes is the maximum value of the information contents of all the GO terms shared by the two genes.

Degree distribution fitting for scale-free network. The degree distribution fitting

followed Barabasi (Albert and Barabasi, 2002). Briefly, the connectivity (degree) of each

vertex in the network was counted. The distribution histogram of the connectivities of all

the vertices was computed, and the logarithm of the distribution was least-square fitted to

a straight line using the statistical software R. The fitting R2 was computed to assess the quality of the fitting.

Results

Data integration. Three representative studies addressed the issue of confidence in the protein-protein interaction datasets of Saccharomyces cerevisiae (Bader et al., 2004)

(Patil and Nakamura, 2005) (von Mering et al., 2002). Each of the three studies categorizes known interactions into high, medium or low confidence group. As shown in

Table 2.2, in each of the studies, only a small portion of the evaluated interactions are highly reliable. In addition, only a small portion of the known yeast ORFs are covered in high confidence datasets. This is particularly true if the coverage of the dataset is high.

The high confidence interactions from the three datasets were selected and unioned together. In addition, since these studies focused on high-throughput data, 1759 literature-based interactions were retrieved from well-curated CYGD database and were added into the dataset. With redundant entries removed, the final dataset contains 10936

15

unique interactions among 3409 known or hypothetical proteins of Saccharomyces

cerevisiae.

Dataset von Meringa Baderb Patilc Total 78390 47783 12674 High 2428 (3%) 5627 (12%) 5280 (42%) Interactions Medium 8448 (11%) 7102 (15%) 1019 (8%) Low 67514 (86%) 35054 (73%) 6375 (50%) Total 5313 6707 4202 High 988 (19%) 2759 (41%) 2534 (60%) Proteins Medium 2288 (43%) 2544 (38%) 815 (19%) Low 5058 (95%) 4028 (60%) 3440 (82%) a - (von Mering et al., 2002) b - (Bader et al., 2004). c - (Patil and Nakamura, 2005).

Table 2.2. The datasets of three representative studies

Since the proteins covered in the dataset account for only approximately two thirds of the yeast ORFs, it is important to ensure that no obvious bias exists in the representations of the functional categories in the cell. As shown in Table 2.3 and Figure

2.1, the protein coverage for each individual functional category varies between 31.94%

(DNA metabolism) and 100% (regulation). Of the 34 functional categories, 26 are covered by over 50%. These results indicate that, even though not comprehensive, the final dataset has a broad and balanced coverage among the functional categories in the cell. It is worth noting that over one third of the involved proteins are currently not annotated.

16

Involved Total Coverage Functional Category Proteins Proteins (%) amino acid and derivative metabolism 90 125 72 Budding 34 36 94 electron transport 5 11 45 energy pathways 24 41 59 lipid metabolism 68 133 51 membrane organization and biogenesis 2 3 67 Morphogenesis 26 39 67 nuclear organization and biogenesis 8 9 89 organelle organization and biogenesis 97 136 71 129 212 61 protein catabolism 66 97 68 protein modification 95 121 79 carbohydrate metabolism 55 83 66 response to stress 46 84 55 biogenesis and assembly 33 47 70 RNA metabolism 188 216 87 signal transduction 48 66 73 Sporulation 22 45 49 Transcription 170 215 79 Transport 318 550 58 vitamin metabolism 15 25 60 Regulation 7 7 100 Behavior 1 2 50 cell cycle 141 175 81 other cellular process 56 83 67 other development 4 6 67 other metabolism 195 296 66 other physiological process 36 97 37 cell homeostasis 22 46 47 cell wall organization and biogenesis 25 54 47 coenzymes and prosthetic group 18 50 36 metabolism Conjugation 16 26 62 Cytokinesis 5 11 45 DNA metabolism 23 72 32

Table 2.3. The protein coverage of the dataset among the functional categories

17

Figure 2.1. The protein coverage of the dataset among the functional categories

18

In the following sections, the reliability of the interaction dataset is evaluated from the perspectives of functional annotation, cellular colocalization and scale-free fitting.

Comparisons are carried out between the selected reliable dataset (referred to as

“reliable”) and other datasets containing unreliable interactions.

Validation against functional annotation. If two proteins interact with each other, they are likely to be in the same functional category; an interaction between two functionally unrelated proteins is likely a false positive (von Mering et al., 2002). The

yeast proteins of known functions were mapped into 16 high-level categories in the

CYGD annotation system, and the interactions were mapped to corresponding category

pairs. The interaction intensity for each category pair is visualized by grey scale plot. As

shown in Figure 2.2, the plot for the medium/low-confidence dataset contains many dark

spots in the off-diagonal region, which represents putative false positives. Out of 4466

interactions, only 1815 (40.64%) are between the same functional categories (diagonal

triangles). In contrast, the plot for the reliable dataset is much cleaner in the off-diagonal

region, and the dark spots are more concentrated on the diagonal. In fact, 538 (64.35%)

out of 836 interactions are between the same functional categories. The same experiment

using GO annotations yielded essentially the same results (Figure 2.3). In the

medium/low-confidence dataset, 2208 (26.67%) out of 8283 interactions are between the

same categories. In the reliable dataset the ratio is 754 (52.47%) out of 1437. These

results indicate that the selected dataset is enriched with reliable interactions.

19

A B

Figure 2.2. The interaction intensities between the functional categories of the CYGD annotation system. Yeast genes are classified into groups according to the annotation in the CYGD database. In each plot, x and y axes are the cellular processes, and the grey scale of each intersecting area represents the interaction density between the two groups. The diagonal triangles represent the interactions between the same functional group, and the off-diagonal squares represent those between different groups. (A) The medium/low confidence dataset. (B) The reliable dataset.

Functional categories: A: metabolism B: energy C: cell cycle and DNA processing D: transcription E: protein synthesis F: protein fate G: protein with binding function or cofactor requirement H: protein activity regulation I: cellular transport, transport facilitation and transport routes J: cellular communication/signal transduction mechanism K: cell rescue, defense and virulence L: interaction with the cellular environment M: transposable elements, viral and plasmid proteins N: cell fate O: biogenesis of cellular components P: cell type differentiation

20

A B

Figure 2.3. The interaction intensities between the functional categories of GO annotation system. Yeast genes are classified into groups according to the annotation in GO database. In each plot, x and y axes are the cellular processes, and the grey scale of each intersecting area represents the interaction density between the two groups. The diagonal triangles represent the interactions between the same functional group, and the off-diagonal squares represent those between different groups. (A) The medium/low confidence dataset. (B) The reliable dataset.

Functional categories: 0: amino acid and derivative metabolism 1: budding 2: carbohydrate metabolism 3: cell cycle 4: cell homeostasis 5: cell wall organization and biogenesis 6: coenzymes and prosthetic group metabolism 7: conjugation 8: cytokinesis 9: DNA metabolism 10: electron transport 11: energy pathways 12: lipid metabolism 13: membrane organization and biogenesis 14: morphogenesis 15: nuclear organization and biogenesis 16: organelle organization and biogenesis 17: protein biosynthesis 18: protein catabolism 19: protein modification 20: response to stress 21: and assembly 22: RNA metabolism 23: signal transduction 24: sporulation 25: transcription 26: transport 27: vitamin metabolism 28: regulation 29: behavior 30: other cellular process 31: other development 32: other metabolism 33: other physiological process 21

Another measure for the functional similarity between two interacting proteins is

semantic similarity (Lord et al., 2003a) (Lord et al., 2003b). Two genes are functionally

similar to each other if they belong to a very specific pathway, and therefore share a GO term with high information content. In contrast, a GO term shared by two unrelated genes is likely to be less specific and therefore has low information content. Therefore the functional similarity between two proteins can be inferred from the information content of the most specific term shared by the two genes. As shown by the boxplot in

Figure 2.4, both the extreme value line and the 3rd quartile line are much higher in the high confidence dataset than in the medium or low confidence dataset. The medium and low-confidence dataset also contain interactions with very high information content, but the boxplot indicates that these data points are likely outliers. Two-sample comparisons by Wilcox test showed that, overall, the interacting proteins in the reliable dataset are functionally more similar to each other than those in the medium or low confidence dataset (p-values < 2.2e-16).

Validation against protein localization. For two proteins to physically interact with each other, they must be localized to the same compartment in the cell. Therefore an interaction is probably false positive if the interacting proteins do not colocalize and therefore have no chance to physically encounter each other. High-throughput localization studies (Kumar et al., 2002) made it possible to use this rationale to validate

protein-protein interactions.

22

Figure 2.4. Boxplot for the functional similarities between interacting proteins. The notch in the box indicates the sample median; the two box ends represent the first quartile and third quartile lines, respectively; the dashed lines extend to the minimum and the maximum, respectively; the isolated data points beyond the dashed lines are outliers.

Similar to the experiment of functional validation, yeast genes were grouped according to compartment localization, and protein interaction intensities were calculated for each pair of compartments. As shown in Figure 2.5, the medium /low confidence dataset contains many possibly spurious interactions represented by off-diagonal dark spots. Based on the localization annotation of CYGD, only 7464 (41.33%) out of 18056 interactions are between colocalized proteins. In contrast, the reliable dataset contains

2230 colocalized interacting pairs, which accounts for over 70% of the interactions.

Similar results were obtained using the localization data annotated in SGD (Figure 2.6).

Overall, these results confirm the high reliability of the selected interactions from the perspective of protein localization.

23

A B

Figure 2.5. The interaction intensities between the localization categories of CYGD database. Yeast genes are classified into groups according to the localization annotations in the CYGD database. In each plot, x and y axes are the cellular compartments, and the grey scale of each intersecting area represents the interaction density between the two groups. Diagonal triangles represent interactions between colocalized proteins, and off- diagonal squares represent those between different compartments. (A) The medium/low confidence dataset. (B) The reliable dataset.

Cellular compartments: A: extracellular B: bud C: cell wall D: cell periphery E: plasma membrane F: endomembrane G: H: cytoskeleton I: ER J: Golgi K: transport vesicles L: nucleus M: mitochondria N: peroxisome O: endosome P: vacuole Q: lipid particles

24

A B

Figure 2.6. The interaction intensities between the localization categories of SGD database. Yeast genes are classified into groups according to the localization annotations in the SGD database. In each plot, x and y axes are the cellular compartments, and the grey scale of each intersecting area represents the interaction density between the two groups. Diagonal triangles represent interactions between colocalized proteins, and off- diagonal squares represent those between different compartments. (A) The medium/low confidence dataset. (B) The reliable dataset.

Cellular compartments: A: nucleus B: C: nuclear-periphery D: ER E: ER-to-Golgi F: Golgi G: Golgi-to-ER H: Golgi-to-vacuole I: vacuole J: vacuolar-membrane K: endosome L: peroxisome M: N: actin O: microtubule P: spindle- Q: bud-neck R: cell-periphery S: cytoplasm T: lipid-particle U: punctate-composite

25

A B

Figure 2.7. The scale-free fitting of the datasets. The histogram of the connectivities in the reliable dataset (A) or the whole dataset (B) was computed with seven bins. The densities of the bins (y-axis) were then plotted against the bin midpoints (x-axis) at logarithm scale, and the best linear model was shown with a straight line.

Validation through scale-free fitting. Recently many naturally occurring complex

networks have been shown to have a scale-free topology (Albert and Barabasi, 2002). In

particular, the protein-protein interaction network of yeast was also shown to be scale-

free (Jeong et al., 2001). In a scale-free network, the degree distribution of the vertices

follows a power law, i.e. ρ(k) ~ k-r. At logarithmic scale this distribution approximates a straight line, i.e. ln(ρ(k)) ~ -rln(k), where r is called the scaling exponent, a characteristic

factor of the network. A reliable dataset of protein-protein interactions will likely form a

network that fits well into the scale-free model, and therefore its degree distribution

should fit into a straight line. As shown in Figure 2.7, the linear fitting of degree

26

distribution is very good for the reliable dataset (R2 = 0.95). This implies that the

network conforms nicely with the scale-free topology. In contrast, the fitting quality

dropped significantly for the whole dataset (R2 = 0.87). This is likely due to false interactions, which tend to be random and change the scale-free topology. Taken together, these observations further confirm that the dataset used in this study is highly

reliable.

Discussion

Data quality is an important issue in data mining in any field. Bioinformatics and

computational biology are no exceptions. As a matter of fact, data obtained in functional

genomics, such as high-throughput proteomics and gene expression analysis, are

particularly prone to errors due to the difficulties in data confirmation and cross-

validation (Grunenfelder and Winzeler, 2002). Studies using Saccharomyces cerevisiae

as the model organism could benefit from the availability of multiple highthroughput

datasets, but they have to deal with the problem of low data quality. Recently a number

of studies have evaluated yeast proteomics data from the perspective of independent

evidence (von Mering et al., 2002), statistical and topological descriptors (Bader et al.,

2004), and combination of annotation, and protein domains (Patil

and Nakamura, 2005), among many others. This dissertation benefits from these studies

by taking the union of the highly reliable interactions to yield the largest reliable dataset

ever for the yeast protein-protein interaction network. The main reason that the union

instead of the join of these evaluated interactions was used is the data size. Furthermore,

by joining these datasets it is expected that the systematic errors, represented by the

27 biases of different high-throughput methods, will be minimized. Of course, random errors still remain in the final dataset, but they are likely to be insignificant due to the filtering in these studies.

It should be noted that the protein coverage of the dataset is more representative than comprehensive. The 3409 proteins only accounts for about two thirds of the 5585 confirmed ORFs of Saccharomyces cerevisiae. It is possible to lower the filter threshold to increase the coverage. However, even if all the medium-confidence interactions are included, the number of proteins covered is still less than 4000, while the number of interactions is more than doubled (data not shown). This suggests that the spurious interactions between highly abundant proteins are so prevalent that even sophisticated methods (either detection or computational filtering) are unable to recognize the weak signals of transient interactions or the interactions between low abundance proteins.

In the experiments that use interaction intensity to validate the dataset, off-diagonal dark spots are regarded as spurious interactions in general. However, this is not always the case. Interactions between different functional groups certainly occur in the cell, and they should occur in a non-negligible scale. For example, in Figure 2.2B, group J proteins interact intensively with quite a few other groups. This is rather expected, since group J represents cell communication and signal transduction. Similarly, proteins that do not colocalize may also interact, since cell is a totally dynamic system. For example, in Figure 2.5B, group F strongly interacts with groups J, K, P. Again this is not surprising since F represents endomemberane, while J, K, and P represent Golgi, transport vesicles, and vacuoles, respectively. They all belong to the intracellular

28 membrane system and they do interact with each other dynamically and constantly.

Therefore, such experiments should be used for validating the dataset as a whole, not individual interaction.

Finally, it is worth emphasizing again that the dataset obtained here and used in the following studies is by no means comprehensive or accurate. It is not comprehensive since it only covers about two thirds of the yeast genes. It is not accurate in that, besides false positives and false negatives, no details about the interactions are represented. For example, a protein kinase may activate or repress the function of its target, which will not be distinguishable in the dataset of simple interactions. But still, this dataset, represented as a network in which proteins are nodes and interactions are connections, does capture the majority of the basic relationships between interacting proteins. Given such a high- level view of the protein-protein interactions in yeast, many questions can still be addressed. The next chapter will try to answer the question, what kind of network is it from the topology perspective?

29

CHAPTER 3

NETWORK TOPOLOGY

Complex networks can be mapped to a few different types of graphs in graph theory. Some of the earliest works on phase transitions led to the proposal of random graph theory. A random graph can be generated by randomly connecting two vertices with certain probability. In a random network, the connectivities of the vertices follow a

Poisson distribution. Some engineering, social and biological networks were found to be random networks (Albert and Barabasi, 2002). In 1999 Barabasi and colleague discovered that in many real life networks, the connectivities of the vertices do not conform to Poisson distribution, but follow a power law distribution (Barabasi and

Albert, 1999). They name such networks scale-free networks, and proposed a model to explain the origination of the power law distribution (Albert and Barabasi, 2002). Scale- free networks differ greatly from random networks in network robustness (Albert et al.,

2000). Recently, modular networks started to catch the attention of scientists (Girvan and

Newman, 2002) (Ihmels et al., 2002) (Newman and Girvan, 2004). In a modular network, the topology shows patterns of community structures. A community, or a module, is a subgraph (subnetwork) in which components have more connections with each other than with other vertices in the network.

30

Different types of networks have different topologies, which also determine the

dynamics of network behaviors. For example, in a scale-free network, the small fraction

of vertices with high connectivities (hubs) is critical for its robustness. On the other hand, in a modular network the robustness may be determined at both module and network levels. Given the protein-protein interaction network of yeast, the first natural question to ask is, what is its topology? Since networks of different topologies have different behaviors, answers to this question may dictate what to study and how to study this network. Even though the concept is very intuitive that biological systems are modular (Hartwell et al., 1999), independent examinations are still needed to show that modularity indeed exists in the yeast proteome network. In this chapter, the protein- protein interaction network of Saccharomyces cerevisiae is characterized by comparing it side-by-side with computer generated networks of the same number of vertices,

(approximately) the same number of edges, but different topologies. The assessment is applied on four characteristic features of networks, namely the connectivity distribution, the graph spectrum, the shortest path length, and the clustering coefficient.

Materials and Methods

Computer generated networks. First, the yeast protein-protein interaction network is represented as an undirected graph G(V, E), where V represents the proteins and E represents the interactions. Then the connection probability of the entire network is calculated by

E × 2 i) p = V × ()V −1

31

which is 0.00188. For the random network, first a set of 3409 vertices were generated.

Then for each pair of vertices, a connection was made with probability p=0.00188. The

scale-free network was generated according to the preferential attachment model

proposed by Barabasi and Albert (Albert and Barabasi, 2002). The initial number of

vertices was set to m0=5. For each iteration, the number of connections generated for the added vertex is m=6, which is the integer value that is closest to 6.4, the average number of interactions in the yeast proteome network. For the modular network, first a set of

3409 vertices were randomly divided into groups (modules), the size of which was drawn

from a uniform distribution in [10, 50]. The range of this distribution was chosen based

on the size range of functional modules that was proposed previously (Spirin and Mirny,

2003), and it was adjusted a little in order to get the desired number of edges when

combined with the connection probabilities (see below). Then for each pair of vertices, a

connection was added with probability pin if the two vertices were in the same group, or

pout otherwise. For each module, pin was drawn from a uniform distribution in [0.01, 0.3]; pout was set to 0.0004. The approximate ratio of pin to pout was chosen as following. The average size of a module is (10+50)/2=30. In a network of 3409 vertices, there are about

110 modules. If the probability of connectivity is the same for pin and pout, each vertex will be 110 times more likely to connect with the vertices outside the module than with the vertices inside the module, in which case the definition of module is violated. To obtain a good modular structure, pin should be significantly larger than 110 times of pout.

The actual values of pin and pout were chosen as stated in order to get the desired number of edges. For both the module size and the connectivity probability, randomized values

32

were used to introduce certain variations into the topology so that the obtained network is

more likely to represent real life networks. For each type of network, the procedure was

repeated multiple times until the number of edges is close enough to the number of edges

in the yeast network. The final networks used are a random network with exactly 10936

edges, a scale-free network with 10416 edges, and a modular network with 10965 edges.

Graph spectral density. For each network, first the adjacency matrix of size 3409

by 3409 was generated. Then the 3409 eigenvalues were computed using the R statistics

package, which were sorted to give the graph spectrum of the network. The density of

the spectrum was computed using the R package, kernel being set to Gaussian kernel and

bandwidth being set to 0.1. A normalization factor was calculated to rescale the axes,

ii) f = Np(1− p) where N=3409, p=0.00188.

Clustering coefficient. The clustering coefficient of a vertex in a network is the inter-connectivity between its neighbors. For a given vertex P which connects with k other vertices Qi (i=1, 2, …, k), the maximum number of possible connections between

the neighbors is cmax = k(k-1)/2. Assume the actual number of connections between the

neighbors is q, and the clustering coefficient of P is defined as the ratio of q and cmax

2q iii) C = k(k −1)

33

Results

Connectivity distribution. As stated above, the distribution of connectivity

distinguishes a scale-free network. The power law distribution of connectivity states that,

in a scale-free network, the probability that any vertex has a connectivity of k decays

exponentially, i.e. P(k) ~ k-r, where r is called the scaling exponent. Plotted at

logarithmic scale, the distribution approximates a straight line of slope -r. To exam this property, the connectivity distributions were compared between the yeast proteome network and those of the computer-generated networks. As expected, the histogram of the connectivities of the random network approximates a Poisson distribution (Figure

3.1A), while that of the scale-free network has a power law tail (Figure 3.1B, E). The distribution of the modular network is similar to that of the random network, but the

frequencies of low connectivities are much higher (Figure 3.1C). The histogram for the

yeast proteome network is very similar to that of the scale-free network, with a long

power law tail (Figure 3.1D). At logarithmic scale, the distribution approximates a

straight line of slope –2.35 (Figure 3.1F). This characteristic exponent value is very close

to those of metabolic and protein networks studied previously (Jeong et al., 2000) (Jeong

et al., 2001). These results indicate that the protein-protein interaction network of yeast

has properties typical of scale-free networks.

34

A B

C D

E F

Figure 3.1. The connectivity distribution of the networks. (A) The random network. (B) The scale-free network. (C) The modular network. (D) The yeast proteome network. (E) The density of connectivity in the scale-free network at logarithmic scale. The best linear model is shown. (F) The density of connectivity in the yeast proteome network at logarithmic scale. The best linear model is shown. 35

Graph spectrum. The spectrum of a graph is the set of eigenvalues of its adjacency matrix. The spectral density of a graph is directly indicative of its topological features. For a random graph, the spectral density resembles a semicircle. But for a

scale-free network, the spectral density shows a high and sharp peak at zero (Albert and

Barabasi, 2002). As shown in Figure 3.2, the spectral density of the computer generated

random network has a shape resembling a semicircle. The modular network has a similar

spectral density profile. In contrast, the spectral density of the yeast protein-protein

interaction network has a peak well above the semicircle, and is similar to the scale-free

network. However, the peak is not as sharp as that of the scale-free network, and the

height of the peak is also lower. This observation suggests that the yeast proteome

network may have topological features of both scale-free networks and modular (or

random) networks.

Shortest path length and diameter. A path in a graph is a set of consecutive

edges from a source vertex to a target vertex. Multiple paths may exist between any two

vertices. The shortest path length between two vertices is the distance of the path that is

the shortest among all the possible paths. The average of all the pairwise shortest path

length was first analyzed in the studies that are now popularly known by the terms “six

degree separation” and “small world” (Almaas et al., 2002) (Watts and Strogatz, 1998).

Likewise, the maximum of all the pairwise shortest path lengths is another measure for

the small-world property, and is called network diameter.

36

Figure 3.2. The spectral density of the networks. For the purpose of showing the semicircle distribution of the random network, the x-axis (spectrum) is shrunk by the normalization factor, and the y-axis (density) is expanded by the normalization factor.

To examine the small world property of the yeast proteome network, the Floyd-

Marshall all-against-all shortest path algorithm was applied to compute the pairwise shortest path distances for each of the four networks. As shown in Figure 3.3 and Table

3.1, the scale-free network has the smallest average path length and network diameter.

The random network has the same network diameter as the scale-free network but larger average path length. The modular network has even larger diameter and larger average path length. Interestingly, the diameter of the yeast proteome network is 13, which is the largest of all four networks. Its average path length is also longer than that of the random or scale-free network, and is close to that of the modular network. The distribution of the

37 shortest distance in the yeast proteome network is also broader than that of the scale-free network or the random network, and is more similar to that of the modular network.

These results suggest that significant modularity may exist in the protein-protein interaction network of Saccharomyces cerevisiae.

Figure 3.3. The shortest distance distribution of the networks

Network Yeast Modular Scale-Free Random Average shortest distance 4.9 5.2 3.8 4.5 Diameter 13 10 8 8

Table 3.1. The diameter and shortest path length of the networks

38

Clustering coefficients. The clustering coefficient of a vertex in a network is a measure of the inter-connectivity between its neighbors. In the context of subgraph, high clustering coefficients among the vertices indicate a densely connected subgraph. In other words, they form a local community. Therefore, clustering coefficient is yet another way to characterize complex networks. Random networks have overall very low clustering coefficients, while scale-free networks and modular networks have higher clustering coefficients (Barabasi and Oltvai, 2004) (Albert and Barabasi, 2002). The results in this chapter so far suggest that the yeast proteome network is both scale-free and modular (Figure 3.2, 3.3). What will the clustering coefficients tell about its topology? As shown in Figure 3.4 and Table 3.2, over 95% of the vertices in the random network have clustering coefficient of zero. The scale-free network has overall higher clustering coefficients than the random network, but majority of them are still zero. As expected, the modular network has overall much higher clustering coefficients.

Interestingly, the clustering coefficients in the yeast network are much higher than that of the scale-free network and are very close to that of the modular network. The distribution of the clustering coefficients is also very similar to that of the modular network, with essentially no outliers. Since clustering coefficient is a direct indicator for local community structures, this observation strongly suggests that the topology of the yeast proteome network is modular.

39

Figure 3.4. The clustering coefficients of the networks

Network Yeast Modular Scale-Free Random Non-zero (%) 70 84 23 4 Mean 0.33 0.35 0.02 0

Table 3.2. The clustering coefficients of the networks

Discussion

The protein-protein interaction network of Saccharomyces cerevisiae has been found to exhibit scale-free property (Jeong et al., 2001). However, that observation was based on a small dataset of 2240 interactions between 1870 yeast proteins. The network covered less than one third of the yeast proteome. In this study, the scale-free topology is confirmed on a much larger dataset, which contains five times as many interactions and

40

covers almost twice as many proteins. This is the first time the scale-free feature of the

yeast proteome network is observed on a reliable and representative dataset.

Other than the scale-free property, little is known about the topology of the yeast

proteome network. A number of studies focused on finding functional modules in the network (Spirin and Mirny, 2003) (Tanay et al., 2004) (Xiong et al., 2005). However,

none of these works had looked at the overall characteristics of the network topology for

signs of modularity. In a way the modularity was assumed to exist in the network and

algorithms were applied directly to look for functional modules. No matter what

topology it has, a network will give a set of modules as the result of any given partition

algorithm. However, if a network, such as a random network, does not have the property

of modularity, the identified modules are not likely to be very meaningful. Results in

this study for the first time indicate that, besides being scale-free, the yeast proteome

network is also modular. The distribution of the shortest path length significantly

deviates from that of typical small-world networks, and is similar to that of modular

networks (Figure 3.3). More importantly, the clustering coefficients of the proteins are

orders of magnitude higher than that of random networks or scale-free networks, and are

very close to that of modular networks (Figure 3.4). These results lay the ground for

identifying the functional modules in the yeast proteome network, which is described in

Chapter 4.

41

CHAPTER 4

MODULE DETECTION AND VALIDATION

Biology is organized as functional modules. As a critical level of biology

hierarchy, functional modules are cellular entities that perform certain biological

functions, which are relatively independent from each other (Barabasi and Oltvai, 2004)

(Hartwell et al., 1999). Revealing modular structures in biological networks will help us

tremendously in understanding how cells function (Hartwell et al., 1999) (Bork et al.,

2004). Many questions remain to be answered, but the detection of the functional

modules is a first step.

Recently a number of network partition algorithms have been designed to find

community and modular structures in complex networks. On the basis of shortest path

algorithm in graph theory, Girvan and Newman generalized the concept of vertex

betweenness to edges to distinguish between inter-community edges and intra- community edges. They designed an algorithm that iteratively removes the edges of the

highest betweenness until a given network breaks into desired number of clusters (Girvan

and Newman, 2002). Building on this work, Parisi and colleagues strengthened the

definition of community and proposed a local topology-based concept of “edge clustering coefficient” to replace the global edge betweenness measurement (Radicchi et al., 2004).

42

In another study, using shortest-distance as a metric, Rives and Galitski applied a

hierarchical clustering algorithm to reveal the modular organization of yeast signaling

networks (Rives and Galitski, 2003). Spirin and Mirny combined clique detection,

superparamagnetic clustering (SPC) and Monte Carlo optimization (MC) to search for

functional modules in the yeast protein network (Spirin and Mirny, 2003). Berg and

Lassig used a probabilistic model to expand the motif concept and proposed a local graph

alignment algorithm to detect such probabilistic motifs in the transcription network of E.

coli (Berg and Lassig, 2004). More recently, Xiong and colleagues applied an

association pattern discovery method to find the “hypercliques” (functional modules) in

the yeast proteome network (Xiong et al., 2005). One common theme shared by these

studies is that networks were represented as unweighted graphs. Even though they do

capture essential features of many complex networks, unweighted graph representations

will impose a big limitation on the study of biological networks. Protein-protein

interaction networks, in particular, have a very high degree of inter-module crosstalk

(Rives and Galitski, 2003), which makes it very difficult to partition them using

algorithms based solely on topology. Some recent works do take this into consideration

and use weighted graph representations. Shamir and his colleagues applied a biclustering

algorithm to the integrated genomic data to partition the molecular network of yeast

(Tanay et al., 2004). However, their weighting scheme is applied on the bipartite graph

to represent the level of association between genes and properties, not between pairs of

interacting genes. Another interesting work is from Ouzounis’s group (Pereira-Leal et al., 2004). They first transformed the yeast protein interaction network into a line graph,

43

and then applied a graph flow-based clustering algorithm to find functional modules. In

their work, the weight of an edge represents the level of confidence attributed to that

interaction, which may not indicate the functional correlation between the two proteins.

In recent years high-throughput studies have generated a huge amount of functional

genomic data. In particular, microarray technology has been applied to study yeast gene

expressions under all kinds of conditions, and the results of these studies are centralized

for public access (Ball et al., 2005). It is therefore highly desirable to develop new

methods that would take advantages of functional genomics information and partition

protein-protein interaction networks in a biologically more meaningful way.

Given the fact that the protein-protein interactions of yeast are represented as a highly abstracted network, the modular structure identified this way may not necessarily represent the actual functional organizations in the cell. The biological relevance of the modules has to be validated against independent knowledge, such as functional genomics information. In particular, gene phenotype may be used to rigorously examine if the modules are biologically meaningful. Since genes in the same functional module are involved in a relatively independent cellular function, they may carry similar phenotype.

Such a validation is very important since it is the modularity at the phenotypical level, not that at the topological level that actually interests biologists.

Methods and Methods

Data sources. The gene expression datasets of Saccharomyces cerevisiae were retrieved from the Expression Connection database at SGD (http://db.yeastgenome.org/

44

cgi-bin/expression/expressionConnection.pl). The yeast gene deletion phenotype dataset

were retrieved from the authors’ website (http://genomics.lbl.gov/YeastFitnessData/

websitefiles/). The protein complex datasets were downloaded from SGD (ftp://genome-

ftp.stanford.edu/pub/yeast/data_download/literature_curation/go_protein_complex_slim.t

ab) and MIPS (http://mips.gsf.de/genre/proj/yeast/).

Weighted graph representation. The protein-protein interaction network of yeast

is represented as a weighted graph G= (V, E). The vertices of the graph are the set of unique proteins, and therefore |V|=3409. The edges of the graph are the interactions, and therefore |E|=10899.

To add weights to the edges, a total of 265 microarray datasets were downloaded from Saccharomyces Genome Database (SGD). The raw data are expression change ratios, which were transformed into Z-scores so that data from different experiments were comparable. If the expression of a given gene g in a microarray experiment m is changed by the ratio r, the normalized Z-score is

m i) Z g = (r − µ) /σ

where µ is the mean of all the data in that experiment, and σ is the standard deviation.

The edge weight is defined as the average of the Z-score differences over all the experiments. For a given interaction between protein Pi and protein Pj, the weight is

n 1 m m ii) Wi, j = n ∑(Z i − Z j ) m=1

45 where n is the total number of microarray experiments in the dataset. This way the weight represents the “dissimilarity” between the expression profiles of two genes, which is the equivalent of “distance” in graph theory.

Betweenness-based partitioning algorithm for weighted graph. Girvan and

Newman first proposed the concept of edge betweenness in the context of network community (Girvan and Newman, 2002). The idea was that inter-community edges are more likely to be on some shortest paths than intra-community edges. Therefore by computing the all-against-all shortest paths of a graph and calculating the number of times each edge is traveled, one can identify the linkers between communities. By removing these linkers step-by-step one would eventually obtain the community structure of a graph, which is represented as a hierarchical tree (Girvan and Newman, 2002). This algorithm (GN for short) is intuitively very appealing. However, not all interactions are equally important within a network. Some interactions may be used more frequently than others. With the yeast protein-protein interaction network being represented as a weighted graph, the GN algorithm was extended so that the shortest path was based on edge weights. The all-against-all shortest path was computed using the Floyd-Warshall algorithm.

Besides this extension, an important modification was also made on the measurement of edge betweenness. In the GN algorithm, the betweenness of an edge is essentially the number of all-against-all shortest paths that run through it. In the example graph shown in Figure 4.1A, there are two subgraphs. In the left subgraph the edge CD

46

A

B

Figure 4.1. Edge betweenness based on non-redundant shortest path. (A) An example graph containing two subgraphs. (B) A bipartite graph representing all the shortest paths passing through edge ST. Vertices P, Q, R, S are the end vertices on one side of the edge, and vertices T, U, V, W, X are on the other side. An edge is drawn between two vertices if there is a shortest path between them that passes through ST. One set of non-redundant paths is shown by the four dark edges P-T, Q-U, R-V and S-X.

47 has a betweenness of 24. This is because it is the only bridge that connects vertices A, B,

C and vertices D, E, F, G, H, I, J, K, and therefore there are total 3x8=24 distinct all- against-all shortest paths. Similarly, in the right subgraph, the edge ST has a betweenness of 20. It can be shown that, in the whole graph, edge CD has the highest betweenness. Therefore edge CD is removed at this step. However, by simple visual inspection we tend to say that edge ST is a better candidate that connects two communities {P, Q, R, S} and {T, U, V, W, X}, and that the left subgraph is a separate community. From the topological point of view, the original definition of betweenness may lead to unbalanced partitioning under certain circumstances.

To resolve this issue the idea of “non-redundancy” is introduced into the computation of edge betweenness. When counting the number of shortest paths for an edge, the end points must be distinct. For example, when counting the shortest paths that go through edge ST, if path P-T is counted, no other path that starts or ends with P (P-U,

P-V, P-W, P-X) or T (T-Q, T-R, T-S) should be counted (Figure 4.1B). Based on this idea, the betweenness of an edge is the maximum number of non-redundant all-against- all shortest paths passing through it. It was expected that this change will keep the intuitiveness of the original algorithm, while making it more robust against unbalanced partition.

For the implementation of this new definition, the Maximum Bipartite Matching

(MBM) algorithm is the perfect tool (Figure 4.1B). Following the Floyd-Warshall algorithm, all the shortest paths passing through the given edge are identified. Then the end vertices of all the paths are divided into two groups, depending on which side each

48 vertex sits with respect to the given edge. A bipartite graph is constructed on the two vertex groups. Each shortest path is converted to an edge in the bipartite graph. Finally, the MBM algorithm is applied to find the maximum matching number, which is the betweenness of the given edge. In the example shown in Figure 4.1, edge CD has betweenness of 3, and edge ST has betweenness of 4 (Figure 4.1B). Therefore edge ST is removed to give a more meaningful result.

Quantitative definition of module. Communities, or modules, have been loosely referred to as “densely connected subgraphs”. However, many quantitative definitions for this concept exist in the literature (Radicchi et al., 2004). For simplicity the term “in- degree” (kin) of a vertex is used to represent the number of its within-subgraph connections, and the term “out-degree” (kout) to represent the number of its outside- subgraph connections. Please note that these are not the notations used in a directed graph to denote incoming and outgoing edges. A necessary condition for a subgraph to be called a module is that the sum of in-degrees of all the vertices in the subgraph is greater than the sum of out-degrees. This is a weak definition. A much stronger definition requires that for every vertex in the subgraph the in-degree is larger than the out-degree. However, this later definition is too stringent for a real-life network, which may have many complicated crosstalks between modules. Furthermore, even if the modularity of a network is so clear-cut that every module satisfies such a strong definition, the algorithm has to be perfect to actually find the modules.

49

To address this issue, a quantitative definition of community is proposed that is both strong and practical. Let kin and kout be the in-degree and out-degree of a vertex, respectively. A subgraph of n vertices is a module if

n n i i iii) ∑∑kin > kout i==11i

1 2 n 1 2 n iv) {kin ,kin ,...,kin } >> {kout ,kout ,...,kout }

The first criterion is the weak definition as stated above. The second criterion states that, collectively, the in-degrees of the vertices in the subgraph are significantly greater than the out-degrees. This is less stringent than the strong definition, but the definition still captures the essence of the concept “densely connected subgraph”. In implementing this second criterion, the Wilcox two-sample test was applied to compare the in-degrees and out-degrees, and a p-value of 0.01 was used as the cutoff value for significance.

Computer generated graph with community structures. For the purpose of testing the algorithm, artificial graphs with known community structures were generated exactly as the examples in the original GN-algorithm paper. Briefly, each graph contains

128 vertices that are divided into 4 communities, each of which contains 32 vertices.

Between each pair of vertices an edge is added with certain probability. The probability is pin if the two vertices are within the same community, and pout if the two vertices belong to different communities. The pout is varied to produce graphs with different levels of crosstalks. The higher the pout is, the more crosstalks exist between communities. The probability pin is chosen accordingly so that the average number of connections per vertex is 16.

50

Functional module datasets from previous studies. For the purpose of comparison, two sets of functional modules were obtained, which were identified in two previous studies. One dataset was kindly provided by Victor Spirin and was retrieved from http://insilico.mit.edu/modules/allOurClusters.html (Spirin and Mirny, 2003). The other dataset was retrieved from http://www.cs.tau.ac.il/%7Ershamir/samba/ (Tanay et al., 2004).

Results

Testing the algorithm on computer-generated graphs. Before being applied to the yeast proteome network, the new partition algorithm was first tested on artificial graphs produced exactly as the examples in the original GN-algorithm paper. For each graph produced, first the algorithm was applied until four communities were obtained.

Then the produced communities were compared with the actual communities to calculate the fraction of vertices that were classified correctly. As shown in Figure 4.2, when inter- community crosstalks are few, the new algorithm can correctly find the community structures in the graph. As the number of edges between communities increase, the rate of misclassification increases. Interestingly, if the inter-community edges are few, the new algorithm tends to make slightly more mistakes than the GN algorithm. However, when the inter-community edges per vertex reaches above 6, the new algorithm begins to out-perform the GN algorithm. In other words, the new algorithm is capable of discerning more complex crosstalks. These results suggest that the extension and modifications on the GN algorithm make it more robust against noise and the blurring of community boundaries. Since real life networks are always very complex and

51 community structures are always very intricate, the extended algorithm may produce more meaningful results in real applications.

Figure 4.2. Algorithm performances on the partitioning of an artificial network. An artificial network was constructed so that the 128 vertices were divided into four communities. The original GN algorithm and the extended GN algorithm were applied to partition this network, and the classification accuracy was assessed by the percentage of vertices that are clustered correctly.

Partition of the protein-protein interaction network of yeast. With the validity confirmed, the algorithm was then applied to the protein-protein interaction network of yeast. Following the perspectives of Hartwell (Bork et al., 2004) (Hartwell et al., 1999) and Spirin and Mirny (Spirin and Mirny, 2003), the algorithm was set to terminate when no subgraph had more than five vertices. Then the definition of module was applied to these candidate subgraphs and a total of 266 functional modules were obtained. Out of

52 the 3409 proteins in the network, 3150 (92.4%) are included in these modules. This indicates a good coverage among the functionalities of the yeast, and a good sensitivity that is likely the result of the combination of the modified algorithm and the proposed filtering criteria. The module sizes range from 5 to 98, 56.2% of which fall within five to twenty-five, a size range proposed by Spirin and Mirny (Spirin and Mirny, 2003). A list of these modules is available in Appendix A, along with a preliminary annotation based on Gene Ontology (GO).

Validation through connectivity density. For the purpose of validating the functional modules from the topology perspective, a simple measurement is proposed, namely the connectivity density. The connectivity density of a subgraph is the ratio of total in-degrees to the total number of connections. Obviously the density of a subgraph is always between 0 and 1; and according to the weak definition, the density of a module should be between 0.5 and 1. The lower the density, the less likely a module. The next issue is which control to use for comparison. For a given module, a simple control would be a set of the same number of proteins randomly picked up in the network. However, a random set of proteins are unlikely to be connected to each other, and therefore such a comparison is not very convincing. Instead, a more rigorous control was used. For each functional module, a small portion of the proteins in the module was randomly replaced with the same number of proteins outside this module. The replacement proteins are connected with the proteins in the module but do not belong to it. In this way, the control is guaranteed to be connected. Comparison to such controls is equivalent to asking the

53 question, if the module is shifted a little, do we get a less connected or more connected subgraph?

Figure 4.3 is a scatter plot of the connectivity densities of the functional modules and their controls. For most of the modules, 15% component replacement causes the connectivity density to decrease significantly. For many of them, the density drops below 0.5, suggesting that they do not even qualify for functional modules anymore. If more proteins are replaced (30%), the connectivity densities decrease even more. The replacing experiment was repeated 20 times, each time a different set of the proteins being randomly replaced. These observations suggest that the identified modules are indeed densely connected local subgraphs, and thus are good candidates for functional modules in the yeast protein network.

Genes in the same functional module confer similar phenotype. Since a functional module performs a relatively independent cellular function, a similar phenotype is expected to appear if the genes in the same module are knocked out. To verify this, each gene’s phenotype was represented as a vector of 31 dimensions, which correspond to the 31 experimental conditions (Giaever et al., 2002). The Euclidean distance of two vectors was used to represent the phenotype difference between two genes, and the average difference of all gene pairs in a module was used to represent its phenotype divergence. Figure 4.4A shows the distribution of the phenotype divergence of all the identified modules. For 185 out of the 254 (72.8%) modules, the phenotype divergence is lower than the average phenotype difference over all the yeast ORFs. In other words, genes in the same functional modules display more similar phenotypes than

54

Figure 4.3. Identified functional modules are densely connected subgraphs. In this scatter plot, each data point represents the connectivity density of a functional module (x- axis) and its replacement control (y-axis). The dashed line is y=x, which means that the connectivity density is the same for the module and its control. Any data point above the line corresponds to the case where controls have higher connectivity density, while data points below the line represent the case where control has lower connectivity density than the actual functional module. The further below the line, the less dense the control is compared to the original module. Each data point is the average of 20 randomization experiments.

55

A

B

Figure 4.4. Genes within a functional module confer similar deletion phenotype. (A) Histogram of the phenotype divergence of the functional modules. The dashed line indicates the phenotype difference averaged over all pairs of the yeast genes. (B) The scatter plot of phenotype divergence and module size. The Pearson correlation coefficient is 0.01. 56

those in different functional modules. To exclude the possibility of artifact due to module size difference, the relationship between module size and phenotype divergence was also examined, and no significant correlation was found (Figure 4.4B). This is also confirmed by Pearson correlation analysis (r=0.01).

To further confirm these observations, 20 randomization experiments were carried out where 30% of the proteins were replaced in each module to generate controls, and phenotype divergence was compared between each module and its control. It was found that for 60.9% of the modules, randomization increases phenotype divergence. Overall, the controls have higher phenotype divergence than the original modules (p-value <

0.001). Altogether, these results suggest that most of the identified modules are not only topologically meaningful, but they are also biologically relevant.

Phenotype similarity was also shown to be correlated with functional similarities between genes in another independent study (Gunsalus et al., 2005), which also supports its use for validating biological significance. To further evaluate the new method, the phenotype divergence of the modules were compared with that of two previous studies, which used a biclustering method on integrated functional genomics data (Tanay et al.,

2004) or a mixture of three different methods(Spirin and Mirny, 2003). As shown in

Figure 4.5, the phenotype divergences of the modules in this study are comparable to that of the mixed methods (p-value=0.46), both of which are significantly lower than that of the biclustering method (p-values < 0.002). It is worth noting that in this study 266 functional modules were detected, which is more than that found by the biclustering method (205) and is three times as many as that found by the mixed techniques (90). 57

This again suggests that the new algorithm is capable of finding biologically relevant functional modules.

Figure 4.5. The comparison of phenotype divergence. Shown is the mean phenotype divergence of each set of modules, error bar being the standard deviation. “GN Ext.” refers to the result in this study; “Spirin03” refers to the result of Spirin et al. (Spirin and Mirny, 2003); “Tanay04” refers to the result of Tanay et al. (Tanay et al., 2004).

Known protein complexes are largely contained in the functional modules. A protein complex is an aggregate of multiple proteins that interact with each other and perform certain biological activities (Gingras et al., 2005). Since this is conceptually very similar to the definition of a functional module, a question to ask is whether the new algorithm could detect protein complexes in their entirety, or whether they would be randomly divided into fragments during partitioning. To answer this question, each protein complex was matched against the identified modules and the maximum overlap

58 between each complex and the functional modules was calculated. As shown in Figure

4.6A, a majority of the 194 protein complexes annotated by the Comprehensive Yeast

Genome Database (CYGD) at MIPS are largely contained in the identified modules

(overlap > 0.75). 98 protein complexes (51%) were identified in their entirety by our algorithm. Since small protein complexes are likely to be contained in large functional modules by chance, this analysis was also applied to the subset of large protein complexes. Of the 78 complexes that contain five or more proteins, 45 are largely contained in the functional modules, and 23 were identified completely. Similar results were obtained by analyzing the protein complex dataset annotated by the Saccharomyces

Genome Database (SGD) (Figure 4.6B).

To further confirm these results, the overlapping analysis was applied against the control modules obtained by replacing 15% of the module components. Then, for each of the protein complexes in the CYGD database, the overlap ratios before and after the replacement were compared. As shown in Figure 4.6C, for the majority of the protein complexes (107 out of 194), the overlap ratios with the control modules are lower than the overlap with the actual functional modules. The overlap ratios remain unchanged for about one third of the complexes (67 out of 194). Only a very small portion of the complexes (20 out of 194) are better overlapped with the controls. In addition, the number of completely contained complexes decreases to 73. It should be noted that in the controls only a small portion (15%) of the module components are replaced, while the overlapping with the complexes changed significantly. These analyses show that known protein complexes are largely contained in the identified modules, and many of them are

59

A B

C

Figure 4.6. Protein complexes are contained in functional modules. Protein complexes annotated either in CYGD (panel A) or SGD (panel B) were matched against the identified functional modules. The overlap between each complex and a functional module was identified and the ratio of overlap to the size of the complex was calculated. The maximum overlap ratios of all the complexes were shown as histograms. For either panel A or panel B, the inset shows the overlap ratios for the subset of the protein complexes of five or more components. (C) Scatter plot of overlap ratio. Each data point represents the maximum overlap ratio with the actual functional modules (x-axis) and with the control modules (y-axis). The dashed line is on diagonal (y=x), representing the cases where the overlap ratios remain unchanged. The data points above the line represent the complexes that overlap better with the controls; data points below the line represent the complexes overlap better with the actual functional modules.

60 identified completely, without a single component missing. This further indicates that the new algorithm is capable of detecting functional modules that are biologically meaningful.

Figure 4.7. The chromosome segregation module. This module contains 18 proteins, most of which are annotated with chromosome segregation or cell division functions. Proteins represented by circles form the Ndc80 protein complex; proteins represented by square are the components of DAM1-DUO1 protein complex. Each protein complex contains a complete subgraph of 4 vertices (C4).

The chromosome segregation functional module: An example. Figure 4.7 shows an example of one of the functional modules identified in the protein-protein interaction network of yeast. This functional module has 18 proteins. In the CYGD annotation, the genes Nuf2, Spc25, Dam1, Duo1, Tid3, Spc34, Dad1, Dad2, Ask1, Spc24 and Spc19 are annotated as playing a role in chromosome segregation, or spindle pole body, or both. Ulp1 and Smc5 are annotated as playing a role in mitotic cell division. 61

Cnn1 is annotated with meiosis. Similar annotations are given in the Gene Ontology

(GO) database. This functional module is obviously the core machinery responsible for the separation of . Out of the 18 genes, 14 have a lethal deletion phenotype.

This is consistent with the fact that chromosome segregation is a housekeeping process for budding yeast, just like for any other organism.

As discussed above, this functional module contains two protein complexes.

NUF2, TID3, SPC24 and SPC25 form the highly conserved Ndc80 protein complex.

This complex is the core of kinetochore (Asakawa et al., 2005), and is responsible for proper alignment and attachment of chromosomes (Wei et al., 2005). DAM1, DAD1,

DAD2, DUO1, ASK1, SPC19 and SPC34 form the DAM1-DUO1 protein complex. This complex is a ring-shaped interface between microtubule and kinetochore, and it is capable of translating the force generated by microtubule depolarization into movement to facilitate chromosome segregation (Westermann et al., 2006) (Westermann et al.,

2005). It is interesting to note that each of the two protein complexes contains a complete subgraph (clique) of size 4. Since functional modules are densely connected subgraphs, cliques are indeed expected to appear more frequently.

YLR419w is the only component in this module that has no definite functional annotation in either CYGD or SGD. The most updated SGD and GO annotation regard it as a hypothetical protein with ATP-dependent helicase activity, based on the homology of a small portion of the amino acid sequence. Likewise, in CYGD it is called a putative helicase. Based on the fact that YLR419W is an integral component of this functional module, its biological function may be related to chromosome segregation. A number of

62 lines of evidence are in line with this prediction. First of all, evolutionary studies showed that this gene belongs to a family of helicase with very diverse functions, many of which has multiple functions (Sanjuan and Marin, 2001). Secondly, overexpression of a dominant negative form of RHA, a RNA helicase, causes aberrant mitosis with extra centrosome and tetraploidy in human breast epithelial cells, suggesting its role in centrosome formation and chromosome segregation (Schlegel et al., 2003). In addition,

RUVBL1/TIP49a, a human ATP-dependent helicase, was shown to associate with tubulins and colocalize with centrosome and mitotic spindle (Gartner et al., 2003). These results are consistent with the predicted function of YLR419w.

Discussion

Identifying the functional modules in protein-protein interaction network is a first step in understanding the organization and dynamics of cell functions. To ensure that the identified modules are biologically meaningful, network-partitioning algorithms should take into account not only topological features but also functional relationships.

Furthermore, the identified functional modules should be validated rigorously against available biological knowledge.

The protein-protein interaction network of yeast used here was obtained through integrating the high confidence datasets from three rigorous studies. With 10936 interactions between 3409 proteins, this network is very complex. Uncovering the modular structure of such a network is a challenging task. To make things worse, not all the interactions are stable, and not all the interactions occur at the same time. In other words, the network is not a real snapshot of the interactions in yeast, but an overlap of

63 many snapshots. How much confidence do we have with the results obtained from such a network? This study addressed both issues by using expression dissimilarity as the weights for the interactions. First of all, adding weights based on domain knowledge to represent the strengths of connections can enhance the network analysis (Barrat et al.,

2004) (Newman, 2001). Secondly, the algorithm is shortest-path based, which makes the use of weight highly desirable. In the case of this study, in order to obtain functional modules with biological significance, it is highly desirable to incorporate functional genomics information into the partitioning process. Based on the notion that, in general, genes in the same functional module are more likely to be co-regulated than genes from different modules, the expression dissimilarity was computed by taking into consideration hundreds of microarray expression profiles. If the distance between two interacting proteins is very small, it means that the two corresponding genes’ expression profiles are very similar. In other words, they are co-regulated. Those unstable, transient interactions will likely have a larger distance due to less correlated expression profiles.

Therefore by using expression dissimilarity as weight the functional correlations between interacting partners are emphasized. The modular structures obtained in such a weighted graph very likely represent real functional modules.

The betweenness-based partitioning algorithm was proved to be intuitive and powerful in module detection in real world networks (Girvan and Newman, 2002). In this study an extended algorithm was developed to partition weighted graphs, which can be used on other types of networks. For example, expression profiles can also be used to add weights to graphs representing transcriptional networks (Ihmels et al., 2002), and the

64 new algorithm can be used on such datasets to identify regulatory modules in yeast or other organisms. Of course, quite a few algorithms have recently been developed to study the modularity in biological networks, as described in the introduction. Due to certain limitations, only limited comparisons were made with two of those studies.

Ideally a more comprehensive analysis of these methods, such as a competition style study, will greatly benefit the community by guiding future investigations in this field.

By comparing the pairwise phenotype difference it was found that, in general, genes confer a similar deletion phenotype if their protein products belong to the same functional module. This result has significant relevance for showing that biology is modular. Even though various methods have been developed to detect functional modules at the topological level, it is the modularity at the functional and the phenotypical level that interests us. The results in this study indicate that phenotype similarity could be used to evaluate the biological significance of functional modules detected using topological features. Furthermore, with more detailed analysis, phenotype data can be very valuable resources for understanding biological processes. For example, certain treatments may cause growth defects among the deletion mutants for a given module. This would be strong evidence that this module is involved in the cellular response to these treatments.

In this study, an example was shown in which the function of a gene was predicted based on the associated functional module. The yeast gene YLR419w is not annotated in any detail due to lack of significant homology to any know gene. However, through functional module classification its biological functions were predicted to be chromosome segregation. In recent years, whole genome sequencing projects have

65 generated a huge amount of DNA sequence information, and various sophisticated gene finding algorithms have been applied to find large numbers of Open Reading Frames

(ORFs). Even though homology-based gene annotation provides the first clue to the biological functions of new ORFs, functional genomics-based annotation methods have become increasingly important (Troyanskaya, 2005) (Bentley, 2000). This is particularly true for those ORFs with poor or no homology. This study showed that functional module detection could be yet another complimentary method for gene annotation.

66

CHAPTER 5

MODULE ORGANIZATIONS

Through the above studies a set of functional modules have been identified in the yeast proteome network, which were shown to be biologically relevant. However, this is just a first step in understanding the functional organizations in the cell. To truly understand the dynamics of a complex network it is key to understand the links between its subsystems (Barrat et al., 2004). Cellular networks are no exceptions. It is the communications between functional modules, not the functional modules themselves that shapes the behaviors of cells and organisms (Hartwell et al., 1999). Even though a module is relatively independent in carrying out a specific cellular function, responses to intracellular or extracellular perturbations usually involve multiple cellular processes or functional modules. The coordination between these modules is the key to cellular homeostasis (Hartwell et al., 1999) (Rives and Galitski, 2003). For example, during

DNA replication in a eukaryotic cell, the responses to the occurrence of DNA damage include replication fork stalling, replisome stabilization, recombination repression, checkpoint activation, DNA repair, and replication resumption or restart, among many other events (Branzei and Foiani, 2005).

67

It goes without saying that the communications between the functional modules in the cell are very complex. As a first step, one should at least understand the static picture of the connections between them. In particular, the roles of hubs in the module-module connections are of primary interest. The hubs in the protein-protein interaction network are those proteins that are highly connected with other proteins. It has been proposed that hubs are the main components of cellular communications (Barabasi and Oltvai, 2004).

Of course, not every hub plays the same role. Through analyzing the gene expression correlations between hubs and their interactors, Han and his colleagues proposed a dynamic architecture model for the yeast protein-protein interaction network, in which different types of hubs play different roles in the network organization. The ‘party hubs’ represent the principal connectors within modules, while the ‘date hubs’ represent the higher-level connectors that organize functional modules into a dynamic proteome (Han et al., 2004). Even though it is very insightful on the functional organizations in yeast, this model was based on indirect observations on mRNA expression profiles. Now that functional modules are identified and validated, close and direct examinations on the hub subpopulations could provide more insights into the organizations of functional modules in the yeast proteome network.

Materials and Methods

Gene expression correlation. The gene expression data contain the profiles of 265 microarray experiments (Chapter 4). The expression level in each microarray experiment was treated as a random variable, and therefore the expression profile of each gene is a vector of 265 dimensions. The expression correlation between two genes was measured

68 by the Pearson Correlation Coefficient (PCC) between the two vectors. The overall level of the expression correlation between a gene and its interacting neighbors was measured by the mean of the PCCs between the gene and its neighbors.

The visualization of hub-hub interactions between functional modules. Each of the 45 functional modules was given an identification number between 1 and 45, and was referred to as Mi. For an interaction between protein A in module Ma and protein B in module Mb, two data points were generated. The data point representing protein A is (x1, y1) where x1 was drown from an uniform distribution u(a-1, a) and y1 was drown from an uniform distribution u(b-1, b). In other words, it was randomly picked up in the square at the intersection of Ma and Mb in the grid. Likewise, the data point representing protein B was drawn randomly from the square of the intersection of Mb and Ma in the grid. Upon plotting the schematic representation, each hub type was represented by different symbols. In this way, both the types and the densities of hub-hub interactions between each pair of functional modules are visualized.

Results

Hubs differentially co-express with interacting partners. Han and his colleagues checked the expression profiles of hubs in the protein-protein interaction network of yeast, and found two populations of hubs showed different expression correlation with their interacting partners (Han et al., 2004). To check if this result was reproducible on the larger dataset used in this study, the Pearson Correlation Coefficients (PCCs) between the expression profiles of each gene and its interacting partners were calculated, and the mean of the PCCs was taken as a measure for its overall co-expression correlation. As

69 shown in Figure 5.1, the expression correlation of the whole gene population displays a narrow Gaussian distribution, centered at 0.1 and skewed slightly to the right. However, for the hubs with 15 or more connections, the distribution is clearly bimodal. The left peak largely overlaps that of the whole population, while the right peak is centered at 0.6.

This reproducible result confirms that there are subpopulations of hubs that may serve different roles in the organization of the proteome network of yeast.

Figure 5.1. Bimodal distribution of co-expression correlations among hubs. The solid line represents the whole gene population, while the dashed line represents the population of hubs with 15 or more interactions.

Gene expressions are positively correlated within functional modules but not between functional modules. The fact that hubs are more likely to have high PCC suggests a positive correlation between connectivity and PCC. However, as shown by the scatter plot in Figure 5.2, this correlation is not very strong (correlation coefficient =

0.274). Since the mean PCC is used, one explanation is that the expression correlation 70 between a hub and its neighbors may vary significantly, and the averaging canceled the real relationship. To check this possibility, the connectivity of each hub is divided into the interactions with proteins within its own module (referred to as in-degree, ki) and the interactions with proteins outside its own module (referred to as out-degree, ko). As shown in Figure 5.3, a hub’s PCC is positively correlated with its in-degree, but is negatively correlated to its out-degree. This differential correlation is more obvious for the hubs with even higher connectivity (Figure 5.3C, D). This observation confirms that significant coexpression exists within functional modules but not between different modules. In other words, the genes in the same functional module are more likely to be co-regulated. More importantly, this experiment implies that in-degree and out-degree could be used as distinguishing factors to classify hubs into subpopulations.

Figure 5.2. Correlation between connectivity and PCC for hubs. The x-axis is connectivity, and the y-axis is mean PCC. The hubs are the proteins with 15 or more interactions. 71

A B

C D

Figure 5.3. Correlation between PCC and in-degree/out-degree for hubs. (A) Scatter plot of mean PCC and in-degree for hubs with 15 or more interactions. p- value<2.2e-16. (B) Scatter plot of mean PCC and out-degree for hubs with 15 or more interactions. p-value=0.006. (C) Scatter plot of mean PCC and in-degree for hubs with 30 or more interactions. p-value=7.7e-11. (D) Scatter plot of mean PCC and out-degree for hubs with 30 or more interactions. p-value=0.0001.

72

Subpopulations of hubs exist in the yeast proteome network. To further explore this differential relationship and gain insights into the organization of the proteome network, the hubs (k >= 15) is further divided into two groups. For the hubs in the first group, the out-degree is larger than the in-degree; for the hubs in the second group, the out-degree is equal to or smaller than the in-degree. As shown in Figure 5.4, the mean

PCCs of the first group exhibit a normal distribution centered at 0.1. Interestingly, the mean PCCs of the second population show a bimodal distribution. The left peak is centered on 0.1, while the right peak is centered at 0.65. The two peaks are separated clear-cut at PCC=0.4. This bimodal distribution resembles that of the whole hub population (Figure 5.1), though here it is exhibited by one of the subpopulations. This interesting result further suggests that there are different types of hubs in the yeast proteome network, which may serve different roles both topologically and functionally.

Combining these results with the bimodal distribution of the whole hub population, it becomes clear that the high PCC population (PCC>=0.4) consists almost exclusively the hubs with ki>=ko, which will be referred as “core hubs”. On the other hand, the low

PCC population (PCC<0.4) can be further divided into two subpopulations, those with ki>ko and those with ki<=ko, which will be referred as “local hubs” and “global hubs”, respectively (Table 5.1). In the following it will be showed why these hubs are given these names by examining the organizational roles they play in the proteome network of yeast.

73

Figure 5.4. Distribution of mean PCC for subpopulations of hubs. The hubs are proteins with at least 15 interactions. The solid line represents the hubs with more interactions outside of the module; the dashed line represents the hubs with more interactions within the module.

Hub type Core Local Global Connectivity >=15 >=15 >=15 ki vs. ko -- ki>=ko ki=0.4 <0.4 <0.4 Number of hubs 102 168 95

Table 5.1. Three types of hubs

The roles of the hubs in the organization of functional modules. As a first step to reveal the organizational roles of the hubs in the network, the level of overlap among the interactions of each hub type was examined. If the interacting partners of a hub population largely overlap with each other, their roles are likely confined to local 74 communities. As shown in Table 5.2, global hubs have the highest number of unique interactors per protein. This is not due to higher connectivity, because the difference is even more obvious after normalization against number of interactions. This suggests that global hubs are involved in the overall organization of the yeast proteome network. On the other hand, core hubs have the largest overlap among their interacting partners. This suggests that they may play important roles in the organizations within functional modules

Hub type Core Local Global Unique interactor per hub 4.0 6.4 13.9 Unique interactor per hub per interaction 0.1 0.3 0.5

Table 5.2. The number of unique interactors in each hub population

Next, since clustering coefficient is a direct indicator of local community structures

(Chapter 3), local organizers are likely to have high clustering coefficients, while global organizers are likely to have low coefficients. As shown in Figure 5.5, global hubs have the lowest clustering coefficients among all the hubs, consistent with the above observation that their interactions are highly dispersed. In contrast, core hubs have the highest clustering coefficients, which recapitulates the high overlapping among their interacting partners. This result further suggests that they play different roles in the modular organizations of the yeast proteome network.

75

Figure 5.5. The clustering coefficients of the hub populations

Besides interaction overlap and clustering coefficient, an even more direct indicator of organizational roles is module connectivity. A hub is connected to a module if it interacts with at least one protein in the module. The module connectivity of a hub is measured by the number of connected modules divided by its connectivity. As shown in

Figure 5.6, core hubs have the lowest module connectivity among the three populations.

This indicates that the interactors of core hubs not only overlap with each other, they are also concentrated within very few modules. This is consistent with the idea that they are organizers within functional modules. The module connectivities of local hubs are significantly higher than that of core hubs (p-value<0.0001), consistent with previous results on interaction overlapping. Global hubs have significantly higher module connectivity than both the core and local hubs (p-values < 0.0001). This means that the interactors of global hubs are widely spread out among the functional modules, further suggesting their roles as high-level organizers of the yeast proteome network. 76

Figure 5.6. The module connectivity of the hubs. Shown are the mean of the module connectivity for each hub type with standard error.

To summarize the above observations, the hub-hub interactions between the functional modules are visualized in Figure 5.7. In this schematic representation, each data point represents a hub that connects the two functional modules identified by its x and y axis position (see Materials & Methods). Overall, core hubs are involved in very limited connections between just a few modules. Local hubs are involved in more module connections than core hubs, but aggregation in certain modules are still obvious.

In contrast, data points for global hubs are totally spread out in the matrix, including those areas where the core or local hubs are completely absent. Taken together, these observations suggest that global hubs are the major components of module-module organizations, while core hubs are the organizers within functional modules.

77

Figure 5.7. The schematic of the hub-hub interactions between functional modules. The x-axis and y-axis are the functional modules. Each data point in the square (i, j) represents a hub (of a specific type) in module Mi that interacts with another hub in module Mj. For any data point in the square (i, j), there is a corresponding data point in the square (j, i ) representing its interactor.

78

Yet another question is, what are the differences between the roles of core hubs and local hubs? In particular, since functional modules usually contain protein complexes as components, how are the proteins involved in complex formation represented in these hub subpopulations? The protein complex dataset from the CYGD database contains 194 yeast protein complexes formed by 981 unique yeast proteins. Among these proteins, there are 39 core hubs, 122 local hubs, and 37 global hubs, which accounts for 38.2%,

72.6%, and 38.9% of each of the hub population, respectively (Figure 5.8). This observation suggests that, within a functional module, local hubs form protein complexes, which may be regulated by core hubs to execute the function of the module.

Figure 5.8. The hubs represented in protein complexes. Shown is the percentage of each hub population that is involved in protein complex formation.

79

Hubs preferentially interact with hubs of the same type. With the primary roles of the hubs being clarified, the next logical question is how these different types of hubs interact with each other. For each hub, the relative concentrations of its interactors in each of the three subpopulations were calculated, and the interactor concentrations of each hub type were compared to determine their interaction preference. As shown by the boxplot in Figure 5.9, each hub preferentially interacts with the hubs of its own type.

Since these are relative concentrations, the result is not an artifact caused by the size difference between the subpopulations. Such preference is statistically very significant

(all p-values < 10-6). This interesting observation reveals some characteristics of the fine structure in the yeast proteome network (see Discussion).

80

Figure 5.9. The interactions between hub types. For each group G1, bar G2 represents the fraction of G2 hubs that are interacting partners of each hub in G1. Overall, this boxplot shows the interacting preference between each type of hubs.

81

The dynamics of module-module interactions in cellular response. Through the above topological studies, a picture could be drawn on how the hub subpopulations organize the functional modules in the yeast proteome network (see Discussion).

However, this high-level view is very preliminary and static. To gain insights into the dynamics of module-module interactions, first the involvement of each hub population in signaling transduction pathways was examined. As shown in Figure 5.10, the three hub types have very different representations in cell signaling. Approximately 15.7% of global hubs are involved cell signaling. The representation of signaling molecules in the local hub population is much lower, at about 7.7%. Strikingly, among the 102 core hubs, only one is annotated as playing a role in signal transduction. This observation suggests that signaling pathways are mainly involved in the communications between functional modules, but they do not play significant roles in the communications within functional modules. This further suggests that the roles of global hubs in module-module interactions are not static, but indeed dynamic. This also partly explains why the global and local hubs show low expression correlations with their interacting partners.

Generally speaking, the regulations on signaling molecules are mainly through post- translational modifications, such as phosphorylation-dephosphorylation, while changes at the transcript level are rather secondary.

82

Figure 5.10. The hub populations involved in signal transduction. A hub is considered to be involved in signal transduction if it is annotated by CYGD as a signaling molecule or as being modified by phosphorylation and dephosphorylation.

In order to further understand how the hub populations are involved in the module- module interactions during cellular response, the expression profiles of the hubs within each microarray experiment were examined. If it is the dynamic communications between modules, not the functional modules themselves that largely characterize the behaviors of cells and organisms (Hartwell et al., 1999), then one would see more dramatic changes of gene expression among global hubs than the local or core hubs. As shown in Figure 5.11A, the expression changes of global hubs are significantly larger than that of core hubs (p<0.005) or local hubs (p<2.2e-16). Furthermore, for a given stimulus, global hubs are much more likely to undergo significant expression change

(increased or decreased by at least 2 fold) than local hubs (p<4.1e-6) or core hubs

(p<2.2e-16) (Figure 5.11B).

83

A B

C

Figure 5.11. The expression change of hubs in cellular response. (A) For each microarray experiment, the mean change of each hub type was computed. Shown are the mean changes for each hub type across the 265 microarray experiments. (B) For each microarray experiment, the number of hubs that undergo significant expression change (at least two fold) was counted. Shown are the numbers of hubs with significant changes across the 265 microarray experiments. (C) For a given hub, the variance of the expression changes across the 45 mating experiments was computed. Shown are the variances of all the hubs in each subpopulation.

84

Since global hubs undergo significant expression changes (Figure 5.11A, B), they preferentially interact with each other (Figure 5.9), and yet their expression profiles don’t strongly correlate with each other (Table 1), it is very likely that global hubs respond differently to different stimuli. To examine this hypothesis, a subset of experiments was analyzed in which yeast was treated with agents that are related to mating. For each hub the variance of the expression changes across these experiments was computed. As shown in Figure 5.11C, global hubs show much higher variations than the core or local hubs (p<2.2e-16). This indicates that even though the treatments are all mating-related, different agent may induce different response in the communicating activities between functional modules. In contrast, the responses of the core and local hubs are more similar across the experiments, which may represent the changes in the functional modules that are most relevant to mating.

Last, let us looked at one specific example in which the expressions of yeast genes were tracked after treatment of α-factor, a sporulation-promoting agent. Starting from the time of treatment and up to 2 hours after treatment, there are 33 hubs that changed their expression by at least 2 fold at some point. Among them are 2 core hubs, 12 local hubs and 19 global hubs. As shown in Figure 5.12, each hub type shows distinct yet very intriguing expression pattern. First of all, most global hubs responded almost immediately to the treatment, as evident by the expression changes at 15 min. In contrast, local hubs responded much later. With one exception, all local hubs did not show expression change until 45min to 60min after treatment. Secondly, after the initial response, the expressions of most global hubs stayed largely unchanged up to the last

85 time point (120 min). In contrast, the expression changes of local hubs were transient; most of them quickly returned to the baseline. The responses of the two core hubs appeared later than global hubs but earlier than local hubs. Even though no pattern can be inferred from just two genes, it is noteworthy that one core hub, NOP14, dramatically changed expression level across the time course and is the hub that maintained the most drastic change (complete repress) at the time point of 120 min. In addition, the spikes of the two core hubs appear to precede those of local hubs, which suggest that the response of core hubs may regulate the responses of local hubs. Putting these patterns in the context of the static architecture of the network, it is likely that global hubs represent the early response genes to the treatment. These hubs may try to respond to the treatment by coordinating the behaviors of the functional modules that are related to mating.

Sustained changes of global hubs may eventually lead to the responses from the core and local hubs within those functional modules.

86

Figure 5.12. The expression change profiles of hub in response to α-factor.

87

Discussion

Han and his colleagues examined the gene expression correlations between hub proteins and their interacting neighbors, and they identified two subpopulations of hubs in the yeast protein-protein interaction network. The “party hubs”, which co-express with their neighbors, represent the principal connectors within functional modules; the “date hubs”, which do not co-express with their neighbors, represent the higher-level connectors that organize the modules into a dynamic proteome (Han et al., 2004). The studies described in this chapter imply that not all date hubs are high-level organizers.

The hubs that have more connections between modules than within modules (“global hubs” in this study) are likely to be the real global organizers of functional modules in the proteome network. However, those hubs that do not co-express with neighbors and have more connections within modules (“local hubs” in this study) are mostly involved in protein complex formation, and their roles are largely limited to within functional modules. Furthermore, the results in this chapter reveal that hubs preferentially interact with other hubs of the same type. As local hubs are mostly involved in protein complex formation and protein complexes are particularly densely connected entities, it is not surprising that local hubs are highly connected to each other. The fact that core hubs also prefer each other suggests an ordered architecture within functional modules. Suppose that core hubs show no discrimination against core or local hubs, or they interact with more local hubs than core hubs. Then each core hub is likely to connect with many protein complexes, many of which may be overlapping with each other. This, together with other connections between protein complexes, will make the organization patterns

88 within functional modules very complex and intricate. The fact that core hubs preferentially interact with each other suggests that they themselves form a subnetwork inside functional modules. This subnetwork may serve as the central control unit that coordinates the behaviors and functions of the protein complexes. Based on these observations, a high-level model is proposed to represent the architecture of the yeast proteome network (Figure 5.13).

Figure 5.13. A model for the modular organization of the yeast proteome network. Within functional modules, local hubs are the major components of protein complexes, which are organized and coordinated by subnetworks of core hubs to form functional modules. Global hubs are the major connectors between functional modules and server as the high-level organizers of the proteome network.

89

Lee Hartwell has raised the question how a cell integrates information that are originated both internally and externally (Hartwell et al., 1999). “Does cellular integration merely emerge from a web of pairwise connections between different sensory modules, or are there specific modules that act as a cellular equivalent of the central nervous system – integrating information and resolving conflicts?” (page C48, middle column, last paragraph). Similar point of view was raised through the studies on transcriptional networks (Petti and Church, 2005). Studies in this chapter suggest that a

“central nervous system” may indeed exist. Since global hubs are involved extensively in the connections between functional modules, and they also preferentially interact with each other, it is possible that global hubs, or at least some of them, may form a subnetwork that serves as a “central nervous system” for information processing and integration. This is at least partly supported by the observation that, after α-factor treatment, global hubs respond immediately by changes at transcript level and the changes are sustained during the course of response (Figure 5.12). It is worth noting that, even though the immediate response of global hubs may be reminiscent of the behaviors of Immediate Early Genes (IEGs), these two populations are unlikely to be the same.

The key difference is that the activation of IEGs are usually transient, while the expression changes of global hubs tend to be prolonged.

It is a very intriguing that local hubs are over-represented in protein complexes

(Figure5.8), and that they have low expression correlation with their interacting partners

(Table 5.1). This strongly suggests that protein complex formation is a very dynamic process. It is likely that not all the components need to be present at the same time.

90

Some proteins are added into complexes as needed. It also implies that not necessarily the same complex is formed under different conditions. The composition of a protein complex may change significantly to achieve appropriate response under a given condition. Furthermore, since local hubs have more connections within functional modules (Table 5.1), this dynamic nature can also be extended to the organizations of functional modules.

In this study, hubs were classified into subpopulations at finer resolution. It should be noted that, in a system as complex as the cell, no clear-cut boundary may exist for almost anything. Even the boundary between the functional modules should not be so absolute. Therefore some hubs may play multiple roles or even different roles as designated here (Han et al., 2004). However, based on the differences between these hubs as revealed by multiple studies in this chapter, it can be confidently said that such classification of the hubs points to the right direction. When more and more data from different domains are taken into consideration, the architecture of the yeast proteome network will become clearer and clearer.

91

CHAPTER 6

NETWORK ROBUSTNESS

The robustness of a network is its capacity to maintain its structural and functional integrity in response to internal failure or external attack. How networks behave at the time of internal failure or external attack is of great concern in many real life situations

(Sharom et al., 2004) (Albert and Barabasi, 2002). In biological networks, examples of internal failure may include genetic mutations that cause gene dysfunction (genetic networks), environmental changes that cause nutrient deprivation (metabolic networks), or natural disasters that cause sudden extinction of species (food webs). A common example of external attack is a gene or gene product being specifically inactivated by invading pathogens. A classical example is that the E1A protein of adenovirus specifically interacts and inactivates the tumor suppressor gene Retinoblastoma (Rb), which leads to cell cycle initiation and cell transformation (Nielsch et al., 1991). As an example in medicine, many drugs work by binding and inactivating a specific target protein.

Studies on network robustness started only very recently. One of the earliest works on the robustness of scale-free networks showed that the power law distribution of connectivity is a characteristic feature of networks that are optimized for robust

92 performance under environmental uncertainties (Carlson and Doyle, 1999) (Carlson and

Doyle, 2002). From another perspective, Barabasi and his colleagues showed that, comparing to random networks, scale-free networks are more robust against internal failure (random removal of edges) but are less robust against targeted attacks (hub removal) (Albert et al., 2000). They showed that this principle is applicable in the metabolic networks of microbes (Jeong et al., 2000), and that this principle can be used to explain gene essentiality in yeast (Jeong et al., 2001). One drawback in these studies is that network robustness was examined under large-scale removal of vertices. In reality, such disastrous collapse of networks happens very rarely. Instead, network perturbations usually involve very few components. This is particularly true in cellular networks. For example, uncontrolled growth of tumor, either benign or malignant, usually originates from mutations of very few genes in cell cycle control. Large-scale inactivation of gene function will cause cell to die, which is obviously not beneficial to tumor progression.

Therefore, new method is needed to study network robustness when one or only a few vertices are removed and no network breakdown occurs. Graph spectrum is one of the network representations that may be used to address this issue. The Laplacian spectrum of a graph is the set of eigenvalues of its Laplacian matrix. The Laplacian spectrum of a network not only is closely related to networks dynamics, it is also an invariant characterization of the network. It has been shown that, with very few exceptions, two graphs with identical spectra are indeed topologically identical (Cvetkovic et al., 1995).

Since spectral distance is such a sensitive measure for the topological difference between

93 two graphs, it may also be used to measure the changes of network topology when vertices are removed.

Although modularity was proposed to be a feature associated with network robustness (Hartwell et al., 1999) (Maslov and Sneppen, 2002), there are not many detailed studies on real networks. Han and his colleagues examined the roles of hubs in yeast proteome network and suggested that different types of hubs may be of different importance to genetic robustness (Han et al., 2004). With the modular architecture of the yeast proteome network being investigated (Chapter 5), network robustness analysis can be carried out from the perspective of modular organizations in more detail.

Materials and Methods

Laplacian spectral distance. Let the adjacency matrix of a graph G(V, E) be A, and let the diagonal matrix be D. The Laplacian matrix L is obtained by L=D-A. The set of eigenvalues of L, sorted in ascending order, is called the Laplacian spectrum of G.

The distance between two graph spectra was calculated according to Ipsen and Mikhailov

(Ipsen and Mikhailov, 2002). Briefly, the density of a spectrum is defined as a narrow

Lorentz distribution

N −1 r 2 i) ρ(ω) = K ∑ 2 2 k =1 (ω − ωk ) + r where N is the number of vertices; ωk represents the N-1 nonzero eigenvalues in the spectrum; r is the scale parameter and is set to 0.08; K is the normalization factor so that the integral of the density over [0, +∞] is 1. The distance between two graphs with spectral densities ρ1(ω) and ρ2(ω) is defined as

94

+∞ 2 ii) ε = [ρ1 (ω) − ρ 2 (ω)] dω ∫0

R statistical packages were used for computing the densities and integrals.

Synthetic lethality data analysis. The synthetic lethality data were downloaded from the publisher’ website (Tong et al., 2004). Three datasets were selected for analysis based on the number of synthetic lethal genes in each dataset. The datasets with less than

30 synthetic lethal genes were discarded due to the concerns on the statistical significance of the results. For each dataset, the bait gene was deleted from the original proteome network of yeast, which was used to represent the “sensitized” network. Then each synthetic lethal gene was individually deleted from the sensitized network and the

Laplacian spectral change was computed. The same number of non-synthetic lethal genes was randomly selected to form the control set and the spectral changes for the control genes were also computed. The two sets of spectral changes were compared using Wilcox two-sample test for statistical significance.

Results

Global organizers are important for network robustness. When vertices are sequentially removed, a network will gradually disintegrate and breakdown into disconnected subgraphs, which are called components. The rate at which the network progressively breaks down into components is a measure of network robustness (Albert and Barabasi, 2002). In general, removal of hubs will disintegrate networks faster than removal of non-hubs (Albert et al., 2000). However, as shown in Chapter 5, in a modular network like the yeast proteome network, subpopulations of hubs exist. The global organizers have the highest module connectivity but are not necessarily the most 95 connected proteins in terms of pairwise interactions. Are the global organizers also the true hubs of the network from the robustness perspective? To answer this question, network breakdown was tracked when the proteins were sequentially removed. The removal started with the most connected protein, based on either pairwise interactions or module connectivities. As shown in Figure 6.1A, the yeast proteome network breaks down much faster when the proteins with high module connectivities are removed, compared to if the proteins with high pairwise interactions are removed. To further confirm this, the network breakdown experiment was carried out so that at each removal, the node was selected randomly instead of sequentially. Similar to the sequential removal experiment, the removal of proteins with high module connectivities cause the network to break down much faster (Figure 6.1B). These results indicate that module connectivity is actually a better measure for a protein’s contributions to network robustness, and suggest that global organizers are also important for network robustness.

To further explore the roles played by different types of hubs, the network breakdown was tracked when proteins were picked from different hub populations for removal. As shown in Figure 6.2A, the removal of core hubs, the organizers within functional modules, causes the least disintegration of the entire network. In contrast, the removal of global hubs causes the most drastic breakdown of the network. Random removal gives the same result (Figure 6.2B). This observation further confirms that, in the yeast proteome network, the hubs involved in global organization are more important than local hubs from the robustness perspective.

96

A B

Figure 6.1. Module connectivity and network robustness. The number of components in the yeast proteome network was tracked when hubs were removed. (A) The removal was sequential, starting with the most connected hub, based either on pairwise interaction (k) or on module connectivity (km). (B) The removal was random among the highly connected hubs. The experiments were repeated for 100 times and the average of the 100 experiments is shown.

A B

Figure 6.2. Global hubs are important for network robustness. The number of components in the yeast proteome network was tracked when different types of hubs were removed. (A) Sequential removal, starting with the most connected hub. (B) Random removal. The average of 100 repeats is shown. 97

Graph spectrum for studying network robustness. Currently network breakdown is widely used to study network robustness. Despite its intuitive appeal, network breakdown does not closely resemble real-life situations. Disastrous system failure due to large-scale component malfunction does not happen very often in real-life networks. This is particular true for biological systems. New methods are desired to measure the small changes in networks caused by malfunctions of very few components.

Graph spectrum is one of the network representations that may be used to address this issue. Using spectral distance as the sole objective function, Ipsen and colleague successfully evolved random graphs into graphs of desired topology (Ipsen and

Mikhailov, 2002). Since spectral distance is such a sensitive measure for the topological difference between two graphs, can it also be used to measure the changes of network topology when very few vertices are removed? To test this, the hubs in the three subpopulations were sequentially removed from the yeast proteome network. After each removal, the spectral distance between the new network and the original network was computed. As shown in Figure 6.3, the removal of core hubs caused the least changes in spectrum, and the removal of global hubs lead to the most dramatic changes. This result suggests that spectral distance could be a good measurement for network change in a robustness simulation experiment.

98

Figure 6.3. Graph spectral change due to sequential removal of hubs. In each experiment, hubs were sequentially removed from the yeast proteome network, starting from the most connected. After each removal, the spectral distance was computed between the new network and the original network.

However, one question remains to be answered: is Laplacian spectrum sensitive enough to capture the network changes when only one vertex or very few vertices are removed? To answer this question, each single protein was individually removed from the network, and the spectral change was computed for each removal. As shown in the scatter plot in Figure 6.4A, there is an obvious positive correlation between a protein’s connectivity and the spectral change as the result of its removal. The Pearson correlation coefficient between connectivity and spectral distance is 0.584. As discussed above, the module connectivity of a protein is also an important aspect involved in network robustness (Figure 6.1). Figure 6.4B shows that a similar positive correlation exists between spectral distance and module connectivity (Pearson correlation coefficient =

0.591). These results suggest that graph spectral change is a sensitive measure for

99 tracking network changes when only single vertex is removed and no network breakdown occurs.

A B

Figure 6.4. Correlations between spectral change and connectivity. (A) The correlation between spectral change and pairwise connectivity. (B) The correlation between spectral change and module connectivity.

The Pearson correlation coefficients from the above studies indicate a positive correlation between spectral change and connectivity, but the correlation is not very strong. This is not too difficult to understand. After all, graph spectrum represents the topology of the entire network, not just the number of connections. Even if two proteins have the same number of pairwise interactions, their removal does not necessarily cause exact the same amount of changes in topology. If one of them is connected with more functional modules, its removal may cause more significant changes on graph spectrum.

100

To further explore this relationship, proteins were divided into groups such that proteins within the same group have the same number of interactions. Then within each group the

Pearson correlation coefficient was computed between module connectivity and spectral change. As shown in Figure 6.5A, most of the Pearson correlation coefficients are positive, suggesting a positive correlation. As a matter of fact, over one third of the coefficients are greater than 0.4, suggesting a strong positive correlation. To find out what population of proteins shows the positive correlation between module connectivity and spectral change, the Pearson coefficient obtained for each group is plotted against the pairwise connectivity for that group. As shown in Figure 6.5B, high Pearson coefficients are concentrated in high connectivity region. In other words, these are hubs. This is a very intriguing result. It suggests that for hubs in the protein-protein interaction network, the module connectivity is more relevant to how much a protein contributes to network robustness, while the actual number of pairwise interactions is not as important. This has been shown in Figure 6.1 in the simulation of network breakdown, but here the result is even more meaningful since it is based on single protein removal.

Contributions of individual hubs to network robustness. With these reassuring observations, the method of Laplacian spectral change can now be used to analyze the contributions of individual hubs to the robustness of the yeast proteome network. As shown in Figure 6.6, individual removal of global hubs causes more significant changes to network topology than core hubs (p-value<10-13) or local hubs ( p-value<10-5), while the removal of core hubs causes the least changes than all other hubs (all p- values<0.001). These results suggest that, even at the single protein level, the global

101

A B

Figure 6.5. The correlation between module connectivity and spectral change for proteins with the same pairwise connectivity. (A) The histogram of the Pearson correlation coefficients between module connectivity and spectral changes. (B) The scatter plot of Pearson coefficients and connectivity. Each data point represents one group of proteins; the x-axis is the pairwise connectivity of the proteins in that group; the y-axis is the Pearson coefficient between the spectral changes and the module connectivities in that group.

Figure 6.6. Spectral change upon individual hub removal 102

organizers are the major contributors to the robustness of the yeast protein-protein interaction network.

As indicated in Chapter 5, different hubs play different roles at different levels in the yeast proteome network. It would be interesting to see whether the hubs contribute differently to the robustness of functional modules from to the robustness of the entire network. To address this issue, each functional module was extracted from the network and treated as an independent network. Then each hub was individually removed from each of these networks, and the spectral changes were computed with respect to each module. As shown by the boxplot in Figure 6.7, the spectral changes largely overlap with each other for the core, local and global hubs. However, when outliers are not considered, the spectral changes of core hubs are significantly larger than that of local hubs (p-value=0.053) and global hubs (p-value=5.4e-5), and the spectral changes of local hubs are also larger than that of global hubs (p-value=0.012). This suggests that local organizers contribute more to the module robustness than the global organizers.

103

Figure 6.7. Spectral changes of functional modules upon individual hub removal

The above studies show that in a modular network like the yeast protein-protein interaction network, high-level organizers are the major contributors to network robustness, and organizers within functional modules contribute most to the topological integrity of individual functional modules. An important question to ask is, how do these observations correlate with biological observations? One of the most important biological observations relevant to network robustness is gene essentiality, which was obtained by monitoring the growth of yeast mutants each of which harbors deletion mutation of one specific gene (Giaever et al., 2002). To exam the relationship between network robustness at the topological level and functional essentiality at phenotypical level, the distribution of essential yeast genes among the hub populations was calculated.

As shown in Figure 6.8, most of the core hubs are essential (86.3%), compared to the local and global hubs, of which less than half are essential. These results suggest that the organizers of functional modules (core hubs) contribute more than global hubs to the 104 survival of yeast as an organism. In other words, the functional integrity of individual functional modules may be more important than that of the entire network as to the contributions to organism survival.

Figure 6.8. The percentage of essential genes among the hub populations

Another area where the network robustness analysis may be potentially very useful is genetic lethality screening. Genetic lethality refers to the situation where the deletion of two genes individually does not result in cell death, while the deletion of both genes does (Tong et al., 2004). Genetic lethality has been proposed as a tool for screening potential cancer drug target (Hartwell et al., 1997). One issue associated with this approach is that the number of mutants to be generated and the number of experiments to do is combinatorial. Network robustness simulation might be used here as an in silico pre-screening process in which potential synthetic lethal genes may be identified as those whose deletion causes significant network changes. These potential targets can then be 105 further explored by experimental methods. Obviously in this case network breakdown cannot be used, while graph spectrum fits better into such needs. To check this possibility, three datasets of synthetic lethality were analyzed. As shown in Figure 6.9, for two of the three datasets (ARP2 and BNI1), the deletion of the synthetic lethal genes causes more spectral changes than that of the non-synthetic lethal genes does. For the third dataset, there is no significant difference. This preliminary result suggests that graph spectrum may be developed into computational tools for pre-screening in identifying synthetic lethal genes.

Figure 6.9. Synthetic lethality and graph spectral change. The background mutation refers to the gene that is mutated in the bait strain used in the screening. Shown is the mean spectral change upon removal of either the synthetic lethal genes or the non- synthetic lethal genes, the error bars being standard errors of each dataset. p-values: ARP2 – 0.069; BIM1 – 0.854; BNI1 – 0.0009.

106

Discussion

The topic of network robustness has so far been addressed only using network breakdown simulations where vertices or edges are successively removed (Albert and

Barabasi, 2002) (Albert et al., 2000). Yet in reality network failure usually does not happen at this scale, and therefore such simulations do not always provide practical insights into real systems. In this chapter, the issue of network robustness is addressed under the situations where only very minor changes occur in the network. It is showed for the first time that graph spectral distance may be used to measure the contribution of a vertex to the robustness of the network at both local and global level. This could be very useful when the importance of some components of the network needs to be judged, particularly if these components have similar or the same number of connections. For example, assume that for a specific type of cancer there are a few potential drug targets to choose from to develop an anticancer drug. The purpose is to choose a target whose inhibition will maximize the effects on cancer cells (efficacy) and minimize the effects on normal cells (side effect), which is critical for any drug to be successful (Wermuth,

2004). In this case, one can compile the proteome networks in cancer cell and normal cell and represent them as two graphs. Then each potential drug target can be individually removed from the two networks and the graph spectral changes are computed. The target that causes the most spectral changes in the cancer network and the least changes in the normal network may be the best candidate. Alternatively, certain threshold values of spectral change can be set for both networks, and graph spectral changes can be used to select targets for further testing. It is worth noting that even if a

107 complete proteome network is not available, local subgraphs compiled from relevant pathways may still be used to obtained useful information.

Of course, the application of the graph spectral change in network robustness does not have to involve component removal. It could be equally useful when network changes need to be assessed after adding extra components into the network. In the scenario of drug development, it may also help making choice when enzyme agonists are the desired therapeutic agents. For example, many signal transduction pathways consist of cascading kinases, such as the MAP kinase pathway (Qi and Elion, 2005) (Schwartz and Madhani, 2004). An agonist for any kinase in the cascade may activate this pathway, yet each agonist may affect the behaviors of the network differently. Although many other issues that have to be considered here, graph spectrum could be a convenient in silico tool for initial assessments.

It has been shown that in the protein-protein interaction network of yeast, hubs are generally more likely to be essential than non-hubs (Jeong et al., 2001). In this chapter, it is shown that not all hubs are equally important for the survival of an organism. Core hubs, which may form control units within functional modules (Chapter 5), are more likely to be essential than other types of hubs. This observation not only confirms the organizational roles of these hubs as discussed in Chapter 5, it also points to potential applications. Currently protein-protein interaction data are available at genomic scale for

S. cerevisiae, C. elegans, and D. melanogaster. But due to technique difficulties, phenotypical studies have not been attempted at large scale other than on S. cerevisiae

(Giaever et al., 2002). By combining the results in this chapter and datasets of

108 homologous genes, it is possible to gain useful insights into gene essentiality in higher organisms, including mouse and human.

It should be noted that, no matter how sophisticated an algorithm is, topological features alone will not be able to explain in full the essentiality of all genes. Many metabolic enzymes catalyze reactions in essential metabolic pathways. They may not interact with any other protein but the metabolic intermediates or end product is indispensable for the organism. Therefore computational results should not be taken as the sole metric for the importance of a gene. Rather, it should be used as one of the tools that may help us fully understand the functions of genes.

109

CHAPTER 7

CONCLUSIONS AND FUTURE PERSPECTIVES

In this dissertation, the protein-protein interaction network of the budding yeast

Saccharomyces cerevisiae was studied from the perspectives of topology, modularity, organization and robustness.

In the introduction (Chapter 1), the need for a transition from towards network biology was first addressed from both the necessity and the feasibility point of view. Then by focusing on the theme of modularity, recent progresses on detecting functional modules in the yeast proteome network were summarized, followed by the discussion on the importance of integrating functional genomics information.

Next the need was discussed for studying the organization of the functional modules and the architecture of the network. And last, the issue of network robustness was presented, along with the limitations of previous studies and the approaches to improve. The chapter was concluded with the outline of the entire thesis.

In Chapter 2, the need for and the process of integrating the reliable datasets of protein-protein interactions was first described. The dataset was first assessed for the protein coverage, and the result indicated that the dataset has a representative and balanced coverage over all the functional categories in yeast proteome. Then the dataset 110 was validated from the perspectives of functional similarity, cellular colocalization, and scale-free model fitting. The results indicated that the dataset to be used in this thesis is enriched with the interactions between proteins of the similar functions and the interactions between colocalized proteins. It was also shown that the network represented by the dataset fits into the scale-free network model. Overall this chapter served as the quality control unit for this dissertation research, providing assurance that the results and conclusions in the following sections are valid.

In Chapter 3, statistical mechanics was used to characterize the overall topology of the protein-protein interaction network. It was shown that the yeast proteome network has the characteristics of scale-free networks in terms of connectivity distribution and graph spectrum, but it has the properties of modular networks in terms of shortest path distance and clustering coefficient. This chapter revealed the modularity in the network and provided justification for analyzing the network from the perspective of functional modules.

Chapter 4 was focused on the detection of functional modules. Gene expression profiles were first integrated into the yeast proteome network, which was then represented as a weighted graph. Then an algorithm was developed and was applied to partition the network into subgraphs, which were then filtered using proposed criteria to obtained a set of functional modules. The functional modules were then rigorously validated from the perspectives of local connectivity, growth phenotype, and protein complexes. It was shown that the functional modules are densely connected local subgraphs, that genes in the same functional modules exhibit similar deletion phenotype,

111 and that known protein complexes are largely contained in the functional modules. These results strongly suggested that the detected functional modules are not only topologically valid, but also biologically meaningful. The chapter was concluded with discussion on a specific example of functional modules.

In Chapter 5 the organization of the yeast proteome network was studied. First the relationship between the gene expression profiles of hubs and their interacting proteins was analyzed, and the results indicated that subpopulations of hubs exist in the yeast proteome network. Based on this result and the connectivity distribution between and within functional modules, the hub proteins were classified into three categories, namely the core, local, and global hubs. By examining these hub populations from the perspectives of protein complexes, interaction overlap, clustering coefficients, and module connectivity, it was established that the global hubs form the backbone of module-module connection, while core hubs are organizers within functional modules.

Next, analysis on the interactions between the hubs indicated that each of the four types of hubs preferentially interact with hubs from the same population, which suggests an ordered architecture for the network and the existence of central processing subnets at both global and functional module level. And last, the gene expression changes of the hubs in cellular responses were analyzed to gain insights into the dynamics of module- module interactions, and the results suggest that global hubs are the major and early responders in cellular response.

Chapter 6 addressed the issue of network robustness. First, network breakdown simulation was used to examine the contributions of each hub population to the overall

112 robustness of the yeast proteome network. Next, comparison and correlation studies showed that graph spectral change could be used as a sensitive measure for studying the network robustness under minor disturbance. Then studies on the spectral changes at both network and functional module level indicated that network organizers contribute most to the robustness at both global and local level. And last, a correlation was established between the network robustness and gene essentiality, which suggests many potential applications of this new technique.

Overall, this dissertation studied many aspects of the protein-protein interaction network of Saccharomyces cerevisiae with many original findings, which may include,

1) significant modularity exists in the yeast proteome network; 2) genes in the same functional module confer similar growth phenotype; 3) there are multiple subpopulations of hubs, each of which play distinguished roles in the organization of the yeast proteome network; 4) there exist some form of central subnets at both global and local level that are responsible for information integration and processing; 5) network organizers are important for network robustness. This dissertation made significant contributions to the field of systems biology, which may include 1) the protein-protein interaction dataset of

Saccharomyces cerevisiae used in this study is the largest dataset with balanced protein coverage, its reliability being validated from multiple perspectives. It can be used by other scientists on the topics of their interests; 2) the functional modules detected in this study was rigorously validated and can be used by both bench scientists and computational biologists for guiding experimental design or for other data mining

113 projects; 3) the graph spectrum approach for robustness studies provides a new method to examine the importance of genes at fine resolution.

Yet, as always true in scientific discovery, expanding of knowledge opens to more unknowns. Many more topics remain to be explored. As discussed in Chapter 2, there are still about one third of the yeast genes not included in the proteome network. These likely represent proteins of low expression or proteins with transient interactions. Some of these interactions may already exist in the high-throughput dataset, while others are waiting to be identified in laboratories. Therefore improvements in both experimental and computational methods are needed to obtain a truly comprehensive dataset of protein-protein interactions in yeast.

The weighting of interactions is a topic that can be pursued in many different directions. Instead of using the whole set of expression profiles, one could select a subset of profiles for weight calculation. By using machine learning techniques for microarray selection, the weighting scheme of the interactions may be more representative of the actual relationship within local clusters (Ihmels et al., 2002). To carry this one step further, a specific subset of profiles could be selected for each individual interaction to represent the true functional correlation between two interacting proteins. Of course, data integration need not be limited to gene expression only, but can include other functional genomics information. One great benefit of consolidating multiple datasets is that it can efficiently solve the problem of missing data in each individual datasets.

Another important direction for further studies is on partition algorithms. As discussed in the introduction section of Chapter 4, many algorithms have been developed

114 and applied on biological networks. It is possible to combine the results from each of these methods and take the consensus as the final partition. In fact, the author of this thesis has already conceived a genetic algorithm-based framework for this purpose. Even though it was not completed due to time constraints, preliminary results on artificial datasets were very promising (data not shown). It is beyond doubt that a multi-weighted graph representation, combined with elegant partitioning algorithms, will yield unprecedented insights into the functional organizations in the cell.

In this dissertation, a novel method was developed to study network robustness under minor perturbations using graph spectrum. This opens up many research avenues.

With the phenotype data being available at genomic scale, one interesting topic is the relationship between gene deletion phenotype and network robustness. Studies along this line may lead to discovery of mechanisms of quantitative traits-associated complex diseases. In a quantitative trait-associated disease, multiple genes are implicated, each of which partly contributes to the disease (Newton-Cheh and Hirschhorn, 2005). How does each individual gene contribute to the disruption of the network? How do they disturb the system when combined? Studies on graph spectral change and small-scale network disturbance may lead to mathematical models for answering such questions. Recently it was proposed to use genome-wide association studies for these complex diseases

(Hirschhorn and Daly, 2005), which echoes the urgency for systematic approaches in the domain of computational biology. One difficulty for genome-wide SNP screening is the detection of rare alleles (Hirschhorn and Daly, 2005). It will be a very interesting topic

115 as to whether graph spectral change can be used to build computational tools to detect such alleles.

Still many more questions remain to be answered. What are the functions of those modules? How do they act cooperatively to respond to perturbations? How does protein- protein interaction network evolve? How do functional modules evolve? Even though no answer to any of these questions is in yet in sight, the combination of experimental approaches, computational approaches, and human curiosity shall certainly move us closer to the answers day by day.

116

BIBLIOGRAPHY

Albert, R. et al. (2000) Error and attack tolerance of complex networks. Nature., 406, 378-82.

Albert, R. and Barabasi, A.L. (2002) Statistical mechanics of complex networks. Reviews of Modern Physics, 74, 47-97.

Almaas, E. et al. (2002) Characterizing the structure of small-world networks. Phys Rev Lett., 88, 098101. Epub 2002 Feb 14.

Asakawa, H. et al. (2005) Dissociation of the nuf2-ndc80 complex releases centromeres from the spindle-pole body during meiotic prophase in fission yeast. Mol Biol Cell., 16, 2325-38. Epub 2005 Feb 23.

Bader, J.S. et al. (2004) Gaining confidence in high-throughput protein interaction networks. Nat Biotechnol, 22, 78-85. Epub 2003 Dec 14.

Ball, C.A. et al. (2005) The stanford microarray database accommodates additional microarray platforms and data formats. Nucleic Acids Res, 33, D580-2.

Barabasi, A.L. and Albert, R. (1999) Emergence of scaling in random networks. Science., 286, 509-12.

Barabasi, A.L. and Oltvai, Z.N. (2004) Network biology: Understanding the cell's functional organization. Nat Rev Genet, 5, 101-13.

Barrat, A. et al. (2004) The architecture of complex weighted networks. Proc Natl Acad Sci U S A., 101, 3747-52. Epub 2004 Mar 8.

Bentley, D.R. (2000) The project--an overview. Med Res Rev., 20, 189-96.

Berg, J. and Lassig, M. (2004) Local graph alignment and motif search in biological networks. Proc Natl Acad Sci U S A, 101, 14689-94. Epub 2004 Sep 24.

Bork, P. et al. (2004) Protein interaction networks from yeast to human. Curr Opin Struct Biol, 14, 292-9.

117

Branzei, D. and Foiani, M. (2005) The DNA damage response during DNA replication. Current Opinion in Cell Biology

Cell division, growth and death / Cell differentiation, 17, 568-575.

Carlson, J.M. and Doyle, J. (1999) Highly optimized tolerance: A mechanism for power laws in designed systems. Phys Rev E Stat Phys Plasmas Fluids Relat Interdiscip Topics, 60, 1412-27.

Carlson, J.M. and Doyle, J. (2002) Complexity and robustness. Proc Natl Acad Sci U S A, 99, 2538-45.

Chen, B.P. et al. (1996) Analysis of atf3, a transcription factor induced by physiological stresses and modulated by gadd153/chop10. Mol Cell Biol, 16, 1157-68.

Chen, J. et al. (2005) Discovering reliable protein interactions from high- throughput experimental data using network topology. Artif Intell Med., 35, 37- 47.

Crick, F.H.C. (1970) Central dogma of molecular biology. Nature, 227, 561-563.

Dandekar, T. et al. (1998) Conservation of gene order: A fingerprint of proteins that physically interact. Trends Biochem Sci., 23, 324-8.

Deane, C.M. et al. (2002) Protein interactions: Two methods for assessment of the reliability of high throughput observations. Mol Cell Proteomics., 1, 349-56.

Deng, M. et al. (2003) Assessment of the reliability of protein-protein interactions and protein function prediction. Pac Symp Biocomput., 140-51.

D'Haeseleer, P. and Church, G.M. (2004) Estimating and improving protein interaction error rates. Proc IEEE Comput Syst Bioinform Conf., 216-23.

Enright, A.J. et al. (1999) Protein interaction maps for complete genomes based on gene fusion events. Nature., 402, 86-90.

Fields, S. and Song, O. (1989) A novel genetic system to detect protein-protein interactions. Nature., 340, 245-6.

Gartner, W. et al. (2003) The atp-dependent helicase ruvbl1/tip49a associates with tubulin during mitosis. Cell Motil Cytoskeleton., 56, 79-93.

Gavin, A.C. et al. (2002) Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature, 415, 141-7.

118

Ge, H. et al. (2001) Correlation between transcriptome and interactome mapping data from saccharomyces cerevisiae. Nat Genet, 29, 482-6.

Ghosh, S. and Collins, F.S. (1996) The geneticist's approach to complex disease. Annu Rev Med., 47, 333-53.

Giaever, G. et al. (2002) Functional profiling of the saccharomyces cerevisiae genome. Nature, 418, 387-91.

Gingras, A.C. et al. (2005) Advances in protein complex analysis using mass spectrometry. J Physiol., 563, 11-21. Epub 2004 Dec 20.

Giot, L. et al. (2003) A protein interaction map of drosophila melanogaster. Science, 302, 1727-36. Epub 2003 Nov 6.

Girvan, M. and Newman, M.E. (2002) Community structure in social and biological networks. Proc Natl Acad Sci U S A, 99, 7821-6.

Grunenfelder, B. and Winzeler, E.A. (2002) Treasures and traps in genome-wide data sets: Case examples from yeast. Nat Rev Genet, 3, 653-61.

Gunsalus, K.C. et al. (2005) Predictive models of molecular machines involved in caenorhabditis elegans early embryogenesis. Nature., 436, 861-5.

Han, J.D. et al. (2004) Evidence for dynamically organized modularity in the yeast protein-protein interaction network. Nature, 430, 88-93. Epub 2004 Jun 09.

Hartwell, L.H. et al. (1997) Integrating genetic approaches into the discovery of anticancer drugs. Science., 278, 1064-8.

Hartwell, L.H. et al. (1999) From molecular to modular cell biology. Nature, 402, C47-52.

Hirschhorn, J.N. and Daly, M.J. (2005) Genome-wide association studies for common diseases and complex traits. Nat Rev Genet., 6, 95-108.

Ho, Y. et al. (2002) Systematic identification of protein complexes in saccharomyces cerevisiae by mass spectrometry. Nature, 415, 180-3.

Ihmels, J. et al. (2002) Revealing modular organization in the yeast transcriptional network. Nat Genet, 31, 370-7.

Ipsen, M. and Mikhailov, A.S. (2002) Evolutionary reconstruction of networks. Phys Rev E Stat Nonlin Soft Matter Phys, 66, 046109. Epub 2002 Oct 14.

119

Ito, T. et al. (2001) A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad Sci U S A, 98, 4569-74. Epub 2001 Mar 13.

Jeong, H. et al. (2000) The large-scale organization of metabolic networks. Nature, 407, 651-4.

Jeong, H. et al. (2001) Lethality and centrality in protein networks. Nature, 411, 41- 2.

Kitano, H. (2002a) Computational systems biology. Nature, 420, 206-10.

Kitano, H. (2002b) Systems biology: A brief overview. Science, 295, 1662-4.

Kumar, A. et al. (2002) Subcellular localization of the yeast proteome. Genes Dev, 16, 707-19.

Li, S. et al. (2004) A map of the interactome network of the metazoan c. Elegans. Science, 303, 540-3. Epub 2004 Jan 2.

Lord, P.W. et al. (2003a) Investigating semantic similarity measures across the gene ontology: The relationship between sequence and annotation. Bioinformatics, 19, 1275-83.

Lord, P.W. et al. (2003b) Semantic similarity measures as tools for exploring the gene ontology. Pac Symp Biocomput, 601-12.

Mann, M. et al. (2001) Analysis of proteins and proteomes by mass spectrometry. Annu Rev Biochem., 70, 437-73.

Maslov, S. and Sneppen, K. (2002) Specificity and stability in topology of protein networks. Science, 296, 910-3.

Newman, M.E. (2001) Scientific collaboration networks. Ii. Shortest paths, weighted networks, and centrality. Phys Rev E Stat Nonlin Soft Matter Phys, 64, 016132. Epub 2001 Jun 28.

Newman, M.E. and Girvan, M. (2004) Finding and evaluating community structure in networks. Phys Rev E Stat Nonlin Soft Matter Phys, 69, 026113. Epub 2004 Feb 26.

Newton-Cheh, C. and Hirschhorn, J.N. (2005) Genetic association studies of complex traits: Design and analysis issues. Mutat Res., 573, 54-69.

Nielsch, U. et al. (1991) Adenovirus e1a-p105(rb) protein interactions play a direct role in the initiation but not the maintenance of the rodent cell transformed phenotype. Oncogene., 6, 1031-6. 120

Patil, A. and Nakamura, H. (2005) Filtering high-throughput protein-protein interaction data using a combination of genomic features. BMC Bioinformatics., 6, 100.

Pereira-Leal, J.B. et al. (2004) Detection of functional modules from protein interaction networks. Proteins, 54, 49-57.

Petti, A.A. and Church, G.M. (2005) A network of transcriptionally coordinated functional modules in saccharomyces cerevisiae. Genome Res., 15, 1298-306. Epub 2005 Aug 18.

Qi, M. and Elion, E.A. (2005) Map kinase pathways. J Cell Sci., 118, 3569-72.

Radicchi, F. et al. (2004) Defining and identifying communities in networks. Proc Natl Acad Sci U S A, 101, 2658-63. Epub 2004 Feb 23.

Rigaut, G. et al. (1999) A generic protein purification method for protein complex characterization and proteome exploration. Nat Biotechnol., 17, 1030-2.

Rives, A.W. and Galitski, T. (2003) Modular organization of cellular networks. Proc Natl Acad Sci U S A, 100, 1128-33. Epub 2003 Jan 21.

Sanjuan, R. and Marin, I. (2001) Tracing the origin of the compensasome: Evolutionary history of deah helicase and myst acetyltransferase gene families. Mol Biol Evol., 18, 330-43.

Schlegel, B.P. et al. (2003) Overexpression of a protein fragment of rna helicase a causes inhibition of endogenous brca1 function and defects in ploidy and cytokinesis in mammary epithelial cells. Oncogene., 22, 983-91.

Schwartz, M.A. and Madhani, H.D. (2004) Principles of map kinase signaling specificity in saccharomyces cerevisiae. Annu Rev Genet., 38, 725-48.

Sharom, J.R. et al. (2004) From large networks to small molecules. Current Opinion in Chemical Biology, 8, 81-90.

Spirin, V. and Mirny, L.A. (2003) Protein complexes and functional modules in molecular networks. Proc Natl Acad Sci U S A, 100, 12123-8. Epub 2003 Sep 29.

Sprinzak, E. et al. (2003) How reliable are experimental protein-protein interaction data? J Mol Biol., 327, 919-23.

Strohman, R. (2002) Maneuvering in the complex path from genotype to phenotype. Science, 296, 701-3.

121

Tanay, A. et al. (2004) Revealing modularity and organization in the yeast molecular network by integrated analysis of highly heterogeneous genomewide data. Proc Natl Acad Sci U S A, 101, 2981-6. Epub 2004 Feb 18.

Tong, A.H. et al. (2004) Global mapping of the yeast genetic interaction network. Science, 303, 808-13.

Troyanskaya, O.G. (2005) Putting microarrays in a context: Integrated analysis of diverse biological data. Brief Bioinform., 6, 34-43.

Tucker, C.L. et al. (2001) Towards an understanding of complex protein networks. Trends Cell Biol, 11, 102-6.

Uetz, P. et al. (2000) A comprehensive analysis of protein-protein interactions in saccharomyces cerevisiae. Nature, 403, 623-7. von Mering, C. et al. (2002) Comparative assessment of large-scale data sets of protein-protein interactions. Nature, 417, 399-403.

Watson, J.D. and Crick, F.H.C. (1953) A structure for deoxyribose nucleic acid. Nature, 171, 737-8.

Watts, D.J. and Strogatz, S.H. (1998) Collective dynamics of 'small-world' networks. Nature, 393, 440-2.

Wei, R.R. et al. (2005) Molecular organization of the ndc80 complex, an essential kinetochore component. Proc Natl Acad Sci U S A., 102, 5363-7. Epub 2005 Apr 4.

Wermuth, C.G. (2004) Selective optimization of side activities: Another way for drug discovery. J Med Chem., 47, 1303-14.

Westermann, S. et al. (2005) Formation of a dynamic kinetochore- microtubule interface through assembly of the dam1 ring complex. Mol Cell., 17, 277-90.

Westermann, S. et al. (2006) The dam1 kinetochore ring complex moves processively on depolymerizing microtubule ends. Nature, 440, 565-569.

Wolfgang, C.D. et al. (1997) Gadd153/chop10, a potential target gene of the transcriptional repressor atf3. Mol Cell Biol, 17, 6700-7.

Xiong, H. et al. (2005) Identification of functional modules in protein complexes via hyperclique pattern discovery. Pac Symp Biocomput, 221-32.

122

APPENDIX A

THE LIST OF THE FUNCTIONAL MODULES

123

Module ID: The identification numbers given to each module. Component Genes: The genes contained in each module. Gene names are used when available. ORF names are used otherwise. GO Term: For each functional module, given is the most specific GO term in the biological process domain that annotates at least half of the genes in the module. This may provide clues for the biological function of a given module. GO Description: The text description of each GO term.

Module GO Component Genes GO Description ID Term BBC1, DAP1, ECM33, FUS2, GAL1, GAL3, GAL4, GAL80, HRT1, MMS1, MST27, MST28, PMA1, cellular physiological M1 P0050875 PMA2, RIX7, RVS161, TPO5, YBP2, YBR108W, process YGP1, YIL001W, YJL200C APC1, APC11, APC2, APC4, APC5, APC9, CDC16, M2 CDC23, CDC26, CDC27, DOC1, LEU3, SWM1, P0007049 cell cycle TCB3 ATP3, ATP5, ATP7, CAF4, CDC123, DEF1, DMA1, DMA2, ECM17, ENT2, ERG1, FUR1, GDE1, HMS1, MCR1, MET1, MET10, MET14, MET16, OSH7, PRP12, PXR1, RRP3, SEC65, SLX1, SOD2, cellular physiological M3 P0050875 SRP14, SRP21, SRP54, SRP68, SRP72, STI1, process THG1, TSA2, YBL029W, YBT1, YER077C, YGR266W, YIL039W, YIL082W, YKR051W, YKU80, YLR271W, YNL311C, YPR003C CFT1, CFT2, FIP1, MPE1, PAP1, PFS2, PTA1, M4 P0006379 mRNA cleavage PTI1, REF2, RNA14, SSU72, SYC1, YSH1, YTH1 MET16, RRP3, SLX1, YBT1, YGR266W, cellular physiological M5 P0050875 YKR051W, YLR271W, YPR003C process M6 CYC2, RPS18A, YCL056C, YDL157C, YNL100W P0008150 biological process M7 BUD27, GIM3, GIM4, GIM5, PAC10, TUB4, YKE2 P0007021 tubulin folding ACE2, ATP1, ATP2, DRE2, ECM30, MNL1, PUS1, RPL30, SAH1, TIM11, UBP15, YDR111C, cellular physiological M8 P0050875 YER064C, YJL075C, YJR098C, YNL157W, process YNL313C AAC3, CDC14, CIS1, ELP6, GRH1, HAP2, HAP3, cellular physiological M9 P0050875 HAP4, HAP5, HEX3, HOR2, IKI1, ILV1, ILV2, process

124

ILV6, OPY1, PGM1, PTP2, QRI8, RAD59, RFA3, RHR2, RIM1, RNH1, SCJ1, SLX8, SPE3, SPE4, SSK1, SSK2, TOS1, UBC8, VAS1, YGR146C, YJR141W, YLR327C, YNL063W, YNL217W, YPL166W, YPL201C, YPR085C, YRA2 ASK1, ASN1, CNN1, DAD1, DAD2, DAM1, DUO1, cellular physiological M10 NUF2, SMC5, SPC19, SPC24, SPC25, SPC34, P0050875 process SPO71, TID3, TRM2, ULP1, YLR419W ATG26, BUD3, BUD4, IMG1, IMG2, MHR1, MRP20, MRP7, MRPL1, MRPL10, MRPL16, MRPL17, MRPL19, MRPL20, MRPL23, MRPL24, M11 MRPL27, MRPL28, MRPL3, MRPL35, MRPL36, P0019538 protein metabolism MRPL39, MRPL4, MRPL44, MRPL51, MRPL6, MRPL7, MRPL8, MRPL9, PKH2, PRP31, RIM2, TIM21, YER078C, YPL183W-A BUD23, POP1, POP3, POP4, POP5, POP6, POP7, M12 P0008033 tRNA processing POP8, RPP1, SNM1, TSR2 ARX1, BRX1, CDC21, CIC1, DBP10, DBP2, DBP9, DNL4, DPB3, DPB4, DRS1, EBP2, ERB1, FPR4, GAR1, HAS1, HUL5, IME4, IPI3, KAR4, LHP1, LOC1, LSG1, MAK11, MAK16, MAK21, MAK5, MRT4, MUB1, MUM2, NIP7, NMD3, NOC2, NOC3, NOG1, NOG2, NOP12, NOP15, NOP16, NOP2, NOP4, NOP7, NSA1, NSA2, NUG1, POL2, cellular physiological M13 P0050875 PUF6, REI1, RLP24, RLP7, RMI1, RPF1, RPL5, process RPS2, RRP1, RRP14, RRP15, RSA3, SDA1, SPB1, SPB4, SPO14, SRP101, SRP102, SSF1, SSQ1, TIF4631, TIF4632, TIF6, UBR2, URB1, VPS73, YDR198C, YDR412W, YEN1, YGL036W, YGR150C, YJL122W, YMR163C, YOR1, YOR227W, YPL009C, YTM1 ATG9, BUD13, CYS4, HAT1, HCR1, HUA1, ILS1, KIP3, KTR5, MAS1, MAS2, MET22, MSH1, NIP1, M14 PML1, PRT1, RLI1, RNR3, RPG1, SMC4, TIF34, P0019538 protein metabolism TIF35, TIF5, UBP14, UBP2, YGL185C, YLR065C, YNL134C, YOL087C AAH1, ABD1, ASR1, CCL1, FLO8, GNA1, GNP1, nucleobase, IWR1, KIN28, PBP4, PCL10, PCP1, RBA50, nucleoside, RPB10, RPB11, RPB2, RPB3, RPB4, RPB5, RPB7, M15 P0006139 nucleotide and RPB8, RPB9, RPO21, SHR3, SPT5, SRX1, SWC5, nucleic acid TFB3, TFG1, TFG2, YBR139W, YBR280C, metabolism YDR131C, YHL021C, YLR225C M16 CSI1, CSN9, PCI8, RRI1, RRI2, URH1 P0000338 protein deneddylation MDM10, MPM1, TOM20, TOM22, TOM40, TOM5, M17 P0017038 protein import TOM6, TOM7, YHR003C 125

CSI1, CSN9, PCI8, RRI1, RRI2, URH1, YIH1, M18 P0000338 protein deneddylation YJL160C BUD32, GRX3, GRX4, IDP2, KAE1, SGO1, cellular physiological M19 P0050875 YGL220W, YLL029W process ACH1, AFR1, ATF2, ERG13, ERG6, GCS1, GND2, GRX1, IQG1, LRO1, MET6, NCS2, NTF2, PEX29, PEX30, PEX31, PIS1, PRM2, RRN3, SEN15, SNA3, cellular physiological M20 SNU13, SNX41, TVP15, TVP23, TVP38, WTM1, P0050875 process WTM2, YDL089W, YGR026W, YIP1, YIP4, YIP5, YMR266W, YOP1, YOR097C, YOR283W, YPL095C ADK1, AVO1, BCP1, CDC25, CEG1, CET1, DOS2, RAS1, RAS2, RKM1, RPL23A, SAS5, SDC25, cellular physiological M21 P0050875 SES1, SOD1, SPT21, SSA3, SSA4, UBC1, process YBL036C, YOL098C ARP9, CAR2, CDC15, CRZ1, CSM1, CYC1, DBP8, DEG1, DSE3, ERG24, FAT1, GRE2, GYL1, HEF3, KNS1, LRS4, MAM1, MPC54, NMA1, OSH3, OYE3, PLP2, SAS10, SDP1, SET5, SFB3, SGM1, SNF12, SNF2, SNF5, SNF6, SPO12, SUM1, SWI3, cellular physiological M22 P0050875 TAF14, TGL1, THI21, THI22, YBR028C, process YBR194W, YBR281C, YCK1, YCK3, YDL086W, YFL049W, YGR111W, YGR154C, YLR132C, YNL191W, YOR215C, YOR252W, YPR118W, YPR152C AGE1, BDH1, CUE3, HNT3, NCE102, RPL4B, cellular physiological M23 SER3, SFA1, YBR042C, YDR266C, YDR531W, P0050875 process YGL039W, YOR246C ATG14, BEM4, BNI4, CDC10, CDC11, CDC12, CDC3, CHS3, DOG2, GIN4, MER1, MSK1, NFI1, cellular physiological M24 PAA1, PTC3, PTC4, RHO2, SEF1, SHS1, SKT5, P0050875 process SPR28, SPR3, TCB1, VHS1, VPS15, VPS30, VPS34, VPS38, YDR186C, YGR205W ARL1, BET1, EMP24, GDS1, HIP1, HPR5, IMH1, LST7, NUM1, OPI1, OSH2, PUT3, SAR1, SCS2, establishment of M25 SEC16, SEC20, SEC22, SEC23, SEC24, SEC31, P0045184 protein localization SED4, SFB2, STT4, SWH1, TIF3, UFE1, VPS52, VPS53, VPS54, YDL110C AIP1, APC1, APC11, APC2, APC4, APC5, APC9, CDC16, CDC23, CDC26, CDC27, CYR1, DOC1, cellular physiological M26 GCN1, GCN20, KAP120, KEL3, LEU3, PHO80, P0050875 process PHO81, RAD51, RAD55, RAD57, SAP1, SPT2, SRV2, SWM1, TCB3 M27 BEM1, EXO84, RGA2, RSR1, SEC10, SEC15, P0045184 establishment of

126

SEC3, SEC5, SEC6, SEC8, VPS9, YMR002W protein localization ARL1, BET1, DLD3, DMC1, DPM1, EMP24, ERP1, ERP2, ERV25, FET5, FTH1, GDS1, GEA2, GPI10, HIP1, HPR5, ILV3, IMH1, KAP123, LCB1, LCB2, LEU1, LRG1, LST7, MID2, NUM1, OPI1, OSH2, PDC5, PHB1, PUT3, RDH54, RDI1, RHO1, RHO4, cellular physiological M28 P0050875 ROM2, SAC7, SAM1, SAM2, SAR1, SCS2, SEC16, process SEC20, SEC22, SEC23, SEC24, SEC31, SED4, SFB2, SLG1, STT4, SWH1, TIF3, TRP2, TRP3, TRP4, UFE1, URA1, VPS52, VPS53, VPS54, YDL110C, YFR044C, ZEO1 COP1, DID4, GPT2, PCH2, RET2, RET3, RTT105, SEC21, SEC26, SEC28, SLN1, SOH1, ULP2, UTR4, cellular physiological M29 P0050875 YBR242W, YDR063W, YER079W, YER184C, process YGL157W, YIL137C, YLR108C, YPD1, YRR1 ACA1, BSP1, DOT5, DYN1, ECM5, GLY1, GTS1, HSC82, ICL1, LIP2, MAC1, MSB1, PLB1, POG1, cellular physiological M30 P0050875 PPG1, PPT1, SFL1, VTH2, YBR138C, YDL233W, process YFR006W, YIL057C IMG1, IMG2, MHR1, MRP20, MRPL1, MRPL19, M31 MRPL24, MRPL3, MRPL36, MRPL39, MRPL4, P0006412 protein biosynthesis MRPL44, MRPL6, MRPL7, MRPL9, YPL183W-A AHC1, AME1, APN2, AQY2, BIR1, BNA5, CBF1, CBF2, CDC34, CDC4, CDC53, CEP3, CHL4, COS111, CSE4, CTF13, CTF19, CTF3, FES1, GCN4, GUF1, HRT3, HSP42, IML3, MCM16, cellular physiological M32 MCM21, MCM22, MET28, MET30, MET31, MET4, P0050875 process MIF2, NKP1, NKP2, OKP1, RUB1, SGT1, SIS1, SKP1, SWE1, UBA3, UBC12, UBC9, UBI4, UFO1, ULA1, YCR082W, YJL149W, YLR224W, YLR267W, YLR352W AGP1, CHS2, EPT1, ERG9, GAA1, GDA1, GTT1, cellular physiological M33 GTT2, GTT3, ISR1, MEP1, TAT2, USE1, VRG4, P0050875 process YIL023C M34 SNO1, SNO2, SNO3, SNO4, SNZ1, SNZ2, SNZ3 P0009228 thiamin biosynthesis AAT2, ACH1, ACS2, AFR1, ATF2, ERG10, ERG13, ERG6, GAS1, GCS1, GLC3, GND2, GPH1, GRS1, GRX1, HRR25, IQG1, LRO1, LTV1, MBP1, MET6, MNN1, NCS2, NTF2, PEX29, PEX30, PEX31, PFS1, PGM2, PIS1, PRM2, RPS3, RRN3, cellular physiological M35 P0050875 SEN15, SKS1, SNA3, SNU13, SNX41, SPO13, process SUI1, SWI6, TSR1, TVP15, TVP23, TVP38, WTM1, WTM2, YDL089W, YDL124W, YGR026W, YIP1, YIP4, YIP5, YMR266W, YOL131W, YOP1, YOR097C, YOR283W, YPL095C 127

BFR2, BMS1, CHD1, CKA1, CKA2, CKB1, CKB2, DHR2, DIM1, DIP2, ECM16, EMG1, ENP1, ENP2, HIS7, HOT1, HTA2, IMP3, IMP4, KRE33, KRI1, KRR1, MPP10, MRD1, NAN1, NOC4, NOP1, M36 NOP14, OGG1, PNO1, POB3, POL5, RCL1, RIO2, P0006364 rRNA processing RRP7, SET2, SLX9, SOF1, TPT1, UTP10, UTP15, UTP20, UTP21, UTP22, UTP30, UTP4, UTP6, UTP7, YER030W, YGR054W, YHL035C, YLR003C nucleobase, BCD1, BNA1, BRE1, GCD10, GCD14, GLT1, IST2, nucleoside, LGE1, NBP1, RTT102, RTT106, SKG6, SUA7, M37 P0006139 nucleotide and TEL2, YFL052W, YLR455W, YML108W, nucleic acid YMR315W, YNG1 metabolism AME1, CEP3, CHL4, CSE4, CTF19, CTF3, IML3, chromosome M38 MCM16, MCM21, MCM22, MIF2, NKP1, NKP2, P0007059 segregation OKP1, UBC9 homotypic vacuole AAP1, PEP3, PEP5, VAM6, VPS16, VPS33, VPS41, M39 P0042145 fusion, non- VPS8 autophagic DST1, INM1, NMD5, NUP159, NUP82, RPL16A, M40 RPL16B, RPL25, RPL26B, RPL34A, RPL34B, P0006412 protein biosynthesis RPL35B, RPS20, RPS22A, SXM1, YPR015C ALR1, ARE1, ARG80, BUD23, BUD32, COG3, CSE2, EAF6, FAA3, FAB1, FYV7, GAL11, GAT1, GDH2, GRX3, GRX4, HHO1, HXK2, IDP2, IMD2, IMD3, IME1, KAE1, MAG1, MAG2, MED1, MED11, MED2, MED4, MED6, MED7, MED8, nucleobase, MLP1, MMS22, MSC2, MSC3, MSH5, NUT1, nucleoside, NUT2, PGD1, POP1, POP3, POP4, POP5, POP6, M41 P0006139 nucleotide and POP7, POP8, RGR1, ROX3, RPP1, RTT101, nucleic acid RTT107, SGO1, SIN4, SNM1, SRB2, SRB4, SRB5, metabolism SRB6, SRB7, SRB8, SSE2, SSN2, SSN3, SSN8, SUB1, TAH18, TSR2, YAP1, YBP1, YFR038W, YGL024W, YGL220W, YIL077C, YLL029W, YML082W, YMR1, YMR102C, YMR114C, YNR024W, ZRG17 DDC1, LEU4, LOT5, LOT6, MEC3, MMM1, MRS11, MRS5, NDL1, PAC1, RAD17, SRN2, cellular physiological M42 STM1, SUV3, TIM18, TIM22, TIM54, TIM9, P0050875 process TRM10, URA6, YBR197C, YCR016W, YLR051C, YNR063W ADH3, ADK1, ASF1, AVO1, BCP1, CDC25, CEG1, cellular physiological M43 CET1, DOS2, FUM1, FYV10, GID7, GID8, HIR3, P0050875 process KCC4, MOH1, MSO1, NHA1, PGI1, PTC2, RAD10, 128

RAD53, RAS1, RAS2, RKM1, RMD5, RNR1, RNR2, RNR4, RPL23A, SAS5, SDC25, SEC1, SES1, SML1, SOD1, SPT21, SSA3, SSA4, SWI4, UBC1, VID24, VID28, VID30, YAL027W, YBL036C, YHL010C, YKR017C, YOL098C, YTA7 ATF2, ERG6, GCS1, PEX29, PEX30, PEX31, PIS1, SNA3, SNX41, TVP15, TVP23, TVP38, YDL089W, M44 P0008150 biological process YGR026W, YIP1, YIP4, YIP5, YOP1, YOR097C, YPL095C FYV10, GID7, GID8, MOH1, NHA1, PTC2, regulation of M45 RAD53, RMD5, SWI4, VID24, VID28, VID30, P0050791 physiological process YHL010C, YKR017C, YTA7 CDC46, CDC47, CTF4, DIA2, ELC1, LST4, MCM10, MCM3, MCM6, MNP1, NCL1, RAD14, cellular physiological M46 RAD4, RAD7, RPP0, RPP1A, RPP1B, RPP2B, P0050875 process SEC61, SKY1, SSF2, SSS1, TIR3, YER067W, YHR087W, YLR287C M47 BPT1, IMP1, IMP2, MRL1, RGT2 P0046907 intracellular transport AEP1, AEP2, ASM4, BRF1, CLB5, COG1, COG6, COG7, COX12, DST1, ECM1, FIN1, FZF1, HRP1, INM1, KIN1, LEM3, LYS2, LYS5, MIH1, MRPL15, MTF2, MTW1, NAB2, NIC96, NMD5, NSP1, NUP157, NUP159, NUP188, NUP192, NUP49, NUP53, NUP57, NUP82, PCA1, PHO86, PIB1, cellular physiological M48 POM152, PSE1, PSP1, RDS2, RPC17, RPL16A, P0050875 process RPL16B, RPL25, RPL26B, RPL34A, RPL34B, RPL35B, RPS20, RPS22A, SIC1, SPO74, SRL3, STB1, SXM1, TCM10, TFC4, TRX2, VAC14, YAL061W, YBR239C, YDL063C, YHR199C, YIL092W, YJL057C, YLR326W, YML125C, YNL035C, YPL014W, YPR015C CCZ1, DSS4, GDI1, HSV2, RBD2, RGP1, RIC1, establishment of M49 SEC4, VPS21, YPT1, YPT10, YPT31, YPT32, P0045184 protein localization YPT52, YPT6, YPT7 FYV10, GID7, GID8, MOH1, NHA1, RMD5, M50 P0006090 pyruvate metabolism VID24, VID28, VID30, YHL010C, YKR017C GZF3, OCA1, POL4, RHO5, SIW14, YCR095C, M51 P0008150 biological process YHL029C, YNL056W AIR1, ATG1, BAT1, BAT2, CCT2, CCT3, CCT5, CCT6, CDC13, CDC31, CDC37, CDC5, CTK1, CTK2, CTK3, FMN1, FPR3, FUN12, FUN30, cellular physiological M52 P0050875 GLC8, HRB1, HTZ1, IMD1, IMD4, IME2, IRR1, process KAR1, KIC1, LIA1, MCD1, MDM31, MIS1, MMS4, MPS3, MTR10, MTR4, MUS81, NPL3, NSR1,

129

PAP2, PDI1, PLC1, PPQ1, PPZ2, PRP16, PSK1, PSK2, PWP1, RAI1, RAT1, RPC10, RPL2A, RVB1, RVB2, SCC4, SDS22, SGD1, SGN1, SHM1, SMC1, SMC2, SMC3, SOG2, STB3, STN1, TCP1, TEL1, TIF11, TOR2, UTP5, VHS3, YDL156W, YGR102C, YGR130C, YLR030W, YLR118C AGP1, CHS2, EPT1, ERG25, ERG26, ERG27, ERG28, ERG9, GAA1, GDA1, GTT1, GTT2, GTT3, INP54, ISR1, MEP1, MLH1, MLH2, MLH3, OM45, cellular physiological M53 P0050875 PMS1, RTN2, SEC11, SHO1, SLH1, SPC1, TAT2, process TEP1, TRM1, USE1, VRG4, YER049W, YIL023C, YJL202C, YNL181W ESP1, GTR1, GTR2, HXT10, MEH1, MVP1, NOP8, M54 P0008150 biological process SLM4, YCR015C, YGR203W, YJL097W AGP1, CHS2, EPT1, ERG25, ERG26, ERG27, ERG28, ERG9, ESP1, GAA1, GDA1, GTR1, GTR2, GTT1, GTT2, GTT3, HXT10, INP54, ISR1, MDM10, MEH1, MEP1, MLH1, MLH2, MLH3, MPM1, MVP1, NOP8, OM45, PMS1, RTN2, cellular physiological M55 P0050875 SAM37, SEC11, SHO1, SLH1, SLM4, SPC1, TAT2, process TEP1, TOM20, TOM22, TOM40, TOM5, TOM6, TOM7, TOM70, TRM1, USE1, VRG4, YCR015C, YER049W, YGR203W, YHR003C, YIL023C, YJL097W, YJL202C, YNL181W AAC3, ATP3, ATP5, ATP7, CAF4, CDC123, CDC14, CIS1, DEF1, DMA1, DMA2, ECM17, ELP6, ENT2, ERG1, FUR1, GDE1, GRH1, HAP2, HAP3, HAP4, HAP5, HEX3, HMS1, HOR2, IKI1, ILV1, ILV2, ILV6, MCR1, MET1, MET10, MET14, MET16, OPY1, OSH7, PGM1, PRP12, PTP2, PXR1, QRI8, RAD59, RFA3, RHR2, RIM1, RNH1, RRP3, cellular physiological M56 SCJ1, SEC65, SLX1, SLX8, SOD2, SPE3, SPE4, P0050875 process SRP14, SRP21, SRP54, SRP68, SRP72, SSK1, SSK2, STI1, THG1, TOS1, TSA2, UBC8, VAS1, YBL029W, YBT1, YER077C, YGR146C, YGR266W, YIL039W, YIL082W, YJR141W, YKR051W, YKU80, YLR271W, YLR327C, YNL063W, YNL217W, YNL311C, YPL166W, YPL201C, YPR003C, YPR085C, YRA2 homotypic vacuole AAP1, HOG1, PEP3, PEP5, RCK1, RCK2, ROD1, M57 P0042145 fusion, non- VAM6, VPS16, VPS33, VPS41, VPS8 autophagic AEP1, AEP2, ASM4, CLB5, COG1, COG6, COG7, cellular physiological M58 COX12, FIN1, HRP1, KIN1, MIH1, MRPL15, P0050875 process MTF2, MTW1, NAB2, NIC96, NSP1, NUP157,

130

NUP188, NUP192, NUP49, NUP53, NUP57, PCA1, PHO86, POM152, PSP1, SIC1, SPO74, SRL3, STB1, TCM10, VAC14, YDL063C, YIL092W, YJL057C, YML125C, YNL035C, YPL014W ACT1, ARC1, DIG1, DIG2, DLD2, EMI2, FIG1, FRS1, FRS2, FUS1, FUS3, GLK1, MTG2, NSE3, cellular physiological M59 NSE4, PTR3, PUS6, QRI1, SPH1, SSY1, SSY5, P0050875 process STE12, STE7, THS1, TPM2, TUP1, TWF1, YGL057C, YIR014W AAD3, CKI1, CTR3, DAL3, GCR1, GCR2, GLN1, HSP104, HXK1, IRA1, IRA2, MKS1, PMD1, PRS1, cellular physiological M60 PRS2, PRS3, PRS4, PRS5, RIM11, TPS1, TPS2, P0050875 process TPS3, TRM7, TSC10, TSL1, YER163C, YGR277C, YIL177C, YKL215C, YMR251W, YUH1 ADE2, AFG3, APT1, ARE2, BIO3, BRE2, CCT4, CYM1, DOA4, DYN2, EAF5, ERR1, FRE3, HAA1, HMG2, HSP26, LAT1, LPD1, MET2, MEX67, MIP6, MSN5, MTR2, NBP2, NTE1, NUP100, cellular physiological M61 OSM1, PAC11, PBN1, PBS2, PDA1, PDB1, PDX1, P0050875 process PEP8, PRB1, PRX1, PTC1, SDC1, SDH1, SDH2, SDH3, SET1, SKM1, SPP1, SPS1, SUP35, SWD1, SWD3, TCM62, VPS17, VPS29, VPS35, VPS5, VPS74, YRM1, YTA12 COP1, GPT2, RET2, RET3, SEC21, SEC26, SEC28, establishment of M62 P0045184 SLN1, YBR242W, YIL137C, YPD1 protein localization CDC2, CDC33, CDC73, CTR9, EXG1, FAA4, HYS2, LEO1, LSB1, MDH2, PAF1, PET112, POL1, POL32, RTF1, SBP1, SPL2, SPT16, VTS1, cellular physiological M63 P0050875 YAL049C, YDL025C, YEL023C, YGR016W, process YGR058W, YHR009C, YIL108W, YKR018C, YLR392C, YPK2 ACE2, AHP1, ATP1, ATP17, ATP2, ATP4, BFA1, BZZ1, CIT2, CYT2, DCS1, DCS2, DDR48, DNM1, DRE2, ECM30, EDC1, GAC1, GDB1, GIP1, GIP2, GLC7, GLG2, GPG1, GSY1, GSY2, HMO1, HOP1, KAP122, LIF1, MCA1, MDV1, MGT1, MHP1, MNL1, NEJ1, NNT1, NUP1, NUP170, PBP2, PIG1, PIG2, PNC1, PRR1, PUS1, RAD18, RAD54, RAD6, cellular physiological M64 P0050875 RED1, REG2, REX4, RNT1, RPL30, RTT103, process SAH1, SCD5, SLF1, SNO1, SNO2, SNO3, SNO4, SNZ1, SNZ2, SNZ3, SRP40, STE23, TIM11, TOA1, TOA2, UBA1, UBC6, UBP15, UBR1, UGA2, YAK1, YCR087C-A, YDR111C, YDR132C, YER064C, YGL081W, YJL075C, YJR098C, YKL023W, YKL056C, YKR064W, YLR053C,

131

YLR241W, YML131W, YNL157W, YNL313C, YPI1, YPL245W, YPL247C nucleobase, CDC2, CDC73, CTR9, HYS2, LEO1, PAF1, POL1, nucleoside, M65 POL32, RTF1, SPT16, VTS1, YAL049C, YDL025C, P0006139 nucleotide and YHR009C nucleic acid metabolism DAL80, EXO1, GZF3, HDA1, HDA2, HDA3, MSH2, MSH3, MSH6, OCA1, POL4, PRO2, PSY3, cellular physiological M66 RHO5, SGS1, SHU1, SHU2, SIW14, TOP3, P0050875 process YBR246W, YCR095C, YDR520C, YHL029C, YLR046C, YNL056W MRP1, MRP13, MRP4, MRP51, MRPS17, MRPS18, M67 MRPS28, MRPS9, NAM9, RSM10, RSM22, P0006412 protein biosynthesis RSM23, RSM26, RSM7 ACO1, ADE12, ARO1, CAM1, CCT8, CPA1, CPA2, ECM29, EFB1, FAS1, FAS2, FDH1, FUI1, GFA1, GON7, GUS1, HAT2, HIF1, HSM3, IDH2, LYS12, NAS2, NAS6, PRE5, RAD23, RLM1, RPN10, cellular physiological M68 RPN11, RPN12, RPN13, RPN14, RPN2, RPN3, P0050875 process RPN4, RPN5, RPN6, RPN7, RPN8, RPN9, RPT1, RPT2, RPT3, RPT4, RPT5, RPT6, RTN1, SMP1, STS1, TEF2, TEF4, TIF2, UBP6, URA7, URA8, YBR025C, YBR238C, YKL161C, YKR043C AVT2, AXL1, BTT1, CAC2, CCE1, CPR1, CRC1, nucleobase, DBF20, EGD2, HIT1, HOS2, HOS4, HSH49, nucleoside, M69 HSP12, HST1, MLC2, MSI1, NPR1, RLF2, RSA1, P0006139 nucleotide and RSC1, SET3, SIF2, SNT1, TOS8, YHR209W, nucleic acid YMR027W, YMR155W, YRF1-3, ZDS1 metabolism ARP8, BAS1, BCD1, BNA1, BRE1, CYC2, ESC8, GCD10, GCD14, GLT1, IES1, IES3, IES5, IOC2, IOC3, IST2, ISW1, ISW2, ITC1, LDB7, LGE1, LTP1, MOT1, NBP1, NHP10, NHP6B, NPL6, nucleobase, PHO2, PHO4, PSF1, PSF2, REB1, RFX1, RPS18A, nucleoside, M70 RSC2, RSC3, RSC4, RSC58, RSC6, RSC8, RSC9, P0006139 nucleotide and RTT102, RTT106, SFH1, SHM2, SKG6, SLD5, nucleic acid SMI1, STH1, STR3, SUA7, TEL2, TFC7, VPS1, metabolism YCL056C, YDL157C, YEF3, YFL052W, YGL242C, YLR455W, YML108W, YMR315W, YNG1, YNL100W, YPL077C DBR1, DCP1, DCP2, DHH1, EDC3, GDH1, KEM1, LSM1, LSM2, LSM3, LSM4, LSM5, LSM6, LSM7, M71 NAM7, NMD2, NMD4, PAT1, PRP24, PRP38, P0016071 mRNA metabolism PRP6, RPM2, RPS28B, SCM3, SNU23, UPF3, YBR094W, YLR211C, YLR446W 132

AGP1, CHS2, EPT1, ERG9, GAA1, GDA1, GTT1, GTT2, GTT3, ISR1, MEP1, MLH1, MLH2, MLH3, cellular physiological M72 P0050875 PMS1, TAT2, TEP1, TRM1, USE1, VRG4, process YIL023C BCD1, BNA1, BRE1, CYC2, GCD10, GCD14, GLT1, IST2, LDB7, LGE1, NBP1, PSF1, PSF2, RPS18A, RTT102, RTT106, SKG6, SLD5, SUA7, cellular physiological M73 P0050875 TEL2, YCL056C, YDL157C, YFL052W, process YLR455W, YML108W, YMR315W, YNG1, YNL100W, YPL077C ALG5, AMS1, ANP1, APE3, ARD1, ASC1, ATG19, BET3, BET5, DPB11, DRS2, ENO1, ERO1, ERV41, ERV46, FKS1, GLR1, GSG1, GYP6, HOC1, KAR2, KRE11, KRS1, LAP4, MAS6, MDJ1, MGE1, MNN10, MNN9, NAT1, NEM1, NPR2, NUP120, NUP133, NUP145, NUP84, NUP85, OST1, PDR5, PKC1, POX1, RRS1, RUP1, SEC13, SEC62, SEC63, cellular physiological M74 P0050875 SEC66, SEC72, SEH1, SKN1, SLC1, SPO7, SSC1, process SSL1, STT3, SWP1, TFA1, TFA2, TFB1, TFB4, TIM17, TIM44, TRS120, TRS130, TRS20, TRS23, TRS31, TRS33, VAN1, VBA3, WBP1, YAL053W, YDR128W, YER007C-A, YER182W, YFL034W, YHL039W, YHR080C, YHR113W, YJR014W, YPL222W CSL4, DIS3, LRP1, MTR3, RRP4, RRP40, RRP42, M75 RRP43, RRP46, RRP6, SKI6, SKI7, YIR035C, P0006402 mRNA catabolism YLR345W regulation of ADA2, ARO9, ECM19, GCN5, HFI1, NGG1, PDR1, nucleobase, PIF1, SGF29, SPT20, SPT3, SPT7, SPT8, TAF12, nucleoside, M76 P0019219 TAF5, TAF6, TAF7, TRA1, UBP8, VPS51, YIM1, nucleotide and YNG2 nucleic acid metabolism M77 AME1, CHL4, IML3, MCM22, NKP1 P0008150 biological process COG8, DBP5, GCD1, GCD11, GCD2, GCD6, GCD7, GCN3, GFD1, GLE1, GLE2, HAM1, IST1, cellular physiological M78 P0050875 NUP42, PRO1, RIP1, SCW4, SUI2, SUI3, XKS1, process YHL008C, YHL018W ARK1, CHC1, CLC1, ENT1, ENT3, GGA2, IZH4, M79 PAN1, PAN2, PAN3, SWA2, YAP1801, YAP1802, P0006810 transport YOR111W ADE17, ADE5, ADE6, AHA1, APE2, ARA1, CTF18, CTF8, DPB2, ELG1, GCY1, MAP2, MDH3, cellular physiological M80 P0050875 NPA3, PFK2, POL12, PRI1, PRI2, RAD24, RFC1, process RFC2, RFC3, RFC4, RFC5, RNQ1, SNT2, SPE2, 133

USO1, VPS13, YLR413W, YMR181C, YOR262W, YPT35 BUB1, BUB3, CAK1, CAT8, CDC20, CDC28, CDC6, CDH1, CKS1, CLB1, CLB2, CLB3, CLB4, CLB6, CLN1, CLN2, DAL7, DAL82, HSL1, MAD1, M81 P0007049 cell cycle MAD2, MAD3, MGS1, NMA2, PCM1, RGT1, SAP4, STD1, YDR196C, YDR314C, YKR077W, YML011C, ZPR1 COG8, COP1, DBP5, DID4, GCD1, GCD11, GCD2, GCD6, GCD7, GCN3, GFD1, GLE1, GLE2, GPT2, HAM1, IST1, NUP42, PCH2, PRO1, RET2, RET3, RIP1, RTT105, SCW4, SEC21, SEC26, SEC28, cellular physiological M82 P0050875 SLN1, SOH1, SUI2, SUI3, ULP2, UTR4, XKS1, process YBR242W, YDR063W, YER079W, YER184C, YGL157W, YHL008C, YHL018W, YIL137C, YLR108C, YPD1, YRR1 ATG26, BUD3, BUD4, IMG1, IMG2, MHR1, MRH4, MRP20, MRP49, MRP7, MRP8, MRPL1, MRPL10, MRPL13, MRPL16, MRPL17, MRPL19, MRPL20, MRPL23, MRPL24, MRPL25, MRPL27, cellular physiological M83 MRPL28, MRPL3, MRPL35, MRPL36, MRPL39, P0050875 process MRPL4, MRPL44, MRPL51, MRPL6, MRPL7, MRPL8, MRPL9, PKH2, PRP31, RIM2, SMP2, TIM21, YDR115W, YER078C, YMR210W, YPL183W-A, ZRC1 AFG2, AVT6, BOS1, CAD1, DUS3, FLR1, GIS3, LAC1, LAG1, LIP1, MCH2, NUP116, PET122, cellular physiological M84 P0050875 PET494, PET54, SPC2, TYE7, UTR1, YAP3, process YAR044W, YEL041W, YOR062C, YOR289W AFG3, APT1, BRE2, CYM1, DOA4, LAT1, LPD1, cellular physiological M85 MTR2, NTE1, OSM1, PDA1, PDB1, PDX1, SDC1, P0050875 process SET1, SPP1, SWD1, SWD3, YTA12 CAC2, CPR1, CRC1, HOS2, HOS4, HSH49, HSP12, chromatin M86 HST1, MLC2, MSI1, NPR1, RLF2, RSC1, SET3, P0016568 modification SIF2, SNT1, TOS8, YHR209W, YMR027W, ZDS1 ABP1, AKL1, ARG81, BNI1, FAL1, GRR1, HOF1, HTS1, LYS9, PFY1, PMT6, PRK1, RHO3, SCP1, cellular physiological M87 P0050875 TRM11, TRM112, TRM9, UBP5, YDR140W, process YEL043W, YIL151C, YNL152W, YOR164C, YPS3 AAH1, ABD1, ASR1, BUD27, CCL1, FLO8, GIM3, GIM4, GIM5, GNA1, GNP1, IVY1, IWR1, KIN28, PAC10, PBP4, PCL10, PCP1, RBA50, RPB10, cellular physiological M88 P0050875 RPB11, RPB2, RPB3, RPB4, RPB5, RPB7, RPB8, process RPB9, RPO21, SHR3, SPC72, SPC97, SPC98, SPT5, SRX1, SWC5, TFB3, TFG1, TFG2, TUB4, 134

YBR139W, YBR280C, YDR131C, YHL021C, YKE2, YLR225C ADP1, CTL1, EAP1, FZO1, IPP1, NDD1, NEW1, cellular physiological M89 PEX18, PEX2, PEX21, PEX7, PPA2, SPC29, TUS1, P0050875 process YCK2, YDL203C, YPL066W CMD1, CMK1, CNA1, CNB1, CUE5, DML1, DOG1, ECM25, EDE1, END3, GAL2, GYP5, HOM6, KEL2, KEX2, KIN2, KRE6, LAS17, LSB3, LSB6, MET18, MLC1, MYO2, MYO3, MYO4, MYO5, PCL9, PEP1, RPL10, RVS167, SHE2, cellular physiological M90 SHE3, SHE4, SLA1, SLA2, SPC110, SQT1, SUL2, P0050875 process SYP1, UBP7, UBP9, UBX6, URA5, VRP1, YDR219C, YDR267C, YDR348C, YER156C, YFR042W, YHM2, YHR122W, YHR182W, YJL045W, YLR243W, YLR422W, YMR295C, YNR065C, YOL111C, YSC84 MDM10, MPM1, SAM37, TOM20, TOM22, M91 P0017038 protein import TOM40, TOM5, TOM6, TOM7, TOM70, YHR003C ADH1, ADH5, APP1, AVT2, AXL1, BTT1, CAC2, CCA1, CCE1, CDC43, COF1, CPR1, CRC1, CRN1, DBF20, DNA2, ECM32, EGD2, ENO2, FAR11, FBA1, GPM1, HIT1, HOS2, HOS4, HSH49, HSP12, HST1, KRE29, LCD1, LYS14, MGM101, MLC2, cellular physiological M92 P0050875 MPH1, MSI1, NPR1, PAN6, PDC1, PGK1, PSO2, process RAM1, RAM2, RFA1, RFA2, RLF2, RSA1, RSC1, SEN2, SEN34, SEN54, SET3, SIF2, SNT1, SPF1, SUP45, TOS8, YHR209W, YMR027W, YMR155W, YRF1-3, ZDS1 MMM1, MRS11, MRS5, TIM18, TIM22, TIM54, M93 P0017038 protein import TIM9 M94 GIM3, GIM4, GIM5, PAC10, TUB4 P0007021 tubulin folding ADP1, AKR1, ARP7, ATC1, BCK2, BEM3, BOI1, BOI2, BUD6, CDC42, CDC54, CLA4, CTL1, CTS1, DOP1, DSE1, EAP1, FZO1, GIC1, GIC2, GPA1, IPP1, KSS1, MAK10, MAK3, MAK31, MPT5, MSB2, MSB3, MSB4, MSE1, NDD1, NEW1, PEA2, cellular physiological M95 PEX18, PEX2, PEX21, PEX7, PIM1, PPA2, PYC1, P0050875 process RGA1, SAN1, SEN1, SPA2, SPC29, SST2, STE11, STE18, STE2, STE4, STE5, STE50, SYG1, TSC11, TUS1, VPS62, YCK2, YDL203C, YDR239C, YEL048C, YGL015C, YIL055C, YPL066W, YPL113C, YPR115W, ZDS2, ZTA1 AHC1, APN2, CDC34, CDC4, CDC53, COS111, cellular physiological M96 FES1, GCN4, GUF1, HRT3, HSP42, MET30, P0050875 process MET31, RUB1, SGT1, SIS1, SKP1, SWE1, UBA3, 135

UBC12, UBI4, UFO1, ULA1, YCR082W, YJL149W, YLR224W, YLR267W, YLR352W ATP12, ATP14, BUD14, COX5B, ELM1, IDP3, KIN3, MEP2, MMF1, MMT1, NTA1, QCR8, RAV1, TFP1, VMA10, VMA13, VMA2, VMA4, VMA5, M97 P0006810 transport VMA7, VMA8, VPH1, YAT1, YER036C, YGL226W, YGR117C, YIL157C, YLR168C, YMR003W, ZUO1 ABP1, AKL1, ARG81, BNI1, CDC9, CNS1, CPR7, ECM10, ERR3, FAL1, GRR1, HGH1, HIS4, HOF1, HSP82, HTS1, LSM12, LYS9, MEK1, PBP1, PFY1, PMT6, POL30, PRK1, PUF3, RAD27, RHO3, SBA1, cellular physiological M98 P0050875 SCP1, STB5, STP1, TAD2, TIS11, TRM11, process TRM112, TRM9, TUB3, UBP5, YDR140W, YEL043W, YGR052W, YIL151C, YNL152W, YOR164C, YOR378W, YPS3 ALR1, ARG80, BUD23, COG3, CSE2, EAF6, FAA3, FAB1, FYV7, GAL11, GAT1, GDH2, HXK2, IME1, MED1, MED11, MED2, MED4, MED6, MED7, MED8, MLP1, MMS22, MSC2, MSH5, nucleobase, NUT1, NUT2, PGD1, POP1, POP3, POP4, POP5, nucleoside, M99 POP6, POP7, POP8, RGR1, ROX3, RPP1, RTT101, P0006139 nucleotide and RTT107, SIN4, SNM1, SRB2, SRB4, SRB5, SRB6, nucleic acid SRB7, SRB8, SSE2, SSN2, SSN3, SSN8, SUB1, metabolism TAH18, TSR2, YAP1, YBP1, YFR038W, YGL024W, YIL077C, YML082W, YMR1, YMR102C, YMR114C, YNR024W, ZRG17 M100 ORC1, ORC2, ORC3, ORC4, ORC5, ORC6 P0006260 DNA replication ABP1, ACA1, ADO1, ADR1, AKL1, ARG81, BMH1, BNI1, BNR1, BSP1, CAP1, CAP2, CDC9, CNS1, CPR7, CYK3, DOT5, DYN1, ECM10, ECM5, ERR3, FAL1, GLY1, GRR1, GTS1, HGH1, HIS4, HOF1, HSC82, HSP82, HTS1, ICL1, INP52, KCS1, LIP2, LSM12, LYS9, MAC1, MEK1, MSB1, NTH1, PBP1, PFY1, PLB1, PMT6, POG1, POL30, PPG1, PPT1, PRK1, PTP1, PUF3, RAD27, RHO3, cellular physiological M101 P0050875 RNH202, RNH203, SBA1, SCP1, SFL1, SOK1, process STB5, STP1, STU1, SVL3, TAD2, TIS11, TOP2, TRM11, TRM112, TRM9, TUB3, UBP5, VTH2, YBR138C, YDL233W, YDR140W, YDR366C, YEL043W, YER071C, YFR006W, YFR016C, YFR017C, YGR052W, YIL057C, YIL151C, YIR003W, YNL152W, YOR164C, YOR378W, YPS3, YTA6 M102 END3, GAL2, GYP5, HOM6, LAS17, LSB6, P0006996 organelle

136

MYO3, MYO5, PCL9, PEP1, RPL10, RVS167, organization and SQT1, UBP7, VRP1, YHM2, YLR243W, YNR065C biogenesis regulation of actin M103 ARC15, ARC18, ARC19, ARC35, ARP3 P0030833 filament polymerization ARX1, BRX1, CDC21, CIC1, DBP10, DBP2, DBP9, DPB3, DPB4, DRS1, EBP2, GAR1, HAS1, IPI3, LHP1, LSG1, MAK21, MAK5, MRT4, NIP7, NMD3, NOC2, NOC3, NOG1, NOG2, NOP12, NOP15, NOP16, NOP2, NOP4, NOP7, NSA1, cellular physiological M104 P0050875 NSA2, NUG1, POL2, PUF6, REI1, RLP24, RLP7, process RMI1, RPF1, RRP1, RRP14, RRP15, RSA3, SDA1, SPB1, SPB4, SRP101, SRP102, SSF1, TIF4631, TIF4632, URB1, VPS73, YGR150C, YJL122W, YMR163C, YOR1, YPL009C, YTM1 ADY3, CCZ1, DON1, DSS4, FRT2, GDI1, HST3, HSV2, JJJ1, LAS1, MRS6, RBD2, RGP1, RIC1, establishment of M105 RPL37A, SEC4, SFP1, SSP1, VPS21, YGR017W, P0045184 protein localization YLR072W, YPT1, YPT10, YPT31, YPT32, YPT52, YPT53, YPT6, YPT7 BBP1, BFR1, BUB2, CBK1, CPR3, DDI1, DUT1, GAS2, GBP2, HPR1, HSP10, HSP60, ISM1, MCX1, cellular physiological M106 MFT1, MOB2, NUP2, RLR1, SEC53, SIP2, SSD1, P0050875 process SUB2, TEM1, TEX1, THO1, THP2, UGP1, YGR066C, YGR153W, YIR016W, YJL218W AOS1, CDC48, CMK2, DSK2, HCH1, HMF1, M107 NPL4, RTG1, RTG3, SGT2, SHP1, UBA2, UFD1, P0019538 protein metabolism UFD2, YDR049W, YPL236C, YPS7 BBC1, DAP1, ECM33, FAR10, FAR3, FAR7, FAR8, FUS2, GAL1, GAL3, GAL4, GAL80, GPA2, GPB1, GPB2, GPR1, HFM1, HRT1, LSP1, MMS1, cellular physiological M108 MST27, MST28, OYE2, PHO84, PIL1, PKH1, P0050875 process PMA1, PMA2, RIX7, RVS161, TPI1, TPO5, VPS64, YBP2, YBR108W, YGP1, YIL001W, YJL200C, YRF1-4 ACT1, ARC1, DAL80, DIG1, DIG2, DLD2, EMI2, EXO1, FIG1, FRS1, FRS2, FUS1, FUS3, GLK1, GZF3, HDA1, HDA2, HDA3, MSH2, MSH3, MSH6, MTG2, NSE3, NSE4, OCA1, POL4, PRO2, PSY3, cellular physiological M109 PTR3, PUS6, QRI1, RHO5, SGS1, SHU1, SHU2, P0050875 process SIW14, SPH1, SSY1, SSY5, STE12, STE7, THS1, TOP3, TPM2, TUP1, TWF1, YBR246W, YCR095C, YDR520C, YGL057C, YHL029C, YIR014W, YLR046C, YNL056W M110 DEF1, ECM17, MET1, MET10, MET16, RRP3, P0050875 cellular physiological 137

SLX1, SOD2, TSA2, YBT1, YER077C, YGR266W, process YIL039W, YKR051W, YKU80, YLR271W, YPR003C DAL80, PSY3, SHU1, SHU2, YBR246W, M111 P0008150 biological process YDR520C, YLR046C AGE1, ARO4, BDH1, CUE3, ERG3, HNT3, NCE102, PHO12, RPL4B, SER3, SER33, SFA1, cellular physiological M112 P0050875 SRY1, YBR042C, YDR266C, YDR531W, process YGL039W, YOR246C, YPL150W MRH4, MRP49, MRP8, MRPL13, MRPL25, SMP2, M113 P0008150 biological process YDR115W, YMR210W, ZRC1 BPT1, BUD7, COG4, COR1, CTA1, DCI1, DSL1, ECI1, GLO2, GPD1, GPD2, IMP1, IMP2, MAE1, MDM30, MRL1, ORC1, ORC2, ORC3, ORC4, cellular physiological M114 ORC5, ORC6, PEX13, PEX14, PEX17, PEX5, P0050875 process QCR2, RAD1, RBK1, RGT2, SLX4, SMK1, STE20, SWR1, TIP20, TRR2, YBL010C, YGR263C, YKR022C, YOR051C APL2, APL4, APM1, APM2, APS1, BUD9, DAK2, ECM14, HNT2, HYM1, OMS1, PLB3, QCR7, SPR1, cellular physiological M115 P0050875 TRL1, UBC4, UFD4, VAC7, VMA22, VMA6, process VPH2, YBL104C, YFR043C, YFR045W BUD23, POP3, POP4, POP5, POP6, POP7, POP8, M116 P0008033 tRNA processing RPP1 ANP1, ARD1, ASC1, DRS2, ENO1, FKS1, HOC1, KAR2, MNN10, MNN9, NAT1, NEM1, PDR5, cellular physiological M117 P0050875 PKC1, SEC62, SEC63, SEC66, SEC72, SPO7, process YAL053W, YER007C-A, YHR080C, YJR014W AGP1, ERG9, GAA1, GTT1, GTT2, MEP1, VRG4, cellular physiological M118 P0050875 YIL023C process FAR10, FAR3, FAR7, FAR8, GPA2, GPB1, GPB2, cellular physiological M119 GPR1, HFM1, LSP1, OYE2, PHO84, PIL1, PKH1, P0050875 process TPI1, VPS64, YRF1-4 ANP1, HOC1, KAR2, MNN10, MNN9, SEC62, cell organization and M120 P0016043 SEC63, SEC66, SEC72, YHR080C biogenesis AGE1, BDH1, CUE3, ERG3, HNT3, NCE102, cellular physiological M121 PHO12, RPL4B, SER3, SER33, SFA1, YBR042C, P0050875 process YDR266C, YDR531W, YGL039W, YOR246C APL1, APL3, APL5, APL6, APM3, APM4, APS2, APS3, BRO1, CRR1, CYC8, HSL7, IOC4, IRE1, cellular physiological M122 KAR5, KIP1, LST8, MEU1, MSL5, MUD2, NRG1, P0050875 process PPZ1, PRP2, PRP22, PRP39, REV1, RIM13, RIM20, SMY2, SNF7, SNP1, SWI1, TKL1, TKL2, TOR1,

138

VPS24, VPS4, YCR101C, YDL036C, YDR541C, YER087W, YER128W, YFR039C, YGR122W, YIL130W, YMR009W, YPL105C EST3, PRE10, PRE2, PRE3, PRE6, PRE7, PRE8, ubiquitin-dependent M123 P0006511 PRE9, PUP2, PUP3, SCL1, UMP1 protein catabolism ADR1, BMH1, BNR1, CAP1, CAP2, CYK3, KCS1, cellular physiological M124 NTH1, SOK1, STU1, SVL3, YDR366C, YER071C, P0050875 process YFR016C, YFR017C, YIR003W RNA elongation from M125 CDC73, CTR9, LEO1, PAF1, RTF1 P0006368 Pol II promoter regulation of ABF2, ARP4, CDC45, CHA4, CTI6, DEP1, DOT6, nucleobase, EAF3, EPL1, ESA1, FAA2, FKH2, INO80, PHO23, nucleoside, M126 RPD3, RXT3, SAP30, SDS3, SIN3, STB2, STB4, P0019219 nucleotide and SWC4, SWI5, UME1, UME6, VID21, YAF9, nucleic acid YBR095C, YMR075W metabolism ADE3, ALA1, AMN1, ATG10, ATG11, ATG12, ATG16, ATG3, ATG4, ATG5, ATG7, ATG8, BIK1, BIM1, CHK1, CYS3, DED81, DPS1, DUN1, ERG20, FAP1, FET3, FPR1, HOM2, HOM3, IDH1, ILV5, ISC10, IST3, KAR9, KTR3, MLP2, MRE11, cellular physiological M127 P0050875 NTG1, NUC1, PHM8, PPX1, RAD28, RAD50, process RAD9, RBL2, SSK22, SSZ1, STU2, TFC1, TFC3, TFC6, TFC8, TRP5, TRZ1, TUB1, TUB2, TYR1, XRS2, YBR063C, YDR051C, YDR341C, YLR152C, YMR087W, YMR226C, YNL208W, YOR285W ACE2, ATP1, ATP2, BZZ1, DRE2, ECM30, EDC1, KAP122, MNL1, NUP1, NUP170, PIG2, PNC1, PUS1, RNT1, RPL30, RTT103, SAH1, SLF1, SNO1, cellular physiological M128 SNO2, SNO3, SNO4, SNZ1, SNZ2, SNZ3, TIM11, P0050875 process TOA1, TOA2, UBP15, YDR111C, YDR132C, YER064C, YJL075C, YJR098C, YKR064W, YML131W, YNL157W, YNL313C DYN1, ECM5, GLY1, HSC82, ICL1, LIP2, MAC1, cellular physiological M129 P0050875 MSB1, PLB1, PPG1, PPT1, VTH2, YFR006W process ARC15, ARC18, ARC19, ARC35, ARC40, ARP2, ARP3, ATG9, BCK1, BUD13, BUL1, CDC7, CYS4, DBF4, FYV8, GAL7, HAT1, HCR1, HUA1, ILS1, KIP3, KTR5, LHS1, MAS1, MAS2, MCM2, MET22, cellular physiological M130 MKK1, MKK2, MNT3, MSH1, NIP1, PDC6, PML1, P0050875 process PRT1, PST2, RGD1, RLI1, RNR3, RPG1, SKG3, SLT2, SMC4, THI3, TIF34, TIF35, TIF5, UBP14, UBP2, YBR052C, YCP4, YGL185C, YLR065C, YLR253W, YNL040W, YNL134C, YOL087C

139

M131 MRH4, MRP49, MRPL13, MRPL25, YDR115W P0008150 biological process ABF2, ADA2, ARO9, ARP4, ASK10, BIT61, CDC45, CHA4, CTI6, CUE2, DEP1, DOT6, EAF3, ECM19, ECM22, EPL1, ESA1, FAA2, FKH2, GCN5, HFI1, INO80, NGG1, PDR1, PHO23, PIF1, regulation of RPD3, RXT3, SAP30, SDS3, SER2, SGF29, SGF73, nucleobase, SIN3, SNF8, SPT15, SPT20, SPT3, SPT7, SPT8, nucleoside, M132 P0019219 STB2, STB4, SWC4, SWI5, TAF1, TAF10, TAF11, nucleotide and TAF12, TAF13, TAF2, TAF3, TAF4, TAF5, TAF6, nucleic acid TAF7, TAF8, TAF9, TBF1, TRA1, UBP8, UME1, metabolism UME6, VID21, VPS20, VPS25, VPS36, VPS51, YAF9, YBR095C, YGR071C, YIM1, YMR075W, YNG2 APL2, APL4, APM1, APM2, APS1, HYM1, QCR7, M133 UBC4, UFD4, VAC7, YBL104C, YFR043C, P0006810 transport YFR045W actin cytoskeleton ARC15, ARC18, ARC19, ARC35, ARC40, ARP2, M134 P0030036 organization and ARP3, RGD1, YNL040W biogenesis BRO1, RIM13, RIM20, SNF7, VPS24, VPS4, cellular physiological M135 P0050875 YDR541C, YER128W, YGR122W process DYN1, ECM5, GLY1, ICL1, LIP2, MAC1, MSB1, cellular physiological M136 P0050875 VTH2, YFR006W process AAD3, ATG14, BEM4, BNI4, BUB1, BUB3, CAK1, CAT8, CDC10, CDC11, CDC12, CDC20, CDC28, CDC3, CDC6, CDH1, CHS3, CKI1, CKS1, CLB1, CLB2, CLB3, CLB4, CLB6, CLN1, CLN2, CTR3, DAL3, DAL7, DAL82, DOG2, GCR1, GCR2, GIN4, GLN1, HSL1, HSP104, HXK1, IRA1, IRA2, MAD1, MAD2, MAD3, MER1, MGS1, MKS1, MSK1, cellular physiological M137 NFI1, NMA2, PAA1, PCM1, PMD1, PRS1, PRS2, P0050875 process PRS3, PRS4, PRS5, PTC3, PTC4, RGT1, RHO2, RIM11, SAP4, SEF1, SHS1, SKT5, SPR28, SPR3, STD1, TCB1, TPS1, TPS2, TPS3, TRM7, TSC10, TSL1, VHS1, VPS15, VPS30, VPS34, VPS38, YDR186C, YDR196C, YDR314C, YER163C, YGR205W, YGR277C, YIL177C, YKL215C, YKR077W, YML011C, YMR251W, YUH1, ZPR1 APN2, GCN4, MET31, RUB1, UBA3, UBC12, cellular physiological M138 P0050875 ULA1 process CDC73, CTR9, LEO1, PAF1, RTF1, VTS1, RNA elongation from M139 P0006368 YAL049C, YDL025C, YHR009C Pol II promoter ADE4, AYR1, BGL2, BTN2, CBP6, CHO1, COG8, cellular physiological M140 P0050875 COP1, CRM1, CSE1, CTR1, DBP5, DIA4, DID4, process

140

GCD1, GCD11, GCD2, GCD6, GCD7, GCN3, GFD1, GLE1, GLE2, GLO3, GPT2, GSP1, GSP2, HAM1, HEM15, IBD2, IST1, KAP104, LIN1, MET3, MOG1, NDE1, NUP42, OAC1, PCH2, PET10, PET9, PEX19, PEX3, PRO1, PSD1, RET2, RET3, RIP1, RNA1, RTT105, SAC1, SCW4, SEC21, SEC26, SEC27, SEC28, SEC7, SER1, SLN1, SOH1, SRM1, SUI2, SUI3, ULP2, UTR4, XKS1, YBR187W, YBR242W, YCR076C, YDL193W, YDR063W, YER079W, YER184C, YGK3, YGL157W, YHL008C, YHL018W, YIL137C, YKU70, YLR031W, YLR108C, YMR124W, YPD1, YPR004C, YRB1, YRB2, YRR1 AGP1, ERG9, GAA1, GTT1, GTT2, VRG4, cellular physiological M141 P0050875 YIL023C process CFT1, CFT2, CLP1, FIP1, MPE1, PAP1, PCF11, M142 PFS2, PTA1, PTI1, REF2, RNA14, RNA15, RRN9, P0006379 mRNA cleavage SSU72, SYC1, YDL218W, YSH1, YTH1 M143 FYV10, GID8, MOH1, RMD5, VID28, VID30 P0006090 pyruvate metabolism ARR4, CPR2, CPR5, CRH1, CUE4, ERV2, FOL2, GIR2, GSF2, INO1, MDM39, MSN4, POR1, RBG1, cellular physiological M144 P0050875 RBG2, RMD7, RPN1, SCW10, SKG1, SUC2, process YBR014C, YDL121C, YGR250C AAH1, ABD1, ADH2, ASR1, BUD27, CCL1, CPR6, DOA1, ESS1, FLO8, GIM3, GIM4, GIM5, GNA1, GNP1, HYR1, ISA2, IVY1, IWR1, KIN28, LOS1, MAF1, MMS2, NAP1, NGL2, PAC10, PBP4, PCL10, PCP1, PDR8, QNS1, RBA50, RET1, REX2, RPA12, RPA135, RPA14, RPA190, RPA43, RPA49, nucleobase, RPB10, RPB11, RPB2, RPB3, RPB4, RPB5, RPB7, nucleoside, M145 RPB8, RPB9, RPC11, RPC19, RPC25, RPC31, P0006139 nucleotide and RPC34, RPC37, RPC40, RPC53, RPC82, RPO21, nucleic acid RPO26, RPO31, SAC6, SHR3, SPC72, SPC97, metabolism SPC98, SPT5, SRX1, SWC5, TBS1, TFB3, TFG1, TFG2, TOM1, TRR1, TUB4, UBC13, YBR139W, YBR280C, YDL129W, YDR131C, YFL042C, YHL021C, YKE2, YLR225C, YOL070C, YOR220W ARL1, BET1, EMP24, HIP1, IMH1, LST7, PUT3, SAR1, SEC16, SEC20, SEC22, SEC23, SEC24, establishment of M146 P0045184 SEC31, SED4, SFB2, TIF3, UFE1, VPS52, VPS53, protein localization VPS54, YDL110C BET3, BET5, GSG1, GYP6, KRE11, TRS120, M147 P0006888 ER to Golgi transport TRS130, TRS23, TRS31, TRS33 M148 APL1, APL3, APL5, APL6, APM3, APM4, APS2, P0050875 cellular physiological 141

APS3, HSL7, KAR5, LST8, MSL5, MUD2, PRP2, process PRP22, PRP39, SMY2, SWI1, TOR1, YDL036C, YER087W, YIL130W, YPL105C CPD1, GOS1, LAP3, MYO1, NYV1, PEP12, SEC17, SEC18, SEC9, SED5, SFT1, SFT2, SLY1, SNC1, M149 SNC2, SRO7, SRO77, SSO1, SSO2, SYN8, TLG1, P0046907 intracellular transport TLG2, VAM3, VAM7, VPS45, VPS68, VTI1, YIF1, YIP3, YKT6, ZRT2 BCP1, CDC25, DOS2, RAS1, RAS2, RKM1, cellular physiological M150 P0050875 RPL23A, SAS5, SDC25, SPT21, SSA4, YOL098C process ALG5, APE3, DPB11, ERO1, ERV41, ERV46, GLR1, MAS6, MDJ1, MGE1, NPR2, NUP120, NUP133, NUP145, NUP84, NUP85, OST1, POX1, cellular physiological M151 P0050875 RRS1, RUP1, SEC13, SEH1, SLC1, SSC1, STT3, process SWP1, TIM17, TIM44, VAN1, VBA3, WBP1, YDR128W, YER182W, YHL039W, YPL222W ADE2, EAF5, ERR1, HAA1, HSP26, MET2, MEX67, MIP6, MSN5, NUP100, PBN1, PEP8, M152 P0006810 transport PRB1, SDH1, SDH2, SDH3, SPS1, SUP35, TCM62, VPS17, VPS29, VPS35, VPS5, VPS74 BFR2, BMS1, BUD21, CHD1, CKA1, CKA2, CKB1, CKB2, DHR2, DIM1, DIP2, ECM16, EMG1, ENP1, ENP2, HCA4, HIS7, HOT1, HTA2, IMP3, IMP4, KRE33, KRI1, KRR1, LCB3, LCP5, MPP10, MRD1, MVD1, NAN1, NOC4, NOP1, NOP14, NOP58, OGG1, PNO1, POB3, POL5, PRM8, PTC5, M153 P0006364 rRNA processing PWP2, RCL1, RIO2, RML2, RRP7, RRP9, SET2, SIK1, SLX9, SLY41, SOF1, TPT1, UTP10, UTP11, UTP14, UTP15, UTP18, UTP20, UTP21, UTP22, UTP30, UTP4, UTP6, UTP7, YER030W, YGL146C, YGR054W, YGR210C, YHL035C, YLR003C, YOR059C CDD1, CSL4, CSM2, CTT1, CUP2, DIS3, DYS1, FCY1, GDH3, HLR1, HPT1, ITT1, LRP1, MCM1, MES1, MET17, MGA1, MTR3, NUP60, PCT1, PDS1, PRP5, PSA1, RRP4, RRP40, RRP42, RRP43, cellular physiological M154 RRP45, RRP46, RRP6, SKI2, SKI3, SKI6, SKI7, P0050875 process SNF11, SRL2, SRP1, TDH2, THI6, TIR1, TPO1, UBA4, URM1, WWM1, YBR137W, YHR112C, YIR035C, YJR056C, YKL069W, YKR011C, YLR345W, YML053C nucleobase, BCD1, BNA1, GCD10, GCD14, GLT1, IST2, NBP1, nucleoside, M155 RTT102, RTT106, SUA7, TEL2, YFL052W, P0006139 nucleotide and YLR455W, YML108W, YMR315W, YNG1 nucleic acid 142

metabolism AAD3, ATG14, BEM4, BNI4, CDC10, CDC11, CDC12, CDC3, CHS3, CKI1, CTR3, DAL3, DOG2, GCR1, GCR2, GIN4, GLN1, HSP104, HXK1, IRA1, IRA2, MER1, MKS1, MSK1, NFI1, PAA1, PMD1, PRS1, PRS2, PRS3, PRS4, PRS5, PTC3, PTC4, cellular physiological M156 P0050875 RHO2, RIM11, SEF1, SHS1, SKT5, SPR28, SPR3, process TCB1, TPS1, TPS2, TPS3, TRM7, TSC10, TSL1, VHS1, VPS15, VPS30, VPS34, VPS38, YDR186C, YER163C, YGR205W, YGR277C, YIL177C, YKL215C, YMR251W, YUH1 ARR4, CPR2, CPR5, CRH1, CUE4, ERV2, GSF2, cellular physiological M157 INO1, MDM39, MSN4, RMD7, SCW10, SKG1, P0050875 process YBR014C, YDL121C ARR4, CPR2, CPR5, CRH1, CUE4, ERV2, FOL2, cellular physiological M158 GSF2, INO1, MDM39, MSN4, POR1, RMD7, P0050875 process SCW10, SKG1, YBR014C, YDL121C AIP1, APC1, APC11, APC2, APC4, APC5, APC9, ARK1, CDC16, CDC23, CDC26, CDC27, CHC1, CLC1, CYR1, DOC1, ENT1, ENT3, GCN1, GCN20, GGA2, IZH4, KAP120, KEL3, LEU3, MKT1, PAN1, cellular physiological M159 PAN2, PAN3, PFK1, PHO80, PHO81, PMI40, P0050875 process RAD51, RAD55, RAD57, SAP1, SCP160, SPT2, SRV2, SSE1, SSL2, SWA2, SWM1, TCB3, YAP1801, YAP1802, YGR067C, YHB1, YNL045W, YOR111W AFG3, APT1, ARE2, BIO3, BRE2, CCT4, CYM1, DOA4, DYN2, FRE3, HMG2, LAT1, LPD1, MTR2, cellular physiological M160 NBP2, NTE1, OSM1, PAC11, PBS2, PDA1, PDB1, P0050875 process PDX1, PRX1, PTC1, SDC1, SET1, SKM1, SPP1, SWD1, SWD3, YRM1, YTA12 BPT1, GPD1, GPD2, IMP1, IMP2, MDM30, MRL1, M161 P0006810 transport RGT2, STE20, TRR2 BBC1, CMD1, CMK1, CNA1, CNB1, CUE5, DAP1, DML1, DOG1, ECM25, ECM33, EDE1, END3, FAR10, FAR3, FAR7, FAR8, FUS2, GAL1, GAL2, GAL3, GAL4, GAL80, GPA2, GPB1, GPB2, GPR1, GYP5, HFM1, HOM6, HRT1, KEL2, KEX2, KIN2, KRE6, LAS17, LSB3, LSB6, LSP1, MET18, MLC1, cellular physiological M162 P0050875 MMS1, MST27, MST28, MYO2, MYO3, MYO4, process MYO5, OYE2, PCL9, PEP1, PHO84, PIL1, PKH1, PMA1, PMA2, RIX7, RPL10, RVS161, RVS167, SHE2, SHE3, SHE4, SLA1, SLA2, SPC110, SQT1, SUL2, SYP1, TPI1, TPO5, UBP7, UBP9, UBX6, URA5, VPS64, VRP1, YBP2, YBR108W, 143

YDR219C, YDR267C, YDR348C, YER156C, YFR042W, YGP1, YHM2, YHR122W, YHR182W, YIL001W, YJL045W, YJL200C, YLR243W, YLR422W, YMR295C, YNR065C, YOL111C, YRF1-4, YSC84 ARP1, ARP10, ATG17, AVO2, BDP1, BSC5, CDC19, CHS1, CMP2, DIP5, GRX5, JNM1, KOG1, KSP1, LDB18, MCK1, MDM36, NIP100, PNT1, cellular physiological M163 PPH3, PSY2, PYK2, SLM1, SLM2, SNX4, TRM3, P0050875 process YAP6, YBL046W, YBR270C, YDR357C, YGL250W, YJR008W, YJR011C, YJR024C, YNR068C, ZIP1 ALG5, APE3, DPB11, ERO1, ERV41, ERV46, GLR1, NUP120, NUP133, NUP145, NUP84, cellular physiological M164 P0050875 NUP85, OST1, RRS1, SEC13, SEH1, SLC1, STT3, process SWP1, VAN1, VBA3, WBP1, YHL039W BNA6, CDD1, CSL4, CSM2, CTT1, CUP2, DCD1, DIS3, DYS1, ECM31, FCY1, GDH3, HEM2, HLR1, HPA2, HPA3, HPT1, HUA2, ISN1, ITT1, KAP95, LRP1, MCM1, MES1, MET17, MGA1, MHT1, MTR3, MUK1, NUP60, PCT1, PDS1, PRO3, PRP5, cellular physiological M165 PSA1, RIB4, RNP1, RRP4, RRP40, RRP42, RRP43, P0050875 process RRP45, RRP46, RRP6, SAE2, SKI2, SKI3, SKI6, SKI7, SNF11, SRL2, SRP1, TDH2, THI6, TIR1, TPO1, UBA4, URM1, WWM1, YBR137W, YHR112C, YIR035C, YJR056C, YKL069W, YKR011C, YLR345W, YML053C, YNK1 ADA2, ARO9, ASK10, BIT61, CUE2, ECM19, regulation of ECM22, GCN5, HFI1, NGG1, PDR1, PIF1, SER2, nucleobase, SGF29, SGF73, SNF8, SPT15, SPT20, SPT3, SPT7, nucleoside, M166 SPT8, TAF1, TAF10, TAF11, TAF12, TAF13, P0019219 nucleotide and TAF2, TAF3, TAF4, TAF5, TAF6, TAF7, TAF8, nucleic acid TAF9, TBF1, TRA1, UBP8, VPS20, VPS25, VPS36, metabolism VPS51, YGR071C, YIM1, YNG2 AOS1, CDC48, CFT1, CFT2, CLP1, CMK2, DSK2, FIP1, FIR1, HCH1, HMF1, KRE1, MPE1, NBP35, NPL4, PAB1, PAP1, PCF11, PFS2, PTA1, PTI1, cellular physiological M167 REF2, RNA14, RNA15, RRN9, RTG1, RTG3, P0050875 process SGT2, SHP1, SIM1, SSU72, SYC1, TAO3, UBA2, UFD1, UFD2, YBR141C, YDL218W, YDR049W, YPL236C, YPS7, YSH1, YTH1 FYV10, GID7, GID8, MOH1, RMD5, VID24, M168 P0006090 pyruvate metabolism VID28, VID30 AGX1, BCY1, BNA6, CAT2, CDD1, CLG1, COX4, cellular physiological M169 P0050875 COX5A, COX6, COX9, CSL4, CSM2, CTT1, CUP2, process 144

DCD1, DIS3, DYS1, ECM31, FCY1, FRM2, GDH3, HEM2, HLR1, HPA2, HPA3, HPT1, HUA2, ISN1, ITT1, KAP95, LRP1, MCM1, MES1, MET17, MGA1, MHT1, MMR1, MTR3, MUK1, NUP60, PCL1, PCL2, PCL5, PCL6, PCL7, PCL8, PCT1, PDS1, PHO85, PRO3, PRP5, PSA1, RIB4, RIM15, RNP1, RRP4, RRP40, RRP42, RRP43, RRP45, RRP46, RRP6, SAE2, SAT4, SKI2, SKI3, SKI6, SKI7, SKI8, SNF11, SOR1, SOR2, SPO11, SRL2, SRP1, TDH2, THI6, TIR1, TPK1, TPK2, TPK3, TPO1, UBA4, URM1, WHI4, WWM1, YBR137W, YHR033W, YHR112C, YIR035C, YJR056C, YKL069W, YKR011C, YLR177W, YLR345W, YML053C, YNK1, YOL075C AME1, CHL4, CTF3, IML3, MCM16, MCM21, chromosome M170 P0007059 MCM22, MIF2, NKP1, NKP2 segregation establishment of M171 SEC65, SRP14, SRP21, SRP54, SRP68, SRP72 P0045184 protein localization DCP1, DHH1, GDH1, KEM1, LSM1, LSM2, LSM3, nuclear mRNA M172 LSM4, LSM5, LSM6, LSM7, PAT1, PRP24, PRP38, P0000398 splicing, via PRP6, SCM3, SNU23, YLR211C, YLR446W spliceosome CAT2, COX4, COX5A, COX6, COX9, SKI8, M173 P0006810 transport SPO11, YOL075C EXO84, SEC10, SEC3, SEC5, SEC6, SEC8, VPS9, establishment of M174 P0045184 YMR002W protein localization CAF4, ENT2, OSH7, SEC65, SRP14, SRP21, establishment of M175 P0045184 SRP54, SRP68, SRP72, YBL029W protein localization vesicle-mediated M176 APL2, APL4, APM2, APS1, YFR043C, YFR045W P0016192 transport M177 POP4, POP5, POP6, POP7, POP8, RPP1 P0008033 tRNA processing AAC1, ADH4, ADH6, AGE1, ARG1, ARG3, ARG4, ARG5, ARG8, ARO4, BDH1, CAT5, CUE3, ERG3, FRE1, HIS3, HIS5, HNT3, IDP1, LSC1, LSC2, LYS1, LYS20, MSG5, MSU1, NCE102, NST1, cellular physiological M178 PHO12, PRP28, PTC7, PYC2, RAD26, RAD3, P0050875 process RPL4B, SER3, SER33, SFA1, SLD2, SRY1, TAL1, THR4, TOS4, YBR042C, YBR184W, YDR266C, YDR326C, YDR531W, YGL039W, YJL068C, YKL075C, YOR246C, YPL150W, YPL229W ADA2, ARO9, ASK10, ECM19, ECM22, GCN5, regulation of HFI1, NGG1, PDR1, PIF1, SGF29, SPT15, SPT20, nucleobase, M179 SPT3, SPT7, SPT8, TAF1, TAF10, TAF12, TAF2, P0019219 nucleoside, TAF5, TAF6, TAF7, TAF8, TRA1, UBP8, VPS51, nucleotide and YIM1, YNG2 nucleic acid 145

metabolism AAR2, ADE8, ARO8, BRR1, BUD31, BUR2, CBC2, CCC1, CDC40, CEF1, CLF1, CSN12, CWC2, CWC22, CWC23, DIB1, EGD1, FOB1, GCN2, GPX2, IFH1, ISY1, LEA1, LUC7, MSH4, MTD1, MUD1, NAB3, NAB6, NAM8, NGR1, nucleobase, NRD1, NRP1, NTC20, PET111, PRP18, PRP19, nucleoside, M180 PRP40, PRP42, PRP43, PRP45, PRP8, PUB1, PXA1, P0006139 nucleotide and PXA2, QDR3, REV3, REV7, RSE1, SCM4, SEC39, nucleic acid SGV1, SLU7, SMB1, SMD1, SMD2, SMD3, SMX3, metabolism SNT309, SNU114, SNU56, SNU71, SRL1, SRO9, STO1, SUT1, SWC7, SYF1, SYF2, VBA1, YBL081W, YDL144C, YHC1, YJU2, YLR424W, YNL213C, YNL224C CRZ1, CSM1, CYC1, DEG1, DSE3, ERG24, FAT1, GYL1, LRS4, MAM1, OSH3, PLP2, SAS10, SET5, cellular physiological M181 SFB3, SGM1, SPO12, SUM1, TGL1, YBR194W, P0050875 process YBR281C, YLR132C, YNL191W, YOR215C, YOR252W, YPR118W, YPR152C CCT8, ECM29, FDH1, FUI1, GON7, LYS12, NAS6, PRE5, RAD23, RPN10, RPN11, RPN12, RPN13, cellular physiological M182 RPN14, RPN2, RPN3, RPN5, RPN6, RPN7, RPN8, P0050875 process RPT1, RPT2, RPT3, RPT4, RPT5, RPT6, STS1, URA7, URA8 AAP1, CAF20, CWC21, GAL83, GIS4, GLN3, HOG1, NRG2, PEP3, PEP5, PIN4, RCK1, RCK2, REG1, ROD1, SCW11, SIP1, SIP3, SIP4, SNF1, cellular physiological M183 P0050875 SNF4, TOS3, URA4, VAM6, VPS16, VPS33, process VPS41, VPS8, YER129W, YKR096W, YMR086W, YMR291W ALR1, COG3, EAF6, FAA3, FAB1, FYV7, GAT1, GDH2, HXK2, MED1, MED11, MED2, MED4, nucleobase, MED6, MED7, MED8, MLP1, MMS22, MSC2, nucleoside, M184 MSH5, NUT1, PGD1, ROX3, RTT101, RTT107, P0006139 nucleotide and SIN4, SRB6, SSE2, SSN2, SSN3, SSN8, TAH18, nucleic acid YAP1, YBP1, YGL024W, YIL077C, YMR102C, metabolism ZRG17 BEM1, BPT1, BUD7, CDC24, CLN3, COG4, COR1, CTA1, DCI1, DDC1, DSL1, ECI1, EXO84, FAR1, GLO2, GPD1, GPD2, IMP1, IMP2, LEU4, LOT5, LOT6, MAE1, MDM30, MEC3, MIR1, MMM1, cellular physiological M185 P0050875 MRL1, MRS11, MRS5, NDL1, NSL1, ORC1, process ORC2, ORC3, ORC4, ORC5, ORC6, PAC1, PEX13, PEX14, PEX17, PEX5, QCR2, RAD1, RAD17, RBK1, RGA2, RGT2, RSR1, SEC10, SEC15, SEC3,

146

SEC5, SEC6, SEC8, SLX4, SMK1, SRN2, SSA1, SSA2, STE20, STM1, SUV3, SWR1, TAT1, TIM18, TIM22, TIM54, TIM9, TIP20, TOS2, TRM10, TRR2, URA6, VPS9, YBL010C, YBR197C, YCR016W, YDJ1, YGR263C, YHR020W, YKR022C, YLR051C, YMR002W, YNR063W, YOR051C, YOR060C APL2, APL4, APM1, APM2, APS1, ATP12, ATP14, BUD14, BUD9, COX5B, DAK2, ECM14, ELM1, HNT2, HYM1, IDP3, KIN3, MEP2, MMF1, MMT1, NTA1, OMS1, PLB3, QCR7, QCR8, RAV1, SPR1, M186 TFP1, TRL1, UBC4, UFD4, VAC7, VMA10, P0006810 transport VMA13, VMA2, VMA22, VMA4, VMA5, VMA6, VMA7, VMA8, VPH1, VPH2, YAT1, YBL104C, YER036C, YFR043C, YFR045W, YGL226W, YGR117C, YIL157C, YLR168C, YMR003W, ZUO1 AAD3, CTR3, DAL3, GLN1, HXK1, MKS1, TPS3, cellular physiological M187 P0050875 TSC10, YIL177C, YMR251W process ARL3, BLM10, BRE5, CCT7, CDC55, CIN1, CIN5, EST3, EXO70, NET1, PPE1, PPH21, PPH22, PRE1, PRE10, PRE2, PRE3, PRE4, PRE6, PRE7, PRE8, cellular physiological M188 PRE9, PUP1, PUP2, PUP3, RAP1, RDR1, RIF1, P0050875 process RIF2, ROX1, RTS1, RTS3, SCL1, SIR1, SIR2, SIR3, SIR4, SPP41, TAP42, TDH3, TIP41, TPD3, UBP3, UMP1, VID22, YFL006W, YKL206C, YLR199C M189 IME4, KAR4, MUM2, SPO14, YGL036W P0007126 meiosis M190 RPL25, RPL26B, RPL35B, RPS20, RPS22A P0006412 protein biosynthesis ATP12, ATP14, BUD14, COX5B, IDP3, KIN3, cellular physiological M191 MMF1, MMT1, QCR8, YAT1, YGL226W, P0050875 process YIL157C, YLR168C, YMR003W, ZUO1 ASK1, ASN1, ATG13, CNN1, COG2, CSI1, CSN9, DAD1, DAD2, DAM1, DUO1, NUF2, NVJ1, PCI8, RRI1, RRI2, SMC5, SPC19, SPC24, SPC25, SPC34, cellular physiological M192 P0050875 SPO71, TID3, TRM2, ULP1, URH1, VAB2, VAC8, process YGL079W, YIH1, YJL160C, YKL061W, YLR419W, YNL086W CFT1, CFT2, CLP1, FIP1, FIR1, KRE1, MPE1, NBP35, PAB1, PAP1, PCF11, PFS2, PTA1, PTI1, M193 P0006379 mRNA cleavage REF2, RNA14, RNA15, RRN9, SIM1, SSU72, SYC1, TAO3, YBR141C, YDL218W, YSH1, YTH1 BUD27, GIM3, GIM4, GIM5, IVY1, PAC10, microtubule M194 P0007020 SPC72, SPC97, SPC98, TUB4, YKE2 nucleation ADY3, AFG2, AVT6, BOS1, CAD1, CCZ1, DON1, cellular physiological M195 P0050875 DSS4, DUS3, FLR1, FRT2, GDI1, GIS3, HST3, process 147

HSV2, JJJ1, LAC1, LAG1, LAS1, LIP1, MCH2, MRS6, NUP116, PET122, PET494, PET54, RBD2, RGP1, RIC1, RPL37A, SEC4, SFP1, SPC2, SSP1, TYE7, UTR1, VPS21, YAP3, YAR044W, YEL041W, YGR017W, YLR072W, YOR062C, YOR289W, YPT1, YPT10, YPT31, YPT32, YPT52, YPT53, YPT6, YPT7 ALR1, ARG80, COG3, CSE2, EAF6, FAA3, FAB1, FYV7, GAL11, GAT1, GDH2, HXK2, IME1, MED1, MED11, MED2, MED4, MED6, MED7, nucleobase, MED8, MLP1, MMS22, MSC2, MSH5, NUT1, nucleoside, NUT2, PGD1, RGR1, ROX3, RTT101, RTT107, M196 P0006139 nucleotide and SIN4, SRB2, SRB4, SRB5, SRB6, SRB7, SRB8, nucleic acid SSE2, SSN2, SSN3, SSN8, SUB1, TAH18, YAP1, metabolism YBP1, YFR038W, YGL024W, YIL077C, YML082W, YMR1, YMR102C, YMR114C, YNR024W, ZRG17 AIR2, BBP1, BFR1, BUB2, CBK1, CPR3, CSM3, DDI1, DED1, DUT1, ECM11, ECM15, ELP2, ELP3, ELP4, GAS2, GBP2, HBS1, HMLALPHA2, HMT1, HPR1, HSP10, HSP60, IKI3, ISM1, MCX1, MFT1, MOB2, MRPS35, NIS1, NUP2, PHD1, RIB1, RIB3, cellular physiological M197 RIB5, RIS1, RLR1, SEC53, SIP2, SMT3, SSD1, P0050875 process SUB2, TDP1, TEM1, TEX1, THO1, THP2, TOF1, TOF2, TOP1, UGP1, URK1, YDR020C, YFR011C, YGR066C, YGR153W, YIR016W, YJL218W, YLR456W, YMR233W, YPL068C, YPR172W, ZMS1 ADE17, ADE5, ADE6, AHA1, APE2, ARA1, ARH1, CAF130, CAF16, CAF17, CAF40, CAR1, CCR4, CDC36, CDC39, CDC60, CLU1, CTF18, CTF8, DBF2, DPB2, ELG1, FAA1, FCP1, GCY1, GND1, JJJ3, KEL1, KGD1, KGD2, MAP2, MDH1, MDH3, cellular physiological M198 MEC1, MOB1, MOT2, MPS1, NOT3, NOT5, NPA3, P0050875 process PFK2, POL12, POP2, PRI1, PRI2, RAD24, RFC1, RFC2, RFC3, RFC4, RFC5, RNQ1, SNT2, SPC42, SPE2, USO1, VPS13, YBR255W, YCR079W, YIR024C, YLR413W, YMR181C, YMR31, YNL092W, YOR262W, YPT35 CTA1, DCI1, ECI1, GLO2, PEX13, PEX14, PEX17, M199 P0006810 transport PEX5, RBK1, YGR263C ADO1, ADR1, BMH1, BNR1, CAP1, CAP2, CYK3, INP52, KCS1, NTH1, PTP1, RNH202, RNH203, cellular physiological M200 P0050875 SOK1, STU1, SVL3, TOP2, YDR366C, YER071C, process YFR016C, YFR017C, YIR003W, YTA6

148

ECM17, MET1, MET10, MET16, RRP3, SLX1, cellular physiological M201 YBT1, YGR266W, YKR051W, YLR271W, P0050875 process YPR003C ATP3, ATP5, ATP7, CAF4, ENT2, FUR1, HMS1, M202 MCR1, OSH7, SEC65, SRP14, SRP21, SRP54, P0006810 transport SRP68, SRP72, YBL029W ATG13, COG2, CSI1, CSN9, NVJ1, PCI8, RRI1, cellular physiological M203 RRI2, URH1, VAB2, VAC8, YGL079W, YIH1, P0050875 process YJL160C, YKL061W, YNL086W ACC1, ARR4, ATE1, ATG18, ATG2, ATG20, AVT4, CPR2, CPR5, CRH1, CUE4, DTD1, DUR1, ERV2, FOL2, GIR2, GSF2, HRK1, INO1, MDM39, MDS3, MIA40, MSN4, POR1, RBG1, RBG2, cellular physiological M204 P0050875 RMD7, RMT2, RPN1, SAP155, SAP185, SAP190, process SCW10, SIT4, SKG1, SUC2, TUF1, YBR014C, YBR225W, YBR261C, YDL121C, YGR250C, YMR196W, YMR209C, YNL187W ARL3, EST3, PRE1, PRE10, PRE2, PRE3, PRE6, ubiquitin-dependent M205 PRE7, PRE8, PRE9, PUP2, PUP3, SCL1, UMP1, P0006511 protein catabolism YKL206C, YLR199C regulation of nucleobase, ABF2, CDC45, CHA4, CTI6, DEP1, DOT6, EAF3, nucleoside, M206 FKH2, INO80, PHO23, RPD3, RXT3, SAP30, SDS3, P0019219 nucleotide and SIN3, STB2, UME1, UME6, YBR095C, YMR075W nucleic acid metabolism ESP1, GTR1, GTR2, HXT10, MDM10, MEH1, MPM1, MVP1, NOP8, SAM37, SLM4, TOM20, M207 P0006810 transport TOM22, TOM40, TOM5, TOM6, TOM7, TOM70, YCR015C, YGR203W, YHR003C, YJL097W BAS1, BCD1, BNA1, BRE1, CYC2, GCD10, GCD14, GLT1, IST2, LDB7, LGE1, NBP1, PHO2, PHO4, PSF1, PSF2, RPS18A, RTT102, RTT106, cellular physiological M208 SHM2, SKG6, SLD5, SMI1, SUA7, TEL2, P0050875 process YCL056C, YDL157C, YFL052W, YGL242C, YLR455W, YML108W, YMR315W, YNG1, YNL100W, YPL077C APC1, APC11, APC2, APC4, APC5, APC9, CDC16, M209 P0007049 cell cycle CDC23, CDC26, CDC27, DOC1, LEU3 ADP1, AKR1, CTL1, CTS1, EAP1, FZO1, GPA1, IPP1, MAK10, MAK3, MAK31, NDD1, NEW1, cellular physiological M210 PEX18, PEX2, PEX21, PEX7, PPA2, SPC29, P0050875 process STE11, STE18, STE2, STE4, STE5, STE50, SYG1, TUS1, VPS62, YCK2, YDL203C, YPL066W, ZTA1

149

ADH3, ASF1, FUM1, FYV10, GID7, GID8, HIR3, KCC4, MOH1, MSO1, NHA1, PGI1, PTC2, RAD10, cellular physiological M211 RAD53, RMD5, RNR1, RNR2, RNR4, SEC1, P0050875 process SML1, SWI4, VID24, VID28, VID30, YAL027W, YHL010C, YKR017C, YTA7 AGX1, CLG1, MMR1, PCL1, PCL2, PCL5, PCL6, cellular physiological M212 PCL7, PCL8, PHO85, RIM15, SAT4, SOR1, SOR2, P0050875 process WHI4 DDC1, LEU4, LOT5, LOT6, MEC3, NDL1, PAC1, cellular physiological M213 RAD17, SRN2, STM1, SUV3, TRM10, URA6, P0050875 process YBR197C, YCR016W, YLR051C, YNR063W AGE1, ARO4, BDH1, CUE3, ERG3, HNT3, NCE102, NST1, PHO12, RPL4B, SER3, SER33, cellular physiological M214 SFA1, SRY1, YBR042C, YDR266C, YDR531W, P0050875 process YGL039W, YKL075C, YOR246C, YPL150W, YPL229W ACA1, ADO1, ADR1, BMH1, BNR1, BSP1, CAP1, CAP2, CYK3, DOT5, DYN1, ECM5, GLY1, GTS1, HSC82, ICL1, INP52, KCS1, LIP2, MAC1, MSB1, NTH1, PLB1, POG1, PPG1, PPT1, PTP1, RNH202, cellular physiological M215 P0050875 RNH203, SFL1, SOK1, STU1, SVL3, TOP2, VTH2, process YBR138C, YDL233W, YDR366C, YER071C, YFR006W, YFR016C, YFR017C, YIL057C, YIR003W, YTA6 BRF1, DST1, ECM1, FZF1, INM1, LEM3, LYS2, LYS5, NMD5, NUP159, NUP82, PIB1, PSE1, RDS2, RPC17, RPL16A, RPL16B, RPL25, RPL26B, cellular physiological M216 P0050875 RPL34A, RPL34B, RPL35B, RPS20, RPS22A, process SXM1, TFC4, TRX2, YAL061W, YBR239C, YHR199C, YLR326W, YPR015C AAP1, CAF20, CDC2, CDC33, CDC73, CTR9, CWC21, EXG1, FAA4, GAL83, GIS4, GLN3, HOG1, HYS2, LEO1, LSB1, MDH2, NRG2, PAF1, PEP3, PEP5, PET112, PIN4, POL1, POL32, RCK1, RCK2, REG1, ROD1, RTF1, SBP1, SCW11, SIP1, cellular physiological M217 SIP3, SIP4, SNF1, SNF4, SPL2, SPT16, TOS3, P0050875 process URA4, VAM6, VPS16, VPS33, VPS41, VPS8, VTS1, YAL049C, YDL025C, YEL023C, YER129W, YGR016W, YGR058W, YHR009C, YIL108W, YKR018C, YKR096W, YLR392C, YMR086W, YMR291W, YPK2 CYB2, EHD3, FYV4, MAM33, MRP1, MRP13, MRP21, MRP4, MRP51, MRPS16, MRPS17, M218 P0006412 protein biosynthesis MRPS18, MRPS28, MRPS9, NAM9, PUS7, RSM10, RSM22, RSM23, RSM24, RSM25, RSM26, RSM27, 150

RSM7, TYS1, YOR356W BMH2, BOP3, ECM13, FKH1, FRQ1, MTH1, cellular physiological M219 NTH2, PIK1, RIM101, RTG2, SEC2, SNF3, UBP12, P0050875 process URE2, YMR144W, ZAP1 AMS1, ANP1, ARD1, ASC1, ATG19, BET3, BET5, DRS2, ENO1, FKS1, GSG1, GYP6, HOC1, KAR2, KRE11, KRS1, LAP4, MNN10, MNN9, NAT1, NEM1, PDR5, PKC1, SEC62, SEC63, SEC66, cellular physiological M220 P0050875 SEC72, SKN1, SPO7, SSL1, TFA1, TFA2, TFB1, process TFB4, TRS120, TRS130, TRS20, TRS23, TRS31, TRS33, YAL053W, YER007C-A, YFL034W, YHR080C, YHR113W, YJR014W CSM1, DSE3, GYL1, LRS4, MAM1, OSH3, PLP2, cellular physiological M221 SPO12, SUM1, YBR194W, YLR132C, YOR252W, P0050875 process YPR118W, YPR152C ARC15, ARC18, ARC19, ARC35, ARC40, ARP2, ARP3, BCK1, BUL1, CDC7, DBF4, FYV8, GAL7, cellular physiological M222 LHS1, MCM2, MKK1, MKK2, MNT3, PDC6, PST2, P0050875 process RGD1, SKG3, SLT2, THI3, YBR052C, YCP4, YLR253W, YNL040W ASK1, DAD1, DAD2, DAM1, DUO1, SPC19, mitotic spindle M223 P0000071 SPC34, TRM2, ULP1 assembly ORC1, ORC2, ORC3, ORC4, ORC5, ORC6, RAD1, M224 P0006260 DNA replication SLX4, SWR1, YOR051C AME1, AQY2, BIR1, BNA5, CBF1, CBF2, CEP3, CHL4, CSE4, CTF13, CTF19, CTF3, IML3, cellular physiological M225 P0050875 MCM16, MCM21, MCM22, MET28, MET4, MIF2, process NKP1, NKP2, OKP1, UBC9 ATP12, ATP14, COX5B, IDP3, MMF1, MMT1, cellular physiological M226 YAT1, YGL226W, YIL157C, YLR168C, P0050875 process YMR003W BPT1, GPD1, GPD2, IMP1, IMP2, MDM30, MRL1, cellular physiological M227 ORC1, ORC2, ORC3, ORC4, ORC5, ORC6, RAD1, P0050875 process RGT2, SLX4, STE20, SWR1, TRR2, YOR051C ARP8, ESC8, IES1, IES3, IES5, IOC2, IOC3, ISW1, ISW2, ITC1, LTP1, MOT1, NHP10, NHP6B, NPL6, chromatin M228 REB1, RFX1, RSC2, RSC3, RSC4, RSC58, RSC6, P0006338 remodeling RSC8, RSC9, SFH1, STH1, STR3, TFC7, VPS1, YEF3 BFR2, BMS1, BUD21, CBF5, CHD1, CKA1, CKA2, CKB1, CKB2, COQ2, DHR2, DIM1, DIP2, ECM16, M229 EMG1, ENP1, ENP2, ERG2, FAP7, FET4, HCA4, P0007046 ribosome biogenesis HEK2, HIS7, HOT1, HTA2, IMP3, IMP4, JSN1, KRE33, KRI1, KRR1, LCB3, LCP5, MPP10, MRD1, 151

MVD1, NAF1, NAN1, NHP2, NOC4, NOP1, NOP14, NOP58, OGG1, PNO1, POB3, POL5, PRM8, PTC5, PWP2, RCL1, RIO2, RML2, ROK1, RRP7, RRP9, SET2, SHQ1, SIK1, SLX9, SLY41, SOF1, TPT1, UTP10, UTP11, UTP13, UTP14, UTP15, UTP18, UTP20, UTP21, UTP22, UTP30, UTP4, UTP6, UTP7, UTP8, UTP9, YER030W, YGL146C, YGR054W, YGR210C, YHL035C, YLR003C, YMR310C, YOR059C CCT8, FUI1, GON7, NAS6, PRE5, RPN10, RPN11, RPN12, RPN13, RPN14, RPN2, RPN3, RPN5, M230 P0030163 protein catabolism RPN6, RPN7, RPN8, RPT1, RPT2, RPT3, RPT4, RPT5, RPT6, STS1 ACH1, AFR1, ERG13, GND2, GRX1, IQG1, LRO1, cellular physiological M231 MET6, NCS2, NTF2, PRM2, RRN3, SEN15, P0050875 process SNU13, WTM1, WTM2, YMR266W, YOR283W ARE2, BIO3, CCT4, DYN2, FRE3, HMG2, NBP2, cellular physiological M232 P0050875 PAC11, PBS2, PRX1, PTC1, SKM1, YRM1 process BEM2, CYB2, EHD3, FYV4, MAM33, MRP1, MRP10, MRP13, MRP17, MRP21, MRP4, MRP51, MRPL38, MRPS16, MRPS17, MRPS18, MRPS28, M233 MRPS5, MRPS8, MRPS9, NAM9, PET123, PSD2, P0006412 protein biosynthesis PUS7, RSM10, RSM19, RSM22, RSM23, RSM24, RSM25, RSM26, RSM27, RSM7, TYS1, UBP10, YOR205C, YOR356W BMH2, ECM13, FRQ1, MTH1, NTH2, PIK1, RTG2, cellular physiological M234 P0050875 SNF3 process MRP1, MRP13, MRP21, MRP4, MRP51, MRPS17, MRPS18, MRPS28, MRPS9, NAM9, PUS7, RSM10, M235 P0006412 protein biosynthesis RSM22, RSM23, RSM26, RSM27, RSM7, YOR356W BDF1, BDF2, BEM2, BUD20, CIN8, CYB2, EHD3, FHL1, FYV4, GPI15, HCS1, HHF1, HHT1, HIR1, HIR2, HMLALPHA1, HTA1, HTB1, HTB2, INO2, INO4, IPI1, IWS1, KAP114, MAM33, MDN1, MRP1, MRP10, MRP13, MRP17, MRP21, MRP4, MRP51, MRPL38, MRPS16, MRPS17, MRPS18, M236 MRPS28, MRPS5, MRPS8, MRPS9, NAM9, NOB1, P0019538 protein metabolism PEP4, PET123, PRM1, PSD2, PUS2, PUS7, RAD16, RIX1, RSM10, RSM19, RSM22, RSM23, RSM24, RSM25, RSM26, RSM27, RSM7, SMC6, SPT4, SPT6, TEC1, TYS1, UBP10, WRS1, YCR072C, YLR247C, YMR317W, YOL054W, YOR205C, YOR356W M237 IMG1, IMG2, MHR1, MRP20, MRP7, MRPL1, P0006412 protein biosynthesis 152

MRPL10, MRPL17, MRPL19, MRPL23, MRPL24, MRPL27, MRPL3, MRPL35, MRPL36, MRPL39, MRPL4, MRPL44, MRPL6, MRPL7, MRPL8, MRPL9, PRP31, RIM2, YER078C, YPL183W-A CIS1, ELP6, GRH1, HAP2, HAP3, HAP4, HAP5, regulation of M238 IKI1, YGR146C, YLR327C, YNL063W, YPL166W, P0050791 physiological process YPR085C, YRA2 AMS1, ATG19, BET3, BET5, GSG1, GYP6, KRE11, KRS1, LAP4, SKN1, SSL1, TFA1, TFA2, cellular physiological M239 P0050875 TFB1, TFB4, TRS120, TRS130, TRS20, TRS23, process TRS31, TRS33, YFL034W, YHR113W CSL4, DIS3, LRP1, MTR3, RRP4, RRP40, RRP42, M240 RRP43, RRP45, RRP46, RRP6, SKI2, SKI3, SKI6, P0006402 mRNA catabolism SKI7, TIR1, YIR035C, YLR345W CDC14, CIS1, ELP6, GRH1, HAP2, HAP3, HAP4, HAP5, IKI1, RIM1, RNH1, SPE3, SPE4, VAS1, cellular physiological M241 P0050875 YGR146C, YLR327C, YNL063W, YPL166W, process YPR085C, YRA2 CUE5, ECM25, EDE1, END3, GAL2, GYP5, HOM6, KEX2, LAS17, LSB3, LSB6, MYO3, MYO5, PCL9, PEP1, RPL10, RVS167, SLA1, M242 P0006810 transport SLA2, SQT1, SYP1, UBP7, VRP1, YDR348C, YHM2, YJL045W, YLR243W, YLR422W, YMR295C, YNR065C, YSC84 ANB1, BMH2, BOP3, CPD1, CSR2, DIA1, ECM13, FKH1, FRQ1, GOS1, HXT6, HXT7, HYP2, LAP3, MTH1, MYO1, NTH2, NYV1, PEP12, PIK1, PSR1, PSR2, RIM101, RSP5, RTG2, SEC17, SEC18, M243 SEC2, SEC9, SED5, SFT1, SFT2, SLY1, SNC1, P0006810 transport SNC2, SNF3, SRO7, SRO77, SSO1, SSO2, SYN8, TLG1, TLG2, UBP12, URE2, VAM3, VAM7, VPS45, VPS68, VTI1, WHI2, YIF1, YIP3, YKT6, YMR144W, YOR385W, ZAP1, ZRT2 ARO8, BRR1, BUR2, CBC2, CCC1, FOB1, IFH1, LUC7, MSH4, MTD1, MUD1, NAB3, NAB6, nucleobase, NAM8, NGR1, NRD1, NRP1, PET111, PRP18, nucleoside, M244 PRP40, PRP42, PUB1, PXA1, PXA2, QDR3, REV3, P0006139 nucleotide and REV7, RSE1, SEC39, SGV1, SLU7, SMB1, SMD1, nucleic acid SMD2, SMD3, SNU56, SNU71, SRL1, STO1, metabolism SUT1, VBA1, YBL081W, YDL144C, YHC1 BCD1, BNA1, BRE1, CYC2, GCD10, GCD14, GLT1, IST2, LGE1, NBP1, RPS18A, RTT102, cellular physiological M245 RTT106, SKG6, SUA7, TEL2, YCL056C, P0050875 process YDL157C, YFL052W, YLR455W, YML108W, YMR315W, YNG1, YNL100W 153

nucleobase, CKI1, GCR1, GCR2, IRA1, IRA2, PMD1, PRS1, nucleoside, M246 PRS2, PRS3, PRS4, PRS5, RIM11, TRM7, P0006139 nucleotide and YER163C, YGR277C, YUH1 nucleic acid metabolism ARX1, BRX1, CDC21, CIC1, DBP10, DBP2, DBP9, DPB3, DPB4, DRS1, EBP2, ERB1, FPR4, GAR1, HAS1, IPI3, LHP1, LOC1, LSG1, MAK21, MAK5, MRT4, MUB1, NIP7, NMD3, NOC2, NOC3, NOG1, NOG2, NOP12, NOP15, NOP16, NOP2, NOP4, cellular physiological M247 NOP7, NSA1, NSA2, NUG1, POL2, PUF6, REI1, P0050875 process RLP24, RLP7, RMI1, RPF1, RRP1, RRP14, RRP15, RSA3, SDA1, SPB1, SPB4, SRP101, SRP102, SSF1, SSQ1, TIF4631, TIF4632, UBR2, URB1, VPS73, YDR412W, YEN1, YGR150C, YJL122W, YMR163C, YOR1, YOR227W, YPL009C, YTM1 BET3, BET5, GSG1, KRE11, TRS120, TRS23, M248 P0006888 ER to Golgi transport TRS31, TRS33 BEM1, CDC24, CLN3, DDC1, EXO84, FAR1, LEU4, LOT5, LOT6, MEC3, MIR1, MMM1, MRS11, MRS5, NDL1, NSL1, PAC1, RAD17, RGA2, RSR1, SEC10, SEC15, SEC3, SEC5, SEC6, cellular physiological M249 SEC8, SRN2, SSA1, SSA2, STM1, SUV3, TAT1, P0050875 process TIM18, TIM22, TIM54, TIM9, TOS2, TRM10, URA6, VPS9, YBR197C, YCR016W, YDJ1, YHR020W, YLR051C, YMR002W, YNR063W, YOR060C ARG81, LYS9, TRM11, TRM112, TRM9, cellular physiological M250 P0050875 YDR140W, YEL043W, YIL151C, YOR164C, YPS3 process DAL80, GZF3, OCA1, POL4, PSY3, RHO5, SHU1, M251 SHU2, SIW14, YBR246W, YCR095C, YDR520C, P0008150 biological process YHL029C, YLR046C, YNL056W AAD3, CTR3, DAL3, GLN1, HSP104, HXK1, cellular physiological M252 MKS1, TPS1, TPS2, TPS3, TSC10, TSL1, YIL177C, P0050875 process YKL215C, YMR251W ATP12, ATP14, COX5B, IDP3, MMF1, MMT1, M253 QCR8, YAT1, YGL226W, YIL157C, YLR168C, P0006810 transport YMR003W AGX1, BCY1, CAT2, CLG1, COX4, COX5A, COX6, COX9, FRM2, MMR1, PCL1, PCL2, PCL5, cellular physiological M254 PCL6, PCL7, PCL8, PHO85, RIM15, SAT4, SKI8, P0050875 process SOR1, SOR2, SPO11, TPK1, TPK2, TPK3, WHI4, YHR033W, YLR177W, YOL075C M255 APC1, APC11, APC2, APC4, APC5, APC9, CDC16, P0007049 cell cycle

154

CDC26 BET3, BET5, GSG1, GYP6, KRE11, SKN1, M256 P0006888 ER to Golgi transport TRS120, TRS130, TRS20, TRS23, TRS31, TRS33 ARR4, CPR2, CPR5, CRH1, CUE4, ERV2, GSF2, cellular physiological M257 P0050875 INO1, MDM39, MSN4, RMD7, SKG1, YBR014C process ERG25, ERG26, ERG27, ERG28, INP54, OM45, cellular physiological M258 RTN2, SEC11, SHO1, SLH1, SPC1, YER049W, P0050875 process YJL202C, YNL181W BZZ1, EDC1, KAP122, NUP1, NUP170, PIG2, PNC1, RNT1, RTT103, SLF1, SNO1, SNO2, SNO3, cellular physiological M259 P0050875 SNO4, SNZ1, SNZ2, SNZ3, TOA1, TOA2, process YDR132C, YKR064W, YML131W ATG26, BUD3, BUD4, CDC46, CDC47, CTF4, DIA2, ELC1, IMG1, IMG2, LST4, MCM10, MCM3, MCM6, MHR1, MNP1, MRH4, MRP20, MRP49, MRP7, MRP8, MRPL1, MRPL10, MRPL13, MRPL16, MRPL17, MRPL19, MRPL20, MRPL23, MRPL24, MRPL25, MRPL27, MRPL28, MRPL3, cellular physiological M260 MRPL35, MRPL36, MRPL39, MRPL4, MRPL44, P0050875 process MRPL51, MRPL6, MRPL7, MRPL8, MRPL9, NCL1, PKH2, PRP31, RAD14, RAD4, RAD7, RIM2, RPP0, RPP1A, RPP1B, RPP2B, SEC61, SKY1, SMP2, SSF2, SSS1, TIM21, TIR3, YDR115W, YER067W, YER078C, YHR087W, YLR287C, YMR210W, YPL183W-A, ZRC1 CDC123, DEF1, DMA1, DMA2, ECM17, ERG1, GDE1, MET1, MET10, MET14, MET16, PRP12, PXR1, RRP3, SLX1, SOD2, STI1, THG1, TSA2, cellular physiological M261 P0050875 YBT1, YER077C, YGR266W, YIL039W, YIL082W, process YKR051W, YKU80, YLR271W, YNL311C, YPR003C ECM33, GAL1, GAL3, GAL4, GAL80, MMS1, M262 P0006012 galactose metabolism YGP1 CIS1, GRH1, HAP2, YGR146C, YNL063W, cellular physiological M263 P0050875 YPL166W, YPR085C, YRA2 process GOS1, LAP3, NYV1, SEC17, SEC18, SED5, SFT1, establishment of M264 SLY1, TLG1, TLG2, VAM3, VAM7, VPS45, P0045184 protein localization VPS68, VTI1, YIF1, YIP3, YKT6, ZRT2 CCT8, ECM29, FDH1, FUI1, GON7, HAT2, HIF1, HSM3, LYS12, NAS6, PRE5, RAD23, RPN10, RPN11, RPN12, RPN13, RPN14, RPN2, RPN3, cellular physiological M265 P0050875 RPN5, RPN6, RPN7, RPN8, RPN9, RPT1, RPT2, process RPT3, RPT4, RPT5, RPT6, RTN1, STS1, UBP6, URA7, URA8 155

ARP1, ARP10, ASK1, ASN1, ATG13, ATG17, AVO2, BDP1, BSC5, CDC19, CHS1, CMP2, CNN1, COG2, CSI1, CSN9, DAD1, DAD2, DAM1, DIP5, DUO1, GRX5, JNM1, KOG1, KSP1, LDB18, MCK1, MDM36, NIP100, NUF2, NVJ1, PCI8, PNT1, PPH3, PSY2, PYK2, RRI1, RRI2, SLM1, cellular physiological M266 P0050875 SLM2, SMC5, SNX4, SPC19, SPC24, SPC25, process SPC34, SPO71, TID3, TRM2, TRM3, ULP1, URH1, VAB2, VAC8, YAP6, YBL046W, YBR270C, YDR357C, YGL079W, YGL250W, YIH1, YJL160C, YJR008W, YJR011C, YJR024C, YKL061W, YLR419W, YNL086W, YNR068C, ZIP1

156