COMPUTATIONAL PREDICTION OF -PROTEIN INTERACTIONS ON

THE PROTEOMIC SCALE USING BAYESIAN ENSEMBLE OF MULTIPLE

FEATURE DATABASES

A Dissertation

Presented to

The Graduate Faculty of The University of Akron

In Partial Fulfillment

of the Requirements for the Degree

Doctor of Philosophy

Vivek Kumar

December, 2011 COMPUTATIONAL PREDICTION OF PROTEIN-PROTEIN INTERACTIONS ON

THE PROTEOMIC SCALE USING BAYESIAN ENSEMBLE OF MULTIPLE

FEATURE DATABASES

Vivek Kumar

Dissertation

Approved: Accepted:

______Advisor Department Chair Dr. Dale H. Mugler Dr. Daniel B. Sheffer

______Committee Member Dean of the College Dr. Daniel B. Sheffer Dr. George K. Haritos

______Committee Member Dean of the Graduate School Dr. George C. Giakos Dr. George R. Newkome

______Committee Member Date Dr. Amy Milsted

______Committee Member Dr. Daniel L. Ely

ii

ABSTRACT

In the post-genomic world, one of the most important and challenging problems is to understand protein-protein interactions (PPIs) on a large scale. They are integral to the underlying mechanisms of most of the fundamental cellular processes. A number of experimental methods such as protein affinity chromatography, affinity blotting, and immunoprecipitation have traditionally helped in detecting PPIs on a small scale.

Recently, high-throughput methods have made available an increasing amount of PPI data. However, this data contains a significant amount of erroneous information in the form of false positives and false negatives and shows little overlap among PPIs pooled from different methods, thus severely limiting their reliability. Because of such limitations, computational predictions are emerging to narrow down the set of putative

PPIs.

In this dissertation, a novel computational PPI predictor was devised to predict

PPIs with high accuracy. The PPI predictor integrates a number of proteomic features derived from biological databases. The features chosen for the purpose of this research were expression, , MIPS functions, sequence patterns such as motifs and domains, and protein essentiality. While these features have little or no correlation with each other, they share some degree of relationship with the ability of to interact with each other. Therefore, novel feature specific approaches were devised to characterize that relationship. Text mining and network topology based approaches were

iii

also studied. Gold Standard data comprising of high confidence PPIs and non-PPIs was

used as evidence of interaction or lack thereof.

The predictive power of the individual features was integrated using Bayesian

methods. The average accuracy, based on 10-fold cross-validation, was found to be

0.9396. Since all the features are computed on the proteomic scale, the Bayesian

integration yields likelihood values for all possible combinations of proteins in the

proteome. This has the added benefit of making it possible to enlist putative PPIs in a

decreasing order of confidence measure in the form of likelihood values.

Integration of novel PPIs with other relevant biological information using

Semantic Web representation was examined to better understand the underlying mechanism of diseases and novel target identification for drug discovery.

iv

ACKNOWLEDGEMENTS

I am deeply indebted to many people who have contributed to the completion of my research and graduate studies at The University of Akron. Without their support, patience and guidance, this dissertation work simply would not have been possible. It is to them I owe my deepest gratitude.

First and foremost, I would like to thank my advisor, Dr. Dale H. Mugler, for providing me with the opportunity to conduct this interdisciplinary research work under his supervision. I sincerely appreciate his invaluable guidance and unwavering support throughout my graduate studies.

I would like to express my deepest gratitude to the department chair, Dr. Daniel B.

Sheffer for helping me understand the intricacies of statistics which was vital in laying the foundation for multivariate statistics and machine learning, the tools that will find their use well beyond my doctoral research. I am also grateful to Dr Amy Milsted for teaching the advanced concepts in molecular biology and to Dr Richard Laundraville for providing the opportunity to conduct experiments with a variety of wet lab molecular biology techniques. This experience proved integral to my research work in proteomics.

I owe special thanks to my committee members Dr. George C. Giakos and Dr.

Daniel L. Ely for their timely guidance and feedback regarding contents and format of this dissertation.

v

I am particularly indebted to the president of the university, Dr Luis M. Proenza, for recognizing my research efforts by providing financial assistance for three years, in addition to the graduate scholarship offered by the department.

Lastly, I would like to thank my family and friends for their unending love, encouragement and support at all stages of my doctoral studies. I will always be indebted to my parents who raised me with a love for science and nature, and encouraged me to live a life of inquiry.

vi

TABLE OF CONTENTS

Page

LIST OF FIGURES ...... xi

CHAPTER

I. INTRODUCTION ...... 1

1.1 Protein-Protein Interactions: An Introduction ...... 1

1.2 Motivation ...... 4

1.3 Objectives ...... 5

1.4 Organization of the Dissertation ...... 6

II. LITERATURE REVIEW ...... 8

2.1 Yeast as Model Organism for Research in Molecular Biology ...... 8

2.2 Types of Protein-Protein Interactions ...... 10

2.3 High-throughput Methods for Detecting PPIs...... 11

2.3.1 Kinetics of Protein-Protein Interaction Assays ...... 12

2.3.2 Yeast Two-Hybrid (Y2H) System ...... 14

2.3.3 Variations of Yeast Two-Hybrid (Y2H) System ...... 19

2.3.4 Protein Fragment Complementation Assays ...... 21

2.3.5 Co-immunoprecipitation ...... 23

2.3.6 Protein Microarrays ...... 25

2.4 PPI Databases ...... 27

vii

2.4.1 Database of Interacting Proteins (DIP) ...... 27

2.4.2 Biological General Repository for Interaction Datasets (BioGRID) ...... 28

2.4.3 Biomolecular Interaction Network Database (BIND) ...... 29

2.4.4 IntAct ...... 30

2.4.5 Molecular INTeraction (MINT) ...... 30

2.5 Computational Approaches to Predict PPIs...... 31

2.6 Protein-Protein Interaction Topology a`nd Prediction ...... 33

2.7 Genomic Sequences and Protein-Protein Interactions ...... 36

2.8 Motifs, Domains and Protein-Protein Interactions ...... 37

2.9 Gene Ontology and Protein-Protein Interactions ...... 40

2.9.1 GO Topology Based Semantic Similarity...... 42

2.9.2 Information Theory Based Semantic Similarity ...... 43

2.9.3 Hybrid Approach Based Semantic Similarity ...... 45

2.10 Gene Expression and Protein-Protein Interactions ...... 46

2.11 Protein Essentiality and Protein-Protein Interactions ...... 50

2.12 Text Mining and Protein-Protein Interactions ...... 52

2.13 Protein-Protein Interaction Prediction using Integrative Approaches ...... 56

III. MATERIALS AND METHODS ...... 58

3.1 Research Hypothesis ...... 58

3.2 ORFs – Interchangeability with and Proteins...... 60

3.3 Proposed PPI Prediction Techniques ...... 63

3.3.1 Gene Annotations from Gene Ontology ...... 64

3.3.2 Functional Categories from MIPS Functional Catalog ...... 66

viii

3.3.3 Gene Expression ...... 68

3.3.4 Genomic Motifs and Domains ...... 70

3.3.5 Essentiality ...... 70

3.4 Gold Standard Datasets ...... 71

3.5 Bayesian Integration ...... 73

3.6 Validation of Predicted PPIs ...... 74

3.7 Cross-validation of Predicted PPIs ...... 75

IV. RESULTS AND ANALYSIS ...... 77

4.1 Gene Ontology Feature Analysis ...... 77

4.2 MIPS Feature Analysis ...... 79

4.3 Gene Expression Feature Analysis ...... 81

4.4 Motif Feature Analysis ...... 84

4.5 Essentiality Feature Analysis...... 85

4.6 Comparison of Features...... 86

4.7 Bayesian Integration of Features and Validation ...... 88

4.8 Cross-validation of Combined Likelihood ...... 89

4.9 Visualization of Predicted PPIs ...... 90

V. APPLICATIONS ...... 127

5.1 Rational Drug Discovery - Promise and Limitations ...... 128

5.1.1 PPI and Target Validation ...... 131

5.1.2 PPI and High Throughput Screening ...... 133

5.1.3 Making Rational Drug Discovery Optimal ...... 135

5.2 PPI - Applications using Graph Theoretic Analysis ...... 137

ix

5.2.1 Graph Theoretic Analysis for Drug Target Identification ...... 138

5.2.2 Graph Theoretic Analysis for Cancer Protein Identification ...... 139

5.3 PPI - Applications in Systems Biology ...... 141

5.3.1 Semantic Web for Network Integration and Analysis ...... 143

5.3.2 PPI and Information Retrieval for Pathway Construction ...... 147

5.3.3 PPI and Understanding the Mechanism of Action in a Disease ...... 148

5.3.4 PPI and Target Identification in Drug Discovery ...... 149

5.3.5 PPI and Comparative Genomics ...... 151

VI. CONCLUSION...... 153

REFERENCES ...... 158

x

LIST OF FIGURES

Figure Page

1.1 Frequency distribution of number of publications with respect to year of publication for proteome (Left), PPI (Middle) and all publications (Right) ...... 2

1.2 Fraction of all publications (in percent) in the area of proteome (Left) and protein-protein interaction (Right) ...... 3

2.1 Yeast two-hybrid (Y2H) system with reporter gene ...... 14

2.2 Lack of interaction between the bait and prey proteins results in the failure of reporter gene expression...... 15

2.3 Successful interaction between bait and prey proteins results in the expression of reporter gene...... 15

2.4 Y2H system with bait and prey hybrid proteins to detect PPIs. Successful interaction leads to activation of the transcription apparatus, leading to expression of associated reporter gene ...... 16

2.5 Yeast haploid mating types ‘ α’ and ‘a’ can be used for bait and prey vectors. ...17

2.6 Three-hybrid system employs an additional molecule, in this case an RNA molecule, to facilitate bait-prey interaction...... 20

2.7 Steps involved in the detection of PPIs using the co-immunoprecipitation technique...... 24

2.8 Functional protein microarray pipeline ...... 26

2.9 DNA microarray pipeline ...... 48

2.10 Parse tree for an example clause - “the interaction between JUN and FOS”...... 54

3.1 Frequency distribution of ORFs with respect to ORF length (Left). Log version of the left histogram to highlight the frequency distribution for higher ORF length values (Right)...... 62

xi

4.1 Frequency distribution of Yeast ORFs with respect to the associated GO terms (Left) and GO terms with respect to the associated yeast ORFs (Right) ...... 91

4.2 Frequency distribution breakdown of GO terms with respect to the Yeast ORFs from figure 4.1 into two histograms: Left (0-100), Right (101-MAX). ....92

4.3 Yeast PPI similarity based on GO feature: Proteomic view ...... 93

4.4 Yeast PPI similarity based on GO feature: 100x zoom ...... 93

4.5 ROC curve of PPI prediction based on gene ontology feature...... 94

4.6 Distribution of newly predicted potential PPIs and True Positive/False Positive ratio as a function of likelihood values based on gene ontology feature...... 94

4.7 Frequency distribution of Yeast ORFs with respect to the associated MIPS IDs (Left) and MIPS IDs with respect to the associated yeast ORFs (Right) ...... 95

4.8 Frequency distribution breakdown of MIPS IDs with respect to the Yeast ORFs from figure 4.7 into two histograms: Left (0-100), Right (101-MAX) ...... 96

4.9 Yeast PPI similarity based on MIPS feature: Proteomic view ...... 97

4.10 Yeast PPI similarity based on MIPS feature: 100x zoom ...... 97

4.11 ROC curve of PPI prediction based on MIPS feature...... 98

4.12 Distribution of newly predicted potential PPIs and True Positive/False Positive ratio as a function of likelihood values based on MIPS feature...... 98

4.13 Clustergram of the 2,549 most significant ORFs based on gene expression - Jet Colormap. The tree-graphs on the top and left show the hierarchical clustering of ORFs and experimental conditions respectively, based on the gene expression values...... 99

4.14 Clustergram of the first 1,000 ORFs based on gene expression – Red-Green Colormap. The tree-graphs on the top and left show the hierarchical clustering of ORFs and experimental conditions respectively, based on the gene expression values...... 100

4.15 Hierarchical tree visualizing the clustering of entire yeast transcriptome of 5,666 ORFs ...... 101

4.16 Hierarchical tree of entire yeast transcriptome of 5,666 ORFs from figure 4.15 collapsed down to 1% of its size...... 102

xii

4.17 Hierarchical clustering of a subset of proteome-wide profiles into 16 clusters...... 103

4.18 k-Means clustering of a subset of proteome-wide profiles into 16 clusters...... 104

4.19 Centroids of k-Means clusters from figure 4.18...... 105

4.20 Yeast PPI similarity based on gene expression feature: Proteomic view ...... 106

4.21 Yeast PPI similarity based on gene expression feature: 100x zoom ...... 106

4.22 ROC curve of PPI prediction based on gene expression feature...... 107

4.23 Distribution of newly predicted potential PPIs and True Positive/False Positive ratio as a function of likelihood values based on gene expression feature...... 107

4.24 Frequency distribution of Yeast ORFs with respect to the associated motifs (Left) and motifs with respect to the associated yeast ORFs (Right) .....108

4.25 Frequency distribution breakdown of motifs with respect to the Yeast ORFs from figure 4.24 into two histograms: Left (0-10), Right (11-MAX) .....109

4.26 Yeast PPI similarity based on motif feature: Proteomic view ...... 110

4.27 Yeast PPI similarity based on motif feature: 100x zoom ...... 110

4.28 ROC curve of PPI prediction based on motif feature...... 111

4.29 Distribution of newly predicted potential PPIs and True Positive/False Positive ratio as a function of likelihood values based on motif feature...... 111

4.30 Yeast PPI similarity based on essentiality feature: Proteomic view ...... 112

4.31 Yeast PPI similarity based on essentiality feature: 100x zoom ...... 112

4.32 ROC curve of PPI prediction based on essentiality feature...... 113

4.33 Distribution of newly predicted potential PPIs and True Positive/False Positive ratio as a function of likelihood values based on essentiality feature...... 113

4.34 Scatter plot of yeast feature similarity matrices ...... 114

4.35 Distribution of normalized PPI similarity values for a uniform set of features ...... 115

xiii

4.36 Correlation among five different yeast feature similarity matrices...... 116

4.37 Yeast Gold Standard Positives: Proteomic view ...... 117

4.38 Yeast Gold Standard Positives: 100x zoom ...... 117

4.39 Yeast Gold Standard Negatives: Proteomic view ...... 118

4.40 Yeast Gold Standard Negatives: 100x zoom ...... 118

4.41 Frequency distribution of the combined log(likelihood) values based on the combined yeast features (Left). Log version of the left histogram to highlight the high log(likelihood) values associated with relatively much lower frequencies...... 119

4.42 Gold Set Positive membership as a function of the combined log(likelihood) values for combined yeast features...... 120

4.43 Gold Set Negative membership as a function of the combined log(likelihood) values for combined yeast features...... 120

4.44 ROC curve of PPI prediction, considering all five features simultaneously. ....121

4.45 Distribution of newly predicted potential PPIs and True Positive/False Positive ratio as a function of likelihood values, considering all five features simultaneously...... 121

4.46 ROC curve of PPI prediction, considering all five features simultaneously, in one of the iterations from 10-fold cross-validation...... 122

4.47 Distribution of newly predicted potential PPIs and True Positive/False Positive ratio as a function of likelihood values, considering all five features simultaneously, in one of the iterations from 10-fold cross-validation...... 122

4.48 PPI Predictor accuracy for the combined yeast features across 10 iterations from 10-fold cross-validation ...... 123

4.49 The 99% confidence interval for the 10-fold cross-validation of prediction accuracy ...... 123

4.50 Degree Sorted Circular Layout of top 500 PPIs based on likelihood ratios. Novel PPI predictions, GSP PPIs and GSN PPIs are shown in green, red and blue respectively...... 124

xiv

4.51 An alternative view of the PPIs from figure 4.50 using Radial Tree Layout. Novel PPI predictions, GSP PPIs and GSN PPIs are shown in green, red and blue respectively...... 125

4.52 Circular Layout of top 500 PPIs based on likelihood ratios. Red, green and blue colors represent PPI predictions made using 3, 4 and 5 features respectively...... 126

5.1 Rational drug discovery and development pipeline ...... 129

5.2 Activities carried out during rational drug discovery process ...... 130

5.3 Protein Interaction Network for target validation using pathway analysis ...... 132

5.4 HTS followed by lead assembly to cover large chemical space ...... 134

5.5 Mapping of targets (proteins) and drugs (small molecules) on to the pathway ...... 135

5.6 RDF triplet and its relation to the RDF data model ...... 144

5.7 RDF/XML serialization of a simplified version of PPI data-model ...... 145

5.8 Graphical representation of the PPI example generated by RDF Gravity ...... 146

5.9 Pathway development by semantic integration of databases ...... 148

5.10 Protein Interaction Network across species along with gene specific phylogenetic tree ...... 151

xv

CHAPTER I

INTRODUCTION

1.1 Protein-Protein Interactions: An Introduction

In the post-genomic era of today, one of the most important and challenging problems is to understand protein interactions on a large scale [155]. A protein interacts with another protein to carry out a biological function inside or outside a cell. However, such interaction rarely takes place in isolation. Protein-protein interactions (PPIs) are fundamental to virtually any and every cellular process. Almost all important molecular processes are carried out by molecular machines that are themselves organized in terms of components consisting of a large number of PPIs. In fact, all protein modifications are catalyzed by protein-modifying enzymes such as protein kinases, protein phosphatases, glycosyl transferases, acyl transferases and proteases. These enzymes depend upon PPIs to regulate basic cellular processes such as cell growth, cell cycle, metabolic pathways, and signal transduction.

PPIs can be broadly classified as protein complexes or transient PPIs, depending on the stability and the duration of the interaction. It is uncommon to see a reference to any major research topic in molecular biology including DNA replication, transcription, translation, splicing, secretion, cell cycle control, signal transduction, and intermediary metabolism without identifying the role of protein complexes as essential components

[119]. Similarly, a large number of transient PPIs are known to control cellular processes

1

at the molecular level. Transient PPIs also play a key role in the recruitment and

assembly of the transcription complex to specific promoters, the transport of proteins

across membranes, the folding of native proteins catalyzed by chaperonins, individual

steps of the translation cycle, and the breakdown and re-formation of subcellular

structures during the cell cycle [119].

Figures 1.1 and 1.2 show the amount of research activity in the area of proteomics

in general and protein-protein interaction in particular based on information retrieved

from PubMed database maintained by the National Center for Biotechnology Information

(NCBI), at the U.S. National Library of Medicine (NLM), located at the National

Institutes of Health (NIH).

Figure 1.1 Frequency distribution of number of publications with respect to year of publication for proteome (Left), PPI (Middle) and all publications (Right).

2

In figure 1.1, left and middle graphs show a steep rise in the number of yearly

publications in the area of proteomics and protein-protein interaction respectively and the

right graph shows the number of all the yearly publications. There are no publications in

the area of proteomics prior to 1994 because the terms ‘proteome’ or ‘proteomics’ were

first coined in 1994. Figure 1.2 shows the relative rise in the research in proteomics and

protein-protein interaction compared to sum total of all areas of research in PubMed

publications. The figure also shows that approximately 20% of recent publications in the

area of proteomics are related to protein-protein interaction.

Figure 1.2 Fraction of all publications (in percent) in the area of proteome (Left) and protein-protein interaction (Right).

3

1.2 Motivation

Genome sequencing opened up the possibility of identifying genes as the

functional units of blueprints underlying the biological processes at a genomic scale.

Microarray hybridization experiments have made it possible to study and identify the

state of biological processes at any given time or during a time frame that entails critical

biological transformations undergoing in a cell. However, understanding the cellular

processes also dictates identifying proteins as the structural and functional units of

biological processes. A complete network of PPIs related to a species, also known as

interactome, is key to the understanding of the structure and behavior of the cellular

subsystems. Traditional approaches for studying the biochemistry of proteins have

focused on single molecules, one experiment at a time. While this approach is helpful in

gaining fundamental understanding of few key functions, it is slow and labor intensive,

and therefore not practical at the proteomic scale [59].

The large scale PPI experimental studies have proved to be indispensable in

creating a systems view of the cell in the context of the complex relationships exhibited

by the cellular components and their subunits, specifically, protein complexes and

functional protein modules. Towards that goal, large-scale experiments have been

undertaken to study PPIs in model organisms such as Saccharomyces cerevisiae [9, 42,

44, 85, 96] and Plasmodium falciparum [87] and complex multicellular organisms such as Caenorhabditis elegans [26, 50], Drosophila melanogaster [54], and Homo sapiens

[44]. However, the PPI experimental techniques are prone to the presence of significant false positives and false negatives and other practical limitations inherent to the biophysical nature of the experimental design. Over the last decade, a number of

4 computational models have emerged to complement the experimental techniques in order to create a more comprehensive view of the protein interaction networks.

Parallel to the evolution of the experimental techniques, most of the computational approaches for predicting protein interactions have been evolving as data related to a wider spectrum of protein features is increasingly made available and as new paradigms are being explored to assess their suitability to learn and model the underlying biological phenomena at molecular level. These computational approaches, in turn, can be subjected to ensemble prediction to take advantage of their individual predictive abilities. With this perspective, the dissertation aims to devise novel feature specific computational approaches which are then integrated using Bayesian probabilistic model.

1.3 Objectives

This dissertation is an effort to systematically study the key protein features in order to improve PPI prediction. Towards this goal, this dissertation has identified four major objectives:

1. Data management: Identifying, extracting and integrating large datasets from a

variety of data sources such as biological databases and individual studies.

2. Studying protein features: Using existing techniques and devising new ones to

study the relationships between protein features and protein-protein interactions,

based on data from the first objective.

3. Integration: Integrating the relationships obtained from the second objective to

harness their complementary strengths in an effort to improve the predictive

power of the PPI prediction model as a whole.

5

4. Validation: Performing the Receiver Operating Characteristic Curve (ROC)

analysis to validate PPI predictions against the test data in a 10-fold cross-

validation.

1.4 Organization of the Dissertation

The remainder of the dissertation is organized in five chapters which are briefly summarized next. Chapter 2 provides a multifaceted overview of the research in the area of protein-protein interaction. It begins by introducing Saccharomyces cerevisiae as the model organism used in this research. Then, it explores the underlying chemical kinetics of the PPI assays used in detecting PPIs. The chapter discusses some of the widely used high throughput PPI identification techniques such as yeast two-hybrid, protein fragment complementation, co-immunoprecipitation and protein microarrays. A number of PPI databases exist which store results of protein interaction studies. They are explored next.

Next, a number of proteomic features have been explored in the context of their relationship with protein-protein interactions. They include protein interaction network topology, protein sequences, protein sequence patterns such as motifs and domains, gene ontology, gene expression, protein essentiality and text mining. The chapter concludes with an overview of the ensemble techniques to integrate multiple predictions.

Chapter 3 discusses the proposed computational techniques for PPI prediction.

First, the research objective is established as an alternative to null hypothesis. The five proposed prediction techniques are studied next followed by their integration using

Bayesian probabilistic model. Finally, the ensemble prediction results are validated

6 against Gold Standard data in a10-fold cross validation. Chapter 4 discusses the results of the techniques studied in previous chapter.

Chapter 5 discusses the applications of improved PPI prediction. First, the role of protein-protein interaction in rational drug discovery, particularly in the context of drug- target validation, high throughput screening and molecular pathways, is discussed.

Second, the topology of the PPI network is examined in the context of identification of novel drug-protein targets for small molecule drug discovery. Third, applications of protein-protein interaction in systems biology are explored; particularly the role of

Semantic Web and ontologies in information representation, retrieval and integration with PPI data to help better understand disease etiology, target identification in drug discovery and comparative genomics. Chapter 6 concludes the dissertation by confirming that the objectives of this research were met at 1% level of significance using a 10-fold cross validation. Also, it briefly discusses how this work can be extended in a future research.

7

CHAPTER II

LITERATURE REVIEW

This chapter discusses the existing high-throughput methods and the computational approaches to predict PPIs. The chapter begins with a brief introduction of the importance of yeast as a model system in proteomic research. Then, it briefly discusses the broad categories of PPIs followed by a review of the large scale PPI detection experiments and databases relevant to this dissertation. The rest of the chapter focuses on a variety of computational methods which have co-evolved to overcome the limitations of the high-throughput PPI detection methods. Among them, the most prominent approaches are based on the following features - network topology, genomic sequences, protein domains, gene ontology, gene expression, protein essentiality, protein

MIPS (Munich Information Center for Protein Sequences) functional classification and text mining. The chapter ends with a brief review of the computational approaches to integrate heterogeneous data sources.

2.1 Yeast as Model Organism for Research in Molecular Biology

Yeast is considered one of the oldest domesticated species with records of its use in brewing dating back to 6000 B.C. S. cerevisiae is unique among around 700 species of yeast family (a subgroup of more than 700,000 different fungi species) for its tremendous impact in the food and drug industry. It was first introduced in genetics as an

8 experimental organism by O. Winge in the mid-1930s at a time when Drosophila, corn and Neurospora were the principal organisms of genetic research [127]. S. cerevisiae has been used as an experimental system in molecular biology research since the 1960s [42].

In the early 1980s, human hepatitis B vaccine was first produced using recombinant yeast cells [99]. In 1996, it became the first eukaryotic organism to have a complete genome sequenced [36, 55]. Once again, in the post-genomic era, S. cerevisiae has emerged as a model organism for research in functional genomics and proteomics.

There are a number of reasons, why S. cerevisiae is ideally suited as a model organism for molecular biology research in general and in this dissertation in particular.

1. It is a eukaryotic organism and is also unicellular, which makes it easy to grow in

a well defined media with a complete control on the environmental parameters. At

the same time, its biology is in many ways similar to humans and other

metazoans.

2. It can be easily subject to rapid growth, dispersed cells, replica plating, mutant

isolation and a complex DNA transformation system [53].

3. It is not pathogenic while handling in routine lab experiments.

4. Unlike many other micro-organisms, it is available in two forms, diploid and

haploid, which are viable in the presence of a variety of markers. Therefore, it is

easy to conduct recessive mutation experiments on haploids and complementation

tests on diploids.

5. Since integrative recombination of transforming DNA in yeast happens through

homologous recombination, cloned yeast DNA along with foreign sequences on

plasmid can be easily directed to a given location in the genome [137].

9

6. Each of its more than 6,000 ORFs has been knocked out one at a time and tagged

so that they can be studied in parallel.

In addition, yeast cells carry out a number of basic functions which are significantly similar to those in higher eukaryotes, particularly humans and other metazoans [42]. Some of these are as follows:

• Biosynthetic pathways and their regulation

• Cell division and cell cycle

• DNA replication, recombination and repair

• Transcriptional regulation and activation

• Signal transduction pathways

• Stress responses

• Nuclear-mitochondrial interactions

• Regulatory mechanism mediated by chromatin

2.2 Types of Protein-Protein Interactions

PPIs differ in their biochemical nature based on their composition, affinity, and lifetime of association. Nooren et al. [109] broadly categorize PPIs based on three different criteria:

1. Homo-oligomeric complexes versus hetero-oligomeric complexes: based on

whether PPIs occur between identical or non-identical chains.

2. Obligate versus non-obligate complexes: based on whether the components of the

complexes are independently and structurally stable in vivo .

10

3. Transient versus permanent complexes: based on the lifetime of the complex. A

permanent interaction is very stable and exists only as a complex, while a

transient interaction associates and dissociates in vivo .

While such categorization serves an important role in broadly understanding PPIs, many PPIs exist on a broad continuum across the categories described above.

Furthermore, an interaction that is primarily transient in vivo may become permanent under certain cellular conditions.

At a molecular level, it is the non-covalent contacts among residue side chains that form the basis of protein folding, protein assembly and PPIs, which in turn facilitate many of the biological functions. While protein interactions leading to protein folding or assembly of folded chains into multi-chain proteins are permanent, others such as the receptor-ligand interactions or the signal transduction interactions are transient. Ofran et al. [110] have studied the differences in the nature of these interactions as a function of their sequence composition and residue-residue contact preferences of their mutual interfaces.

2.3 High-throughput Methods for Detecting PPIs

A number of experimental methods exist which have traditionally helped in detecting PPIs on a small scale. These methods include protein affinity chromatography, affinity blotting, immunoprecipitation and cross-linking [119]. These approaches are fundamentally based on the idea of binding assays and ‘Incubate-Wash-Detect’. The target is set up as the interaction partner fused to a solid support while ligand is added in

11 the solution for the incubation to take place between the interacting partners. Unbound ligand is washed away and the amount of the bound ligand is assessed using fluorescence, radioactivity or some other measurable indicator. There are four essential requirements for the end-point measurement to be meaningful [19].

1. The target and ligand must be allowed to incubate for long enough for them to

interact and reach steady-state equilibrium in the interaction chemistry.

2. In order to prevent any inadvertent dissociation and subsequent release of the

ligand bound to the target, the solid support must be washed quickly.

3. For the dose-response curves of the interaction to be valid, ligand concentration

must be similar to the steady-state affinity of the target-ligand interaction.

4. The amount of bound ligand must be measured precisely.

2.3.1 Kinetics of Protein-Protein Interaction Assays

It is common to find variations with respect to the assay requirements and assay setup in different laboratories. Also, unlabeled ligands are often used in combination with the labeled ligands to assess the specificity of the interaction between the labeled ligand and its target. An understanding of the effects of parameters, such as concentration of ligands, target, incubation period and chemical equilibrium factors, on the interaction signal is important when comparing experiments conducted under differing circumstances.

For the sake of simplicity, consider the scenario of a monovalent (1:1) interaction between interaction partners A and B, where A is a ligand, B is a target and AB is an interaction (a transient or stable complex) [19].

12

A + B ↔ AB (1)

Using the basics of chemical kinetics, the strength of the interaction can be described by the equilibrium dissociation constant, K D where [A], [B] and [AB] are the molar concentrations of free ligand A, free target B and the complex AB respectively.

[A].[ B] = KD (2) [AB ]

The rate of formation of complex AB at any moment can be expressed as a function of

-1 -1 concentration levels of the free reactants and the association rate constant k a (M s ) and

-1 the dissociation rate constant k d (s )

d[AB ] = ka .[A].[B] - kd .[AB] (3) dt

At equilibrium, [AB] reaches a steady state and therefore,

0 = k a .[A].[B] - kd .[AB]

Combining equations (2) and (3),

kd [A].[ B] = = KD (4) ka [AB ]

In assays, it is customary to have significantly higher levels of a ligand relative to the target in order to avoid depletion, hence [A] >> [B]. Assuming the total amount of target is [B total ], the free amount of target being [B] and the bound amount of B is [AB],

[B] = [B total ] – [AB] (5)

Combining equations (2) and (5),

[A] [AB] = [B total ]. ([ A] + KD)

[AB ] [A] = (6) [Btotal ] ([ A] + KD)

13

Thus the relative amount of bound target [AB] with respect to the total amount of target

[B total ] is dependent on the ligand concentration and the equilibrium dissociation constant

KD and not on the excess of ligand concentration compared to the target concentration.

The assay-based methods for PPI have generally not been scalable. However, new high-throughput techniques have emerged over the past decade. Some of the most widely used large scale PPI approaches are described next.

2.3.2 Yeast Two-Hybrid (Y2H) System

Yeast two-hybrid system is used to identify and/or validate novel PPIs. This high- throughput genetic process was originally developed by Fields and Song in yeast cells

[45]. It is based on the principle that yeast transcription factor GAL4 can be used to activate a reporter gene provided that DNA binding domain (DBD) and a transcription activation domain (AD) are mutually linked. Y2H systems are designed to detect pairs of interacting proteins in yeast in vivo [14, 70]. The two proteins being tested for interaction

are fused with a DBD and AD to form bait hybrid and prey hybrid respectively (see

figure 2.1).

Figure 2.1 Yeast two-hybrid (Y2H) system with reporter gene.

14

Absence of interaction between the bait and prey proteins results in disruption of

the transcription apparatus, resulting in the failure of reporter gene expression (see figure

2.2). On the other hand, a successful interaction between the bait and prey hybrid proteins

results in the formation of functional transcription factor leading to the transcription of

the reporter gene (see figure 2.3). The transcriptional regulator used by Fields and Song

[45] consisted of DBD at position 1-147 residues on bait and AD at position 768-881

residues on prey. However, many other variants of the transcription factor have been

designed.

Figure 2.2 Lack of interaction between the bait and prey proteins results in the failure of reporter gene expression.

Figure 2.3 Successful interaction between bait and prey proteins results in the expression of reporter gene.

15

As shown in figure 2.4, the bait vector consists of DBD and the bait gene, while

the prey vector consists of AD and the prey gene. Successful protein-protein interaction

leads to the expression of the reporter gene. LacZ, as reporter gene, causes yeast cells to

turn blue on X-gal plates. His3 reporter gene, when expressed, manufactures histidine

thus allowing the yeast colonies to survive in a histidine deficient (His -) medium.

Figure 2.4 Y2H system with bait and prey hybrid proteins to detect PPIs. Successful interaction leads to activation of the transcription apparatus, leading to expression of associated reporter gene.

Yeast cells are ideally suited for Y2H system since two different haploid mating types, for example ‘ α’ and ‘a’, can be grown to have two different vectors – bait and prey

(see figure 2.5). The mating of the two haploid cells results in a diploid cell which

16

contains both vectors. Even though the Y2H system can be used for testing individual

PPIs, it is most useful when screening PPIs on a proteomic scale. Assuming that there are

around 6,000 proteins in S. cerevisiae , this would necessitate an examination of 6,000 x

6,000 combinations where every protein in PPI is tried both as bait and prey.

Figure 2.5 Yeast haploid mating types ‘α’ and ‘a’ can be used for bait and prey vectors.

First, each open reading frame (for details on open reading frame, see section 3.2) in the genome is amplified by Polymerase Chain Reaction (PCR) and cloned into two different vectors, one consisting of DBD domain and the other consisting of AD domain for GAL4 transcriptor. The vector with multiple cloning site downstream of GAL4-DBD will result in a 3’ fusion of GAL4-DBD to the bait protein. Similarly, the vector with

17

multiple cloning site upstream of GAL4-AD will result in a 5’ fusion of GAL4-AD to the

prey protein. In an array-based Y2H system [23], the prey vectors are transformed into

haploid yeast cells individually and arranged into agar plates with 96 or 384 wells.

Haploid bait cells are spotted on the top of each array position resulting in diploid cells

which are capable of expressing both the prey and bait proteins. This process is repeated

for all combinations using a robot. Using reporter enzymes, a potential PPI can be

identified to a given array position.

Roger Brent et al. developed a LexA-based Y2H system [56]. This system

consisted of DBD derived from the prokaryotic LexA protein, which is in fact a

transcriptional repressor in the presence of LexA operators. The AD was derived from 88

residue long peptide chain (B42) from E. Coli . The transcription of the reporter gene is

detected by the color change in the growth medium of the yeast cells. A subsequent

variant of LexA employs a Gal4-AD or an activation domain from herpes simplex virus

V16 protein as AD in Y2H [161].

Similar to transcription factor variants, a number of reporter gene variants exist in

the Y2H system. They are integrated into the yeast genome under the control of artificial

promoter constructs which consist of an upstream activation sequence (UAS) and a

TATA box (5’-TATAAA-3’ or a variant). One of the common reporter genes LacZ

encodes for β-Galactosidase which can be quantitatively assessed using a variety of assays. However, it is not well suited for Y2H based on library screening. There also exist reporter genes for metabolic enzymes involved in metabolism of histidine (His3), leucine (Leu2), adenine and uracil (Ura3). For instance, successful interaction of proteins of interest will lead to translation of a reporter enzyme such as His3 which will enable the

18

Y2H yeast cells to survive on a histidine deficient medium. Using multiple reporter genes

with their specific promoters in the same Y2H system can reduce the possibility of false

positives.

Since it is an in vivo assay technique, transient and unstable interactions can be detected with greater accuracy. It enables a highly sensitive detection of PPIs and can also be used to narrow down the protein regions mediating the interactions [71].

However, the membrane proteins, when localized in the nucleus, misfold or do not fold the way they do in their native subcellular locations. In addition, cytoplasmic proteins do not undergo appropriate post-translational modifications in the nucleus. Therefore, this approach is better suited to protein interactions that take place inside the nucleus. Baits or preys which indicate unusually high number of interactions (generally more than 25) are prone to false positive errors. Some proteins may activate the reporter gene without another interacting protein which may lead to additional false positives.

2.3.3 Variations of Yeast Two-Hybrid (Y2H) System

There are several variations of the Y2H system which differ somewhat in their underlying principle. Toxic effects or steric problems associated with large proteins often make protein interaction more difficult to achieve, leading to false negatives. Some proteins bind with RNA or additional small molecules and modify their conformation prior to interacting with other proteins. As shown in figure 2.6, the RNA three-hybrid system addresses some of these concerns by allowing the two fusion proteins to interact via an intermediary RNA molecule in order to initiate reporter gene transcription [96].

19

Figure 2.6 Three-hybrid system employs an additional molecule, in this case an RNA molecule, to facilitate bait-prey interaction.

Sos Recruitment System (SRS) offers an alternative to the conventional Y2H in

identifying PPIs taking place in the cytoplasm instead of the nucleus [12]. It involves

recruiting hSos, a Ras Gyuanyl nuecleotide Exchange Factor (RasGEF), to plasma

membrane requirements so that it can activate Ras. However, localization of hSos to

plasma membrane requires an interaction between two fusion proteins – hSos fused to a

protein of interest (bait) and the target protein (prey) fused to the plasma membrane

localization signal. Thus Sos recruitment system is based on the dependence of the Ras

signaling pathway on the protein-protein interaction for its activation [20]. Ras

Recruitment System (RRS) further improves upon the effectiveness of SRS in detecting

PPIs [15].

There also exist bacterial two hybrid (B2H) systems based on the principle of

reconstitution of adenylate cyclase [80] and another one based on the transcription

activation of reporter gene by RNA polymerase recruitment assisted by protein-protein

interactions [77]. Similarly, there are mammalian two hybrid (M2H) systems based on

20

the principle of activation of cytokine signaling [40] and another one based on the

reconstitution of active transcription factor assisted by protein-protein interactions [93].

2.3.4 Protein Fragment Complementation Assays

These are protein complementation techniques that are based on the principle of co-expression of two-hybrid fusion proteins to identify PPIs. In essence, the gene encoding the reporter enzyme is cleaved into two or more fragments. One fragment is fused to gene for bait and another one to a potential prey, thus resulting in two types of plasmids. When the cell is co-transfected with both the plasmids, simultaneous expression of the fragments potentially leads to an interaction between bait and prey and simultaneously an enzymatic activity by the reconstituted reporter enzyme [154].

The reporter gene fragments are cleaved such that the newly created N- and C-

termini are located on the same face of the parent protein. This obviates the need for long

peptide linkers between the fragments in the reconstituted protein and prevents the

disruption of orientation-dependence of the PPI being investigated. The reporter protein

used in a complementation essay should ideally satisfy the following requirements [102]:

1. It is relatively small and monomeric.

2. It is well studied with respect to its structure and function.

3. It can be measured easily with simple assays in vivo and in vitro .

4. It can be over-expressed in toxic environment in both eukaryotes and prokaryotes.

The common protein complementation approaches essentially vary based on their

reporter activity and are summarized as follows:

21

• DHFR-based Survival Assay: Pelletier et al. devised an oligomerization-assisted

complementation of fragments of an enzyme murine dihydrofolate reductase

(mDHFR) [116, 125]. They fused the N- and C- terminal fragments of mDHFR to

oligomerizing domains of GCN4 leucine zipper sequences and expressed the

fragments in the E. Coli culture where the endogenous DHFR activity was

initially inhibited by trimethoprim. Co-expression of DHFR fragments was

demonstrated by the restoration of the inhibited bacterial colony. This principle

can be used to predict PPIs since the reporter enzyme DHFR when reconstituted

by the PPI allows the cell culture to grow in the presence of a selective medium

such as methotrexate, a cell growth inhibitor [116, 125].

• Bimolecular Fluorescence Complementation (biFC) Assay: In this approach,

protein-protein interaction leads to the reconstitution of functional fluorophore as

a reporter enzyme using two non-fluorescent protein fragments [63].

• SplitTEV Protease Complementation Assay: Protein-protein interaction leads to

formation of an active TEV (Tobacco Etch Virus) protease which activates the

release of membrane-bound transcription factor [165].

• Split-ubiquitin Complementation Assay: Protein-protein interaction leads to

reassembly of ubiquitin (Ub) which is also fused to a reporter protein on its C-

terminus. Fully intact Ub is susceptible to cleavage at its C-terminus by the Ub

specific proteases. Cleaving releases the reporter protein thus indicating the

presence of the PPI [75, 149].

22

2.3.5 Co-immunoprecipitation

In this approach, individual proteins are tagged and then used to isolate protein complexes which are studied using mass spectroscopy. This technique is also known as co-precipitation/mass-spectroscopy approach. Figure 2.7 summarizes the steps involved in detection of PPIs using this approach [85]. These are:

1. An affinity tag is attached to a target protein (bait). A generic tag such as FLAG

tag (an octapeptide with a sequence: N-DYKDDDDK-C) can be used as affinity

tag [37, 62].

2. Bait vector is transfected into cell culture to allow the tagged bait protein to bind

with one or more prey proteins resulting in bait complex.

3. Sepharose beads fused with a number of protein A tags are incubated with anti-

FLAG antibodies.

4. Antibodies and the lysed cell culture are allowed to incubate together resulting in

the binding of the bait complex with the antibodies on the beads.

5. The bait proteins along with the associated proteins shared by the common protein

complex are precipitated on an affinity column specific to the given tag, for

example, FLAG tag.

6. The protein complexes are separated by using one-dimensional gel

electrophoresis, more specifically SDS-PAGE (Sodium Dodecyl Sulfate – Poly

Acrylamide Gel Electrophoresis).

7. Proteins are excised from the gel, digested with trypsin, and analyzed by mass

spectrometry.

23

Using co-immunoprecipitation technique, Gavin et al. [50] identified 1,440 proteins in 232 protein complexes in yeast. They were able to assign a new function to one or more proteins in 91% of the complexes. Using a similar approach, Ho et al. [61] started with an initial set of 725 bait proteins and identified 3,617 PPIs among 1,578 proteins. Using a set of 86 bait proteins related to DNA-damage response, they were able to create an almost complete picture of the DNA-damage response network in yeast.

However, both groups observed a number of false positives in their interactions and failed to identify many of the well known PPIs.

Figure 2.7 Steps involved in the detection of PPIs using the co-immunoprecipitation technique.

24

2.3.6 Protein Microarrays

Protein microarrays have recently emerged as a multiplex approach for a number of proteomic applications including identification of - PPIs, protein kinase substrates, targets of biologically active small molecules, transcription factor protein activation and protein-phospholipids interactions. Protein microarrays can be broadly classified into three categories [59]:

• Analytical Microarrays: They are used to profile complex mixture of proteins (for

example, from disease tissue sample) for differential expression or clinical

diagnostics. A library of antibodies, aptamers or antibodies is arrayed on the glass

slide to probe the proteins from tissue sample. The proteins are profiled for

binding affinities, specificities and protein expression levels.

• Reverse Phase Microarrays: They are used to identify altered proteins of interest,

for example, post-translational protein modifications altered by a disease [146].

The lysate from various tissue samples is arrayed onto a nitrocellulose slide using

a contact pin microarrayer and the slide is probed with antibodies against the

protein of interest. Detection is made with chemiluminescent, fluorescent or

colorimetric assays.

• Functional Microarrays: They are used to study biochemical activities at

proteomic scale, including protein interactions such as protein-protein, protein-

DNA, protein-RNA, protein-phospholipid, and protein-small molecule [60, 169].

Full-length functional proteins or protein domains are arrayed on a singe chip to

be probed by the interacting molecules.

25

The proteins can be oriented on the slide surface in a random or uniform manner.

Affinity tag surfaces are used for uniform orientation. On the other hand, coating the glass surface with nitrocellulose, gel pads or poly-L-lysine creates a random orientation of proteins on the array [9, 27, 84, 152, 170]. Functional protein microarrays are of particular significance for this dissertation because of their PPI detection capability. A step-wise workflow of functional microarrays [31] is shown in figure 2.8.

Figure 2.8 Functional protein microarray pipeline.

26

First, a protein library of the yeast proteins fused with Glutathione-S-Transferase

(GST) or His6 tags is prepared. GST tags bind to slide via glutathione and His6 tags bind

via nickel (Ni). The fusion constructs are expressed under the control of GAL1 (galactose

inducible) or CUP1 (copper inducible) promoters. As seen in the figure, His6 tags on the

library protein bind to Ni ions on the glass slide. The slide is incubated with the probe

molecules (for example, calmodulin proteins which are bound to calcium or

phospholipids) fused with biotin to allow for protein interaction to take place. The bound

proteins are detected by adding streptavidin conjugated to a fluorescent dye such as Cy3.

2.4 PPI Databases

There exist a number of specialized biological databases which store biological

interactions, particularly PPIs. These databases can be used not only to study an

individual PPI but also to examine large scale regulatory and signaling pathways and

protein interaction networks at the cellular level. PPI databases can be analyzed using

network visualization and data integration tools such as Cytoscape [136, 142]. Cytoscape

has an extensible open software architecture which is implemented using Java platform.

This permits dynamic addition of specialized plugins that can provide additional

functionalities such as interaction network analysis, molecular profiling, visualization,

and support for databases.

2.4.1 Database of Interacting Proteins (DIP)

DIP [130] catalogs the experimentally validated PPIs across 372 different species.

In addition to the manually curated data, it stores computationally derived data based on

27

the PPI networks extracted from its high quality, core DIP dataset. There are a total of

71,276 PPIs derived from 69,471 distinct experiments describing interactions among

23,201 proteins. For Saccharomyces cerevisiae , it stores 23,860 PPIs derived from

16,444 experiments relating 5,051 proteins. It contains PPI data for Homo sapiens ,

Drosophila melanogaster , Escherichia coli , Caenorhabditis elegans , Helicobacter pylor i

and other well known model organisms.

2.4.2 Biological General Repository for Interaction Datasets (BioGRID)

It is a comprehensive and curated online repository of protein-protein and genetic

interactions. Initially, it was set up with database partitions corresponding to different

species but the current version provides an integrated data output through its web user-

interface thus allowing investigations across multiple species simultaneously. The

physical and genetic interaction data is extracted from the literature and compiled by the

BioGRID curation team. BioGRID contains two types of interactions [150]:

• Raw Interactions: A raw interaction is uniquely identified by the interactors, the

directionality of interaction, experimental system and the primary publication.

Therefore, X → Y and Y → X are considered two different interactions.

• Non-Redundant Interactions: A non-redundant interaction is uniquely identified

by the interactors alone, regardless of the directionality of interaction,

experimental system and the primary publication.

As of version 3.1.79 (August 2011), BioGRID has 271,151 raw interactions and

181,830 non-redundant interactions derived from 10,305 unique publications covering

28

6,107 unique proteins for S. cerevisiae . Similarly, BioGRID has 405,680 raw interactions

and 279,872 non-redundant interactions derived from 27,563 unique publications

covering 34281 unique proteins for 50 different species [1].

2.4.3 Biomolecular Interaction Network Database (BIND)

The BIND repository stores molecular interactions, molecular complexes and

pathways with the goal of providing insight into cellular development and disease

pathogenesis at the molecular scale [16]. Molecular interactions can be among proteins,

DNAs, RNAs, ligands, molecular complexes, small molecules or even photons. The

molecular interaction record also provides information on aliases, sub-cellular locations,

species, and sequence and experimental conditions related to an interacting molecule.

The pathway record specifies the network of interactions related to a cellular process. It

also records phenotypes and diseases related to the pathway. It includes atomic level

information related to the chemical reactions, photochemical activation and

conformational changes in the interactions.

In addition to storage and query capabilities, BIND also provides a variety of

data-mining and visualization tools [15]. For instance, BIND provides a graphical

analysis tool to study the functional domains involved in protein interactions. Similarly,

BIND lets researchers partition a protein interaction network into specific subsets using a

clustering tool. It allows users to map pathways across multiple species and can be used

to produce data to simulate the kinetics of a given pathway. Many of these functionalities

are possible because BIND is able to abstract out the complexity of the underlying

29

biological data to a form that that can be subject to graph theory methods and other data-

mining algorithms.

2.4.4 IntAct

IntAct is a molecular interaction database and toolkit for storage, analysis and

presentation of protein interactions. Interaction data is derived from the literature or from

direct data uploads by expert curators based on a deep annotation model capturing

experimental details [10]. IntAct uses a number of controlled vocabularies, the main one

being the Molecular Interaction ontology of the Proteomics Standard Initiative (PSI-MI),

to describe experimental details, binding sites, protein tags and mutations [3]. IntAct web

interface supports analyzing interaction networks in the context of gene ontology

annotation [13] or InterPro [66] domains of the interacting proteins. It also supports a

web service to retrieve interaction networks in an XML (Extensible Markup Language)

format.

As of Aug 2011, IntAct consists of 57,665 proteins that participate in 183,026 interactions and complexes [3]. The n-ary interactions in complexes can be expanded to

268,577 binary interactions based on the spoke-model.

2.4.5 Molecular INTeraction (MINT)

MINT is a public repository for molecular interaction records extracted by expert curators from experiments published in peer-reviewed journals [26]. It primarily focuses on experimentally verified PPIs (including both direct and indirect relationships). PPI detection methods include variations of two hybrid approach, peptide and protein arrays,

30

variations of immunoprecipitation approach, tandem affinity purification, X-ray

crystallography and fluorescence microscopy among others [5]. Additional annotations

such as kinetic and binding constants and interacting domains are stored, if available.

Genetic or computationally inferred PPIs are not included. MINT uses the PSI-MI

standards for annotations and representation of molecular interactions. Information

retrieval from MINT supports PSI1.0-XML and PSI2.5-XML in addition to flat file

format. The entire database can be freely accessed interactively using online queries or as

a batch process using FTP (File Transfer Protocol).

MINT also includes HomoMINT [117], VirusMINT [28] and DOMINO [25].

HomoMINT extends experimentally verified PPIs from model organisms to the orthologous proteins in Homo sapiens. VirusMINT stores all PPIs between human and viral proteins with the goal of integrating them with HomoMINT. DOMINO stores PPIs mediated by protein recognition modules or domains.

2.5 Computational Approaches to Predict PPIs

Large scale high-throughput methods have made available an increasing amount

of PPI data set in the last decade. However, this data contains a significant amount of

erroneous information in the form of false positives and false negatives [85]. von Mering

et al. [162] estimated that more than half of the high-throughput PPI data they studied

contained spurious interactions. They based this conclusion on the linkage frequency -

the frequency with which PPIs link proteins of the same functional categories. Their

reference dataset based on Munich Information Center for Protein Sequences (MIPS) and

31

Yeast Proteome Database (YPD) showed a linkage frequency of more than 80% whereas

the high-throughput data had a linkage frequency of around 21%.

Out of about 80,000 yeast PPIs pooled from different high-throughput methods, only about 2,400 PPIs were covered by more than one method [162]. This was attributed to one or more of the three possible causes:

1. None of the high-throughput methods has fully explored the entire interactome.

2. Some methods by design are better tuned to detecting certain types of PPIs and

therefore ignore many true interactions thus generating many false negatives.

3. Most of the methods have a low threshold of PPI detection and generate many

false positives.

Weak dynamic interactions and interactions with inducible proteins such as proteases are often below the sensitivity threshold level of the assays. Such PPIs are often missed as false negatives. To further illustrate this limitation, consider serine-cysteine protease TYSND1, an example of a low abundance, inducible protein which participates in weak interactions. TYSND1 localizes from cytosol to peroxisomes where it targets the cleavage of specific enzymes PTS1 and PTS2 involved in the fatty acid beta-oxidation pathway [86]. The resulting polypeptides may in turn interact with each other to form beta-oxidation enhancing complex. Even though, researchers had hypothesized the presence of some kind of peroxisomal enzyme-processing protein like TYSND1 for decades, they were not able to detect and characterize an actual protein. It was eventually detected by using highly sensitive multiplexed isobaric tagging tandem mass spectrometry [68].

32

Because of the aforementioned limitations inherent to the experimental

approaches for predicting PPIs, computational predictions are proving to be instrumental

in narrowing down the set of putative PPIs and in assigning varying degrees of

confidence measures to experimentally derived PPIs. A number of computational

approaches have been developed over the years to predict PPIs by analyzing large scale

experimental data from genomic and proteomic databases and publications, which could

be directly or indirectly related to protein interactions.

The rest of this chapter focuses on these different computational paradigms.

While the approaches discussed here may have their origins rooted in their individual objectives in functional genomics and proteomics, they are being cited together in this chapter in an effort to provide a single frame of reference for factors contributing to the computational prediction of PPIs.

2.6 Protein-Protein Interaction Topology and Prediction

The biological macromolecules such as DNA, RNA, proteins and other small

cellular molecules have complex mutual structural and functional relationships and are

best represented as networks. This section focuses specifically on the geometry and

topology of a PPI network (also known as protein interaction network) and their impact

on the underlying behavior of PPIs.

A PPI network is usually represented as a graph where nodes represent the

proteins and the edges represent the protein-protein relationships. The connectivity of a

protein is represented by the degree of the node, which is simply the number of neighbors

connected to a node. The path between two nodes is a sequence of adjacent nodes that

33

connects the two nodes. The number of edges in a path is called the path length. Shortest

path can be defined as the count of the minimum number of edges/connections required

to traverse from one node to the other. In a directed graph, it may be required to take into

account the directionality of the edge. The mean path length of a network is calculated as

the average of all the shortest paths between all possible pairs of the nodes in a graph.

The degree distribution of a PPI network [18] is a measure of the probability that a given

node possesses a given degree or connectivity.

Most biological networks including PPI networks are known to be ‘scale-free’

networks and have a degree distribution such that a only a few protein hubs bind together

a large number of proteins and a large number of protein hubs are simply interactions

among few proteins [16, 22, 28, 53]. Mathematically speaking, the degree distribution of

λ a scale-free network can be approximated by P(k) ≈ k - , where P is the probability that a

given node will have degree k and λ is a network constant such that 2 ≤ λ ≤ 3. Another property of interest in PPI networks is the clustering coefficient (C i) which measures the degree of mutual connectivity of all the nodes surrounding a given node i. It is mathematically defined as, C i = 2n i / k i x (k i -1) where n i represents the number of actual

links among the neighbors of node i and k i is degree or connectivity of node i. The degree

distribution function and clustering coefficients can be used to characterize individual

biological networks [7].

One of the other characteristic features of PPI networks is a property called

‘small-world’ effect which dictates that most of the nodes in a network be connected to any other node in the graph through a short path [144, 163]. However, the path length of scale-free networks such as PPI networks is much smaller than other networks that

34

exhibit small-world effect such as randomly distributed networks [30, 32]. So, it would

be more precise to describe PPI networks as exhibiting an ‘ultra small-world’ effect. This

property allows local perturbations in a PPI network to spread throughout the network

rapidly. The small-world effect is also a feature of social networks [103, 120], Internet

[7], scientific collaboration networks [107] and metabolic networks [43]. One of the

places where the PPI networks part with social networks in the ‘small-world’ analogy is

the hub-hub interactions. Highly connected protein complexes in a PPI network are

seldom found to be interacting with each other unlike the assortative behavior of the

social networks in which well-connected entities have very high odds of direct

interactions with each other [97].

Schwikowski et al. [134] developed upon the concept of PPI network to predict protein functions in S. cerevisiae . They used the majority method to predict functions for an uncharacterized protein based on its interacting neighbors. They identified the three most frequent annotations associated with the neighbor proteins to annotate the protein of interest. Based on the global analysis of 2,709 published interactions, they established

2,385 PPIs among 1,548 proteins. They observed that 63% of the interactions occurred between proteins sharing a common function and 76% of them related proteins from the same subcellular location. Their approach correctly predicted functional category for

72% of the characterized proteins interacting with one or more characterized proteins.

But more importantly, they predicted one or more functions for 364 proteins of unknown functionality. However, their prediction is limited to the uncharacterized proteins which are known to directly interact with proteins of known functionality.

35

2.7 Genomic Sequences and Protein-Protein Interactions

In general, proteins of similar sequences tend to fold into a similar structure and therefore are more likely to have similar functions. Pellegrini et al. [115] presented one of the earliest computational approaches towards PPI prediction based on genomic sequences. Their approach is based on the assumption that proteins participating in a shared pathway or a protein complex are more likely to co-evolve because evolutionary forces will either preserve or eliminate all the functionally linked proteins in a new species. This co-evolution or correlated evolution could be characterized by the phylogenetic profile of a protein, which is essentially a bit-array encoding the presence or absence of a protein over multiple genomes. They computed phylogenetic profiles for

4,290 proteins found in E. coli by comparing them against sixteen other genomes. Protein pairs which show a high degree of similarity in their phylogenetic profile were found to be functionally linked significantly more often than the protein pairs with a lower phylogenetic match [115].

As an extension to Pellegrini’s approach, Marcotte et al. [95] proposed that if two proteins in the genome of a species have homologs which are found to be fused into a single protein chain in the genome of another species, then the two proteins have a high likelihood of interacting together. For instance, Gyr A (804 amino acids) and Gyr B (875 amino acids) subunits of DNA gyrase in E. coli are found to be fused into a single chain of Topoisomerase II (1429 amino acids) in yeast. They searched a total of 4,290 protein sequences in E. coli against genomes of other species for patterns of .

They found 6,809 pairs of mutually non-homologous sequences where both the proteins had statistically significant similarity to a single protein in a different genome, thus

36

yielding 6,809 putative PPIs. Using a similar approach, they found 45,502 putative PPIs

in the yeast genome.

Overbeek et al. [111] tried to predict functional coupling among genes which are

conserved in gene clusters across multiple genomes. They generalized this approach to

detect common classes of functionally linked genes such as transport and signal

transduction clusters and reconstructed several major metabolic and functional

subsystems.

2.8 Motifs, Domains and Protein-Protein Interactions

Standard homology searches for proteins of similar fold and functionality are reasonably successful when sequence similarity is higher than 50%. In the 30%-50% similarity range, there could be serious errors in predicting protein structure, often in protein loop structures. In the range below 30% (also called twilight zone), serious errors occur more frequently, especially in predicting basic protein folds. In such cases, where two sequences have similar chains of a small number of critical residues, standard homology approaches are ineffective. Fortunately, there exist databases of patterns or signatures of short residue chains, such as Prosite [65], which account for functional similarity even in sequences which share low sequence similarity. Such patterns exist in different forms such as motifs or domains [104].

• Sequence Motif: It is a conserved pattern of amino acids that is found in a

sufficient number of proteins to have a biological significance. For instance, the

proteins with similar sequence motifs share similar biochemical activities.

37

• Structural Motif: When the adjacent secondary structural elements of a protein

fold into a specific three-dimensional configuration, it results into a super-

secondary structure or fold called a structural motif. For instance, the helix-loop-

helix motif consists of 2 alpha helices around a loop of amino acids and is often

associated with transcription factors. Similarly, the Zinc finger motif consists of

two beta strands with a folded alpha helical end bound to a zinc ion and is

associated with DNA binding proteins.

• Sequence Domain: It is an extended sequence pattern (longer than motif)

identified using a multiple sequence alignment method. It signifies some degree

of common evolutionary ancestry among the aligned sequences. It may be as long

as the entire protein sequence (homeomorphic domain) or more frequently a

smaller segment of the sequence. Also, a complex domain may itself be composed

of smaller domains fused by evolutionary forces.

• Structural Domain: It is a segment of polypeptide chain that folds into a

independently stable three-dimensional, tertiary structure and therefore, functions

and evolves independently of other segments. Because of its inherent stability, it

can be genetically swapped from one protein to another, resulting into chimeric

proteins.

Domains vary in size but usually are around 200 amino acids or less in size, though the largest one known is 907 residues long. 49% of all domains lie in the range of

51 to 150 residues. The protein with the most number of domains is known to contain 13 domains [34].

38

A domain in a protein interacts with another domain in the same or another protein in order to stabilize its tertiary structure. Therefore, the structural principles of domains and their functional behaviors are often good indicators of functional behaviors of the protein as a whole [39, 41].

Most of the proteins that regulate cellular and sub-cellular processes are composed of a variety of small domains which have unique binding specificities and functions. These processes include signal transduction, cell-cycle regulation, proteolysis, secretory pathway regulation, cytoskeletal organization regulation and gene expression regulation. These interaction domains act as recognition modules and enable the proteins to interact with suitable ligands in order to run the aforementioned molecular biological processes. Interaction domains can be categorized into different families on the basis of sequence, structure and ligand-binding properties. For instance, among the proteins of signal transduction pathways - SH3, WW and EVH1 domain families recognize proline- rich sequences; SH2 and PTB families recognize phosphotyrosine-based sequences; and

14-3-3, FHA, PBD and WD40 recognize and bind to phosphoserine and phosphothreonine motifs [114].

Interaction domain motifs are conserved sequences or structures uniquely shaped by evolution for specific protein-protein interfaces. When the interacting proteins diverge evolutionarily, the participating motifs are rendered degenerate. Conversely, convergent evolutionary forces on previously unrelated proteins may create common motifs [164].

For instance, tetratricopeptide repeat (TPR), a PPI module, is a degenerate 34 residue pattern found in a number of functionally dissimilar proteins. It is present in tandem arrays of 3-16 motifs, which act together to form a scaffold in order to facilitate PPIs and

39

assembly of protein complexes [8, 14]. It is hypothesized that TPR is an ancient PPI

module that has been adapted by different proteins for specific functions. Similarly, Zinc

finger (ZnF) motifs have been identified for their role in PPIs, though they are more

commonly associated with DNA binding domains. They have been found to have

conserved C-X2-C-X12-H-X1-5-C sequence. These CCHC-type ZnFs exhibit similarity

to well-known CCHH-type ZnFs [98].

Sprinzak and Margalit [148] characterized a pair of interacting proteins by their sequence signatures (or motifs) and identified the over-represented sequence signature pairs by looking for sequence signatures that were more frequent than expected at random based on all possible combinations of sequence signature pairs. They demonstrated that over-represented sequence signatures can be used as identifiers of interacting proteins and can significantly reduce the search space of the interacting proteins.

2.9 Gene Ontology and Protein-Protein Interactions

The Gene Ontology (GO) project is a collaborative effort by GO consortium to standardize the use of concepts and terms related to the gene products [13]. This is ensured by enforcing a controlled, structured ontology which facilitates a consistent response to queries related to gene products. The GO consortium maintains gene ontologies in three broad areas, namely, biological process, molecular function, and cellular component. In addition, the GO consortium also maintains associations (better known as annotations) between genes or gene products and their matching ontology

40

terms. While the gene ontologies are species-independent, the gene annotations are

specific to species and are supported by collaborating species-specific database groups.

The ontology data can be stored in many formats including OBO-plain text,

OBO-XML, OWL and GO RDF/XML [2]. OBO (Open Biomedical Ontologies) provides

controlled vocabularies for portability across different biological and medical domains.

OBO-XML is the serialized XML version of OBO and uses RELAX-NG schema with a

compact syntax instead of XML syntax. RDF/XML as an RDF style XML is another

format, GO consortium offers along with DTD (Document Type Definition) for storing

and retrieving GO ontology and annotations. OWL (Web Ontology Language) is a more

recent standard for web-enabled ontologies supported by W3C (World Wide Web

Consortium) [4]. It uses RDF/XML based serialization and builds heavily on Semantic

Web. New tools have emerged to help mapping of OBO format to OWL so that GO

ontologies can take advantage of the powerful features of OWL, including tools such as

Protege. Also, new biomedical ontologies such as BioPAX (Biological Pathway

Exchange) have been created using OWL. The GO consortium develops and catalogs a

variety of tools including third party products related to creation, maintenance and use of

gene ontologies and gene annotations.

As noted earlier, the GO ontology is built around the idea of controlled and standardized concepts or terms, henceforth called GO terms. This has led to extensive research work around the idea of quantitatively evaluating the semantics and relationships among the GO terms and consequently among the gene products represented by the GO terms. GO term similarity can be measured as the degree of commonality or overlap between any two GO terms. Similarly, the distance between two

41

GO terms can be measured as the degree of dissimilarity or non-overlap between them.

While some approaches study a GO term relationship as a similarity measure and derive

some sort of distance measure based on the similarity measure, other approaches lead

themselves naturally into studying a relationship between two GO terms as a function of

their distance measure and the similarity measure is derived indirectly.

Gene Ontology is structured as a directed acyclic graph (DAG) where a node

represents the GO term and an edge represents the relationship between two GO terms.

The edges are directed and can have either of the two different forms: ‘is-a’ or ‘is-part-

of’. This is analogous to the use of relationship types among objects in an object-oriented

paradigm from computer science. Henceforth, the terms such as ‘node’ or ‘GO term’ or

‘concept’ will be used interchangeably. Also, the rest of this discussion is applicable

specifically in the context of the Gene Ontology unless noted otherwise.

Most of the methods related to measuring similarity among GO terms are based on GO graph structure or information-content of the GO nodes or some form of empirical combination of the two approaches.

2.9.1 GO Topology Based Semantic Similarity

Based on this approach, the similarity between two nodes is calculated as a

function of one or more paths between the nodes. Usually, the shortest path between the

nodes is chosen. Rada et al. [123] proposed one of the earliest similarity measures based

on path distance between two nodes. Thus distance between two nodes A and B is given by, D AB = D AC + D BC , where C is the ‘Most Specific Ancestor’ of A and B. For a DAG such as GO graph, the most specific parent node is often termed as ‘Lowest Common

42

Ancestor’ (LCA). This measure can be further extended to include other graph

parameters such as the depth of the lowest common ancestor (depth of a term in GO is

calculated as path length between the GO root and the GO term), edge weights (higher

weights can be assigned to deeper nodes), edge density around the nodes (calculated as a

function of number of edges for a given node). While this approach is intuitive and is

based on the well established framework of graph theory, it faces many challenges in the

context of GO. For instance, the depth of a node in GO is not a consistent indicator of its

semantic significance and a leaf term at a certain level could be just as significant as

another leaf term at a much lower or higher level. Thus, the depth of a term may reflect

the amount of biological research undertaken in that area instead of the intrinsic semantic

significance of the term.

2.9.2 Information Theory Based Semantic Similarity

In this approach, the frequency with which a term is found in the corpus (gene

annotation data source) dictates its semantic significance. Intuitively speaking, a parent

term occurs in a corpus if any of its descendent terms is found in the corpus. Thus, the

usage frequency for a given term also includes the usage frequency of all its children. The

probability PA of occurrence of a GO term A is defined as its frequency in the corpus

divided by the highest frequency of any term in the corpus (usually the root term). Thus

PA = frequency (A) / frequency (root) . On the same note, P root = frequency (root) / frequency (root). As expected, the probability of the root term is 1 because there is always a root term in the GO graph. To account for the possibility of the orphan terms

(terms lacking ancestors) in GO, P A can be defined as, P A = frequency (A) / total number

43 of gene products in the given corpus. Information content (IC) of the term A is given by,

IC A = - log (P A).

One of the earliest applications of the ideas of information theory in semantic similarity can be traced to the WorldNet application which provides a semantic lexical dictionary [44]. Similar to the topological distance approach, most of the information theory based similarity measures depend on computing similarity with respect to the most specific ancestor and therefore rely on the IC value of the LCA. Here are some of the most cited methods of semantic similarity based on information theory:

• Resnik [126] defined semantic similarity between two GO terms as

Sim (A,B) = IC LCA (A,B) . Thus, Sim (A,B) ∈ [0, ∞)

Since the similarity depends only on the LCA of terms A and B, this measure does not

differentiate between any two pairs of terms which have the same LCA and therefore

ignores the information inherent in the participating terms A and B.

• Lin [91] proposed a measure which tries to alleviate the aforementioned problem with

Resnik’s model by incorporating the IC values of the participating terms. According

to Lin,

2 × log PLCA BA ),( Sim (A,B) = , which could be mathematically rephrased as log PA + log PB

2× IC LCA BA ),( IC LCA BA ),( Sim (A,B) = , or . Thus, Sim (A,B) ∈ [0, 1] IC A + IC B Avg (IC A, IC B)

Lin’s model suggests some improvements by incorporating the IC values of child

terms. However, this similarity measure can be interpreted as a ratio of two IC values

- parent IC and average child IC values, and thus provides a relative similarity

measure between the child and the parent terms. One of the limitations of this model

44

is when both the child terms (and as a result the LCA) are close to the root of the GO

graph. This results in very similar IC values for the child and the parent terms, thus

erroneously inflating the semantic similarity between the two child terms.

• A simplified version of Jiang and Conrath model [74] defines distance between two

terms as summation of edge weights along the shortest path linking the two terms.

Dist (A,B) = wt(A_M1) + wt(M1_M2) + … + wt(MI_LCA)

+ wt(B_N1) + wt(N1_N2) + … + wt(NJ_LCA)

In other words,

Dist (A,B) = (IC A – IC M1 ) + (IC M1 – IC M2 ) + … + (IC LCA – IC MI ) +

+ (IC B – IC N1 ) + (IC N1 – IC N2 ) + … + (IC LCA – IC NJ )

where M1, M2 …MI are the nodes leading up to LCA from node A and N1, N2 …

NJ are the nodes leading up to LCA from node B. This can be further simplified to

Dist (A,B) = 2 log P LCA(A,B) - log PA - log PB , or IC A + IC B – 2 IC LCA(A,B)

This model, like Lin’s model, suggests some improvements over Resnik’s model and

the distance measure could be interpreted as a relative measure of IC values of parent

and child terms. However, like Lin’s model, this model gives inflated values of

semantic similarity for terms close to the root of the GO graph.

2.9.3 Hybrid Approach Based Semantic Similarity

Similarity models based on this approach try to incorporate the strengths of the

two broad approaches discussed earlier. These models are formulated to take into account

the topological and informational aspects of the GO terms.

45

The Jiang and Conrath model presented earlier is a simplified version of a more

generic model which defines edge weight between a child A and its immediate ancestor

r α  E  d(C) +1 node C as wt A,( C) = β + 1( − β )  ×   ×[IC A − IC C]×T A,( C)  E(C)  d(C) 

where d(C) is the depth of the node C in the GO hierarchy, E(C) is the number of edges r linked to node C (in other words, local density of C), E is the global average density of

GO (total number of edges in GO divided by total number of terms in GO), and T(A,C) is

the link relation/type factor. The parameters α ( α ≥ 0) and β (0 ≤ β ≤ 1) control the degree to which the node depth and density factors, respectively, contribute to the edge weight, wt(A,C). As α →0 and β →1, these factors become less significant until when for α=0 and β=1, this model turns into the simplified Jiang and Conrath model described earlier.

A number of studies have used Gene Ontology based functional categories to

annotate uncharacterized proteins known to cluster with known proteins [100]. Vazquez

et al. [160] suggested assigning proteins to the functional classes based on the hypothesis

that proteins from different functional classes exhibit minimum protein interactions with

each other. Speer et al. [145] used GO-based semantic similarity measures to validate

gene expression based biological clusters.

2.10 Gene Expression and Protein-Protein Interactions

A DNA microarray is a two dimensional array of thousands of microscopic DNA

spots (better known as probes) chemically attached to the glass or silicon array surface by

covalent bonds. A DNA microarray is also referred to as a gene array or gene chip or

biochip. The underlying principle of DNA microarrays is hybridization, a process by

46 which nucleic acid strands from a biological sample of interest bind to the complementary nucleic acid strands in the probes resulting in double stranded molecules which are deposited on the chip (see figure 2.9). In contrast, the non-complimentary nucleic acid strands from the sample do not hybridize and get washed away from the chip.

Broadly speaking, there are two types of DNA microarray technologies – complementary DNA (cDNA) dual channel array and oligonucleotide single channel array. cDNA is the DNA synthesized by reverse transcriptase using messenger RNA

(mRNA) as the template. cDNA microarrays have cDNA probes (100-3,000 base pairs long) which are readily obtained from cDNA libraries. In order to use cDNA microarrays

[16, 57, 82-84], two different mRNA samples are reverse transcribed to cDNA samples and labeled fluorescently with two color tags - green Cyanine 3 (Cy3) and red Cyanine 5

(Cy5). Both the samples are hybridized to a single array thus producing a composite image of hybridization, which is scanned and preprocessed to study relative gene expression levels of the two sources.

In contrast, oligonucleotide microarrays are fabricated by the process of

photolithography and use oligonucleotide probes of two types - perfect match probes

(PM) represent short segments for a gene and mismatch probes (MM) are created by

putting a mismatching (in other words, non-complementary) nucleotide in the middle of a

PM probe [6, 28, 51]. Both the probe types use the same fluorescent dye (such as

phycoerythrin), however two different mRNA samples are hybridized each to a different

array. The two hybridization images are studied and compared for their respective

absolute expression levels. Based on their length, oligonucleotide probes are synthesized

47 as short probes of 15-25 nucleotides or long probes of 50-120 nucleotides and the two types differ in their hybridization specificity and manufacturing specifications.

Figure 2.9 DNA microarray pipeline.

In stark contrast to the conventional methods of studying gene expression in vivo ,

DNA microarrays allow gene expression analysis for a given cell type for a given time period and under desirable experimental conditions [52]. They are commonly used to discriminate between various tissue samples such as healthy versus cancerous tissues and to identify or at least narrow down the participating genes by studying their differential gene expressions. In addition to disease diagnosis and toxicogenomics, they are helpful in comparative genomic hybridization studies for genome-wide detection of chromosomal aberrations and for identifying the single-nucleotide polymorphisms (SNPs) responsible

48 for various genetic disorders. Gene expression data derived from microarray studies has the potential to help decipher new pathways and predict new functionalities for uncharacterized genes by relating their gene expression profiles to genes of known functionalities.

In one of the earliest works related to the use of microarrays to study genome- wide mRNA expression, Eisen et al. [38] observed that genes of known similar functions tend to cluster together based on their expression profiles. They used standard correlation techniques to interpret the DNA microarray hybridization data. They showed that the genome-wide expression patterns can be used as measures that could provide insight into the underlying cellular processes. Moreover, the functions of the poorly characterized or novel genes could be predicted from the functions of the well annotated genes with similar expression profiles.

Ge et al. [51] developed a transcriptome-interactome correlation mapping to compare the interactions between proteins encoded by genes that share the common gene expression profile clusters as opposed to different clusters, and inferred that genes with similar expression profiles are more likely to encode interacting proteins. Andrei

Grigoriev [57] studied at proteomic scale the relationship between similarity of gene expression patterns for a pair of genes and protein interactions for S. cerevisiae and much simpler genome of bacteriophage T7. The study suggested that the protein pairs encoded by co-expressed genes interact with each other more frequently than with random proteins. Jansen et al. [72] extended the use of mRNA expression levels to study their relationship with protein-protein interactions. They observed that proteins participating in a permanent (stable over its lifetime) complex such as ribosome or proteasome tend to

49 show significant co-expression with respect to their absolute mRNA levels as well as expression profiles. On the other hand, proteins in a transient complex or aggregated genomic datasets from experimental models such as Y2H experiments have a weaker relationship with gene expression. It is surmised that large scale experimental datasets are prone to noise from sources such as false positives, false negatives, and inconsistent physiological conditions associated with them, thus making it difficult to decipher clear trends of correlation with gene expression levels. Similarly, interactions in a transient complex may occur under physiological conditions which are not subject to a significant gene expression activity, thereby diminishing the relationship between gene expression and protein-protein interactions in such cases.

2.11 Protein Essentiality and Protein-Protein Interactions

Essential genes are the minimal set of indispensable genes critical to the survival of a cell [69, 82]. While the concept of essentiality in itself represents a key milestone of understanding for a molecular biologist, there are many practical applications of essentiality in genes and proteins. Antibiotics are designed to paralyze the cellular processes of a microbe at one or more of its pathways by targeting the essential genes or gene products [78]. A number of techniques have been used in order to identify the essential genes, including antisense RNA for Staphylococcus aureus [73], transposon mutagenesis for Mycoplasma genitalium [67], high density transposon mutagenesis for

Haemophilus influenzae [6], mariner-based transposon for Vibrio cholerae [78] and comparative genomics for M. genitalium and H. influenzae [105] and genetic footprinting in yeast [141].

50

Thatcher et al. [156] further extended the concept of essentiality to marginal essentiality based on the marginal benefit hypothesis by identifying individually small but significant contributions of non-essential genes to the viability of cells. Yu et al. [167] have computed a measure of marginal essentiality for yeast based on four large-scale gene knockout experiments that study the protein essentiality with respect to the following:

• Growth rate [151]

• Phenotypes under diverse environments [74]

• Sporulation efficiency [35]

• Sensitivity to small molecules [168]

It was observed that essential proteins differ significantly from non-essential

proteins with respect to the major topological characteristics in a protein interaction

network. More specifically, essential proteins have a higher average degree (number of

direct interactions) and therefore they are more likely to interact than non-essential

proteins. Essential proteins have a higher clustering coefficient which means they are

more likely to interact with other proteins in a clique-like pattern. Also, they have lower

characteristic path length (average distance between nodes) and diameter (maximum

inter-node distance) which implies that they are more closely connected with other

essential proteins.

A similar study done within regulatory networks involving genes and transcription factors provides further insight into the behavior of essential proteins [167].

It was observed that genes which were regulated by the least number of transcription

51 factors were more likely to be essential genes. It is hypothesized that essential genes are often ‘house-keeping genes’ which consistently express themselves at a higher level and do not need to be regulated through many transcription factors so that their functioning is predictably stable. On a similar note, it was observed that a more promiscuous regulatory protein (associated with transcription of many genes) was more likely to be an essential protein, since its deletion will impact the expression of a larger number of genes.

Essential genes are also more likely to be associated with a greater number of

functions. This was confirmed by identifying the functions associated with different

genes based on the MIPS classification of gene functions [101].

2.12 Text Mining and Protein-Protein Interactions

Recent advances in natural language processing (NLP) and machine learning have

made it possible to apply these techniques to the general problem of event extraction [8].

NLP based text processing approaches are now being developed for identification of one

such type of event - protein-protein interaction, using the biomedical literature [22].

These approaches can be described in terms of four primary steps that are applied

sequentially.

The first step consists of “named entity recognition” or NER followed by normalization. Named entity recognition results in identification of entities i.e. proteins present in the text being processed, for example, JUN & FOS proteins. NER can be as simple as identifying terms/tokens in the text that are present in an extensive, curated named entity list [79]. In case of PPI identification, such a list will contain the names of proteins (e.g. JUN, FOS) and their synonyms (e.g. c-JUN, c-FOS, AP-1, AP1) derived

52 from various sources. A more advanced NER system will identify named entities using language specific shape features. English shape features may include features of the words/tokens in the text e.g. upper case, lower case, capitalized form, punctuations etc.

[79]. Most advanced NER systems are hierarchical as they first identify the most common named entities based on their presence in the named entity list. This is followed by treating NER as a sequence labeling classification problem using shape and syntax features for classification. Sequence labeling classification problem consists of training a classifier to predict the class of a token T 1, based on classes of tokens in the context window around the token T 1 [79]. For example, in the following sentence, “FOS binds

JUN to form AP-1.”, a classifier using a context window of size 3 will attempt to predict

the class of token “JUN” using features assigned to tokens “binds” and “to”. Once NER

is complete, normalization is achieved by mapping the named entities to unique

identifiers stored in curated databases such as UniProt [22]. Normalization ensures that

all synonyms of a protein e.g. JUN and c-JUN are assigned a single, unique identifier.

The second step involves tokenizing the text on sentence boundaries to extract

sentences. If necessary, sentences with multiple clauses are split into simpler sentences

with a single clause. These sentences, with one clause, are checked for co-occurrence and

sentences with less than two protein names are eliminated. Sentences with two or more

proteins are then converted to a parse tree format using syntax based parsing. Sentences

in parse tree format are stored for further processing [22]. A parse tree for the clause, “the

interaction between JUN and FOS”, is shown in figure 2.10. Interacting proteins are

normally described in the literature using a finite set of relationships terms such as

“interacts”, “binds”, “activates” etc.

53

NP = Noun Phrase, PP = Prepositional Phrase, DT = Determiner, NN = Noun,

IN = Preposition, VP = Verb Phrase, and CC = Conjunction.

Figure 2.10 Parse tree for an example clause - “the interaction between JUN and FOS”.

The third step processes the sentences in the parsed tree format to identify

potential pairs of interacting proteins and the nature of their interaction/relationship.

Since different types of relationships between interacting proteins lead to differences in

sentence structure, the processing of the sentences stored as parse trees varies based on

the type of interaction being investigated [22]. For example, for the relationship of type

“interaction” captured by the fragment “interaction between JUN and FOS”, the

following form can be used to capture the interaction:

FORM: REL(noun) word* PRO i word* PRO j;

REL = interaction, PRO i = JUN, PRO j = FOS

word* = one or more words e.g. “between” or “and”

54

Processing using this form will result in identification of JUN and FOS as potential pairs of proteins participating in an interaction of type “interaction”.

In the fourth step, features of the candidate interacting pairs are identified for further use with a machine learning based classification technique [22]. Features used for classification can be “keywords” e.g. relationship in the form of “binds, “activates”,

“interacts”. Alternatively, they can be “distances” e.g. number of word-tokens between proteins and relationship keywords in the sentence. Features can also be “part-of-speech” value of the root node connecting proteins in the parse tree e.g. “Noun Phrase” in the parse tree above [22]. Lastly, features can also be defined in terms of syntactical attributes of the parse tree such as the shortest path between interacting protein pairs in the parse tree. {JUN-NN-VP-NN-FOS} will constitute the shortest path between interacting proteins JUN and FOS in the parse tree above. Features identified in this step are fed to a supervised classification technique such as a Support Vector Machine (SVM), to develop a classification model for the interacting proteins [22].

In spite of the initial success of NLP and machine learning techniques in identification of PPIs, significant limitations persist to be able to use them for automatic detection. Identification of PPIs based on co-occurrence alone leads to high recall but suffers from low precision. Only 30% of protein pairs co-occurring in a sentence have been found to have a real interaction [8]. Rules-based approaches, with rules specific to the type of interaction between proteins, can help reduce the 70% false positive rate and therefore, improve precision but result in a significant drop in recall. Moreover, a rules- based approach is specific to the corpus used for training and not generalizable across the vast biomedical literature [8].

55

Another limitation of current NLP techniques is that identification of PPIs is limited to simple sentences i.e. sentences with one clause. However, it is common for the information needed for identification of PPIs to be spread over multiple sentences, paragraphs and sometimes even papers [8]. Currently existing techniques are not sophisticated enough to address such situations.

2.13 Protein-Protein Interaction Prediction using Integrative Approaches

One of the integrative approaches for prediction is based on association rule mining which essentially tries to discover new relations among existing variables or features of a dataset. While such approaches have been well researched in the area of marketing and web usage statistics, their application in the area of computational biology is more recent. A simple example of association rule mining in the context of a proteomic study could look like this:

IF ((protein P i has feature F p) AND (protein P j has feature F q) AND (protein P i

and protein P j have feature F r OR Feature F s )) THEN (protein P i and protein P j

interact)

By extracting and integrating a large set of such rules from a database of protein features, complex relationships could be identified with a certain degree of confidence. This could be developed into a knowledge data discovery (KDD) system that can be mined for higher level constructs via an expert system using a forward or backward chaining rules engine.

Oyama et al. [112] proposed a data mining method to discover association rules

related to a set of 4,307 unique PPIs pooled together from the YPD database, MIPS

56 database and large scale experiments by Ito et al. [70] and Uetz et al. [159]. For the interacting proteins, they identified seven types of protein features from various publications and public genome databases. These feature types were defined as functional or primary structural aspects of the protein set and include Yeast Protein Database (YPD) categories, Enzyme Commission (EC) number, SWISS-PROT/PIR (Protein Information

Resource) keywords, PROSITE motifs, amino acid bias (based on the amino acid residue enrichment), segment clusters (based on protein sequence homology) and amino acid patterns (patterns known to bind to specific protein domains). Based on these feature types, a total of 5,241 features were created. They identified 6,367 association rules based on the PPIs and features and evaluated them against scoring measures. In the process, novel knowledge was mined which was valuable in characterizing existing and new PPIs

[112].

Kotlyar et al. [83] proposed a similar classifier based on association mining to

discover the association rules. However, they also predicted new PPIs and assigned a

confidence measure to every prediction. They integrated different types of evidences for

PPIs such as detection by high-throughput methods, sub-cellular co-localizations, and

structural domains as features of PPI datasets. They used gene expression data and

protein localization information for rule validation.

57

CHAPTER III

MATERIALS AND METHODS

3.1 Research Hypothesis

Given a set of PPIs in S. cerevisiae species, P(PPI) is a priori probability of

_ detecting PPI and P (PPI) is a priori probability of detecting a non-PPI. Therefore,

_ P (PPI) = 1 – P(PPI). Since, Odds = Probability / (1 – Probability),

_ O(PPI) = P(PPI) / P (PPI).

The goal of this dissertation is to devise a Bayesian approach, referred to as PPI

Predictor in the rest of this section, which integrates a number of features from the S.

cerevisiae proteome to improve the discriminative capability of the predictor to better

classify PPIs as positive or negative. The proposed predictor should show higher a

posteriori odds for a positive PPI and conversely show lower a posteriori odds for a non-

PPI. In other words,

O(PPI | PPI Predictor) > O(PPI) for a positive PPI, and

O(PPI | PPI Predictor) < O(PPI) for a non-PPI.

This objective can be formulated in terms of a statistical hypothesis, as described

next. According to the null hypothesis, there is no improvement in the ability to correctly

classify the PPIs as positive or negative , given the PPI Predictor. This dissertation claims

to reject the null hypothesis in favor of an alternative hypothesis of significant

58 improvement in correctly classifying the PPIs as positive or negative, given the PPI

Predictor. The discriminative ability of a predictor is also known as the classification accuracy or simply accuracy. In the absence of a predictor, the accuracy of correctly classifying an event as positive or negative is 0.5. Therefore, the two hypotheses can be framed as follows:

Null Hypothesis, H 0: Accuracy of PPI Predictor ≤ 0.5

Alternative Hypothesis (Claim), H a: Accuracy of PPI Predictor > 0.5

Since the ratio of posterior and prior odds of an event is called the likelihood ratio, a good predictor will show a likelihood ratio of greater than one for a true PPI and a value of less than one for a non-PPI. Using the Gold Standard dataset, the likelihood ratios of all possible PPIs in the yeast proteome will be computed. The likelihood ratio of a given PPI could be conceived as a measure of confidence in that PPI. An additional objective of this dissertation is, therefore, to assign a confidence measure to the PPIs and rank the PPIs in decreasing order of their confidence measures. A set of PPIs with a confidence measure above a given threshold could then be selected for further proteomic studies as putative PPIs of interest.

Receiver Operating Characteristic (ROC) curves are commonly used to measure

the accuracy of classifiers or predictors and are used in this dissertation to assess the

accuracy of the PPI Predictor. In order to validate this claim, a multi-fold cross-validation

technique is used against the Gold Standard data. The Gold Standard dataset is known to

contain a widely published and validated set of PPIs, and is used as a reference dataset

for training as well as testing the hypothesis.

59 3.2 ORFs – Interchangeability with Genes and Proteins

While the general concepts of gene, protein and gene product are widely used and

understood, there are certain assumptions this dissertation makes with regard to the usage

of terms such as ORFs and their relevance, and at times ORF interchangeability with

genes or proteins, in the context of the various PPI prediction features including gene

expression, gene ontology, essentiality, motif and MIPS functions.

Before delving into the specifics of the datasets used for this dissertation, it is

therefore imperative to discuss such assumptions and the terminology used by the yeast

research community in the context of the terms and features of the datasets. On the same

note, it seems equally pertinent to provide a high-level overview of the S. cerevisiae

genome and proteome.

An open reading frame (ORF) is a series of codons deduced from a DNA

sequence that starts with a 5’ initiation codon (AUG) and ends with one of the 3’

termination codons (TAG, TGA, TAA) and represents a putative or known gene. An

ORF can potentially translate to a polypeptide or RNA. An ORF is, however, not

equivalent to a gene or locus unless there exists a known phenotype associated with a

mutation in ORF or mRNA transcript or gene product generated from the ORF DNA

sequence. The yeast genome sequencing project requires that an ORF be long enough to

encode a protein of 100 or more amino acids [135]. However, if a gene has already been

characterized and localized to the , it can be less than 100 codons in length.

ORF is used interchangeably with the term Coding Sequence (CDS) by the

Saccharomyces community in general, and Saccharomyces Genome Database (SGD) in

particular [135].

60 Saccharomyces ORFs are commonly classified as one of the three types [135]:

1. Verified ORFs: There exists experimental evidence that these ORFs have an

associated gene product. They are very likely to have orthologs in one or more

Saccharomyces species. Most of the named genes map to this class.

2. Uncharacterized ORFs: These ORFs have orthologs in one or more species which

makes them good candidates as protein encoding ORFs. However, there exists no

specific experimental data to support that they generate gene products in

Saccharomyces species.

3. Dubious ORFs: They are unlikely to encode to an expressed protein. They are

often small and overlap with a larger ‘Verified’ or ‘Uncharacterized’ ORF.

Dubious ORFs are likely to have one or more of the following ORF properties:

• They are not conserved in other Saccharomyces species.

• There exists no published experimental evidence that they translate into

gene products.

• Mutating them causes the same phenotype as one known for a mutated

gene that the dubious ORF is known to overlap with.

• They do not contain an intron.

As of July 30, 2011, the S. cerevisiae genome consists of 12,157,105 DNA base pairs across 16 . A total of 6,607 ORFs have been identified. 4,933 (74.7%) of them are ‘verified’, 865 (13.1%) are ‘uncharacterized’ and the remaining 809 (12.2%) are marked as ‘dubious’. This means there possibly exists a gene (verified or uncharacterized) every 2 kilobases of the DNA. Most of the bacteria have a gene density

61 of about a gene per kilobase but compared to higher eukaryotes, yeast exhibits a very high gene density. The S. cerevisiae genome also includes 299 tRNAs, 77 snoRNAs, 27

rRNAs and 6 snRNAs [135]. Figure 3.1 shows frequency distribution of ORFs with

respect to ORF length, with an average ORF length of 451 amino acids. Approximately

36% of yeast ORFs have up to 250 amino acids, while only around 4% of yeast ORFs

have 1250 or more amino acids.

Figure 3.1 Frequency distribution of ORFs with respect to ORF length (Left). Log version of the left histogram to highlight the frequency distribution for higher ORF length values (Right).

Saccharomyces community indexes public domain yeast databases, in the area of

genomics and proteomics, by yeast ORFs. The ORFs used in this dissertation originate

from a number of yeast databases that also provide information relating to various yeast

62 features either directly or indirectly. Details of each of the individual features are covered in the subsequent sections of this chapter. Since one of the goals of this dissertation is to devise a strategy to identify novel PPIs, it starts out by casting a wide net on the ORFs in the yeast databases of interest and concludes with a list of PPIs in decreasing order of their relevance as dictated by the associated likelihood values. Therefore, the proteomic data used in this research includes ‘verified’ ORFs, and extends to ‘uncharacterized’ and

‘dubious’ ORFs, when there exists sound experimental data relating to a feature of interest for a given ORF.

3.3 Proposed PPI Prediction Techniques

Out of a broad set of features that characterize yeast, five have been chosen since there exists some degree of association between the chosen features and the protein interactions. The chosen features are gene expression, gene ontology, MIPS functions, sequence patterns such as motifs and domains, and protein essentiality. Each of these features has been subject to a distinct approach that best relates the feature based similarity between any two proteins in the proteome with the presence or absence of interactions between the same pair of proteins. Therefore, the objective of each of the feature specific techniques is to provide a measure of a relationship between the feature and the protein-protein interactions in the form of likelihood values. The individual feature based predictions are then integrated using Bayesian ensemble (see section 3.5).

The prediction approaches, described in the rest of this section, are devised to identify a relationship between any given pair of proteins based on a given feature, thus resulting in a similarity measure among all possible combinations of proteins considered

63 for the given feature. These approaches, however, exclude homodimeric PPIs (when both the interacting proteins are identical) because similarity measures for such instances yield by default the highest permissible similarity values for the given feature and do not provide any meaningful evidence towards the likelihood of a PPI. Also, the Gold

Standard data used in the dissertation does not support PPI instances where the participating proteins are identical.

3.3.1 Gene Annotations from Gene Ontology

The gene annotation data used in this dissertation refers to the May 2009 release

of the S. cerevisiae gene association data. It contains 86,465 annotations across 6,352 S.

cerevisiae ORFs of which 35.6%, 29.7% and 34.7% are related to Biological Process,

Molecular Function and Cellular Component categories respectively. These annotations

conform to the ‘for-public version’ of the Gene Ontology of the OBO (Open Biomedical

Ontologies) v1.2 format maintained by the Gene Ontology Consortium [13]. There are

28,440 GO terms in the March 2009 release of the GO database out of which 1,390

obsolete terms were filtered out. After further eliminating terms with no directly

associated S. cerevisiae ORFs, the GO dataset was reduced to 5,545 GO terms.

Most of the approaches that study gene ontology could be broadly categorized as

either “topology based” or “information theory based”. Some models have made an effort

to combine the two approaches using some form of empirical parameters. For instance,

Jiang and Conrath [74] use the empirical parameters α and β to assign weights to node depth and density factors respectively (see section 2.9.3). This dissertation proposes a novel technique to derive a proteome-wide measure of the information contents of a node

64 in GO DAG weighted by all its ancestor nodes derived from the global topology of the

GO DAG, without improvising any empirical parameters to combine the two existing types of approaches. More specifically, here are the steps of this approach:

1. Load Gene Ontology data.

2. In order to optimize for large scale computations, filter out the following GO

fields from the GO database – Synonym and Definition. This also helps to

optimize post processing steps.

3. Filter out obsolete GO Terms

4. Load Gene Annotations for S. cerevisiae .

5. Once again, to optimize for large scale computations, filter out the following

Annotation fields from the Annotations database – Database, DB_Object_ID,

Dqualifier, DBReference, WithFrom, DB_Object_Name, DB_Object_Type,

Taxon, Date, Assigned_by.

6. Synchronize the GO database and the Annotations database, making sure the

annotations are still valid.

7. Based on the GO topology, create a proteome-wide association matrix between

the ORF and the GO terms and their ancestors.

8. Compute Information Contents (ICs) for GO terms and ORFs.

9. Recompute the association matrix weighted by GO and ORF ICs.

10. Compute pairwise similarity between all the 6,352 ORFs.

MySQL version of GO ontology and annotations enables the use of powerful tools (based on Perl and SQL) that can be used to data-mine various GO based relations.

65 In addition to the Matlab-based implementation, various SQL queries were developed to mine information such as number of paths (shortest and otherwise) connecting two GO terms and path distances between two GO terms.

The key distinguishing feature and advantage of this approach is that it requires a

proteome-wide computation of Information Content values and is a rigorous, non-

empirical solution unaffected by the choice of the empirical parameters that most of the

contemporary approaches depend upon.

3.3.2 Functional Categories from MIPS Functional Catalog

The MIPS functional catalog data [101] consists of 15,823 unique ORF-MIPS function associations among 460 MIPS catalog functions and 4,779 S. cerevisiae ORFs.

Two of the 1,362 MIPS functions (ID:98 corresponding to “CLASSIFICATION NOT

YET CLEAR-CUT” and ID:99 corresponding to “UNCLASSIFIED PROTEINS”) are not meaningful as a source of annotation for this research. As a result, 1,394 annotations corresponding to MIPS Catalog IDs 98 and 99 have been filtered out of the original data.

A total of 196 functional categories were found to be redundant with respect to the functional descriptors even though they refer to non-redundant MIPS functional category identifiers. Eliminating the 196 redundant functional categories resulted in a set of 14,388

ORF functional category association tuples. It is important to note that S. cerevisiae

MIPS data-source refers to a subset of the broader and generic MIPS function catalog

comprising of 1,362 unique functions.

Unlike the GO graph structure, MIPS functional categories are based on the idea of a strictly hierarchical, tree data structure [101]. A functional category [128] in MIPS

66 could be represented as Ai.B ij .C ijk .D ijkl .E ijklm .Fijklmn, where functional category Ai contains a

more specific category Bij which in turn contains category Cijk and so on. The specificity of a functional category is proportional to the level of subcategories associated with it. A protein can be annotated to one or more MIPS functional categories. Similarly, a MIPS functional category is usually annotated to multiple proteins.

Most of the approaches related to MIPS functional categories depend upon the

hierarchical nature of the functional categories to identify relations between a pair of

proteins. For instance, a protein annotated to the functional category ‘43.03.07.02.01.03’

is considered functionally closer to a protein annotated to ‘43.03.07.02’ than to a protein

annotated to ‘43.03.23’. While the approaches based on this concept are intuitive and

easy to implement, they lack a uniform and rigorous measure to identify the underlying

similarity among MIPS functional categories on the proteomic scale.

The proposed approach creates a novel measure of semantic information inherent to the individual functional categories by virtue of their usage frequency in the MIPS corpus. It builds upon the inherent hierarchical semantic structure to compute the proteome-wide measure of information content for every corpus term. This technique consists of the following steps:

1. Load MIPS Functional Catalog database.

2. Remove functional categories which are redundant with respect to MIPS

functional descriptors.

3. Load MIPS functional annotations for S. cerevisiae

4. Remove those annotations which are redundantly associated with different

evidence descriptors and PubMed descriptors.

67 5. Convert MIPS functional IDs from hierarchical IDs to normalized IDs.

6. Based on the MIPS functional catalog, create a proteome-wide association matrix

between the ORFs and the MIPS functional categories and their ancestors.

7. Compute ICs for MIPS functional categories and ORFs.

8. Recompute the association matrix weighted by MIPS and ORF ICs.

9. Compute pairwise similarity between all the 4,779 ORFs.

3.3.3 Gene Expression

The gene expression data for S. cerevisiae was obtained from four different

sources.

1. Cell cycle data: It consists of gene expression data of 6,178 ORFs across 77

different stages and experimental conditions related to cell cycle [147].

2. DNA damage response data: It consists of gene expression data of 6,167 ORFs

across 52 experimental parameters [47].

3. Rosetta data: It consists of gene expression data of 6,132 ORFs across 300

experimental conditions [64].

4. Stress response data: It includes gene expression in response to a variety of stress

factors such as exposing cells to heat shock at varying temperatures for different

time periods and exposure to a variety of chemical agents. It consists of gene

expression data of 6,152 ORFs across 173 different stress factors [48].

The data from the aforementioned sources was combined into a single data source of

6,329 ORF expressions across 602 biological sample points of interest. The ORFs whose

68 gene expression data was missing for more than 5% of the experimental/sample points in the original dataset, because of experimental limitations or other factors, were eliminated from the dataset. For the rest, it was considered prudent to compute the best estimate for a missing value for a given ORF, instead of rejecting the entire vector of gene expression patterns for a given ORF due to a very small number of missing samples out of over 600 sample points. In such cases, the missing values were estimated based on the mean expression value for that ORF. After this step, the transcriptome consisted of 5,666 expression profiles across 602 sample points.

Because of their ease of use, linear correlation techniques have been used widely by the microarray research community, dating back to one of the earliest works by Eisen et al. [38]. However, gene expression data is known to exhibit non-linear variations and can be analyzed for patterns of non-linear correlation using a mutual information model.

This dissertation builds upon the concept of the mutual information in its application to gene expression profiles. The gene expression based approach consists of the following steps:

1. Load gene expression data from multiple sources.

2. Integrate the data into single data source.

3. Remove the expression profiles which are missing significant data (data missing

for greater than 5% of the experimental conditions).

4. Compute the maximum likelihood estimates of the missing data for the remaining

profiles.

5. Compute pairwise correlation based on mutual information theory for all the

5,666 ORFs.

69 3.3.4 Genomic Motifs and Domains

The use of domains and protein sequence information to predict PPIs has been well studied (see sections 2.7 and 2.8). This dissertation introduces a novel information- theoretic measure of similarity based on the domains and motifs found in the proteins participating in the PPI. The motif data [101] consists of 4,172 unique motifs associated with 2,379 unique S. cerevisiae ORFs and 9,378 unique ORF-motif associations along with their p-values. The proposed technique, based on utilizing motif information for prediction of PPI, consists of the following steps:

1. Load the motifs from the eMOTIFS database [101], which contains 9,378

associations between 2,379 ORFs and 4,172 unique motifs along with the p-value

of the evidence.

2. Filter out the following fields in order to optimize computations – ‘ORF sequence

length’, ‘start_coord’, ‘stop_coord’, ‘URL to domain site’ and ‘description’.

3. Compute the pairwise similarity among all the 2,379 ORFs based on the dot

product of the respective p-value vectors.

3.3.5 Essentiality

The essentiality data was obtained from MIPS data source. It consists of

8,130,528 pairs of protein interactions among 4,033 ORFs, identified as ‘essential -

essential’, ‘essential - non-essential’, ‘non-essential – essential’ and ‘non-essential - non-

essential’ based on the role of the participating protein in the protein interaction.

It has been proposed that protein complexes (especially binary complexes) are

more likely to be composed of proteins which are all essential in contrast to the

70 complexes formed of all non-essential or a mixture of both essential and non-essential proteins (see section 2.11). Towards that end, all possible protein-protein pairs were classified into “Both Essential”, “Both Non-essential” and “One Essential and One Non- essential” pairs based on the available PPI data. This data was then used to compute likelihood values based on the overlap with the Gold Standard dataset.

3.4 Gold Standard Datasets

Computation of the PPI likelihood for individual and integrated features using PPI

Predictor requires a reference data set. The reference data set used in this dissertation is

comprised of both, a set of Gold Standard Positive (GSP) data and a set of Gold Standard

Negative (GSN) data. The GSP dataset contains known protein interactions of high

confidence. GSN data, on the other hand, refers to the lack of protein interactions as there

is no established methodology to identify or prove non-existence of a given protein

interaction. However, one of the intuitive approaches to build GSN data relies on the

protein localization information and is based on the assumption that proteins from

different sub-cellular compartments are unlikely to interact.

One of the most commonly cited sources of GSP data in the protein interaction

research is the MIPS complex catalog. Manually curated protein complexes cataloged in

this database have been widely used as benchmarks in the S. cerevisiae research

community. MIPS complexes are hierarchically organized to capture the dynamic nature

of complexes when the larger complexes (such as pre-initiation complexes, kinetochores,

splicesome, proteosome) are formed by the fusion of smaller ones. However, this leads to

many spurious PPIs since not all proteins from two different sub-complexes of a larger

71 complex are bound to interact with each other. Since this research uses a matrix model to represent PPIs (both reference and predicted), the MIPS complex catalog has the potential to introduce significant number of false positive PPIs in the reference dataset, which in turn will introduce additional false positives in the predicted PPIs. In addition to the less than optimal hierarchical structure, the MIPS complex catalog lacks direct links to the published evidence [122].

In order to address these concerns, this dissertation builds on a more comprehensive dataset, namely CYC2008 [122]. The CYC2008 database consists of

1,920 associations among 1,627 unique proteins and 408 unique protein complexes as opposed to 215 heteromeric complexes found in the MIPS complex catalog. Out of 408 complexes in the database, 326 are supported by small-scale experimental evidence. An additional 82 complexes have been derived from the SGD database and have been manually curated against the source publications that reported them. The modular nature of the complexes in this dataset minimizes the potential for a false positive PPI in the reference dataset. Using this dataset, a total of 11,425 PPIs were identified as GSP. The number of PPIs, identified with only 1, 2, 3, 4 and 5 complexes, are 10692, 506, 30, 6, and 21 respectively. There are no PPIs which had membership in 6 or more complexes.

Since data sources underlying GSP and GSN sets represent the most accurate PPI classification available, it is logical to assume that a given PPI cannot be part of both

GSP and GSN dataset. However, in reality, there exist a very small number of PPIs which violate this assumption. More specifically, there are 409 PPIs which are common to the

GSP set of 11,255 PPIs and GSN set of 2,705,844 PPIs. This is, however, not entirely unexpected as the two datasets were created independently based on two different

72 methodologies. However, given the extremely small number of these PPIs, their presence is of little significance for the purpose of this dissertation.

3.5 Bayesian Integration

A meaningful naïve Bayes integration of multiple features requires that the participating features be uncorrelated. In order to ascertain that the different features participating in Bayesian integration are relatively independent of each other, the mutual correlation value of the PPI similarity matrices for these features was calculated. The following steps were used to implement the Bayesian integration approach:

1. Remap all the five feature similarity matrices (see section 3.3), including the Gold

Standard matrices, against the same ordered list of ORFs so that they can be

compared with each other in the context of the same ORF vector.

2. For every proteomic feature, identify suitable cut-off values and compute the

measures such as - True Positives, False Positives, True Negatives, False

Negatives and their corresponding Likelihood Ratios at these cut-offs. This is

done so that individual predictive power of each of the features can be computed.

3. Compute cross-feature likelihood ratios by integrating the five features.

Statistically speaking, the combined likelihood ratio of ‘n’ features, say F 1, F 2,

…F n, can be expressed as product of their individual likelihood ratios. In other

n words, L(F 1, F 2, … Fn) = ∏ L(Fi ) = L(F 1).L(F 2). … L(F n). This enables the 1

computation of the joint predictive power of multiple features based on their

individual predictive power, computed in step 2.

73 Bayesian integration permits partial integration of features, in other words, not all

of the features need to be available for prediction. Although validation results

based on inclusion of all of the features are presented, prediction of novel PPIs

can be carried out even when evidence from some of the features is absent.

4. Eliminate the low-likelihood ratios (less than one) and the corresponding PPIs.

This filtered out 7,213 likelihood ratios and associated PPIs, leaving behind

19,667 likelihood ratios and associated PPIs to process further.

5. Sort the likelihood ratios in descending order while preserving the order of the

corresponding PPIs.

3.6 Validation of Predicted PPIs

The results of the individual prediction based on each of the features, as well as from the ensemble of features based on Bayesian integration, can be validated using the

ROC curve. Here are the steps taken towards validation of predictions using individual features as well as their integrated feature set.

1. For a given likelihood ratio, compute the feature overlap with both, Gold Standard

Positives and Negatives.

2. For each of the likelihood cutoff values for a feature, compute the ratio of True

Positives and True Negatives and identify the relationship between the cutoff

likelihood and the TP:FP ratio.

3. For each of the likelihood cut-off values, identify the potentially novel PPIs by

eliminating the already known positives and negatives found in Gold Standard

74 data. Also, identify the relationship between the cutoff likelihood and the number

of potentially new PPIs.

4. Generate an ROC curve based on the True Positive Ratio (TPR) and the False

TP Positive Ratio (FPR) for each of the features, where TPR = and TP + FN

FP FPR = . FP + TN

3.7 Cross-validation of Predicted PPIs

In addition to the validation process described above, a multi-fold cross validation can be used to assess the predictive power of the PPI Predictor. This involves randomly partitioning the data into two non-overlapping datasets - training set and test set. Analysis is done on the training set and the results validated against the test set. To reduce variability introduced due to partitioning, this process is repeated multiple times and the validation results are averaged over all the iterations.

This dissertation uses a 10-fold cross validation. More specifically, it consists of the following steps:

1. The Gold Standard Positive and Negative datasets were randomly divided into 10

parts each.

2. In each iteration, 9 parts (of both GSP and GSN) were used as training data and 1

part as testing data.

3. Cross-feature likelihood values and associated PPIs were computed based on the

training data.

75 4. TP:FP ratios and number of potentially new PPIs were computed based on the

predicted PPIs and Gold Standard test data. These were plotted against the cutoff

likelihood values from the training data.

5. For each of the likelihood cutoffs from the training data, TPR and FPR were

computed based on the test data, and the ROC curve was generated for each

iteration.

6. The accuracy values computed during the 10 iterations were averaged.

76

CHAPTER IV

RESULTS AND ANALYSIS

This chapter presents the results of the techniques described in chapter three.

Therefore, it mirrors the previous chapter in its organization. Results stemming from

implementation of the algorithm for each feature including eventual computation of

similarity measures are presented. Statistical techniques including ROC curves have been

used to validate the findings for each of the features and their combined prediction

capability using Bayesian integration. A 10-fold cross validation has been used to further

validate the results of the combined prediction based on all the features.

4.1 Gene Ontology Feature Analysis

This section describes results obtained during the application of the Gene

Ontology feature in the prediction of novel PPIs. Figure 4.1 (left plot) shows a frequency distribution of 6,352 ORFs plotted against the number of associated GO terms from the

GO ontology. The ORFs with 45 to 55 GO terms are most frequently present and account for almost 18% of all the ORFs. Highly annotated ORFs, with as many as 200-280 associated GO terms, occur sparsely since rarely does a protein participate in a wide range (over 200) of biological processes and molecular functions across different cellular locations. However, a significant number of ORFs (around 12%) exist with very few

77 annotations (3 to 17) as such proteins are highly specialized with respect to their participation in molecular functions and biological processes.

Similarly, figure 4.1 (right plot) shows a frequency distribution of 5,545 GO terms against the number of associated yeast ORFs. Figure 4.2 shows a more detailed breakdown of the GO term frequency distribution from figure 4.1, with the left and right histograms displaying high and low frequency distributions respectively. 24% of the GO terms are associated with no more than one ORF while 77% of the GO terms are associated with no more than 20 ORFs. These terms denote molecular functions and biological processes that are highly specialized and not commonly shared by yeast ORFs.

However, approximately 2% of the GO terms are associated with as many as 500 ORFs.

These terms are found closer to the root of the GO hierarchy and are quite generic in nature.

Figure 4.3 shows a proteome-wide view of PPI similarity values, among all possible combinations of 6,352 ORFs for which GO annotations were available, in a

6,352 x 6,352 matrix. Since the similarity values vary widely from 0 to over 10,000, the similarity distribution is shown on a logarithmic scale to stretch the visual range. In order to better illustrate the distribution of similarity values, a smaller scale matrix is recreated using the proteomic matrix by randomly selecting 1% of the ORFs. Figure 4.4 shows GO based PPI similarity values of this 64 x 64 matrix magnified by 100%.

Figure 4.5 shows an ROC curve of GO similarity values validated against the

Gold Standard data. The accuracy of the GO predictor in classifying PPIs is computed as area under the ROC curve and is found to be 0.9812 as opposed to an accuracy of 0.5 when a prior likelihood measure (i.e. without a feature based predictor) is used. Figure

78 4.6 shows a plot of TP-FP ratio as a function of cutoff likelihood values, which indicates that increase in likelihood ratios increases the probability of correctly identifying a true

PPI as opposed to identifying a non-PPI as a true PPI. Figure 4.6 also shows a second plot with the number of newly predicted potential PPIs against the cutoff likelihood values. Newly predicted potential PPIs exclude well known PPIs and non-PPIs from GSP and GSN datasets respectively and contain some proportion of true positives and false positives depending on the likelihood ratio. Lower numbers of newly predicted PPIs with higher cutoff likelihood ratio are expected and indicate that the probability of finding new

PPIs drops if the associated cutoff likelihood ratio is raised.

4.2 MIPS Feature Analysis

This section describes results obtained during the application of MIPS feature in

the prediction of novel PPIs. Figure 4.7 (left plot) shows a histogram plot showing the

distribution of 4,779 ORFs against the number of associated MIPS IDs from the MIPS

functional catalog. Highly annotated ORFs occur sparsely (approximately 6%) since only

a small number of promiscuous proteins participate in a wide range (20-45) of MIPS

functions. However, the majority of ORFs (74%) are associated with a small number of

MIPS functions (<10).

Figure 4.7 (right plot) shows a distribution of 506 MIPS catalog functions against

the number of associated yeast ORFs. Figure 4.8 shows a more detailed breakdown of the

MIPS functions frequency distribution from figure 4.7, with the left and right histograms

displaying high and low frequency distributions respectively. 10% of the MIPS functions

are associated with no more than one ORF while 55% of the MIPS functions are

79 associated with no more than 20 ORFs. These MIPS functions denote molecular functions and biological processes specific to the yeast ORFs they annotate. On the other end of the spectrum, approximately 3% of the MIPS functions are associated with as many as 500 ORFs. These functions are, therefore, very generic in nature.

Figure 4.9 shows a proteome-wide distribution of PPI similarity values among all possible combinations of 4,779 ORFs for which MIPS catalog functions were available.

Similarity values in this 4,779 x 4,779 matrix vary in the range of 0 to greater than 25,000 and are plotted on a logarithmic scale. For a closer look at the distribution of similarity values, 1% of the ORFs are selected at random and their PPI similarity values using

MIPS feature are showing in figure 4.10 as a 48 x 48 matrix.

Similar to GO terms based similarity, figure 4.11 shows an ROC curve of MIPS similarity values validated against the same set of Gold Standard data. The accuracy of

MIPS predictor in classifying PPIs is once again computed using the area under the ROC curve. It is determined to be 0.9582, significantly larger than the accuracy of 0.5 based on a prior likelihood of prediction. Figure 4.12 shows a plot of TP-FP ratio as a function of cutoff likelihood values. Similar to the use of GO feature, it indicates that probability of correctly identifying a true PPI increases with a corresponding increase in the cutoff likelihood ratio. Figure 4.12 also shows the associated distribution of newly predicted potential PPIs as a function of cutoff likelihood ratio. An increase in likelihood ratio comes at the cost of associated decrease in number of novel PPIs predicted using this measure.

80 4.3 Gene Expression Feature Analysis

As part of PPI prediction using gene expression, expression profiles of 5,666 transcripts depicting 602 experimental conditions are analyzed for similarity with each other. While all 5,666 profiles were used towards the goal of novel PPI prediction based on gene expression feature, a smaller subset of profiles was used at times for better visualization of the gene expression profiles and their clustering in the figures ahead.

These figures provide complementary perspectives to help understand the similarity among the underlying expression profiles using a variety of microarray analysis tools.

Figure 4.13 and 4.14 show a clustergram based on a subset of the transcripts in the S. cerevisiae transcriptome. A clustergram is a graphical representation used to

visualize the relationships among the gene expression profiles. More specifically, it is a

visual representation of the gene expression matrix. It may also consist of tree-graphs on

one or both sides of the matrix in order to illustrate the clustering of the profiles. Figure

4.13 shows a clustergram that displays the 2,549 most biologically significant gene

expression profiles out of the total available transcriptome of 5,666 profiles. As seen in

figure, the gene expression matrix consists of 602 experimental conditions on the

horizontal axis and 2,549 profiles on the vertical axis. The value of gene expression for a

given ORF and experimental condition is represented by the corresponding color. The

tree-graphs on the top and left represent visually the clustering of ORFs and experimental

conditions respectively, based on the gene expression values. A number of filtering

criteria were used to select the subset of 2,549 profiles. First, the expression profiles that

exhibited little variance over the 602 experimental conditions were filtered out. Then, the

expression profiles with low entropy were eliminated. Figure 4.14 similarly shows a

81 clustergram illustrating the hierarchical clustering of a subset of first 1,000 profiles in the transcriptome in a more traditional red-green colormap.

Figure 4.15 shows in detail the tree-graph that visualizes the hierarchical clustering of the entire transcriptome based on 5,666 ORFs. Each of the 5,666 expression profiles is a leaf branch in the tree. A distance (dissimilarity) threshold of 0.75 was picked which assigns a unique color to each group of nodes where the linkage distance is less than 0.75. Figure 4.16 shows a scaled down version of tree-graph from figure 4.15.

For the sake of visual clarity in the tree representation, the lowest nodes of the tree have been merged up thus yielding a tree of 57 branches (scaled down to 1%) instead of 5,666 branches. A distance (dissimilarity) threshold of 0.9 was picked which assigns a unique color to each group of nodes where the linkage distance is less than 0.9. Most of the branches in this figure represent multiple ORF nodes. For instance, branch ‘40’ represents the merged expression profiles of 9 ORFs (YDR252W, YEL062W, YEL064C,

YER007W, YGR071C, YLR422W, YOL001W, YOL003C and YPL216W). Similarly, branch ‘3’ and ‘8’ represent 1,979 and 478 ORFs respectively. On the other hand, branch

‘37’ represents just one ORF, YDL221W.

As discussed in section 2.10, genes with similar expression profiles are more likely to encode interacting proteins [45]. In other words, the ORFs that share a given gene expression profile cluster are more likely to interact with each other than with ORFs from another cluster. Figures 4.17 to 4.19 show how gene expression profiles can be grouped into finite number of clusters, which can be used to gain an insight into the correlation between transcriptome (gene expression profiles) and interactome (PPIs). In each of the figures, horizontal axis represents experimental conditions and vertical axis

82 represents the gene expression values. Figure 4.17 shows the first 1,000 profiles clustered into 16 hierarchical clusters. Figure 4.18 shows the first 1,000 profiles grouped into 16 clusters based on k-means clustering. Figure 4.19 shows the representative profiles of the

16 clusters from figure 4.18.

Figure 4.20 shows the corresponding proteome-wide view of PPI similarity values for all pair-wise combinations of 5,666 ORFs, for which expression profiles were available. The similarity values in the associated 5,666 x 5,666 matrix vary from -1 to 1 and the similarity distribution is shown on a linear scale. In order to better visualize the distribution of gene expression based similarity values, a smaller scale sample was created from the proteomic-wide similarity matrix by randomly sampling 1% of the

ORFs. Figure 4.21 shows gene expression based PPI similarity values of this matrix of 57 x 57 ORFs, effectively magnifying the display by 100%.

Figure 4.22 shows an ROC curve of gene expression based similarity values validated against the Gold Standard data. Once again, the accuracy of gene expression based predictor in classifying PPIs is computed as area under the ROC curve. With a value of 0.8407, it was significantly larger than a prior value of 0.5 when no feature based predictor is used. Figure 4.23 shows a plot of TP-FP ratio as a function of cutoff likelihood values. Similar to GO terms and MIPS annotation, with use of gene expression as a feature, an increase in likelihood ratio leads to a corresponding increase in the probability of correctly identifying a true PPI as opposed to incorrectly identifying a non-

PPI as true PPI. Figure 4.23 also shows another plot illustrating that the probability of finding new potential PPIs drops when the associated cutoff likelihood ratio is raised.

83 4.4 Motif Feature Analysis

Results of using sequence motifs as an ORF feature for predicting novel PPIs are presented next. Figure 4.24 (left plot) shows a frequency distribution of 2,379 ORFs against the number of associated sequence motifs. Approximately 45% of the yeast ORFs are known to consist of only 1 or 2 motifs. On the other hand, more than 7% of the ORFs consist of more than 10 motifs, with some consisting of many as 30 motifs. Figure 4.24

(right plot) shows the frequency distribution of 4,172 available motifs against the corresponding number of associated yeast ORFs. Figure 4.25 shows the same distribution split into two for improved visualization, with the left and right histograms displaying high and low motif frequency distributions respectively. 98% of the motifs are associated with no more than 10 yeast ORFs, with 79% of the motifs associated with just 1 or 2 yeast ORFs. However, close to 1% of the motifs are associated with more than 20 ORFs.

Figure 4.26 shows a proteome-wide view of PPI similarity values computed using

the motif feature for all pair-wise combinations of 2,379 ORFs known to be associated

with sequence motifs. The similarity distribution is shown on a linear scale from 0 to 80.

In order to better illustrate the distribution of similarity values, once again a smaller scale

sample was created from the 2,379 x 2,379 proteomic matrix by randomly sampling 1%

of the ORFs. Figure 4.27 shows motif based PPI similarity values of the resulting 24 x 24

matrix.

Figure 4.28 shows an ROC curve of motif based similarity values validated

against the Gold Standard data. As usual, the accuracy of MIPS predictor in classifying

PPIs was evaluated by measuring area under the ROC curve. A value of 0.6297 for this

measure was superior to an accuracy of 0.5 corresponding to an absence of a feature

84 predictor. Figure 4.29 shows a plot of TP-FP ratio as a function of cutoff likelihood values and indicates that increase in likelihood ratios increases the probability of correctly identifying a true PPI as opposed to incorrectly identifying a non-PPI as true

PPI. As with other measures, Figure 4.29 shows a second plot illustrating that the probability of finding new PPIs drops when the associated cutoff likelihood ratio is increased. This is reflected by the lower number of newly predicted potential PPIs with higher cutoff likelihood ratios in the graph.

4.5 Essentiality Feature Analysis

Figure 4.30 shows the results of a proteome-wide view of PPI similarity values for all pair-wise combinations of 4,033 ORFs, for which essentiality data was available.

Depending on the essentiality of the ORF, the similarity value of a given PPI can be grouped into one of 3 categories: both essential, combination of essential and non- essential, and both non-essential. To better visualize the distribution of similarity values, a magnification of 100% was created by down-sampling the 4,033 x 4,033 matrix. This was achieved by randomly selecting 1% of all ORFs and storing the corresponding similarity values in a 40 x 40 matrix. Figure 4.31 displays the resulting matrix containing essentiality based PPI similarity values of the selected ORFs.

Figure 4.32 shows an ROC curve of essentiality based similarity values validated against the Gold Standard data. The accuracy of essentiality based predictor in classifying

PPIs can be measured by the area under the ROC curve. A value of 0.683 for this measure reflects the improvement in accuracy resulting from use of essentiality as a feature, when compared against an accuracy value of 0.5 for the case when no feature

85 predictor is used. Figure 4.33 shows a plot of TP-FP ratio as a function of cutoff likelihood values. An increase in likelihood ratio corresponds to a consistent and gradual increase in the probability of correctly identifying a true PPI as opposed to identifying a non-PPI as true PPI. Figure 4.33 also shows a second plot illustrating that number of newly predicted potential PPIs (and therefore, the probability of finding new PPIs) drops if the associated cutoff likelihood ratio is raised.

4.6 Comparison of Features

Features selected to compute similarity values in this research originate from distinctly different sources. Therefore, they not only fundamentally differ from each other, associated techniques designed to compute similarity are specialized to the feature context and yield similarity values applicable within such context.

Figure 4.34 shows a set of pair-wise scatter plots of the five features used for PPI prediction. These plots provide an effective tool for qualitatively evaluating the relationship between similarity values of each feature in relation to the other. The widely distributed nature of the scatter in these plots substantiates the independent origin of underlying features.

Figure 4.35 shows a plot of the similarity values corresponding to a small, random subset of the PPIs common to all the features. In spite of being normalized to the same scale, these features show little visual correlation.

A quantitative understanding of the relationships among the various features can be achieved by computing a proteome-wide correlation among the similarity values for each of the features. Figure 4.36 shows the distribution of the correlation coefficients

86 among the similarity values of various features. For example, a correlation coefficient of approximately -0.07 between GO and ESS shows little or no correlation between the PPI similarity matrices corresponding to GO and ESS. The correlation coefficient distribution indicates a weak correlation of motif sequences with gene expression, GO and MIPS with approximate correlation values of 0.21, 0.32 and 0.30 respectively. This shows the relative importance of the role of motifs and domains in influencing gene expression behaviors. Also, it illustrates the role of motifs in influencing protein properties such as molecular functions, biological processes and cellular localizations as is evident in a weak relation with GO and MIPS. A relatively higher, but still poor correlation of approximately 0.55 between GO and MIPS features further indicates that in spite of commonality around functional annotations, these features provide independent measures of similarity as they are curated based on different principles.

Lack of significant correlation across all pair-wise comparisons among the five features establishes independence among the predictors based on these features. This makes it possible to create a framework for integration of these features using Bayesian integration approach as discussed in section 3.5.

A comparison of ROC curves of the five features suggests that gene ontology and

MIPS are the best predictors of the five, followed by gene expression at third position.

Also, sequence motif and essentiality features are relatively poor predictors. However, a

Bayesian integration is able to take advantage of all the features (including the weak ones) in order to maintain a high overall predictive power and high coverage across multiple features.

87 4.7 Bayesian Integration of Features and Validation

Figure 4.37 shows a proteome-wide view of GSP dataset which consists of high confidence PPIs as a 1,602 x 1602 matrix. The GSP dataset is made up of 11,255 PPIs, a mere 0.05% of all possible binary PPIs based on 6,472 different ORFs used in this dissertation. In order to better visualize the distribution of GSPs in relation to all possible interaction between ORFs, a smaller scale sample was created by randomly sampling 1% of the ORFs in Figure 4.37. Figure 4.38 shows the resulting GSP matrix of 16 x 16 ORFs, with effective magnification of 100%.

Figure 4.39 shows a proteome-wide view of GSN dataset which consists of high confidence non-PPIs as a 2,902 x 2,902 matrix. The GSN dataset consists of 2,705,844 non-PPIs, 12.9% of all possible binary PPIs based on 6,472 different ORFs used in this dissertation. Similar to the case with GSPs, a smaller scale sample was created by randomly sampling 1% of the ORFs. Figure 4.40 shows the smaller scale GSN matrix of

29 x 29 ORFs, in effect, magnified by 100%.

Figure 4.41 shows a frequency distribution of the likelihood ratios for combined yeast features. The dominant majority of PPIs (existing and newly predicted) exhibit a likelihood value of less than 10 and the number of PPIs decreases as the associated cutoff likelihood value increases. However there exists a small set of PPIs with likelihood values in the range of 10 3 to 10 8, which exhibit high odds of being true PPIs. While the left histogram shows the frequency distribution on the linear scale, the right histogram shows the frequency distribution on a log scale to better illustrate the log (likelihood) values in the 3:8 range.

88 The plot in figure 4.42 illustrates the degree to which PPIs of a given likelihood are members of GSP, that is, overlap with GSP. GSP membership of the PPIs predicted using the Bayesian integration technique shows a continuous increase as the likelihood values are increased. Similarly, figure 4.43 makes a similar inference with respect to

GSN membership. It shows that overlap with GSN decreases as the likelihood values increase.

Figure 4.44 shows an ROC curve corresponding to the PPI prediction based on

Bayesian integration of the five features. Measured using area under the curve, the integrated technique exhibits a high prediction accuracy of 0.9583. Based on this result, null hypothesis is rejected and alternative hypothesis (H a: Accuracy of PPI Predictor >

0.5) is accepted.

Figure 4.45 shows a distribution of newly predicted potential PPIs (in red)

against the cutoff for likelihood ratio. This indicates a decrease in the probability of

finding new PPIs with increase in likelihood ratio. The figure also shows (in blue) that

the relative probability of predicting truly positive PPIs compared to false positives

increases as the cutoff likelihood increases.

4.8 Cross-validation of Combined Likelihood

Using a multi-fold cross validation approach, 10 iterations were run. Figures 4.46 and 4.47 show the results of one such iteration with an accuracy of 0.9375 for the PPI prediction and a 99% confidence interval of [0.9254, 0.9911]. Figure 4.48 shows the results of running 10 iterations. All of them show accuracies comparable to the results obtained without cross-validation. The average accuracy based on all the 10 iterations

89 and all the five features simultaneously is 0.9396. Figure 4.49 shows a 99% confidence interval for the 10-fold cross-validation combined prediction accuracy.

4.9 Visualization of Predicted PPIs

Figures 4.50 to 4.52 illustrate protein interaction networks constructed using a representative set of predicted PPIs. For the sake of improved visualization, a subset of top 500 PPIs was selected based on their likelihood ratios. Different categories of PPIs were displayed using different color codes for the corresponding interaction edges in the network. The figures were generated using Cytoscape version 2.8.

Figure 4.50 shows the distribution of newly predicted potential PPIs, PPIs from

GSP set, and PPIs from GSN set sorted by node degree in a circular layout. The figure confirms the presence of relatively high proportion of new predictions (in green) and

GSP PPIs (in red) as compared to GSN PPIs (in blue). Three edges in yellow indicate that the corresponding PPIs are known to be members of both GSP and GSN, which is infrequent but not surprising as explained in section 3.4. Figure 4.51 shows an alternative visualization of the selected PPIs from figure 4.50 using a radial tree layout.

Figure 4.52 shows the distribution of number of features used to predict a given

PPI. As mentioned in section 3.5, Bayesian approach allows for prediction based on integration of partial number of features. Red, green and blue colors represent PPI predictions made using 3, 4 and 5 features respectively.

90

Figure 4.1 Frequency distribution of Yeast ORFs with respect to the associated GO terms (Left) and GO terms with respect to the associated yeast ORFs (Right).

91

Figure 4.2 Frequency distribution breakdown of GO terms with respect to the Yeast ORFs from figure 4.1 into two histograms: Left (0-100), Right (101-MAX).

92

Figure 4.3 Yeast PPI similarity based on GO feature: Proteomic view.

Figure 4.4 Yeast PPI similarity based on GO feature: 100x zoom.

93

Figure 4.5 ROC curve of PPI prediction based on gene ontology feature.

Figure 4.6 Distribution of newly predicted potential PPIs and True Positive/False Positive ratio as a function of likelihood values based on gene ontology feature.

94

Figure 4.7 Frequency distribution of Yeast ORFs with respect to the associated MIPS IDs (Left) and MIPS IDs with respect to the associated yeast ORFs (Right).

95

Figure 4.8 Frequency distribution breakdown of MIPS IDs with respect to the Yeast ORFs from figure 4.7 into two histograms: Left (0-100), Right (101-MAX).

96

Figure 4.9 Yeast PPI similarity based on MIPS feature: Proteomic view.

Figure 4.10 Yeast PPI similarity based on MIPS feature: 100x zoom.

97

Figure 4.11 ROC curve of PPI prediction based on MIPS feature.

Figure 4.12 Distribution of newly predicted potential PPIs and True Positive/False Positive ratio as a function of likelihood values based on MIPS feature.

98

Figure 4.13 Clustergram of the 2,549 most significant ORFs based on gene expression - Jet Colormap. The tree-graphs on the top and left show the hierarchical clustering of ORFs and experimental conditions respectively, based on the gene expression values.

99

Figure 4.14 Clustergram of the first 1,000 ORFs based on gene expression – Red-Green Colormap. The tree-graphs on the top and left show the hierarchical clustering of ORFs and experimental conditions respectively, based on the gene expression values.

100

Figure 4.15 Hierarchical tree visualizing the clustering of entire yeast transcriptome of 5,666 ORFs.

101

Figure 4.16 Hierarchical tree of entire yeast transcriptome of 5,666 ORFs from figure 4.15 collapsed down to 1% of its size.

102

Figure 4.17 Hierarchical clustering of a subset of proteome-wide profiles into16 clusters.

103

Figure 4.18 k-Means clustering of a subset of proteome-wide profiles into 16 clusters.

104

Figure 4.19 Centroids of k-Means clusters from figure 4.18.

105

Figure 4.20 Yeast PPI similarity based on gene expression feature: Proteomic view.

Figure 4.21 Yeast PPI similarity based on gene expression feature: 100x zoom.

106

Figure 4.22 ROC curve of PPI prediction based on gene expression feature.

Figure 4.23 Distribution of newly predicted potential PPIs and True Positive/False Positive ratio as a function of likelihood values based on gene expression feature.

107

Figure 4.24 Frequency distribution of Yeast ORFs with respect to the associated motifs (Left) and motifs with respect to the associated yeast ORFs (Right).

108

Figure 4.25 Frequency distribution breakdown of motifs with respect to the Yeast ORFs from figure 4.24 into two histograms: Left (0-10), Right (11-MAX).

109

Figure 4.26 Yeast PPI similarity based on motif feature: Proteomic view.

Figure 4.27 Yeast PPI similarity based on motif feature: 100x zoom.

110

Figure 4.28 ROC curve of PPI prediction based on motif feature.

Figure 4.29 Distribution of newly predicted potential PPIs and True Positive/False Positive ratio as a function of likelihood values based on motif feature.

111

Figure 4.30 Yeast PPI similarity based on essentiality feature: Proteomic view.

Figure 4.31 Yeast PPI similarity based on essentiality feature: 100x zoom.

112

Figure 4.32 ROC curve of PPI prediction based on essentiality feature.

Figure 4.33 Distribution of newly predicted potential PPIs and True Positive/False Positive ratio as a function of likelihood values based on essentiality feature.

113 y matrices. Figure 4.34 plot Scatter similaritfeature yeast of

114 ty values for a uniformty a values for set features. of Figure 4.35 Distribution normalized PPI of similari

115

Figure 4.36 Correlation among five different yeast feature similarity matrices.

116

Figure 4.37 Yeast Gold Standard Positives: Proteomic view.

Figure 4.38 Yeast Gold Standard Positives: 100x zoom.

117

Figure 4.39 Yeast Gold Standard Negatives: Proteomic view.

Figure 4.40 Yeast Gold Standard Negatives: 100x zoom.

118

Figure 4.41 Frequency distribution of the combined log(likelihood) values based on the combined yeast features (Left). Log version of the left histogram to highlight the high log(likelihood) values associated with relatively much lower frequencies (Right).

119

Figure 4.42 Gold Set Positive membership as a function of the combined log (likelihood) values for combined yeast features.

Figure 4.43 Gold Set Negative membership as a function of the combined log (likelihood) values for combined yeast features.

120

Figure 4.44 ROC curve of PPI prediction, considering all five features simultaneously.

Figure 4.45 Distribution of newly predicted potential PPIs and True Positive/False Positive ratio as a function of likelihood values, considering all five features simultaneously.

121

Figure 4.46 ROC curve of PPI prediction, considering all five features simultaneously, in one of the iterations from 10-fold cross-validation.

Figure 4.47 Distribution of newly predicted potential PPIs and True Positive/False Positive ratio as a function of likelihood values, considering all five features simultaneously, in one of the iterations from 10-fold cross-validation.

122

Figure 4.48 PPI Predictor accuracy for the combined yeast features across 10 iterations from 10-fold cross-validation.

Figure 4.49 The 99% confidence interval for the 10-fold cross-validation of prediction accuracy.

123

Figure 4.50 Degree Sorted Circular Layout of top 500 PPIs based on likelihood ratios. Novel PPI predictions, GSP PPIs and GSN PPIs are shown in green, red and blue respectively.

124

Figure 4.51 An alternative view of the PPIs from figure 4.50 using Radial Tree Layout. Novel PPI predictions, GSP PPIs and GSN PPIs are shown in green, red and blue respectively.

125

Figure 4.52 Circular Layout of top 500 PPIs based on likelihood ratios. Red, green and blue colors represent PPI predictions made using 3, 4 and 5 features respectively.

126

CHAPTER V

APPLICATIONS

This chapter examines the applications of protein-protein interactions from three different perspectives. First, applications of protein-protein interactions are examined in the context of the ‘Rational Drug Discovery’ paradigm. After introducing the paradigm and its potential and limitations, applications of protein-protein interactions are discussed in the context of ‘Target Validation’ and use of High Throughput Screening (HTS) during the ‘Hit Identification/Lead Selection’ phase of rational drug discovery. Next, the role of protein-protein interactions in pathway construction and their potential in addressing core limitations of the ‘Rational Drug Discovery’ process are discussed.

Second, the topology of the PPI network is examined in the context of identification of novel drug-protein targets for small molecule drug discovery. Use of graph theoretic measures in describing topology and identification of novel PPIs for drug discovery and cancer therapeutics is described.

Third, applications of protein-protein interactions in the rapidly growing field of systems biology are considered. Contrary to the reductionist approach of molecular biology, systems biology builds on an integrative approach to study emergent properties of the system that requires an integration of multiple components/subsystems. Such integration requires representation of components using a framework amenable to semantic integration and reasoning. Use of the Semantic Web to represent PPI network,

127 its integration with other components and subsequent analysis is described. Finally, applications of protein-protein interactions in systems biology are explored, particularly the role of Semantic Web and ontologies in information representation, retrieval and integration with PPI data to help better understand disease etiology, target identification in drug discovery and comparative genomics.

5.1 Rational Drug Discovery - Promise and Limitations

Most drugs are low molecular weight (< 800 Daltons) organic compounds termed small molecules [108]. They act by binding to specific binding sites on proteins and alter the mode of functioning of proteins in select cell organelles, tissue types or organs.

Rational drug discovery, therefore, can be defined as the process of identifying these small molecules that selectively act at the site of their action to alter the behavior of the target protein in a desirable manner without causing unacceptable side-effects and/or toxicity [108, 124].

Different stages of the drug discovery pipeline, as shown in figure 5.1, involve: identification of the target protein; validation of the target; identification of the hits (small molecules that show activity against the target); selecting promising hits, also called leads, based on the nature of the activity; and optimization of the lead’s availability/efficacy profile before the lead can advance to pre-clinical development

[157]. A lead’s pharmacokinetic (PK) and pharmacodynamic (PD) properties are the primary contributing factors to its availability at the site of action and its efficacy respectively [108].

128

Figure 5.1 Rational drug discovery and development pipeline.

Rational drug discovery has significant advantages over the classical drug discovery process that was the standard until the 1980s [124]. Figure 5.2 lists salient activities carried out during rational drug discovery. Target identification and validation in the initial stages of the drug discovery process require increased understanding of the physiological and pathological processes underlying the disease. Hit identification based on this understanding of interaction between the target protein and structure of small molecules is essential to developing the structure activity relationship (SAR) of promising hits [108]. Comparative evaluation of these hits can lead to the construction of pharmacophore, a minimum set of molecular features necessary to create a favorable activity profile against a specific target protein. Understanding of SAR and pharmacophores of lead compounds is critical to developing backup compounds/leads if

129 the primary lead compound fails due to unexpected PK/PD profile that cannot be further optimized during the lead optimization phase [108].

Figure 5.2 Activities carried out during rational drug discovery process.

Although adopting the rational drug discovery process has led to significant progress in the identification of new targets, the paradigm shift is yet to result in the significant increase in number of new chemical entities (NCEs) approved by the Food and Drug Administration (FDA) [131]. A closer look at the underlying causes is necessary in order to recognize the important role our understanding of protein-protein interactions can play in not only addressing the current limitations but also in optimization and subsequent success of the rational drug discovery and development process.

130 Use of ’omics techniques such as Genomics, Proteomics and Metabolomics during the Rational Drug Discovery process has resulted in a plethora of new targets being identified for almost every major disease. However, our limited understanding of the context in which these novel targets operate has led to an equally spectacular attrition of these targets (and associated leads) during target validation and the later stages of drug discovery and development. This has led to a situation widely described as ‘Target Rich,

Lead Poor’ [131]. As explained ahead, our understanding of protein-protein interactions in the context of etiology of the disease can significantly improve the success rate of targets after validation.

Hit identification requires screening the chemical compounds against an assay containing the target. Advent of high throughput screening and its use in drug discovery has made it possible to successfully screen chemical libraries of more than a million compounds [108]. However, the chemical space is of the order of 10 60 [81] and even with

a tremendous improvement in screening capacity, identification of optimal leads against a

target continues to be akin to looking for the proverbial needle in the haystack. Once

again, as described next, an improved understanding of protein-protein interactions

followed by mapping of known ligands to their corresponding targets can help design

focused libraries that can cover the chemical space intelligently.

5.1.1 PPI and Target Validation

Validation of a target during the drug discovery process is necessary to ensure that the target being investigated plays the primary role in causing the disease and/or an alteration in the behavior of this target will lead to reversal of the disease/symptoms

131 [166]. However, it is now well known that proteins interact and collaborate with other proteins to carry out their function. Therefore, the process of target validation can be significantly improved by first identifying the PPIs this target participates in and then developing a PPI sub-network involving the target protein [166]. As shown in figure 5.3, this sub-network can additionally be annotated by the properties of its member proteins such as function, localization, known ligands, known binding sites, and nature of interaction (activation/inhibition).

Figure 5.3 Protein interaction network for target validation using pathway analysis.

Function of the target protein can then be identified by the role it plays in its sub- network. For example, supervised learning techniques such as k-Nearest Neighbor

(KNN) and Support Vector Machines (SVM) can be used to predict the function of the target protein based on the function of the proteins it interacts with.

132 Since perturbation in the underlying interaction network is central to most diseases, superposition of an associated perturbation (for instance, gene expression data) on to the network/pathway can be invaluable in understanding the role of the target protein and subsequently in its validation and selection as a potential druggable target

[166].

If the target’s role in the interaction network has been validated as critical to the manifestation of a disease, it can act as a molecular biomarker of the disease [90].

Molecular biomarkers, such as p53 concentration, are routinely used in lieu of the clinical endpoints (cancer) during the drug discovery process to accelerate the drug discovery and development [17].

In the event the target protein is determined to be a non-viable candidate because of either its secondary role in the disease process or its lack of druggability (ability to bind to small molecules and undergo functional modification), our understanding of the

PPI sub-network involving this target provides the best possible chance of identifying backup targets that have a high probability of successful validation [166]. This can potentially save a very costly drug discovery venture from imminent collapse.

5.1.2 PPI and High Throughput Screening

Understanding of protein-protein interactions can assist in high throughput screening in several ways. First, an improved understanding of interactions and binding sites of the protein with other proteins/peptides provides insights into the nature of the binding. This can help in developing insights into the pharmacophores of the hit compounds [89]. Subsequent development of a screening library around the

133 pharmacophores can significantly improve our odds of finding the optimal lead in the vast chemical space [89].

Second, a protein in an interaction network will usually interact with more than one other protein. Understanding of these interactions and associated binding sites can help in identifying fragments (partial molecules) that exhibit complimentary binding to the target. These fragments can be used to develop a fragment-centric screening library for HTS [139]. Due to the combinatorial nature of the chemical space, a small number for fragments in the screening library can cover a vast area in chemical space. As an illustration, if a small molecule is made of 3 segments, a library containing n, m & p instances of the three segments respectively can produce n*m*p small molecules.

Therefore, screening with these three segments will be equivalent to screening with n*m*p small molecules. As shown in figure 5.4, ‘Fragment-like Hits’ identified during the screening can then be used as input to a set of related techniques called Fragment

Expansion, Linking and Assembly to develop leads against the target [24].

Figure 5.4 HTS followed by lead assembly to cover large chemical space.

134 5.1.3 Making Rational Drug Discovery Optimal

Identification of PPIs and development of the protein interactome is critical to the development of signaling pathways. These pathways capture the interaction of a protein with other proteins and ligands and therefore, play an important role in improving our understanding of the physiology and pathology of the disease of interest [131]. This has necessitated incorporation of a target based, pathway-centric approach to rational drug discovery process. As shown in figure 5.5, it is achieved by first using PPIs to develop the pathway and then mapping disease targets (proteins) and associated leads (drugs) to the pathway [131].

Figure 5.5 Mapping of targets (proteins) and drugs (small molecules) on to the pathway.

135 Mapping of the targets to the pathway can not only provide support to the target’s mechanistic hypothesis but also help identify alternative druggable targets (proteins with which this target interacts as part of the mechanism of action). This, in turn, provides insight into the therapeutic value of the target and its modulation by the lead (target’s druggability) [131-132].

Since most diseases are multi-factorial, mapping of targets to the pathway can also help in the identification of associated proteins. When target identification is based on weak disease-association studies, information about PPIs can help validate the target

[166]. Pathways built using PPIs can also help in interpretation of differential expression

(of genes, for instance) at a functional level [138].

Similarly, mapping of leads (drugs) to the PPI based pathway helps map chemical space (lead) onto the biological space (target). Identification of the chemical space corresponding to a disease pathway can lead to focused development of a chemical library that can be screened quickly and cost-effectively [21]. Moreover, a known drug can be mapped to the pathway via the protein with which it interacts. Interacting proteins can then be evaluated for their druggability using screening against focused chemical libraries (libraries developed with a focus towards specific protein function/families) to develop new leads [21].

Lastly, an improved understanding of the binding sites of the protein with other proteins/peptides can help in developing small molecule drugs for targets that primarily interact with larger molecules/peptide in vivo [108]. Information about binding sites and the nature of the interaction can help with in silico drug-design using molecular modeling, a process called molecular docking [108]. This process is based on in silico

136 modeling of the interactions of the ligand with the target and can be done in an in silico environment, thereby, reducing the time and cost of the drug discovery process.

5.2 PPI - Applications using Graph Theoretic Analysis

Graph theoretic analysis of the protein interaction network can not only provide important insights into the nature of PPIs but also assist us in applying these insights in understanding the physiology and pathology of biological processes. Such analysis may involve understanding and characterizing the topology of these networks using graph theoretic measures. These measures can be node-centric or network-centric. Node-centric measures characterize a node in relation to other nodes and connections between them.

Network-centric measures characterize a sub-graph topology or connectivity. Many of these measures are described earlier in section 2.6. Two additional network topology concepts that will be used ahead are described next:

• k-core Analysis: It is an algorithm used to identify nodes with large degree and

therefore, identify nodes that form the core of the network. The algorithm is

implemented by iteratively applying the k-filtering step in which nodes with a

degree smaller than ‘k’ are gradually deleted until the degree of the remaining

nodes is greater than or equal to ‘k’.

• k-clique Community: In order to compute the k-clique community, first all of

the k-cliques for a predetermined value of ‘k’ are identified in the network. A k-

clique is a set of k-nodes that are maximally connected, in other words, each node

in the set is directly connected to all other nodes. A k-clique is also called a k-

complete subgraph. Once all k-cliques are identified, the k-clique community of a

137 k-clique can be computed. The k-clique community of a k-clique is its union with

all k-cliques that can be reached from it through a series of adjacent k-cliques,

where cliques sharing k-1 nodes with the k-clique are considered as adjacent. k-

clique communities are effective in identifying central hubs in the protein

interactome.

5.2.1 Graph Theoretic Analysis for Drug Target Identification

Graph theoretic analysis can be used in identification of new drug targets and validation of target proteins associated with weak evidence. The underlying approach consists of the following steps:

1. Building PPI sub-network involving the target.

2. Annotating the sub-network with known drug-protein targets.

3. Computing graph theoretic measures defined above for the nodes/proteins in the

sub-network.

4. Selection of the graph theoretic measures which are relevant to the identification

of drug-target proteins.

5. Using these measures to identify novel drug-target proteins and/or validate novel

target proteins using classification techniques such as KNN and SVM.

The sub-network, in step #1, can be built by merging interactions identified using experimental and computational techniques. These are then annotated with known drug- protein targets, such as those extracted from the Drug Bank database [171]. Graph theoretic measures can then be computed for each node in the sub-network. However, in

138 order to identify the measures in step #4, we need to identify a negative set, a set of proteins that are very unlikely to be drug-protein targets. This can be achieved by first identifying an essential set of proteins that are required for fundamental cellular processes. Ubiquitously expressed proteins in human proteome can be used to identify this essential set [158]. All known drug-protein targets have also been shown to be members of an essential protein set [171]. Therefore, proteins in the sub-network that are not members of the essential protein set can be used in the negative set.

Using the proteins known to be drug-targets and in the negative set, we can identify the measures that are relevant for classification of novel proteins. It has been shown that drug-target proteins exhibit a higher level of connectivity among themselves

[171]. Similarly, drug-targets display lower values of the average shortest path to other nodes in the subgraph [171]. Therefore, drug-protein target subgraph connectivity and mean path to subgraph measures can be used to identify novel drug targets in the interactome. K-core analysis of the protein interaction network has shown that drug target proteins form the topological centers of the network [171]. Such an analysis, with high values of ‘k’, can help identify these centers that will invariably contain proteins that are good candidates as drug-targets.

5.2.2 Graph Theoretic Analysis for Cancer Protein Identification

PPIs and associated signal transduction play a critical role in the development of cancer. Significant progress has been made in identification of small molecules that inhibit these interactions and provide therapeutic value. These interactions are involved in cellular processes such as regulation of apoptosis and cell growth/division. Therefore,

139 identification of novel proteins and associated interactions is vital to developing small- molecule antagonists that can inhibit these interactions and move cancer research closer to finding a cure [11].

Identification of cancer proteins using graph theoretic analysis consists of the

following steps:

1. Build a PPI network using information from human protein interaction databases

such as HPRD (Human Protein Reference Database) & BIND (see section 2.4).

2. Annotate the interaction network with known cancer genes obtained from a

cancer database such as CGCD (Cancer Gene Census Database).

3. Identify the essential genes as genes in the genome that are ubiquitously

expressed. Control genes are identified as genes that are neither cancer genes nor

essential genes.

4. Compute graph-theoretic measures for the cancer (recessive/dominant), essential

and control genes and identify the measures relevant for identification of cancer

proteins.

5. Use the relevant measures to classify novel proteins/interactions into cancer/non-

cancer genes.

Analysis in step #4 has shown that cancer proteins tend to have higher degree than control genes [76, 153]. This observation is in accordance with a higher level of connectivity observed in disease genes as compared to non-disease genes [76]. Similarly, cancer genes in the interactome have been shown to possess a higher degree of betweenness [153]. This is expected, as highly connected nodes are expected to facilitate

140 communication between nodes with which they interact. Cancer proteins also exhibit lower values of mean shortest path to other cancer proteins than non-cancer proteins. This indicates that a cancer protein exhibits closer collaboration with other cancer proteins than non-cancer proteins [153].

Both cancer and essential proteins have been found to have lower clustering coefficients than control proteins. This indicates that, even though cancer proteins possess a high level of connectivity, their neighbor proteins are less likely to connect to each other when compared against the control proteins [153]. Since cancer proteins possess a higher level of betweenness, associated low clustering coefficients further assert their importance in facilitating communication between the nodes with which they interact. The k-clique community analysis of cancer proteins indicates that they participate in central hubs of the interaction network rather than peripheral ones. This further substantiates their participation in sub-networks that are central to the interactome

[76].

The measures such as connectivity, betweenness, average shortest path distance, clustering coefficients and k-clique analysis can be further used to analyze essential proteins that exhibit characteristics similar to that of cancer proteins. Results of such analysis can be used to identify novel cancer proteins and PPIs that can be targeted for the development of small molecule drugs [11].

5.3 PPI - Applications in Systems Biology

Systems Biology can be defined as the study of interactions between components of a biological system, and how these interactions lead to emergent properties of the

141 system such as function and behavior of the system [143]. While a reductionist approach in biology has been successful in identifying components and to some extent, interactions between them; it has largely failed in furthering our understanding of how system properties emerge. Systems biology, via simultaneous qualitative and/or quantitative analysis of biological networks encompassing multiple components, utilizes the integrative approach to help us understand system level function and behavior, for example, induction of angiogenesis by a tumor [133].

Since PPIs are at the core of determining function and behavior of a component

(for example, cell cycle regulation) or system (for example, adaptation of the visual system to light), the PPI network provides a natural choice of a biological network around which a blueprint of multiple interacting networks/components can be developed.

This approach is at the core of numerous systems biology applications that have significantly advanced our understanding of the complex biological and pathological system. The framework underlying this approach has three key features. These are:

1. Representation of the protein interaction network using a semantic model that is

extensible and domain-agnostic. Extensibility of the model allows semantic

integration of the protein interaction network with other biological networks of

interest. Being agnostic to the underlying domain allows representation and

integration with a diverse set of networks.

2. Depending on the context of the problem, extending and/or integrating the PPI

network with other relevant sources and/or networks. For example, the PPI network

can be integrated with the gene ontology (GO) network to infer associations between

GO terms on the basis of associated protein interactions [29].

142 3. Analyzing the integrated multiple networks to study emergent properties of interest.

In order to successfully carry out these analyses, representation of the integrated

networks using semantic models in feature #1 will have to support querying and

reasoning.

5.3.1 Semantic Web for Network Integration and Analysis

In order to represent PPIs in a form that allows domain-agnostic integration with other sources of information and supports reasoning-based analysis, the knowledge representation mechanism should be able to capture both the underlying structure and the semantics of the information using domain-agnostic constructs that support inferencing.

Such a knowledge representation mechanism should provide an open standard for capturing structure by specifying a syntax that supports markup. It should also provide formal means for modeling primitives to capture meaning of the concepts being represented and a grammar for modeling relationships between the primitives/concepts.

These primitives, such as classes and their properties, can then be used to capture complex concepts and relationships, and develop domain-specific vocabularies or ontologies [113]. One framework that provides such a knowledge representation mechanism and is being used in a number of systems biology applications based on the use of protein-protein interactions is described next.

5.3.1.1 Resource Description Framework (RDF)

RDF is a framework for describing web resources using XML for underlying representation/serialization. It provides a standard mechanism for making statements

143 about resources. These statements provide a description of the resources via attribute- value pair and form the basis of knowledge representation [121].

RDF represents information in the form of a triplet analogous to the subject, verb and object in a sentence. This simple but powerful and flexible mechanism of describing information makes RDF domain-agnostic. Moreover, this representation allows for integrating information from diverse sources to form a virtual knowledgebase of heterogeneous information. Explicit semantics described by the RDF statements can be used to make useful inferences while performing intelligent queries.

5.3.1.2 RDF Data Model

The RDF data model, as shown in figure 5.6, consists of resources, literals and

statements. Resources are analogous to nouns and are associated with a Uniform

Resource Identifier (URI). Property is a specific type of resource that can be used to

describe both the attributes of a resource and the relationship between two resources.

Since properties are themselves resources, they can also be described by properties [121].

Figure 5.6 RDF triplet and its relation to the RDF data model.

144 5.3.1.3 RDF Representation/Serialization Using RDF/XML

RDF/XML provides a standard way to serialize knowledge represented using a collection of RDF statements. This representation has the advantage of building on the already standardized XML technology features such as URIs and XML Namespaces.

Figure 5.7 shows an RDF-based example depicting a simplified version of a PPI data- model for protein ‘p53’ interacting with protein ‘MDM2’.

1:

2:

3: xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"

4: xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#"

5: xmlns:ppi="http://www.ppi.com/ppis#">

6:

7: p53

8:

9:

10: MDM2

11:

12:

13:

14:

15:

Figure 5.7 RDF/XML serialization of a simplified version of PPI data-model.

Figure 5.8 below shows the graphical representation of the RDF/XML based PPI

example shown above in figure 5.7. The RDF/XML graph was generated using RDF

145 Gravity (RDF Graph Visualization Tool). RDF Gravity is based on Jena Semantic Web toolkit and the JUNG (Java Universal Network Graph) specification. RDF triplets consisting of {Subject, Predicate, Object} can be seen in the graph.

Figure 5.8 Graphical representation of the PPI from figure 5.7 generated by RDF Gravity.

While RDF provides the abstract data-model, RDF Schema (RDFS) provides

RDF’s vocabulary description language. RDFS is essentially a semantic extension of

RDF and builds on the RDF data model to develop vocabulary (ontological primitives) that can be used to capture knowledge semantics. It adds semantics for describing properties and classes of RDF resources and generalization hierarchies of these properties and classes. Web Ontology Language (OWL) further builds on RDFS to provide mechanisms for advance semantic concepts, property restrictions (for example, cardinality constraints) and inferencing capabilities [88].

146 5.3.2 PPI and Information Retrieval for Pathway Construction

A 2011 update of the molecular biology database collection contains 1,330

publicly accessible databases related to various areas of molecular biology [46]. As the

number of databases in the collection has grown over the years, there has been a concern

that the proliferation is doing little towards building a single body of knowledge required

to use them effectively. In order to encourage integration between complementary

databases, a consortium of scientists has recommended a standard to promote

consistency, inter-operability and adherence to semantic and syntactic standards [49].

Semantic representation of PPIs can be particularly useful in facilitating integration

among suitable databases, leading to the development of well annotated pathways

designed with specific systems biology applications in mind [140]. Such an approach

consists of the following steps:

1. Creating semantic representation of the PPIs using one or more interaction

database(s).

2. Traversing the semantic representation to extract a semantic representation of

each protein in the interactome, for example p53 and MDM2.

3. Using this information to query the complementary databases via a semantic

interface. As an illustration, see figure 5.9 where each of the databases A, B, C &

D is accessed via a semantic link-hub whose purpose is to translate between

multiple ID-spaces, for example, between protein-ID in the interactome and

database A.

4. Merging the information from each of the databases with the semantic

representation of the protein used in the query.

147 This approach to semantic integration can also be used with text mining. In such a case, the interactome is traversed and semantic information for each protein is used to create a document vector containing terms that are important for text mining. This document vector is then used to query the document-management-systems containing documents of interest [140].

Figure 5.9 Pathway development by semantic integration of databases.

5.3.3 PPI and Understanding the Mechanism of Action in a Disease

Understanding the mechanism of action of a disease is of vital importance when developing a diagnostic or therapeutic approach for that disease. Knowledge of protein- protein interactions can play a central role in furthering our understanding of the underlying mechanism of action.

148 This can be achieved by:

1. Developing a semantic representation of the interactome using standard ontology.

For example, BioPAX ontology based representation of Reactome [92].

2. Extracting Gene (EG) records of the genes known to be involved in the

diseases under study [94].

3. Converting EG records extracted in step #2 into an RDF format suitable for

integration with semantic representation of PPIs in step #1.

4. Semantic integration of representations in steps #1 & #3.

5. Analysis of the integrated representation using semantic query languages [129].

5.3.4 PPI and Target Identification in Drug Discovery

The goal of target identification in a drug discovery pipeline is to identify the protein(s) that will make good target(s) against which a small molecular drug can be developed during subsequent stages of the pipeline. One of the challenges facing scientists today is to be able to analyze the function of the proteins in the larger biological context so that druggable target proteins can be identified and prioritized. While PPIs are helpful in determining protein function, target identification requires understanding of the biological context in which the interaction occurs. This context may include temporal sequence of the interaction, differential expression of the proteins and associated phenotypes, nature of the interaction (activation/inhibition), location of the interaction, and genetic linkage of the corresponding genes with the disease [58, 106].

The process for developing a prioritized list of targets for the disease may include:

1. Developing semantic representation of the PPIs.

149 2. Identifying candidate gene/protein targets by analyzing the genetic linkage and/or

gene expression analysis carried out for the disease.

3. Extracting Gene Ontology (GO) information for the candidate genes using the

NCBI Entrez Gene database. This will provide information related to molecular

function, biological processes and cellular components of the candidate gene.

4. Converting the information from step #2 into semantic form using RDF/XML.

5. Extracting phenotypes and clinical features associated with the disease from

phenotype databases of the disease. Converting this information in semantic form

using RDF/XML.

6. Extending PPI representation in step #1 by integrating it with semantic

representations developed in steps #3, #4 and #5.

7. Ranking the proteins involved in the disease pathway/process based on cellular

location and nature of the cellular processes affected.

8. Selecting top ranked genes/proteins for the target validation step of the drug

discovery pipeline.

If the primary target identified is not druggable (does not respond to small molecule ligands) or has a poor therapeutic window (dose required for efficacy results in toxicity), determining the downstream protein interaction cascade can be useful in identifying multiple downstream targets. More than one of these targets can be targeted simultaneously to achieve the desired cumulative efficacy while using a lower dose of drug for each individual target [118].

150 Semantic integration of the PPI representation in step #5 with toxicity data can further help in selecting targets that are at low risk of failure during the late stages of drug discovery and development pipeline.

5.3.5 PPI and Comparative Genomics

PPI networks of different organisms (see figure 5.10) can be used to carry out comparative genomic studies [33].

Figure 5.10 Protein interaction network across species along with gene specific phylogenetic tree.

This can be achieved by the following steps

1. Develop a semantic representation of interactome of each species.

2. Connect the orthologous proteins across each pair of interactome used in the

comparative genomic study.

151 3. Analyze the integrated representation in light of the phylogenetic trees of

individual proteins across the species.

The advantage of this approach is that the evolutionary relationship of the genes or proteins can be studied in the larger context of the pathways in which these proteins participate.

152

CHAPTER VI

CONCLUSION

In the post-genomic world, one of the most important and challenging problems is to understand PPIs on a large scale. PPIs are fundamental to virtually any and every cellular process. They are integral to the underlying mechanisms of fundamental cellular processes such as DNA replication, transcription, translation, splicing, secretion, cell cycle control, signal transduction, and intermediary metabolism.

There exist a number of specialized biological databases which store protein interactions of various kinds. Some of the well known PPI databases are DIP, BioGRID,

IntAct, BIND and MINT. These databases can be used not only to study an individual

PPI but also to research larger scale regulatory and signaling pathways and protein interaction networks at the cellular and systemic level.

A number of experimental methods exist which have traditionally helped in detecting PPIs on a small scale. These methods include protein affinity chromatography, affinity blotting, immunoprecipitation and cross-linking. Unfortunately, these methods are limited in their throughput. Recently, large scale high-throughput methods have made available an increasing amount of PPI data set. However, this data contains a significant amount of erroneous information in the form of false positives and false negatives [85]. von Mering et al. [162] estimated that more than half of the high-throughput PPI data they studied contained spurious interactions. In another study, out of about 80,000 yeast

153 PPIs pooled from different high-throughput methods, only about 2,400 PPIs were covered by more than one method [162], severely limiting their reliability.

Because of the limitations inherent to the experimental approaches for predicting

PPIs, computational predictions are proving to be instrumental in narrowing down the set

of putative PPIs and in assigning varying degrees of confidence measures to

experimentally derived PPIs.

The goal of this dissertation is to devise a computational predictor, called “PPI

Predictor”, to predict PPIs with high accuracy. In other words, relative to a random

predictor, the PPI Predictor should be able to increase the associated likelihood for true

PPIs and decrease the associated likelihood for non-PPIs. The discriminative ability of

the PPI Predictor to correctly classify PPIs as positive or negative is measured as

classification accuracy. In order to achieve that goal, the PPI Predictor integrates a

number of proteomic features. While the principles of this research are essentially

applicable to any species, for the purpose of this research S. cerevisiae (baker’s yeast)

was chosen as the model organism. Therefore, the proteomic features used in this

dissertation refer to the databases directly or indirectly derived from molecular biology

experiments related to yeast.

The features chosen for the purpose of this research were gene expression, gene

ontology, MIPS functions, sequence patterns such as motifs and domains, and protein

essentiality. While these features have little or no correlation with each other, they all

share some degree of relationship with the ability of proteins to interact with each other.

Therefore, each of the features is essentially a suitable candidate as a classifier with some

predictive power to predict PPIs. In order to harness the predictive power of each of the

154 individual features, feature specific approaches were devised to characterize the relationship between the feature based similarity for a given pair of proteins and the probability of the same pair of proteins to interact with each other. Gold Standard data comprising of high confidence PPIs and non-PPIs was used as evidence of interaction or lack thereof. Three of these approaches are novel in their design and application while the other two adopt already known techniques towards the goal of this dissertation.

A comparison of ROC curves of the five features suggested that GO and MIPS are the best predictors of the five, followed by gene expression as third best indicator.

Also, motif and essentiality features are relatively poor predictors. However, a Bayesian integration is able to take advantage of all the features (including the weak ones) in order to maintain a high overall predictive power and high coverage across multiple features.

The predictive power of the individual features was integrated using Bayesian methods. Since all the features were computed on a proteomic scale, the Bayesian integration yields likelihood values for all possible combinations of proteins in the proteome. This has the added benefit of making it possible to enlist putative PPIs in a decreasing order of confidence measure in the form of likelihood values.

One way to measure the accuracy of a classifier is using ROC curves. Area under the curve is often used as measure of classification accuracy and is used in this research to assess the performance of PPI Predictor. Therefore, another way to frame the goal of this research is to evaluate the null hypothesis which states that the accuracy of the PPI

Predictor is 0.5 (same as a random predictor). The null hypothesis was rejected in favor of the alternative hypothesis which claims that the accuracy of PPI Predictor is greater than 0.5. An accuracy of 0.9583 was obtained based on the integrated proteomic features

155 trained against the Gold Standard data and was found to be significantly higher than 0.5 at 1% level of significance.

In order to validate the results of the research, 10-fold cross-validation was used whereby 10 iterations were run by randomly dividing the Gold Standard data into 9 parts for training and 1 part for testing. Based on the combined features, an ROC curve was generated and the accuracy and likelihood values were computed for each iteration. The accuracy values computed during these 10 iterations were averaged to reduce variability because of an individual iteration. The average accuracy, based on 10-fold cross- validation, was found to be 0.9396.

Due to the fundamental role PPIs play in the determination of biological function, improved accuracy in their detection is crucially important in a number of areas.

Identification of novel PPIs can directly impact and enhance the quality of signaling pathways crucial to manifestation of a disease. This in turn can lead to early validation of novel targets and further improve the effectiveness of high throughput screening during the lead identification phase of the rational drug discovery process. Use of semantic web in representation of PPI networks is of crucial importance in systems biology research.

Integration of novel PPIs with other relevant biological information using ontological representation of PPI networks can help understand mechanism of action of a disease, leading to novel target identification for drug discovery.

This research is highly cross-disciplinary in nature and traverses multiple domains including molecular biology, computational statistics and computer science. This makes it possible to extend this research work through the application of new perspectives and tools from the aforementioned areas. This research work can be extended in future to

156 explore the suitability of additional proteomic features. That would require devising specific techniques to study those additional features. During this research a few additional features were explored. Two of them, network topology and text mining, are in fact described in details in the literature review chapter. In addition, this work can also be extended to include other species, including humans. Lastly, alternative machine learning techniques such as bootstrap aggregating, boosting and logistic regression can be applied to integrate multiple proteomic features.

157

REFERENCES

1. BioGRID Database Statistics. cited 2011; Available from: http://wiki.thebiogrid. org/doku.php/statistics

2. File Format Guide. cited 2011; Available from: http://www.geneontology.org/ GO.format.shtml

3. IntAct. cited 2011; Available from: http://www.ebi.ac.uk/intact/

4. Mapping OBO to OWL. cited 2011; Available from: http://www.bioontology.org/ wiki/index.php/OboInOwl:Main_Page

5. Mint Database. cited 2011; Available from: http://mint.bio.uniroma2.it

6. Akerley BJ, Rubin EJ, Novick VL, Amaya K, Judson N, Mekalanos JJ. A genome-scale analysis for identification of genes required for growth or survival of Haemophilus influenzae . Proceedings of the National Academy of Sciences. 2002; 99(2):966-971.

7. Albert R, Barabasi A. Statistical mechanics of complex networks. Rev Mod Phys. 2002; 74:47-97.

8. Ananiadou S, Pyysalo S, Tsujii J, Kell DB. Event extraction for systems biology by text mining the literature. Trends Biotechnol. 2010; 28(7):381-390.

9. Angenendt P, Glokler J, Murphy D, Lehrach H, Cahill DJ. Toward optimized antibody microarrays: a comparison of current microarray support materials. Anal Biochem. 2002; 309(2):253-260.

10. Aranda B, Achuthan P, Alam-Faruque Y, Armean I, Bridge A, Derow C, Feuermann M, Ghanbarian AT, Kerrien S, Khadake J, Kerssemakers J, Leroy C, Menden M, Michaut M, Montecchi-Palazzi L, Neuhauser SN, Orchard S, Perreau V, Roechert B, van Eijk K, Hermjakob H. The IntAct molecular interaction database in 2010. Nucleic Acids Res. 2010; 38(Database issue):D525-531.

11. Arkin M. Protein-protein interactions and cancer: small molecules going in for the kill. Curr Opin Chem Biol. 2005; 9(3):317-324.

158 12. Aronheim A. Improved efficiency Sos recruitment system: expression of the mammalian GAP reduces isolation of Ras GTPase false positives. Nucleic Acids Research. 1997; 25(16):3373-3374.

13. Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G. Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium Nature Genet. 2000; 25:25-29.

14. Auerbach D, Thaminy S, Hottiger MO, Stagljar I. Post-yeast-two hybrid era of interactive proteomics: facts and perspectives. Proteomics. 2002; 2:611-623.

15. Bader GD, Betel D, Hogue CW. BIND: the Biomolecular Interaction Network Database. Nucleic Acids Res. 2003; 31(1):248-250.

16. Bader GD, Hogue CW. BIND--a data specification for storing and describing biomolecular interactions, molecular complexes and pathways. Bioinformatics. 2000; 16(5):465-477.

17. Bakhtiar R. Biomarkers in drug discovery and development. J Pharmacol Toxicol Methods. 2008; 57(2):85-91.

18. Barabasi AL, Oltvai ZN. Network biology: understanding the cell’s functional organization. Nature Reviews. 2004; 5:101-113.

19. Björkelund H, Gedda L, Andersson K. Avoiding false negative results in specificity analysis of protein–protein interactions. Journal of Molecular Recognition. 2011; 24(1):81-89.

20. Broder YC, Katz S, Aronheim A. The Ras recruitment system, a novel approach to the study of protein-protein interactions. Current biology. 1998; 8(20):1121- 1130.

21. Brown D, Superti-Furga G. Rediscovering the sweet spot in drug discovery. Drug Discov Today. 2003; 8(23):1067-1077.

22. Bui QC, Katrenko S, Sloot PM. A hybrid approach to extract protein-protein interactions. Bioinformatics. 2011; 27(2):259-265.

23. Cagney G, Uetz P, Fields S. High-throughput screening for protein-protein interactions using two-hybrid assay. Methods Enzymol. 2000; 328:3-14.

24. Carr RA, Congreve M, Murray CW, Rees DC. Fragment-based lead discovery: leads by design. Drug Discov Today. 2005; 10(14):987-992.

159 25. Ceol A, Chatr-aryamontri A, Santonico E, Sacco R, Castagnoli L, Cesareni G. DOMINO: a database of domain-peptide interactions. Nucleic Acids Res. 2007; 35(Database issue):D557-560.

26. Ceol A, Chatr Aryamontri A, Licata L, Peluso D, Briganti L, Perfetto L, Castagnoli L, Cesareni G. MINT, the molecular interaction database: 2009 update. Nucleic Acids Res. 2010; 38(Database issue):D532-539.

27. Charles PT, Goldman ER, Rangasammy JG, Schauer CL, Chen MS, Taitt CR. Fabrication and characterization of 3D hydrogel microarrays to measure antigenicity and antibody functionality for biosensor applications. Biosens Bioelectron. 2004; 20(4):753-764.

28. Chatr-aryamontri A, Ceol A, Peluso D, Nardozza A, Panni S, Sacco F, Tinti M, Smolyar A, Castagnoli L, Vidal M, Cusick ME, Cesareni G. VirusMINT: a viral protein interaction database. Nucleic Acids Res. 2009; 37(Database issue):D669- 673.

29. Chen H, Ding L, Wu Z, Yu T, Dhanapalan L, Chen JY. Semantic web for integrated network analysis in biomedicine. Brief Bioinform. 2009; 10(2):177- 192.

30. Chung F, Lu L. The average distances in random graphs with given expected degrees. Proc Natl Acad Sci USA. 2002; 99(25):15879-15882.

31. Clark DP. Molecular biology: academic cell update. Amsterdam; Boston: Academic Press/Elsevier; 2010.

32. Cohen R, Havlin S. Scale-free networks are ultra small. Phys Rev Lett. 2003; 90:058701.

33. de Silva E, Stumpf MP. Complex networks and simple models in biology. J R Soc Interface. 2005; 2(5):419-430.

34. Dengler U, Siddiqui AS, Barton GJ. Protein structural domains: Analysis of the 3Dee domains database. Proteins: Structure, Function, and Genetics. 2001; 42(3):332-344.

35. Deutschbauer AM, Williams RM, Chu AM, Davis RW. Parallel phenotypic analysis of sporulation and postgermination growth in Saccharomyces cerevisiae . Proceedings of the National Academy of Sciences. 2002; 99(24):15530-15535.

36. Dujon B. The yeast genome project: what did we learn? Trends in Genetics. 1996; 12(7):263-270.

160 37. Einhauer A, Jungbauer A. The FLAG™ peptide, a versatile fusion tag for the purification of recombinant proteins. Journal of Biochemical and Biophysical Methods. 2001; 49(1-3):455-465.

38. Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA. 1998; 95(25):14863- 14868.

39. Elia AE, Cantley LC, Yaffe MB. Proteomic screen finds pSer/pThr-binding domain localizing Plk1 to mitotic substrates. Science. 2003; 299(5610):1228- 1231.

40. Eyckerman S, Verhee A, der Heyden JV, Lemmens I, Ostade XV, Vandekerckhove J, Tavernier J. Design and application of a cytokine-receptor- based interaction trap. Nat Cell Biol. 2001; 3(12):1114-1119.

41. Fan JS, Zhang M. Signaling complex organization by PDZ domain proteins. Neurosignals. 2002; 11(6):315-321.

42. Feldmann H. Yeast: Molecular and Cell Biology [Paperback]. Wiley-VCH; 2010. p. 348.

43. Fell DA, Wagner A. The small world of metabolism. Nat Biotechnol. 2000; 18(11):1121-1122.

44. Fellbaum C, editor. WordNet. An electronic lexical database. Massachusetts, Cambridge: MIT Press; 1998.

45. Fields S, Song O. A novel genetic system to detect protein-protein interactions. Nature. 1989; 340(6230):245-246.

46. Galperin MY, Cochrane GR. The 2011 Nucleic Acids Research Database Issue and the online Molecular Biology Database Collection. Nucleic Acids Res. 2011; 39(Database issue):D1-6.

47. Gasch A, Huang M, Metzner S, Botstein D, Elledge S, Brown P. Genomic expression responses to DNA-damaging agents and the regulatory role of the yeast ATR homolog Mec 1p. Mol Biol Cell. 2001; 12:2987 - 3003.

48. Gasch A, Spellman P, Kao C, Carmel-Harel O, Eisen M, Storz G, Botstein D, Brown P. Genomic expression programs in the response of yeast cells to environmental changes. Mol Biol Cell. 2000; 11:4241 - 4257.

161 49. Gaudet P, Bairoch A, Field D, Sansone SA, Taylor C, Attwood TK, Bateman A, Blake JA, Bult CJ, Cherry JM, Chisholm RL, Cochrane G, Cook CE, Eppig JT, Galperin MY, Gentleman R, Goble CA, Gojobori T, Hancock JM, Howe DG, Imanishi T, Kelso J, Landsman D, Lewis SE, Mizrachi IK, Orchard S, Ouellette BF, Ranganathan S, Richardson L, Rocca-Serra P, Schofield PN, Smedley D, Southan C, Tan TW, Tatusova T, Whetzel PL, White O, Yamasaki C. Towards BioDBcore: a community-defined information specification for biological databases. Nucleic Acids Res. 2011; 39(Database issue):D7-10.

50. Gavin AC, Bosche M, Krause R, Grandi P, Marzioch M, Bauer A, Schultz J, Rick JM, Michon AM, Cruciat CM, Remor M, Hofert C, Schelder M, Brajenovic M, Ruffner H, Merino A, Klein K, Hudak M, Dickson D, Rudi T, Gnau V, Bauch A, Bastuck S, Huhse B, Leutwein C, Heurtier MA, Copley RR, Edelmann A, Querfurth E, Rybin V, Drewes G, Raida M, Bouwmeester T, Bork P, Seraphin B, Kuster B, Neubauer G, Superti-Furga G. Functional organization of the yeast proteome by systematic analysis of protein complexes. Nature. 2002; 415(6868):141-147.

51. Ge H, Liu Z, Church GM, Vidal M. Correlation between transcriptome and interactome mapping data from Saccharomyces cerevisiae . Nature Genetics. 2001; 29(4):482-486.

52. Gershon D. Microarray technology: an array of opportunities. Nature. 2002; 416(6883):885-891.

53. Gietz RD, Woods RA. Transformation of yeast by lithium acetate/single-stranded carrier DNA/polyethylene glycol method. Methods Enzymol. 2002; 350:87-96.

54. Giot L, Bader JS, Brouwer C, Chaudhuri A, Kuang B, Li Y, Hao YL, Ooi CE, Godwin B, Vitols E, Vijayadamodar G, Pochart P, Machineni H, Welsh M, Kong Y, Zerhusen B, Malcolm R, Varrone Z, Collis A, Minto M, Burgess S, McDaniel L, Stimpson E, Spriggs F, Williams J, Neurath K, Ioime N, Agee M, Voss E, Furtak K, Renzulli R, Aanensen N, Carrolla S, Bickelhaupt E, Lazovatsky Y, DaSilva A, Zhong J, Stanyon CA, Finley RL, Jr., White KP, Braverman M, Jarvie T, Gold S, Leach M, Knight J, Shimkets RA, McKenna MP, Chant J, Rothberg JM. A protein interaction map of Drosophila melanogaster . Science. 2003; 302(5651):1727-1736.

55. Goffeau A, Barrell BG, Bussey H, Davis RW, Dujon B, Feldmann H, Galibert F, Hoheisel JD, Jacq C, Johnston M, Louis EJ, Mewes HW, Murakami Y, Philippsen P, Tettelin H, Oliver SG. Life with 6000 genes. Science. 1996; 274(5287):546, 563-547.

56. Golemis EA, Khazak V. Alternative yeast two-hybrid systems. The interaction trap and interaction mating. Methods Mol Biol. 1997; 63:197-218.

162 57. Grigoriev A. A relationship between gene expression and protein interactions on the proteome scale: analysis of the bacteriophage T7 and the yeast Saccharomyces cerevisiae . Nucleic Acids Res. 2001; 29(17):3513-3519.

58. Gudivada RC, Qu XA, Chen J, Jegga AG, Neumann EK, Aronow BJ. Identifying disease-causal genes using Semantic Web-based representation of integrated genomic and phenomic knowledge. J Biomed Inform. 2008; 41(5):717-729.

59. Hall DA, Ptacek J, Snyder M. Protein microarray technology. Mech Ageing Dev. 2007; 128(1):161-167.

60. Hall DA, Zhu H, Zhu X, Royce T, Gerstein M, Snyder M. Regulation of gene expression by a metabolic enzyme. Science. 2004; 306(5695):482-484.

61. Ho Y, Gruhler A, Heilbut A, Bader GD, Moore L, Adams SL, Millar A, Taylor P, Bennett K, Boutilier K, Yang L, Wolting C, Donaldson I, Schandorff S, Shewnarane J, Vo M, Taggart J, Goudreault M, Muskat B, Alfarano C, Dewar D, Lin Z, Michalickova K, Willems AR, Sassi H, Nielsen PA, Rasmussen KJ, Andersen JR, Johansen LE, Hansen LH, Jespersen H, Podtelejnikov A, Nielsen E, Crawford J, Poulsen V, Sorensen BD, Matthiesen J, Hendrickson RC, Gleeson F, Pawson T, Moran MF, Durocher D, Mann M, Hogue CW, Figeys D, Tyers M. Systematic identification of protein complexes in Saccharomyces cerevisiae by mass spectrometry. Nature. 2002; 415(6868):180-183.

62. Hopp TP, Prickett KS, Price VL, Libby RT, March CJ, Pat Cerretti D, Urdal DL, Conlon PJ. A Short Polypeptide Marker Sequence Useful for Recombinant Protein Identification and Purification. Nat Biotech. 1988; 6(10):1204-1210.

63. Hu CD, Kerppola TK. Simultaneous visualization of multiple protein interactions in living cells using multicolor fluorescence complementation analysis. Nat Biotechnol. 2003; 21(5):539-545.

64. Hughes T, Marton M, Jones A, Roberts C, Stoughton R, Armour C, Bennett H, Coffey E, Dai H, He Y. Functional discovery via a compendium of expression profiles. Cell. 2000; 102:109 - 126.

65. Hulo N, Bairoch A, Bulliard V, Cerutti L, De Castro E, Langendijk-Genevaux PS, Pagni M, Sigrist CJ. The PROSITE database. Nucleic Acids Res. 2006; 34(Database issue):D227-230.

163 66. Hunter S, Apweiler R, Attwood TK, Bairoch A, Bateman A, Binns D, Bork P, Das U, Daugherty L, Duquenne L, Finn RD, Gough J, Haft D, Hulo N, Kahn D, Kelly E, Laugraud A, Letunic I, Lonsdale D, Lopez R, Madera M, Maslen J, McAnulla C, McDowall J, Mistry J, Mitchell A, Mulder N, Natale D, Orengo C, Quinn AF, Selengut JD, Sigrist CJ, Thimma M, Thomas PD, Valentin F, Wilson D, Wu CH, Yeats C. InterPro: the integrative protein signature database. Nucleic Acids Res. 2009; 37(Database issue):D211-215.

67. Hutchison CA, Peterson SN, Gill SR, Cline RT, White O, Fraser CM, Smith HO, Craig Venter J. Global Transposon Mutagenesis and a Minimal Mycoplasma Genome. Science. 1999; 286(5447):2165-2169.

68. Islinger M, Luers GH, Li KW, Loos M, Volkl A. Rat liver peroxisomes after fibrate treatment. A survey using quantitative mass spectrometry. J Biol Chem. 2007; 282(32):23055-23069.

69. Itaya M. An estimation of minimal genome size required for life. Febs Letters. 1995; 362(3):257-260.

70. Ito T, Chiba T, Ozawa R, Yoshida M, Hattori M, Sakaki Y. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proceedings of the National Academy of Sciences of the United States of America. 2001; 98(8):4569-4574.

71. Ito T, Ota K, Kubota H, Yamaguchi Y, Chiba T, Sakuraba K, Yoshida M. Roles for the two-hybrid system in exploration of the yeast protein interactome. Mol Cell Proteomics. 2002; 1(8):561-566.

72. Jansen R, Greenbaum D, Gerstein M. Relating whole-genome expression data with protein-protein interactions. Genome Res. 2002; 12(1):37-46.

73. Ji Y, Zhang B, Van SF, Horn, Warren P, Woodnutt G, Burnham MKR, Rosenberg M. Identification of Critical Staphylococcal Genes Using Conditional Phenotypes Generated by Antisense RNA. Science. 2001; 293(5538):2266-2269.

74. Jiang JJ, Conrath DW. Semantic Similarity Based on Corpus Statistics and Lexical Taxonomy. Proc Int’l Conf Research in Computational Linguistics, ROCLING X; 1997.

75. Johnsson N, Varshavsky A. Split ubiquitin as a sensor of protein interactions in vivo . Proc Natl Acad Sci USA. 1994; 91(22):10340-10344.

76. Jonsson PF, Bates PA. Global topological features of cancer proteins in the human interactome. Bioinformatics. 2006; 22(18):2291-2297.

164 77. Joung JK, Ramm EI, Pabo CO. A bacterial two-hybrid selection system for studying protein-DNA and protein-protein interactions. Proc Natl Acad Sci USA. 2000; 97(13):7382-7387.

78. Judson N, Mekalanos JJ. TnAraOut, A transposon-based approach to identify and characterize essential bacterial genes. Nature biotechnology. 2000; 18(7):740-745.

79. Jurafsky D, Martin JH. Speech and Language Processing. Second ed: Prentice Hall; 2008. p. 725-778.

80. Karimova G, Pidoux J, Ullmann A, Ladant D. A bacterial two-hybrid system based on a reconstituted signal transduction pathway. Proc Natl Acad Sci USA. 1998; 95(10):5752-5756.

81. Kirkpatrick P, Ellis C. Chemical space. Nature Insight. 2004; 432(7019):823-823.

82. Kobayashi K, Ehrlich SD, Albertini A, Amati G, Andersen KK, Arnaud M, Asai K, Ashikaga S, Aymerich S, Bessieres P, Boland F, Brignell SC, Bron S, Bunai K, Chapuis J, Christiansen LC, Danchin A, Debarbouille M, Dervyn E, Deuerling E, Devine K, Devine SK, Dreesen O, Errington J, Fillinger S, Foster SJ, Fujita Y, Galizzi A, Gardan R, Eschevins C, Fukushima T, Haga K, Harwood CR, Hecker M, Hosoya D, Hullo MF, Kakeshita H, Karamata D, Kasahara Y, Kawamura F, Koga K, Koski P, Kuwana R, Imamura D, Ishimaru M, Ishikawa S, Ishio I, Le Coq D, Masson A, Mauel C, Meima R, Mellado RP, Moir A, Moriya S, Nagakawa E, Nanamiya H, Nakai S, Nygaard P, Ogura M, Ohanan T, O'Reilly M, O'Rourke M, Pragai Z, Pooley HM, Rapoport G, Rawlins JP, Rivas LA, Rivolta C, Sadaie A, Sadaie Y, Sarvas M, Sato T, Saxild HH, Scanlan E, Schumann W, Seegers JF, Sekiguchi J, Sekowska A, Seror SJ, Simon M, Stragier P, Studer R, Takamatsu H, Tanaka T, Takeuchi M, Thomaides HB, Vagner V, van Dijl JM, Watabe K, Wipat A, Yamamoto H, Yamamoto M, Yamamoto Y, Yamane K, Yata K, Yoshida K, Yoshikawa H, Zuber U, Ogasawara N. Essential Bacillus subtilis genes. Proc Natl Acad Sci USA. 2003; 100(8):4678-4683.

83. Kotlyar M, Jurisica I. Predicting protein-protein interactions by association mining. Inf Syst Front. 2006; 8:37-47.

84. Kramer A, Feilner T, Possling A, Radchuk V, Weschke W, Burkle L, Kersten B. Identification of barley CK2alpha targets by using the protein microarray technology. Phytochemistry. 2004; 65(12):1777-1784.

85. Kumar A, Snyder M. Protein complexes take the bait. Nature. 2002; 415(6868):123-124.

86. Kurochkin IV, Mizuno Y, Konagaya A, Sakaki Y, Schonbach C, Okazaki Y. Novel peroxisomal protease Tysnd1 processes PTS1- and PTS2-containing enzymes involved in beta-oxidation of fatty acids. EMBO J. 2007; 26(3):835-845.

165 87. LaCount DJ, Vignali M, Chettier R, Phansalkar A, Bell R, Hesselberth JR, Schoenfeld LW, Ota I, Sahasrabudhe S, Kurschner C, Fields S, Hughes RE. A protein interaction network of the malaria parasite Plasmodium falciparum . Nature. 2005; 438(7064):103-107.

88. Lacy LW. Owl: Representing Information Using the Web Ontology Language: Trafford Publishing; 2005.

89. Langer T, Wolber G. Pharmacophore definition and 3D searches. Drug Discovery Today: Technologies. 2004; 1(3):203-207.

90. Lewin DA, Weiner MP. Molecular biomarkers in drug development. Drug Discov Today. 2004; 9(22):976-983.

91. Lin D. An Information-Theoretic Definition of Similarity. Proc 15th Int’l Conf Machine Learning; 1998. p. 296-304.

92. Luciano JS. PAX of mind for pathway researchers. Drug Discov Today. 2005; 10(13):937-942.

93. Luo Y, Batalao A, Zhou H, Zhu L. Mammalian two-hybrid system: a complementary approach to the yeast two-hybrid system. BioTechniques. 1997; 22(2):350-352.

94. Maglott D, Ostell J, Pruitt KD, Tatusova T. Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res. 2005; 33(Database issue):D54-58.

95. Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D. Detecting protein function and protein-protein interactions from genome sequences. Science. 1999; 285(5428):751-753.

96. Marti F, Xu CW, Selvakumar A, Brent R, Dupont B, King PD. LCK- phosphorylated human killer cell-inhibitory receptors recruit and activate phosphatidylinositol 3-kinase. Proc Natl Acad Sci USA. 1998; 95(20):11810- 11815.

97. Maslov S, Sneppen K. Specificity and stability in topology of protein networks. Science. 2002; 296(5569):910-913.

98. Matthews JM, Kowalski K, Liew CK, Sharpe BK, Fox AH, Crossley M, MacKay JP. A class of zinc fingers involved in protein-protein interactions biophysical characterization of CCHC fingers from fog and U-shaped. Eur J Biochem. 2000; 267(4):1030-1038.

166 99. McAleer WJ, Buynak EB, Maigetter RZ, Wampler DE, Miller WJ, Hilleman MR. Human hepatitis B vaccine from recombinant yeast. Nature. 1984; 307(5947):178-180.

100. McDermott J, Bumgarner R, Samudrala R. Functional annotation from predicted protein interaction networks. Bioinformatics. 2005; 21(15):3217-3226.

101. Mewes H, Frishman D, Guldener U, Mannhaupt G, Mayer K, Mokrejs M, Morgenstern B, Munsterkotter M, Rudd S, Weil B. MIPS: a database for genomes and protein sequences. Nucleic Acids Res. 2002; 30:31 - 34.

102. Michnick SW, Remy I, Campbell-Valois FX, Vallee-Belisle A, Pelletier JN. Detection of protein-protein interactions by protein fragment complementation strategies. Methods Enzymol. 2000; 328:208-230.

103. Milgram S. The small world problem. Psychology Today. 1967; 2:60.

104. Mount DW. Terms Used for Classifying Proteins Structures and Sequences. Bioinformatics: Sequence and Genome Analysis, Second Edition. Cold Spring Harbor Laboratory Press; 2004. p. 391-393.

105. Mushegian AR, Koonin EV. A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proceedings of the National Academy of Sciences. 1996; 93(19):10268-10273.

106. Newman A, Hunter J, Li YF, Bouton C, Davis M. The BioMANTA Ontology: The Integration of Protein-Protein Interaction Data. Interdisciplinary Ontology Conference 2008. 2008.

107. Newman ME. Scientific collaboration networks. I. Network construction and fundamental results. Phys Rev E Stat Nonlin Soft Matter Phys. 2001; 64(1 Pt 2):016131.

108. Ng R. Drugs: From Discovery to Approval: Wiley-Blackwell; 2008. p. 1-138.

109. Nooren IMA, Thornton JM. Diversity of protein-protein interactions. EMBO J. 2003; 22:3486–3492.

110. Ofran Y, Rost B. Analyzing six types of protein-protein interfaces. JMol Biol. 2003; 325:377–387.

111. Overbeek R, Fonstein M, D’Souza M, Pusch GD, Maltsev N. The use of gene clusters to infer functional coupling. Proc Natl Acad Sci USA. 1999; 96:2896.

112. Oyama T, Kitano K, Satou K, Ito T. Extraction of knowledge on protein-protein interaction by association rule discovery. Bioinformatics. 2002; 18(5):705-714.

167 113. Passin TB. Explorer’s Guide to the Semantic Web: Manning Publications; 2004.

114. Pawson T, Nash P. Assembly of cell regulatory systems through protein interaction domains. Science. 2003; 300(5618):445-452.

115. Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO. Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA. 1999; 96(8):4285-4288.

116. Pelletier JN, Campbell-Valois FX, Michnick SW. Oligomerization domain- directed reassembly of active dihydrofolate reductase from rationally designed fragments. Proc Natl Acad Sci USA. 1998; 95(21):12141-12146.

117. Persico M, Ceol A, Gavrila C, Hoffmann R, Florio A, Cesareni G. HomoMINT: an inferred human network based on orthology mapping of protein interactions discovered in model organisms. BMC Bioinformatics. 2005; 6 Suppl 4:S21.

118. Petricoin EF, Liotta LA. Clinical applications of proteomics. J Nutr. 2003; 133(7 Suppl):2476S-2484S.

119. Phizicky EM, Fields S. Protein-protein interactions: methods for detection and analysis. Microbiol Rev. 1995; 59:94-123.

120. Pool IS, Milgram S, Newcomb T, Kochen M. The Small world. Norwood, N.J.: Ablex Pub.; 1989.

121. Powers S. Practical RDF: O'Reilly Media; 2003.

122. Pu S, Wong J, Turner B, Cho E, Wodak SJ. Up-to-date catalogues of yeast protein complexes. Nucleic Acids Res. 2009; 37(3):825-831.

123. Rada R, Milli H, Bicknell E, Blettner M. Development and Application of a metric on Semantic Nets. IEEE Trans on Systems, Man, and Cybernetics. 1989; 19(1):17-30.

124. Ratti E, Trist D. The continuing evolution of the drug discovery process in the pharmaceutical industry. Farmaco. 2001; 56(1-2):13-19.

125. Remy I, Campbell-Valois FX, Michnick SW. Detection of protein-protein interactions using a simple survival protein-fragment complementation assay based on the enzyme dihydrofolate reductase. Nat Protoc. 2007; 2(9):2120-2125.

126. Resnik P. Semantic Similarity in a Taxonomy: An Information-Based Measure and its Application to Problems of Ambiguity in Natural Language. J Artificial Intelligence Research. 1999; 11:95-130.

168 127. Roman H. Development of Yeast as an Experimental Organism. CSH Monographs - The Molecular Biology of the Yeast Saccharomyces : Life Cycle and Inheritance. 1981; 11(A):9.

128. Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko I, Guldener U, Mannhaupt G, Munsterkotter M, Mewes HW. The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucl Acids Res. 2004; 32(18):5539-5545.

129. Sahoo SS, Bodenreider O, Rutter JL, Skinner KJ, Sheth AP. An ontology-driven semantic mashup of gene and biological pathway information: application to the domain of nicotine dependence. J Biomed Inform. 2008; 41(5):752-765.

130. Salwinski L, Miller CS, Smith AJ, Pettit FK, Bowie JU, Eisenberg D. The Database of Interacting Proteins: 2004 update. Nucleic Acids Res. 2004; 32(Database issue):D449-451.

131. Sams-Dodd F. Target-based drug discovery: is something wrong? Drug Discov Today. 2005; 10(2):139-147.

132. Sams-Dodd F. Drug discovery: selecting the optimal approach. Drug Discov Today. 2006; 11(9-10):465-472.

133. Sauer U, Heinemann M, Zamboni N. Genetics. Getting closer to the whole picture. Science. 2007; 316(5824):550-551.

134. Schwikowski B, Uetz P, Fields S. A network of protein-protein interactions in yeast. Nat Biotechnol. 2000; 18(12):1257-1261.

135. SGD. SGD project - Saccharomyces Genome Database. 2011 cited; Available from: http://www.yeastgenome.org/

136. Shannon P, Markiel A, Ozier O, Baliga N, Wang J, Ramage D, Amin N, Schwikowski B, Ideker T. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 2003; 13(11):2498 - 2504.

137. Sherman F. Getting started with yeast from Fred Sherman. Methods Enzymol. 2003; 350:3-41.

138. Shon J, Park JY, Wei L. Beyond similarity-based methods to associate genes for the inference of function. Biosilico. 2003; 1(3):89-96.

139. Siegal G, Ab E, Schultz J. Integration of fragment screening and library design. Drug Discov Today. 2007; 12(23-24):1032-1039.

169 140. Smith AK, Cheung KH, Yip KY, Schultz M, Gerstein MK. LinkHub: a Semantic Web system that facilitates cross-database queries and information retrieval in proteomics. BMC Bioinformatics. 2007; 8 Suppl 3:S5.

141. Smith V, Chou KN, Lashkari D, Botstein D, Brown PO. Functional Analysis of the Genes of Yeast Chromosome V by Genetic Footprinting. Science. 1996; 274(5295):2069-2074.

142. Smoot ME, Ono K, Ruscheinski J, Wang PL, Ideker T. Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics. 2011; 27(3):431-432.

143. Snoep J, Westerhoff H. From isolation to integration, a systems biology approach for building the Silicon Cell. In: Alberghina L, Westerhoff HV, editors. Springer Berlin / Heidelberg; 2005. p. 13-30.

144. Sole RV, Pastor-Satorras R, Smith E, Kepler TB. A model of large-scale proteome evolution. Adv Complex Systems. 2002; 5:43-54.

145. Speer N, Spiet C, Zell A. Biological Cluster Validity Indices Based on the Gene Ontology. 2005. p. 429-439.

146. Speer R, Wulfkuhle JD, Liotta LA, Petricoin EF, 3rd. Reverse-phase protein microarrays for tissue-based analysis. Curr Opin Mol Ther. 2005; 7(3):240-245.

147. Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B. Comprehensive Identification of Cell Cycle-regulated Genes of the Yeast Saccharomyces cerevisiae by Microarray Hybridization. Mol Biol Cell. 1998; 9(12):3273-3297.

148. Sprinzak E, Margalit H. Correlated sequence-signatures as markers of protein- protein interaction. J Mol Biol. 2001; 311(4):681-692.

149. Stagljar I, Korostensky C, Johnsson N, te Heesen S. A genetic system based on split-ubiquitin for the analysis of interactions between membrane proteins in vivo . Proc Natl Acad Sci USA. 1998; 95(9):5187-5192.

150. Stark C, Breitkreutz BJ, Reguly T, Boucher L, Breitkreutz A, Tyers M. BioGRID: a general repository for interaction datasets. Nucleic Acids Res. 2006; 34(Database issue):D535-539.

151. Steinmetz LM, Scharfe C, Deutschbauer AM, Mokranjac D, Herman ZS, Jones T, Chu AM, Giaever G, Prokisch H, Oefner PJ, Davis RW. Systematic screen for human disease genes in yeast. Nat Genet. 2002; 31(4):400-404.

170 152. Stillman BA, Tonkinson JL. FAST slides: a novel surface for microarrays. BioTechniques. 2000; 29(3):630-635.

153. Sun J, Zhao Z. A comparative study of cancer proteins in the human protein- protein interaction network. BMC Genomics. 2010; 11 Suppl 3:S5.

154. Tarassov K, Messier V, Landry CR, Radinovic S, Serna Molina MM, Shames I, Malitskaya Y, Vogel J, Bussey H, Michnick SW. An in vivo map of the yeast protein interactome. Science. 2008; 320(5882):1465-1470.

155. Teichmann SA. Principles of protein-protein interactions. Bioinformatics. 2002; 18 Suppl 2:S249.

156. Thatcher JW, Shaw JM, Dickinson WJ. Marginal fitness contributions of nonessential genes in yeast. Proceedings of the National Academy of Sciences. 1998; 95(1):253-257.

157. Tonkens R. An overview of the drug development process. Physician Exec. 2005; 31(3):48-52.

158. Tu Z, Wang L, Xu M, Zhou X, Chen T, Sun F. Further understanding human disease genes by comparing with housekeeping genes and other genes. BMC Genomics. 2006; 7:31.

159. Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, Qureshi-Emili A, Li Y, Godwin B, Conover D, Kalbfleisch T, Vijayadamodar G, Yang M, Johnston M, Fields S, Rothberg JM. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae . Nature. 2000; 403(6770):623-627.

160. Vazquez A, Flammini A, Maritan A, Vespignani A. Global protein function prediction from protein-protein interaction networks. Nat Biotechnol. 2003; 21(6):697-700.

161. Vojtek AB, Hollenberg SM, Cooper JA. Mammalian Ras interacts directly with the serine/threonine kinase Raf. Cell. 1993; 74(1):205-214.

162. von Mering C, Krause R, Snel B, Cornell M, Oliver SG, Fields S, Bork P. Comparative assessment of large-scale data sets of protein-protein interactions. Nature. 2002; 417(6887):399-403.

163. Wagner A. The yeast protein interaction network evolves rapidly and contains few redundant duplicate genes. Molecular biology and evolution. 2001; 18(7):1283-1292.

171 164. Walker-Taylor A, Jones DT. Computational Methods for Predicting Protein- Protein Interactions. In: Waksman G, editor. Proteomics and Protein-Protein Interactions: Springer US; 2005. p. 89-114-114.

165. Wehr MC, Laage R, Bolz U, Fischer TM, Grunewald S, Scheek S, Bach A, Nave KA, Rossner MJ. Monitoring regulated protein-protein interactions using split TEV. Nat Methods. 2006; 3(12):985-993.

166. Whittaker PA. The role of bioinformatics in target validation. Drug Discovery Today: Technologies. 2004; 1(2):125-133.

167. Yu H, Greenbaum D, Xin Lu H, Zhu X, Gerstein M. Genomic analysis of essentiality within protein networks. Trends Genet. 2004; 20(6):227-231.

168. Zewail A, Xie MW, Xing Y, Lin L, Zhang PF, Zou W, Saxe JP, Huang J. Novel functions of the phosphatidylinositol metabolic pathway discovered by a chemical genomics screen with wortmannin. Proceedings of the National Academy of Sciences. 2003; 100(6):3345-3350.

169. Zhu H, Bilgin M, Bangham R, Hall D, Casamayor A, Bertone P, Lan N, Jansen R, Bidlingmaier S, Houfek T, Mitchell T, Miller P, Dean RA, Gerstein M, Snyder M. Global analysis of protein activities using proteome chips. Science. 2001; 293(5537):2101-2105.

170. Zhu H, Bilgin M, Snyder M. Proteomics. Annu Rev Biochem. 2003; 72:783-812.

171. Zhu M, Gao L, Li X, Liu Z. Identifying drug-target proteins based on network features. Sci China C Life Sci. 2009; 52(4):398-404.

172