<<

Prediction of -Protein Interactions and Essential Through Data Integration

by

Max Kotlyar

A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Department of Medical University of Toronto

Copyright °c 2011 by Max Kotlyar Abstract

Prediction of Protein-Protein Interactions and Essential Genes Through Data

Integration

Max Kotlyar

Doctor of Philosophy

Graduate Department of Department of Medical Biophysics

University of Toronto

2011

The currently known network of human protein-protein interactions (PPIs) is pro- viding new insights into diseases and helping to identify potential therapies. However, according to several estimates, the known interaction network may represent only 10% of the entire interactome – indicating that more comprehensive knowledge of the inter- actome could have a major impact on understanding and treating diseases. The primary aim of this thesis was to develop computational methods to provide increased coverage of the interactome. A secondary aim was to gain a better understanding of the link between networks and phenotype, by analyzing essential mouse genes.

Two algorithms were developed to predict PPIs and provide increased coverage of the interactome: F pClass and mixed co-expression. F pClass differs from previous PPI prediction methods in two key ways: it integrates both positive and negative evidence for protein interactions, and it identifies synergies between predictive features. Through these approaches F pClass provides interaction networks with significantly improved reli- ability and interactome coverage. Compared to previous predicted human PPI networks,

FpClass provides a network with over 10 times more interactions, about 2 times more pro- teins and a lower false discovery rate. This network includes 595 disease related from OMIM and Cancer Census which have no previously known interactions. The second method, mixed co-expression, aims to predict transient PPIs, which have proven

ii difficult to detect by computational and experimental methods. Mixed co-expression makes predictions using gene co-expression and performs significantly better (p < 0.05) than the previous method for predicting PPIs from co-expression. It is especially effective for identifying interactions of transferases and signal transduction proteins.

For the second aim of the thesis, we investigated the relationship between gene es- sentiality and diverse gene/protein features based on , PPI and gene co-expression networks, gene/protein sequence, Gene Ontology, and orthology. We iden- tified non-redundant features closely associated with essentiality, including centrality in

PPI and gene co-expression networks. We found that no single predictive feature was effective for all essential genes; most features, including centrality, were less effective for genes associated with postnatal lethality and infertility. These results suggest that understanding phenotype will require integrating measures of network topology with in- formation about the biology of the network’s nodes and edges.

iii Acknowledgements

Many people helped me complete this thesis and made the PhD experience much more enjoyable. First, I would like to thank my supervisor, Dr. Igor Jurisica, for providing a huge amount of help with all aspects of the thesis and great encouragement throughout the whole process. I would also like to thank the Jurisica Lab, past and present. Special thanks to: Kristen, for amazing editing help and great collaborations; Daniela, for soli- darity throughout the PhD process and very helpful thesis edits; Abraham, for excellent editing and ideas; Levi, for helping me adopt R and introducing the concept of apples with peanut butter; Dan, for urgent help with editing and biology questions; Kevin for lots of help with the cluster, I2D, and biology questions; Yun, for text mining and very helpful ideas; Richard, for getting me through many computer crashes; Christian, for keeping the cluster going despite my repeated attempts to crash it; Ali, Dave, and Wing, for lots of Navigator help; Adrian, Attila, and Frederic for help with I2D; Elize, for great solidarity and support; Anthony for JavaScript help; Amira, Chiara, Fiona, and Periklis, for lots of help with slides and biology; Sara and Serene, for encouragement and reviewing ideas; Mahima and Dene for great discussions and debates. And of course family, for incredible support – no thesis without them.

iv Contents

1 Introduction 1

1.1 Mapping the human interactome ...... 1

1.2 Types of PPIs ...... 4

1.3 Experimental PPI Identification Methods ...... 5

1.3.1 Small-scale screens ...... 5

1.3.2 HTP screens ...... 5

1.4 Computational PPI Identification Methods ...... 8

1.4.1 Methods using genomic analysis ...... 9

1.4.2 Methods using protein primary structure ...... 9

1.4.3 Methods using protein domains ...... 10

1.4.4 Methods using protein tertiary structure ...... 11

1.4.5 Methods analyzing interaction networks ...... 12

1.4.6 Methods using data integration ...... 12

1.5 PPIs networks and phenotype ...... 13

1.6 Summary of Research Contributions ...... 15

1.6.1 Chapter 2 ...... 15

1.6.2 Chapter 3 ...... 15

1.6.3 Chapter 4 ...... 15

2 Predicting human PPIs using non-independent features 17

v 2.1 Introduction ...... 18

2.2 Results ...... 23

2.2.1 Complementarity of computational and HTP experimental methods 23

2.2.2 False discovery rates of predictions ...... 25

2.2.3 Coverage of predicted networks ...... 28

2.2.4 Evidence for novel predicted PPIs ...... 33

2.3 Discussion ...... 37

2.3.1 Evaluating predictions ...... 37

2.3.2 Using predictive features in novel ways ...... 38

2.3.3 Predicting interactions of disease genes ...... 40

2.4 Methods ...... 43

2.4.1 Predicting interactions ...... 43

2.4.2 Calculating enrichment of PPI sets among predictions ...... 45

2.5 Supplemental Materials and Methods ...... 46

2.5.1 Links to data tables ...... 46

2.5.2 Datasets ...... 46

2.5.3 Prediction overview ...... 47

2.5.4 Features of individual proteins: description and sources ...... 48

2.5.5 Calculating interaction scores from features of individual proteins 49

2.5.6 Features of protein pairs ...... 55

2.5.7 Score integration ...... 57

3 Predicting Transient PPIs From Gene Co-expression 58

3.1 Introduction ...... 60

3.2 Results ...... 65

3.2.1 Correlation distributions of stably and transiently interacting pro-

teins have significant differences ...... 65

vi 3.2.2 Transient interactions have significantly lower correlations than

stable interactions ...... 67

3.2.3 Predictions of PPIs from gene co-expression are improved by con-

sidering local information ...... 68

3.2.4 Mixed approach improves recall of different interaction types . . . 73

3.2.5 Mixed and local approaches work best when expression datasets

have genes with high variance ...... 75

3.3 Discussion ...... 77

3.4 Conclusions ...... 79

3.5 Methods ...... 80

3.5.1 Selection of gene expression datasets ...... 80

3.5.2 Processing of gene expression datasets ...... 80

3.5.3 Transient and stable interactions ...... 81

3.5.4 Transiently and stably interacting proteins ...... 81

3.5.5 Selection of protein pairs for test sets ...... 81

3.5.6 Testing for significant differences in correlation distribution tails . 82

3.5.7 Interaction prediction approaches ...... 83

3.5.8 Calculating significance of performance differences ...... 84

3.5.9 Selection of functional categories ...... 85

3.5.10 Software ...... 85

3.6 Supplemental Materials ...... 86

3.6.1 Links to data tables ...... 86

4 Predicting Essential Mouse Genes 87

4.1 Introduction ...... 88

4.2 Results ...... 90

4.2.1 Defining essential and non-essential mouse genes ...... 90

4.2.2 Identifying features related to essentiality ...... 91

vii 4.2.3 Key features of essential genes ...... 96

4.2.4 Differences among essential genes ...... 99

4.2.5 Predicting essential genes ...... 101

4.3 Discussion ...... 103

4.3.1 Features from gene expression ...... 105

4.3.2 Features based on gene and protein sequence ...... 106

4.3.3 Networks and essentiality ...... 109

4.4 Conclusions ...... 111

4.5 Methods ...... 112

4.5.1 Gene phenotypes ...... 112

4.5.2 Gene expression ...... 112

4.5.3 Gene co-expression networks ...... 114

4.5.4 PPI networks ...... 117

4.5.5 Gene length ...... 118

4.5.6 Gene Ontology ...... 118

4.5.7 Ortholog information ...... 118

4.5.8 Protein structure ...... 119

4.5.9 Protein disorder ...... 119

4.5.10 Sequence biases ...... 119

4.5.11 Assessing relationships between predictive variables and essentiality 120

4.5.12 Identifying non-redundant predictive features ...... 121

4.5.13 Identifying significant differences between different types of essen-

tial genes ...... 123

4.5.14 Predicting essential genes ...... 123

4.6 Supplemental Materials ...... 123

4.6.1 Links to data tables ...... 123

viii 5 Discussion 125

5.1 FpClass: extending coverage of the human interactome ...... 126

5.1.1 Contributions and limitations ...... 127

5.2 Mixed co-expression: identifying transient interactions and interaction

context ...... 129

5.2.1 Contributions and limitations ...... 130

5.3 Characterizing and predicting essential mouse genes ...... 131

5.3.1 Contributions and Limitations ...... 132

5.4 Future improvements to introduced methods ...... 133

5.5 Future applications of methods and results ...... 135

5.5.1 Prediction of pathways and functional modules ...... 135

5.5.2 Prediction of protein function ...... 135

5.5.3 Prediction of disease genes ...... 136

5.5.4 Identification of module regulators ...... 136

5.6 Conclusions ...... 136

Bibliography 138

ix Abbreviations

AUC Area under receiver operating characteristic (ROC)

CDS Coding sequence

DIP Database of Interacting Proteins

false positives FDR False discovery rate (= false positives + true positives ) false positives FPR False positive rate (= false positives + true negatives ) GO Gene Ontology

HMS-PCI High-throughput protein complex identification

HPRD Human Protein Reference Database

HTP High-throughput

I2D Interologous Interaction Database

LUMIER Luminescence-based Mammalian Interactome Mapping mRNA Messenger ribonucleic acid

OMIM Online Mendelian Inheritance in Man database

PDB Protein Data Bank

PPI Protein-protein interaction

PTM Post-translational modification

ROC Receiver operating characteristic

true positives TPR True positive rate (= true positives + false negatives ) TAP Tandem affinity purification

Y2H Yeast two-hybrid

x Measures for evaluating binary predictions

true positives coverage true positives + false negatives false positives false discovery rate false positives + true positives false positives false positive rate false positives + true negatives true positives precision true positives + false positives true positives recall true positives + false negatives true positives sensitivity true positives + false negatives true positives true positive rate true positives + false negatives

xi Glossary

Affinity precipitation

An experiment where a polyclonal antibody or an epitope tag is used to remove a

“bait” protein and its interaction partners from extracts, and the interaction part- ners are identified by immunoblot with polyclonal antibodies or other epitope tags.

(http://www.yeastgenome.org/help/glossary.html)

Dosage lethality

An experiment where overexpression of a gene causes death if a second gene is mutated or deleted. (http://www.yeastgenome.org/help/glossary.html)

Floxed/Frt experiments

Conditional experiments where loxP or frt sites are inserted on each side of the coding sequence of a gene to enable Cre- or Flp-mediated recombination at a particular time or in a specific tissue. (http://www.informatics.jax.org/userdocs/ allele types.shtml)

Frequent pattern mining

Identifying all frequent patterns (e.g., itemsets, subsequences, or substructures) that ap- pear in a data set with a frequency no less than a user-specified threshold [76].

Gene co-expression

The correlation between the expression levels of two genes across multiple conditions.

xii Gene knockout

A genetic technique which renders a gene inoperative by disrupting its coding sequence such that mRNA is not produced. (http://www.informatics.jax.org/userdocs/ allele types.shtml)

Interactome

The entire set of protein-protein interactions in an organism [43].

Orthologs

Homologous proteins that are present in different species and resulted from speciation events [170].

Paralogs

Homologous proteins that are present in the same species and resulted from gene dupli- cation [170].

Postnatal lethality

Death occurring between postnatal day 2 and weaning age. (http://www.informatics. jax.org/searches/MP form.shtml)

Prenatal/perinatal lethality

Death occurring between fertilization and birth. (http://www.informatics.jax.org/ searches/MP form.shtml)

Protein disorder

Protein segments or entire proteins which do not fold completely [52].

xiii Chapter 1

Introduction

1.1 Mapping the human interactome

All cellular processes depend on protein-protein interactions (PPIs) [35]. Identifying and characterizing PPIs is therefore essential for understanding life at a molecular level.

A comprehensive human PPI network can help provide a mechanistic understanding of diseases – resulting in accelerated identification of drug targets and improved target prioritization [150, 7]. The currently known human PPI network is far from complete

[79] but has already provided new perspectives on diseases [66], new candidate disease genes [104], and interaction subnetworks implicated in diseases [36].

With current technologies it may be possible to map large portions of human and model organism interactomes [156], but there are still formidable challenges to obtaining complete interactomes. The size of the human interactome has been estimated at 154,000-

369,000 interactions by Hart et al. [79] and at roughly 650,000 interactions by Stumpf et al. [175]. However, these estimates may be too small since they are based on the results of current experimental PPI detection methods and may not fully account for transient and condition-specific interactions [175]. Also, these estimates do not consider alternative splicing and post-translational modifications of proteins. Nevertheless, if the

1 Chapter 1. Introduction 2 estimate of Stumpf et al. [175] is correct then the currently known interactome is about

10% complete. Raising coverage to over 50% while maintaining low false discovery rates

(FDR)1, would require a great investment of resources, but would likely be achievable with current PPI detection methods [156]. The main challenges are the high false positive and false negative rates of PPI detection methods, and the immense search space that has to be tested for interactions – assuming roughly 20,500 human protein-coding genes

[37, 141], there are over 210 million possible protein pairs; if splice variants are considered, the number rises to more than 770 million2, since 92-94% of human protein-coding genes have splice variants [193].

Experimental PPI detection methods can be divided into 2 groups: small-scale and high-throughput (HTP) screens. Roughly half of known human PPIs were identified ex- clusively by small-scale screens – methods that are often considered reliable [190, 175, 44] but identify only a few interactions at a time and do not scale efficiently to provide high coverage of the interactome. HTP screens [22, 154, 147, 174, 109, 56] identify up to several thousand interactions per screen but their FDR and sensitivity3 rates may pose problems. Estimates of their FDR have varied greatly [79], depending on the evaluation approach. HTP studies by Rual et al. [147] and Stelzl et al. [174] tested subsets of their detected interactions by small-scale screens and obtained FDRs of 22% and 38% [174], respectively. When PPIs from these studies were evaluated by a computational approach

[79] the FDRs were 87% and 98% respectively, although there may have been insufficient data for this assessment approach [79]. The sensitivity rates of HTP methods may be a greater concern than their FDRs. Braun et al. [23] evaluated 5 HTP methods and determined sensitivity rates of 21% to 36%, with FDRs of 0% to 11%. Combining the 5 methods resulted in sensitivity of 59% and FDR of 14%. The main strategies for improv-

1false discovery rate = (false positives)/(false positives + true positives) 2Assuming 20,500 human genes, of which 92% have multiple splice variants, the possible number of ¡20,500×1.92¢ transcript pairs is 2 > 770, 000, 000. 3sensitivity = (true positives) / (true positives + false negatives) Chapter 1. Introduction 3 ing the FDR and sensitivity of HTP screens include executing repeated screens [206], using several types of HTP screens [23], or combining HTP screens and computational

PPI prediction methods [156].

Computational methods provide an alternative approach for identifying PPIs. Since they do not directly detect interactions, they cannot provide the same level of confirma- tion as experimental methods. However, their FDRs are comparable with experimental methods and they provide an efficient means for dealing with the immense search space of potential interactions. Taking full advantage of these strengths depends on effectively integrating computational methods with experimental screens [156]. In an integrated ap- proach, computational methods would provide preliminary probabilities of interaction for all protein pairs. These probabilities would not replace experimental testing but would help determine the number of experimental screens required for establishing a high level of confidence in an interaction. For example, if the computationally derived probability is high, then only a few experimental screens would be required to establish a high degree of certainty in the interaction; conversely, if the computed probability is low, more screens would be needed. Schwartz et al. [156] showed that such an integrated strategy would be the most efficient approach for achieving a comprehensive (≥ 50% coverage) and reliable

(5% FDR) mapping of the interactome. They estimated that with current experimental methods alone, achieving 50% coverage of the interactome would require 969,000 screens, or approximately 403 person-years. By combining experimental screens and computa- tional predictions, the number of screens could be reduced to 69,000, or approximately

29 person-years. No laboratories have undertaken a full-scale integrated mapping project yet, but methods developed in this thesis have provided interaction predictions for all human protein pairs, with FDR comparable to HTP screens and significantly lower than previous PPI prediction methods.

While HTP experimental screens and computational predictions may provide a more comprehensive picture of the interactome, using this integrated experimental-computational Chapter 1. Introduction 4 network to understand diseases poses additional challenges. This task also requires a com- bination of computational and experimental methods. Several studies have combined interaction networks with gene expression data to predict disease status [36, 106, 126] and age [57]. Interaction networks combined with previously known gene phenotypes have been used to identify new disease genes [99, 32, 124], essential genes [80, 211, 133] and genes associated with specific tissues [42].

1.2 Types of PPIs

Binary PPIs can be categorized based on their composition, effect on protein stability and duration [128]. In terms of composition, interactions can be divided into homo-oligomeric complexes, formed by two identical chains, and hetero-oligomeric complexes, formed by non-identical chains [128]. Based on stability, interactions can be divided into obligate and non-obligate complexes. Proteins in obligate complexes do not form stable struc- tures on their own, while proteins in non-obligate complexes can form stable structures independently [128]. Many hetero-oligomeric complexes are non-obligate, while homo- oligomeric complexes are often obligate [97, 128]. In terms of duration, interactions can be divided into transient complexes, lasting seconds or less, and permanent (or stable) complexes, lasting minutes or hours [148]. Transient complexes have a typical half-life of 0.1 to 1 seconds, and cannot be readily detected by co-purification, co-crystallization

[148], or gene co-expression [91]. They are generally difficult to study because the two interacting proteins and the conditions under which they bind have to be known be- fore a detection method is applied [139]. Transient interactions play numerous critical roles in the cell. They are involved in all protein modifications, such as phosphorylation and acetylation. Through such modifications, they regulate most cellular processes in- cluding cell growth, cell cycle, metabolic pathways, , protein transport and signal transduction [139]. Permanent complexes have a typical half-life of 12 minutes Chapter 1. Introduction 5 to 19 hours, and can be readily detected by methods such as co-purification [148]. Per- manent complexes include metabolic enzymes, core RNA polymerase, DNA replication complexes, nuclear pore complexes, ribosomes, and spliceosomes [139].

1.3 Experimental PPI Identification Methods

1.3.1 Small-scale screens

Small-scale screens comprise about 12 main approaches including affinity chromatogra- phy, affinity precipitation, dosage lethality, biochemical assays, and synthetic lethality

[25]. Studies based on small-scale screens often use several methods to detect interac- tions and follow up by investigating the biological relevance of the interactions [48]. As a result, these interactions are often considered reliable [190, 175, 44]. However, the interactions are not reported in an easily accessible format – they are usually within the text of research articles, and are not presented in a systematic format. Extracting them from individual articles, by manual curation or text mining, can result in false positives and false negatives [23, 44, 168, 127]. Text mining is fast but often unreliable; on a man- ually curated test set, it achieved 57% FDR and 0.9% sensitivity [127]. However, manual curation is a slow and laborious process; a consortium of most major curation efforts,

The International Molecular Exchange Consortium (IMEx), curates only 13 biological journals, and 10 of these only since 20054.

1.3.2 HTP screens

High-throughput methods include yeast two-hybrid systems [184, 88, 147, 174], HTP mass spectrometry-based methods [81, 61, 22, 56], and protein chips [210, 209, 202].

HTP studies identify up to several thousand interactions per screen, and have provided

4http://www.imexconsortium.org/about-imex Chapter 1. Introduction 6 over 15,000 human interactions so far.

HTP Yeast Two-Hybrid

The yeast two-hybrid (Y2H) method [184, 88, 147, 174] takes advantage of the fact that gene transcription requires the binding of two domains of a transcriptional activator pro- tein. These domains are called the DNA-binding domain and the activator domain. For two-hybrid analysis, each domain is fused to one of two candidate interacting proteins. If these proteins interact then a functional transcriptional activator is formed. This triggers the transcription of a reporter gene, which gives an observable change in phenotype.

Protein complex purification techniques using mass spectrometry

Two HTP techniques, high-throughput mass spectrometric protein complex identification

(HMS-PCI) [81] and tandem affinity purification (TAP) [61, 22], are able to purify entire protein complexes and determine their constituents. In these techniques the protein whose partners are sought is called the bait and its interacting partners are called the prey. Both HMS-PCI and TAP begin with a step where short tag sequences are fused to the coding sequences of bait proteins. HMS-PCI uses a Flag tag while TAP uses a

TAP tag [53]; other tags used for affinity purification include His6, HA, and myc. The tags allow bait proteins to be extracted from a mixture of cellular contents. If a bait protein is part of a complex, then the complex is extracted along with the bait. The constituents of a complex are separated by gel electrophoresis and identified by mass spectrometry (MS). Although TAP and HMS-PCI follow the same overall strategy, they use different techniques for each of the steps – tagging the proteins, purifying complexes and identifying complex constituents [90]. Differences in the results of the methods can be due to the techniques used for tagging proteins. HMS-PCI transfects cells with plasmids carrying the coding sequence of a bait protein fused to a tag sequence. Although TAP often follows a similar approach, in some cases it can attach a tag to the endogenous, Chapter 1. Introduction 7 chromosomal coding sequence of a bait protein [61].

Protein microarrays

A protein microarray consists of a microscope slide with thousands of proteins immobi- lized on its surface [210, 209, 96, 202]. Labelled target proteins are added to the chip and may bind to some of the immobilized proteins. Unbound targets are washed away while bound targets are detected by various methods such as interaction with a fluorescent dye.

Screens providing indirect HTP evidence for PPIs

Two types of HTP experimental methods can provide indirect evidence for PPIs: gene expression microarrays [62] and HTP synthetic lethal detection [181]. Gene co-expression has been used to predict PPIs by many studies, e.g., [62, 91, 190, 92, 145, 158, 27, 144].

The main idea is that two genes which have correlated expression across various condi- tions are more likely to encode interacting proteins – this relationship was shown to be statistically significant [62]. However, gene co-expression is generally a weak predictor of interaction. Interacting proteins may not have co-expressed genes, especially if their interactions are transient, and conversely, co-expressed genes may be unrelated to pro- tein interaction – for example, if the genes are in two pathways activated by the same stimulus. In addition, several studies [91, 27, 144] have suggested that co-expression cannot predict transient interactions – although our results indicate that this may be possible. Furthermore, these studies found that the predictive value of co-expression is largely limited to very stable interactions such as those in the ribosome [27]. Due to these limitations co-expression has often been used in conjunction with other types of interaction evidence such as orthology data, and protein domains [92, 145, 144].

An important benefit of gene expression data is its ability to provide context for an interaction – for example, the location of the interaction (i.e., tissue or cell type), its timing and its associated phenotype. Bossi et al. [21] assigned an interacting protein Chapter 1. Introduction 8 pair to a tissue if the corresponding gene pair was expressed in the tissue, above a given threshold. The time when an interaction begins and how long it lasts can be estimated in a similar way, if gene expression data is available for different time points [47]. The duration or stability of an interaction can be estimated from gene co-expression; a group of interacting proteins with highly co-expressed genes suggests a stable complex, whereas lack of gene co-expression suggests transient interactions [78]. An interacting protein pair can be associated with a phenotype if the corresponding genes are differentially expressed or differentially co-expressed in the phenotype [36, 178].

HTP synthetic lethal detection [41] provides another type of indirect evidence for

PPIs. Two genes have a synthetic lethal interaction if they cause cell death when they are both mutated, but a mutation of either gene alone is not lethal. Such gene pairs are often functionally related and their proteins are slightly more likely to interact physically [181].

Most often, the genes are present in two pathways that have redundant or complementary functions, although the genes may also be present in the same pathway or in different parts of the interaction network [100].

1.4 Computational PPI Identification Methods

A large number of computational methods have been developed for predicting PPIs.

These methods use a variety of , statistical and graph-theory approaches.

The methods also differ in terms of the types of evidence used for making predictions; many methods consider a single type of evidence while some attempt to integrate diverse types. Computational methods can be grouped into six categories, based on the evidence they consider: genomic sequences of multiple species, protein primary structure, protein domains, protein tertiary structure, topology of interaction networks, or a combination of different evidence types. Most computational methods have not been systematically evaluated. Although descriptions of computational methods generally include testing Chapter 1. Introduction 9 results, the datasets used for testing are often quite different.

1.4.1 Methods using genomic analysis

A number of methods compare the of multiple species to predict PPIs [187].

They look for pairs of genes that are close together on a in different species

[45, 130] (conservation of gene neighborhood), gene pairs with a similar pattern of pres- ence and absence across multiple genomes [138] (co-occurrence), or gene pairs that are found as a single gene in certain genomes [55] (gene fusion). Genes with any of these 3 properties are likely to be functionally related and may encode interacting proteins. In yeast, these methods had FDR and sensitivity rates of about 10% [190].

1.4.2 Methods using protein primary structure

Several methods predict interactions based only on protein sequences [19, 120, 162, 71,

140, 205]. The sequences are often augmented or replaced with the physicochemical properties of residues [19, 120, 71] (e.g., charge, hydrophobicity) or frequencies of residue combinations [162]. Many of the methods [19, 120, 162, 71, 205] use support vector classifiers for determining a mapping from sequence-based information to interaction status. The classifiers are trained on interactions from small-scale screens.

Conceptually similar to predictions from primary structure are approaches that use protein to predict interactions. An approach called the paralogous verification method (PVM) [48] predicts interactions using paralogs – homologous proteins that are present in the same species and resulted from gene duplication [170]. A pair of proteins,

(A, B), is predicted to interact if paralogs of A are known to interact with paralogs of

B. This approach was tested on a dataset comprising yeast interactions from small-scale screens and non-interactions represented by randomly chosen yeast protein pairs. The method had a sensitivity of 40% and a FDR of 1%. The sensitivity of this approach is limited by the number of proteins with paralogs and the number of known interactions Chapter 1. Introduction 10 among paralogs.

Interologous prediction is an approach similar to PVM, that uses orthologs to predict interactions [191, 27, 83]. Orthologs are homologous proteins that are present in differ- ent species and resulted from speciation events [170]. Interologous prediction reports two proteins as interacting if their orthologs in another species are known to interact. This method has predicted over 25,000 human interactions, but is mainly effective for iden- tifying stable complexes rather than transient interactions, and cannot identify species- specific interactions [27]. In a study by Yu et al. [206], predictions of yeast PPIs based on orthologs in worm had 46% FDR but only 0.2% sensitivity, largely due to few known

PPIs in worm at the time of the study.

1.4.3 Methods using protein domains

Domains can be defined as elementary units of protein structure and evolution, which can fold and function independently. PPIs often depend on the presence of specific protein domains [136, 137]. A number of studies predict interactions based on the presence of particular domains on candidate interacting proteins [172, 200, 49, 102]. These studies determine associations between domains and interactions by analyzing domains on pro- tein pairs that have been identified as interacting by experimental PPI detection methods.

First, the studies count occurrences of domain pairs (a, b) on pairs of interacting proteins, such that domain a is present on one protein and domain b is present on the other. Then, domain pairs are associated with an interaction if they are enriched among interacting protein pairs. The calculation of enrichment varies between studies. Sprinzak et al. [172] calculated enrichment as a log-odds ratio: log2(Pab/PaPb), where Pab was the frequency of domain pair ab among interacting protein pairs, and Pa, Pb were the frequencies of the individual domains among interacting protein pairs. To predict whether a given protein pair interacts, all domain pairs on the two proteins were considered; the highest log-odds ratio was determined and compared to a threshold. Several studies [49, 102] expanded Chapter 1. Introduction 11 this approach to account for false positives and false negatives of PPI detection methods.

These studies first calculated probabilities of domain-domain interactions based on the set of detected protein pairs and the estimated false positive and false negative rates of the detection methods. Then, the probability of two proteins interacting was calculated Q from the interaction probabilities of their domains: PAB = 1 − ab ² AB(1 − Pab), where

PAB is the probability of interaction between proteins A and B, ab are a pair of domains on these proteins, and Pab is the probability of an interaction between the domains. Domain-based methods have a number of limitations. About 80% of interactions do not occur through domain-domain binding [155] – for example, they may involve short linear motifs. Even when binding occurs through domains, interactions may depend on additional factors such as post-translational modifications and sub-cellular localization.

Also, an interaction may require several domains on a protein but most studies assume that domains interact independently.

1.4.4 Methods using protein tertiary structure

Docking methods use the tertiary structure of two proteins to predict the structure of their interaction complex [167]. These methods try to find the best structure for the complex, based on the shape and electrostatic complementarity between protein surfaces.

The methods do not predict whether a given protein pair interacts [116, 4] – instead, they search for the optimal fit between two proteins. Applying these methods on a wide scale is difficult because of their high computational requirements [84].

Several methods use a combination of primary and tertiary protein structure to predict interactions [3, 116]. These methods first determine whether members of a given pair of proteins have similar sequences to proteins in a solved complex. Then the methods assess whether the candidate proteins could form a similar complex; often, if proteins have greater than 25% sequence identity they interact in the same way [2]. The methods can predict interactions and some of the properties of the resulting complexes. Chapter 1. Introduction 12

A method proposed by Hue et al. [84] starts with the tertiary structures of candidate interacting proteins and considers whether their structures are similar to those of proteins in solved complexes. The output is a binary value indicating if interaction occurs. So far, the method has only been used to predict interactions between protein domains rather than full length proteins.

An important limitation of methods using protein tertiary structure is the availability of solved structures for individual proteins and for complexes. Also, the methods are less effective for interactions involving conformational changes at the interface, or binding at unstructured protein regions [4]. Transient interactions have both of these properties – they often involve a protein domain binding to a short, unstructured polypeptide section

[4].

1.4.5 Methods analyzing interaction networks

Methods using network topology [151, 67, 9, 10, 152, 142, 158] consider a set of PPIs as a graph, with proteins represented by nodes and interactions by edges. A central motivation for these methods is that PPI networks have densely connected local neighbourhoods

[67], meaning that two interacting proteins are likely to have many common interaction partners. Therefore, two proteins are predicted to interact if they have many known interaction partners in common. The main limitation of topology-based approaches is the high error rate of current PPI networks, in terms of false positives and especially false negatives. For example, over 35% of human proteins have no known interactions5.

1.4.6 Methods using data integration

Since each type of PPI evidence has limitations, many studies have tried to integrate different types of evidence [92, 15, 143, 145, 158]. Commonly used evidence types have

5Based on I2D [27], version 1.71 Chapter 1. Introduction 13 included gene co-expression, function and localization similarity, and complementary do- mains. Algorithms that have been used for predicting PPIs from these data include Naive

Bayes [92, 145, 143, 158], decision trees [143], logistic regression [11, 143], random forests

[203, 143], and support vector machines [15, 143]. Qi et al. [143] assessed the predictive value of different types of evidence and different prediction algorithms. Their test data comprised yeast interacting protein pairs detected by small-scale screens and randomly chosen pairs of yeast proteins. Predictive features for these pairs included similarity of

Gene Ontology [8] annotation (function, process and localization), detection by HTP experimental methods, gene co-expression, synthetic lethality, domain complementarity, genomic evidence (gene neighbourhood/gene fusion/gene co-occurrence) and presence of interacting orthologs. The most effective prediction algorithm was random forest and the top five features, ranked from highest to lowest importance, were gene co-expression, pro- cess similarity, localization similarity, detection by HTP mass spectrometry, and function similarity. The high ranking of gene co-expression was surprising given that other studies

[190, 158], ranked it below most other features. von Mering et al. [190] compared different types of HTP evidence for yeast PPIs, and found that HTP mass spectrometry evidence was most effective. Scott et al. [158] predicted human PPIs using diverse evidence types and found that network topology was most effective. Studies which compared predictions from individual evidence types with predictions from integrated evidence [15, 143, 158], found that integration produced at least a 50% improvement in performance (measured as AUC506) over the best individual evidence type.

1.5 PPIs networks and phenotype

Several computational studies have associated phenotypes with nodes, edges, and mod- ules in PPI networks [142, 180, 118, 178]. These associations have been based on gene

6AUC for the first 50 false positive cases. Chapter 1. Introduction 14 expression data, network topology, combinations of the two, or combinations of topology with known gene phenotypes. Gene expression data can identify phenotype-specific parts of a network in two ways: differentially expressed genes can identify phenotype-specific nodes and differentially co-expressed genes (i.e., co-expressed only among samples of a certain phenotype) can identify phenotype-specific edges [118, 178].

Network topology can identify nodes with an essential phenotype (i.e., genes that are essential for the survival of an organism). Essential genes tend to be nodes whose removal causes large disruptions in the network – disconnecting it or increasing average path length. They often have high degrees, are central in the network (i.e., close to many other nodes), and correspond to articulation points (nodes whose removal disconnects the network) [93, 142, 180].

Combining network topology with previously known gene phenotypes can help iden- tify new associations between genes and phenotypes. The main idea is that genes which are close in a PPI network are likely to share the same phenotypes. Several studies have predicted gene phenotypes based on known phenotypes of proximate genes. Different definitions of proximity have been considered, including direct interaction [129], length of shortest path [63], and scores based on random walks [104, 124]. A random walk incorporates the idea of connectivity in its evaluation of proximity – a higher number of possible paths between nodes increases their proximity. Connectivity was also the main criteria used by Cox et al. [42] for identifying genes with a placenta phenotype.

Combining network topology and expression data can identify network modules that are accurate predictors of phenotype. This approach searches for groups of genes that are highly interconnected in the network, and whose mean expression is correlated with phenotype. Using this approach, Chuang et al. [36] identified network modules that were biomarkers of breast cancer metastasis and Fortney et al. [57] identified modules that were biomarkers of worm aging. These modules were better predictors of phenotype than sets of the most differentially expressed genes. Chapter 1. Introduction 15

1.6 Summary of Research Contributions

1.6.1 Chapter 2

Experimentally identified PPIs may comprise only 10% of the human interactome and previously predicted human PPI networks [145, 158] also likely contain <10% of the interactome. In order to improve interactome coverage we developed F pClass – a data mining based algorithm for predicting PPIs. Compared to previous predicted human

PPI networks [145, 158], F pClass provides a network with lower FDR, over 10 times more interactions and over 2 times more proteins. Also, F pClass can help reduce the false negative rate of HTP experimental PPI screens; in our testing it detected over 90% of interactions that were missed by HTP screens.

1.6.2 Chapter 3

Transient PPIs are often difficult to detect by either experimental or computational methods [139, 4, 27]. We developed the mixed co-expression algorithm to improve pre- dictions of transient interactions from gene co-expression. Compared to the previous co-expression based prediction approach [145, 158], mixed co-expression predicts signifi- cantly more PPIs at the same FDR. The largest improvements occur for interactions of transferases and signal transduction proteins; recall of these interactions improves 2-fold.

1.6.3 Chapter 4

Essential genes have been linked with centrality in PPI networks and with other network properties [93, 74]. However, this relationship can occur due to biases in PPI networks

[206], and in some networks the relationship does not exist [206, 133]. In cases where the relationship is found, it is unclear whether centrality is directly related to essentiality or through properties correlated with essentiality, such as evolution rate. Furthermore, in previous work, the relationships between essentiality and network topology have been Chapter 1. Introduction 16 mostly studied in yeast. We investigated the features of mouse essential genes, including their network properties. We identified a set of 22 non-redundant features closely related to essentiality. Centrality in PPI and gene co-expression networks were among these top features. We also investigated whether essential mouse genes are comprised of several distinct subtypes. We identified significant differences between essential genes associated with different tissues and with different developmental stages. Chapter 2

Predicting human PPIs using non-independent features

This chapter is adapted from the following paper under revision: Max Kotlyar, Yun Niu,

Zhiyong Ding, Gordon B. Mills, and Igor Jurisica. Predicting human protein-protein interactions using non-independent features.

Abstract

Mapping a high percentage of the human interactome efficiently and quickly may benefit from combining high-throughput experimental methods and computational predictions.

We introduce a novel prediction method optimized for this strategy. Our data mining- based prediction algorithm, FpClass, systematically integrates diverse features of indi- vidual proteins and of protein pairs. Importantly, it integrates both positive and negative evidence for protein interactions, and identifies synergies between predictive features (i.e., sets of predictive features that are especially effective as a group). These approaches in- crease the number of protein pairs with predictive evidence, enabling extensive coverage of the interactome – a key contribution to an intertwined computational-experimental mapping strategy.

17 Chapter 2. Predicting human PPIs using non-independent features 18

Compared to high confidence predicted interaction networks from previous studies, our predicted network contains twice as many proteins (11,029) at the same false discovery rate (FDR). To provide a reasonable estimate of prediction accuracy, we comprehensively evaluated FpClass predictions by multiple approaches, including cross-validation on a gold standard dataset, and comparisons of our proteome-wide predictions against diverse high-throughput and small-scale experimental datasets. These comparisons suggest that the FDR of our top 32,234 predictions is between 18% and 54%; at least 1.5 times lower than FDRs of previous predicted networks of similar size. We confirmed a subset of our novel predictions with evidence from PubMed extracted by text mining and with evidence from the Protein Data Bank. In addition, we experimentally validated a number of our predictions for AKT1 and PIK3R1, which are implicated in many cancers. Our predictions will be available as part of the Interologous Interaction Database (I2D) ver.

2.0 (http://ophid.utoronto.ca/I2D).

2.1 Introduction

Identifying human protein-protein interactions (PPIs) not only adds to our understanding of fundamental biological processes, but also provides insights into human diseases and accelerates the development of new therapies. PPIs have been widely used to predict protein function [28], functional modules [171], and protein complexes [10, 103]. The known human interactome has provided new perspectives on diseases [66], and has helped to identify new candidate disease genes [104], as well as PPI subnetworks implicated in diseases [36]. PPIs can also help identify drug targets [150, 70] and accelerate drug discovery [7].

However, the known human interactome may only be ∼10% complete [79, 175], and increasing coverage with current experimental methods will be challenging [156]. Roughly half of known human PPIs were identified exclusively by small-scale experiments, which Chapter 2. Predicting human PPIs using non-independent features 19 are not intended for providing high coverage. Furthermore, PPIs from small-scale screens are not presented in a systematic format, and extracting them from research articles, whether by manual curation or text mining, can result in false positives and false neg- atives [23, 44, 168, 127]. High-throughput (HTP) studies [22, 154, 147, 174, 109, 56], have provided over 15,000 human PPIs and could potentially provide high coverage of the interactome. However, their FDRs1 and sensitivity rates2 have often been high. Esti- mates of their FDRs have varied considerably [79], depending on the evaluation method.

One approach tests a subset of detected interactions by small-scale screens. Using this strategy, HTP yeast 2-hybrid studies by Rual et al. [147] and Stelzl et al. [174], reported

FDRs of no more than 22% and 38%, respectively. A second evaluation approach com- pares the overlap of detected interactions with two other interaction datasets [50]. Based on this method, the FDRs were 87% for Rual et al. [147] and 98% for Stelzl et al. [174]

– however, there may have been insufficient data for this strategy [79], especially in the case of Stelzl et al. [174]. The sensitivity rates of five HTP methods were evaluated by

Braun et al. [23] and ranged from 21% to 36%, while FDR ranged from 0% to 11%.

Combining all five methods gave a sensitivity of 59%, with an FDR of 14%. Thus, while false positives are an important issue, to achieve high coverage with HTP methods, false negatives will need to be reduced. This may require using repeated screens [206], multiple experimental methods [23], or combined experimental and computational PPI detection methods [156].

Schwartz et al. [156] investigated a number of strategies for achieving high coverage of human and model organism interactomes, and concluded that the best approach is a combination of computational predictions and pooled HTP screens. In this approach, a computational method assigns preliminary interaction probabilities to all pairs of proteins in the organism. These probabilities are then used in two ways. First, the probabilities

1FDR = (false positives) / (false positives + true positives) 2sensitivity = (true positives) / (true positives + false negatives) Chapter 2. Predicting human PPIs using non-independent features 20 prioritize protein pairs for further testing by experimental HTP screens. Second, the probabilities determine how many times a protein pair needs to be tested by experimental

HTP screens to reliability confirm its interaction (or its lack of interaction). For example, if the preliminary probability of interaction is high, then detection by few HTP screens may be sufficient to confirm the interaction. If the preliminary probability is low, then more HTP screens may be required, to ensure that the interaction is not a false positive.

Such an integrated computational-experimental approach could greatly reduce the number of HTP screens required for achieving high coverage of the interactome. Schwartz et al. [156] estimated that achieving 50% coverage of the human interactome using only experimental methods could require 969,000 screens, or approximately 403 person-years

[156]. With computational predictions the number of screens could be reduced to 69,000, or approximately 29 person-years [156].

Current computational methods would likely be effective in such a combined strategy, as shown by Schwartz et al. [156] in a proof of concept study. However, since computa- tional prediction studies [33, 143, 145, 158, 194] have not aimed for integration with HTP experimental methods, it may be possible to substantially improve the contribution of predictions to such a strategy. The primary aim of prediction studies has been to iden- tify PPIs with low FDR, usually assessed by testing against gold standard PPIs based on small-scale screens. Low FDR greatly contributes to a combined strategy but may be less important than having predictions for a large number of protein pairs. In a combined approach, the primary role of computational methods is to prioritize a large number of protein pairs for further testing, rather than to provide a limited set of highly-accurate

PPIs. Also, in a combined strategy, computational methods would need to minimize biases associated with training and testing on PPIs from small-scale screens.

Here we present a computational method for predicting human PPIs, that optimizes an integrated computational-experimental strategy. Our method, FpClass, uniquely in- tegrates diverse data types commonly used for PPI prediction [92, 145, 158], including Chapter 2. Predicting human PPIs using non-independent features 21 predictive features based on protein sequence and structure, orthology, network topol- ogy, Gene Ontology [8], and gene co-expression. FpClass differs from previous prediction approaches in two major ways: 1) it does not assume independence of predictive features and 2) it predicts non-interactions as well as interactions. Previous studies using feature integration [92, 111, 33, 15, 117, 145, 177, 143, 204, 158, 71], assumed that predictive features are largely independent. This was reflected in their use of individual feature types (e.g., protein domains) and their integration of different feature types (e.g., pro- tein domains and protein localization). For example, it was often assumed that multiple domains on a protein behave independently – an interaction occurred between single domains on two proteins and was unaffected by the presence of additional domains on the proteins. Similarly, the binding affinity of protein domains was assumed to be un- affected by the localization of their proteins. FpClass considers a wide range of such potential dependencies among features. It assumes that any combination of a protein’s features may affect the protein’s interactions. This combination may comprise one or more features of the same type (e.g., protein domains) or of different types (e.g., protein domains, localizations, post-translational modifications, etc.). FpClass assumes that a pair of combinations, each on a different protein, may facilitate interaction – just like a pair of protein domains may facilitate interaction. FpClass treats a combination as a single predictive feature, such as a protein domain. In other words, predictive fea- tures used by FpClass include individual features (e.g., Myb DNA binding domain) and feature combinations (e.g., {Myb DNA binding domain, Homeodomain-related}). By using combinations as predictive features, the total number of predictive features greatly increases. Although most of these combinations are weak predictors, they provide infor- mation about a large number of protein pairs, thereby increasing coverage and enabling a more extensive prioritization of protein pairs for further testing. Also, some combinations have strong predictive value and reduce the FDR of high confidence predictions.

FpClass assumes that just as some pairs of features can be associated with interaction, Chapter 2. Predicting human PPIs using non-independent features 22 other feature pairs can be associated with absence of interaction. It assumes that feature pairs indicating absence of interaction are those that occur among known PPIs less frequently than expected by chance. By predicting non-interactions it is possible to assign many protein pairs to a low priority category for further testing.

We trained and evaluated FpClass on protein pairs from both small-scale and HTP screens. First, we evaluated FpClass by cross-validation on a gold standard dataset, where positive cases (22,219 PPIs) were interactions reported by at least two stud- ies, either small-scale or HTP, and negative cases were randomly selected protein pairs

(2,221,900 PPIs). Next, we trained FpClass on the gold standard dataset and predicted interactions for all pairs of human proteins. We assessed these predictions in three ways:

1) their contribution to HTP screens – by identifying interactions missed by the screens, and confirming correctly detected interactions; 2) their reliability, measured as FDR; 3) their coverage of the interactome. To assess whether FpClass could contribute to HTP screens, we analyzed our predictions for a set of 182 protein pairs, which had been used by Braun et al. [23] for evaluating HTP screens. Half of these pairs were high confidence interactions confirmed by multiple small-scale screens and the other half were randomly chosen protein pairs. Individual HTP screens evaluated by Braun et al. [23] detected up to 36% of the high confidence interactions with an FDR up to 4%; FpClass identi-

fied 91% at 1% FDR – thus, identifying many missed interactions and confirming those correctly detected. To assess the FDR of FpClass, we analyzed the overlaps between our high confidence, proteome-wide predictions and PPI datasets from diverse experimental methods. We found that previous prediction studies had lower overlaps with almost all experimental datasets, indicating that the FDR of our method is at least 20% lower than previous prediction methods. To assess the interactome coverage of FpClass we com- pared its predictions to the union of five major curated PPI databases. For predictions that were not present in curated databases we found supporting evidence by text mining of PubMed abstracts and analysis of co-crystallized protein pairs in PDB. In addition, Chapter 2. Predicting human PPIs using non-independent features 23 using new Retrovirus-based Protein Complementation Assay (RePCA) screens [51], we obtained statistically significant validation of predictions for critical components of the phosphatidylinositol pathway, AKT1 (p = 2.8e-3) and PIK3R1 (p = 9.3e-3).

2.2 Results

2.2.1 Complementarity of computational and HTP experimen-

tal methods

Achieving a comprehensive mapping of the interactome with HTP experimental screens requires testing each protein pair by multiple screens, in order to limit false positives and false negatives. In an integrated computational-experimental mapping strategy, compu- tational prediction replaces a portion of the experimental screens [156]. For this strategy to be effective, predictions need to identify some of the interactions detected by a given experimental screen, and ideally, some of the interactions that are missed by the screen.

To evaluate how well our method meets these requirements, we applied it to a PPI dataset used by Braun et al. [23], to test five HTP experimental methods. On this dataset, individual HTP methods achieved true positive rates (TPR)3 from 21% to 36% and FDRs of up to 11%. The union of the methods had a TPR of 59% and a FDR of

14%. FpClass achieved 91% TPR with 1% FDR – identifying most of the interactions

(49/55) detected by the five HTP methods and almost all of the interactions missed by the

HTP methods (35/37) (Figure 2.1). This suggests that FpClass could make an important contribution to an integrated interactome mapping strategy. However, it should be noted that the dataset of Braun et al. [23], has a small number of interactions and certain biases; mainly the fact that the positive test cases contain extensively annotated proteins.

3TPR = (true positives) / (true positives + false negatives) Chapter 2. Predicting human PPIs using non-independent features 24 Q07960 Q12982 Q07817 Q92934 Q01658 Q14919 P17275 Q16520 O75791 Q13094 P02545 P20700 P62310 Q9Y333 P04271 P23297 P24941 P61024 P40855 P56589 P63098 Q08209 P13984 P35269 O75381 P40855 O96011 P40855 P63208 Q9Y297 P38936 P78396 P46527 P78396 P40855 Q9Y5Y5 P12931 Q05397 Q07817 Q16611 Q13257 Q9Y6D9 P62993 Q13191 O15525 Q14494 P54725 P55036 P18754 P62826 P12004 P39748 P15927 P35244 O15530 P31749 P01100 P35638 P25205 P49736 P21860 Q02297 P01344 P22692 P04049 P62834 P04271 P06703 P42575 P78560 O15287 O15360 Q13485 Q15797 P63208 Q13309 P68871 P69905 P16333 Q13094 P52298 Q09161 P04150 P07900 O43929 Q13416 P63244 Q08499 P13861 P15311 P53365 P63000 P55211 P98170 P84022 Q13485 P31749 P56279 P31151 Q01469 P12004 P24522 P33992 P49736 P04637 P63279 P53365 P84077 P06729 P19256 P18847 P35638 P16188 P61769 P61769 Q31612 P61769 Q95604 P23560 P34130 P42574 P98170 P55210 P98170 P30281 Q00534 P01100 P53567 P01215 P01233 P09619 P46108 P09341 P25025 O15360 Q00597 O43561 O75791 O43561 P62993 P15498 P62993 P06400 Q13547 Q05516 Q13547 P04637 Q16665 P09914 P60228 P15498 Q13094 P17931 Q08380 P02545 P06400 Q13163 Q13164 O14964 P35240 P04150 Q04206 Q13416 Q7L590 P09619 Q06124 Q06124 Q8WU20 P07949 Q8WU20 P61586 Q07960 O43353 Q9Y239 Q13485 Q9NPI6 O14763 P50591 Q8N5C8 Q9Y4K3 P05230 P11362 P62993 Q05397

Lumier MAPPIT Y2H PCA Wnappa FpClass

(a) True positives of different PPI detection methods. P51570 Q96RQ3 P11177 Q86WB0 P34897 Q96MF2 P30531 P48230 P22003 Q9BTE3 P51948 Q96IJ6 O43379 Q9BY32 P02008 P12277 P10244 Q8TC92 P11245 P31689 Q04771 Q69YN2 P05090 Q8TAX7 P15289 Q16643 P00966 Q9BRX5 P48047 Q9UHP7 O76093 Q92934 P35070 Q5JSZ5 Q13895 Q7Z7F0 P00918 Q13332 O94989 P27824 P27824 Q6PL18 P48509 Q9HAD4 P28906 Q969T3 P60033 P61916 P49450 Q9H2H8 P18754 Q8WZ60 O96005 Q969R5 P51911 Q96QE3 P53618 Q9UM19 P53618 Q15043 P59665 Q5T7W7 Q92988 Q96QF0 P33316 Q9BTP7 P50402 Q9NVT9 P21860 Q5JPI3 P62495 Q6UX01 P01275 P15090 O15540 Q13190 P49327 P62310 O43915 P24278 P16083 Q00688 P47869 Q9H939 Q8N567 Q92947 P07359 Q92692 P43304 Q7L3V2 Q14330 Q8WVV9 Q13002 Q8N6S5 P14317 Q9Y467 P16403 Q9NQX5 P28068 P49810 P03992 Q8N100 P09429 Q15561 P50591 P52272 P49441 Q8WVY7 Q13572 Q8TBE7 P52333 Q9Y5K8 P13473 P60604 P51884 Q9NYU1 P27338 P49711 P82663 Q13875 P20592 Q8NFH4 Q00604 Q9NZJ9 O00712 P58753 P52952 Q8N6G5 O76090 Q15233 P21589 Q8N5C8 P50583 Q5JXC2 P0C7X2 P13725 Q12899 Q15077 O15195 P40425 O76083 Q8IYK8 P16234 Q9BT67 Q8IYI8 Q9UHV9 P20382 Q7Z5B4 P29590 Q96BR1 O60237 Q969J2 Q13522 Q8NEG7 O00743 Q9GZX5 O00506 P07225 O00232 Q9P021 Q16401 Q8N0S2 Q16401 Q8IZD6 P20337 Q9BWV1 P98179 Q15459 P48380 Q9NTN9 Q8IZJ4 Q9NUQ8 P08134 Q9H1M0 O14625 P28702 Q8WTV0 Q96EK2 O00399 P29508 O14771 P12236

Lumier MAPPIT Y2H PCA Wnappa FpClass

(b) False positives of different PPI detection methods.

Figure 2.1: True positive and false positive protein pairs identified by HTP experimental meth- ods and by FpClass. Chapter 2. Predicting human PPIs using non-independent features 25

2.2.2 False discovery rates of predictions

FDR can have a large impact on the efficiency of an integrated mapping strategy. Sim- ulations by Schwartz et al. [156], showed that a 7% reduction in FDR was associated with almost a 10-fold reduction in the number of HTP screens required to map 50% of the interactome. In general, FDR is the main criteria for evaluating predictions – it ex- presses their reliability and determines how much resources to invest in their validation.

We evaluated FDR and other performance measures on two datasets: a gold standard dataset of ∼2.2 million protein pairs and a set of ∼227 million protein pairs containing unique protein pairs based on 21,303 human genes (see Materials and Methods)4. We conducted cross-validation on the gold standard data and determined precision (1-FDR), recall (sensitivity)5, TPR, and FPR (Figure 2.2). However, using a single gold standard dataset for evaluation has several limitations; comparing performance to other methods is not possible unless they have been applied to the same dataset and estimating perfor- mance on new data is difficult due to possible biases (e.g., the ratio of positive to negative cases) in the gold standard dataset or in the new data.

To address these limitations, we computed proteome-wide PPI predictions and com- pared a set of our top predictions against PPI sets derived from diverse methods. Our objectives were to determine whether FpClass could identify PPIs reported by diverse ex- perimental platforms, and to compare its performance with previous prediction methods.

As a first step, we obtained top predictions from FpClass and from two previous compu- tational prediction studies: Rhodes et al. [145], and Scott et al. [158]. From each of these predicted datasets, we removed any PPIs that had been used as positive training cases either in our study or in the other two studies. This left 32,234 predictions from FpClass,

35,950 from Rhodes et al. [145], and 32,440 from Scott et al. [158] (Figure 2.3). Next,

4We selected a set 21,303 proteins from UniProt [40] with unique Entrez Gene IDs [119], and obtained predictions for all pairs of these proteins except self-interactors. 5We include recall because we generate precision-recall curves of performance. Chapter 2. Predicting human PPIs using non-independent features 26

0.20 1.0 domains

0.168 PTMs phys/chem 0.8 GO comp 0.15 GO func GO proc co−expr 0.6 0.114 ortho topo 0.10 all Precision

AUC100 0.4

0.064 0.064 0.056 0.2 0.05 0.044 0.040 0.032

0.014 0.0 0.002 0.00 0.0 0.2 0.4 0.6 0.8 1.0 dom PTMs phys/ GO GO GO gene ortho topo all chem comp func proc co−expr feat Recall

(a) (b)

Figure 2.2: Testing of FpClass by 10-fold cross-validation on gold standard data. (a) Perfor- mance of predictive features measured as AUC100 scores – areas under the ROC curve for up to

100 false positive cases. (b) Precision-recall of individual predictive features and of all features combined.

we compared these 3 sets against 11 experimental PPI datasets: 10 datasets from HTP experimental methods [14, 202, 22, 46, 56, 98, 154, 109, 147, 174] and a dataset of PPIs from small-scale screens. Before using these experimental datasets, we removed PPIs that had been positive training cases for any of the prediction methods (Figure 2.4). Since positive training cases for FpClass were PPIs supported by at least two PubMed IDs, removing these PPIs ensured that the PPIs remaining in a given experimental dataset were exclusive to that dataset. We then determined the overlap of the three predicted datasets with each of the experimental datasets (Figure 2.5), and computed ratios of the observed overlap to the expected overlap (Figure 2.6). The overlaps of our predictions with 10 of the experimental studies were higher than the overlaps of other predicted sets with these studies. Also, our overlaps with 9 of the studies were over 50 times greater than expected by chance (p < 9.4e-6), and with 5 of the studies our overlaps were over

100 times greater than expected by chance (p < 1.2e-12). Chapter 2. Predicting human PPIs using non-independent features 27

Rhodes05 Scott07

32673 1103 26527

645 1526 4165

25898

FpClass 0

Figure 2.3: Numbers of interactions in high confidence predicted PPI datasets.

The overlap of a PPI dataset with two other sets can be used to estimate its FDR [50].

Several studies [79, 158] have used this approach to calculate the FDR of predicted PPIs.

The approach estimates the FDR of a predicted set of interactions, D, by determining the overlap of D with a reference set, R, containing trusted PPIs and a dataset, D’, based on a similar detection method as D (Figure 2.7). We used this approach to compute 7

FDR estimates for each predicted dataset. These estimates used different experimental datasets as reference sets (Figure 2.8). For dataset D’, we could choose either of the two remaining predicted datasets; to reduce bias, we determined an estimate for both choices of D’ and computed the mean. The 7 FDRs for a given predicted dataset were fairly consistent: 18%-54% for our predictions, 41%-80% for Scott et al. [158] and 71%-95% for Rhodes et al. [145].

The fact that certain experimental datasets, especially small-scale screens, had par- ticularly high overlaps with predicted methods (Figure 2.6), does not indicate that these datasets are the most reliable. It is likely that prediction methods have biases that in- crease their overlaps with certain experimental datasets. For example, prediction meth- ods are probably most effective with well-annotated proteins and such proteins may be more frequent among interactions from small-scale screens, as these screens have been Chapter 2. Predicting human PPIs using non-independent features 28

8000 30688 # ppis in dataset # ppis without training cases

6000 5875 5169 4852

4000 # PPIs

2449

2000 1606 1716 1659 1628 1589 1375 871 686 755 748 581 533 531 526 493 393 356 177 276 273 285 232 141 0 Y2H Lim06 MS Sato04 MS Ewing07 MS deHoog04 Y2H Rual05 low Y2H Stelzl05 low Y2H Rual05 high MS Jorgensen09 Y2H Stelzl high05 Y2H_Stelzl05 med Peptide array Wu07 MS Bouwmeester04 small−scale (<10 PPI) LUMIER Barrios−Rodiles05

Figure 2.4: Numbers of interactions in datasets from experimental screens. These datasets served as reference sets for evaluating computational methods. When comparing predicted and experimental datasets, PPIs used for training computational methods were excluded from all datasets. Dataset sizes are shown before (blue) and after (red) exclusion of training cases.

used for a longer time.

2.2.3 Coverage of predicted networks

A second important criteria for evaluating predicted networks is their coverage of the interactome. Although coverage is difficult to assess, it is the main contribution of predicted networks – providing information about sections of the interactome sparsely covered by experimental methods. We investigated the coverage of FpClass by taking increasingly larger sets of our top predictions, evaluating their FDR, and evaluating their overlap with experimentally determined PPI networks. Figures 2.9 and 2.10 show the number of PPIs and the number of proteins in predicted networks, relative to the FDR of the networks. To calculate FDR we used PPIs from small-scale screens as the reference Chapter 2. Predicting human PPIs using non-independent features 29

648

600 Rhodes05 Scott07 500 FpClass

400

300 # PPIs

193 200 148

100 67 47 37 31 29 15 23 14 23 2 6 10 2 2 8 0 0 2 8 0 0 4 2 5 1 1 2 1 1 2 0 1 0 0 4 2 0 2 2 0 Y2H Lim06 MS Sato04 MS Ewing07 MS deHoog04 Y2H Rual05 low Y2H Stelzl05 low Y2H Rual05 high MS Jorgensen09 Y2H Stelzl high05 Y2H_Stelzl05 med Peptide array Wu07 MS Bouwmeester04 small−scale (<10 PPI) LUMIER Barrios−Rodiles05

Figure 2.5: Overlaps of computational datasets with experimental datasets.

1000 Rhodes05 Scott07 FpClass 800 776 716

621 600

400 398

237 230 197 200 152 159 118 # observed PPI / expected 97 98 109 99100 71 52 48 59 60 24 41 45 30 40 9 0 0 0 0 8 9 19 18 20 0 13 0 0 18 9 0 0 Y2H Lim06 MS Sato04 MS Ewing07 MS deHoog04 Y2H Rual05 low Y2H Stelzl05 low Y2H Rual05 high MS Jorgensen09 Y2H Stelzl high05 Y2H_Stelzl05 med Peptide array Wu07 MS Bouwmeester04 small−scale (<10 PPI) LUMIER Barrios−Rodiles05

Figure 2.6: Ratios of observed to expected overlaps between high confidence predicted PPIs and PPIs from experimental screens. Chapter 2. Predicting human PPIs using non-independent features 30

Figure 2.7: Estimating false positive rates of computational datasets. D’Haeseleer and Church

[50] proposed a method of estimating false positive rates by considering the overlaps of three

PPI datasets. Their approach was used by Hart et al. [79] and by Scott et al. [158] to evaluate

PPI predictions. In this approach, two similar datasets are compared against a reference set.

It is assumed that the overlap of any two datasets contains largely true positive PPIs. The

II×III number of true positives, IV, is calculated from the numbers of shared PPIs: IV = I . Once the number of true positives is determined it is used to calculate the number of false positives, V, and the false positive rate. Chapter 2. Predicting human PPIs using non-independent features 31

Rhodes05 Scott07 FpClass 100 95 94 88 84 81 80 80 78 80 73 71 70 68

60 56 54 52

41 43 40 36

False discovery rate 29 23 20 18

0 Y2H Lim06 MS Ewing07 MS Jorgensen09 Peptide array Wu07 MS Bouwmeester04 small−scale (<10 PPI) LUMIER Barrios−Rodiles05

Figure 2.8: FDRs of high confidence predicted PPIs, based on overlaps with PPIs from various experimental screens.

set, since they comprised more interactions than HTP datasets (Figure 2.4) and they are often considered more reliable [190, 175], although there is evidence to the contrary

[188, 44]. In comparison to networks from Rhodes et al. [145] and Scott et al. [158], our networks were considerably larger at equivalent FDR rates – containing over ten times as many interactions and more than twice as many proteins. Our 50% FDR network is available at http://www.cs.utoronto.ca/∼juris/data/SciSig/KotlyarSuppTable3. txt.

To estimate the completeness of our networks (i.e. the fraction of the interactome captured by our networks) we compared them against several experimentally determined

PPI datasets: one dataset comprising the union of human interactions from major PPI databases, and five datasets corresponding to PPIs of the 5 highest degree proteins in the first dataset. Interactions of high degree proteins may represent small subspaces of the interactome (1 protein × most others) that have been mapped more completely than Chapter 2. Predicting human PPIs using non-independent features 32

(81,1513363) 1500000

1000000 PPIs in predicted network 500000 (68,410196)

(50,178738)

Rhodes05 (81,35947) (15,32440) 0 Scott07 (68,32440)

0 20 40 60 80

False discovery rate (%)

Figure 2.9: Numbers of interactions in predicted networks relative to the networks’ FDRs.

Networks from FpClass are shown in blue. The high confidence network of Scott et al. [158] is shown in green and the network of Rhodes et al. [145] is shown in red. Dashed lines link these networks to FpClass networks with equivalent FDRs or equivalent numbers of interactions.

other regions. Figure 2.11 shows the percentages of these datasets included in predicted networks with various FDRs. We considered these percentages as estimates of network completeness. The percentages were very similar for the union of databases and for three of the high degree proteins; reaching 20% at FDR of ∼20% and rising to 40% at ∼70%

FDR. Chapter 2. Predicting human PPIs using non-independent features 33

(81,17453)

15000

(68,11029)

10000 (50,9372)

Rhodes05 (81,5637) (15,4987) Scott07 (68,4987) Proteins in predicted network 5000

0

0 20 40 60 80

False discovery rate (%)

Figure 2.10: Numbers of proteins in predicted networks relative to the networks’ FDRs. Net- works from FpClass are shown in blue. The high confidence network of Scott et al. [158] is shown in green and the network of Rhodes et al. [145] is shown in red. Dashed lines link these networks to FpClass networks with equivalent FDRs or equivalent numbers of proteins.

2.2.4 Evidence for novel predicted PPIs

Evaluations of predictions are heavily dependent on PPI databases, but only a small fraction of the human interactome is present in databases. To assess PPI predictions not found in databases, we considered three alternative sources of evidence: text mining of

PubMed, analysis of co-complexed protein pairs in the PDB database [17], and results of recent experimental PPI screens. Using the text mining method of Niu et al. [127] we identified 460 likely PPIs not present in curated databases6. In the PDB database

6We obtained PPIs of major curated databases through I2D [27] version 1.8, which integrates inter- actions from IntAct [6], MINT [31], HPRD [101], and BioGRID [173], as well as other curated and HTP datasets. Chapter 2. Predicting human PPIs using non-independent features 34

100 All PPIs from I2D*: 66217 PPIs of GRB2: 647 PPIs of YWHAZ: 400 PPIs of TRAF6: 381 PPIs of P53: 346 80 PPIs of IKBKE: 315

60

40 % PPIs identified

20

0

0 20 40 60 80

False discovery rate (%)

Figure 2.11: Completeness of predicted networks. We evaluated the completeness of our pre- dicted networks in two ways: by comparing against a set of PPIs encompassing major PPI databases and by comparing against PPIs of high degree proteins. We obtained PPIs encom- passing major curated databases by downloading I2D [27] version 1.8 and selecting PPIs with the following sources: BioGrid [173], HPRD [101], IntAct [6], and MINT [31]. We refer to this

PPI set as I2D∗. In this set, we identified the five highest degree proteins (GRB2, YWHAZ,

TRAF6, P53, IBKE) and defined separate sets for their PPIs. We then analyzed the overlap between our predicted networks and 6 PPI sets: I2D∗ and five sets containing PPIs of high degree proteins. The curves represent the percentages of these PPI sets included in predicted networks, relative to the FDRs of the networks. Chapter 2. Predicting human PPIs using non-independent features 35 we identified 711 candidate PPIs by searching for complexes comprising two different proteins, present as two or more distinct chains. Most of these protein pairs were present in PPI databases but 106 were absent. Our predictions identified significant fractions of novel PPIs from text mining and from PDB (Figure 2.12). At all FDR levels, the overlap of our predicted networks with each of these novel PPI sets was statistically significant (p

= 7.2e-9). An example of a novel interaction from the PDB that was predicted with high confidence (among our top 7,000 proteome-wide predictions), is the binding of IRAK2 and IRAK4 [112], signalling mediators in the TLR/IL-1R superfamily.

80 All PPIs from I2D*: 66217 Novel PPIs from text mining: 460 Novel PPIs from PDB: 105 60

40 % PPI identified 20

0

0 20 40 60 80

False discovery rate (%)

Figure 2.12: Evidence for predicted interactions not present in PPI databases. We defined a

PPI set, I2D∗, encompassing major curated databases, by downloading I2D [27] version 1.8 and selecting PPIs with the following sources: BioGrid [173], HPRD [101], IntAct [6], and MINT

[31]. We identified candidate novel interactions - absent from I2D∗ - by using text mining of biological literature and analysis of complexes in the PDB database. We then determined if our predictions were identifying these novel interactions. The three curves indicate the percentages of novel PPIs, and I2D∗ PPIs, in predicted networks with various FDRs.

To provide further support for novel interactions, we selected two candidate proteins, Chapter 2. Predicting human PPIs using non-independent features 36

AKT1 and PIK3R1, and used the RePCA method [51] to validate their predicted inter-

actions. At 50% FDR we predicted 3 of 40 PPIs detected for AKT1 and 5 of 26 PPIs

detected for PIK3R1. These overlaps are significant, p = 2.8e-3 for AKT1 and p = 9.3e-3

for PIK3R1. Figure 2.13 shows the overlaps between the newly identified interactions,

our 50% FDR network and interactions in curated databases. Some of the interactions

identified by RePCA were supported by additional evidence – detection by a different

experimental method or detection of a related protein pair by a different method. These

high confidence RePCA interactions had more significant overlaps with our 50% FDR

network. Our network contained 3 of 15 high confidence AKT1 interactions (p = 1.5e-4)

and 5 of 12 high confidence PIK3R1 interactions (p = 2.0e-4).

(a) Interactions of AKT1 (b) Interactions of PIK3R1

Figure 2.13: Interactions of AKT1 and PIK3R1 in PPI databases, a recent experimental screen and 50% FDR predicted networks. Chapter 2. Predicting human PPIs using non-independent features 37

2.3 Discussion

We have developed a method for predicting PPIs that is particularly suited for an in- tegrated interactome mapping strategy – HTP experimental methods combined with computational predictions. Our method, FpClass, uses two novel approaches for inter- action prediction: it considers combinatorial effects among predictive features and it associates predictive features with absence as well as presence of interaction. Testing results indicate that FpClass is complementary to HTP experimental methods. It con-

firms PPIs detected by HTP methods and identifies potential false negatives missed by these methods. Compared to previous human PPI prediction datasets, our results have higher overlaps with both HTP and small-scale experimental methods. The FDR of our predictions, calculated on the basis of these overlaps, is lower than that of previous hu- man PPI prediction datasets. Consequently, at the FDR of previous methods, FpClass predicts over 10 times as many interactions and more than twice as many proteins – pro- viding substantially higher coverage of the interactome, which is an important feature for integrated mapping.

2.3.1 Evaluating predictions

To comprehensively evaluate predictions made by FpClass, we considered top predicted

PPIs and compared to results of diverse experimental methods. One concern with this approach is that datasets from experimental methods are often relatively small, espe- cially after removing PPIs used for training. However, overlaps between high confidence predictions and experimental datasets were consistently much higher than expected by chance and were statistically significant; overlaps with all datasets had p-values < 8.2e-4, and with 7 datasets, < 3.1e-23. Therefore, the overlaps are unlikely to be the result of sampling effects. Also, by considering overlaps with diverse experimental datasets, there is less concern that results are affected by biases of a particular dataset. Chapter 2. Predicting human PPIs using non-independent features 38

FDRs for Rhodes et al. [145] and Scott et al. [158], calculated on the basis of these overlaps, were consistent with previously reported values [79, 158]. For Rhodes et al.

[145] FDR ranged from 71% to 95%; previously reported rates were 83% to 87% [79] and 65% to 83% [158]. For Scott et al. [158] FDR ranged from 41% to 80%; previously reported rates ranged from 65% to 89% [158]. Concordance with previous FDR results lends support to the FDR values calculated for FpClass, 18% to 54%.

2.3.2 Using predictive features in novel ways

Our method considers three novel relationships between predictive features and PPIs.

First, it considers that if a protein has several features of the same type (e.g., multiple domains), these features may jointly affect the protein’s interactions. Secondly, it con- siders that just as interactions may be facilitated (or hindered) by a pair of domains on two proteins (one domain on each protein), interactions may also be affected by a pair of features of different types (e.g., a specific domain on one protein and a specific post-translational modification on the other). Thirdly, it considers that a pair of features

(or feature sets), present on two proteins, may reduce the chances of interaction, rather than increase the chances.

Our predictions were improved by considering these three relationships but previous evidence for these relationships is limited. Few studies have considered how sets of pro- tein features affect protein interactions. Using combinations of protein domains has been shown to improve PPI predictions [78, 195], although it was not shown that multiple domains were required for interaction. A number of studies [189, 38] have shown that certain sets of protein domains occur together much more frequently than expected by chance. Proteins having the same domain sets are more likely to have similar functions and to interact together [38]. Feature sets identified by FpClass may similarly char- acterize groups of proteins with the same role and interaction profile – they may not necessarily correspond to structural features involved in protein-protein binding. Chapter 2. Predicting human PPIs using non-independent features 39

The relationship between protein features and absence of interaction has not been widely studied either. Several studies [153, 165] have focused on identifying non-interacting protein pairs but few protein features have been related with absence of interaction.

Although the novel relationships used by our method have not been experimentally validated, they contribute to a large fraction of our proteome-wide predictions (Figure

2.14).

Total predictions No evidence Feature set evidence Non−interaction evidence

0 50 100 150 200

Protein pairs (millions)

Figure 2.14: The total number of proteome-wide predictions and the numbers of predictions with no evidence, evidence that includes feature sets, and evidence that indicates absence of interaction. When no evidence was available – either positive or negative – protein pairs were predicted with a probability of ∼0.01, similar to the class ratio in the training data.

Evidence for non-interaction may be especially helpful for highlighting false positive results of HTP experimental methods. Previous PPI prediction approaches assigned a low score to a protein pair if it lacked certain types of interaction evidence – a low score did not indicate that an interaction would not occur. Thus, if an interaction was detected by an experimental method but received a low prediction score, there was no reason to question the interaction. However, if low scores reflect evidence for non-interaction, then they may identify potential false positive interactions. Chapter 2. Predicting human PPIs using non-independent features 40

2.3.3 Predicting interactions of disease genes

At 50% FDR FpClass predicted a human protein interaction network comprising 178,738 interactions and 9,372 proteins. This network provides a large number of candidate in- teractions for 424 cancer genes from the Cancer Project [59] and 12,394 disease genes from OMIM [122]. For both of these gene sets, we determined the degree distribu- tion of corresponding proteins in a PPI network comprising major PPI databases7. We then extended this network with our 50% FDR predictions and recalculated the degree distributions of disease related proteins. Our predictions provided candidate interac- tions for 11 cancer genes and 584 OMIM genes which had no interactions in current PPI databases (i.e., the protein products of these genes had no known interactions). The number of genes with fewer than 10 interactions decreased by 55.2% for cancer genes and by 34.4% for OMIM genes (Figures 2.15 and 2.16).

7We used interactions from IntAct [6], MINT [31], HPRD [101], and BioGRID [173], contained in I2D [27] version 1.8. Chapter 2. Predicting human PPIs using non-independent features 41

40

30

20 % proteins with degree <= k 10

Experimental PPI network 0 Experimental & FpClass PPI network

0 2 4 6 8 10

Degree (k)

Figure 2.15: Cumulative degree distribution of proteins from the Cancer in a

PPI network based on experimental screens and a network combining experimental screens with

50% FDR predictions. In the experimental network, many cancer proteins have few interactions;

27% have less than 5 interactions and 44% have less than 10. In the combined experimental and predicted network, these numbers are reduced by about half; 14% of cancer proteins have less than 5 interactions and 20% have less than 10. Chapter 2. Predicting human PPIs using non-independent features 42

60

40

% proteins with degree <= k 20

Experimental PPI network 0 Experimental & FpClass PPI network

0 2 4 6 8 10

Degree (k)

Figure 2.16: Cumulative degree distribution of OMIM proteins in a PPI network based on experimental screens and a PPI network combining experimental screens and 50% FDR pre- dictions. In the experimental network, 59% of OMIM proteins have fewer than 5 interactions

74% have less than 10. In the combined network, these numbers are reduced to 39% and 50%, respectively. Chapter 2. Predicting human PPIs using non-independent features 43

2.4 Methods

2.4.1 Predicting interactions

Our data mining-based approach for predicting PPIs comprised four stages: 1) identify- ing sets of co-occurring features, 2) defining a training set, 3) training the classifier and

4) testing the classifier (Figure 2.17). The aim of the first stage was to identify sets of features that could act co-operatively to affect interactions. We assumed that such sets could consist of any protein features that frequently co-occurred together. To identify such sets we assembled a dataset comprising human proteins annotated with domains, post-translational modifications (PTMs), structural-chemical properties, localizations, functions, and processes. Using a novel frequent pattern mining algorithm, FpClass, we identified sets of features that co-occurred on proteins more frequently than expected by chance (see Supplemental Materials and Methods (next section) for a detailed descrip- tion)

In the second stage, we assembled a training dataset using these feature sets. The dataset contained ∼2.2 million protein pairs: 22,219 PPIs reported by at least two stud- ies (positive cases) and 2,221,900 random protein pairs, excluding any experimentally identified interactions (negative cases). The data was highly biased in favour of negative cases to reflect the fact that protein-protein interactions constitute a small fraction of all possible protein pairs in a cell. Each protein pair in the training dataset was annotated with features of the individual proteins (e.g., domains) and features of the pair (e.g., presence of interacting orthologs in a model organism). Features of single proteins in- cluded original annotations as well as feature sets; original annotations were considered as feature sets of length 1 (Figure 2.17.2).

Our classifier comprised several functions that mapped different types of evidence to interaction scores, indicating the confidence of interaction. A single function was ap- plied to all pairs of feature sets, (Si,Sj), where Si and Sj were associated with different Chapter 2. Predicting human PPIs using non-independent features 44

1. Identifying feature sets

Input : P1 dom i, dom j, loc n, func o, proc p  proteins annotated with features: domains, : : PTMs, functions, processes, etc. Output : Frequent pattern mining  sets of frequently co-occurring features ..,{dom ,dom },{PTM ,func },,,.{dom ,dom ,loc }..  proteins annotated with original features and i j k n j n q sets of co-occurring features P1 dom i, dom j, loc n, func o, proc p, {dom i,dom j} : :

2. Defining of training data

Features of pair (P P): Input : i j Feature Feature orthology, gene co-  proteins annotated with features & feature sets of P i sets of Pj expression, topology sets  interacting & non-interacting protein pairs Protein pairs P1P2 {dom i},.. {PTM i,loc i},.. fly_interologs,.. annotated with features: gene co-expression, pos: 22,219 interolog info, etc. neg: 2,221,900 : : Output :  protein pairs annotated with features / feature sets of individual proteins and of protein pairs

3. Training

Pairs of feature sets: Features of protein pairs Input : {S }, {S } gene exprorthology topology  protein pairs annotated with features / i j feature sets of individual proteins and of f (enriched or deficient Pearson f (interacting f (shared protein pairs in pos PPIs) ρ orthologs) neighbors) Output : Interaction scores  scores of feature set pairs & of features, score integration indicating their contributions to interaction  probabilities of interaction for protein pairs single interaction score

probability of interaction

4. Testing Input :  protein pairs annotated with features /feature sets of individual proteins and of protein pairs  scores of feature set pairs & of features Output :  probabilities of interaction for protein pairs  cross-validation on training data  proteome-wide predictions: 21,303 proteins & 226,898,253 protein pairs

Figure 2.17: Stages of the prediction process: identification of feature sets which may act co- operatively to affect interactions, assembly of training data, training of the classifier and testing of the classifier. Chapter 2. Predicting human PPIs using non-independent features 45 members of a protein pair. The function determined an odds ratio representing the level of enrichment or deficiency of feature set pairs among positive training cases. Several functions were applied to the remaining feature types: Pearson correlation to gene ex- pression data, simple binary functions to orthology data (indicating presence or absence of interacting orthologs), and three functions to topology data – these functions mea- sured the tendency of a protein pair to share the same neighbours in a PPI network.

Scores from all functions were converted to probabilities of interaction; for a given score, s, the probability was the fraction of positive cases among all training cases with scores

≥ s. All probabilities were integrated into a single value, which was again converted to a probability (see Supplemental Materials and Methods).

2.4.2 Calculating enrichment of PPI sets among predictions

We evaluated predictions by assessing their enrichment (overlap) with respect to various

PPI datasets that we considered as validation sets. We calculated the significance of enrichment based on the hypergeometric distribution, H(N, M, n, m), which gives prob- abilities of sampling without replacement. A commonly used example describing the use of this distribution is as follows: given an urn containing N marbles of which M are white, determine the probability of sampling n marbles without replacement and obtaining m white marbles. In our calculations of enrichment, N was the number of proteome-wide predictions (∼227 million), M was the number of interactions in a given validation set, n was the number of predictions in our selected network (e.g., our 50% FDR network, comprising 178,738 interactions), and m was the number interactions common to our net- work and the validation set. Our reported p-values were calculated using the cumulative hypergeometric distribution. P-values represented the probability that the enrichment would be as high or higher than observed, due to chance. They were calculated as follows: Chapter 2. Predicting human PPIs using non-independent features 46

minX(M,n) P (common interactions ≥ m) = H(N, M, n, i) (2.1) i=m

2.5 Supplemental Materials and Methods

2.5.1 Links to data tables

1. positive training cases:

http://www.cs.utoronto.ca/∼juris/data/SciSig/KotlyarSuppTable1.txt

2. top 32,234 predictions:

http://www.cs.utoronto.ca/∼juris/data/SciSig/KotlyarSuppTable2.txt

3. 50% FDR predictions:

http://www.cs.utoronto.ca/∼juris/data/SciSig/KotlyarSuppTable3.txt

4. novel PPIs from text mining:

http://www.cs.utoronto.ca/∼juris/data/SciSig/KotlyarSuppTable4.txt

5. novel PPIs from PDB:

http://www.cs.utoronto.ca/∼juris/data/SciSig/KotlyarSuppTable5.txt

6. experimentally identified PPIs for AKT1:

http://www.cs.utoronto.ca/∼juris/data/SciSig/KotlyarSuppTable6.txt

7. experimentally identified PPIs for PIK3R1:

http://www.cs.utoronto.ca/∼juris/data/SciSig/KotlyarSuppTable7.txt

2.5.2 Datasets

PPI predictions were made for a gold standard set of human interacting and non- interacting protein pairs and for a proteome-wide set of human protein pairs. The proteins Chapter 2. Predicting human PPIs using non-independent features 47 in these datasets were based on a set of 21,303 UniProt proteins (release 15.0) [39], with unique Entrez Gene [119] IDs (Dec. 15, 2009). In this set of proteins, referred to as

UniProt-Entrez, a single protein corresponded to a single gene. Multiple splice variants of a gene were not included for two reasons: most predictive features were associated with genes rather than splice variants and known interactions were rarely specific to particu- lar splice variants. If multiple UniProt IDs were associated with the same Entrez Gene

ID we chose a single representative protein by two criteria: firstly, based on presence in

SwissProt over Trembl, and secondly, based on longer sequence length. In protein pairs used for training and testing, we replaced UniProt proteins sharing the same Entrez Gene

ID with our chosen representative for that ID.

Training and part of testing were carried out on a gold standard set of interacting and non-interacting protein pairs. Interacting protein pairs were selected from the I2D database [27] (version 1.72). We chose 22,219 interactions, each supported by at least 2

PubMed IDs. These interactions covered 7,478 proteins. Non-interacting pairs were gen- erated by randomly choosing pairs of UniProt-Entrez proteins, which had either domain or Gene Ontology annotations. From the non-interacting set we excluded any protein pairs with experimental evidence in I2D [27] (but included I2D PPIs based exclusively on orthology, since we used orthology as a predictive feature.) The non-interacting set contained 2,221,900 protein pairs and 19,355 proteins. Self-interactors were excluded from interacting and non-interacting pairs. PPI predictions were also made for all pairs of distinct UniProt-Entrez proteins.

2.5.3 Prediction overview

Our prediction approach was based on integrating interaction evidence from features of individual proteins and from features of protein pairs. Using each type of feature, we calculated an interaction score, proportional to a probability of interaction. Interaction scores were mapped to probabilities and integrated into a single value. Chapter 2. Predicting human PPIs using non-independent features 48

Nine types of predictive features were used, 6 associated with single proteins and 3 associated with protein pairs. This division reflected how the features were used: a single scoring method, based on data-mining, was applied to features of individual proteins and multiple scoring methods were applied to features of protein pairs.

2.5.4 Features of individual proteins: description and sources

Features of individual proteins comprised protein domains, post-translational modifica- tions (PTMs), structural-chemical protein properties, and Gene Ontology annotations – cellular component, molecular function, and biological process. Domains were obtained from InterPro [85], version 21.0, and UniProt [39], release 15.0. Post-translational mod- ifications were obtained from UniProt [39], release 15.0, and from the Human Protein

Reference Database (HPRD) [101], release 8.0. The domains and PTMs of a protein were represented by a binary vector; each entry in the vector indicated whether a given domain or PTM was present in the protein.

Structural-chemical properties were determined from protein sequence by two pro- grams: PSIPRED [29, 95], version 26, and the pepstats application from the European

Molecular Biology Open Software Suite [146]. PSIPRED was used to predict the frac- tion of a protein’s residues in disordered regions, alpha helices, beta sheets and coils.

Pepstats was used to calculate 11 chemical properties: charge, isoelectric point, and the molar percent of each physico-chemical class of amino acid. Each feature calculated by

PSIPRED and pepstats was discretized into 7 intervals. An interval represented a range of percentiles; for example, the first interval contained values between the 0th and 2.5th percentiles. The 7 intervals were defined as follows: [0%,2.5%], [0%,10%], [0%,40%],

[40%,60%], [60%,100%], [90%,100%], [97.5%,100%]. The intervals were overlapping in order to capture different levels of low and high values as well as intermediate values.

Human Gene Ontology annotations were downloaded from the Gene Ontology [8] website (http://www.geneontology.org/GO.current.annotations.shtml) on Apr.21, Chapter 2. Predicting human PPIs using non-independent features 49

2009. Proteins were annotated with cellular component, molecular function, and biolog- ical process terms specified in the downloaded file, as well ancestors of these terms.

2.5.5 Calculating interaction scores from features of individual

proteins

Our basic approach for predicting PPIs from features of individual proteins was to identify pairs of features enriched in known interactions8, and use such feature pairs as rules for predicting interactions. This approach was proposed by Sprinzak et al. [172] and has been used to predict PPIs based on pairs of domains [172, 145, 158], and pairs of post- translational modifications [158] enriched in known interacting protein pairs. The main aspects of the approach have remained similar across different studies: a single feature type is chosen (e.g., protein domains), a pair of features of this type (one feature on each protein) provides evidence of interaction, and the strength of the evidence is proportional to the enrichment of the feature pair among known interactions.

We extended this approach by considering three possibilities: (1) an interaction could require the presence of several features in a protein, (2) the required features could be of the same or of different types, and (3) the presence of particular features in a pair of proteins could provide evidence for or against interaction. To take into account these possibilities, we made 3 changes to the original approach:

• we identified sets of features, of the same or different types, which co-occur in

proteins;

• we identified pairs of feature sets that were either enriched or deficient among known

PPIs;

• we filtered resulting feature set pairs to reduce redundancy and improve prediction

8Such that each protein in an interacting pair, would have one of the features. Chapter 2. Predicting human PPIs using non-independent features 50

accuracy.

Identifying sets of co-occurring features

The interactions in which a protein participates could require the presence of several features – we assumed that a group of such features is likely to co-occur together in proteins more frequently than expected by chance. To identify such feature sets we implemented a data-mining algorithm based on frequent pattern growth [77] – a method which finds all sets of features that occur at least k times in a dataset, where k is a user specified threshold. Input to the method consists of records, each comprising a set of features. In our case, a record was the set of features, such as domains and PTMs, associated with a single protein. Our full dataset contained 21,303 records, one for each protein. Output from frequent pattern growth consists of feature sets that are subsets of at least k records. For example, with a setting of k = 3, we identified the feature set,

{Myb DNA binding domain, Homeodomain-related}, which meant that these 2 domains were found together in at least 3 proteins. Feature sets that have at least k occurrences are referred to as frequent feature sets, and their number of occurrences is referred to as support.

The task of identifying all frequent feature sets can be difficult because the number of sets may expand exponentially with the number of features, leading to prohibitively high run times and storage requirements. Frequent pattern growth reduces the run time by using a compressed data representation, which allows fast identification of a feature set’s support. If the support is below k, the feature set is not recorded and its supersets do not need to be considered since they will have the same or lower supports. Frequent pattern growth searches for frequent sets by starting with an empty set and expanding it one feature at a time; therefore, when supersets are not considered the search space is constrained. By using fast support look-ups to constrain the search space, frequent pattern growth achieves lower run times than previous pattern mining methods [65]. To Chapter 2. Predicting human PPIs using non-independent features 51 further reduce run time and especially to reduce the number of feature sets we made several modifications to frequent pattern growth.

• A feature set was discarded if its support was similar to that of its subsets, i.e., ≥

80% of a subset’s support. Such feature sets were not recorded and their supersets

were not considered. This was done because a set with similar support to a subset

would provide similar information about interaction – the two sets would be present

on most of the same proteins, and likely in many of the same interacting protein

pairs.

• Whenever a feature f was considered as an extension for feature set FS, the support

of FS ∪ f had to be substantially higher than expected by chance. The expected

support of FS ∪ f was calculated as

sup(FS) sup(f) E(sup(FS ∪ f)) = × × N, (2.2) N N

where N is the number of records in the dataset. To retain feature set FS ∪ f and

expand it further, the following criteria had to be met

sup(FS ∪ f) > 10. (2.3) E(sup(FS ∪ f))

• When feature f was considered as an extension for set FS, the probability of

sup(FS ∪ f) being equal or greater than its observed value, had to be less than a

minimum threshold. This probability, P (supFS f ≥ sup(FS ∪ f)) was calculated with the hypergeometric distribution, H(N, M, n, m). A common way of describ-

ing the problem addressed by this distribution is with an ‘urn model’: given an

urn containing N balls of which M are white, determining the probability of ran-

domly taking n balls without replacement and obtaining m white balls. We used

the cumulative hypergeometric probability and set the distribution parameters as

follows: N = total records, M = sup(f), n = sup(FS), and m = sup(FS ∪ f). If the Chapter 2. Predicting human PPIs using non-independent features 52

cumulative probability was higher than 0.05, feature set FS ∪ f was not recorded

and its supersets were not considered.

After feature sets were identified we annotated all proteins with both their original

features and with feature sets. No distinction was made between the two types of

annotations - original features were considered as feature sets of length 1. We refer

to the feature sets of a protein, simply as features.

Identifying pairs of enriched and deficient feature sets

Once proteins were annotated, we determined the support of features and feature pairs among training cases (interacting and non-interacting protein pairs). For each feature i, we determined support values posSupi, negSupi, among positive and negative training cases, respectively. Support was defined as the number of protein pairs where at least one of the proteins was annotated with the feature. Similarly, for each pair of features,

(i, j), we calculated support values posSupij, negSupij among positive and negative cases, respectively. For feature pairs, support was defined as the number of protein pairs where one of the proteins had feature i and the other had feature j.

For each feature pair, (i, j), we used support values to calculate several measures quantifying enrichment or deficiency of the pair among positive training cases. Two measures, rP os and pP os represented enrichment or deficiency relative to the expected occurrences of (i, j) among positive cases. We defined the expected occurrences of (i, j) among positive cases as

posSup posSup E(posSup ) = i × j × nP os, (2.4) ij nP os nP os where nP os is the number of positive training cases. rP os was the ratio between the observed and the expected support:

posSup rP os = ij (2.5) E(posSupij) Chapter 2. Predicting human PPIs using non-independent features 53

If the value of rP os was greater than 1, pP os was the probability of posSupij being greater than or equal to its observed value, given the values of posSupi and posSupj. If rP os was less than 1, pP os was the probability of posSupij being less than or equal to its observed value. pP os was calculated by the cumulative hypergeometric distribution with the settings N=nPos, M=posSupi, n=posSupj, m=posSupij. Two other measures, rAll and pAll, represented enrichment or deficiency of the fea- ture pair (i, j) among positive cases, relative to the expected occurrences of (i, j) among negative cases. The number of expected occupancies among negative cases was defined as negSup negSup E(negSup ) = i × j × nNeg, (2.6) ij nNeg nNeg where nNeg is the number of negative training cases. rAll was defined as

posSup nNeg rAll = ij × (2.7) E(negSupij) nP os rAll > 1 indicated enrichment of (i, j) among positive cases, while rAll < 1 indicated deficiency. pAll was defined as the cumulative hypergeometric probability distribution with parameters N=nAll, M=nPos, n=posSupij+negSupij, m=posSupij, where nAll is the total number of training cases. If rAll was greater than 1, the right tail of the distribution was used, otherwise, the left tail was used. Feature pairs were considered to be enriched or deficient among positive cases if their values of pP os and pAll were less than 0.001.

Calculating interaction scores from feature pairs

A set, S, of enriched and deficient feature pairs was used to determine interaction scores.

To calculate an interaction score for a protein pair (Pa,Pb) feature pairs i, j from S were selected such that i was a feature of one protein, and j was a feature of the other.

Among the selected feature pairs, the pairs with the highest and lowest rAll values were identified. These pairs, fpmax, fpmin provided the strongest evidence for and against Chapter 2. Predicting human PPIs using non-independent features 54

interaction, respectively. Their rAll values, rAllmax and rAllmin, were used to set the interaction score as follows:

   rAllmax rAllmax × rAllmin > 1  interaction score(P ,P ) = . (2.8) a b  rAllmin rAllmax × rAllmin < 1   1 no feature pairs for Pa,Pb

Filtering feature sets

Using feature sets of length ≥ 1 resulted in a large number of redundant feature pairs: multiple feature pairs present in largely the same protein pairs. The presence of such feature pairs lowered prediction accuracy. This happened when feature pairs predicted the same true positive cases, but different false positive cases. For example, feature pairs i, j and k, l could each be fairly accurate, predicting nT P true positives and nF P

1 false positives with nF P = 5 nT P . However, if they predict the same true positives but different false positives, then using them together results in nT P true positives and

2 × nF P false positives. Combining more such feature pairs would give a linear increase in the number of false positives.

To reduce this problem we ensured that only one feature pair could gain support from a positive training case. This was implemented with a greedy set cover algorithm. All feature pairs from S with rAll > 1 were placed in a set P . The feature pair, fpmax with the highest rAll value was identified and moved from P to a set Q. Positive training cases where fpmax was present were identified. If a feature pair in P was present in k of these cases, its posSup value was lowered by k. Its rAll value was recalculated, and if the rAll value reached 1, the feature pair was removed from P . These steps were repeated until set P was empty. Interaction scores for test cases were then determined from feature pairs in Q and feature pairs in S, which had rAll values less than 1. Chapter 2. Predicting human PPIs using non-independent features 55

2.5.6 Features of protein pairs

Features of protein pairs consisted of information about interacting orthologs, gene co- expression and network topology. Information about interacting orthologs was taken from I2D [27], version 1.72 on Oct. 19, 2009. For a given protein pair, there were 5 interaction scores based on orthology information: each score was simply a binary value indicating the presence or absence of interacting orthologs in a particular model organism

– mouse, rat, fly, worm, or yeast.

Gene co-expression information was based on 10 gene expression datasets from the

Gene Expression Omnibus [12]: GDS596, GDS1221, GDS1289, GDS1329, GDS1618,

GDS1730, GDS2250, GDS2545, GDS2780, and GDS2842. Each of these datasets con- tained over 8,000 transcripts measured in at least 15 samples. Each dataset was pro- cessed by the MAS 5.0 algorithm (http://www.affymetrix.com/support/technical/ whitepapers.affx; Statistical algorithms description document.), using the affy pack- age (version 1.20.0) in R (version 2.8) [179]. Expression levels of each sample were mean centered and levels from multiple probe sets for the same gene were averaged. Using each dataset, Pearson correlation coefficients were calculated for all available gene pairs; these correlations were used as interaction scores.

Topology information was based on a PPI network consisting of positive training cases. For a given protein pair (a, b), three interaction scores were calculated based on the known interactors of proteins a and b. The first score was simply the number of interactors shared by the two proteins:

Sshared = |Ic|, (2.9)

where Ic is the set of shared interactors of proteins a and b. The second score, from Scott et al. [158], adjusted the number of shared interactors by the degrees of proteins a and b:

|Ec| SScott = , (2.10) 1 + |Ea \ Ec| + |Eb \ Ec| Chapter 2. Predicting human PPIs using non-independent features 56

where Ec is the set of edges from proteins a and b to their shared interactors, Ea is the set of edges of protein a and Ea \ Ec is the set Ea minus the set Ec. Proteins a and b received a high score if most of their interactors were the same.

The third score, SpShared, estimated the probability that proteins a and b would share at least their observed k common neighbours. This estimate depended on 4 variables: the degrees of a and b, the number of shared neighbours and the degrees of the shared neighbours. To derive this estimate we started with the simplifying assumption that all neighbours of a and b had equal degrees. We defined a such that deg(a) ≤ deg(b) and estimated the probability of exactly k shared neighbours as follows: µ ¶ deg(b) P (k shared) = P (I )k(1 − P (I )deg(a)−k, (2.11) k anb anb

where P (Ianb ) is the probability of an interaction between a and a neighbour of b, nb.

We defined P (Ianb ) as: deg(n ) − 1 P (I ) = b . (2.12) anb |E| − deg(b)

If degrees are not equal, the probability of k shared neighbours depends on the degrees of these neighbours. Since we were interested in the probability of the observed shared

0 neighbours, Nba, we defined our required probability, P , as follows: the probability of a and b sharing at least k neighbours, with degrees similar or lower than those in the set Nba. Before calculating this probability we defined three sets containing neighbours of b: set Nb containing all neighbours of b, set Nba¯ containing neighbours not shared with protein a, and the previously mentioned set Nba containing neighbours shared with protein a. We estimated P 0 for exactly k shared neighbours as follows: µ ¶ |{n²Nb|deg(n) ≤ deg(Nba)}| Y Y 0 max P (k shared) = P (Ianb ) (1 − P (Ianb )), (2.13)

|Nba| nb²Nba nb²Nba¯ where deg(Nba) is the maximum degree of neighbours in the set Nba. max To calculate the probability of sharing k or more neighbours, we sorted neighbours Chapter 2. Predicting human PPIs using non-independent features 57

in set Nba¯ from highest to lowest degree and calculated the probability as follows:

P 0(≥ k shared) = P 0(k shared) + P 0(> k shared), where P ¡ ¢ Q Q 0 deg(a) 0 |Nba¯| i−k deg(a)−k P (> k shared) = i=k+1 P (k shared) × (i−k) j=1 P (Ianba,j¯ ) j=i−k+1(1 − P (Ianba,j¯ ), (2.14)

th where P (Ianba,j¯ ) is the probability of an interaction between a and the j neighbour in set Nba¯.

2.5.7 Score integration

For each test case, interaction scores from the various feature types were integrated into a single score. Although previous studies used naive or semi-naive Bayes classifiers for integration, we obtained better results with a simple heuristic approach. First, the interaction score from each source was converted to a probability. If score, si,j, was the score for protein pair i, from source j, the probability of the score was calculated as follows:

|positive training cases with scores ≥ si,j| P (si,j) = (2.15) |all training cases with scores ≥ si,j|

If less than 20 training cases had scores ≥ si,j then the probability was calculated based on the 20th top score among training cases. Once all interaction scores were converted to probabilities, a single interaction score, si, for protein pair, i, was calculated as follows: Yn si = 1 − (1 − P (si,j)), (2.16) j=1 where n is the number of interaction score sources. The resulting score was also converted to a probability. Chapter 3

Predicting Transient PPIs From

Gene Co-expression

Abstract

Background

Gene pairs with correlated expression are more likely to encode interacting proteins.

Although this association is weak, it is statistically significant and many studies have used gene co-expression to help predict protein-protein interactions (PPIs). However, the predictive value of gene co-expression has largely been limited to stable PPIs – transient interactions have been predicted poorly or not at all. Thus, interactions of proteins such as kinases have not been effectively identified by gene co-expression.

Results

We investigate the relationship between gene co-expression and PPIs, and present ap- proaches for predicting both stable and transient PPIs from co-expression. First, we analyze correlation distributions of different genes – each distribution consisting of cor-

58 Chapter 3. Predicting Transient PPIs From Gene Co-expression 59 relations between a given gene and all others in a gene expression dataset. We find that genes have significantly different correlation distributions (p < 4.7e-31), depending on whether their proteins participate in transient or stable interactions. Our PPI prediction approaches, referred to as local and mixed, account for these differences in correlation distributions. They assume that two genes, i, j, are more likely to encode interacting proteins if the correlation between i and j is higher than most correlations of i with other genes and most correlations of j with other genes. Thus, the correlation threshold for predicting a PPI is specific to each gene pair. By contrast, the previous approach for predicting PPIs from gene co-expression, referred to as global, predicts an interaction if the correlation of two genes is higher than a fixed threshold, used for all gene pairs.

We evaluate the local, mixed, and global methods to determine which ones are best for predicting various types of PPIs: transient, stable, and a combination of different types. Also, we evaluate whether the choice of best method depends on the gene expres- sion dataset used for calculating correlations. To address these questions we apply the methods to 3 test sets, containing different types of human PPIs (transient, permanent, and combined) and random pairs of human proteins. We repeat predictions for these test sets using 42 gene expression datasets. We find that the mixed approach performs sig- nificantly better than the others (p < 0.05) on all 3 test sets and is especially effective at predicting interactions of signal transduction and transferase proteins. These improve- ments occur with most gene expression datasets but to varying degrees. We identify dataset properties, such as high gene expression variance, that optimize the performance of the mixed approach.

Conclusions

A gene’s correlation distribution can indicate whether its encoded protein participates in transient or stable interactions. By accounting for these differences it is possible to significantly improve predictions of transient and stable interactions from gene co- Chapter 3. Predicting Transient PPIs From Gene Co-expression 60 expression.

3.1 Introduction

Gene co-expression refers to correlation between the expression levels of two genes.

Numerous studies have used gene co-expression to predict protein-protein interactions

(PPIs) [91, 190, 92, 145, 158, 144] or to assess the reliability of PPIs reported by var- ious sources [48, 27]. The association between gene co-expression and protein-protein interaction was first systematically shown by Ge et al. [62]. Their study clustered yeast genes by co-expression, and showed that genes from the same cluster were about 5 times more likely to have interacting proteins than genes from different clusters (p=9.8e-43). von Mering et al. [190] found that co-expression could predict about 10% of known yeast PPIs with a precision1 of about 1%. Since co-expression on its own was not a strong predictor of PPIs, it was often used in combination with other predictive data types. For example, Ramani et al. [144] combined co-expression with orthology data, and showed that predictions of human PPIs were considerably improved when gene pairs were co-expressed in human and had co-expressed orthologs in other species. Other stud- ies integrated gene co-expression with protein features such Gene Ontology annotations

[8] and protein domains [92, 145, 158].

Although gene co-expression is not the strongest evidence for PPIs, it has important benefits as a predictive feature: it is available for most genes, it provides information about interaction stability, and in some cases, it provides information about interaction context. Gene co-expression data can be determined for most human and model organism genes using databases such as Gene Expression Omnibus (GEO) [13]. Co-expression can indicate the stability of PPIs – very stable PPIs, with half-lives between 12 minutes to 19 hours [148], often correspond to gene pairs with very high correlations in multiple expres-

1precision = true positives / (true positives + false positives) Chapter 3. Predicting Transient PPIs From Gene Co-expression 61 sion datasets [91, 27]. Such stable interactions characterize metabolic enzymes, core RNA polymerase, DNA replication complexes, nuclear pore complexes, ribosomes, and spliceo- somes [139]. The third and possibly most important benefit of gene co-expression is its ability to provide context for PPIs. Context becomes evident when high co-expression occurs only in expression datasets associated with certain conditions or only among cer- tain samples within datasets. For example, Taylor et al. [178] showed that in breast cancer patients with long term remission, genes encoding BRCA1 and its known inter- actors were highly correlated. In patients with poor outcome, these correlations were lower, suggesting that interactions of the BRCA1 protein were disrupted. Unfortunately, gene co-expression has often been unable to identify context; it has usually detected only stable interactions, which occur under most conditions. Interactions which occur only in certain contexts have been poorly detected by gene co-expression. This is especially true of transient interactions, which last between 0.1 and 1 seconds. Transient interac- tions play numerous critical roles. They are involved in all protein modifications, such as phosphorylation and acetylation. Through such modifications, they regulate most cellu- lar processes including cell growth, cell cycle, metabolic pathways, transcription, protein transport and signal transduction [139]. Not surprisingly, proteins involved in transient interactions have been implicated in many diseases [123, 87].

The inability of co-expression to predict transient interactions has been reported by several studies [91, 27, 207]. Jansen et al. [91] showed that permanent yeast protein complexes could be readily identified by co-expression but transient complexes and PPIs detected by yeast 2-hybrid had only a weak relationship with co-expression. Similarly,

Brown et al. [27] showed that protein pairs in stable yeast complexes had much higher co-expression than random protein pairs, while kinase-substrate interactions had similar co-expression to random protein pairs. (Co-expression or correlation of protein pairs refers to correlation of genes encoding the protein pairs.) Ramani et al. [144] suggested that integrating co-expression with orthology still would not provide reliable predictions Chapter 3. Predicting Transient PPIs From Gene Co-expression 62 of transient interactions.

We propose that the inability to detect transient PPIs by co-expression was partly due to the use of random protein pairs to create the background correlation distribution.

Instead, we suggest comparing the correlation of two interacting proteins, Pa,Pb, against correlations of the same two proteins with all others (i.e., correlations of Pa with all proteins Pi and correlations of Pb with all proteins Pi, i∈ / {a, b}). We refer to correlations of random protein pairs as global information and to correlations involving candidate interacting proteins as local information.

These ideas are based on the following hypothesis: proteins which participate in tran- sient interactions have different correlation distributions2 than proteins which participate in stable interactions, and accounting for these differences can improve PPI predictions.

To explain our hypothesis we introduce a simple model which defines two types of pro- teins: synchronized and unsynchronized (Figure 3.1), involved in stable and transient interactions, respectively. A synchronized protein is highly correlated with many oth- ers3, although it does not necessarily participate in many physical interactions. For example, the protein may be a member of a large stable complex, or may participate in a process involving many other proteins. The many high correlations of a synchronized pro- tein cause its correlation distribution to have a heavy positive tail – for example, the 99th percentile of the distribution occurs at a high correlation (Figure 3.1(c)). Unsynchronized proteins have mostly low correlations with other proteins, and participate primarily in transient interactions. Their correlation distributions have a light positive tail, with the

99th percentile occurring at a moderately high correlation level (Figure 3.1(d)). Physical interactions of an unsynchronized protein have correlations that are among the highest in this distribution (e.g. close to the 99th percentile), and some of them may be recog- nizable as outliers. However, a synchronized protein could have many correlations that

2The correlation distribution of a protein refers to the distribution of correlations between the pro- tein’s gene and all other genes in a gene expression dataset. 3The correlation of a protein refers to the correlation between the protein’s gene and another gene. Chapter 3. Predicting Transient PPIs From Gene Co-expression 63

a synchronized protein: S b unsynchronized protein: U

direct partners: direct partners: • few or many S • few or many U • very high correlation with S • moderate correlation with U e.g., ρ = 0.8 e.g., ρ = 0.5

indirect partners: indirect partners: • many, e.g., > 100 • few or many • high correlation with S • low correlation with U e.g., ρ = 0.6 e.g., ρ = 0.3

c d e Correlations of S Correlations of U Correlations of both S & U 2.5 2.5 2.5 2.0 2.0 2.0

1.5 1.5 1.5

1.0 1.0 1.0 Density 0.5 0.5 0.5 0.0 0.0 0.0 99% 99% 0.0 0.4 0.8 0.0 0.4 0.8 0.0 0.4 0.8 indirect direct indirect direct S random S indirect S direct partners partners partners partners partners partners partners & & U direct, indirect U direct partners partners Pearson correlation

Figure 3.1: Synchronized and unsynchronized proteins. Figure (a) shows the main properties of a synchronized protein S: very high correlation with several direct partners (physically interacting with S) and high correlation with a large number of other proteins, referred to as indirect partners (e.g., proteins in the same complex or process as S). Figure (b) shows the main properties of an unsynchronized protein: moderate correlation with several direct interactors and low correlation with other proteins. Figure (c) shows the correlation distribution of a synchronized protein; the distribution has a heavy positive tail, with the 99th percentile occurring at a relatively high correlation of 0.58. Correlations with direct interaction partners are above this threshold and correlations with indirect partners are close to this threshold.

Figure (d) shows the correlation distribution of unsynchronized protein U, characterized by a light positive tail, with the 99th percentile occurring at a moderate correlation of 0.38. Direct and indirect partners of protein U occur above and around this value, respectively. Figure (e) shows the combined correlation distributions of proteins S and U. Direct partners of S occur at some of the highest values in this distribution, followed by indirect partners of S, and some of the direct partners of U. These are followed by random partners of S, and additional direct and indirect partners of U. Thus, it is difficult to distinguish direct interactions of U from random protein pairs involving S. Chapter 3. Predicting Transient PPIs From Gene Co-expression 64 are as high or higher. Since random protein pairs may include synchronized proteins, correlations of random pairs may also be higher than those of unsynchronized proteins.

Therefore, it may be difficult to distinguish correlations corresponding to transient in- teractions from correlations of random protein pairs. Recognizing the correlation of two transiently interacting proteins may require comparing it to other correlations of the two proteins

The main ideas of our hypothesis can be summarized as follows:

1. A protein involved in transient interactions has few high correlations, while some

proteins with stable interactions (synchronized proteins) have many high correla-

tions.

2. The correlation of a transient PPI4 is typically lower than that of a stable PPI.

3. The correlation of a transient PPI is often higher than most correlations of tran-

siently interacting proteins, but lower than many correlations of synchronized pro-

teins and many correlations of random protein pairs.

4. The correlation associated with a transient interaction may be easiest to identify if

it is compared against other correlations of the two transiently interacting proteins.

We test our hypothesis by analyzing correlations of transiently and stably interacting proteins. First we evaluate whether transiently interacting proteins have significantly lighter distribution tails than stably interacting proteins. Then we evaluate whether transient interactions have significantly lower median correlations than stable interac- tions. We use human kinases as representatives of transiently interacting proteins, and members of permanent human protein complexes as representatives of stably interact- ing proteins. We test for significant differences in correlations using 42 gene expression datasets (Supplemental Materials).

4This refers to the correlation between the genes of two transiently interacting proteins. Chapter 3. Predicting Transient PPIs From Gene Co-expression 65

After establishing evidence for our hypothesis, we present PPI prediction methods based on its ideas. To predict an interaction between proteins i and j, these methods consider whether the correlation between these proteins is higher than most correlations of i with other proteins and most correlations of j with other proteins. Consequently, the threshold for predicting an interaction is specific to each protein pair. By contrast, the previous method for predicting PPIs from co-expression evaluates whether the cor- relation of two proteins is higher than a fixed threshold (the same threshold is used for all protein pairs). We compare how well these three methods predict different types of

PPIs: transient, stable, and a combination of various types. To ensure that results are not specific to a single gene expression dataset, we repeat the testing in 42 datasets.

Lastly, we identify dataset properties that affect the performance of different prediction methods.

3.2 Results

3.2.1 Correlation distributions of stably and transiently inter-

acting proteins have significant differences

We first tested whether proteins with stable interactions had different correlation distri- butions than proteins with transient interactions. We expected that proteins in stable interactions would have distributions with heavier positive tails. To conduct this test we used 238 human proteins from permanent complexes [208] to represent stably inter- acting proteins and 311 human kinases [82] to represent transiently interacting proteins.

We determined correlation distributions of these proteins in 42 gene expression datasets.

Each distribution comprised correlations between a protein from one of the two sets (i.e. a kinase or a complex member) and all other proteins represented in a given gene ex- pression dataset (Figure 3.2). We then tested whether the 99th percentiles of correlation Chapter 3. Predicting Transient PPIs From Gene Co-expression 66 distributions were significantly different between stably and transiently interacting pro- teins. For example, in the Gene Atlas expression dataset [176], among stably interacting proteins, the 99th percentile occurred at a median correlation of 0.66 while among tran- siently interacting proteins it occurred at a median correlation of 0.45. This difference was statistically significant (p < 4.4e-39) based on the Wilcoxon rank-sum test. We re- peated this test in 42 gene expression datasets and in all datasets except one (GDS2737), the 99th percentile had a significantly higher (p < 1.7e-5) correlation among complex members than among kinases (Figure 3.3).

kinase: MKNK2 2.5 c. mem.: PSMD7

2.0

1.5 Density 1.0 99th% 99th% =0.40 =0.73 0.5

0.0

−1.0 −0.5 0.0 0.5 1.0

Pearson correlation

Figure 3.2: Examples of correlation distributions. Using the Gene Atlas dataset, we calcu- lated correlations between the kinase MKNK2 and all other proteins represented in the dataset

(green). Similarly, we calculated correlations between PSMD7, a member of the proteasome protein complex and all other proteins in the dataset (blue). In each distribution we determined the correlation corresponding to the 99th percentile. Chapter 3. Predicting Transient PPIs From Gene Co-expression 67

0.8

0.7

0.6

0.5

0.4

0.3 kinases comp. members 0.2 Median 99th percentile (Pear. correlation) GDS534 GDS737 GDS183 GDS596 GDS724 GDS181 GDS963 GDS711 GDS1329 GDS2201 GDS2519 GDS1975 GDS2190 GDS2250 GDS2545 GDS1956 GDS2767 GDS1064 GDS1096 GDS2819 GDS2737 GDS1815 GDS1650 GDS1067 GDS2642 GDS2106 GDS2643 GDS1362 GDS1559 GDS2362 GDS2520 GDS2785 GDS2842 GDS2528 GDS2529 GDS2853 GDS1449 GDS1479 GDS2649 GDS1065 GDS1094 GDS2626

Figure 3.3: Median 99th percentiles of correlation distributions. We determined correlation distributions of complex members and kinases in 42 gene expression datasets. In each dataset, we determined the median 99th percentile among correlation distributions of complex members and among correlation distributions of kinases.

3.2.2 Transient interactions have significantly lower correlations

than stable interactions

While our first test confirmed significant differences in the 99th percentiles of correlation distributions, it did not evaluate whether transient interactions have lower correlations than stable interactions. (Our first test considered the top 1% of correlations, but ki- nases used in our study had an average of 11 known substrates, representing roughly

0.05% of correlations.) To test whether transient and stable interactions had signifi- cantly different correlations, we calculated 3,265 correlations between kinases and their known substrates, and 3,623 correlations between co-complexed protein pairs. Although members of co-complexed pairs may not be in direct contact, including their correlations does not affect our conclusions – their presence makes our significance tests more strin- gent. We calculated correlations in 42 gene expression datasets, and tested for significant Chapter 3. Predicting Transient PPIs From Gene Co-expression 68 differences between the two interaction types, using the Wilcoxon rank-sum test. In all datasets, transient interactions had significantly lower median correlations (p < 1.8e-6) than stable interactions (Figure3.4).

0.8 kinase−substrate pairs co−complexed pairs

0.6

0.4

0.2 Median Pearson correlation

0.0 GDS737 GDS534 GDS183 GDS724 GDS963 GDS596 GDS181 GDS711 GDS1094 GDS2626 GDS1065 GDS2106 GDS1329 GDS2250 GDS2529 GDS2201 GDS2819 GDS2528 GDS2190 GDS2642 GDS1815 GDS1975 GDS1956 GDS1064 GDS2737 GDS2842 GDS1362 GDS2519 GDS2643 GDS2649 GDS1067 GDS2853 GDS2362 GDS2545 GDS2767 GDS1096 GDS1650 GDS1559 GDS2520 GDS1479 GDS2785 GDS1449

Figure 3.4: Median correlations of human kinase-substrate protein pairs and human co- complexed protein pairs from permanent protein complexes. We determined correlations of kinase-substrate pairs and of co-complexed pairs, in 42 gene expression datasets. In all datasets, kinase-substrate pairs had significantly lower median correlations than co-complexed pairs (p

< 1.8e-6).

3.2.3 Predictions of PPIs from gene co-expression are improved

by considering local information

We compared three methods for predicting PPIs from gene co-expression: a method used in previous studies, which we call global, and our new methods called local and mixed.

The global method predicts an interaction if the correlation of two proteins is above a given threshold. The same threshold is used for all protein pairs. This method could also provide confidence levels for interactions by assigning confidence scores proportional to correlations. We refer to the method as global because it evaluates correlations of all Chapter 3. Predicting Transient PPIs From Gene Co-expression 69 protein pairs by the same criteria – either a single correlation threshold or a single scale which maps correlations to interaction confidence levels.

The local method uses different evaluation criteria for each protein pair. The method is based on our hypothesis that the background distribution should contain correlations of the candidate interacting proteins. The local method predicts an interaction between two proteins, i, j, if their correlation, ρi,j is substantially higher than their correlations with most other proteins (i.e., ρi,j is greater than correlations of i with most proteins and greater than correlations of j with most proteins). We assumed that the chances of interaction were highest if ρi,j could be considered an outlier in the correlation distribu- tions of i and of j. To assess whether a given correlation was an outlier, we calculated an outlier score, using an approach similar to outlier detection by interquartile range [199]

(see Methods).

The mixed method uses a combination of the global and local approaches (see Meth- ods). It assumes that two proteins are most likely to interact if their correlation is high compared to other correlations of the two proteins, and compared to correlations of most other protein pairs. Consequently, the mixed method calculates two outlier scores for the correlation of a protein pair.

We evaluated the three prediction methods on three test sets of human protein pairs.

The test sets, referred to as Strans, Sstable, and Smult, contained transient, stable, and multiple interaction types, respectively, as well as random protein pairs representing non-interactions. The numbers of protein pairs in the test sets were as follows – Strans contained 3,265 kinase-substrate pairs and 32,650 random pairs, Sstable contained 3,623 co-complexed protein pairs and 36,623 random pairs, and Smult contained 14,242 high confidence PPIs and 142,214 random pairs (see Methods). All high confidence PPIs were included in Smult, regardless of interaction type. We repeated predictions for each test set using 42 Affymetrix gene expression datasets. Thus, we evaluated each prediction method a total of 126 times – 42 times on 3 test sets. The procedure for evaluating a Chapter 3. Predicting Transient PPIs From Gene Co-expression 70 prediction method with a given expression dataset and a given test set consisted of two main steps:

1. Based on the expression data, the prediction method calculated interaction scores

for each protein pair in the test set.

2. Each score in the test set was used as a threshold for predicting interactions. At

each threshold, the true positive rate (TPR)5 and the false positive rate (FPR)6

were determined. Based on pairs of TPR and FPR values, the area under the

Receiver Operating Characteristic curve (AUC) was calculated.

The mixed method generally performed better than the others on all test sets (Figures

3.5-3.7). It outperformed the other methods with most, though not all, gene expression datasets. To measure prediction performance we calculated AUC scores for different numbers of false positive cases: 100, 200, 400, 800, 1600, and all cases. We refer to these measures as AUC100,..,AUC1600, and AUC, respectively. We used Wilcoxon signed- rank tests to determine whether two methods had significantly different performance on a given test set, at a given false positive level. (For example, one of the tests determined that on the Strans dataset, the 42 AUC100 scores of the mixed method were significantly higher than the corresponding scores of the global method.) We considered differences to be significant at p < 0.05. On the Strans (kinase-substrate) dataset the mixed method performed significantly better than the global method at all false positive levels, and sig- nificantly better than the local method at AUC100. On the Sstable (co-complexed pairs) dataset the mixed and local methods performed significantly better than the global method at all false positive levels. On the Smult (multiple PPI types) dataset the mixed and global methods performed significantly better than the local method at all false positive levels. The mixed method was significantly better than the global method at

AUC100 and AUC200.

5TPR = true positives / (true positives + false negatives) 6FPR = false positives / (false positives + true negatives) Chapter 3. Predicting Transient PPIs From Gene Co-expression 71

mixed local 0.020 global

0.015

AUC100 0.010

0.005

0.000 GDS534 GDS737 GDS183 GDS724 GDS711 GDS963 GDS181 GDS596 GDS1449 GDS1559 GDS2529 GDS2190 GDS1065 GDS2528 GDS1362 GDS2519 GDS1956 GDS1094 GDS2767 GDS2106 GDS2520 GDS2643 GDS2362 GDS2842 GDS2642 GDS2201 GDS2545 GDS2626 GDS2853 GDS1650 GDS2649 GDS2785 GDS1479 GDS1329 GDS1067 GDS1096 GDS1064 GDS2250 GDS1815 GDS2819 GDS1975

Figure 3.5: Performance of PPI prediction methods on the Strans test set. Predictions were made using 42 gene expression datasets (x-axis) and performance was evaluated as AUC100 scores (y-axis). Chapter 3. Predicting Transient PPIs From Gene Co-expression 72

mixed local 0.15 global

0.10 AUC100

0.05 GDS183 GDS737 GDS724 GDS534 GDS963 GDS711 GDS181 GDS596 GDS1065 GDS2626 GDS1094 GDS2201 GDS2842 GDS2106 GDS2528 GDS1362 GDS2250 GDS2529 GDS2819 GDS2190 GDS2853 GDS1329 GDS2642 GDS1064 GDS1815 GDS2520 GDS1479 GDS2649 GDS1559 GDS1650 GDS1956 GDS1975 GDS2643 GDS2785 GDS2545 GDS2767 GDS1067 GDS2362 GDS2519 GDS1449 GDS1096 GDS2737

Figure 3.6: Performance of PPI prediction methods on the Sstable test set. Predictions were made using 42 gene expression datasets (x-axis) and performance was evaluated as AUC100 scores (y-axis).

mixed local 0.008 global

0.006 AUC 0.004

0.002 GDS534 GDS737 GDS963 GDS724 GDS711 GDS183 GDS181 GDS1559 GDS2529 GDS1065 GDS2626 GDS2853 GDS2528 GDS2190 GDS1094 GDS1362 GDS2106 GDS2767 GDS2519 GDS1650 GDS2643 GDS2642 GDS1449 GDS2649 GDS2201 GDS2362 GDS2842 GDS2785 GDS1479 GDS1956 GDS1064 GDS2545 GDS2520 GDS2737 GDS1329 GDS1975 GDS1815 GDS1067 GDS2250 GDS1096 GDS2819

Figure 3.7: Performance of PPI prediction methods on the Smult test set. Predictions were made using 42 gene expression datasets (x-axis) and performance was evaluated as AUC100 scores (y-axis). Chapter 3. Predicting Transient PPIs From Gene Co-expression 73

3.2.4 Mixed approach improves recall of different interaction

types

Results on the test sets containing kinase-substrate and co-complexed pairs suggested that the mixed method was more effective for both transient and stable PPIs. However, these two test sets contained limited types of transient and stable interactions. To gain more information about performance on different types of PPIs, we focused on the test set containing multiple interaction types and categorized interactions using Gene Ontology

(GO) functional annotations. We selected a set of 10 high-level functional annotations and labelled an interaction with an annotation if one or both of the interacting proteins carried the annotation (see Methods). Figure 3.8(a) shows the recall7 achieved for differ- ent categories using the mixed and global approaches with Gene Atlas data (GDS596)

[176]. Recall was calculated at a precision of 60% (the default precision was 9% since interacting protein pairs comprised 1/11th of the test set). Figure 3.8(b) shows the per- cent improvement in recall of the mixed approach over the global approach. The mixed approach achieved higher recalls for all PPI categories. Categories where the global ap- proach had particularly low recall – transferase, signal transducer, and transcription reg- ulator – showed some of the biggest improvements with the mixed approach. Figure 3.9 shows distributions of percent improvement in 42 gene expression datasets. The trends seen in Gene Atlas data [176] were largely consistent across datasets, although many datasets showed less improvement for hydrolase and more improvement for nucleic acid binding. Figure 3.10 shows the significance of improvements; it indicates that the mixed approach achieved consistently better recall in different gene expression datasets. We cal- culated the significance of improvements in recall at different precision levels: 50%, 60%,

70%, and 80%. At precisions greater than 80%, predicted networks from both approaches became very small and contained few interaction categories. The best improvements in

7recall = true positives / (true positives + false negatives) Chapter 3. Predicting Transient PPIs From Gene Co-expression 74

global catalytic catalytic mixed hydrolase hydrolase

oxidoreductase oxidoreductase

transferase transferase

enzyme regulator enzyme regulator

nucleic acid binding nucleic acid binding

signal transducer signal transducer

structural molecule structural molecule

transcription regulator transcription regulator

transporter transporter

0.00 0.01 0.02 0.03 0.04 0.05 0 10 20 30 40 50 60

Recall % improvement in recall

(a) (b)

Figure 3.8: Recovery of PPIs involving different functional categories of proteins. Results are based on predictions at 60% precision, obtained from Gene Atlas data [176]. Figure (a) shows recall of different functional categories using the mixed and global approaches. Figure (b) shows the percent improvement in recall of the mixed approach over the global approach for different functional categories.

recall occurred at around 65% precision, although several categories – signal transducer, structural molecule, and transcription regulator – had significant improvements across a wide range of precisions (p < 0.05). Signal transduction and transcription regulation are associated with transient interactions [139]. Chapter 3. Predicting Transient PPIs From Gene Co-expression 75

catalytic hydrolase oxidoreductase transferase enzyme regulator nucleic acid binding signal transducer structural molecule transcription regulator transporter

−100 0 100 200 300 400 500

% Improvement

Figure 3.9: Percent improvement of the mixed approach over the global approach in recovering

PPIs of different functional categories. For each functional category, the distribution of percent improvements is shown for 42 gene expression datasets.

3.2.5 Mixed and local approaches work best when expression

datasets have genes with high variance

The ability of gene co-expression to predict interacting or co-complexed protein pairs varied greatly between gene expression datasets. We found that predictive performance with a given gene expression dataset depended on several dataset properties: the average variance of genes in the dataset, the number of samples in the dataset, and the correlation,

th corrvar,99, between the variance of genes and the 99 percentiles of their correlation distributions. These properties had significant positive correlations (p < 0.05) with predictive performance on Smult, measured as AUC100 scores (Table 3.1). Prediction methods using local information were more sensitive to dataset properties, especially to

th corrvar,99. This property indicates whether the 99 percentile of a gene can be explained by its variance – in some datasets, the higher the variance, the lower the 99th percentile. Chapter 3. Predicting Transient PPIs From Gene Co-expression 76

Precision 50 60 70 80

catalytic

hydrolase 10−2 oxidoreductase transferase enzyme regulator nucleic acid binding 10−1 signal transducer structural molecule transcription regulator transporter 100

Figure 3.10: The mixed approach was significantly better than the global approach at recovering several functional categories of PPIs. Significance was assessed by comparing recalls of various functional categories, in 42 sets of PPI predictions – each predicted from a different gene expression dataset. Recall was calculated at 4 levels of precision: 50%, 60%, 70%, and 80%

(x-axis). Precisions and functional categories where the mixed approach achieved significantly higher recall (p < 0.05) than the global approach are indicated in red.

With these gene expression datasets the mixed approach provided little improvement over the global approach.

Approach Variance corrvar,99 No. samples global 9.36e-4 2.44e-2 3.16e-2

local 1.11e-4 9.54e-4 2.22e-3

mixed 3.65e-5 4.40e-3 3.67e-3

Table 3.1: P-values of Pearson correlations between dataset properties and AUC100 scores. Chapter 3. Predicting Transient PPIs From Gene Co-expression 77

3.3 Discussion

Gene co-expression is commonly used to predict PPIs but has been largely ineffective at predicting transient protein interactions. This could be expected since transiently interacting proteins would not require consistently synchronized gene expression levels.

However, our results suggest that even moderate levels of co-expression may be sufficient to identify transient interactions, if the co-expression is viewed in the context of specific proteins. The importance of context is shown by the fact that proteins involved in transient interactions have significantly lighter correlation distribution tails than proteins involved in stable interactions. The moderate correlation of a transient PPI may be an outlier in the correlation distribution of a transiently interacting protein. Therefore, interpreting whether a correlation is sufficiently high to indicate an interaction can depend on the two proteins in question. This idea underlies our local and mixed approaches for predicting PPIs by co-expression. The local method considered whether the correlation between two proteins was higher than most other correlations of these proteins. The mixed method assessed a correlation in two ways: relative to other correlations of the two proteins, and relative to correlations of all proteins. The previously used global method can be viewed as assessing a given correlation relative to correlations of all protein pairs.

Our testing showed that combining local and global information significantly improves

PPI predictions but local information is not necessarily better than global. The local method performed significantly better than the global method on stable interactions but on transient interactions the two methods had similar performance, and on a combina- tion of different interactions, the global method was significantly better. These results may be explained by the different correlation distributions of stably and transiently in- teracting proteins. When the local method is applied to stably interacting proteins, the criteria for predicting interactions are strict, since these proteins have heavy distribu- tion tails and only very high correlations stand out as outliers. Using strict criteria is effective because stably interacting proteins have many high correlations, of which only a Chapter 3. Predicting Transient PPIs From Gene Co-expression 78 fraction correspond to physical interactions. The global method cannot effectively filter these numerous correlations since they are higher than correlations of most protein pairs.

However, when predicting transient interactions, the local method becomes the less ef- fective filter. Transiently interacting proteins have light distribution tails and moderate correlations may appear as outliers relative to these distributions. Moderate correlations may often be due to chance, but the local method views them as outliers and assumes that they indicate interactions. However, moderate correlations may also correspond to transient PPIs – such cases are correctly identified by the local method but missed by the global method.

The performance of methods depended not only on PPI types but also on gene ex- pression datasets. Performance varied widely across expression datasets and different methods worked best with different expression datasets. These results raised two main questions: what factors determine the predictive value of a dataset? and is it possible to know the best method for a given dataset? Our analysis of expression datasets suggests that the best prediction performance was achieved when a dataset had a large number of samples and its genes had high variance. A large number of samples reduces the probabil- ity of a high correlation occurring by chance, thus lowering the number of false positives.

High variance may help filter high false positive correlations, and is especially important for the performance of the local and mixed methods. If a gene has low variance and several outlying expression levels, then it may have high correlations due to chance [198].

The local and mixed methods assume that such correlations correspond to PPIs if most correlations of the gene are low. Consequently, the local and mixed methods performed worst with datasets where genes had low variance. It may be possible to improve the performance of these methods either by replacing Pearson correlation with a more robust correlation measure or by identifying and discarding correlations driven by outliers and low variance.

While these approaches are likely to improve the local and mixed methods, the overall Chapter 3. Predicting Transient PPIs From Gene Co-expression 79 ability of co-expression to predict PPIs has important limitations. A key limiting factor is that some interacting protein pairs may have completely uncorrelated gene expression levels. Examples of such interactions are cyclin dependent kinases and their substrates

[62]. Also, gene expression levels are often poorly correlated with protein levels [73,

69, 134]. While these factors increase the false negative rate of PPI predictions, the false positive rate can also be high. Co-expression may not be able to distinguish direct protein interactions from other relationships among proteins such as presence in the same complex or pathway, similar functionality, or participation in concurrent processes.

However, identifying these relationships may be beneficial, and local information would likely help with these tasks.

3.4 Conclusions

Predictions of transient PPIs from gene co-expression are improved by considering local information – i.e. determining if the correlation of two candidate interacting proteins i and j is higher than most correlations of i with other proteins and most correlations of j with other proteins. Previous studies focused on global information – comparing the cor- relation of two candidate interacting proteins against a fixed threshold or, equivalently, against correlations of all other protein pairs. The value of local information comes from the fact that proteins which participate in transient interactions have significantly fewer high correlations than proteins that participate in stable interactions. While the correla- tion of two transiently interacting proteins is often relatively low, it may be recognizable as an outlier when compared to other correlations of the two proteins. Our mixed PPI prediction method, which combines local and global information, performed significantly better (p < 0.05) than methods which used only one type of information. Predictions of the mixed method were significantly better for both transient and stable PPIs. Chapter 3. Predicting Transient PPIs From Gene Co-expression 80

3.5 Methods

3.5.1 Selection of gene expression datasets

The 42 gene expression datasets used for testing, were downloaded from the Gene Ex- pression Omnibus (GEO) [13] website on Feb. 27, 2009. These datasets had unique GDS

IDs, were based on human Affymetrix chips and contained at least 35 samples.

3.5.2 Processing of gene expression datasets

All gene expression datasets were normalized by the MAS 5.0 algorithm, using the affy R package [60], version 1.24.2. The datasets were log-transformed, and within each sample, the sample mean was subtracted from expression levels. If a gene was represented by multiple probe sets, only the probe set with the highest median expression level was kept.

Co-expression of genes was calculated by Pearson correlation.

The above processing options were selected to optimize the performance of the global prediction method. The MAS 5.0 algorithm has been previously shown to be the most effective normalization method for the task of predicting PPIs by gene co-expression [110].

The other options were selected because they optimized the performance, measured as

AUC100 scores, of the global method on the Smult dataset:

• mean centered samples gave significantly improved performance over non-centered

samples (p < 0.05);

• representing a gene by the probe set with the highest median expression level gave

higher average AUC100 scores than averaging the probe sets of a gene;

• Pearson correlation gave significantly higher AUC100 scores than Spearman corre-

lation (p < 0.05). Chapter 3. Predicting Transient PPIs From Gene Co-expression 81

3.5.3 Transient and stable interactions

Transient interactions were represented by 3,526 human kinase-substrate protein pairs downloaded on October 31, 2010 from the PhosphoSitePlus website: http://www.phosphosite. org/downloads/Kinase Substrate Dataset [82]. Stable interactions were represented by 3,623 co-complexed protein pairs within permanent human protein complexes de- scribed by Zanivan et al. [208].

3.5.4 Transiently and stably interacting proteins

Transiently interacting proteins were represented by 311 human kinases from the set of kinase-substrate protein pairs described above. Stably interacting proteins were repre- sented by 238 members of permanent human protein complexes described above.

3.5.5 Selection of protein pairs for test sets

Dataset Smult: multiple interaction types

The Smult dataset consisted of 14,271 high-confidence human interacting protein pairs and 142,710 random human protein pairs, representing non-interactions. Human PPIs were downloaded from the Interologous Interaction Database [27], version 1.71 (http:

//ophid.utoronto.ca/i2d). The set of 14,271 high confidence interactions was selected based on 2 criteria: (1) identification by two or more studies or (2) identification by a single study using small-scale screens (< 20 interactions), and presence in at least two of the following databases – DIP [203], MINT [31], IntAct [6], HPRD [101], or BioGrid [24].

Protein pairs mapping to the same probe set were eliminated. A set of 142,710 random protein pairs was generated based on proteins in the UniProt database [39], version 56.2.

Pairs were eliminated if they were present in I2D, were involved in the same complex according to the CORUM [149] or Reactome [121] databases, or were involved in the same pathway based on the Reactome [121] database. Pairs mapping to the same probe Chapter 3. Predicting Transient PPIs From Gene Co-expression 82 set were also eliminated. If a gene expression dataset did not include data for certain protein pairs, additional random pairs were generated so that the ratio of interacting to non-interacting pairs represented in the dataset would remain 1:10. This ratio was skewed in favor of non-interacting pairs to reflect the fact that interacting protein pairs represent a small fraction of all protein pairs in a cell.

Dataset Strans: kinase-substrate pairs

The Strans dataset consisted of 3,526 human kinase-substrate protein pairs and 35,260 random human protein pairs. Kinase-substrate protein pairs were downloaded on Oc- tober 31, 2010, from the PhosphoSitePlus website : http://www.phosphosite.org/ downloads/Kinase Substrate Dataset [82]. Random human protein pairs were selected as for Smult (described above).

Dataset Sstable: co-complexed protein pairs

The Sstable dataset consisted of 3,623 co-complexed protein pairs within permanent human protein complexes and 36,230 random human protein pairs. Permanent human protein complexes and their constituent proteins were based on the work of Zanivan et al. [208].

Random human protein pairs were selected as for Smult (described above).

3.5.6 Testing for significant differences in correlation distribu-

tion tails

In each gene expression dataset, Pearson correlations were calculated between all protein

(gene) pairs. For each protein, the 99th percentile of correlations was identified. Wilcoxon rank-sum tests were used to test for significant differences in the 99th percentiles. Chapter 3. Predicting Transient PPIs From Gene Co-expression 83

3.5.7 Interaction prediction approaches

To predict PPIs based on a gene expression dataset, the first step was to calculate correlations for all protein (gene) pairs represented in the dataset. The next steps were different for each prediction method and are described below.

Local prediction

For each protein in the test set, correlations were obtained between the protein and all others in the expression dataset. (These correlations were calculated before starting local prediction). Then, for each protein in the test set, the 99th percentile of correlations was identified. A correlation, ρij, between candidate interacting proteins i and j was mapped to two outlier scores oi, oj:

th ρij − 50 percentile(Di) oi = th th (3.1) 99 percentile(Di) − 50 percentile(Di)

th ρij − 50 percentile(Dj) oj = th th , (3.2) 99 percentile(Dj) − 50 percentile(Dj) where Di and Dj were the correlation distributions of proteins i and j, respectively. A single outlier score was then calculated as follows:

minoij = min(oi, oj) (3.3)

These scores were calculated for all protein pairs in the test set. Proteins i and j were predicted to interact if their score, minoij, was above a given threshold. The mino scores of all protein pairs in the test set were used as thresholds for predicting interactions.

Global prediction

The global method predicted interactions between protein pairs with correlations above a given threshold. The method could be implemented simply by calculating correla- tions between protein pairs. Our implementation was slightly different but produced exactly the same PPI predictions. For consistency with the local and mixed prediction Chapter 3. Predicting Transient PPIs From Gene Co-expression 84

approaches, we transformed correlations into global outlier scores,(OG). These scores were later used for implementing the mixed prediction method. A global outlier score indicated whether a correlation was an outlier relative to correlations of all protein pairs represented in the expression dataset. The score was defined as follows:

th ρij − 50 percentile(DG) oG = th th , (3.4) 99 percentile(DG) − 50 percentile(DG) where DG is the distribution of correlations among all protein pairs represented in the expression data. These scores were calculated for all protein pairs in the test set. All the oG scores in the test set were then used as thresholds for predicting interactions.

Mixed prediction

The mixed approach combined the ideas of the global and local approaches. First it considered whether a correlation was an outlier relative to the global distribution of correlations, DG – containing correlations between all pairs of proteins represented in an expression dataset. Using the distribution, DG, of all correlations in a gene expression

th th dataset, the 50 and 99 percentiles were determined. The Pearson correlation, ρij, between candidate interacting proteins i and j was mapped to local outlier scores, oi

(Eq. 3.1) and oj (Eq. 3.2) and to a global outlier score, oG (Eq. 3.4). The scores were then combined as follows:

omixed = mean(mean(oi, oj), oG), (3.5)

We considered two other approaches for calculating omixed: mean(min(oi, oj), oG) and mean(max(oi, oj), oG). We chose the approach with the highest performance in Gene Atlas data [176].

3.5.8 Calculating significance of performance differences

The three prediction approaches had significant differences in their AUC scores (p <

0.05). Before testing for differences, the three approaches were applied to protein pairs Chapter 3. Predicting Transient PPIs From Gene Co-expression 85 of the test sets, using each expression dataset. This resulted in each prediction approach providing 126 sets of predictions. AUC scores were calculated for each prediction set.

Wilcoxon signed-rank tests were then used to test for significant differences between the scores of prediction approaches.

To test for significant differences in the recall of functional categories, Wilcoxon signed-rank tests were applied to the recalls of a given category, provided by the global and mixed approaches in 42 gene expression datasets.

3.5.9 Selection of functional categories

PPIs were divided into categories using 10 Gene Ontology [8] functional categories. These categories were a subset of GO slims downloaded on Mar. 10, 2010. The subset of categories was selected using the following criteria: (1) at least 500 proteins in the Smult dataset were annotated with the category (2) no more than half of all proteins in Smult were annotated with the category (3) the overlap between members of any two categories was limited. The difference in membership between categories i and j was calculated as follows:

|S ∩ S | membership similarity = i j , (3.6) |Si ∪ Sj| where Si and Sj are the sets of proteins in categories i and j, respectively. Categories with the highest number of membership similarity scores > 0.5 were eliminated, until all remaining scores were < 0.5.

3.5.10 Software

All computations were done with R 2.10.0 and Perl 5.8.8, on IBM HS21 cluster, Red Hat

4.1.2-14. Chapter 3. Predicting Transient PPIs From Gene Co-expression 86

3.6 Supplemental Materials

3.6.1 Links to data tables

1. GDS IDs of gene expression datasets used for testing:

http://www.cs.utoronto.ca/∼juris/data/MaxThesis/Ch3/expressionDatasetIDs.

txt

2. Strans dataset: http://www.cs.utoronto.ca/∼juris/data/MaxThesis/Ch3/S trans.txt

3. Sstable dataset: http://www.cs.utoronto.ca/∼juris/data/MaxThesis/Ch3/S perm.txt

4. Smult dataset: http://www.cs.utoronto.ca/∼juris/data/MaxThesis/Ch3/S mult.txt Chapter 4

Predicting Essential Mouse Genes

Abstract

Essential genes are imperative for survival. In mammals essential genes are involved in development, maintenance of fundamental cellular processes, and vital tissue-specific functions. Identification of essential genes is required for understanding each of these mechanisms. Experimental methods for identifying mammalian essential genes are ex- pensive and time consuming. Computational prediction of these genes offers a fast, preliminary alternative.

Here we present the first predictions of essential mouse genes. Our approach uses

5 types of predictive data: gene expression, networks of protein-protein interactions

(PPIs) and co-expressed genes, Gene Ontology, gene and protein sequence, and orthol- ogy. We investigate the relationships between essentiality and 371 features derived from these data types, and identify a key subset of 22 non-redundant features strongly linked with essentiality. This subset includes features such as centrality in PPI networks, low rate of non-synonymous substitutions, and high protein disorder. Although linked with essentiality, these features exhibited significantly different values in different groups of essential genes, such as essential genes related to specific tissues, prenatal lethality, post-

87 Chapter 4. Predicting Essential Mouse Genes 88 natal lethality, and infertility. Using our set of 22 features, we predicted essential genes in each of these groups, and the combined set of essential genes. For the combined set, we obtained 90% precision with 10% recall and 70% precision with 50% recall. Most groups of essential genes were predicted with similar rates, but precisions were substantially lower for several groups, especially genes associated exclusively with postnatal lethality.

4.1 Introduction

Essential genes are ones whose absence is lethal to an organism [160]. More specifically, they have been defined as genes required for survival to reproductive age and for success- ful reproduction [86, 108, 107]. In multicellular organisms, knowledge of essential genes helps answer questions about basic cellular processes, development, and tissue-specific functions that are vital for life. Essential genes in higher organisms are known to in- clude tumour suppressor genes [5, 89] and have similarities to genes involved in disease

[169, 105]. However, identifying essential genes in higher organisms such as mammals, requires experimental methods that are time consuming and expensive. We propose a machine learning approach for predicting essential mouse genes from diverse data types including protein-protein interaction networks, Gene Ontology, gene and protein sequence and orthology. This approach can have a complementary role to experimental methods, providing fast, preliminary estimates of gene essentiality. Such estimates can help priori- tize genes for disruption experiments; for example, researchers building mouse models of disease could avoid genes that are likely to be embryonic lethal. Also, a computational approach can highlight distinguishing features of essential genes and can help identify distinct types of essential genes.

Thus far, computational methods have been used to predict essential genes in yeast and bacteria [94, 33, 160, 72]. These studies made predictions based on gene expression, sequence homology, PPIs, and genomic data. The strongest predictive features included Chapter 4. Predicting Essential Mouse Genes 89 low protein hydropathicity [160], low frequency of codons similar to the [160], high degree in PPI networks [94], low variance of gene expression [94], and presence of orthologs in other organisms [72]. Some of these features have also been linked with mammalian essential genes, or mammalian housekeeping genes, which are enriched for essential genes [183]. Mammalian essential genes have been associated with low variability of gene expression [34], slow evolution rates [107, 34], high degree in PPI networks [183], and large numbers of conserved sites [183]. Since many of these features are significantly correlated with each other [131, 74, 192] it has been difficult to establish which features are directly related to essentiality or indirectly related through correlation with another feature. For example, slow evolution rate and high gene expression have both been associated with essentiality [131, 34], but evolution rate is more closely linked with gene expression than with essentiality [132]. Independent features of essential genes have not been determined – for example, it is unclear which features would make unique contributions to predicting essentiality.

The above computational studies considered essential genes as a single group with shared characteristics. While this is a natural assumption for unicellular organisms, in multicellular organisms essential genes may comprise different categories. The Mam- malian Phenotype Ontology [166] includes several categories which correspond to es- sentiality – prenatal lethality, postnatal lethality and infertility. Also, genes in these 3 categories are typically associated with additional phenotypes such as specific organ sys- tems (e.g., cardiovascular), tissues (e.g., skin), and processes (e.g., aging). Essential genes annotated with some of these categories may represent different subtypes of essentiality.

In this study we search for non-redundant features shared by most essential mouse genes and investigate whether essential mouse genes comprise significantly different cat- egories. We then predict genes in these categories and in the overall set of essential genes. First, we survey 371 gene and protein features for association with essentiality.

These features are based on gene expression, PPI and gene co-expression networks, Gene Chapter 4. Predicting Essential Mouse Genes 90

Ontology [8], gene and protein sequence, and orthology. We identify 22 features which are significantly correlated with essentiality and contribute unique information to essen- tiality prediction. Secondly, we investigate whether these 22 features have significant differences between categories of essential genes. We cluster essential categories based on these features and identify groups of essential genes with significant differences in many features. Lastly, we predict essential genes and determine which features are most effective predictors overall and for specific essential categories.

4.2 Results

4.2.1 Defining essential and non-essential mouse genes

We defined essential and non-essential genes based on phenotype annotations in the

Mouse Genome Database (MGD) [30]. We defined essential genes as ones with the phenotypes prenatal/perinatal lethality, postnatal lethality, or infertility; determined either through gene knockout or Floxed/Frt experiments. We defined non-essential genes as ones subjected to gene knockout, and not annotated with any essential phenotypes.

The numbers of essential and non-essential genes were 2,252 and 2,490, respectively, leaving 18,320 genes uncharacterized.

In addition to the 3 essential phenotypes above, we considered a number of other potentially distinct categories of essential genes. We defined one of these categories, postnatal-only, as encompassing genes that are postnatal lethal but not prenatal lethal.

Many mouse genes possess multiple phenotypes, indicating that their disruption had different effects in different mice. Most genes with the postnatal lethality phenotype also had the prenatal/perinatal lethality phenotype. We defined the postnatal-only category to investigate whether certain gene or protein properties were associated exclusively with postnatal lethality.

Our other essential categories were related to organ systems (e.g. cardiovascular), tis- Chapter 4. Predicting Essential Mouse Genes 91 sues (e.g., skin), and processes (e.g., aging), in which essential genes were involved. Each of these categories consisted of essential genes (ones annotated with prenatal/perinatal lethality, postnatal lethality, or infertility) that also possessed a non-essential phenotype such as aging. We used these categories to investigate whether essential genes differed based on organs, tissues, or processes. Figure 4.1 shows the numbers of genes in our essential categories.

4.2.2 Identifying features related to essentiality

To identify distinguishing features of essential genes, we compiled an extensive set of gene and protein features based on 5 types of data: gene expression, PPI and gene co-expression networks, Gene Ontology annotations, gene and protein sequence, and or- thology (Figure 4.2.(a); Methods sections 4.5.2 through 4.5.10 describe calculations of the features, Supplemental Materials include feature list). We included features previously associated with essentiality in E.coli [72], S. cerevisiae (Baker’s yeast) [94, 33, 160, 72], and M. musculus (mouse) [107, 34]. We also included many novel features such as pro- tein disorder, rates of synonymous substitution, gene expression levels in specific tissues, protein domains, and properties of gene co-expression networks. We computed features for 4,742 essential and non-essential genes, subject to data availability.

To determine a subset of non-redundant features which best characterize essential genes, we followed 3 steps (Figure 4.2.(b)): 1) selection of features significantly correlated with essentiality, 2) within sets of features of the same type (e.g., protein domains), identification of predictive, non-redundant features, and 3) merging of the best features from different sets into a single non-redundant set. We applied these steps to one half our essential and non-essential genes (i.e., 2,371 genes), which we used as a training set.

The remaining genes were later used to test the ability of selected features to predict essential genes.

The first step of feature selection identified features significantly correlated with es- Chapter 4. Predicting Essential Mouse Genes 92

aging

cardiovascular

digestive

immune

liver

nervous

respiratory

skeleton infertility skin prenatal lethality

postnatal lethality urinary postnatal−only tumorigenesis Total total essential Essential

0 1000 2500 0 500 1000 1500 2000 2500

No. Genes No. Genes

(a) (b)

Figure 4.1: Numbers of genes in different essential categories. (a) Categories based on essential phenotypes. We defined several phenotypes from the Mammalian Phenotype Ontology as being essential: infertility, prenatal lethality, and postnatal lethality. Many genes belonged to several of these categories. We added the category postnatal-only to represent genes that were postnatal lethal and not prenatal lethal. (b) Categories based on non-essential phenotypes. We defined a number of essential categories based on non-essential phenotypes (e.g., aging). These categories contained essential genes which also possessed a non-essential phenotype. For example, 657 essential mouse genes possessed the aging phenotype, and the total number of mouse genes with the aging phenotype was 1086.

sentiality. For each feature, we calculated correlations and p-values with respect to 5 cat- egories of essentiality – infertility, prenatal lethality, postnatal lethality, postnatal-only

(postnatal lethality exclusive of prenatal lethality), and the union of these categories.

The resulting 5 p-values were adjusted for multiple testing, and if the lowest p-value was Chapter 4. Predicting Essential Mouse Genes 93

371 features

52 56 200 50 13 gene expression networks sequence /structure Gene Ontology orthology

•specific tissues •PPI •disorder •localization •evolution rates e.g., skin e.g., degree e.g., % disorder e.g., nucleus e.g., Dn w.r.t human •summary stats •coexpression •domains •function •ortholog properties e.g., variance e.g., degree e.g., Homeobox e.g., kinase e.g., orthology score •gene length e.g., CDS length •sequence bias e.g., % Trp residues

(a)

371 features Identify features with significant correlation to essentiality

191 remain

Identify features which may predict essentiality • within groups of related features (e.g. protein domains), keep features which: (1) predict > 5% of essential genes with (2) > 55% precision and (3) higher precision than other features in group 26 remain

Combine all remaining features and filter by same criteria 22 remain

(b)

Figure 4.2: Features investigated for association with essentiality and the process for selecting key features. (a) Features evaluated for links with essentiality (b) Process for selecting a subset of non-redundant features, most closely linked with essentiality. Chapter 4. Predicting Essential Mouse Genes 94

< 0.05 then the feature was selected.

The second step examined sets of similar features (e.g., features related to nucleotide substitution rates) and selected features that were non-redundant. A feature was con- sidered non-redundant if it could predict at least 5% of essential genes, better than other features in its set. For example, the feature elephant Dn (rate of non-synonymous substitutions relative to elephant orthologs) predicted 64 essential genes more reliably than the feature human Dn (i.e., these genes were assigned higher probabilities of es- sentiality by elephan Dn than by human Dn). Therefore, elephant Dn was considered non-redundant relative to human Dn. Similarly, when elephant Dn was compared with other features related to substitution rates, it made better predictions for at least 5% of essential genes in the training set. Therefore, elephant Dn was considered non-redundant in its set.

This approach helped deal with some of the issues which stem from high correlations among features related to essentiality. One of these issues is whether a given feature is related directly to essentiality or indirectly, through another feature. Another issue is whether two highly correlated features reflect the same underlying gene property and should be considered redundant; and similarly, whether features with lower correlation can be considered be non-redundant. Our approach assumed that a feature was directly related to essentiality and was non-redundant if it predicted a subset of essential genes better than other features. In some cases, our approach concluded that highly corre- lated features were non-redundant (Figure 4.3.(a)), while less correlated features were redundant (Figure 4.3.(b)).

The final step of feature selection combined the non-redundant features of different data types and applied the same criteria as above to determine a final non-redundant set. Chapter 4. Predicting Essential Mouse Genes 95

0.9 human_Dn 0.9 human_Dn elephant_Dn rat_Ds

0.8 0.8

0.7 0.7 precision precision

0.6 0.6

0.5 0.5

0 200 400 600 800 1000 0 200 400 600 800 1000 essential mouse genes essential mouse genes sorted by human_Dn (rate of non−synonymous mutations w.r.t human orthologs) sorted by human_Dn (rate of non−synonymous mutations w.r.t human orthologs)

(a) (b)

Figure 4.3: Selecting non-redundant features. Each feature was used to predict essential genes.

The y-axis represents the reliability of predictions, measured as precision. The x-axis orders genes by precision measurements obtained from a given feature. (a) Precisions of essential genes based on feature, human Dn (the rate of non-synonymous substitutions with respect to human orthologs) and based on a similar feature, elephant Dn. The x-axis represents essential genes sorted by precisions from human Dn. The 2 features are highly correlated (ρ = 0.74, p <

2.2e − 16) but often give different precisions (circles and curve do not correspond) and neither feature is clearly better than the other – for many genes human Dn performs better (curve is higher than circles) and for many others, elephant Dn performs better (circles are above curve). Each feature predicts ≥ 5% of essential genes better than the other, and therefore, the features are non-redundant with respect to each other. (b) Precisions of essential genes based on human Dn and rat Ds (the rate of synonymous substitutions with respect to rat orthologs). Precisions from human Dn are mostly higher than from rat Dn; therefore, rat Dn is redundant, although its correlation with human Dn,(ρ = 0.14, p = 7.2e − 11), is lower than the correlation of human Dn with elephant Dn. Chapter 4. Predicting Essential Mouse Genes 96

4.2.3 Key features of essential genes

The feature selection process identified a set of 22 features, encompassing all main data types – gene expression, PPI and gene co-expression networks, gene and protein sequence,

Gene Ontology, and orthology (Figure 4.4). The main features were as follows:

• features based on gene expression – variance, expression at embryonic day 10,

and the ratio of maximum prenatal versus postnatal expression (expr:variance,

expr:embryo d10, expr:prebirth postbirth ratio);

• features based on networks – centrality (eigenvector centrality and closeness) in co-

expression networks, closeness and betweenness in PPI networks (coexpMixed:evcent,

coexpLocal:evcent, ppiAll:close, ppiHighConf:between);

• features based on gene and protein sequence – frequencies of specific nucleotides,

codons, and residues, and the number of protein residues in disordered regions

(seq:T, seq:CTG, seq:CCG, seq:Ser, disorder:n residues);

• Gene Ontology features – nuclear localization (GO:nucleus, GO:DNA binding,

GO:nucleoplasm, GO:chromosome);

• features based on orthology – rates of synonymous and non-synonymous substitu-

tions and prevalence of orthologs across species (ortho:human Dn, ortho:elephant Dn,

ortho:armadillo Ds, ortho:orthology score). Chapter 4. Predicting Essential Mouse Genes 97

ortho:orthology_score ortho:human_Dn ortho:human_Dn GO:nucleus GO:nucleus ortho:elephant_Dn ortho:elephant_Dn ppiHighConf:between GO:DNA_binding ortho:orthology_score GO:transcription_regulator disorder:n_residues expr:prebirth_postbirth_ratio ppiAll:close ppiAll:close seq:T ppiHighConf:between coexpMixed:evcent GO:nucleoplasm GO:DNA_binding disorder:n_residues expr:variance expr:variance GO:transcription_regulator seq:T ortho:armadillo_Ds coexpMixed:evcent expr:prebirth_postbirth_ratio seq:CTG seq:CTG expr:embryo_d10 length:n_CDS_bases coexpLocal:evcent coexpLocal:evcent ortho:armadillo_Ds seq:CCG GO:chromosome expr:embryo_d10 length:n_CDS_bases GO:chromosome seq:CCG seq:Ser seq:Ser GO:nucleoplasm

−0.2 0.0 0.1 0.2 0.0 0.2 0.4 0.6 0.8 Corr. with essentiality AUC

(a) (b)

Figure 4.4: Non-redundant features most closely linked with essentiality. (a) Spearman corre- lation of features with essentiality. (b) AUC scores of features predicting essentiality.

Many of the top 22 features were significantly correlated (Figure 4.5), especially fea- tures based on the same data type. Many features based on different data types were also correlated – for example, high CCG and Serine frequencies were correlated with nuclear localization. Another group of correlated features included protein disorder, prevalence of orthologs and high prenatal gene expression. Chapter 4. Predicting Essential Mouse Genes 98 expr:variance ortho:human_Dn ortho:elephant_Dn seq:T ortho:armadillo_Ds seq:CTG coexpLocal:evcent coexpMixed:evcent ppiHighConf:between ppiAll:close disorder:n_residues length:n_CDS_bases expr:prebirth_postbirth_ratio ortho:orthology_score seq:CCG seq:Ser GO:nucleoplasm GO:nucleus GO:transcription_regulator GO:DNA_binding expr:embryo_d10 GO:chromosome expr:variance ortho:human_Dn 1.0 ortho:elephant_Dn seq:T ortho:armadillo_Ds seq:CTG coexpLocal:evcent 0.5 (p<3e−140) coexpMixed:evcent ppiHighConf:between ppiAll:close 0.1 (p<6e−5) disorder:n_residues 0.0 length:n_CDS_bases −0.1 (p<6e−5) expr:prebirth_postbirth_ratio ortho:orthology_score seq:CCG seq:Ser −0.5 (p<3e−140) GO:nucleoplasm GO:nucleus GO:transcription_regulator GO:DNA_binding expr:embryo_d10 −1.0 GO:chromosome

Figure 4.5: Correlations among 22 non-redundant features closely linked with essentiality.

Gene co-expression and protein interaction networks

Four gene co-expression networks were generated based on 55 mouse gene expression datasets and 3 PPI networks were generated based on interactions in the Interologous

Interaction Database (I2D) [27] (Methods sections 4.5.3 and 4.5.4). Features such as degree and centrality were calculated for genes and proteins in each of these networks.

Gene co-expression networks were calculated based on different ways of evaluating gene correlations. In a network referred to as global, two genes were connected by an edge if their Pearson correlation was high relative to most gene pairs in the gene expression Chapter 4. Predicting Essential Mouse Genes 99 dataset. Features based on this network were not among the top 22 features. In a second co-expression network, referred to as local, two genes were connected by an edge if their

Pearson correlation with each other was substantially higher than their correlations with other genes in the dataset. Centrality calculated in this network (coexpLocal : evcent) was among the top 22 features. A third network, referred to as mixed, combined the approaches of the global and local networks. Centrality based on this network was also among the top features. A fourth co-expression network, was formed by the union of the

first three, but did not contribute any top features.

Protein-protein interaction networks differed according to the sources of their in- teractions. A high-confidence network, ppiHighConf contained experimentally verified mouse interactions. A medium-confidence network, ppiMedConf extended the previous network with interologous interactions based on human and rat. A network referred to as ppiAll extended the previous two networks with interologous interactions based on yeast,

fly, and worm. The top 22 features included betweenness centrality in the ppiHighConf network and closeness centrality in the ppiAll network.

4.2.4 Differences among essential genes

We used the top predictive features to assess the extent of differences between cate- gories of essential genes. We determined whether features that characterize essentiality have significantly different distributions between essential categories, and consequently, whether essential categories have many significant differences. First we selected 15 of our top features whose correlations with each other were < 0.5. Then, we tested whether each feature had significantly different values between a given category and all other essential genes (Figure 4.6A). Resulting p-values were adjusted for multiple testing. Out of 15 features, 14 were significantly different in at least one category. Clustering essential categories based on their median values for the 15 features, indicated that the largest differences were between prenatal and postnatal lethality. Prenatal lethality genes were Chapter 4. Predicting Essential Mouse Genes 100 postnatal lethality postnatal−only cardiovascular infertility nervous urinary respiratory skeleton aging liver digestive skin immune prenatal lethality tumorigenesis

coexpMixed:evcent 1e−20 ppiAll:close ortho:human_Dn nervous, essential expr:variance nervous, non−essential expr:prebirth_postbirth_ratio 1e−10 other essential other non−essential ortho:orthology_score GO:chromosome GO:transcription_regulator 0 seq:T expr:embryo_d10 Density GO:nucleoplasm seq:CCG 1e−10 ppiHighConf:between disorder:n_residues ortho:armadillo_Ds 1e−20

seq:CTG 0.0 0.1 0.2 0.3 0.4

−6 −5 −4 −3 −2 −1

log(human_Dn)

(a) (b)

Figure 4.6: Differences among categories of essential genes. (a) Significant differences among essential gene categories. Colour intensity indicates significantly higher (red) or lower (blue) feature values in a given essential category. (b) Distributions of human Dn values (rates of non-synonymous substitutions with respect to human orthologs). Essential genes (red dashed line) tend to have lower human Dn values than non-essential genes (black dashed line). This is also the case among nervous system (NS) essential (solid red line) and non-essential genes

(solid black line). However, all NS genes tend to have low human Dn values. Consequently, based on this feature, NS essential genes are closer to NS non-essential genes than to other essential genes.

characterized by centrality in gene co-expression and PPI networks, low gene expression variance, slow evolution, prevalence of orthologs, and nuclear localization. Postnatal lethality genes were characterized by opposite trends: lack of centrality in networks, higher gene expression variance, fewer orthologs, and less nuclear localization. Infertility genes were generally similar to postnatal lethality but had higher non-synonymous sub- Chapter 4. Predicting Essential Mouse Genes 101 stitution rates. Other categories fell between prenatal and postnatal lethality. Essential genes in a particular tissue, were sometimes more similar to non-essential genes in the same tissue (Figure 4.6B) than to other essential genes.

4.2.5 Predicting essential genes

We predicted essential genes using each of the 22 selected features separately and in combination (Figures 4.7 and 4.8). To make predictions, features were used as inputs to three classifiers – logistic regression, LogitBoost, and ADTree – and the predictions of the classifiers were averaged. Integrating features resulted in significantly better performance in all essential categories (p < 0.05). For the entire set of essential genes, we obtained

90% precision with 10% recall and 70% precision with 50% recall. Individual essential categories were predicted with similar rates, but precisions were substantially lower for several categories – especially postnatal-only, infertility, and urinary. In most categories, the top individual features were betweenness centrality in protein networks (ppi:close), the rate of non-synonymous substitutions with respect to human orthologs (human Dn), and nuclear localization (GO:nucleus). However, these features were less effective for infertility, urinary, and especially postnatal-only categories. In the latter category, the most effective feature was protein disorder. Chapter 4. Predicting Essential Mouse Genes 102 any infertility prenatal lethality postnatal lethality postnatal−only aging cardiovascular digestive immune liver nervous respiratory skeleton skin urinary tumorigenesis coexpLocal:evcent coexpMixed:evcent ppiHighConf:between ppiAll:close 0.8 disorder:n_residues expr:embryo_d10 expr:prebirth_postbirth_ratio expr:variance 0.7 length:n_CDS_bases ortho:orthology_score ortho:human_Dn ortho:elephant_Dn 0.6 ortho:armadillo_Ds seq:T seq:CCG seq:CTG 0.5 seq:Ser GO:nucleus GO:transcription_regulator GO:DNA_binding 0.4 GO:nucleoplasm GO:chromosome Combined

Figure 4.7: Prediction performance of individual and combined features, evaluated as area under ROC (AUC) in different categories of essential genes. Chapter 4. Predicting Essential Mouse Genes 103

1.0 all features 1.0 all features coexpMixed.evcent coexpMixed.evcent 0.9 ppiAll.close ppiAll.close disorder.n_residues 0.8 disorder.n_residues ortho.human_Dn ortho.human_Dn 0.8 0.6 0.7 Precision Precision

0.4 0.6

0.5 0.2

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Recall Recall

(a) all essential genes (b) infertility

1.0 all features 1.0 all features coexpMixed.evcent coexpMixed.evcent 0.9 ppiAll.close ppiAll.close disorder.n_residues 0.8 disorder.n_residues 0.8 ortho.human_Dn ortho.human_Dn

0.6 0.7 Precision Precision 0.6 0.4

0.5 0.2 0.4

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Recall Recall

(c) prenatal lethality (d) postnatal-only

Figure 4.8: Precision-recall of all features combined and 5 individual features, in 4 categories of essential genes. (a) all essential genes (b) infertility (c) prenatal lethality (d) postnatal-only

4.3 Discussion

Through a systematic survey of gene and protein features, we identified a key set of 22 non-redundant features associated with mouse essential genes. These features encom- pass diverse data types including gene expression, PPI and gene co-expression networks, gene and protein sequence, Gene Ontology and orthology. Based on these features, we Chapter 4. Predicting Essential Mouse Genes 104 predicted essential mouse genes with 70% precision at 50% recall. Individual features achieved precisions that were at least 10% lower, and no single feature was better than others for all categories of essential genes. This indicates that integration of diverse fea- tures is crucial for accurately predicting essential genes. The need for diverse data types likely arises from the diverse nature of essential genes. We identified significant differences between all essential categories that were considered, including categories representing essential genes from specific tissues, organ systems, processes. Some of the differences stemming from different categories had confounding effects on differences between es- sential and non-essential genes – for example, based on the rate of non-synonymous substitution, essential nervous system genes were more similar to non-essential nervous system genes, than to other essential genes. Diverse, non-redundant features may help compensate for such effects.

The key non-redundant features associated with essential mouse genes are summarized in Table 4.1. Many of these features have been directly linked with essentiality in yeast or bacteria but have not been tested in mammals. Several features can be indirectly linked with essentiality through another feature with a previously known connection to essentiality. For example, high protein disorder has been linked with high degree in yeast

PPI networks [54], which is a property of essential yeast [93] and human genes [66].

Also, centrality in gene co-expression networks is correlated with high degree in these networks, which was shown to be a property of essential yeast genes [33]. For certain features, especially ones related to sequence bias, we could find no previously reported associations with essentiality. Chapter 4. Predicting Essential Mouse Genes 105

Essential Feature AUC Effective For Ineffective For Previous Evidence

few non-syn. substitutions 0.638 most categories post-only, infertility yeast [192], mouse [34]

nuclear localization 0.628 most categories urinary yeast [94]

centrality PPI nets 0.610 most categories post-only yeast/fly/worm [74]

orthologs across species 0.609 most categories post-only yeast [72], mouse [34]

high protein disorder 0.601 most categories indirect in yeast [54]

low thymine frequency 0.587 most categories

centrality coexp nets 0.579 most categories post-only, infertility indirect in yeast [33]

low gene expr. variance 0.576 most categories post-only, infertility mouse [34]

high prenatal gene expr 0.566 most categories post, infertility, aging

low CTG frequency 0.565 most categories post-only, urinary

long gene length 0.565 most categories post-only, urinary yeast [160, 72]

high CCG frequency 0.548 digestive, skeleton most categories

high Ser frequency 0.527 digestive most categories E.coli [72]

Table 4.1: Main distinguishing features of mouse essential genes.

4.3.1 Features from gene expression

Several studies have investigated the relationship between gene expression and gene es- sentiality in mice [34, 107]. They found that low variability in expression across different tissues is a key feature of essential genes and high expression level is weakly related to essentiality. Choi et al. [34] proposed that essential genes have low expression vari- ability because they carry out core functions in cells – consequently, their proteins are always required at specific concentrations. High expression may also be explained by the need for low variability since changes in expression level would have less impact on the Chapter 4. Predicting Essential Mouse Genes 106 concentration of an abundant protein [34].

Our results confirmed that essential mouse genes are characterized by low variability in expression and to a lesser extent, by high expression. We also found that a high ratio of pre-natal to post-natal expression was strongly associated with essentiality. This is an expected feature of genes involved in embryogenesis, of which almost 90% are essential.

Features from orthology

The relationship between essentiality and evolution rate, or prevalence of orthologs has been studied in diverse organisms including E.coli, yeast, and mouse [72, 34, 107]. In these organisms, essential genes had a slow rate of evolution, and orthologs across many species. These properties result from strong negative selection; changes to essential genes are often fatal, and there is no progeny to carry on the changes. In our results, slow evolution and numerous orthologs were among the most distinguishing properties of essential genes.

4.3.2 Features based on gene and protein sequence

Biases of gene sequence have been linked with essentiality in yeast [72, 160] but not in higher organisms. A large number of sequence properties were associated with yeast essential genes including low hydrophobicity, low GC content, and low frequencies of certain amino acids (e.g., Trp, Met) and codons (e.g., TAC, TAT). We found many of the same trends in mouse essential genes but the key features were low thymine frequency, low CTG frequency, high CCG frequency and high Ser frequency – most of these biases have not been previously reported.

Low thymine frequency may help minimize occurrences of pyrimidine dimers – DNA lesions formed by thymine or cytosine bases through photochemical reactions [68]. The most frequent pyrimidine dimers involve pairs of thymines. Unrepaired pyrimidine dimers increase the chances of cancer. In most organisms, pyrimidine dimers can be repaired Chapter 4. Predicting Essential Mouse Genes 107 by photoreactivation – a process involving photolyase enzymes and exposure to sunlight.

However, placental mammals (including mice and humans) are incapable of this process and rely on nucleotide excision repair to remove pyrimidine dimers. Consequently, low thymine frequency may have a stronger association with mammalian essential genes than with yeast essential genes.

The reasons for biases in CTG (Leu), CCG (Pro) and Ser frequencies were unclear.

The biases may be partially explained by the propensity of these amino acids to partici- pate in disordered protein regions. We found that essential proteins were significantly en- riched for residues associated with disordered regions; the level of enrichment of residues was significantly correlated (p ≤ 0.00013) with their propensity for disordered regions

(Figure 4.9). However, this provides only a partial explanation for the observed sequence biases since some of the biases were specific to codons rather than residues, and the biases were not redundant with respect to protein disorder.

Protein disorder was one the best predictors of essentiality and was best at predicting postnatal-only genes. Several studies have shown that hubs in PPI networks are enriched for disordered regions [54, 135], possibly because these regions give proteins the flexibility to interact with multiple partners [135]. Disordered regions are also involved in DNA binding [201], a common property of essential genes that are involved in development. Chapter 4. Predicting Essential Mouse Genes 108

significantly deficient in ess. proteins, p<0.05 Met no signficant difference between ess. and non−ess. proteins signficantly enriched in ess. proteins, p<0.05

His Ser

Gln Pro Thr Glu AlaLys Arg Gly Asp

Asn

Val Propensity for Intrinsic Disorder log(odds) Leu Tyr Trp PheCys Ile −0.4 −0.2 0.0 0.2 0.4 0.6

0.85 0.90 0.95 1.00 1.05

Enrichment in Essential Proteins

Figure 4.9: Propensity of residues to form disordered regions [113] is positively correlated with their enrichment in essential genes (Spearman correlation p ≤ 0.00013). Chapter 4. Predicting Essential Mouse Genes 109

4.3.3 Networks and essentiality

The relationship between essentiality and PPI networks has been studied in yeast, worm,

fly, [93, 142, 74, 54] and human [66], but not in mouse. The properties most related to essentiality have been high degree, closeness centrality, and betweenness centrality.

Our results confirm these findings. High degree proteins may be essential because their absence disrupts too many functions [93], or because a fraction of their many interactions are critical for the cell [80]. Reasons for the link between essentiality and closeness or betweenness centrality have not been determined. We propose that central genes are related to embryogenesis and peripheral genes are adult and tissue-specific. In our PPI data, embryogenesis genes were significantly more central (p < 0.05) than genes with other top-level mammalian phenotypes. We found similar trends in gene co-expression networks.

PPI networks were more predictive of essentiality than gene co-expression networks.

However, the effectiveness of PPI networks may be partially due to biases. We found that all of our PPI networks were highly enriched for essential proteins (Table 4.2). This may indicate that experimental studies of PPIs have dealt more often with essential proteins than with non-essential proteins. This bias suggests that PPI networks may be mainly effective for predicting well-studied essential proteins. Among our gene co-expression networks, only the global network had the same bias, although to a lesser extent than

PPI networks. Chapter 4. Predicting Essential Mouse Genes 110

Network Vertices Edges Enrichment (p-value) Correlation (p-value)

coexp. global 11,313 1,091,692 9.6e-5 2.2e-11

coexp. local 12,195 127,748 6.6e-1 2.6e-10

coexp. mixed 15,289 1,005,688 5.2e-1 2.9e-14

coexp. all 15,905 1,597,641 7.9e-1 6.4e-13

high conf. PPI 4231 6,868 1.6e-31 1.4e-9

med. conf. PPI 10,053 50,427 3.0e-9 5.6e-15

all PPI 10,869 79,089 1.8e-9 0

Table 4.2: PPI networks, gene co-expression networks, and essentiality. Enrichment of essential genes/proteins in a network was calculated relative to the fraction of essential genes among mouse genes with known phenotypes. Correlation with essentiality is the most significant p- value of correlations between network properties (e.g., degree) and essentiality. Chapter 4. Predicting Essential Mouse Genes 111

4.4 Conclusions

Knowledge of essential mammalian genes can provide a better understanding of critical cellular, developmental, and tissue-specific processes. It can also accelerate the develop- ment of animal models of human disease, since embryonic lethal genes may be avoided in knockout experiments. However, current experimental methods for identifying essential mammalian genes require large investments of time and resources.

Our study shows that an automated approach can predict essential mammalian genes with high precision. One of the challenges of this task is that mammalian essential genes may differ according to the tissues or processes in which they are active. This raises the question of whether essential genes should be considered as a single group.

Our results show that most categories of essential genes do have common properties, such as low nucleotide substitution rates, that distinguish them from non-essential genes.

However, many features are significantly different between categories of essential genes.

Such differences can confound the differences between essential and non-essential genes.

Due to such effects, prediction of essential genes requires integration of diverse fea- tures. We identified a set of 22 non-redundant predictive features based on gene expres- sion, PPI and gene co-expression networks, gene and protein sequence, Gene Ontology and orthology. Predictions based on all of these features combined were significantly bet- ter than predictions based on individual features. Features such as high protein disorder and low nucleotide substitution rates were effective predictors for most essential gene categories but no feature was best for all categories. Chapter 4. Predicting Essential Mouse Genes 112

4.5 Methods

4.5.1 Gene phenotypes

Phenotypes of mouse genes were obtained from the Mouse Genome Informatics database version 4.35 [30]. The following phenotypes were defined as essential: lethality-prenatal / perinatal (MP:0005374), lethality-postnatal (MP:0005373), and infertility (MP:0001924).

A gene was considered essential if it was annotated with one of these phenotypes, and the phenotype had been determined by gene knockout or Floxed/Frt. A gene was considered non-essential if it had been tested by gene knockout and was not annotated with an essential phenotype.

A training set, Strain, containing 1,126 essential and 1,745 non-essential genes was used for selecting gene and protein features related to essentiality. A test set, Stest containing the same numbers of essential and non-essential genes was used for testing essentiality predictions.

Eleven non-essential phenotypes were included in the analysis. These phenotypes were selected by 3 criteria: a depth of 1 in the phenotype hierarchy, limited overlap

(number of shared genes) with essential phenotypes and limited overlap with other non- essential phenotypes. A non-essential phenotype was excluded if it was annotated to

> 65% of the genes of an essential phenotype. In addition, a non-essential phenotype was excluded if > 65% of its genes were shared with another non-essential phenotype, which was annotated to more genes.

4.5.2 Gene expression

Two types of predictive features were based on gene expression levels: summary gene expression statistics, (e.g., median expression level across samples), and tissue-specific expression levels (e.g., expression level in placenta). Summary statistics were determined in 3 steps: preprocessing of gene expression datasets, calculation of statistics within Chapter 4. Predicting Essential Mouse Genes 113 datasets, and integration of statistics from different datasets. A total of 55 gene expres- sion datasets were downloaded from the Gene Expression Omnibus (GEO) [13] website on Feb. 27, 2009. These datasets had unique GDS IDs, were based on mouse Affymetrix chips and contained at least 15 samples. All gene expression datasets were processed by the MAS 5.0 algorithm, using the affy package [60], version 1.24.2, in R, version 2.8. If a gene was represented by multiple probe sets, only the probe set with the highest median expression level was kept.

In each dataset, 7 summary statistics were calculated for every gene. Five of these were standard summary statistics: maximum, median, variance, interquartile range

(IQR), and median absolute deviation (MAD). Two other measures were specific to gene expression data: the number of samples where the gene was expressed, and the gene’s tissue specificity [107]. A gene was considered to be expressed in a sample if its level was

> 150. Tissue specificity was calculated as in [107]: P n (1 − [ log2S(j) ] τ = j=1 log2Smax , (4.1) n − 1 where n was the number of samples, S(j), was the expression level for sample j, and

Smax was the highest expression level of the gene across all samples. After these 7 measures were calculated the dataset was normalized – values were log-transformed, and within each sample, the sample mean was subtracted from expression levels. Then the 5 standard measures were recalculated, resulting in 12 features per gene, in each dataset.

Next, feature values from different datasets were combined together, so that a given gene was associated with a total of 12 values, rather than 12 values in every dataset.

First, in each dataset, the values of a feature were replaced with ranks. Then the ranks of a feature, i, of gene, j, were combined by calculating their median across datasets:

fi,j = mediank²D(rank(fi,j,k)), (4.2)

where fi,j was the integrated value of feature i of gene j, D was the set of gene expression datasets, and rank(fi,j,k) was the rank of feature i of gene j in dataset k. Chapter 4. Predicting Essential Mouse Genes 114

4.5.3 Gene co-expression networks

A total of 32 predictive features represented network properties of genes in gene co- expression networks. The features were calculated in 4 steps: 1) preprocessing of gene expression datasets, 2) generation of co-expression networks for each dataset, 3) integra- tion of co-expression networks, and 4) calculation of network properties for each gene in a network.

The first step, preprocessing of gene expression datasets, was carried out as in the previous section (4.5.2); datasets were processed by the MAS5.0 algorithm, expression levels were logged and samples were mean centered. The same 55 gene expression datasets were used as above.

From each gene expression dataset, 4 co-expression networks were generated, referred to as global, local, mixed, and combined networks. In the global network, two genes were connected by an edge if their Pearson correlation was high relative to most gene pairs in the dataset. Edges were determined by calculating the Pearson correlation of gene pairs across all samples, and comparing the correlations against a given threshold – gene pairs with correlations above the threshold were connected by edges (threshold selection is described further on). In the local network, two genes were connected by an edge if their Pearson correlation with each other was substantially higher than their correlations with other genes in the dataset. Edges in the local network were calculated in 3 steps.

First, Pearson correlation of all gene pairs was calculated. Second, for each gene, the correlation level corresponding to the 99th percentile was identified. Third, the Pearson correlation, ρij, between genes, i and j, was mapped to 2 scores oi, oj as follows:

th ρij − 50 percentile(Di) oi = th th (4.3) 99 percentile(Di) − 50 percentile(Di)

th ρij − 50 percentile(Dj) oj = th th , (4.4) 99 percentile(Dj) − 50 percentile(Dj) where Di,Dj were correlation distributions of genes i and j, respectively. These two Chapter 4. Predicting Essential Mouse Genes 115

scores were combined into a single value, minoij, as follows:

minoij = min(oi, oj) (4.5)

If a score was above a given threshold, the two genes were connected by an edge in the local network. The mixed network combined the ideas of the global and local approaches.

First, a distribution of Pearson correlations, DG, was determined for all gene pairs in a dataset. The 50th and 99th percentiles in this distribution were identified. The Pearson correlation, ρij, between candidate interacting genes i, j was mapped to a global score, oG as follows: th ρij − 50 percentile(DG) oG = th th , (4.6) 99 percentile(DG) − 50 percentile(DG) where DG was the global distribution of correlations. Scores oi, oj were calculated, as in the local approach. Correlation ρij was then mapped to a single score as follows:

omixed = mean(mean(oi, oj), oG), (4.7)

The combined network was simply the union of the edges from the global, local, and mixed networks.

To generate co-expression networks, gene pairs with scores above selected thresholds were connected by edges (scores were Pearson correlations in the global network, mino scores in the local network and omixed scores in the mixed network). Thresholds were selected so that connected gene pairs would be likely to encode interacting protein pairs.

To identify these thresholds, a training set of protein pairs was used, comprising 6,334 high confidence PPIs (positive cases) from I2D [27] and 63,340 random protein pairs

(negative cases). The score of each gene pair, sij was converted to a likelihood ratio,

LRij, of interaction as follows:

LRij = P (pos cases|scores ≤ sij)/P (neg cases|scores ≤ sij), (4.8) where pos cases and neg cases were gene pairs in a network encoding interacting and random protein pairs, respectively, and scores was the set of gene pair scores in the Chapter 4. Predicting Essential Mouse Genes 116 network. Thresholds were defined as the smallest scores giving a likelihood ratio ≥

10. Thus, for each gene expression dataset, 4 networks were generated where edges represented likelihood ratios of interaction ≥ 10.

Networks of the same type (e.g., local networks) from different expression datasets were integrated into a single network. First, for each gene pair, an integrated likelihood ratio was computed as the product of likelihood ratios from different gene expression datasets. Then, gene pairs with integrated likelihood ratios ≥ 100 were connected by edges. By this approach 4 integrated networks were generated, from local, global, mixed, and combined networks.

The final step for determining co-expression related features, was the calculation of gene network properties. In each integrated network, eight network properties were deter- mined for each vertex, v: degree, betweenness, closeness centrality, eigenvector centrality,

PageRank, clustering coefficient, articulation point status, and size of connected compo- nent. Degree is the number of adjacent edges at the vertex. Betweenness is the number of shortest paths that traverse the vertex [58]. Closeness centrality is defined as the inverse of the average length of shortest path lengths to all other vertices in the graph:

|V | − 1 closeness centrality(v) = P , (4.9) i6=v dvi where V is the set of vertices in the network and dvi is the shortest path length from v to vertex i [58]. Eigenvector centrality scores indicate whether a vertex has neighbours that are central in the network – vertices with high eigenvector centralities are connected to many other vertices, which are also connected to many others. The scores are based on the first eigenvector of the graph adjacency matrix [20]. PageRank is defined in a similar way to eigenvector centrality; the PageRank of a vertex is high if many of its neighbours have high PageRank scores: X P ageRank(v) = (1 − d) + d P ageRank(i)/degree(i), (4.10)

i²Nv where i is a neighbour of v, and Nv is the set of v’s neighbours [26]. Clustering coefficient Chapter 4. Predicting Essential Mouse Genes 117 is the probability that the neighbours of a vertex are connected [197]:

|{e |i, j²N , e ²E}| clustering coefficient(v) = ij v ij , (4.11) degree(v) × (degree(v) − 1)/2 where i and j are neighbours of v and E is the set of edges in the network. Articulation point status indicates whether deletion of v disconnects the graph. Component size is the number of vertices in the connected component containing v. These 8 features were calculated using the igraph package, version 0.5.1, in R, version 2.8. The 8 features were calculated for all vertices (genes) in each integrated network, resulting in up to 32 features for every gene. (Some genes had fewer than 32 features because they were not represented in all networks.)

4.5.4 PPI networks

A total of 24 predictive features represented network properties of proteins, in PPI net- works. The features were generated in 2 steps: selection of PPI networks and calculation of network properties for each protein in a network.

Protein interaction networks were based on mouse PPIs from the Interologous In- teraction Database (I2D) [27]. From these interactions, three networks were generated, ppiHigh, ppiMed, and ppiAll, reflecting different levels of interaction reliability. ppiHigh consisted of experimentally validated mouse PPIs, comprising 4,231 proteins and 6,868 interactions. ppiMed contained interactions from ppiHigh as well as predicted mouse

PPIs based on experimentally validated interactions among mouse orthologs in rat and human – resulting in a network with 10,053 proteins and 50,427 interactions. ppiAll contained interactions from the previous 2 networks as well mouse PPIs based on orthol- ogous interactions in yeast, worm, and fly – resulting in a network with 10,869 proteins and 79,089 interactions.

The second step consisted of calculating 8 network properties for each protein in each PPI network. The properties were the same ones as computed for genes in gene Chapter 4. Predicting Essential Mouse Genes 118 co-expression networks.

4.5.5 Gene length

Three features related to gene length were obtained for each gene: total gene length, coding sequence (CDS) length, and the ratio of CDS length to total length. Total gene length and CDS length were obtained from Ensembl [18] release 58, using BioMart.

4.5.6 Gene Ontology

Genes were annotated with function and localization categories based on a subset of generic GO slim categories [8], downloaded from http://www.ebi.ac.uk/QuickGO/GMultiTerm on July 8, 2010. Categories were selected according to two criteria: frequency among essential genes in the training set and number of co-occurrences with other categories. A category had to be annotated to at least 10 essential genes and no more than 2/3rds of essential genes in the training data. Also, if two categories co-occurred on more than 80% of the same essential genes, then the category with fewer occurrences was removed.

4.5.7 Ortholog information

Thirteen features were computed based on orthologs of mouse genes in various species.

Twelve features were nucleotide substitution rates between mouse genes and their or- thologs in several species: the rate of synonymous substitutions (Ds), the rate of non- synonymous substitutions (Dn), and the ratio, Dn/Ds, relative to orthologs in rat, human, elephant, and armadillo. The last feature, orthology score, indicated the extent to which a gene was preserved across species. An orthology score was highest if a mouse gene had orthologs in many species, orthologs in distantly related species and 1:1 orthologs. The Chapter 4. Predicting Essential Mouse Genes 119 score was calculated as follows:

X orthology score(g) = f(s, g) (4.12) s²{sequenced species},s6=mouse    0 g has no orthologs in s  f(s, g) = 1/|{mouse genes with 1 : 1 orthologs in s}| g has 1 : 1 ortholog in s .    1/|{mouse genes with orthologs in s}| g has 1 : n or n : n orthologs in s (4.13)

Substitution rates and ortholog information were downloaded from Ensembl [18] release

58 using BioMart.

4.5.8 Protein structure

Protein structure comprised 22 features: 21 binary features representing the presence or absence of certain InterPro domains and 1 feature indicating the total number of unique domains on a protein. The 21 selected domains were ones occurring on at least 10 essential proteins in the training set. Domain annotations of proteins were downloaded from InterPro release 27.0.

4.5.9 Protein disorder

Two features related to protein disorder were determined for mouse proteins: the number of residues in disordered regions and the percentage of residues in disordered regions.

Protein disorder was predicted using DISOPRED version 2.3 [196].

4.5.10 Sequence biases

Sequence biases comprised 103 features related to codon, residue and nucleotide compo- sition. Protein and gene sequences were downloaded from Ensembl [18] release 58 using

BioMart. For gene sequences, the longest CDS was selected. Most features represented Chapter 4. Predicting Essential Mouse Genes 120 the relative frequency of a particular residue, codon, or nucleotide. For example, the relative frequency of each codon, ci, in gene, gj was calculated as follows:

f(ci,j) = (number of occurrences of ci in gj)/(number of codons in gj), (4.14)

Of the 103 features, 30 were residue frequencies (including frequencies of individual residues and of residue types, e.g., polar), 64 were codon frequencies, 4 were nucleotide frequencies. Three features were combinations of nucleotide, codon, or residues frequen- cies. Feature, GC, was the fraction of G or C nucleotides in a gene. Feature, rare, was the fraction of rare residues in a protein: cystine, tryptophan, histidine, and methionine.

Feature stop was the fraction of codons similar to the stop codon: TAC, TAT, TGC,

TGT, or TGG. Also, two measures of codon bias, CAI and Nc, were computed with the codonW program (http://bioweb.pasteur.fr/seqanal/interfaces/codonw.html).

4.5.11 Assessing relationships between predictive variables and

essentiality

The relationships between predictive features and essentiality were assessed in three ways:

Spearman correlation, Mann-Whitney U tests and chi-square tests. Pearson correlation is commonly used to calculate correlation between a continuous variable (e.g., median gene expression) and a binary variable (e.g., essentiality) or between two binary variables

[64]. Since our continuous variables were not normally distributed, we applied Spearman correlation instead. When one or both variables are binary, the values of the correlation coefficient become constrained as the distribution of the binary variable deviates further from 50:50 – however, the distribution of essentiality was 47.5:52.5. To test the sta- tistical significance of feature-essentiality relationships we used Mann-Whitney U tests for continuous features such as median gene expression and chi-square tests for binary features such as GO annotations. These tests were used to assess associations between each predictive feature and 5 essential categories – prenatal lethality, postnatal lethality, Chapter 4. Predicting Essential Mouse Genes 121 postnatal-only, infertility, and these 4 categories combined. Thus, for each predictive feature, 5 p-values were determined, a multiple testing adjustment (false discovery rate

[16]) was applied to these p-values and features with a minimum p-value > 0.05 were eliminated from further analysis – 191 features remained (Supplemental Materials).

Correlations and statistical tests were carried out on the training set, Strain. Calcu- lations were done with R, version 2.8.

4.5.12 Identifying non-redundant predictive features

A predictive feature was considered non-redundant if it could predict a subset of essential genes with higher precision than other features. Identification of non-redundant features was carried out in 3 steps: using each predictive feature to assign precisions to genes, comparing pairs of features based on precisions, and selecting non-redundant features.

To assign precisions to genes, based on a feature, fi, genes were ordered by values of fi; by increasing values if fi was positively correlated with essentiality and by decreasing values if the correlation was negative. The precision of an essential gene, gk, based on feature fi was defined as the fraction of genes ranked higher than gk, that were essential:

precision(gk, fi) = |{g|rank(g, fi) > rank(gk, fi), g²Gess}|/rank(gk, fi), (4.15)

where rank(g, fi) was the rank of gene g based on feature fi and Gess was the set of essential genes, within the training set. Thus, each feature was associated with a vector of precisions for essential genes.

After features were associated with precision vectors, pairs of features, (fi, fj), were compared based on their precision vectors. The number of non-redundant predictions,

(nr) of fi relative to fj was defined as the number of genes whose precisions from fi were higher than precisions from fj, by at least 0.05. Only genes with fi precisions ≥ 0.55 were included in this comparison; lower values were considered too close to the default Chapter 4. Predicting Essential Mouse Genes 122 prediction of 0.475 (i.e., the fraction of essential genes in the dataset).

nri,j = |{gk|gk²Gess, precision(gk, fi) ≥ 0.55, precision(gk, fi) > precision(gk, fj)+0.05)}| (4.16)

If the number of non-redundant predictions was at least 5% of essential genes in the training set then fi was considered non-redundant relative to fj. After determining numbers of non-redundant predictions between all pairs of features, a set of non-redundant features was selected. These features were identified by a greedy algorithm which prioritized features with the most non-redundant predictions and the highest correlation with essentiality. Input to the algorithm comprised a set, Scandidates, of candidate non-redundant features and output comprised a set, Sselected, of selected non-redundant features. The algorithm consisted of two iterative steps:

1. Selection of the least redundant feature, fbest, from Scandidates. For each feature,

fi ∈ Scandidates, numbers of non-overlapping predictions were determined between fi

and all other members fj inScandidates, i 6= j. The lowest number of non-overlapping

predictions, nrmin(fi), was identified (i.e., the worst case scenario, where fi was

most redundant). After a value of nrmin(fi) was calculated for all features fi ∈

Scandidates, the feature with the highest value was considered the least redundant

feature, fbest:

fbest = argmax min(nr(fi, fj), ∀fj²Scandidates, j 6= i)), (4.17) fi²F

If fbest was not unique then the candidate with the highest absolute correlation

with essentiality was chosen. Feature fbest was added to the set of non-redundant

features and removed from Scandidates.

2. Removal from Scandidates of any features which lacked non-redundant predictions

relative to fbest. If Scandidates was empty, the algorithm was terminated, otherwise the steps were repeated. Chapter 4. Predicting Essential Mouse Genes 123

4.5.13 Identifying significant differences between different types

of essential genes

Essential genes were divided into 15 categories based on their essential and non-essential phenotypes (e.g., lethality-postnatal, aging). Non-redundant predictive features were tested for significant differences between a given category and all others, using Mann-

Whitney U tests. For example, a Mann-Whitney U test was carried out to test whether orthology scores of genes in the lethality-postnatal category were significantly different than those of other essential genes. After testing all features for a given category, p-values were adjusted for multiple testing by the FDR approach [16].

4.5.14 Predicting essential genes

Gene essentiality was predicted by applying several classifiers to non-redundant predictive features, and calculating the unweighted mean of their predictions. The classifiers used for prediction were logistic regression, LogitBoost, and ADTree, from the WEKA software package [75]. Predictions were carried out by 10-fold cross-validation on the Stest dataset.

The three classifiers were chosen based on their cross-validation performance on the Strain dataset.

4.6 Supplemental Materials

4.6.1 Links to data tables

1. features significantly correlated with essentiality (p < 0.05) and phenotypes used

for defining essential categories:

www.cs.utoronto.ca/∼juris/data/MaxThesis/Ch4/featureAndPhenotypeNames.

xls

2. GDS IDs of gene expression datasets: Chapter 4. Predicting Essential Mouse Genes 124

www.cs.utoronto.ca/∼juris/data/MaxThesis/Ch4/expressionDatasetIDs.txt

3. genes in the Strain dataset: www.cs.utoronto.ca/∼juris/data/MaxThesis/Ch4/S train.txt

4. genes in the Stest dataset: www.cs.utoronto.ca/∼juris/data/MaxThesis/Ch4/S test.txt

5. feature values:

www.cs.utoronto.ca/∼juris/data/MaxThesis/Ch4/featureValues.txt Chapter 5

Discussion

Determining the identity and context (e.g., location, timing) of human PPIs could clarify disease mechanisms and accelerate the development of therapies. Consequently, many experimental and computational methods have been developed to accelerate detection of PPIs [163, 164], identify their context [159, 21], and understand their impact on phe- notype [142, 178]. However, key steps for achieving a comprehensive, explanatory inter- actome model remain largely incomplete: only a small fraction of the interactome has been identified, the context of most interactions is unknown, and relationships between the interactome and different phenotypes are poorly understood. This thesis described our contributions to addressing these problems. The F pClass algorithm was developed to increase coverage of the interactome, while limiting the false discovery rate (FDR).

The mixed co-expression algorithm was designed to increase coverage of transient PPIs, which have been particularly difficult to detect [139, 4, 27, 144]. At the same time, mixed co-expression can contribute information about the context of interactions, such as the tissues and conditions in which they occur. Our last study, characterizing and predicting essential mouse genes, aimed to understand essential phenotypes in terms of interaction networks, gene expression, genomic features, and orthology.

In this chapter we summarize the methods and results presented in the thesis, high-

125 Chapter 5. Discussion 126 lighting their strengths and limitations. We discuss possible improvements to our meth- ods and how they may be applied to current problems in .

5.1 FpClass: extending coverage of the human in-

teractome

Currently known human PPIs represent about 10% or less of the human interactome

[79, 175]. F pClass was designed to extend coverage of the interactome – to provide probabilities of interaction for a large number of protein pairs, so that they could be prioritized for further validation. Most previous prediction methods focused on a single feature of candidate interacting protein pairs and consequently, provided limited cover- age. For example, phylogenetic profile methods required orthologs across different species, gene co-expression and interologous prediction identified mainly stable PPIs [91, 27], and domain-based prediction methods were limited to the 20% of PPIs that are mediated by domain-domain binding [4, 155]. Feature integration has been shown to give higher cov- erage and lower FDR than individual predictive features [92, 15, 145, 143, 158]. However, high-confidence human PPI networks from previous integrative methods contained less than 40,000 interactions – corresponding to less than 20% of the interactome [145, 158].

A possible reason for this limited coverage was that integrative studies assumed a large degree of independence between features. For example, studies that used protein domains and localizations assessed whether two proteins had a pair of compatible domains (one on each protein) and a pair of compatible localizations [143, 158]. Synergies between features were not considered – for example, a combination of domains on one protein binding to a combination of domains on another, or a pair of domains having increased binding affinity in a particular cellular compartment.

The motivating idea behind F pClass was that PPI predictions could be improved by considering such synergies. Our aim was to identify synergies between a large number Chapter 5. Discussion 127 of protein features including protein domains, localizations, functions, processes and post-translational modifications. Just as previous prediction methods identified pairs of compatible domains, our aim was to identify pairs of compatible feature sets – where each set could include any number of domains, localizations, and other properties annotated to a protein. However, this task is potentially intractable since millions of feature sets can be found on human proteins and we would need to assess compatibility between all pairs of these sets. In order to overcome this problem, we needed a way of identifying feature sets which are important for interaction, without having to test compatibility between all pairs of feature sets. Our solution was based on the following hypothesis: protein features which act co-operatively to enable interactions are likely to co-occur on proteins more often than expected by chance. In other words, to identify feature sets that are important for interaction, it may be sufficient to identify frequently co-occurring protein features. To identify such feature sets we implemented a data mining algorithm based on frequent pattern growth [77]. We then searched for pairs of feature sets which were enriched in experimentally determined PPIs, and used these feature set pairs to predict interactions.

5.1.1 Contributions and limitations

The major contribution of F pClass is improved coverage of the interactome; compared to previous prediction methods [145, 158], F pClass provides substantially larger predicted networks at lower FDRs. For example, F pClass predicted ∼178,000 PPIs at 50% FDR while previous methods predicted <40,000 PPIs with > 68% FDR.

Also, F pClass may be able to reduce the error rate of HTP experimental PPI de- tection methods, by identifying PPIs that were missed by these methods. As part of our testing, we applied F pClass to a gold standard PPI dataset used by Braun et al.

[23] to assess HTP experimental PPI detection methods. F pClass identified most of the interactions (49/55) detected by the HTP methods and almost all of the interactions Chapter 5. Discussion 128 missed by the HTP methods (35/37). The FDR of F pClass was 1% and the FDRs of

HTP experimental methods ranged from 0% to 4%.

Although F pClass provides better human PPI predictions than previous methods, it still shares many of the same drawbacks, which limit its coverage and FDR. The coverage of F pClass is limited by 3 main factors: the number of experimentally validated human

PPIs, the availability of GO and PTM annotations and the availability of protein struc- ture annotations related to binding interfaces. The greatest impact on coverage is the size of the current experimentally validated human PPI network. This network is used for training F pClass, and as a result, interaction types that are sparsely represented in the network (e.g., interactions of membrane proteins) are poorly predicted by F pClass. Also, the most effective predictive module of F pClass is network topology, which predicts an interaction when two proteins share many experimentally validated interaction partners.

However, this module can only be applied to proteins within the experimentally validated

PPI network. The second factor limiting the coverage of F pClass is the availability of

GO and PTM annotations. GO annotations are available for 18,202 human proteins, but many proteins are annotated only with non-specific terms (e.g., ‘cytoplasm’), which provide little information for predicting interactions. For example, if the specificity of terms is measured as information content [115], then out of 16,318 human proteins with cellular component annotations, only 7,990 are annotated with terms that are more spe- cific than ‘cytosol’. PTM annotations are limited to 11,541 proteins. The third factor affecting coverage is the fraction of proteins with structure annotations related to their binding interfaces. The only such annotations used by F pClass were domains, but just

20% of PPIs are mediated through domain-domain binding [155].

There may also be limits to how much the FDR of F pClass can be reduced. The prediction approach of F pClass and most previous methods involves identifying com- plementary sets of proteins, such that members of one set preferentially interact with members of the other. For example, F pClass found that proteins containing a So- Chapter 5. Discussion 129 matostatin/Cortistatin C domain tend to interact with proteins in the Somatostatin receptor family. However, F pClass often cannot determine specific pairings of proteins from the two sets (i.e., it might only be able to report that a protein with a Somato- statin/Cortistatin C domain is likely to interact with some protein in the Somatostatin receptor family). This limitation occurs for predictions based on protein domains, GO annotations, PTMs, or properties determined from protein sequence, such as percent protein disorder. Several features used by F pClass – network topology, orthologous in- teractions (interologs), and gene co-expression – can help identify specific pairings, but not in all cases. The same problem affected previous prediction methods to a greater extent. Feature sets used by F pClass reduce the impact of the problem by providing more annotations for proteins, thus making predictions more individualized.

5.2 Mixed co-expression: identifying transient in-

teractions and interaction context

Mixed co-expression improves predictions of transient interactions and can help iden- tify conditions in which interactions occur (i.e., context such as localization, time, and strength). Transient interactions have been harder to identify than stable interactions, using either experimental [139, 4] or computational methods [4, 27]. Experimental detec- tion has been especially difficult for transient interactions which only occur under rare conditions; the prevalence of such interactions is hard to estimate [175]. Computational prediction of transient interactions has been difficult due to several factors. Primar- ily, training sets used by prediction methods have contained fewer transient than stable interactions. Secondly, gene co-expression (as used by previous studies) and interolog information have been ineffective at predicting transient interactions [91, 27]. Thirdly, prediction methods based on protein structure have also been less effective; they usually assume that proteins interact through domains, but interfaces of transient interactions of- Chapter 5. Discussion 130 ten involve short disordered protein segments [4]. However, computational methods such as NetworKIN [114] have predicted kinase-substrate interactions, which are a subset of transient interactions.

The second issue addressed by mixed co-expression is the absence of context for inter- actions. Previous studies inferred the context of some interactions from gene expression data [159, 21]; interactions were linked with tissues [21] or conditions [159] where their genes were expressed. In these studies, expression data suggested which proteins were present under given conditions but did not indicate which pairs were interacting. Thus, the link between conditions and interactions was indirect. Gene co-expression can pro- vide a more direct link, since it helps identify specific interactions under given conditions.

However, in previous studies, gene co-expression identified mainly stable interactions. For transient interactions, which are more sensitive to context, gene co-expression could not provide context information.

5.2.1 Contributions and limitations

Mixed co-expression improves predictions of stable and transient interactions from gene co-expression. In doing so, it contributes context, which is simply the set of condi- tions/samples in which co-expression occurs. To predict whether two proteins, i, j, in- teract, mixed co-expression calculates their correlation, ρ, and evaluates this correlation using local and global information. Evaluating based on local information means com- paring ρ to correlations of i with other genes and to correlations of j with other genes.

Evaluating based on global information means comparing ρ with correlations of all gene pairs in an expression dataset. Mixed co-expression calculates scores based each evalua- tion and predicts an interaction if the average of the scores is above a given threshold.

By using local information, the threshold for predicting an interaction becomes specific to each gene pair. The previous approach for predicting PPIs from gene co-expression, global co-expression, compared the correlation of each gene pair against the same fixed Chapter 5. Discussion 131 threshold. Mixed co-expression detected significantly more interactions than the global approach, at the same FDR. The biggest improvements occurred for interactions involv- ing transferases and signal transduction proteins; recall of these interactions increased almost 2-fold.

However, the ability of gene co-expression to predict PPIs has inherent limitations.

Generally, gene expression levels are poorly correlated with protein levels [73, 69, 134].

Some interacting protein pairs, such as cyclin dependent kinases and their substrates, have completely uncorrelated gene expression levels [62]. Also, co-expression may not be able to distinguish direct protein interactions from other relationships among proteins – such as presence in the same complex or pathway.

The predictive value of co-expression, both mixed and global, depends on the number of datasets in which co-expression is calculated and the properties of those datasets. The importance of the number of datasets has not been systematically tested; however, for

F pClass, co-expression based on 10 datasets gave an AUC100 score that was 2.4 times higher than the best individual dataset. The predictive value of gene co-expression also depends on properties of gene expression datasets; datasets with more samples and higher gene variance provide better predictions.

5.3 Characterizing and predicting essential mouse

genes

A key motivation for studying PPI networks is understanding phenotype. One of the first links that was established between large-scale PPI networks and phenotype, involved es- sential genes/proteins. Essential proteins were shown to be central in yeast PPI networks

[93]. However, later studies did not always support this link. Yu et al. [206] investigated the relationship between essentiality and centrality in yeast PPI networks from 3 sources:

HTP yeast 2-hybrid (Y2H), affinity purification followed by mass spectrometry (MS), and Chapter 5. Discussion 132 literature-curated interactions. They found that centrality was correlated with essential- ity only in the literature-based network, suggesting that essential proteins may be central because their interactions have been studied more extensively. Park and Kim [133] also found no correlation in a Y2H network, but did find correlation in a MS network. A further question is whether centrality is directly related with essentiality or indirectly related through other protein properties. For example, centrality, evolution rate, and gene expression level, are significantly correlated with each other and with essentiality – it is unclear which of the 3 features are inherent properties of essential proteins. Also, the link between centrality and essentiality has been mainly studied in yeast – no pre- vious studies have examined whether the link holds for mammalian essential proteins in different tissues.

5.3.1 Contributions and Limitations

In chapter 4 we described our investigation of mouse essential (proteins). We surveyed

371 gene and protein features for links with essentiality and identified a set of 22 non- redundant features closely associated with essentiality. These features included high protein disorder, low rates of non-synonymous substitutions, nuclear localization and various measures of centrality (including closeness and betweenness) in PPI and gene co-expression networks. Centrality in PPI networks was one of the top features for predicting essential genes, but this was at least partly due to bias in PPI networks.

The networks were significantly enriched for essential proteins, suggesting that as in yeast, interactions of essential proteins have been studied more extensively than those of non-essential proteins. However, centrality in gene co-expression networks was also an important predictive feature, and most co-expression networks had no bias in favour of essential genes. Based on these results, centrality appeared to be an inherent property of many, but not all, essential genes. Essential genes related to infertility, renal-urinary tissues and postnatal-lethality were poorly predicted by centrality. Essential genes related Chapter 5. Discussion 133 to different tissues and development times had significantly different properties. Possibly due to this factor, integrating diverse features was very effective at improving predictions of essential genes.

Although many features were significantly correlated with essentiality, each feature could distinguish only a fraction of essential genes from non-essential genes. For example, centrality was the feature with the greatest predictive value, yet it was relevant for less than 40% of essential genes. Also, it was largely ineffective for certain types of essential genes, especially ones associated exclusively with postnatal lethality.

A general limitation of our study was the limited number of known essential mouse genes. Out of roughly 22,000 mouse genes, 4,742 have known phenotypes and 47.5% of these are essential. The set of genes with known phenotypes may have certain biases; in particular, it may be enriched for essential genes since in yeast, only about 25% of genes are essential – although yeast may have a lower percentage of essential genes than multicellular organisms. Some of the trends we identified may change once phenotypes are established for more genes; however, the relationships we identified have high statistical significance.

5.4 Future improvements to introduced methods

Several modifications to F pClass could substantially improve its coverage and FDR. The largest increase in coverage could come from merging F pClass with mixed co-expression, and using more gene expression datasets. This could improve coverage of transient PPIs, since current predictive features of F pClass (e.g., domains, global co-expression, orthol- ogy) are less effective for transient PPIs. In addition, gene expression data is available for most genes, unlike GO annotations and experimentally determined interactions. The highest confidence predictions would likely remain unchanged since the most conclusive evidence would still be for stable interactions. Coverage could also be improved by using Chapter 5. Discussion 134 additional protein structure data, such as information about short linear motifs (SLiMs) and coiled-coil regions [155]. F pClass currently uses globular domains which mediate about 20% of human PPIs, while SLiMs may mediate up to 40% of PPIs [125]. Improv- ing FDR may require a more sophisticated use of structural information than currently implemented in F pClass. The current approach identifies pairs of structural features associated with interaction, and assigns the same interaction scores to all protein pairs with those features. However, only a subset of these protein pairs are likely to interact.

Binding specificity can be determined by sequence differences within or outside the bind- ing interface [182]. Incorporating sequence-based prediction methods [15, 71, 205] into

F pClass may help improve the specificity of predictions.

The performance of mixed co-expression would benefit most from integrating a large number of diverse gene expression datasets. Surveying large numbers of gene expression datasets for co-expressed genes has been effective at identifying gene pairs sharing similar functions, participating in the same pathway or physically interacting [1]. Using more datasets also increases the chances of detecting context-specific interactions; for exam- ple, a large number of cancer gene expression datasets may help identify cancer-linked

PPIs. A second approach for improving the performance of mixed co-expression is to customize the calculation of co-expression for each gene expression dataset. Currently, the most effective version of mixed co-expression uses an unweighted mean of local and global information. However, the effectiveness of each information type depends on a given dataset; local information works best in datasets where genes have high expres- sion variance, while global information is best in datasets with very low variance. A weighted mean, customized to each dataset, would likely perform better than the current unweighted mean.

Achieving better prediction and characterization of essential genes may also require customizing predictive features to specific contexts such as tissues, subcellular localiza- tions, or biological processes. For example, in yeast PPI networks, localized centrality Chapter 5. Discussion 135 was found to be a better predictor of essentiality than global centrality [133]. Other predictive features, such as the rate of non-synonymous substitutions or percent disorder could also be considered in a specific context – essential genes may have maximal or minimal values of these features relative to neighbouring proteins in PPI networks, or proteins in the same subcellular localization or tissue.

5.5 Future applications of methods and results

5.5.1 Prediction of pathways and functional modules

A large number of methods have been developed to identify modules [161] and linear signalling cascades [157] in PPI networks. However, the limited size of human PPI net- works means that the pathways and modules identified by these methods are incomplete.

The 50% FDR network from F pClass contains more than twice as many interactions as previous predicted [145, 158] and experimentally determined [27] human networks, suggesting that it could provide a more complete set of pathways and modules. Mixed co-expression could further improve identification of pathways and protein complexes – several studies have shown that integrating PPI networks with gene co-expression can produce modules with greater functional coherence [185, 186]. Since mixed co-expression identifies interactions of diverse proteins, it may identify a more comprehensive set of functional modules.

5.5.2 Prediction of protein function

The function of a protein can be predicted as the predominant function of its PPI module or more generally, of its neighbours in a PPI network [161]. Since F pClass and mixed co-expression increase the number of proteins with interactions, they may provide more complete modules, and may improve the coverage and accuracy of function predictions. Chapter 5. Discussion 136

5.5.3 Prediction of disease genes

Genes involved in disease can be predicted based on their proximity to previously known disease genes in a PPI network [63, 104, 124]. By increasing the number of proteins and interactions in PPI networks, F pClass could provide a more complete set of disease- related genes. Mixed co-expression could be especially helpful for predicting disease genes since it identifies interactions of signalling proteins, which have been implicated in many diseases [123, 87]. The importance of mixed co-expression for phenotype prediction was shown with essential genes – mixed co-expression was more effective than other co- expression based networks.

5.5.4 Identification of module regulators

Integration of PPIs with gene co-expression has been the most successful approach for identifying functional modules [186]. However, co-expression primarily identified stable interactions and was therefore most effective at characterizing stable modules. Proteins involved in module regulation and in communication between modules are more likely to have transient interactions than proteins internal to modules. Therefore, mixed co- expression may identify the mechanisms of module regulation and communication.

5.6 Conclusions

This thesis made several contributions towards mapping the human interactome and understanding its connection with essentiality. The F pClass algorithm provided a more comprehensive mapping of the interactome than previous prediction methods, producing a network with a lower false discovery rate, over 10 times more interactions, and over

2 more proteins. This network provided interactions for 595 OMIM and Cancer Gene

Census proteins that had no experimentally determined interactions. Also, F pClass was able to enhance HTP experimental PPI detection methods – identifying over 90% Chapter 5. Discussion 137 of interactions missed by these methods. Integrating F pClass with HTP experimental methods could enable efficient and comprehensive mapping of the interactome.

Mixed co-expression addressed two common limitations of PPI networks: low avail- ability of transient interactions and lack of interaction context. The main idea of mixed co-expression was that correlations should be evaluated based on both local and global information. Using local information meant that a correlation between two genes was interpreted in the context of these genes. This was beneficial because genes encoding transiently interacting proteins have significantly different correlation distributions than genes encoding stably interacting proteins. The outliers in each of these distributions may correspond to PPIs. Compared to the usual approach for predicting PPIs from gene expression, mixed co-expression identified significantly more interactions, and especially interactions involving signal transduction and transferase activity proteins.

Our analysis of essential mouse genes showed that centrality in experimentally de- termined PPI networks is one of the best predictors of essentiality. However, we found that mouse PPI networks are significantly enriched for essential proteins, suggesting that more comprehensive networks are needed to fully understand the relationship between

PPIs and essentiality. On the other hand, centrality in gene co-expression networks was also an important feature of essential genes, and most co-expression networks were not biased in favour of essential genes. Centrality in a modified mixed co-expression network was a better predictor of essentiality than centrality in a network based on the previous, global co-expression approach.

The bias we found in current mouse PPI networks is an example of the challenges in- volved in understanding phenotype through networks. It underscores the fact that better interactome coverage is required for a more complete and more accurate understanding of phenotype. The methods developed in this thesis help to improve interactome coverage and may thereby contribute to a better understanding of phenotype. Bibliography

[1] P. Adler, R. Kolde, M. Kull, A. Tkachenko, H. Peterson, J. Reimand, and J. Vilo.

Mining for coexpression across hundreds of datasets using novel rank aggregation

and visualization methods. Genome Biol, 10(12):R139, 2009.

[2] P. Aloy, H. Ceulemans, A. Stark, and R. B. Russell. The relationship between

sequence and interaction divergence in proteins. J Mol Biol, 332(5):989–98, 2003.

[3] P. Aloy and R. B. Russell. Interrogating protein interaction networks through

. Proc Natl Acad Sci U S A, 99(9):5896–901, 2002.

[4] P. Aloy and R. B. Russell. Structural : modelling protein interac-

tions. Nat Rev Mol Cell Biol, 7(3):188–97, 2006.

[5] A. Amsterdam and N. Hopkins. Mutagenesis strategies in zebrafish for identifying

genes involved in development and disease. Trends Genet, 22(9):473–8, 2006.

[6] B. Aranda, P. Achuthan, Y. Alam-Faruque, I. Armean, A. Bridge, C. Derow,

M. Feuermann, A. T. Ghanbarian, S. Kerrien, J. Khadake, J. Kerssemakers,

C. Leroy, M. Menden, M. Michaut, L. Montecchi-Palazzi, S. N. Neuhauser, S. Or-

chard, V. Perreau, B. Roechert, K. van Eijk, and H. Hermjakob. The IntAct molec-

ular interaction database in 2010. Nucleic Acids Res, 38(Database issue):D525–31,

2010.

138 Bibliography 139

[7] M. R. Arkin and J. A. Wells. Small-molecule inhibitors of protein-protein interac-

tions: progressing towards the dream. Nat Rev Drug Discov, 3(4):301–17, 2004.

[8] M. Ashburner, C. A. Ball, J. A. Blake, D. Botstein, H. Butler, J. M. Cherry, A. P.

Davis, K. Dolinski, S. S. Dwight, J. T. Eppig, M. A. Harris, D. P. Hill, L. Issel-

Tarver, A. Kasarskis, S. Lewis, J. C. Matese, J. E. Richardson, M. Ringwald, G. M.

Rubin, and G. Sherlock. Gene ontology: tool for the unification of biology. The

Gene Ontology Consortium. Nat Genet, 25(1):25–9, 2000.

[9] G. D. Bader and C. W. Hogue. An automated method for finding molecular com-

plexes in large protein interaction networks. BMC , 4:2, 2003.

[10] J. S. Bader. Greedily building protein networks with confidence. Bioinformatics,

19(15):1869–74, 2003.

[11] J. S. Bader, A. Chaudhuri, J. M. Rothberg, and J. Chant. Gaining confidence in

high-throughput protein interaction networks. Nat Biotechnol, 22(1):78–85, 2004.

[12] T. Barrett, T. O. Suzek, D. B. Troup, S. E. Wilhite, W. C. Ngau, P. Ledoux,

D. Rudnev, A. E. Lash, W. Fujibuchi, and R. Edgar. NCBI GEO: mining mil-

lions of expression profiles–database and tools. Nucleic Acids Res, 33(Database

issue):D562–6, 2005.

[13] T. Barrett, D. B. Troup, S. E. Wilhite, P. Ledoux, D. Rudnev, C. Evangelista,

I. F. Kim, A. Soboleva, M. Tomashevsky, K. A. Marshall, K. H. Phillippy, P. M.

Sherman, R. N. Muertter, and R. Edgar. NCBI GEO: archive for high-throughput

functional genomic data. Nucleic Acids Res, 37(Database issue):D885–90, 2009.

[14] M. Barrios-Rodiles, K. R. Brown, B. Ozdamar, R. Bose, Z. Liu, R. S. Donovan,

F. Shinjo, Y. Liu, J. Dembowy, I. W. Taylor, V. Luga, N. Przulj, M. Robinson,

H. Suzuki, Y. Hayashizaki, I. Jurisica, and J. L. Wrana. High-throughput mapping Bibliography 140

of a dynamic signaling network in mammalian cells. Science, 307(5715):1621–5,

2005.

[15] A. Ben-Hur and W. S. Noble. Kernel methods for predicting protein-protein inter-

actions. Bioinformatics, 21 Suppl 1:i38–46, 2005.

[16] Y. Benjamini and Y. Hochberg. Controlling the false discovery rate: a practical

and powerful approach to multiple testing. Journal of the Royal Statistical Society

Series B, 57:289–300, 1995.

[17] H. M. Berman, T. Battistuz, T. N. Bhat, W. F. Bluhm, P. E. Bourne, K. Burkhardt,

Z. Feng, G. L. Gilliland, L. Iype, S. Jain, P. Fagan, J. Marvin, D. Padilla,

V. Ravichandran, B. Schneider, N. Thanki, H. Weissig, J. D. Westbrook, and

C. Zardecki. The Protein Data Bank. Acta Crystallogr D Biol Crystallogr, 58(Pt

6 No 1):899–907, 2002.

[18] E. Birney, T. D. Andrews, P. Bevan, M. Caccamo, Y. Chen, L. Clarke, G. Coates,

J. Cuff, V. Curwen, T. Cutts, T. Down, E. Eyras, X. M. Fernandez-Suarez, P. Gane,

B. Gibbins, J. Gilbert, M. Hammond, H. R. Hotz, V. Iyer, K. Jekosch, A. Ka-

hari, A. Kasprzyk, D. Keefe, S. Keenan, H. Lehvaslaiho, G. McVicker, C. Mel-

sopp, P. Meidl, E. Mongin, R. Pettett, S. Potter, G. Proctor, M. Rae, S. Searle,

G. Slater, D. Smedley, J. Smith, W. Spooner, A. Stabenau, J. Stalker, R. Storey,

A. Ureta-Vidal, K. C. Woodwark, G. Cameron, R. Durbin, A. Cox, T. Hubbard,

and M. Clamp. An overview of Ensembl. Genome Res, 14(5):925–8, 2004.

[19] J. R. Bock and D. A. Gough. Predicting protein–protein interactions from primary

structure. Bioinformatics, 17(5):455–60, 2001.

[20] P. Bonacich. Power and Centrality: A Family of Measures. American Journal of

Sociology, 92:1170–1182, 1987. Bibliography 141

[21] A. Bossi and B. Lehner. Tissue specificity and the human protein interaction

network. Mol Syst Biol, 5:260, 2009.

[22] T. Bouwmeester, A. Bauch, H. Ruffner, P. O. Angrand, G. Bergamini,

K. Croughton, C. Cruciat, D. Eberhard, J. Gagneur, S. Ghidelli, C. Hopf, B. Huhse,

R. Mangano, A. M. Michon, M. Schirle, J. Schlegl, M. Schwab, M. A. Stein,

A. Bauer, G. Casari, G. Drewes, A. C. Gavin, D. B. Jackson, G. Joberty,

G. Neubauer, J. Rick, B. Kuster, and G. Superti-Furga. A physical and func-

tional map of the human TNF-alpha/NF-kappa B signal transduction pathway.

Nat Cell Biol, 6(2):97–105, 2004.

[23] P. Braun, M. Tasan, M. Dreze, M. Barrios-Rodiles, I. Lemmens, H. Yu, J. M.

Sahalie, R. R. Murray, L. Roncari, A. S. de Smet, K. Venkatesan, J. F. Rual,

J. Vandenhaute, M. E. Cusick, T. Pawson, D. E. Hill, J. Tavernier, J. L. Wrana,

F. P. Roth, and M. Vidal. An experimentally derived confidence score for binary

protein-protein interactions. Nat Methods, 6(1):91–7, 2009.

[24] B. J. Breitkreutz, C. Stark, T. Reguly, L. Boucher, A. Breitkreutz, M. Livstone,

R. Oughtred, D. H. Lackner, J. Bahler, V. Wood, K. Dolinski, and M. Tyers. The

BioGRID Interaction Database: 2008 update. Nucleic Acids Res, 36(Database

issue):D637–40, 2008.

[25] B. J. Breitkreutz, C. Stark, and M. Tyers. The GRID: the General Repository for

Interaction Datasets. Genome Biol, 4(3):R23, 2003.

[26] S. Brin and L. Page. The of a large-scale hypertextual Web search engine.

Computer Networks and ISDN Systems, 30(1-7):107–117, 1998.

[27] K. R. Brown and I. Jurisica. Unequal evolutionary conservation of human protein

interactions in interologous networks. Genome Biol, 8(5):R95, 2007. Bibliography 142

[28] C. Brun, F. Chevenet, D. Martin, J. Wojcik, A. Guenoche, and B. Jacq. Functional

classification of proteins for the prediction of cellular function from a protein-protein

interaction network. Genome Biol, 5(1):R6, 2003.

[29] K. Bryson, L. J. McGuffin, R. L. Marsden, J. J. Ward, J. S. Sodhi, and D. T.

Jones. Protein structure prediction servers at University College London. Nucleic

Acids Res, 33(Web Server issue):W36–8, 2005.

[30] C. J. Bult, J. T. Eppig, J. A. Kadin, J. E. Richardson, and J. A. Blake. The Mouse

Genome Database (MGD): mouse biology and model systems. Nucleic Acids Res,

36(Database issue):D724–8, 2008.

[31] A. Ceol, A. Chatr Aryamontri, L. Licata, D. Peluso, L. Briganti, L. Perfetto,

L. Castagnoli, and G. Cesareni. MINT, the molecular interaction database: 2009

update. Nucleic Acids Res, 38(Database issue):D532–9, 2010.

[32] J. Y. Chen, E. Youn, and S. D. Mooney. Connecting protein interaction data,

mutations, and disease using bioinformatics. Methods Mol Biol, 541:449–61, 2009.

[33] Y. Chen and D. Xu. Understanding protein dispensability through machine-

learning analysis of high-throughput data. Bioinformatics, 21(5):575–81, 2005.

[34] J. K. Choi, S. C. Kim, J. Seo, S. Kim, and J. Bhak. Impact of transcriptional

properties on essentiality and evolutionary rate. , 175(1):199–206, 2007.

[35] H. N. Chua and L. Wong. Increasing the reliability of protein interactomes. Drug

Discov Today, 13(15-16):652–8, 2008.

[36] H. Y. Chuang, E. Lee, Y. T. Liu, D. Lee, and T. Ideker. Network-based classification

of breast cancer metastasis. Mol Syst Biol, 3:140, 2007. Bibliography 143

[37] M. Clamp, B. Fry, M. Kamal, X. Xie, J. Cuff, M. F. Lin, M. Kellis, K. Lindblad-

Toh, and E. S. Lander. Distinguishing protein-coding and noncoding genes in the

human genome. Proc Natl Acad Sci U S A, 104(49):19428–33, 2007.

[38] I. Cohen-Gihon, R. Nussinov, and R. Sharan. Comprehensive analysis of co-

occurring domain sets in yeast proteins. BMC , 8:161, 2007.

[39] The UniProt Consortium. The Universal Protein Resource (UniProt) in 2010.

Nucleic Acids Res, 38(Database issue):D142–8, 2010.

[40] UniProt Consortium. The Universal Protein Resource (UniProt) 2009. Nucleic

Acids Res, 37(Database issue):D169–74, 2009.

[41] M. Costanzo, A. Baryshnikova, J. Bellay, Y. Kim, E. D. Spear, C. S. Sevier,

H. Ding, J. L. Koh, K. Toufighi, S. Mostafavi, J. Prinz, R. P. St Onge, B. Van-

derSluis, T. Makhnevych, F. J. Vizeacoumar, S. Alizadeh, S. Bahr, R. L. Brost,

Y. Chen, M. Cokol, R. Deshpande, Z. Li, Z. Y. Lin, W. Liang, M. Marback, J. Paw,

B. J. San Luis, E. Shuteriqi, A. H. Tong, N. van Dyk, I. M. Wallace, J. A. Whit-

ney, M. T. Weirauch, G. Zhong, H. Zhu, W. A. Houry, M. Brudno, S. Ragibizadeh,

B. Papp, C. Pl, F. P. Roth, G. Giaever, C. Nislow, O. G. Troyanskaya, H. Bussey,

G. D. Bader, A. C. Gingras, Q. D. Morris, P. M. Kim, C. A. Kaiser, C. L. My-

ers, B. J. Andrews, and C. Boone. The genetic landscape of a cell. Science,

327(5964):425–31, 2010.

[42] B. Cox, M. Kotlyar, A. I. Evangelou, V. Ignatchenko, A. Ignatchenko, K. Whiteley,

I. Jurisica, S. L. Adamson, J. Rossant, and T. Kislinger. Comparative systems

biology of human and mouse as a tool to guide the modeling of human placental

. Mol Syst Biol, 5:279, 2009.

[43] M. E. Cusick, N. Klitgord, M. Vidal, and D. E. Hill. Interactome: gateway into

systems biology. Hum Mol Genet, 14 Spec No. 2:R171–81, 2005. Bibliography 144

[44] M. E. Cusick, H. Yu, A. Smolyar, K. Venkatesan, A. R. Carvunis, N. Simonis, J. F.

Rual, H. Borick, P. Braun, M. Dreze, J. Vandenhaute, M. Galli, J. Yazaki, D. E.

Hill, J. R. Ecker, F. P. Roth, and M. Vidal. Literature-curated protein interaction

datasets. Nat Methods, 6(1):39–46, 2009.

[45] T. Dandekar, B. Snel, M. Huynen, and P. Bork. Conservation of gene order: a

fingerprint of proteins that physically interact. Trends Biochem Sci, 23(9):324–8,

1998.

[46] C. L. de Hoog, L. J. Foster, and M. Mann. RNA and RNA binding proteins

participate in early stages of cell spreading through spreading initiation centers.

Cell, 117(5):649–62, 2004.

[47] U. de Lichtenberg, L. J. Jensen, S. Brunak, and P. Bork. Dynamic complex forma-

tion during the yeast cell cycle. Science, 307(5710):724–7, 2005.

[48] C. M. Deane, L. Salwinski, I. Xenarios, and D. Eisenberg. Protein interactions:

two methods for assessment of the reliability of high throughput observations. Mol

Cell , 1(5):349–56, 2002.

[49] M. Deng, S. Mehta, F. Sun, and T. Chen. Inferring domain-domain interactions

from protein-protein interactions. Genome Res, 12(10):1540–8, 2002.

[50] P. D’Haeseleer and G. M. Church. Estimating and improving protein interaction

error rates. Proc IEEE Comput Syst Bioinform Conf, pages 216–23, 2004.

[51] Z. Ding, J. Liang, Y. Lu, Q. Yu, Z. Songyang, S. Y. Lin, and G. B. Mills. A

retrovirus-based protein complementation assay screen reveals functional AKT1-

binding partners. Proc Natl Acad Sci U S A, 103(41):15014–9, 2006. Bibliography 145

[52] A. K. Dunker, Z. Obradovic, P. Romero, E. C. Garner, and C. J. Brown. Intrinsic

protein disorder in complete genomes. Genome Inform Ser Workshop Genome

Inform, 11:161–71, 2000.

[53] A. Dziembowski and B. Sraphin. Recent developments in the analysis of protein

complexes. FEBS Lett, 556(1-3):1–6, 2004.

[54] D. Ekman, S. Light, A. K. Bjorklund, and A. Elofsson. What properties character-

ize the hub proteins of the protein-protein interaction network of Saccharomyces

cerevisiae? Genome Biol, 7(6):R45, 2006.

[55] A. J. Enright, I. Iliopoulos, N. C. Kyrpides, and C. A. Ouzounis. Protein interaction

maps for complete genomes based on gene fusion events. Nature, 402(6757):86–90,

1999.

[56] R. M. Ewing, P. Chu, F. Elisma, H. Li, P. Taylor, S. Climie, L. McBroom-

Cerajewski, M. D. Robinson, L. O’Connor, M. Li, R. Taylor, M. Dharsee, Y. Ho,

A. Heilbut, L. Moore, S. Zhang, O. Ornatsky, Y. V. Bukhman, M. Ethier, Y. Sheng,

J. Vasilescu, M. Abu-Farha, J. P. Lambert, H. S. Duewel, II Stewart, B. Kuehl,

K. Hogue, K. Colwill, K. Gladwish, B. Muskat, R. Kinach, S. L. Adams, M. F.

Moran, G. B. Morin, T. Topaloglou, and D. Figeys. Large-scale mapping of human

protein-protein interactions by mass spectrometry. Mol Syst Biol, 3:89, 2007.

[57] K. Fortney, M. Kotlyar, and I. Jurisica. Inferring the functions of longevity genes

with modular subnetwork biomarkers of Caenorhabditis elegans aging. Genome

Biol, 11(2):R13, 2010.

[58] L. C. Freeman. Centrality in Social Networks I: Conceptual Clarification. Social

Networks, 1:215–239, 1979. Bibliography 146

[59] P. A. Futreal, L. Coin, M. Marshall, T. Down, T. Hubbard, R. Wooster, N. Rahman,

and M. R. Stratton. A census of human cancer genes. Nat Rev Cancer, 4(3):177–83,

2004.

[60] L. Gautier, L. Cope, B. M. Bolstad, and R. A. Irizarry. affy–analysis of Affymetrix

GeneChip data at the probe level. Bioinformatics, 20(3):307–15, 2004.

[61] A. C. Gavin, M. Bosche, R. Krause, P. Grandi, M. Marzioch, A. Bauer, J. Schultz,

J. M. Rick, A. M. Michon, C. M. Cruciat, M. Remor, C. Hofert, M. Schelder,

M. Brajenovic, H. Ruffner, A. Merino, K. Klein, M. Hudak, D. Dickson, T. Rudi,

V. Gnau, A. Bauch, S. Bastuck, B. Huhse, C. Leutwein, M. A. Heurtier, R. R. Cop-

ley, A. Edelmann, E. Querfurth, V. Rybin, G. Drewes, M. Raida, T. Bouwmeester,

P. Bork, B. Seraphin, B. Kuster, G. Neubauer, and G. Superti-Furga. Functional

organization of the yeast proteome by systematic analysis of protein complexes.

Nature, 415(6868):141–7, 2002.

[62] H. Ge, Z. Liu, G. M. Church, and M. Vidal. Correlation between and

interactome mapping data from Saccharomyces cerevisiae. Nat Genet, 29(4):482–6,

2001.

[63] R. A. George, J. Y. Liu, L. L. Feng, R. J. Bryson-Richardson, D. Fatkin, and M. A.

Wouters. Analysis of protein sequence and interaction data for candidate disease

gene prediction. Nucleic Acids Res, 34(19):e130, 2006.

[64] Gene V. Glass and Kenneth D. Hopkins. Statistical methods in education and

psychology. Allyn and Bacon, Boston, 3rd edition, 1996.

[65] B. Goethals and J. Z. Mohammed. Advances in frequent itemset mining imple-

mentations: report on FIMI’03. SIGKDD Explor. Newsl., 6(1):109–117, 2004.

[66] K. I. Goh, M. E. Cusick, D. Valle, B. Childs, M. Vidal, and A. L. Barabasi. The

human disease network. Proc Natl Acad Sci U S A, 104(21):8685–90, 2007. Bibliography 147

[67] D. S. Goldberg and F. P. Roth. Assessing experimentally derived interactions in a

small world. Proc Natl Acad Sci U S A, 100(8):4372–6, 2003.

[68] D. S. Goodsell. The molecular perspective: ultraviolet light and pyrimidine dimers.

Oncologist, 6(3):298–9, 2001.

[69] D. Greenbaum, C. Colangelo, K. Williams, and M. Gerstein. Comparing protein

abundance and mRNA expression levels on a genomic scale. Genome Biol, 4(9):117,

2003.

[70] R. Guimera, M. Sales-Pardo, and L. A. Amaral. A network-based method for target

selection in metabolic networks. Bioinformatics, 23(13):1616–22, 2007.

[71] Y. Guo, L. Yu, Z. Wen, and M. Li. Using support vector machine combined

with auto covariance to predict protein-protein interactions from protein sequences.

Nucleic Acids Res, 36(9):3025–30, 2008.

[72] A. M. Gustafson, E. S. Snitkin, S. C. Parker, C. DeLisi, and S. Kasif. Towards the

identification of essential genes using targeted genome and comparative

analysis. BMC Genomics, 7:265, 2006.

[73] S. P. Gygi, Y. Rochon, B. R. Franza, and R. Aebersold. Correlation between

protein and mRNA abundance in yeast. Mol Cell Biol, 19(3):1720–30, 1999.

[74] M. W. Hahn and A. D. Kern. of centrality and essentiality

in three eukaryotic protein-interaction networks. Mol Biol Evol, 22(4):803–6, 2005.

[75] M. Hall, E. Frank, G. Holmes, B. Pfahringer, P. Reutemann, and I. H. Witten.

The weka data mining software: An update. SIGKDD Explorations, 11, 2009.

[76] J. Han, H. Cheng, D. Xin, and X. Yan. Frequent pattern mining: current status

and future directions. Data Mining and Knowledge Discovery, 15:55–86, 2007. Bibliography 148

[77] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation.

SIGMOD Rec., 29:1–12, May 2000.

[78] J. D. Han, N. Bertin, T. Hao, D. S. Goldberg, G. F. Berriz, L. V. Zhang, D. Dupuy,

A. J. Walhout, M. E. Cusick, F. P. Roth, and M. Vidal. Evidence for dynamically

organized modularity in the yeast protein-protein interaction network. Nature,

430(6995):88–93, 2004.

[79] G. T. Hart, A. K. Ramani, and E. M. Marcotte. How complete are current yeast

and human protein-interaction networks? Genome Biol, 7(11):120, 2006.

[80] X. He and J. Zhang. Why do hubs tend to be essential in protein networks? PLoS

Genet, 2(6):e88, 2006.

[81] Y. Ho, A. Gruhler, A. Heilbut, G. D. Bader, L. Moore, S. L. Adams, A. Millar,

P. Taylor, K. Bennett, K. Boutilier, L. Yang, C. Wolting, I. Donaldson, S. Schan-

dorff, J. Shewnarane, M. Vo, J. Taggart, M. Goudreault, B. Muskat, C. Alfarano,

D. Dewar, Z. Lin, K. Michalickova, A. R. Willems, H. Sassi, P. A. Nielsen, K. J. Ras-

mussen, J. R. Andersen, L. E. Johansen, L. H. Hansen, H. Jespersen, A. Podtele-

jnikov, E. Nielsen, J. Crawford, V. Poulsen, B. D. Sorensen, J. Matthiesen, R. C.

Hendrickson, F. Gleeson, T. Pawson, M. F. Moran, D. Durocher, M. Mann, C. W.

Hogue, D. Figeys, and M. Tyers. Systematic identification of protein complexes in

Saccharomyces cerevisiae by mass spectrometry. Nature, 415(6868):180–3, 2002.

[82] P. V. Hornbeck, I. Chabra, J. M. Kornhauser, E. Skrzypek, and B. Zhang. Phospho-

site: A bioinformatics resource dedicated to physiological protein phosphorylation.

Proteomics, 4(6):1551–61, 2004.

[83] T. W. Huang, C. Y. Lin, and C. Y. Kao. Reconstruction of human protein interolog

network using evolutionary conserved network. BMC Bioinformatics, 8:152, 2007. Bibliography 149

[84] M. Hue, M. Riffle, J. P. Vert, and W. S. Noble. Large-scale prediction of protein-

protein interactions from structures. BMC Bioinformatics, 11:144, 2010.

[85] S. Hunter, R. Apweiler, T. K. Attwood, A. Bairoch, A. Bateman, D. Binns, P. Bork,

U. Das, L. Daugherty, L. Duquenne, R. D. Finn, J. Gough, D. Haft, N. Hulo,

D. Kahn, E. Kelly, A. Laugraud, I. Letunic, D. Lonsdale, R. Lopez, M. Madera,

J. Maslen, C. McAnulla, J. McDowall, J. Mistry, A. Mitchell, N. Mulder, D. Natale,

C. Orengo, A. F. Quinn, J. D. Selengut, C. J. Sigrist, M. Thimma, P. D. Thomas,

F. Valentin, D. Wilson, C. H. Wu, and C. Yeats. InterPro: the integrative protein

signature database. Nucleic Acids Res, 37(Database issue):D211–5, 2009.

[86] L. D. Hurst and N. G. Smith. Do essential genes evolve slowly? Curr Biol,

9(14):747–50, 1999.

[87] S. Huveneers, H. Truong, and H. J. Danen. Integrins: signaling, disease, and

therapy. Int J Radiat Biol, 83(11-12):743–51, 2007.

[88] T. Ito, T. Chiba, R. Ozawa, M. Yoshida, M. Hattori, and Y. Sakaki. A comprehen-

sive two-hybrid analysis to explore the yeast protein interactome. Proc Natl Acad

Sci U S A, 98(8):4569–74, 2001.

[89] T. Jacks. Tumor suppressor gene mutations in mice. Annu Rev Genet, 30:603–36,

1996.

[90] J. Janin and B. Seraphin. Genome-wide studies of protein-protein interaction. Curr

Opin Struct Biol, 13(3):383–8, 2003.

[91] R. Jansen, D. Greenbaum, and M. Gerstein. Relating whole-genome expression

data with protein-protein interactions. Genome Res, 12(1):37–46, 2002.

[92] R. Jansen, H. Yu, D. Greenbaum, Y. Kluger, N. J. Krogan, S. Chung, A. Emili,

M. Snyder, J. F. Greenblatt, and M. Gerstein. A Bayesian networks approach for Bibliography 150

predicting protein-protein interactions from genomic data. Science, 302(5644):449–

53, 2003.

[93] H. Jeong, S. P. Mason, A. L. Barabasi, and Z. N. Oltvai. Lethality and centrality

in protein networks. Nature, 411(6833):41–2, 2001.

[94] Hawoong Jeong, Zoltn N. Oltvai, and Albert-Lszl Barabasi. Prediction of Protein

Essentiality Based on Genomic Data. CompPlexUs, 1(1), 2003.

[95] D. T. Jones. Protein secondary structure prediction based on position-specific

scoring matrices. J Mol Biol, 292(2):195–202, 1999.

[96] R. B. Jones, A. Gordus, J. A. Krall, and G. MacBeath. A quantitative protein

interaction network for the ErbB receptors using protein microarrays. Nature,

439(7073):168–74, 2006.

[97] S. Jones and J. M. Thornton. Principles of protein-protein interactions. Proc Natl

Acad Sci U S A, 93(1):13–20, 1996.

[98] C. Jorgensen, A. Sherman, G. I. Chen, A. Pasculescu, A. Poliakov, M. Hsiung,

B. Larsen, D. G. Wilkinson, R. Linding, and T. Pawson. Cell-specific informa-

tion processing in segregating populations of Eph receptor ephrin-expressing cells.

Science, 326(5959):1502–9, 2009.

[99] M. G. Kann. Protein interactions and disease: computational approaches to un-

cover the etiology of diseases. Brief Bioinform, 8(5):333–46, 2007.

[100] R. Kelley and T. Ideker. Systematic interpretation of genetic interactions using

protein networks. Nat Biotechnol, 23(5):561–6, 2005.

[101] T. S. Keshava Prasad, R. Goel, K. Kandasamy, S. Keerthikumar, S. Kumar,

S. Mathivanan, D. Telikicherla, R. Raju, B. Shafreen, A. Venugopal, L. Balakrish-

nan, A. Marimuthu, S. Banerjee, D. S. Somanathan, A. Sebastian, S. Rani, S. Ray, Bibliography 151

C. J. Harrys Kishore, S. Kanth, M. Ahmed, M. K. Kashyap, R. Mohmood, Y. L.

Ramachandra, V. Krishna, B. A. Rahiman, S. Mohan, P. Ranganathan, S. Ram-

abadran, R. Chaerkady, and A. Pandey. Human Protein Reference Database–2009

update. Nucleic Acids Res, 37(Database issue):D767–72, 2009.

[102] I. Kim, Y. Liu, and H. Zhao. Bayesian methods for predicting interacting protein

pairs using domain information. Biometrics, 63(3):824–33, 2007.

[103] A. D. King, N. Przulj, and I. Jurisica. Protein complex prediction via cost-based

clustering. Bioinformatics, 20(17):3013–20, 2004.

[104] S. Kohler, S. Bauer, D. Horn, and P. N. Robinson. Walking the interactome for

prioritization of candidate disease genes. Am J Hum Genet, 82(4):949–58, 2008.

[105] F. A. Kondrashov, A. Y. Ogurtsov, and A. S. Kondrashov. Bioinformatical assay

of human gene morbidity. Nucleic Acids Res, 32(5):1731–7, 2004.

[106] L. Li, K. Zhang, J. Lee, S. Cordes, D. P. Davis, and Z. Tang. Discovering cancer

genes by integrating network and functional properties. BMC Med Genomics, 2:61,

2009.

[107] B. Y. Liao, N. M. Scott, and J. Zhang. Impacts of gene essentiality, expression

pattern, and gene compactness on the evolutionary rate of mammalian proteins.

Mol Biol Evol, 23(11):2072–80, 2006.

[108] B. Y. Liao and J. Zhang. Null mutations in human and mouse orthologs frequently

result in different phenotypes. Proc Natl Acad Sci U S A, 105(19):6987–92, 2008.

[109] J. Lim, T. Hao, C. Shaw, A. J. Patel, G. Szabo, J. F. Rual, C. J. Fisk, N. Li,

A. Smolyar, D. E. Hill, A. L. Barabasi, M. Vidal, and H. Y. Zoghbi. A protein-

protein interaction network for human inherited ataxias and disorders of Purkinje

cell degeneration. Cell, 125(4):801–14, 2006. Bibliography 152

[110] W. K. Lim, K. Wang, C. Lefebvre, and A. Califano. Comparative analysis of

microarray normalization procedures: effects on reverse engineering gene networks.

Bioinformatics, 23(13):i282–8, 2007.

[111] N. Lin, B. Wu, R. Jansen, M. Gerstein, and H. Zhao. Information assessment on

predicting protein-protein interactions. BMC Bioinformatics, 5:154, 2004.

[112] S. C. Lin, Y. C. Lo, and H. Wu. Helical assembly in the MyD88-IRAK4-IRAK2

complex in TLR/IL-1R signalling. Nature, 465(7300):885–90, 2010.

[113] R. Linding, L. J. Jensen, F. Diella, P. Bork, T. J. Gibson, and R. B. Russell. Protein

disorder prediction: implications for structural proteomics. Structure, 11(11):1453–

9, 2003.

[114] R. Linding, L. J. Jensen, G. J. Ostheimer, M. A. van Vugt, C. Jrgensen, I. M. Miron,

F. Diella, K. Colwill, L. Taylor, K. Elder, P. Metalnikov, V. Nguyen, A. Pasculescu,

J. Jin, J. G. Park, L. D. Samson, J. R. Woodgett, R. B. Russell, P. Bork, M. B.

Yaffe, and T. Pawson. Systematic discovery of in vivo phosphorylation networks.

Cell, 129(7):1415–26, 2007.

[115] P. W. Lord, R. D. Stevens, A. Brass, and C. A. Goble. Investigating semantic

similarity measures across the Gene Ontology: the relationship between sequence

and annotation. Bioinformatics, 19(10):1275–83, 2003.

[116] L. Lu, A. K. Arakaki, H. Lu, and J. Skolnick. Multimeric threading-based prediction

of protein-protein interactions on a genomic scale: application to the Saccharomyces

cerevisiae proteome. Genome Res, 13(6A):1146–54, 2003.

[117] L. J. Lu, Y. Xia, A. Paccanaro, H. Yu, and M. Gerstein. Assessing the limits of

genomic data integration for predicting protein networks. Genome Res, 15(7):945–

53, 2005. Bibliography 153

[118] X. Ma, H. Lee, L. Wang, and F. Sun. CGI: a new approach for prioritizing genes by

combining gene expression and protein-protein interaction data. Bioinformatics,

23(2):215–21, 2007.

[119] D. Maglott, J. Ostell, K. D. Pruitt, and T. Tatusova. Entrez Gene: gene-centered

information at NCBI. Nucleic Acids Res, 35(Database issue):D26–31, 2007.

[120] S. Martin, D. Roe, and J. L. Faulon. Predicting protein-protein interactions using

signature products. Bioinformatics, 21(2):218–26, 2005.

[121] L. Matthews, G. Gopinath, M. Gillespie, M. Caudy, D. Croft, B. de Bono, P. Garap-

ati, J. Hemish, H. Hermjakob, B. Jassal, A. Kanapin, S. Lewis, S. Mahajan, B. May,

E. Schmidt, I. Vastrik, G. Wu, E. Birney, L. Stein, and P. D’Eustachio. Reactome

knowledgebase of human biological pathways and processes. Nucleic Acids Res,

37(Database issue):D619–22, 2009.

[122] V. A. McKusick. Mendelian Inheritance in Man and its online version, OMIM. Am

J Hum Genet, 80(4):588–604, 2007.

[123] R. T. Moon, A. D. Kohn, G. V. De Ferrari, and A. Kaykas. WNT and beta-catenin

signalling: diseases and therapies. Nat Rev Genet, 5(9):691–701, 2004.

[124] S. Navlakha and C. Kingsford. The power of protein interaction networks for

associating genes with diseases. Bioinformatics, 26(8):1057–63, 2010.

[125] V. Neduva and R. B. Russell. Linear motifs: evolutionary interaction switches.

FEBS Lett, 579(15):3342–5, 2005.

[126] R. K. Nibbe, M. Koyuturk, and M. R. Chance. An integrative - approach to

identify functional sub-networks in human colorectal cancer. PLoS Comput Biol,

6(1):e1000639, 2010. Bibliography 154

[127] Y. Niu, D. Otasek, and I. Jurisica. Evaluation of linguistic features useful in

extraction of interactions from PubMed; application to annotating known, high-

throughput and predicted interactions in I2D. Bioinformatics, 26(1):111–9, 2010.

[128] I. M. Nooren and J. M. Thornton. Diversity of protein-protein interactions. EMBO

J, 22(14):3486–92, 2003.

[129] M. Oti, B. Snel, M. A. Huynen, and H. G. Brunner. Predicting disease genes using

protein-protein interactions. J Med Genet, 43(8):691–8, 2006.

[130] R. Overbeek, M. Fonstein, M. D’Souza, G. D. Pusch, and N. Maltsev. Use of conti-

guity on the chromosome to predict functional coupling. In Silico Biol, 1(2):93–108,

1999.

[131] C. Pal, B. Papp, and L. D. Hurst. Genomic function: Rate of evolution and gene

dispensability. Nature, 421(6922):496–7; discussion 497–8, 2003.

[132] C. Pal, B. Papp, and M. J. Lercher. An integrated view of protein evolution. Nat

Rev Genet, 7(5):337–48, 2006.

[133] K. Park and D. Kim. Localized network centrality and essentiality in the yeast-

protein interaction network. Proteomics, 9(22):5143–54, 2009.

[134] L. E. Pascal, L. D. True, D. S. Campbell, E. W. Deutsch, M. Risk, I. M. Coleman,

L. J. Eichner, P. S. Nelson, and A. Y. Liu. Correlation of mRNA and protein levels:

cell type-specific gene expression of cluster designation antigens in the prostate.

BMC Genomics, 9:246, 2008.

[135] A. Patil and H. Nakamura. Disordered domains and high surface charge confer

hubs with the ability to interact with multiple proteins in interaction networks.

FEBS Lett, 580(8):2041–5, 2006. Bibliography 155

[136] T. Pawson, G. D. Gish, and P. Nash. SH2 domains, interaction modules and cellular

wiring. Trends Cell Biol, 11(12):504–11, 2001.

[137] T. Pawson and P. Nash. Assembly of cell regulatory systems through protein

interaction domains. Science, 300(5618):445–52, 2003.

[138] M. Pellegrini, E. M. Marcotte, M. J. Thompson, D. Eisenberg, and T. O. Yeates.

Assigning protein functions by comparative genome analysis: protein phylogenetic

profiles. Proc Natl Acad Sci U S A, 96(8):4285–8, 1999.

[139] E. M. Phizicky and S. Fields. Protein-protein interactions: methods for detection

and analysis. Microbiol Rev, 59(1):94–123, 1995.

[140] S. Pitre, C. North, M. Alamgir, M. Jessulat, A. Chan, X. Luo, J. R. Green, M. Du-

montier, F. Dehne, and A. Golshani. Global investigation of protein-protein in-

teractions in yeast Saccharomyces cerevisiae using re-occurring short polypeptide

sequences. Nucleic Acids Res, 36(13):4286–94, 2008.

[141] K. D. Pruitt, J. Harrow, R. A. Harte, C. Wallin, M. Diekhans, D. R. Maglott,

S. Searle, C. M. Farrell, J. E. Loveland, B. J. Ruef, E. Hart, M. M. Suner, M. J.

Landrum, B. Aken, S. Ayling, R. Baertsch, J. Fernandez-Banet, J. L. Cherry,

V. Curwen, M. Dicuccio, M. Kellis, J. Lee, M. F. Lin, M. Schuster, A. Shkeda,

C. Amid, G. Brown, O. Dukhanina, A. Frankish, J. Hart, B. L. Maidak, J. Mudge,

M. R. Murphy, T. Murphy, J. Rajan, B. Rajput, L. D. Riddick, C. Snow, C. Stew-

ard, D. Webb, J. A. Weber, L. Wilming, W. Wu, E. Birney, D. Haussler, T. Hub-

bard, J. Ostell, R. Durbin, and D. Lipman. The consensus coding sequence (CCDS)

project: Identifying a common protein-coding gene set for the human and mouse

genomes. Genome Res, 19(7):1316–23, 2009.

[142] N. Przulj, D. A. Wigle, and I. Jurisica. Functional topology in a network of protein

interactions. Bioinformatics, 20(3):340–8, 2004. Bibliography 156

[143] Y. Qi, Z. Bar-Joseph, and J. Klein-Seetharaman. Evaluation of different biolog-

ical data and computational classification methods for use in protein interaction

prediction. Proteins, 63(3):490–500, 2006.

[144] A. K. Ramani, Z. Li, G. T. Hart, M. W. Carlson, D. R. Boutz, and E. M. Marcotte.

A map of human protein interactions derived from co-expression of human mRNAs

and their orthologs. Mol Syst Biol, 4:180, 2008.

[145] D. R. Rhodes, S. A. Tomlins, S. Varambally, V. Mahavisno, T. Barrette,

S. Kalyana-Sundaram, D. Ghosh, A. Pandey, and A. M. Chinnaiyan. Probabilistic

model of the human protein-protein interaction network. Nat Biotechnol, 23(8):951–

9, 2005.

[146] P. Rice, I. Longden, and A. Bleasby. EMBOSS: the European Molecular Biology

Open Software Suite. Trends Genet, 16(6):276–7, 2000.

[147] J. F. Rual, K. Venkatesan, T. Hao, T. Hirozane-Kishikawa, A. Dricot, N. Li, G. F.

Berriz, F. D. Gibbons, M. Dreze, N. Ayivi-Guedehoussou, N. Klitgord, C. Simon,

M. Boxem, S. Milstein, J. Rosenberg, D. S. Goldberg, L. V. Zhang, S. L. Wong,

G. Franklin, S. Li, J. S. Albala, J. Lim, C. Fraughton, E. Llamosas, S. Cevik,

C. Bex, P. Lamesch, R. S. Sikorski, J. Vandenhaute, H. Y. Zoghbi, A. Smolyar,

S. Bosak, R. Sequerra, L. Doucette-Stamm, M. E. Cusick, D. E. Hill, F. P. Roth,

and M. Vidal. Towards a proteome-scale map of the human protein-protein inter-

action network. Nature, 437(7062):1173–8, 2005.

[148] J. Rudolph. Inhibiting transient protein-protein interactions: lessons from the

Cdc25 protein tyrosine phosphatases. Nat Rev Cancer, 7(3):202–11, 2007.

[149] A. Ruepp, B. Waegele, M. Lechner, B. Brauner, I. Dunger-Kaltenbach, G. Fobo,

G. Frishman, C. Montrone, and H. W. Mewes. CORUM: the comprehensive re- Bibliography 157

source of mammalian protein complexes–2009. Nucleic Acids Res, 38(Database

issue):D497–501, 2010.

[150] H. Ruffner, A. Bauer, and T. Bouwmeester. Human protein-protein interaction

networks and the value for drug discovery. Drug Discov Today, 12(17-18):709–16,

2007.

[151] R. Saito, H. Suzuki, and Y. Hayashizaki. Interaction generality, a measurement to

assess the reliability of a protein-protein interaction. Nucleic Acids Res, 30(5):1163–

8, 2002.

[152] R. Saito, H. Suzuki, and Y. Hayashizaki. Construction of reliable protein-protein

interaction networks with a new interaction generality measure. Bioinformatics,

19(6):756–63, 2003.

[153] O. Sanchez-Graillet and M. Poesio. Negation of protein-protein interactions: anal-

ysis and extraction. Bioinformatics, 23(13):i424–32, 2007.

[154] S. Sato, C. Tomomori-Sato, T. J. Parmely, L. Florens, B. Zybailov, S. K. Swanson,

C. A. Banks, J. Jin, Y. Cai, M. P. Washburn, J. W. Conaway, and R. C. Conaway.

A set of consensus mammalian mediator subunits identified by multidimensional

protein identification technology. Mol Cell, 14(5):685–91, 2004.

[155] S. E. Schelhorn, T. Lengauer, and M. Albrecht. An integrative approach for pre-

dicting interactions of protein regions. Bioinformatics, 24(16):i35–41, 2008.

[156] A. S. Schwartz, J. Yu, K. R. Gardenour, Jr. Finley, R. L., and T. Ideker. Cost-

effective strategies for completing the interactome. Nat Methods, 6(1):55–61, 2009.

[157] J. Scott, T. Ideker, R. M. Karp, and R. Sharan. Efficient algorithms for detecting

signaling pathways in protein interaction networks. J Comput Biol, 13(2):133–44,

2006. Bibliography 158

[158] M. S. Scott and G. J. Barton. Probabilistic prediction and ranking of human

protein-protein interactions. BMC Bioinformatics, 8:239, 2007.

[159] E. Segal, M. Shapira, A. Regev, D. Pe’er, D. Botstein, D. Koller, and N. Fried-

man. Module networks: identifying regulatory modules and their condition-specific

regulators from gene expression data. Nat Genet, 34(2):166–76, 2003.

[160] M. Seringhaus, A. Paccanaro, A. Borneman, M. Snyder, and M. Gerstein. Predict-

ing essential genes in fungal genomes. Genome Res, 16(9):1126–35, 2006.

[161] R. Sharan, I. Ulitsky, and R. Shamir. Network-based prediction of protein function.

Mol Syst Biol, 3:88, 2007.

[162] J. Shen, J. Zhang, X. Luo, W. Zhu, K. Yu, K. Chen, Y. Li, and H. Jiang. Predicting

protein-protein interactions based only on sequences information. Proc Natl Acad

Sci U S A, 104(11):4337–41, 2007.

[163] B. A. Shoemaker and A. R. Panchenko. Deciphering protein-protein interactions.

part i. experimental techniques and databases. PLoS Comput Biol, 3(3):e42, 2007.

[164] B. A. Shoemaker and A. R. Panchenko. Deciphering protein-protein interactions.

Part II. Computational methods to predict protein and domain interaction part-

ners. PLoS Comput Biol, 3(4):e43, 2007.

[165] P. Smialowski, P. Pagel, P. Wong, B. Brauner, I. Dunger, G. Fobo, G. Frishman,

C. Montrone, T. Rattei, D. Frishman, and A. Ruepp. The Negatome database:

a reference set of non-interacting protein pairs. Nucleic Acids Res, 38(Database

issue):D540–4, 2010.

[166] C. L. Smith, C. A. Goldsmith, and J. T. Eppig. The mammalian phenotype on-

tology as a tool for annotating, analyzing and comparing phenotypic information.

Genome Biol, 6(1):R7, 2005. Bibliography 159

[167] G. R. Smith and M. J. Sternberg. Prediction of protein-protein interactions by

docking methods. Curr Opin Struct Biol, 12(1):28–35, 2002.

[168] L. Smith, L. K. Tanabe, R. J. Ando, C. J. Kuo, I. F. Chung, C. N. Hsu, Y. S.

Lin, R. Klinger, C. M. Friedrich, K. Ganchev, M. Torii, H. Liu, B. Haddow, C. A.

Struble, R. J. Povinelli, A. Vlachos, Jr. Baumgartner, W. A., L. Hunter, B. Car-

penter, R. T. Tsai, H. J. Dai, F. Liu, Y. Chen, C. Sun, S. Katrenko, P. Adriaans,

C. Blaschke, R. Torres, M. Neves, P. Nakov, A. Divoli, M. Mana-Lopez, J. Mata,

and W. J. Wilbur. Overview of BioCreative II gene mention recognition. Genome

Biol, 9 Suppl 2:S2, 2008.

[169] N. G. Smith and A. Eyre-Walker. Human disease genes: patterns and predictions.

Gene, 318:169–75, 2003.

[170] E. L. Sonnhammer and E. V. Koonin. Orthology, paralogy and proposed classifi-

cation for paralog subtypes. Trends Genet, 18(12):619–20, 2002.

[171] V. Spirin and L. A. Mirny. Protein complexes and functional modules in molecular

networks. Proc Natl Acad Sci U S A, 100(21):12123–8, 2003.

[172] E. Sprinzak and H. Margalit. Correlated sequence-signatures as markers of protein-

protein interaction. J Mol Biol, 311(4):681–92, 2001.

[173] C. Stark, B. J. Breitkreutz, T. Reguly, L. Boucher, A. Breitkreutz, and M. Ty-

ers. BioGRID: a general repository for interaction datasets. Nucleic Acids Res,

34(Database issue):D535–9, 2006.

[174] U. Stelzl, U. Worm, M. Lalowski, C. Haenig, F. H. Brembeck, H. Goehler,

M. Stroedicke, M. Zenkner, A. Schoenherr, S. Koeppen, J. Timm, S. Mintzlaff,

C. Abraham, N. Bock, S. Kietzmann, A. Goedde, E. Toksoz, A. Droege, S. Kro-

bitsch, B. Korn, W. Birchmeier, H. Lehrach, and E. E. Wanker. A human Bibliography 160

protein-protein interaction network: a resource for annotating the proteome. Cell,

122(6):957–68, 2005.

[175] M. P. Stumpf, T. Thorne, E. de Silva, R. Stewart, H. J. An, M. Lappe, and

C. Wiuf. Estimating the size of the human interactome. Proc Natl Acad Sci U S

A, 105(19):6959–64, 2008.

[176] A. I. Su, T. Wiltshire, S. Batalov, H. Lapp, K. A. Ching, D. Block, J. Zhang, R. So-

den, M. Hayakawa, G. Kreiman, M. P. Cooke, J. R. Walker, and J. B. Hogenesch.

A gene atlas of the mouse and human protein-encoding . Proc Natl

Acad Sci U S A, 101(16):6062–7, 2004.

[177] A. Szilagyi, V. Grimm, A. K. Arakaki, and J. Skolnick. Prediction of physical

protein-protein interactions. Phys Biol, 2(2):S1–16, 2005.

[178] I. W. Taylor, R. Linding, D. Warde-Farley, Y. Liu, C. Pesquita, D. Faria, S. Bull,

T. Pawson, Q. Morris, and J. L. Wrana. Dynamic modularity in protein interaction

networks predicts breast cancer outcome. Nat Biotechnol, 27(2):199–204, 2009.

[179] R Development Core Team. R: A Language and Environment for Statistical Com-

puting. 2008.

[180] K. L. Tew, X. L. Li, and S. H. Tan. Functional centrality: detecting lethality of

proteins in protein interaction networks. Genome Inform, 19:166–77, 2007.

[181] A. H. Tong, M. Evangelista, A. B. Parsons, H. Xu, G. D. Bader, N. Page, M. Robin-

son, S. Raghibizadeh, C. W. Hogue, H. Bussey, B. Andrews, M. Tyers, and

C. Boone. Systematic genetic analysis with ordered arrays of yeast deletion mu-

tants. Science, 294(5550):2364–8, 2001.

[182] R. Tonikian, Y. Zhang, S. L. Sazinsky, B. Currell, J. H. Yeh, B. Reva, H. A. Held,

B. A. Appleton, M. Evangelista, Y. Wu, X. Xin, A. C. Chan, S. Seshagiri, L. A. Bibliography 161

Lasky, C. Sander, C. Boone, G. D. Bader, and S. S. Sidhu. A specificity map for

the PDZ domain family. PLoS Biol, 6(9):e239, 2008.

[183] Z. Tu, L. Wang, M. Xu, X. Zhou, T. Chen, and F. Sun. Further understanding

human disease genes by comparing with housekeeping genes and other genes. BMC

Genomics, 7:31, 2006.

[184] P. Uetz, L. Giot, G. Cagney, T. A. Mansfield, R. S. Judson, J. R. Knight, D. Lock-

shon, V. Narayan, M. Srinivasan, P. Pochart, A. Qureshi-Emili, Y. Li, B. Godwin,

D. Conover, T. Kalbfleisch, G. Vijayadamodar, M. Yang, M. Johnston, S. Fields,

and J. M. Rothberg. A comprehensive analysis of protein-protein interactions in

Saccharomyces cerevisiae. Nature, 403(6770):623–7, 2000.

[185] I. Ulitsky and R. Shamir. Identification of functional modules using network topol-

ogy and high-throughput data. BMC Syst Biol, 1:8, 2007.

[186] I. Ulitsky and R. Shamir. Identifying functional modules using expression profiles

and confidence-scored protein interactions. Bioinformatics, 25(9):1158–64, 2009.

[187] A. Valencia and F. Pazos. Computational methods for the prediction of protein

interactions. Curr Opin Struct Biol, 12(3):368–73, 2002.

[188] K. Venkatesan, J. F. Rual, A. Vazquez, U. Stelzl, I. Lemmens, T. Hirozane-

Kishikawa, T. Hao, M. Zenkner, X. Xin, K. I. Goh, M. A. Yildirim, N. Simonis,

K. Heinzmann, F. Gebreab, J. M. Sahalie, S. Cevik, C. Simon, A. S. de Smet,

E. Dann, A. Smolyar, A. Vinayagam, H. Yu, D. Szeto, H. Borick, A. Dricot, N. Kl-

itgord, R. R. Murray, C. Lin, M. Lalowski, J. Timm, K. Rau, C. Boone, P. Braun,

M. E. Cusick, F. P. Roth, D. E. Hill, J. Tavernier, E. E. Wanker, A. L. Barabasi, and

M. Vidal. An empirical framework for binary interactome mapping. Nat Methods,

6(1):83–90, 2009. Bibliography 162

[189] C. Vogel, C. Berzuini, M. Bashton, J. Gough, and S. A. Teichmann. Supra-domains:

evolutionary units larger than single protein domains. J Mol Biol, 336(3):809–23,

2004.

[190] C. von Mering, R. Krause, B. Snel, M. Cornell, S. G. Oliver, S. Fields, and P. Bork.

Comparative assessment of large-scale data sets of protein-protein interactions.

Nature, 417(6887):399–403, 2002.

[191] A. J. Walhout, R. Sordella, X. Lu, J. L. Hartley, G. F. Temple, M. A. Brasch,

N. Thierry-Mieg, and M. Vidal. Protein interaction mapping in C. elegans using

proteins involved in vulval development. Science, 287(5450):116–22, 2000.

[192] D. P. Wall, A. E. Hirsh, H. B. Fraser, J. Kumm, G. Giaever, M. B. Eisen, and

M. W. Feldman. Functional genomic analysis of the rates of protein evolution.

Proc Natl Acad Sci U S A, 102(15):5483–8, 2005.

[193] E. T. Wang, R. Sandberg, S. Luo, I. Khrebtukova, L. Zhang, C. Mayr, S. F.

Kingsmore, G. P. Schroth, and C. B. Burge. Alternative isoform regulation in

human tissue transcriptomes. Nature, 456(7221):470–6, 2008.

[194] J. Wang, C. Li, E. Wang, and X. Wang. Uncovering the rules for protein-protein

interactions from yeast genomic data. Proc Natl Acad Sci U S A, 106(10):3752–7,

2009.

[195] R. S. Wang, Y. Wang, L. Y. Wu, X. S. Zhang, and L. Chen. Analysis on multi-

domain cooperation for predicting protein-protein interactions. BMC Bioinformat-

ics, 8:391, 2007.

[196] J. J. Ward, J. S. Sodhi, L. J. McGuffin, B. F. Buxton, and D. T. Jones. Prediction

and functional analysis of native disorder in proteins from the three kingdoms of

life. J Mol Biol, 337(3):635–45, 2004. Bibliography 163

[197] S. Wasserman and K. Faust. Social Network Analysis: Methods and Applications.

Cambridge University Press, Cambridge, 1994.

[198] Rand R. Wilcox. Introduction to robust estimation and hypothesis testing. Statisti-

cal modeling and decision science. Academic Press, San Diego, CA, 1997. 96049165

Rand R, Wilcox. ill. Includes indexes Includes bibliographical references.

[199] Rand R. Wilcox. Introduction to robust estimation and hypothesis testing. Statisti-

cal modeling and decision science. Elsevier/Academic Press, Amsterdam ; Boston,

2nd edition, 2005.

[200] J. Wojcik, I. G. Boneca, and P. Legrain. Prediction, assessment and validation of

protein interaction maps in bacteria. J Mol Biol, 323(4):763–70, 2002.

[201] P. E. Wright and H. J. Dyson. Intrinsically unstructured proteins: re-assessing the

protein structure-function paradigm. J Mol Biol, 293(2):321–31, 1999.

[202] C. Wu, M. H. Ma, K. R. Brown, M. Geisler, L. Li, E. Tzeng, C. Y. Jia, I. Jurisica,

and S. S. Li. Systematic identification of SH3 domain-mediated human protein-

protein interactions by peptide array target screening. Proteomics, 7(11):1775–85,

2007.

[203] I. Xenarios, L. Salwinski, X. J. Duan, P. Higney, S. M. Kim, and D. Eisenberg.

DIP, the Database of Interacting Proteins: a research tool for studying cellular

networks of protein interactions. Nucleic Acids Res, 30(1):303–5, 2002.

[204] K. Xia, D. Dong, and J. D. Han. IntNetDB v1.0: an integrated protein-protein

interaction network database generated by a probabilistic model. BMC Bioinfor-

matics, 7:508, 2006. Bibliography 164

[205] C. Y. Yu, L. C. Chou, and D. T. Chang. Predicting protein-protein interactions

in unbalanced data using the primary structure of proteins. BMC Bioinformatics,

11:167, 2010.

[206] H. Yu, P. Braun, M. A. Yildirim, I. Lemmens, K. Venkatesan, J. Sahalie,

T. Hirozane-Kishikawa, F. Gebreab, N. Li, N. Simonis, T. Hao, J. F. Rual, A. Dri-

cot, A. Vazquez, R. R. Murray, C. Simon, L. Tardivo, S. Tam, N. Svrzikapa, C. Fan,

A. S. de Smet, A. Motyl, M. E. Hudson, J. Park, X. Xin, M. E. Cusick, T. Moore,

C. Boone, M. Snyder, F. P. Roth, A. L. Barabasi, J. Tavernier, D. E. Hill, and

M. Vidal. High-quality binary protein interaction map of the yeast interactome

network. Science, 322(5898):104–10, 2008.

[207] M. Zampieri, N. Soranzo, D. Bianchini, and C. Altafini. Origin of co-expression

patterns in E. coli and S. cerevisiae emerging from reverse engineering algorithms.

PLoS One, 3(8):e2981, 2008.

[208] S. Zanivan, I. Cascone, C. Peyron, I. Molineris, S. Marchio, M. Caselle, and F. Bus-

solino. A new computational approach to analyze human protein complexes and

predict novel protein interactions. Genome Biol, 8(12):R256, 2007.

[209] H. Zhu, M. Bilgin, R. Bangham, D. Hall, A. Casamayor, P. Bertone, N. Lan,

R. Jansen, S. Bidlingmaier, T. Houfek, T. Mitchell, P. Miller, R. A. Dean, M. Ger-

stein, and M. Snyder. Global analysis of protein activities using proteome chips.

Science, 293(5537):2101–5, 2001.

[210] H. Zhu, J. F. Klemic, S. Chang, P. Bertone, A. Casamayor, K. G. Klemic, D. Smith,

M. Gerstein, M. A. Reed, and M. Snyder. Analysis of yeast protein kinases using

protein chips. Nat Genet, 26(3):283–9, 2000.

[211] E. Zotenko, J. Mestre, D. P. O’Leary, and T. M. Przytycka. Why do hubs in the

yeast protein interaction network tend to be essential: reexamining the connection Bibliography 165

between the network topology and essentiality. PLoS Comput Biol, 4(8):e1000140,

2008.