Predicting False Positives of Protein-Protein Interaction Data by Semantic Similarity Measures§ George Montañez1 and Young-Rae Cho*,2

Send Orders of Reprints at [email protected] Current Bioinformatics, 2013, 8, 000-000 1 Predicting False Positives of Protein-Protein Interaction Data by Semantic Similarity Measures§ George Montañez1 and Young-Rae Cho*,2 1Machine Learning Department, Carnegie Melon University, Pittsburgh, PA 15213, USA 2Bioinformatics Program, Department of Computer Science, Baylor University, Waco, TX 76798, USA Abstract: Recent technical advances in identifying protein-protein interactions (PPIs) have generated the genomic-wide interaction data, collectively collectively referred to as the interactome. These interaction data give an insight into the underlying mechanisms of biological processes. However, the PPI data determined by experimental and computational methods include an extremely large number of false positives which are not confirmed to occur in vivo. Filtering PPI data is thus a critical preprocessing step to improve analysis accuracy. Integrating Gene Ontology (GO) data is proposed in this article to assess reliability of the PPIs. We evaluate the performance of various semantic similarity measures in terms of functional consistency. Protein pairs with high semantic similarity are considered highly likely to share common functions, and therefore, are more likely to interact. We also propose a combined method of semantic similarity to apply to predicting false positive PPIs. The experimental results show that the combined hybrid method has better performance than the individual semantic similarity classifiers. The proposed classifier predicted that 58.6% of the S. cerevisiae PPIs from the BioGRID database are false positives. Keywords: Gene Ontology, protein-protein interactions, semantic similarity. 1. INTRODUCTION determined by large-scale experimental and computational approaches include an extremely large number of false Proteins interact with each other for biochemical stability positives, i.e., a significantly large fraction of the putative and functionality, building protein complexes as larger interactions detected must be considered spurious because functional units. PPIs therefore play a key role in biological they cannot be confirmed to occur in vivo [21,22,23]. processes within a cell. Recently, high-throughput Filtering PPI data is thus a critical preprocessing step to experimental techniques, such as yeast two-hybrid system improve analysis accuracy when handling interactome. The [1,2,3,4], mass spectrometry [5,6] and synthetic lethality erroneous interaction data can be curated by other resources screening [7], have made remarkable advances in identifying which are used to judge the level of functional associations PPIs on a genome-wide scale, collectively referred to as the of interacting protein pairs, such as gene expression profiles interactome. Since the evidence of interactions provides [24,25]. insights into the underlying mechanisms of biological processes, the availability of a large amount of PPI data has A recent study [26] has suggested the integration of GO introduced a new paradigm towards functional data to assess the validity of PPIs through measuring characterization of proteins on a system level [8,9]. semantic similarity of interacting proteins. GO [27] is a repository of biological ontologies and annotations of genes Over the past few years, systematic analysis of the and gene products. Although the annotation data on GO are interactome by theoretical and empirical studies has been in created by the published evidence resulted from mostly the spotlight in the field of bioinformatics [10,11,12]. In unreliable high-throughput experiments, they are frequently particular, a wide range of computational approaches have used as a benchmark for functional characterization because been applied to the protein interaction networks for of their comprehensive information. functional knowledge discovery, for instance, function prediction of uncharacterized genes or proteins [13,14,15], Functional similarity between proteins can be quantified functional module detection [16,17,18], and signaling by semantic similarity, a function that returns a numerical pathway identification [19,20]. Although the automated value reflecting closeness in meaning between two methods are scalable and robust, their accuracy is limited ontological terms annotating the proteins [28]. Since an because of unreliability of interaction data. The PPIs interaction of a protein pair is interpreted as their strong functional association, one can measure the reliability of protein-protein interactions using semantic similarity: *Address correspondence to this author at the Bioinformatics Program, proteins with higher semantic similarity are more likely to Department of Computer Science, Baylor University, One Bear Place # 97356, Waco, TX 76798, USA; Tel: 1-254-710-3385; Fax: 1-254-710-3889; interact with each other than those with low semantic E-mails: [email protected], [email protected] similarity. Therefore, absent of true information identifying which proteins actually interact, semantic similarity can be an indirect indicator of such interactions. §Part of information included in this article has been previously published in Proceedings of IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, 2012. 1574-8936/13 $58.00+.00 © 2013 Bentham Science Publishers 2 Current Bioinformatics, 2013, Vol. 8, No. 3 Montañez and Cho In this article, we assess reliability of PPIs determined methods calculate the path length between terms in an experimentally and computationally. The performance of ontology as their similarity. Information content-based existing semantic similarity measures is analyzed in terms of methods use an information-theoretic measure based on the functional consistency, including the combinations of the notion of term likelihood to assign higher values to terms measures which achieve improved performance over the that have higher specificity. Common term-based methods previous methods. These semantic similarity measures are consider the number of shared ancestor terms in an ontology applied to identify false positive PPIs in current S. cerevisiae to assign a similarity value. Hybrid methods incorporate PPI databases. The experimental results show that the aspects of two different categories. The semantic similarity combined hybrid method has better performance than the measures in these four categories are summarized in Table 1. individual semantic similarity classifiers. The proposed combined classifier predicted that 58.6% of the S. cerevisiae 2.2.1. Path Length-Based Methods (Edge-Based Method) PPIs from the BioGRID database [29] are false positives. Path length-based methods calculate semantic similarity by measuring the shortest path length between two terms. 2. METHODS The path length can be normalized with the maximum depth of the ontology, which represents the longest path length out 2.1. Gene Ontology (GO) of all shortest paths from the root to leaf nodes. An ontology is a formal way of representing knowledge length(C1,C2 ) which is described by concepts and their relationships [30]. sim path ()C1,C2 = log 2 depth As a collaborative effort to specify bio-ontologies, GO addresses the need for consistent descriptions of genes and where length(C1,C2) is the shortest path length between two gene products across species [31]. It provides a collection of terms C and C in an ontology. well-defined biological concepts, called GO terms, spanning 1 2 three domains: biological processes, molecular functions and The semantic similarity is also measured by the depth to cellular components. GO is structured as a Directed Acyclic the most specific common ancestor (SCA) of two terms, i.e. Graph (DAG) by specifying general-to-specific relation- the shortest path length from the root to SCA [35]. The ships, such as “is-a” and “part-of”, between parent and child longer the path length to SCA from two terms is, the more terms. similar they are in meaning. Wu and Palmer [36] normalized the depth to the SCA by the average depth to the terms. This As another intriguing feature, GO maintains annotations normalized measure is used to adjust the similarity distorted for genes and gene products to their most specific GO terms, through the depths of the terms of interest. called direct annotations. Because of general-to-specific relationships in the ontology structure, a gene that is 2 length(Croot ,Csca ) simWu ()C1,C2 = annotated to a specific term is also annotated to all its parent length(C ,C ) + length(C ,C ) + 2 terms on the paths towards the root. These are called inferred root sca root sca annotations. Considering both direct and inferred annotat- length(Croot ,Csca ) ions, we can quantify the specificity of a GO term by the where Croot denotes the root term and Csca is the most proportion of the number of annotated genes on the term to specific common ancestor term of C and C . the total number of annotated genes in the ontology. Suppose 1 2 Gi and Gj are the sets of genes annotated to the GO term ti To compute functional similarity between two proteins, and tj, respectively, and ti is a parent term of tj. The size of we take into consideration semantic similarity between Gi, | Gi |, is always greater than or equals to | Gj |. pairwise combinations of the terms having direct annotations of the proteins. These path length-based methods are Note that a gene can be annotated to multiple GO terms. applicable to the well-balanced ontology in which each edge Suppose a gene

Predicting False Positives of Protein-Protein Interaction Data by Semantic Similarity Measures§ George Montañez1 and Young-Rae Cho*,2

Gene Discovery and Annotation Using LCM-454 Transcriptome Sequencing Scott J

Unicarbkb: Building a Standardised and Scalable Informatics Platform for Glycosciences Research

The ELIXIR Core Data Resources: Fundamental Infrastructure for The

Hierarchical Classification of Gene Ontology Terms Using the Gostruct

Gene Ontology Analysis with Cytoscape

To Find Information About Arabidopsis Genes Leonore Reiser1, Shabari

Use and Misuse of the Gene Ontology Annotations

EMBL-EBI Powerpoint Presentation

Webnetcoffee

An Open-Sourced Bioinformatic Pipeline for the Processing of Next-Generation Sequencing Derived Nucleotide Reads

Wormbase Curation Interfaces and Tools

Sequence Alignment/Map Format Specification