Effective Pre-Processing Strategies for Functional Clustering of a Protein-Protein Interactions Network

Effective Pre-processing Strategies for Functional Clustering of a Protein-Protein Interactions Network Duygu Ucar, Srinivasan Parthasarathy, Sitaram Asur and Chao Wang Department of Computer Science and Engineering, The Ohio State University Contact: srini ¡ @cse.ohio-state.edu Abstract clustering and graph partitioning algorithms have been ap- plied in this domain with mixed results. The cluster as- In this article we present novel preprocessing techniques, signments have been found to vary significantly with the based on topological measures of the network, to identify algorithm and its parameters. In this paper, we examine the clusters of proteins from Protein-protein interaction (PPI) application of two different clustering approaches - a hierar- networks wherein each cluster corresponds to a group of chical agglomerative clustering algorithm and a multi-way functionally similar proteins. The two main problems with graph partitioning algorithm. analyzing Protein-protein interaction networks are their scale-free property and the large number of false positive The primary property of the PPI network that is detri- interactions that they contain. Our preprocessing tech- mental to traditional graph partitioning or clustering is its niques use a key transformation and separate weighting scale-free topology [16]. In scale-free networks, the node functions to effectively eliminate suspect edges, potential degree distribution follows a power-law as ¢¤£¦¥¨§ © . false positives, from the graph. A useful side-effect of this This produces a skew in the data-space, with a few highly transformation is that the resulting graph is no longer scale connected proteins (hubs) linking the rest of the proteins free. We then examine the application of two well-known to the system, which several conventional clustering algo- clustering techniques, namely Hierarchical and Multilevel rithms cannot handle effectively. An additional problem is Graph Partitioning on the reduced network. We define suit- the unreliability of the interaction data. PPI data are typ- able statistical metrics to evaluate our clusters meaning- ically obtained from high-throughput screen sources such fully. From our study, we discover that the application of as the Yeast two-Hybrid (Y2H) system [8]. Although this clustering on the pre-processed network results in signifi- system produces a large number of protein interactions, the cantly improved, biologically relevant and balanced clus- resultant interactions have been shown to include a large ters when compared with clusters derived from the origi- number of false positives [7] . These are caused mostly due nal network. We strongly believe that our strategies would to the testing of increasingly arbitrary protein-protein inter- prove invaluable to future studies on prediction of protein actions. The interactions data, therefore, includes several functionality from PPI networks. physical interactions with no biological significance. In this paper, we pre-process the data using Line graph transformation based on two topological metrics to trans- 1 Introduction form the PPI network into a sparser network with reduced interactions. Our aim is to show that the transformed graph Protein-protein interaction (PPI) networks are believed contains fewer false positives and leads to a more biologi- to be important sources of information related to biological cally relevant partitioning than the original graph. To val- processes and complex metabolic functions of the cell. The idate our results, we use the Gene Ontology (GO) consor- presence of biologically relevant functional modules in PPI tium database [5], which provides structured vocabularies networks has been theorized by many researchers [6, 11, (ontologies) to annotate genes in terms of their associations 18] . The task of extracting these functional modules for in biological processes, molecular functions and cellular the purposes of functional prediction and identification is components. The primary purpose of these annotations is to an active research area in functional genomics. provide a common terminology to identify biologically rel- Clustering techniques are adequate to extract tightly con- evant associations among genes. To summarize, our main nected modules from the interaction network. However, it contributions in this paper are : has been observed that no single clustering algorithm can adequately reflect the underlying biological functions of Novel pre-processing strategies to eliminate redundant proteins in their clusters [13]. Hence, various traditional false positive interactions from the original PPI dataset 1 Application of two different clustering algorithms with a dense subnetwork), it is obvious that the proposed inter- suitable validation to obtain biologically meaningful action is supported by several other interactions. Hence, results from the cleaned dataset. the edges (interactions) that are not part of dense subnetworks are more likely to be interactions that are falsely ob- 2 Methodology tained. Edges that connect subnetworks are also potential false interactions. Hence we use topological metrics of the network, namely the Clustering Coefficient and Centrality 2.1 Dataset (Betweenness and Closeness), to quantify the possibility of The Database of Interacting Proteins (DIP) [19] is an on- an interaction being false. line database, that accumulates experimentally determined protein-protein interactions from different sources. In this 2.2.1 Clustering Coefficient work, we focus on the budding Yeast (Saccharomyces Cere- The Clustering Coefficient [17], is a metric commonly em- visiae) proteome since it is a well-studied organism with ployed to identify well-connected sub-components in net- large amounts of interactions data. As of May 2005, the works. It represents the interconnectivity of a node's neigh- database contains 4741 Yeast proteins having 15428 inter- bors. The Clustering Coefficient of a node in a graph can actions. For the purpose of this study, the network can be be defined as follows: visualized as a graph with nodes representing proteins and the edges between them denoting the corresponding inter- £§ (1) ¥ £¦¥ § actions. Hence, we use the terms network and graph inter- changeably in this paper. where denotes the number of triangles that go through node . The denominator gives the maximum number of 2.2 Preprocessing triangles that can go through node . It is implied that nodes The majority of the interactions on the DIP database are having high Clustering Coefficient have neighbors that have obtained using the Yeast two-hybrid (Y2H) system. Fields higher probability to be neighbors. and Song first described the features of the Y2H system in 2.2.2 Centrality 1989 [8]. It has become one of the most commonly used technologies to detect protein-protein interactions. Its main The Centrality of a node in a network is a measure of the advantages are its simplicity, low cost and high throughput. structural importance of the node. There are three impor- However, it is burdened by a tendency to produce a large tant aspects of Centrality: Degree, Closeness, and Between- number of false positives. A number of studies made to as- ness. In this work we use Betweenness and Closeness as sess the quality of the data have demonstrated large number they are more informative than degree and more suitable of erroneously identified interactions. Hence, the biologi- for this problem. cal relevance of interacting proteins obtained from this sys- Betweenness Centrality: Betweenness [10], is a measure tem needs to be re-affirmed. In this paper, we provide a of the centrality of a node and its influence over data flows pre-processing technique that uses two different metrics to in the network. For a node , it is normally calculated as the identify and eliminate interactions which are most likely to fraction of the shortest geodesic paths between node pairs ! £"$#&%§ be false. We then proceed to partition the PPI graph using that pass through node . More precisely, if is the % only the edges (interactions) which we believe to be reli- number of paths from " to that pass through node in a able. graph ' having nodes, then the Betweenness Centrality Although, it is impossible without experimental exami- of node can be calculated as *,+.- - /1032 £"$#&%§ nation to determine if the interactions eliminated are indeed ! ( false, we believe that the presence of balanced, biologically £§) (2) £ §4£ § significant clusters on the cleaned data serves as a prelimi- nary validation of our technique. There has been work done Closeness Centrality : Closeness Centrality [9], is a mea- by Saito et al [14], to eradicate false positive interactions sure of the closeness of a node, on average, to all the other using a metric called Interaction Generality. However, this ' nodes. Formally the closeness of a node in a graph is metric focuses only on the degree of individual proteins defined by the follo wing expression: without considering the topology of the network. We believe that the degree, by itself, is not sufficient. It is impor- * - 9:032 78 £.5§6 (3) tant to consider connectivity and density of sub-networks to !;£.;#3<=§ adequately deal with false positive interactions. Our technique is therefore governed by the following intuition. If a where !;£.;#3<>§ denotes the pairwise geodesic distance be- < node is strongly connected to its neighbors (i.e., lies inside tween node and . denotes the number of reachable 7 2 nodes from node . Due to the scale free property,

Load more