Clustering Proteins from Interaction Networks for the Prediction of Cellular Functions C
Total Page:16
File Type:pdf, Size:1020Kb
Clustering proteins from interaction networks for the prediction of cellular functions C. Brun, C Herrmann, A Guenoche To cite this version: C. Brun, C Herrmann, A Guenoche. Clustering proteins from interaction networks for the prediction of cellular functions. BMC Bioinformatics, BioMed Central, 2004, 5 (1), pp.95. 10.1186/1471-2105- 5-95. hal-01596222 HAL Id: hal-01596222 https://hal-amu.archives-ouvertes.fr/hal-01596222 Submitted on 19 Dec 2018 HAL is a multi-disciplinary open access L’archive ouverte pluridisciplinaire HAL, est archive for the deposit and dissemination of sci- destinée au dépôt et à la diffusion de documents entific research documents, whether they are pub- scientifiques de niveau recherche, publiés ou non, lished or not. The documents may come from émanant des établissements d’enseignement et de teaching and research institutions in France or recherche français ou étrangers, des laboratoires abroad, or from public or private research centers. publics ou privés. Distributed under a Creative Commons Attribution| 4.0 International License BMC Bioinformatics BioMed Central Methodology article Open Access Clustering proteins from interaction networks for the prediction of cellular functions Christine Brun1, Carl Herrmann*1 and Alain Guénoche2 Address: 1Laboratoire de Génétique et Physiologie du Développement, IBDM, CNRS/INSERM/Université de la Méditerranée and 2lnstitut de Mathématiques de Luminy CNRS Parc Scientifique de Luminy, Case 907, 13288 Marseille Cedex 9, France Email: Christine Brun - [email protected]; Carl Herrmann* - [email protected]; Alain Guénoche - [email protected] * Corresponding author Published: 13 July 2004 Received: 29 March 2004 Accepted: 13 July 2004 BMC Bioinformatics 2004, 5:95 doi:10.1186/1471-2105-5-95 This article is available from: http://www.biomedcentral.com/1471-2105/5/95 © 2004 Brun et al; licensee BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL. Abstract Background: Developing reliable and efficient strategies allowing to infer a function to yet uncharacterized proteins based on interaction networks is of crucial interest in the current context of high-throughput data generation. In this paper, we develop a new algorithm for clustering vertices of a protein-protein interaction network using a density function, providing disjoint classes. Results: Applied to the yeast interaction network, the classes obtained appear to be biological significant. The partitions are then used to make functional predictions for uncharacterized yeast proteins, using an annotation procedure that takes into account the binary interactions between proteins inside the classes. We show that this procedure is able to enhance the performances with respect to previous approaches. Finally, we propose a new annotation for 37 previously uncharacterized yeast proteins. Conclusion: We believe that our results represent a significant improvement for the inference of cellular functions, that can be applied to other organism as well as to other type of interaction graph, such as genetic interactions. Background posed a prediction method in which a protein of While more data become available, analyzing protein- unknown function is assigned the three most frequent cel- protein interaction (PPI) networks appears as a particu- lular functions represented among its direct interaction larly effective way to make functional predictions for pro- partners. This approach is strictly local, as it does not take teins of unknown function. Most studies so far focused on into account the graph as a whole but only the immediate the baker's yeast S. cerevisiae due to large available datasets protein neighborhood. However, we have good reasons to [1,2], but recent experimental data for D. melanogaster [3] believe that the organization of proteins inside the inter- will most probably broaden the field of investigations. It actome goes beyond the one-step separation. Protein is therefore of crucial interest to develop reliable and effi- complexes and pathways are an example of more compli- cient strategies allowing to infer a function to yet unchar- cated relationships between proteins involved in a same acterized proteins based on interaction data. biological process. Indeed, other methods focused on the fact that proteins sharing a significant number of interac- It was soon noticed that proteins of similar cellular func- tion partners are likely to participate in common cellular tions tend to lie within a short distance in the interaction processes as proposed by Jacq (2001). Recently, we have graph. Based on this property, Schwikowski et al. [4] pro- designed PRODISTIN [5,6], a method in which distance Page 1 of 11 (page number not for citation purposes) BMC Bioinformatics 2004, 5:95 http://www.biomedcentral.com/1471-2105/5/95 values between all protein pairs are computed from the the clustering algorithm which must reflect the biological number of common and specific interaction partners and reality, and (b) the annotation procedure within classes. used to build a classification tree. Functional classes are defined according to the tree topology and to the number Results of proteins sharing functional annotations. The func- Graph, classes and partitions tional predictions for yet uncharacterized proteins are We analyze the interaction network as a graph, such that then proposed based on their belonging to a particular proteins are the vertices and each interaction is an undi- functional class. An alternative method for predicting bio- rected edge. Our aim is to build clusters of proteins shar- logical functions was proposed [7], which ranks protein ing a high percentage of interactions, as this appears to be pairs according to their probability for having the experi- a strong indicator of biological relatedness. We use the mentally measured number of common interaction part- Czekanowski-Dice distance. D, because it increases the ners. A different method, which does not rely on common weight of shared interactors, and because two proteins interaction partners was suggested by Vasquez et al. [8]; it having no common interactors will get the maximum dis- optimizes the annotations of uncharacterized proteins tance value, while those interacting with exactly the same such as to minimize the number of interactions between set of proteins will have zero value: proteins of different functional groups. These latter two approaches, while based on the interaction network, do Int() i∆ Int() j not define any clusters of proteins. Biological knowledge Dij(),,= teaches us that dense protein-protein interactions are the Int() i∪ Int() j+ Int() i∩ Int() j sign of the common involvement of those proteins in cer- tain biological processes. We therefore tried to select dense in which i and j denote two proteins, Int(i) and Int(j) are classes of proteins sharing a high percentage of interac- the lists of their interactors plus themselves (to decrease tions in the interaction graph. Several clustering algo- the distance between proteins interacting with each other) rithms applied to protein interaction graphs have been and ∆ is the symmetrical difference between the two sets. proposed so far [9,10]. They are based on a density func- From the distance matrix, we first build another valued tion evaluated in each vertex x which is computed from graph Γ = (X, E), then we evaluate a density function De the number of edges in its neighborhood. We adopted in each vertex to perform clustering only from Γ and De. another approach, computing first an appropriate dis- tance between vertices. Generally the length of a shortest Graph path or the Czekanowski-Dice distance are used, and a Given a distance matrix, D : X × X → ޒ , the first operation classical clustering method is then applied [6,11]. We will is to select a degree δ which works as a threshold. From present an alternative algorithm using the Czekanowski- any element x, the distance values D(x,y) are ranked in Dice distance as in [6]. From the distance matrix, a new increasing order and the δ-th value gives the σx threshold. graph Γ is built, which is then partitioned into disjoint Then, we take as edges in E all the pairs (x,y) such that classes of proteins using an appropriate density function. D(x,y) ≤ σ . Let n = |X|, m = |E| and Γ = (X, E) be the cor- Our algorithm differs from similar approaches in many x δ responding graph. It is not a classical threshold graph on ways : 1) the graph Γ is not a classical threshold graph, in D, since the threshold value is not the same for all the which edges are selected when their length is lower than a vertices. threshold value, and 2) we use the valuation of the edges to measure a density in each vertex and 3) we perform Moreover, it is not a regular graph with degree δ either, progressive clustering. because the edge selection process is not symmetrical. Consequently, there can be more than δ vertices incident The resulting classes are assigned a biological function to x. according to the functional annotations of their members following a classical majority rule. Finally, a refined anno- When there is no ambiguity on the δ value, the graph will tation procedure is proposed to predict the cellular func- simply be denoted Γ. For any part Y of X, let Γ (Y) be the tion(s) of uncharacterized proteins, taking into account set of vertices not in Y that are adjacent to Y. Thus, the the function(s) assigned to the class and the direct interac- neighborhood of x is denoted Γ (x), the degree of a vertex tion partners of uncharacterized protein present within x is Dg(x) = |Γ (x)|. the class. Hence, interaction data is used at two levels (see figure 1): first to define the classes using our partitioning Density function algorithm, but also to annotate uncharacterized proteins For each vertex x, we compute a density value denoted once the classes have been formed.