A Graph Modification Approach for Finding Core–Periphery Structures in Protein Interaction Networks Sharon Bruckner1, Falk Hüffner2 and Christian Komusiewicz2*
Total Page:16
File Type:pdf, Size:1020Kb
Bruckner et al. Algorithms for Molecular Biology (2015) 10:16 DOI 10.1186/s13015-015-0043-7 RESEARCH Open Access A graph modification approach for finding core–periphery structures in protein interaction networks Sharon Bruckner1, Falk Hüffner2 and Christian Komusiewicz2* Abstract The core–periphery model for protein interaction (PPI) networks assumes that protein complexes in these networks consist of a dense core and a possibly sparse periphery that is adjacent to vertices in the core of the complex. In this work, we aim at uncovering a global core–periphery structure for a given PPI network. We propose two exact graph-theoretic formulations for this task, which aim to fit the input network to a hypothetical ground truth network by a minimum number of edge modifications. In one model each cluster has its own periphery, and in the other the periphery is shared. We first analyze both models from a theoretical point of view, showing their NP-hardness. Then, we devise efficient exact and heuristic algorithms for both models and finally perform an evaluation on subnetworks of the S. cerevisiae PPI network. Keywords: Protein complexes, Graph classes, NP-hard problems Background core or its attachment (in each step, a vertex maximizing A fundamental task in the analysis of PPI networks is some specific objective function is chosen). The aim of the identification of protein complexes and functional these approaches was to predict protein complexes [4,5] modules. Herein, a basic assumption is that complexes or to reveal biological features that are correlated with in a PPI network are strongly connected among them- topological properties of core–periphery structures in selves and weakly connected to other complexes [1]. networks [3]. In this work, we use core–periphery mod- This assumption is usually too strict. To obtain a more eling in a different context. Instead of searching for local realistic network model of protein complexes, several core–periphery structures, we attempt to unravel a global approaches incorporate the core–attachment model of core–periphery structure in PPI networks. protein complexes [2]. In this model, a complex is con- To this end, we hypothesize that the true network con- jectured to consist of a stable core plus some attachment sists of several core–periphery structures. We propose proteins, which have only transient interactions with the two precise models to describe this. In the first model, core. In graph-theoretic terms, the core thus is a dense the core–periphery structures are disjoint. In the second subnetwork of the PPI network. The attachment (or: model, the peripheries may interact with different cores, periphery) is less dense, but has edges to one or more but the cores are disjoint. Then, we fit the input data to cores. each formal model and evaluate the results on several PPI Current methods employing this type of modeling are networks. based on seed growing [3-5]. Here, an initial set of promis- Our approach. In spirit, our approach is related to the ing small subgraphs is chosen as cores. Then, each core clique-corruption model of the CAST algorithm for gene is separately greedily expanded by adding vertices to its expression data clustering [6]. In this model, the input is a similarity graph where edges between vertices indicate *Correspondence: [email protected] similarity. The hypothesis is that the objects correspond- 2 Institut für Softwaretechnik und Theoretische Informatik, TU Berlin, ing to the vertices belong to disjoint biological groups of Ernst-Reuter-Platz 7, 10587 Berlin, Germany Full list of author information is available at the end of the article similar objects, the clusters. In the case of gene expres- sion data, these are assumed to be groups of genes with © 2015 Bruckner et al.; licensee BioMed Central. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Bruckner et al. Algorithms for Molecular Biology (2015) 10:16 Page 2 of 13 the same function. Assuming perfect measurements, the connecting the cores. In this model, we thus assume similarity graph is a cluster graph. that the cores are disjoint cliques and the vertices of the periphery are an independent set. Such graphs are called Definition 1. A graph G is a cluster graph if each con- monopolar [11]. nected component of G is a clique. Definition 3. A graph is monopolar if its vertex set Because of stochastic measurement noise, the input can be two-partitioned into V1 and V2 such that G[V1] graph is not a cluster graph. The task is to recover the is an independent set and G[V2] is a cluster graph. The underlying cluster graph from the input graph. Under partition (V1, V2) is called monopolar partition. the assumption that the errors are independent, the most likely cluster graph is one that disagrees with the input Again, we call the vertices in V1 periphery vertices and graph on a minimum number of edges. Such a graph the vertices in V2 core vertices. Our second fitting model can be found by applying a minimum number of edge now is the following. modifications (that is, edge insertions or edge deletions) to the input graph. This paradigm directly leads to the MONOPOLAR EDITING = ( ) optimization problem CLUSTER EDITING [7-9]. Input: An undirected graph G V, E . We now apply this approach to our hypothesis that there Task: Transform G into a monopolar graph by is a global core–periphery structure in the PPI networks. applying a minimum number of edge modifications In both models detailed here, we assume that all proteins and output a monopolar partition. of each core interact with each other; this implies that each Figure 1 shows an example graph along with optimal core is a clique. We also assume that the proteins in the solutions for SPLIT CLUSTER EDITING and MONOPO- periphery interact only with the cores but not with each LAR EDITING and, for comparison, CLUSTER EDITING. other. Hence, the peripheries are independent sets. Clearly, the models behind SPLIT CLUSTER EDITING and In the first model, we assume that ideally the protein MONOPOLAR EDITING are simplistic and cannot com- interactions give rise to vertex-disjoint core–periphery pletely reflect biological reality. For example, subunits of structures, that is, there are no interactions between protein complexes consisting of two proteins that first different cores and no interactions between cores and interact with each other and subsequently with the core peripheries of other cores. Then each connected compo- of a protein complex are supported by neither of our nent has at most one core which is a clique and at most one models. Nevertheless, our models are less simplistic than periphery which is an independent set. This is precisely pure clustering models that attempt to divide protein the definition of a split graph. interaction networks into disjoint dense clusters. Further- more, there is a clear trade-off between model complexity, = ( ) Definition 2. AgraphG V, E is a split graph if V algorithmic feasibility of models, and interpretability. can be partitioned into V1 and V2 such that G[V1] is an independent set and G[V2] is a clique. Further related work. In the following, we point to some We call the vertices in V1 periphery vertices and the related work in the literature that is not directly relevant vertices in V2 core vertices. Note that the partition for a for our algorithms and their evaluation but either consid- split graph is not always unique. Split graphs have been ers models of core–periphery structure or optimization previously used to model core–periphery structures in problems that are related to SPLIT CLUSTER EDITING or social networks [10]. There, however, the assumption is MONOPOLAR EDITING. that the network contains exactly one core–periphery Della Rossa et al. [12] proposed to compute core– structure. We assume that each connected component is periphery profiles that assign to each vertex a numeri- a split graph; we call graphs with this property split clus- cal coreness value. The computation of these values is ter graphs. Our fitting model is described by the following based on a heuristic random-walk model. Their evalua- optimization problem. tion showed that the S. cerevisiae PPI network exhibits a clear core–periphery structure which significantly devi- SPLIT CLUSTER EDITING ates from random networks with the same degree distri- Input: An undirected graph G = (V, E). bution. An adaption of the Markov Clustering algorithm Task: Transform G into a split cluster graph by MCL that incorporates the core-attachment model for applying a minimum number of edge modifications. protein complexes was presented by Srihari et al. [13]. The SPLIT EDITING problem is closely related to SPLIT In our second model, we allow the vertices in the periph- CLUSTER EDITING as it asks to transform a graph into a ery to be attached to an arbitrary number of cores, thereby (single) split graph by at most k edge modifications. SPLIT Bruckner et al. Algorithms for Molecular Biology (2015) 10:16 Page 3 of 13 Figure 1 An example input and optimal solutions to CLUSTER EDITING,SPLIT CLUSTER EDITING,andMONOPOLAR EDITING. Dashed edges are edge deletions, bold edges are edge insertions. CLUSTER EDITING and SPLIT CLUSTER EDITING produce the same two clusters but SPLIT CLUSTER EDITING assigns the blue vertex of the size-four cluster to the periphery. In an optimal solution to MONOPOLAR EDITING the two blue vertices are in the periphery which is shared between two clusters.