The Bottlenecks in Biological Networks

1 The Bottlenecks in Biological Networks Hamidreza Mahyar∗y1, Elahe Ghalebi K.y1, Hamid R. Rabieez, Radu Grosuy ∗Department of Science and Engineering, Sharif University of Technology yInstitute of Computer Engineering, Vienna University of Technology zDepartment of Computer Engineering, Sharif University of Technology Email: ∗[email protected], yfelahe.ghalebi, [email protected], [email protected] Abstract A well-established goal in the biological networks is to investigate relations between the topological features and the functional properties of systems biology. Nodes with high betweenness centrality, as bottlenecks, have surprising functional and dynamic properties in biological networks. Bottlenecks, that have many shortest paths passing through them, are connector hubs for many inter-modular connections to nodes of different modules. Thus, detection of bottleneck nodes in the networks will be of great interest. In this paper, we propose a new approach to efficiently identify the bottlenecks in the biological networks, using compressive sensing with indirect measurements. This method uses only the local information at each node, thus it is applicable to large real-world and unknown networks in which the global methods are often impossible. We experimentally evaluated the performance of the proposed method by extensive simulations on several biological networks. The evaluation results show that our algorithm outperforms the best existing methods with notable improvements in terms of F-Measure. Index Terms Biological Networks, Bottlenecks, Hubs, Network Centrality I. INTRODUCTION A wide range of real-world biological systems can be structured and modeled as networks (graphs) [1]. As an example in the protein interaction networks, proteins are represented by the nodes (vertices) and the interactions between these components by the links (edges) of the graph [2]. As another example, the topological layout of the connectome has been quantified by representing the nervous system (like the nematode Caenorhabditis elegans worm) as a graph in which each node denotes a neuron and each link denotes a synaptic connection between neurons [3]. Identifying important nodes in structural analysis of such networks has been a substantial problem [4]. Network centrality is a measure to quantify the relative importance of nodes in a network according to their topological features in the graph structure. Betweenness centrality is an outstanding measure that represents the importance of a node in terms of the fraction of shortest paths going through that node within the network. The detection of nodes with high betweenness centrality is an inevitable task in the structural analysis of biological networks and has many applications. In a protein network, Bottlenecks (i.e. high betweenness centrality nodes that control information flow in the network) are key connector proteins with surprising functional and dynamic specifications [2]. In a neuronal connectome, the rich club neurons with high betweenness centrality are connector hubs for many inter-modular links to nodes of different modules [3]. Various exact and approximation algorithms have been proposed in the literature to identify bottlenecks in biological networks [2], [5], [6], [7]. One of the major disadvantages of these approaches is that they assume full knowledge about the network topological structure which is often unrealistic. Existence of missing data in networks is almost unavoidable because there are several limitations (e.g. facing with large-scale real networks) that may hinder access to complete data of networks. The second main drawbacks of these methods is that they assume “direct measurement” of each network node, which can be practically difficult, costly and in some cases impossible, because of scalability and accessibility in real- world biological networks. In this paper, we propose a new approach, with indirect aggregated measurements and without full network topology, to efficiently and accurately detect the bottlenecks in the biological networks. To this end, we use “compressive sensing” [8], [9], which aims to effectively recover sparse high-dimensional data from a much smaller number of non-adaptive indirect measurements [10], [11], [12], [13]. 1 Authors contributed equally 2 II. PROBLEM FORMULATION We consider a biological network expressed by a graph G = (V; E), where V denotes the set of nodes (vertices) with cardinality jV j = n, and E is the set of links (edges). For a node v 2 V , let N (v) ⊂ V be its neighborhood set, deg(v) = jN (v)j as its degree, and Ego(v) as its one-hop adjacency matrix. Let B(v) denote the global betweenness centrality of node v 2 V , as [14]: X σuw(v) B(v) = (1) σuw u;w;u6=w where σuw is the total number of shortest paths between every u; w 2 V , u 6= w, and σuw(v) is the number of such paths going through node v. Suppose every node i 2 V has a real value xi. x = (xi; i = 1; 2; :::; n) is a k-sparse data vector if kxk0 = k where k:k0 denotes the number of non-zero elements in its support. Based on the problem addressed in this paper, the number of bottlenecks (i.e. top-k betweenness centrality nodes) is much smaller than the total number of all nodes in the biological network (k n). In compressive sensing over networks, we have m independent indirect measurements, such that m n. Then, we are interested in identifying specific nodes, i.e. bottlenecks, from these measurements. Let x 2 Rn be a non-negative data vector whose p-th entry represents the value over node p, and y 2 Rm denotes the measurement vector whose q-th entry is the total additive values of nodes in a connected sub-graph corresponding to the measurement q in the network. Let A be an m × n measurement matrix in which its i-th row corresponds to the i-th measurement. Ai;j = 1 if and only if the i-th measurement includes node j and zero otherwise. Thus, we can formulate this problem as a linear system y = Ax. III. PROPOSED METHOD In this section, we introduce our proposed approach to efficiently detect the bottlenecks (i.e. k-highest betweenness centrality nodes) in a biological network via compressive sensing. In this algorithm, we first construct a feasible measurement matrix A with m independent measurements and its corresponding measurement vector y, then we recover the sparse approximation of the data vector. To this end, to construct a measurement with length l, the following steps perform: First, a start node is selected uniformly at random from the set of all nodes V in the network G. Second, the start node is added to the visited set and its neighbors to the neighbor set. These two sets are initialized to NULL for each measurement. Third, a next node is selected among the nodes in the neighbor set relative to their local betweenness score [15]. This local score can be computed in a parallel or distributed way if each node communicates only with its 1-hop neighborhood. Fourth, the selected next node is added to the visited set and it is removed from the neighbor set, then its neighbors are added to the neighbor set. The latter two steps are repeated l times to have a measurement with the length of l. Each measurement is considered as one row in the measurement matrix, so that Ai;j = 1 if node j is in the visited set for the measurement i and Ai;j = 0 otherwise. The accumulative sum of local values over the nodes in the visited set for measurement i is added to the corresponding entity of the measurement vector y. Thus, we construct the matrix Am×n and the vector ym×1, by m independent measurements which can be generated in a parallel fashion. Finally, to find the 2 sparse approximation x^ from the linear sketch of y = Ax, we solve the program x^ = min kxk1 + kAx − yk2. x IV. EXPERIMENTAL EVALUATION In this section, we experimentally evaluate the performance of our algorithm (called CS-HubDet), in real-world biological networks under various configurations. A. Datasets We considered three well-known biological networks: (1) The neuronal connectome of the nematode worm Caenorhabditis elegans, that is considered anatomically at a cellular scale as 2359 synaptic connections between 297 neurons [16]; (2) The Yeast protein-protein interaction networks with 2361 nodes and 7182 links [17]; (3) The meta-analysis network of human whole-brain functional co-activations with comparable resting-state fMRI network and node coordinates, with 638 nodes and 18625 links [18]. 3 0.8 0.8 0.8 0.6 0.6 0.6 0.4 0.4 0.4 F−measure F−measure F−measure 0.2 CS−HubDet 0.2 CS−HubDet 0.2 CS−HubDet CS−TopCent CS−TopCent CS−TopCent RW RW RW 0 0 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 k/n k/n k/n (a) Human-Brain Co-activations (b) C. Elegans Neuronal Connectome (c) Yeast Protein Interactions Fig. 1. Effect of sparsity k on the accuracy of CS-HubDet in comparison with CS-TopCent and RW in terms of F-measure. For each method in each network, we ran the 0:2n measurements with the length of 0:4n. B. Settings In order to evaluate the performance of the proposed approach, CS-HubDet, we considered both the precision and recall metrics. Precision measures the number of correctly identified bottleneck nodes divided by the total number of all detected nodes. Recall measures the number of correctly detected bottleneck nodes divided by the total number of nodes in the network.

Load more