<<

1 The Bottlenecks in Biological Networks

Hamidreza Mahyar∗†1, Elahe Ghalebi K.†1, Hamid R. Rabiee‡, Radu Grosu† ∗Department of Science and Engineering, Sharif University of Technology †Institute of Computer Engineering, Vienna University of Technology ‡Department of Computer Engineering, Sharif University of Technology Email: ∗[email protected], †{elahe.ghalebi, radu.grosu}@tuwien.ac.at, ‡[email protected]

Abstract A well-established goal in the biological networks is to investigate relations between the topological features and the functional properties of . Nodes with high betweenness , as bottlenecks, have surprising functional and dynamic properties in biological networks. Bottlenecks, that have many shortest paths passing through them, are connector hubs for many inter-modular connections to nodes of different modules. Thus, detection of bottleneck nodes in the networks will be of great interest. In this paper, we propose a new approach to efficiently identify the bottlenecks in the biological networks, using compressive sensing with indirect measurements. This method uses only the local information at each node, thus it is applicable to large real-world and unknown networks in which the global methods are often impossible. We experimentally evaluated the performance of the proposed method by extensive simulations on several biological networks. The evaluation results show that our algorithm outperforms the best existing methods with notable improvements in terms of F-Measure.

Index Terms Biological Networks, Bottlenecks, Hubs, Network Centrality

I.INTRODUCTION A wide range of real-world biological systems can be structured and modeled as networks (graphs) [1]. As an example in the protein interaction networks, proteins are represented by the nodes (vertices) and the interactions between these components by the links (edges) of the graph [2]. As another example, the topological layout of the has been quantified by representing the nervous system (like the nematode Caenorhabditis elegans worm) as a graph in which each node denotes a and each link denotes a synaptic connection between [3]. Identifying important nodes in structural analysis of such networks has been a substantial problem [4]. Network centrality is a measure to quantify the relative importance of nodes in a network according to their topological features in the graph structure. is an outstanding measure that represents the importance of a node in terms of the fraction of shortest paths going through that node within the network. The detection of nodes with high betweenness centrality is an inevitable task in the structural analysis of biological networks and has many applications. In a protein network, Bottlenecks (i.e. high betweenness centrality nodes that control information flow in the network) are key connector proteins with surprising functional and dynamic specifications [2]. In a neuronal connectome, the rich club neurons with high betweenness centrality are connector hubs for many inter-modular links to nodes of different modules [3]. Various exact and approximation algorithms have been proposed in the literature to identify bottlenecks in biological networks [2], [5], [6], [7]. One of the major disadvantages of these approaches is that they assume full knowledge about the network topological structure which is often unrealistic. Existence of missing data in networks is almost unavoidable because there are several limitations (e.g. facing with large-scale real networks) that may hinder access to complete data of networks. The second main drawbacks of these methods is that they assume “direct measurement” of each network node, which can be practically difficult, costly and in some cases impossible, because of scalability and accessibility in real- world biological networks. In this paper, we propose a new approach, with indirect aggregated measurements and without full , to efficiently and accurately detect the bottlenecks in the biological networks. To this end, we use “compressive sensing” [8], [9], which aims to effectively recover sparse high-dimensional data from a much smaller number of non-adaptive indirect measurements [10], [11], [12], [13].

1 Authors contributed equally 2

II.PROBLEM FORMULATION We consider a expressed by a graph G = (V,E), where V denotes the set of nodes (vertices) with cardinality |V | = n, and E is the set of links (edges). For a node v ∈ V , let N (v) ⊂ V be its neighborhood set, deg(v) = |N (v)| as its , and Ego(v) as its one-hop . Let B(v) denote the global betweenness centrality of node v ∈ V , as [14]:

X σuw(v) B(v) = (1) σuw u,w,u6=w where σuw is the total number of shortest paths between every u, w ∈ V , u 6= w, and σuw(v) is the number of such paths going through node v. Suppose every node i ∈ V has a real value xi. x = (xi, i = 1, 2, ..., n) is a k-sparse data vector if kxk0 = k where k.k0 denotes the number of non-zero elements in its support. Based on the problem addressed in this paper, the number of bottlenecks (i.e. top-k betweenness centrality nodes) is much smaller than the total number of all nodes in the biological network (k  n). In compressive sensing over networks, we have m independent indirect measurements, such that m  n. Then, we are interested in identifying specific nodes, i.e. bottlenecks, from these measurements. Let x ∈ Rn be a non-negative data vector whose p-th entry represents the value over node p, and y ∈ Rm denotes the measurement vector whose q-th entry is the total additive values of nodes in a connected sub-graph corresponding to the measurement q in the network. Let A be an m × n measurement matrix in which its i-th row corresponds to the i-th measurement. Ai,j = 1 if and only if the i-th measurement includes node j and zero otherwise. Thus, we can formulate this problem as a linear system y = Ax.

III.PROPOSED METHOD In this section, we introduce our proposed approach to efficiently detect the bottlenecks (i.e. k-highest be- tweenness centrality nodes) in a biological network via compressive sensing. In this algorithm, we first construct a feasible measurement matrix A with m independent measurements and its corresponding measurement vector y, then we recover the sparse approximation of the data vector. To this end, to construct a measurement with length l, the following steps perform: First, a start node is selected uniformly at random from the set of all nodes V in the network G. Second, the start node is added to the visited set and its neighbors to the neighbor set. These two sets are initialized to NULL for each measurement. Third, a next node is selected among the nodes in the neighbor set relative to their local betweenness score [15]. This local score can be computed in a parallel or distributed way if each node communicates only with its 1-hop neighborhood. Fourth, the selected next node is added to the visited set and it is removed from the neighbor set, then its neighbors are added to the neighbor set. The latter two steps are repeated l times to have a measurement with the length of l. Each measurement is considered as one row in the measurement matrix, so that Ai,j = 1 if node j is in the visited set for the measurement i and Ai,j = 0 otherwise. The accumulative sum of local values over the nodes in the visited set for measurement i is added to the corresponding entity of the measurement vector y. Thus, we construct the matrix Am×n and the vector ym×1, by m independent measurements which can be generated in a parallel fashion. Finally, to find the 2 sparse approximation xˆ from the linear sketch of y = Ax, we solve the program xˆ = min kxk1 + kAx − yk2. x

IV. EXPERIMENTAL EVALUATION In this section, we experimentally evaluate the performance of our algorithm (called CS-HubDet), in real-world biological networks under various configurations.

A. Datasets We considered three well-known biological networks: (1) The neuronal connectome of the nematode worm Caenorhabditis elegans, that is considered anatomically at a cellular scale as 2359 synaptic connections between 297 neurons [16]; (2) The Yeast protein-protein interaction networks with 2361 nodes and 7182 links [17]; (3) The meta-analysis network of human whole- functional co-activations with comparable resting-state fMRI network and node coordinates, with 638 nodes and 18625 links [18]. 3

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4 F−measure F−measure F−measure

0.2 CS−HubDet 0.2 CS−HubDet 0.2 CS−HubDet CS−TopCent CS−TopCent CS−TopCent RW RW RW 0 0 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 k/n k/n k/n (a) Human-Brain Co-activations (b) C. Elegans Neuronal Connectome (c) Yeast Protein Interactions

Fig. 1. Effect of sparsity k on the accuracy of CS-HubDet in comparison with CS-TopCent and RW in terms of F-measure. For each method in each network, we ran the 0.2n measurements with the length of 0.4n.

B. Settings In order to evaluate the performance of the proposed approach, CS-HubDet, we considered both the precision and recall metrics. Precision measures the number of correctly identified bottleneck nodes divided by the total number of all detected nodes. Recall measures the number of correctly detected bottleneck nodes divided by the total number of nodes in the network. Overall, we used the F -measure metric which represents the harmonic mean between precision and recall, as: P recision × Recall F − measure = 2 × (2) P recision + Recall For comparison with the previous work, we also implemented the method presented by Mahyar et al. [19] (CS-TopCent), which efficiently detects central nodes in networks under the same conditions as our approach (i.e. without full knowledge of the network topology and with indirect measurements). In addition, we compared the CS-HubDet and the CS-TopCent with the approach introduced by Xu et al. [20] (RW), which is the best baseline method for sparse signal recovery in networks using compressive sensing. For the optimization step, we used the CVXPY package [21], that is a convex optimization modeling layer for Python. Each of the aforementioned methods has a source of randomness, hence in each of the test cases we repeated the experiments 10 times. The denoted point in each figure represents the mean value over the all repetitions with its asymmetric standard deviation, which quantifies the amount of variations of F -measure at that point in the figure.

C. Evaluation Results We evaluated the accuracy of CS-HubDet for identifying bottleneck nodes in the biological networks, based on three different experiments: (1) Effect of sparsity k (number of top-k nodes to be recovered), (2) Effect of required number of measurements m, and (3) Effect of measurement length l. These experiments are explained as follows and the main reason for the improvements lies in the measurement matrix construction of our algorithm. Experiment 1 - Effect of sparsity k: In this experiment, for different percentage of sparsity, we compared the accuracy of top-k nodes lists approxi- mated by our method (CS-HubDet) and by the competing methods (CS-TopCent and RW), in comparison with the top-k list of global betweenness centrality described in Equation (1), in terms of F -measure. Figure 1 shows the evaluation results for this experiment. To have a better understanding, we normalized the sparsity level at each point in the horizontal axis, such that it is proportional to the number of top-k nodes to be recovered divided by the total k number of all nodes in the network (i.e. n ). For each network and each point of sparsity in this experiment, we ran 10 sets of measurements containing m = 0.2n measurements of length l = 0.4n, then we reported the mean value over all the repetitions with its standard deviation. As clearly depicted in Figure 1, we have the higher F -measure by our CS-HubDet method compared to the competing methods even on the lower percentage of sparsity (higher value of k). Overall, the results show that the CS-HubDet outperforms the CS-TopCent and the RW methods by around 11.8% and 15.2% improvements in average on the three networks, respectively. Experiment 2 - Effect of number of measurements m: Figure 2 depicts the performance evaluation of the proposed CS-HubDet method compared to CS-TopCent and RW, in terms of F -measure for different required number of measurements m to accurately detect the bottlenecks in the biological networks. The horizontal axis shows the required number of measurements m divided by the total m number of all nodes n in the network (i.e. n ). For each number of measurement in each network, we ran 10 sets of measurements with the measurement length of l = 0.2n and we set the sparsity to k = 0.15n, then the mean 4

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4 F−measure F−measure F−measure

0.2 CS−HubDet 0.2 CS−HubDet 0.2 CS−HubDet CS−TopCent CS−TopCent CS−TopCent RW RW RW 0 0 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 m/n m/n m/n (a) Human-Brain Co-activations (b) C. Elegans Neuronal Connectome (c) Yeast Protein Interactions

Fig. 2. Effect of number of measurements m on the accuracy of CS-HubDet in comparison with CS-TopCent and RW in terms of F-measure. For all networks, we ran the measurements of length 0.2n and the sparsity sets to 0.15n

0.8 0.8 0.8

0.6 0.6 0.6

0.4 0.4 0.4 F−measure F−measure F−measure

0.2 CS−HubDet 0.2 CS−HubDet 0.2 CS−HubDet CS−TopCent CS−TopCent CS−TopCent RW RW RW 0 0 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 l/n l/n l/n (a) Human-Brain Co-activations (b) C. Elegans Neuronal Connectome (c) Yeast Protein Interactions

Fig. 3. Effect of measurement length l on the accuracy of CS-HubDet compared to CS-TopCent and RW in terms of F-measure. For all networks, we ran 0.2n measurements and the sparsity sets to 0.15n.

value over all the repetitions is reported. It is clearly depicted that in all test cases, CS-HubDet performs better than the two competing methods for the most number of the measurements in terms of achieving higher F -measure. Overall, the average improvements of CS-HubDet in comparison with CS-TopCent and RW on all networks are around 29.4% and 38.1%, respectively. Experiment 3 - Effect of measurement length l: This experiment investigates the effect of measurement length l on CS-HubDet, CS-TopCent, and RW in terms of F -measure score for accurate detection of bottlenecks in the biological networks. In Figure 3, the values for different measurement lengths in the horizontal axis are normalized, so that they are proportional to the measurement l length l divided by the total number of nodes n in the network (i.e. n ). Throughout this experiment, the sparsity level sets to k = 0.15n and the total number of required measurements sets to m = 0.2n for each network and each method. Figure 3 depicts that the proposed CS-HubDet method outperforms the existing methods CS-TopCent and RW for the most (even in lower) lengths of the measurements, in order to have a higher F -measure, for all investigated networks. Overall, the average improvements of our method in comparison with CS-TopCent and RW on all datasets are around 8.7% and 19.8%, respectively.

V. CONCLUSION This paper addressed the problem of efficiently identifying the bottlenecks in the biological networks. It is shown that the bottleneck nodes are highly correlated to the nodes with high betweenness centrality in these networks. Most previous work needs to know the complete network topology and also does not take into consideration the imposed overhead by their direct measurements of each individual node. To tackle these drawbacks, in this paper, we introduced a new approach to efficiently identify the bottleneck nodes in the biological network, using compressive sensing. With extensive experimental evaluations on real-world networks, we demonstrated that the proposed method outperforms the state-of-the-art compressive sensing-based approaches for sparse recovery over biological networks in terms of F -measure. As a future work, we are going to propose an efficient approach for distributed identification of central nodes in other types of networks [22].

REFERENCES [1] J. Ji, A. Zhang, C. Liu, X. Quan, and Z. Liu, “Survey: Functional module detection from protein-protein interaction networks,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 2, 2014. 5

[2] M. Joy, A. Brock, D. Ingber, and S. Huang, “High-betweenness proteins in the yeast protein interaction network,” BioMed Research International, vol. 2, no. 1, 2005. [3] E. Towlson, P. Vertes, S. Ahnert, W. Schafer, and E. Bullmore, “The rich club of the c. elegans neuronal connectome,” Journal of Neuroscience, vol. 33, no. 15, 2013. [4] S. M. Taheri, H. Mahyar, M. Firouzi, E. Ghalebi K., R. Grosu, and A. Movaghar, “Hellrank: A hellinger-based centrality measure for bipartite social networks,” Analysis and Mining (SNAM), DOI: 10.1007/s13278-017-0440-7, vol. 7, May 2017. [5] H. Yu, P. Kim, E. Sprecher, V. Trifonov, and M. Gerstein, “The importance of bottlenecks in protein networks: correlation with essentiality and expression dynamics,” PLoS Comput Biol, vol. 3, no. 4, 2007. [6] D. Koschutzki and F. Schreiber, “Centrality analysis methods for biological networks and their application to gene regulatory networks,” Gene regulation and systems biology, vol. 1, no. 2, 2008. [7] G. Pavlopoulos, M. Secrier, C. Moschopoulos, T. Soldatos, S. Kossida, J. Aerts, R. Schneider, and P. Bagos, “Using to analyze biological networks,” BioData mining, vol. 4, no. 1, 2011. [8] M. Davenport, M. Duarte, Y. Eldar, and G. Kutyniok, “Introduction to compressed sensing, chapter in compressed sensing: Theory and applications,” Cambridge University Press, pp. 1–64, 2012. [9] E. J. Candes, J. K. Romberg, and T. Tao, “Stable signal recovery from incomplete and inaccurate measurements,” Comm. Pure Appl. Math., vol. 59, no. 8, pp. 1207–1223, Aug. 2006. [10] H. Mahyar, H. R. Rabiee, and Z. S. Hashemifar, “UCS-NT: An Unbiased Compressive Sensing Framework for Network Tomography,” in IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Vancouver, Canada, May 2013, pp. 4534–4538. [11] H. Mahyar, H. R. Rabiee, Z. S. Hashemifar, and P. Siyari, “UCS-WN: An Unbiased Compressive Sensing Framework for Weighted Networks,” in Conference on Information Sciences and Systems (CISS), Baltimore, USA, Mar. 2013, pp. 1–6. [12] H. Mahyar, H. R. Rabiee, A. Movaghar, E. Ghalebi, and A. Nazemian, “CS-ComDet: A compressive sensing approach for inter-community detection in social networks,” in IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Paris, France, Aug. 2015, pp. 89–96. [13] H. Mahyar, H. R. Rabiee, A. Movaghar, R. Hasheminezhad, E. Ghalebi, and A. Nazemian, “A low-cost sparse recovery framework for weighted networks under compressive sensing,” in IEEE International Conference on Social Computing and Networking (SocialCom), Chengdu, China, Dec. 2015, pp. 183–190. [14] L. Freeman, “A set of measures of centrality based on betweenness,” Sociometry, vol. 40, pp. 35–41, 1977. [15] M. Everett and S. P. Borgatti, “Ego network betweenness,” Social Networks, vol. 27, pp. 31–38, 2005. [16] J. White, E. Southgate, J. Thomson, and S. Brenner, “The structure of the nervous system of the nematode caenorhabditis elegans: the mind of a worm,” Phil. Trans. R. Soc. Lond., vol. 314, 1986. [17] D. Bu, Y. Zhao, L. Cai, H. Xue, X. Zhu, H. Lu, J. Zhang, S. Sun, L. Ling, N. Zhang, G. Li, and R. Chen, “Topological structure analysis of the protein-protein interaction network in budding yeast,” Nucleic Acids Research, vol. 31, 2003. [18] N. Crossley, A. Mechelli, P. Vertes, T. Winton-Brown, A. Patel, C. Ginestet, P. McGuire, and E. Bullmore, “Cognitive relevance of the of the human brain functional coactivation network,” Proceedings of the National Academy of Sciences, vol. 110, no. 28, 2013. [19] H. Mahyar, “Detection of top-k central nodes in social networks: A compressive sensing approach,” in IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (ASONAM), Paris, France, Aug. 2015, pp. 902–909. [20] W. Xu, E. Mallada, and A. Tang, “Compressive sensing over graphs,” in IEEE INFOCOM, Apr. 2011, pp. 2087–2095. [21] CVXPY, “Python-embedded modeling language for convex optimization problems,” May 2017. [Online]. Available: http://www.cvxpy.org/en/latest/ [22] S. M. Taheri, H. Mahyar, M. Firouzi, E. Ghalebi K., R. Grosu, and A. Movaghar, “Extracting implicit social relation for social recommendation techniques in user rating prediction,” in the 26th International World Wide Web Conference (WWW), Social Computing Workshop: Spatial Social Behavior Analytics on the Web, Perth, Australia, Apr. 2017, pp. 1343–1351.