Scalable Betweenness Centrality Maximization Via Sampling
Total Page:16
File Type:pdf, Size:1020Kb
Scalable Betweenness Centrality Maximization via Sampling Ahmad Mahmoody Charalampos E. Tsourakakis Eli Upfal Brown University Harvard University Brown University Providence, RI 02912 Cambridge, MA 02138 Providence, RI 02912 [email protected] [email protected] [email protected] ABSTRACT Categories and Subject Descriptors Betweenness centrality (BWC) is a fundamental centrality mea- H.2.8 [Database Management]: Database Applications—Data sure in social network analysis. Given a large-scale network, mining; G.2.2 [Discrete Mathematics]: Graph Theory—Graph how can we find the most central nodes? This question is algorithms of great importance to many key applications that rely on BWC, including community detection and understanding Keywords graph vulnerability. Despite the large amount of work on scalable approximation algorithm design for BWC, estimat- Social Network, Centrality, Sampling, Optimization ing BWC on large-scale networks remains a computational challenge. 1. INTRODUCTION In this paper, we study the Centrality Maximization prob- Betweenness centrality (BWC) is a fundamental measure lem (CMP): given a graph G = (V, E) and a positive integer in network analysis, measuring the effectiveness of a ver- ∗ k, find a set S ⊆ V that maximizes BWC subject to the car- tex in connecting pairs of vertices via shortest paths [16]. ∗ dinality constraint jS j ≤ k. We present an efficient random- Numerous graph mining applications rely on betweenness ized algorithm that provides a (1 − 1/e − e)-approximation centrality, such as detecting communities in social and bio- with high probability, where e > 0. Our results improve the logical networks [20] and understanding the capabilities of current state-of-the-art result [40]. Furthermore, we provide an adversary with respect to attacking a network’s connec- the first theoretical evidence for the validity of a crucial as- tivity [23]. The betweenness centrality of a node u is defined sumption in betweenness centrality estimation, namely that as in real-world networks O(jVj2) shortest paths pass through ss,t(u) the top-k central nodes, where k is a constant. This also B(u) = , ∑ s explains why our algorithm runs in near linear time on real- s,t s,t world networks. We also show that our algorithm and anal- where ss,t is the number of s-t shortest paths, and ss,t(u) is ysis can be applied to a wider range of centrality measures, the number of s-t shortest paths that have u as their inter- by providing a general analytical framework. nal node. However, in many applications, e.g. [20, 23], we On the experimental side, we perform an extensive ex- are interested in centrality of sets of nodes. For this reason, perimental analysis of our method on real-world networks, the notion of BWC has been extended to sets of nodes [22, demonstrate its accuracy and scalability, and study differ- 40]. For a set of nodes S ⊆ V, we define the betweenness ent properties of central nodes. Then, we compare the sam- centrality of S as pling method used by the state-of-the-art algorithm with our method. Furthermore, we perform a study of BWC in time s (S) evolving networks, and see how the centrality of the central B(S) = ∑ s,t , nodes in the graphs changes over time. Finally, we compare s,t2V ss,t the performance of the stochastic Kronecker model [28] to where s (S) is the number of s-t shortest paths that have an real data, and observe that it generates a similar growth pat- s,t internal node in S. Note that we cannot obtain B(S) from the tern. values fB(v), v 2 Sg. In this work, we study the Centrality Maximization problem (CMP) defined formally as follows: Definition 1 (CMP). Given a network G = (V, E) and a Permission to make digital or hard copies of all or part of this work for personal or positive integer k, find a subset S∗ ⊆ V such that classroom use is granted without fee provided that copies are not made or distributed ∗ for profit or commercial advantage and that copies bear this notice and the full citation S 2 arg max B(S). on the first page. Copyrights for components of this work owned by others than the S⊆V:jSj≤k author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission We also denote the maximum centrality of a set of k nodes by OPTk, and/or a fee. Request permissions from [email protected]. i.e., OPTk = max B(S). KDD ’16, August 13 - 17, 2016, San Francisco, CA, USA S⊆V:jSj≤k c 2016 Copyright held by the owner/author(s). Publication rights licensed to ACM. ISBN 978-1-4503-4232-2/16/08. $15.00 It is known that CMP is APX-complete [15]. The best de- DOI: http://dx.doi.org/10.1145/2939672.2939869 terministic algorithms for CMP rely on the fact that BWC is monotone-submodular and provide a (1 − 1/e)-approximation nodes that can be used as seeds for influence maxi- [15, 12]. However, the running time of these algorithms is mization. We find that betweenness centrality is per- at least quadratic in the input size, and do not scale well to forming relatively well compared to a state-of-the-art large-scale networks. influence maximization algorithm. Finding the most central nodes in a network is a compu- (3) We study four strategies for attacking a network us- tationally challenging problem that we are able to handle ing four centrality measures: betweenness, coverage, accurately and efficiently. In this paper we focus on scal- k-path, and triangle centrality. Interestingly, we find ability of CMP, and graph mining applications. Our main that the k-path and triangle centralities can be more contributions are summarized as follows. effective at destroying the connectivity of a graph. Efficient algorithm. We provide a randomized approxima- 2. RELATED WORK tion algorithm, HEDGE, based on sampling shortest paths, for accurately estimating the BWC and solving CMP. Our algo- Centrality measures. There exists a wide variety of cen- rithm is simple, scales gracefully as the size of the graph trality measures: degree centrality, Pagerank [33], HITS [26], grows, and improves the previous result [40], by (i) pro- Salsa [27], closeness centrality [6], harmonic centrality [7], viding a (1 − 1/e − e)-approximation, and (ii) smaller sized betweenness centrality [16], random walk betweenness cen- samples. Specifically, in Yoshida’s algorithm [40], a sample trality [32], coverage centrality [40], k-path centrality [2], contains all the nodes on “any” shortest path between a pair, Katz centrality [24], rumor centrality [37] are some of the whereas in our algorithm, each sample is just a set of nodes important centrality measures. Boldi and Vigna proposed from a single shortest path between the pair. an axiomatic study of centrality measures [7]. In general, 2 choosing a good centrality measure is application depen- The OPTk = Q(n ) assumption. Prior work on BWC esti- 2 dent [19]. In the following we discuss in further detail the mation strongly relies on the assumption that OPTk = Q(n ) for a constant integer k [40]. As we show in Appendix, this centrality measure of our focus, the betweenness centrality. assumption is not true in general. Only empirical evidence Betweenness centrality (BWC) is a fundamental measure so far supports this strong assumption. in network analysis. The betweenness centrality index is We show that two broad families of networks satisfy this attributed to Freeman [16]. BWC has been used in a wide assumption: bounded treewidth networks and a popular variety of graph mining applications. For instance, Girvan family of stochastic networks that provably generate scale- and Newman use BWC to find communities in social and free, small-world graphs with high probability. Note that biological networks [20]. In a similar spirit, Iyer et al. use the classical Barabási-Albert scale-free random tree model BWC to attack the connectivity of networks by iteratively [5, 30] belongs to the former category. Our results imply removing the most central vertices [23]. 2 BWC that the OPTk = Q(n ) assumption holds even for k = 1, The fastest known exact algorithm for computing for these families of networks. To our knowledge, this is exactly requires O(mn) time in unweighted, and O(nm + the first theoretical evidence for the validity of this crucial n2 log m) for weighted graphs [10, 14, 36]. There exist ran- assumption on real-world networks. domized algorithms [4, 9, 34] which provide either additive General analytical framework. To analyze our algorithm, error or multiplicative error guarantees with high probabil- HEDGE, we provide a general analytical framework based on ity. CMP Chernoff bound and submodular optimization, and show For , the state-of-the-art algorithm [40] (and the only that it can be applied to any other centrality measure if it (i) scalable proposed algorithm based on sampling) provides a is monotone-submodular, and (ii) admits a hyper-edge sam- mixed error guarantee, combining additive and multiplica- pler (defined in Sect. 3). Two examples of such centralities tive error. Specifically, this algorithm provides a solution 1 2 are the coverage [40] and the k-path centralities [2]. whose centrality is at least (1 − e )OPTk − en , by sampling 2 Experimental evaluation. We provide an experimental eval- O(log n/e ) hyper-edges, where each hyper-edge is a set of uation of our algorithm that shows that it scales gracefully all nodes on any shortest path between two random nodes as the graph size increases and that it provides accurate es- with some assigned weights.