A Scalable Community Detection Algorithm for Large Graphs Using Stochastic Block Models Chengbin Peng1, Zhihua Zhang2, Ka-Chun Wong3, Xiangliang Zhang1, David E

Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015) A Scalable Community Detection Algorithm for Large Graphs Using Stochastic Block Models Chengbin Peng1, Zhihua Zhang2, Ka-Chun Wong3, Xiangliang Zhang1, David E. Keyes1 1King Abdullah University of Science and Technology, Thuwal, Saudi Arabia 2Shanghai Jiao Tong University, Shanghai, China 3City University of Hong Kong, Hong Kong, China fchengbin.peng, xiangliang.zhang, [email protected], [email protected], [email protected] Abstract for large graphs, for example, in handling community hetero- geneity [Nadler and Galun, 2006; Fortunato and Barthelemy, Community detection in graphs is widely used in 2007; Xiang and Hu, 2012]. Alternatively, model-based social and biological networks, and the stochastic methods can produce more reliable and accurate results when block model is a powerful probabilistic tool for de- the model assumptions are in accordance with the real graphs. scribing graphs with community structures. How- Stochastic block models (SBMs) are among the important ever, in the era of “big data,” traditional inference probabilistic tools describing the connectivity relationship be- algorithms for such a model are increasingly lim- tween pairs of nodes [Holland et al., 1983], and have re- ited due to their high time complexity and poor ceived considerable attention both in theoretical [Celisse et scalability. In this paper, we propose a multi-stage al., 2012] and application domains [Daudin et al., 2008]. maximum likelihood approach to recover the latent Many algorithms have been proposed to infer the param- parameters of the stochastic block model, in time eters of SBMs, such as Bayesian estimation [Hofman and linear with respect to the number of edges. We Wiggins, 2008] and nuclear norm minimization [Chen et al., also propose a parallel algorithm based on message 2012]. Bayesian estimation defines the prior distributions on passing. Our algorithm can overlap communica- latent variables (community labels of nodes) and maximizes tion and computation, providing speedup without the posterior distribution when a graph is given [Hofman and compromising accuracy as the number of proces- Wiggins, 2008]. Nuclear norm minimization of a matrix min- sors grows. For example, to process a real-world imizes the sum of its singular values, and their algorithm is graph with about 1.3 million nodes and 10 million verified on graphs containing one thousand nodes. edges, our algorithm requires about 6 seconds on In this work, we propose a multi-stage likelihood maxi- 64 cores of a contemporary commodity Linux clus- mization algorithm based on SBMs, which has three advan- ter. Experiments demonstrate that the algorithm tages: can produce high quality results on both bench- • Speed: We devise an algorithm based on coordinate de- mark and real-world graphs. An example of finding scent and use approximations to simplify computations. more meaningful communities is illustrated conse- Our algorithm runs in linear time with respect to the quently in comparison with a popular modularity number of edges. maximization algorithm. • Scalability: We propose a parallel algorithm based on a message-passing, which tends to be the most 1 Introduction portable and performance-consistent parallel program- Community structures, in which nodes belonging to the same ming model for a variety of memory structures. We community are more densely connected to each other than overlap the communication and computation in the al- externally, prevail in real graphs. Finding those structures gorithm, so that for large graphs, it can achieve signifi- can be beneficial in many fields, such as finding protein cant speedup proportional to the number of processors. complexes in biological networks and topical or disciplinary To the best of our knowledge, it is the first parallel algo- groups in collaborative networks [Fortunato and Castellano, rithm for inferring SBMs. 2012; Kemp et al., 2006; Ansotegui´ et al., 2012]. • Quality: The algorithm can produce high-quality re- In this work, we consider non-overlapping community sults. In the initialization, it considers each node as structures. Many community detection algorithms handle one community and then employs a multi-stage iterative such a problem [Newman, 2004; Raghavan et al., 2007; strategy to construct larger communities gradually. It Blondel et al., 2008; Fortunato and Castellano, 2012; Liu et outperforms traditional community detection algorithms al., 2013; Wickramaarachchi et al., 2014; Staudt and Mey- empirically as measured by normalized mutual infor- erhenke, 2015]. However, they come along with limitations mation (NMI) [Danon et al., 2005] with respect to the 2090 ground-truth. Experiments on real-world graphs show 3.1 Estimation Algorithm that our algorithm can produce more meaningful com- Given W , the maximum likelihood estimates of (B; Z) are munities with good quality. defined as n argmax L(B; ZjW ) 2 Stochastic Block Model B;Z X = log[(1−W ) + (2W −1)Θ ] In this work, we develop a community detection algorithm ij ij ij based on a stochastic block model (SBM). We define N as i6=j X T o the number of nodes and M as the number of undirected and = log[(1−Wij) + (2Wij−1)(ZBZ )ij] ; (2) unweighted edges connecting those nodes. Each node be- i6=j longs to one of K blocks, and we use Z 2 f0; 1gN×K to P represent the block labels. That is, Zir = 1 means node subject to 0 ≤ Bij ≤ 1, Zij 2 f0; 1g, j Zij = 1. Roughly i belongs to block r and each row of Z contains only one speaking, we solve the above optimization problem by alter- nonzero entry. We also define a matrix B 2 [0; 1]K×K natively updating B and Z and using a community shrinking where Brk(r 6= k) represents the probability of connections and expanding strategy to improve accuracy. between nodes drawn from block r and k, respectively. If We first describe the alternating updating procedure. When r = k, Brk represents the probability of connections inside Z is fixed and B is considered as unknown, without loss the block. of generality, let β = Brk. If other entries are fixed, Using the matrices B and Z, we define a probability matrix L(B; ZjW ) = s log(β) +s ^log(1 − β) + C, where C is T P T P Θ = ZBZ . Then the adjacency matrix W of a sample a constant, s = ij Wij(Z:rZ:k)ij and s^ = ij(1 − network can be drawn from the following model: T Wij)(Z:rZ:k)ij. Taking the derivative of L with respect to β, Θij if Wij = 1; @L s s^ Pr(Wij) = (1) = − (3) 1 − Θij if Wij = 0; @β β 1 − β and setting the derivative to be zero, we have for i; j 2 f1; 2; ··· ;Ng and i 6= j, indicating that Wij is a s sample from the Bernoulli distribution with success rate Θij. β = : (4) Typically, the adjacency matrix W is available from the data s +s ^ set. Our primary purpose is to estimate Z. As the inter-block connection probabilities are small, we use a representative scalar value to replace the off-diagonal entries of B, which can be computed by counting all the inter- 3 Methodology community edges. Thus, the total time complexity of updating B is O(N) + O(K) + O(M) = O(M). For a specific model, the likelihood is a function of the model parameters, describing the probability of obtaining the ob- Theorem 1 For fixed Z, the objective function L(B; ZjW ) served data with these parameters. The maximum likelihood achieves its global maximum if entries of B are updated ac- estimator is the setting of parameters that maximizes the like- cording to Eq. (4). lihood function. Proof When Z is fixed, because each entry of B can opti- As defined in Eq. (1), if only W is given, the log-likelihood mize the objective function independently, after updating all function is the entries by Eq. (4), the resulting B is a stationary point of X the objective function and each entry satisfies the constraints. L(B; ZjW ) = log Pr(Wij) By taking the second derivative of the objective function, i6=j we have X = log[(1 − W ) + (2W − 1)Θ ]: 2 ij ij ij @ L s s^ i6=j 2 = − 2 − 2 < 0: (5) @β s β (1−β) s β= s+^s β= s+^s It is very time consuming to maximize such a likelihood func- Z β tion directly through traditional optimization methods (for ex- Therefore, when is given, the objective function at ample, branch-and-bound) for large graphs in which there are (determined by Eq. (4)) is a global maximum. As each entry B B at least NK unknown variables. of is irrelevant to each other, we can update by Eq. (4) sequentially. For the sake of speed and scalability, we propose a fast algorithm that updates B and Z in turn to maximize the objec- When B is fixed, we use the block coordinate descent tive function L(B; ZjW ), and use a multi-stage framework method to update Z row by row. When updating the first Z to help the solution be close to the global optimum. We also in ZBZT in Eq. (2), the algorithm keeps the second Z as its develop a parallel implementation to make the model more previous estimate Z(t−1). Then, the likelihood function can scalable. be locally maximized by setting all the elements in the row to 2091 0 but Zirmax = 1, where rmax is chosen by of communities to K. When K is unknown, the algorithm proceeds until no more merging is possible. X rmax = argmax log (1 − Wij) This approach reduces the “collision” probability in the ini- r j6=i tialization significantly, where a “collision” is the situation that two true communities both have most of the nodes in one + (2W − 1)(B[Z(t−1)]T ) : (6) ij rj temporary community.

Load more