Proceedings of the Twenty-Fourth International Joint Conference on Artificial Intelligence (IJCAI 2015)

A Scalable Community Detection Algorithm for Large Graphs Using Stochastic Block Models Chengbin Peng1, Zhihua Zhang2, Ka-Chun Wong3, Xiangliang Zhang1, David E. Keyes1 1King Abdullah University of Science and Technology, Thuwal, Saudi Arabia 2Shanghai Jiao Tong University, Shanghai, China 3City University of Hong Kong, Hong Kong, China {chengbin.peng, xiangliang.zhang, david.keyes}@kaust.edu.sa, [email protected], [email protected]

Abstract for large graphs, for example, in handling community hetero- geneity [Nadler and Galun, 2006; Fortunato and Barthelemy, Community detection in graphs is widely used in 2007; Xiang and Hu, 2012]. Alternatively, model-based social and biological networks, and the stochastic methods can produce more reliable and accurate results when block model is a powerful probabilistic tool for de- the model assumptions are in accordance with the real graphs. scribing graphs with community structures. How- Stochastic block models (SBMs) are among the important ever, in the era of “big data,” traditional inference probabilistic tools describing the connectivity relationship be- algorithms for such a model are increasingly lim- tween pairs of nodes [Holland et al., 1983], and have re- ited due to their high time complexity and poor ceived considerable attention both in theoretical [Celisse et scalability. In this paper, we propose a multi-stage al., 2012] and application domains [Daudin et al., 2008]. maximum likelihood approach to recover the latent Many algorithms have been proposed to infer the param- parameters of the stochastic block model, in time eters of SBMs, such as Bayesian estimation [Hofman and linear with respect to the number of edges. We Wiggins, 2008] and nuclear norm minimization [Chen et al., also propose a parallel algorithm based on message 2012]. Bayesian estimation defines the prior distributions on passing. Our algorithm can overlap communica- latent variables (community labels of nodes) and maximizes tion and computation, providing speedup without the posterior distribution when a graph is given [Hofman and compromising accuracy as the number of proces- Wiggins, 2008]. Nuclear norm minimization of a matrix min- sors grows. For example, to process a real-world imizes the sum of its singular values, and their algorithm is graph with about 1.3 million nodes and 10 million verified on graphs containing one thousand nodes. edges, our algorithm requires about 6 seconds on In this work, we propose a multi-stage likelihood maxi- 64 cores of a contemporary commodity Linux clus- mization algorithm based on SBMs, which has three advan- ter. Experiments demonstrate that the algorithm tages: can produce high quality results on both bench- • Speed: We devise an algorithm based on coordinate de- mark and real-world graphs. An example of finding scent and use approximations to simplify computations. more meaningful communities is illustrated conse- Our algorithm runs in linear time with respect to the quently in comparison with a popular number of edges. maximization algorithm. • Scalability: We propose a parallel algorithm based on a message-passing, which tends to be the most 1 Introduction portable and performance-consistent parallel program- Community structures, in which nodes belonging to the same ming model for a variety of memory structures. We community are more densely connected to each other than overlap the communication and computation in the al- externally, prevail in real graphs. Finding those structures gorithm, so that for large graphs, it can achieve signifi- can be beneficial in many fields, such as finding protein cant speedup proportional to the number of processors. complexes in biological networks and topical or disciplinary To the best of our knowledge, it is the first parallel algo- groups in collaborative networks [Fortunato and Castellano, rithm for inferring SBMs. 2012; Kemp et al., 2006; Ansotegui´ et al., 2012]. • Quality: The algorithm can produce high-quality re- In this work, we consider non-overlapping community sults. In the initialization, it considers each node as structures. Many community detection algorithms handle one community and then employs a multi-stage iterative such a problem [Newman, 2004; Raghavan et al., 2007; strategy to construct larger communities gradually. It Blondel et al., 2008; Fortunato and Castellano, 2012; Liu et outperforms traditional community detection algorithms al., 2013; Wickramaarachchi et al., 2014; Staudt and Mey- empirically as measured by normalized mutual infor- erhenke, 2015]. However, they come along with limitations mation (NMI) [Danon et al., 2005] with respect to the

2090 ground-truth. Experiments on real-world graphs show 3.1 Estimation Algorithm that our algorithm can produce more meaningful com- Given W , the maximum likelihood estimates of (B,Z) are munities with good quality. defined as n argmax L(B,Z|W ) 2 Stochastic Block Model B,Z X = log[(1−W ) + (2W −1)Θ ] In this work, we develop a community detection algorithm ij ij ij based on a stochastic block model (SBM). We define N as i6=j X T o the number of nodes and M as the number of undirected and = log[(1−Wij) + (2Wij−1)(ZBZ )ij] , (2) unweighted edges connecting those nodes. Each node be- i6=j longs to one of K blocks, and we use Z ∈ {0, 1}N×K to P represent the block labels. That is, Zir = 1 means node subject to 0 ≤ Bij ≤ 1, Zij ∈ {0, 1}, j Zij = 1. Roughly i belongs to block r and each row of Z contains only one speaking, we solve the above optimization problem by alter- nonzero entry. We also define a matrix B ∈ [0, 1]K×K natively updating B and Z and using a community shrinking where Brk(r 6= k) represents the probability of connections and expanding strategy to improve accuracy. between nodes drawn from block r and k, respectively. If We first describe the alternating updating procedure. When r = k, Brk represents the probability of connections inside Z is fixed and B is considered as unknown, without loss the block. of generality, let β = Brk. If other entries are fixed, Using the matrices B and Z, we define a probability matrix L(B,Z|W ) = s log(β) +s ˆlog(1 − β) + C, where C is T P T P Θ = ZBZ . Then the adjacency matrix W of a sample a constant, s = ij Wij(Z:rZ:k)ij and sˆ = ij(1 − network can be drawn from the following model: T Wij)(Z:rZ:k)ij. Taking the derivative of L with respect to β,  Θij if Wij = 1, ∂L s sˆ Pr(Wij) = (1) = − (3) 1 − Θij if Wij = 0, ∂β β 1 − β and setting the derivative to be zero, we have for i, j ∈ {1, 2, ··· ,N} and i 6= j, indicating that Wij is a s sample from the Bernoulli distribution with success rate Θij. β = . (4) Typically, the adjacency matrix W is available from the data s +s ˆ set. Our primary purpose is to estimate Z. As the inter-block connection probabilities are small, we use a representative scalar value to replace the off-diagonal en- tries of B, which can be computed by counting all the inter- 3 Methodology community edges. Thus, the total time complexity of updat- ing B is O(N) + O(K) + O(M) = O(M). For a specific model, the likelihood is a function of the model parameters, describing the probability of obtaining the ob- Theorem 1 For fixed Z, the objective function L(B,Z|W ) served data with these parameters. The maximum likelihood achieves its global maximum if entries of B are updated ac- estimator is the setting of parameters that maximizes the like- cording to Eq. (4). lihood function. Proof When Z is fixed, because each entry of B can opti- As defined in Eq. (1), if only W is given, the log-likelihood mize the objective function independently, after updating all function is the entries by Eq. (4), the resulting B is a stationary point of X the objective function and each entry satisfies the constraints. L(B,Z|W ) = log Pr(Wij) By taking the second derivative of the objective function, i6=j we have X = log[(1 − W ) + (2W − 1)Θ ]. 2 ij ij ij ∂ L s sˆ i6=j 2 = − 2 − 2 < 0. (5) ∂β s β (1−β) s β= s+ˆs β= s+ˆs It is very time consuming to maximize such a likelihood func- Z β tion directly through traditional optimization methods (for ex- Therefore, when is given, the objective function at ample, branch-and-bound) for large graphs in which there are (determined by Eq. (4)) is a global maximum. As each entry B B at least NK unknown variables. of is irrelevant to each other, we can update by Eq. (4) sequentially. For the sake of speed and scalability, we propose a fast al- gorithm that updates B and Z in turn to maximize the objec- When B is fixed, we use the block coordinate descent tive function L(B,Z|W ), and use a multi-stage framework method to update Z row by row. When updating the first Z to help the solution be close to the global optimum. We also in ZBZT in Eq. (2), the algorithm keeps the second Z as its develop a parallel implementation to make the model more previous estimate Z(t−1). Then, the likelihood function can scalable. be locally maximized by setting all the elements in the row to

2091 0 but Zirmax = 1, where rmax is chosen by of communities to K. When K is unknown, the algorithm proceeds until no more merging is possible. X  rmax = argmax log (1 − Wij) This approach reduces the “collision” probability in the ini- r j6=i tialization significantly, where a “collision” is the situation that two true communities both have most of the nodes in one + (2W − 1)(B[Z(t−1)]T ) . (6) ij rj temporary community. The ratio of the “non-collision” prob- The time complexity for updating one node by Eq. (6) is ability after shrinking to the probability without shrinking is: 2 O(NK) and for all the nodes it is O(N K). K−1 K−1 Y 1 − k Y (1 − 1 )k 1 KK To reduce the time complexity, we propose a faster method. αK = [1 + α ] ≥ (1 − ) . (c) 1 − k K − k α K! Let N be a column vector containing the number of nodes k=0 K k=0 in each community, which is updated once the row Zi: is changed. Let N (d) be a column vector containing the number The lower bound is not sensitive to the choice of α when α is of nodes connected to node i in each community. Thus, we large enough. For example, the lower bounds at α = 10 and 0.9KK 0.99KK have a new update rule: α = 100 are K! and K! , respectively, both of which are very significant. In practice, we can choose αK = N. K Initializing an extra number of communities may introduce X  (d) rmax = argmax Nk log(Brk) some tiny communities in the final result. The community ex- r k=1 panding is devised by considering the likelihood of the inter- (c) (d)  community edges. When they are more likely to be in certain + (N − N ) log(1 − Brk) . (7) k k communities, the corresponding nodes are merged. In detail, 2 The time complexity is O(Mi)+O(K ), where Mi is the de- for any two communities r and k, the number of edges be- (d) tween them is gree of node i used in computing N . If we use a scalar to X represent the inter-block connection probabilities as before, crk = ZirWijZjk, we can replace the summation in Eq. (7) by two multiplica- i6=j P P tions. We also note that a node will take only a block label which is out of nrk = i Zir × i Zik, the maximum pos- from its neighbors, and the time to enumerate over different sible connections. Therefore, the log likelihood of the inter- choices of r becomes O(Mi) for node i in Eq. (7). Thus, community edges crk belonging to the true community r can the time complexity of computing labels for all the node is be represented by Lp, while the likelihood of not belonging PN reduced to i=1 O(Mi) = O(M). to the community is Lq. For fixed B, we define G(B,Z(2),Z(1)) = P log[(1 − i6=j Lp =crk log(p) + (nrk − crk) log(1 − p), (8) (2) (1)T Wij)+(2Wij −1)(Z BZ )ij]. Therefore, the objective Lq =crk log(q) + (nrk − crk) log(1 − q), (9) function can be rewritten as L(B,Z|W ) := G(B,Z,Z). If entries in Z are continuous variables, then the above op- where p = (crr + crk + ckr + ckk)/(nrr + nrk + nkr + nkk) timization can find a global maximum. indicating the internal density after merging, and q = Brk. If Lp > Lq, we merge r and k into one community, otherwise Theorem 2 min −G(B,Z(t),Z(t−1)) under the constraints leave them separated. Z(t) The time complexity for each merging is O(K2). How- P Z = 1 and 0 ≤ Z ≤ 1 is a convex optimization r ir ir ever, if we only consider the pairs of communities that have (t−1) problem when B and Z are given. at least one edge between them, the time complexity becomes We omit the proof because it is direct. O(M). Community merging runs at the end of each stage af- However, in our model, Z is a Boolean matrix. Thus, it ter updating Z and B. Algorithm 1 is the pseudocode of the will probably converge to a local optimum, in which nodes serial algorithm. from several true communities may become one single tem- Convergence Rate porary community. Here we refer to the community that a We consider an ideal stochastic block model, which com- subset of nodes indeed belongs to as the true community for prises K true communities with the same properties (com- those nodes, and the community determined by an algorithm munity size, internal density, etc.). We define p(0) ≈ 1 as at each iteration is called the temporary community. R K2 the expected ratio of nodes belonging to the same true com- The local optimum problem can be avoided by limiting munity over the total number of nodes over the total number each temporary community to contain only one true commu- of nodes, in a temporary community at iteration zero. nity in the initialization. The next subsection describes the detail. Theorem 3 For an ideal stochastic block model, ML-SBM with random initialization converges quadratically as Community Shrinking and Expanding (t) A community shrinking and expanding approach works as |p − 1 | lim R K = K. follows. First, randomly initialize the nodes into αK tempo- t→∞ |p(t−1) − 1 |2 rary communities where α should be large enough so that the R K community size is tiny. Second, run the inference algorithm The result can be proved by utilizing the concept of temporary while gradually merge communities and reduce the number communities.

2092 Algorithm 1 Serial Algorithm (ML-SBM) Algorithm 2 Parallel Algorithm (Par-SBM) set community number to αK set community number to αK initialize B, Z(t) into many tiny communities initialize B, Z(t) into many tiny communities Z(t−1) = Z(t) Z(t−1) = Z(t) repeat repeat repeat repeat for i = 1 : N do do in parallel (d) compute N for node i for i ∈ SIb do (t) compute N (d) for node i update Zi: according to Eq. (7) (t−1) (t) update Z(t) according to Eq. (7) Zi: = Zi: i: (c) (t−1) (t) update N Zi: = Zi: end for synchronize changes of Zi: at frequency f update B according to Eq. (4) end for (t) update sb and sˆb for each βb entries until Z does not change anymore P P merge communities according to Eq. (8) and (9) synchronize s = b sb and sˆ = b sˆb until no more merging is possible update B according to Eq. (4) update N (c) until Z(t) does not change anymore 3.2 Parallelism merge communities according to Eq. (8) and (9) The parallelism is designed as follows. Let SI be the set of in- until no more merging is possible tegers from 1 to N, partitioned into non-overlapping and ap- proximately equal-size subsets, each of which is represented by SIb for the bth processor. For each processor, we also use nominator indicate the running times of the serial version and local variables sb, sˆb and βb, which have the similar defini- parallel version, respectively. Therefore, the speedup will in- tions as s, sˆ and β, but taking only the ith row (i ∈ SIb) of crease towards g if the number of edges increases given the W into account. same N and K. We choose a message-passing model which is applicable on a variety of memory structures, whether shared, or dis- tributed, or hybrid. We use non-blocking MPI communi- 4 Experimental Results cations to improve the communication-computation overlap. 4.1 The Serial Algorithm This has two-fold benefits: if the hardware allows, it utilizes the network bandwidth during computation, and it maintains In this section, we compare our algorithm (ML) with other a convergence rate for Z similar to the serial version by trans- serial algorithms in MATLAB. The competitors include the mitting up-to-date community labels. In order to reduce the (LV) [Blondel et al., 2008], a Bayesian in- [ ] overall communication data, changes of Zi: are sent to pro- ference algorithm (BI) Hofman and Wiggins, 2008 , spec- [ ] cessor b only if i has neighboring nodes in SIb. Because the tral clustering (SC) Von Luxburg, 2007 , label propagation communication bandwidth will be higher as the message size (LP) [Raghavan et al., 2007], and modularity optimization is larger, we buffer the change information of Z until com- (MM) [Newman, 2004]. We use the LFR benchmark gener- puting f rows of Z finishes. Here f is chosen to be inversely ator [Lancichinetti and Fortunato, 2009] to create graphs and proportional to the number of MPI ranks, and proportional to run all of them for comparison. Each generated graph is of N, maintaining a consistent convergence rate. size 1000, average degree 30, maximum degree 50, exponent The algorithm uses similar partitioning and communicat- of −2, exponent of community size dis- ing strategies when computing B. Algorithm 2 presents the tribution −2, minimum community size 20, and maximum pseudocode of the parallel algorithm. size 100. µ is the proportion of inter-community connections In parallel computing, the rows of W are distributed on to the intra-community ones. The accuracy is evaluated by different processors. If there are totally g processors (equal normalized mutual information (NMI) [Danon et al., 2005], to the number of MPI ranks) and data are evenly distributed, which compares the similarity between the computed com- the computation time of each processor for Eq. (4) and Eq. munity structure and the ground-truth one. The larger the M NMI is, the more similar the two schemes are. If two schemes (7) is O( g ). When the computations for B are partitioned on each processor, reducing the number of communities also are identical, NMI is 1. The results are averaged on five requires O( M ) for each processor. The total message length runs. g From Figs. 1(a) and 1(b) we can see that our algorithm is for Z and B during the iterations is O(g(N + K)). If the relatively fast and has the best accuracy. When µ = 0, most bandwidth is constant, the speedup of the parallel algorithm of the algorithms can find the correct result, and our algorithm is T (M) is the second fastest one. As µ increases to 0.7, our algorithm , (10) needs a longer time to iterate, but its accuracy is still the best. T ( M ) + O(N + K) g Only SC and LV can achieve a similar accuracy over a similar where T is a linear function, such that the numerator and de- amount of time.

2093 2 Graph Name Node Number Edge Number 1 10 ML LFR-1e6 1,000,000 ∼ 5,000,000 BI SC cit-Patents 3,774,768 16,518,948 LV 0.98 MM dblp-Coauthor 1,314,050 10,724,828 1 10 LP

NMI ML 0.96 BI Table 1: Properties of the test graphs SC Running Time (sec) LV 0 MM 10 LP 3 1 10 0.94 0 0.2 0.4 0.6 0 0.2 0.4 0.6 0.95 u u 0.9 2 10 (a) NMI vs. µ (b) Running Time vs. µ 0.85

0.8 NMI PS1 1 0.75 PS1 PS8 1 10 PS8 0.7 PS64 Running Time (sec) PS64 3 LVF 0.99 10 LVF 0.65 LV LV MT MT 0 10 0.98

u=0.1 u=0.3 u=0.1 u=0.3 NMI u=0.3 u=0.5 u=0.5 mc=200 mc=200 u=0.3 u=0.5 u=0.5 1 mc=200 mc=200 mc=1000 mc=1000 mc=2000 0.97 10 mc=1000 mc=1000 mc=2000 ML ML Running Time (sec) (a) NMI (b) Running Time 0.96 SC SC LV LV 6 LP LP 10 1 −1 0.95 10 PS1 3 4 5 3 4 5 10 10 10 10 10 10 0.9 PS8 N N 5 PS64 10 0.8 GT LVF 0.7 (c) NMI vs. N (d) Running Time vs. N LV 4 10 0.6 MT PS1

PS8 Modularity 0.5 PS64

Community Number 3 Figure 1: Comparison to popular algorithms in MATLAB. 10 GT 0.4 LVF (a) and (b) represent the variation of average accuracy and LV 0.3 MT 2 running time over different choices of µ. (c) and (d) represent 10 0.2 the variation over different choices of graph size N. u=0.1 u=0.3 mc=200 mc=200 u=0.3 u=0.5 u=0.5 u=0.1 u=0.3 mc=1000 mc=1000 mc=2000 mc=200 mc=200 u=0.3 u=0.5 u=0.5 mc=1000 mc=1000 mc=2000 (c) Number of Communities (d) Modularity However, for large graphs, SC is not scalable, and LV is not accurate. Our algorithm is the method of choice for both Figure 2: Comparison on benchmark data scalability and accuracy. Figs. 1(c) and 1(d) show the results on graphs similar as the previous experiment but the average degree is 10 and µ = 0.1. On large graphs, compared with ter) [Leskovec et al., 2009]. The results on those quantities other algorithms, ours consistently produces more accurate are averaged on five runs. results with respect to the ground-truth setting, and the run- First, we run our algorithm on benchmark graphs (LFR- ning time growth rate is smaller than others as the graph size 1e6) generated by LFR benchmark [Lancichinetti and Fortu- grows. The slow growth of the running time of our algorithm nato, 2009]. The parameters are similar to the serial case ex- is in accordance with the theoretical time complexity which cept the graph size is increased to one million, and we change is linear in the number of edges. two parameters (the mixing parameter µ and maximum com- munity size mc) to have more variations. Fig. 2 shows the 4.2 The Parallel Algorithm results, in which GT is the ground truth given by the gener- In this section, we exploit distributed memory parallelism for ator. Generally, the increase of µ and mc will decrease the larger graphs. When the input graph has more than one mil- community detection accuracy, but the results generated by lion nodes, typical algorithms for stochastic block models are our algorithm are always most close to the ground truth ac- not able to infer the parameters within a reasonable amount of cording to NMI. The results with our algorithm are also very time (for example, 24 hours). Therefore, we compare our al- close to the ground truth in numbers of communities (except gorithm (PS+number of processors) to a popular community for MT which is predefined) and modularity, and are usually detection algorithm, the Louvain method (LV) [Blondel et al., better than others in modularity. In Fig. 2(b), as the prob- 2008] and a fast graph partitioning algorithm Metis (MT) lem becomes more difficult from left to right, our algorithm [Karypis and Kumar, 1998]. The Louvain method imple- spends more time to maintain the quality, while the other al- mented in C can also generate level one structures as byprod- gorithms finish within the same amount of time while sacrific- ucts containing the finest communities (LVF). Those commu- ing the quality. In addition, the performance of our algorithm nities are merged into larger ones in LV. Parallel algorithms is stable as the running results using 8 processors (P8) and 64 are often extended from serial versions [Wickramaarachchi et processors (P64) are fairly close. al., 2014], but the results are typically invariant. We use sev- We also compare algorithms on real-world graphs: cit- eral metrics to compare the results, such as NMI, modularity Patents [Leskovec et al., 2005] and dblp-Coauthor [Ley, [Clauset et al., 2004], and (the smaller, the bet- 2002]. Nodes in cit-Patents are patents in U.S. patent dataset

2094 6 5 10 10

5 10 4 10

4 10 3 PS 10 PS 3 LVF LVF 10 LV LV 2 MT 10 MT 2 10

Number of Communities Number of Communities 1 (a) by Par-SBM (b) by LVF (c) by LV2 1 10 10

0 0 10 10 0 2 4 6 0 2 4 6 10 10 10 10 10 10 10 10 Figure 4: Subgraph of the dblp-Coauthor network, where col- Community Size Community Size ors represent the community labels determined by different (a) cit-Patents (b) dblp-Coauthor algorithms

Figure 3: Number of communities versus community size del et al., 2008] using random initialization. For the Lou- vain method, we pick the Level One structure with finest Graph Algo Cond Mod Time (sec) communities (LVF) and the Level Two structure denoted PS1 0.418 0.631 1514 by LV2. Communities become larger in higher level struc- PS8 0.416 0.638 284 tures by the Louvain method. Taking the Conference Chair cit- PS64 0.415 0.639 51 as an example, we consider community containing author Patents LVF 0.594 0.560 103 “Michael Wooldridge,” and use SPS, SLV F , and SLV 2 to MT 0.488 0.487 243 represent the nodes in the community found by Par-SBM and the Louvain method, respectively. The community size by our PS1 0.114 0.631 186 algorithm is comparable to the one by LVF, but much smaller PS8 0.113 0.632 33 dblp- than the one by LV2, as |SPS| = 255 and |SLV F | = 166, PS64 0.112 0.632 6 Coauthor while |SLV 2| = 26294. LVF 0.397 0.561 22 All the algorithms are able to identify blue nodes in Fig. MT 0.621 0.253 73 4 as a separate community, while our Par-SBM provides the clearest detail. SLV 2 contains one hundred times more nodes than SPS. Among all the communities found by Par-SBM, Table 2: Comparison on real-world data the red, green, and purple communities in Fig. 4(a) are the top-three contributors to SLV 2, owning 2.96% of total and edges represent the citation relationships between the nodes in SLV 2. The three communities can be represented by two patents. dblp-Coauthor is a coauthor graph extracted three nodes respectively: purple by Dr. Michael Wooldridge from publications in DBLP database. It is also worth not- (multi-agent systems), green by Dr. Bruce W. Porter (knowl- ing that LVF has higher NMI and more communities than LV edge systems), and red by Dr. Christopher W. Geib (Reports in benchmark test. It indicates that LV suffers from the res- of the AAAI Conference Workshops). Nodes in red forms a olution limit problem [Fortunato and Barthelemy, 2007] that strong community because the authors (although from differ- prefers unrealistically large communities. A similar problem ent areas of AI) are fully connected by the workshop reports happens on LV for real-world data as it generates many com- annually. However, some nodes in the aforementioned three munities containing more than 104 nodes (Fig. 3). Especially, communities are inappropriately separated by LVF as shown for example, for dblp-Coauthor network, the biggest com- in Fig. 4(b). munity by LV contains about 105 nodes (10% of the whole graph). It is not realistic for collaboration networks and is 5 Concluding Remarks orders of magnitude larger than a reasonable community size In this paper, we have proposed a fast algorithm for clustering for human interaction [Leskovec et al., 2009], so LV is omit- nodes in large graphs using stochastic block models. We have ted from the remaining comparisons. Alternatively, choos- adopted alternative updates and coordinate descent methods ing low level structures such as LVF can be a remedy [Good to infer the latent parameters, and used community shrinking et al., 2010], although it still contains a few extremely large and expanding to improve accuracy. Our algorithm has lin- communities. On the other hand, MT tends to find even-size ear time complexity in the number of edges, and can scale to communities, which is not desirable for community detection multi-processor systems. either. Table 2 shows the quality of results by different algo- Compared with other community detection algorithms, rithms. our SBM-based method can produce high quality results on benchmark graphs and find interesting communities in the 4.3 Microscopic Example on dblp-Coauthor real-world graphs. Network This work can boost the application of SBMs on big data In this section, we analyze the community detection results of and bring new insights by overcoming the accuracy or run- a run by our Par-SBM (PS) and the Louvain method [Blon- ning time shortages of traditional algorithms. The code is

2095 available at https://github.com/pdacorn/par-sbm. [Lancichinetti and Fortunato, 2009] Andrea Lancichinetti and Santo Fortunato. Benchmarks for testing community References detection algorithms on directed and weighted graphs with overlapping communities. Physical Review E, [Ansotegui´ et al., 2012] Carlos Ansotegui,´ Jesus´ Giraldez-´ 80(1):016118, 2009. Cru, and Jordi Levy. The of SAT formulas. In Theory and Applications of Satisfiability [Leskovec et al., 2005] Jure Leskovec, Jon Kleinberg, and Testing–SAT 2012, pages 410–423. Springer, 2012. Christos Faloutsos. Graphs over time: densification laws, shrinking diameters and possible explanations. In Pro- [ et al. ] Blondel , 2008 Vincent D Blondel, Jean-Loup Guil- ceedings of the 11th ACM International Conference on laume, Renaud Lambiotte, and Etienne Lefebvre. Fast Knowledge Discovery in Data Mining, pages 177–187. unfolding of communities in large networks. Jour- ACM, 2005. nal of Statistical Mechanics: Theory and Experiment, 2008(10):P10008, 2008. [Leskovec et al., 2009] Jure Leskovec, Kevin J Lang, Anir- ban Dasgupta, and Michael W Mahoney. Community [Celisse et al., 2012] Alain Celisse, Jean-Jacques Daudin, structure in large networks: Natural cluster sizes and the and Laurent Pierre. Consistency of maximum-likelihood absence of large well-defined clusters. Internet Mathemat- and variational estimators in the stochastic block model. ics, 6(1):29–123, 2009. Electronic Journal of Statistics, 6:1847–1899, 2012. [Ley, 2002] Michael Ley. The DBLP computer science bibli- [Chen et al., 2012] Yudong Chen, Sujay Sanghavi, and Huan ography: Evolution, research issues, perspectives. In Pro- Advances in Neural Infor- Xu. Clustering sparse graphs. In ceedings of the International Symposium on String Pro- mation Processing Systems 25, pages 2213–2221, 2012. cessing and Information Retrieval, pages 1–10, 2002. [Clauset et al., 2004] Aaron Clauset, Mark EJ Newman, and [Liu et al., 2013] Jialu Liu, Chi Wang, Marina Danilevsky, Cristopher Moore. Finding community structure in very and Jiawei Han. Large-scale spectral clustering on graphs. large networks. Physical Review E, 70(6):066111, 2004. In Proceedings of the 23rd International Joint Conference [Danon et al., 2005] Leon Danon, Albert Diaz-Guilera, Jordi on Artificial Intelligence, pages 1486–1492. AAAI Press, Duch, and Alex Arenas. Comparing community structure 2013. identification. Journal of Statistical Mechanics: Theory [Nadler and Galun, 2006] Boaz Nadler and Meirav Galun. and Experiment, 2005(09):P09008, 2005. Fundamental limitations of spectral clustering. In Ad- [Daudin et al., 2008] Jean Jacques Daudin, Franck Picard, vances in Neural Information Processing Systems, pages and Stephane´ Robin. A mixture model for random graphs. 1017–1024, 2006. Statistics and Computing, 18(2):173–183, 2008. [Newman, 2004] Mark EJ Newman. Fast algorithm for de- [Fortunato and Barthelemy, 2007] Santo Fortunato and Marc tecting community structure in networks. Physical Review Barthelemy. Resolution limit in community detec- E, 69(6):066133, 2004. tion. Proceedings of the National Academy of Sciences, [Raghavan et al., 2007] Usha Nandini Raghavan, Reka´ Al- 104(1):36–41, 2007. bert, and Soundar Kumara. Near linear time algorithm [Fortunato and Castellano, 2012] Santo Fortunato and Clau- to detect community structures in large-scale networks. dio Castellano. Community structure in graphs. In Com- Physical Review E, 76(3):036106, 2007. putational Complexity, pages 490–512. Springer, 2012. [Staudt and Meyerhenke, 2015] C. Staudt and H. Meyer- [Good et al., 2010] Benjamin H Good, Yves-Alexandre henke. Engineering parallel algorithms for community de- de Montjoye, and Aaron Clauset. Performance of modu- tection in massive networks. IEEE Transactions on Paral- larity maximization in practical contexts. Physical Review lel and Distributed Systems, PP(99):1–1, 2015. E, 81(4):046106, 2010. [Von Luxburg, 2007] Ulrike Von Luxburg. A tutorial on [Hofman and Wiggins, 2008] Jake M Hofman and Chris H spectral clustering. Statistics and Computing, 17(4):395– Wiggins. Bayesian approach to network modularity. Phys- 416, 2007. ical Review Letters, 100(25):258701, 2008. [Wickramaarachchi et al., 2014] Charith Wickramaarachchi, [Holland et al., 1983] Paul W. Holland, Kathryn B. Laskey, Marc Frincu, Patrick Small, and Viktor Prasanna. Fast and Samuel Leinhardt. Stochastic blockmodels: First parallel algorithm for unfolding of communities in large steps. Social Networks, 5(2):109–137, 1983. graphs. In 18th IEEE High Performance Extreme Com- [Karypis and Kumar, 1998] George Karypis and Vipin Ku- puting Conference (HPEC 14), 2014. mar. A fast and high quality multilevel scheme for parti- [Xiang and Hu, 2012] Ju Xiang and Ke Hu. Limitation tioning irregular graphs. SIAM Journal on Scientific Com- of multi-resolution methods in community detection. puting, 20(1):359–392, 1998. Physica A: Statistical Mechanics and its Applications, [Kemp et al., 2006] Charles Kemp, Joshua B Tenenbaum, 391(20):4995–5003, 2012. Thomas L Griffiths, Takeshi Yamada, and Naonori Ueda. Learning systems of concepts with an infinite relational model. In AAAI, volume 3, page 5, 2006.

2096