The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18)

The Geometric Block Model

Sainyam Galhotra, Arya Mazumdar, Soumyabrata Pal, Barna Saha College of Information and Computer Sciences University of Massachusetts Amherst Amherst, MA 01003 {sainyam,arya,spal,barna}@cs.umass.edu

Abstract are only two communities of exactly equal sizes, and the inter- b log n cluster edge probability is n and intra-cluster edge proba- To capture the inherent geometric features of many community a log n detection problems, we propose to use a new bility is n√ , it is√ known√ that perfect recovery is possible if model of communities that we call a Geometric Block Model. and only if a − b> 2 (Abbe, Bandeira, and Hall 2016; The geometric block model generalizes the random geometric Mossel, Neeman, and Sly 2015). The regime of the prob- graphs in the same way that the well-studied stochastic block log n model generalizes the Erdos-Renyi¨ random graphs. It is also a abilities being Θ n has been put forward as one of natural extension of random community models inspired by most interesting ones, because in an Erdos-Renyi¨ random the recent theoretical and practical advancement in commu- graph, this is the threshold for graph connectivity (Bol- nity detection. While being a topic of fundamental theoretical lobas´ 1998). This result has been subsequently general- interest, our main contribution is to show that many practical ized for k communities (Abbe and Sandon 2015a; 2015b; community structures are better explained by the geometric Hajek, Wu, and Xu 2016) (for constant k or when k = block model. We also show that a simple triangle-counting o n algorithm to detect communities in the geometric block model (log )), and under the assumption that the communities is near-optimal. Indeed, even in the regime where the average are generated according to a probabilistic generative model of the graph grows only logarithmically with the num- (there is a prior probability pi of an element being in the ith ber of vertices (sparse-graph), we show that this algorithm community) (Abbe and Sandon 2015a). Note that, the results performs extremely well, both theoretically and practically. are not only of theoretical interest, many real-world networks In contrast, the triangle-counting algorithm is far from be- exhibit a “sparsely connected” community feature (Leskovec ing optimum for the . We simulate our et al. 2008), and any efficient recovery algorithm for SBM results on both real and synthetic datasets to show superior has many potential applications. performance of both the new model as well as our algorithm. One aspect that the SBM does not account for is a “tran- sitivity rule” (‘friends having common friends’) inherent to 1 Introduction many social and other community structures. To be precise, consider any three vertices x, y and z.Ifx and y are con- The planted-partition model or the stochastic block model nected by an edge (or they are in the same community), and (SBM) is a random graph model for community detection that y and z are connected by an edge (or they are in the same generalizes the well-known Erdos-Renyi¨ graphs (Holland, community), then it is more likely than not that x and z are Laskey, and Leinhardt 1983; Dyer and Frieze 1989; Decelle connected by an edge. This phenomenon can be seen in many et al. 2011; Abbe and Sandon 2015a; Abbe, Bandeira, and network structures - predominantly in social networks, blog- Hall 2016; Hajek, Wu, and Xu 2015; Chin, Rao, and Vu 2015; networks and advertising. SBM, primarily a generalization Mossel, Neeman, and Sly 2015). Consider a graph G(V,E), of Erdos-Renyi¨ random graph, does not take into account where V = V V ···Vk is a disjoint union of k clusters 1 2 this characteristic, and in particular, probability of an edge denoted by V ,...,Vk. The edges of the graph are drawn 1 between x and z there is independent of the fact that there randomly: there is an edge between u ∈ Vi and v ∈ Vj with exist edges between x and y and y and z. However, one needs probability qi,j, 1 ≤ i, j ≤ k. Given the of to be careful such that by allowing such “transitivity”, the such a graph, the task is to find exactly (or approximately) simplicity and elegance of the SBM is not lost. the partition V V ···Vk of V . 1 2 Inspired by the above question, we propose a random graph This model has been incredibly popular both in theoreti- community detection model analogous to the stochastic block cal and practical domains of community detection, and the model, that we call the geometric block model (GBM). The aforementioned references are just a small sample. Recent GBM depends on the basic definition of the random geo- theoretical works focus on characterizing sharp threshold of metric graph that has found a lot of practical use in wireless recovering the partition in the SBM. For example, when there networking because of its inclusion of the notion of proximity Copyright c 2018, Association for the Advancement of Artificial between nodes (Penrose 2003). Intelligence (www.aaai.org). All rights reserved. Definition. A (RGG) on n ver-

2215 tices has parameters n, an integer t>1 and a real number Given any simple random graph model, it is possible to t β ∈ [−1, 1]. It is defined by assigning a vector Zi ∈ R to generalize it to a random block model of communities much i, 1 ≤ i, n, where Zi, 1 ≤ i ≤ n are independent in line with the SBM. We however stress that the geometric and identical random vectors uniformly distributed in the Eu- block model is perhaps the simplest possible model of real- St−1 ≡{x ∈ Rt x } clidean sphere : 2 =1 . There will be world communities that also captures the transitive/geometric an edge between vertices i and j if and only if Zi,Zj ≥β. features of communities. Moreover, the GBM explains be- Note that, the definition can be further generalized by haviors of many real world networks as we will exemplify t−1 considering Zis to have a sample space other than S , and subsequently. by using a different notion of than inner product (i.e., the Euclidean distance). We simply stated one of the many 2 The Geometric Block Model and its equivalent definitions (Bubeck et al. 2016). Validation Random geometric graphs are often proposed as an alter- V ≡ V V ···V native to Erdos-Renyi¨ random graphs. They are quite well Let 1 2 k be the set of vertices that is a k V ,...,V studied theoretically (though not nearly as much as the Erdos-¨ disjoint union of clusters, denoted by 1 k.Given t ≥ u ∈ V Renyi graphs), and very precise results exist regarding their an integer 2, for each vertex , define a random Z ∈ Rt St−1 ⊂ Rt, connectivity, numbers and other structural properties vector u that is uniformly distributed in t − (Gupta and Kumar 1998; Penrose 1991; Devroye et al. 2011; the 1-dimensional sphere. Avin and Ercal 2007; Goel, Rai, and Krishnamachari 2005). Definition (Geometric Block Model For a survey of early results on geometric graphs and the (V,t,βi,j, 1 ≤ i

2216 Area 1 Area 2 same different Area same different MOD AI 10 2 MOD 19 35 ARCH MOD 6 1 ARCH 13 15 ROB ARCH 3 0 ROB 24 16 MOD ROB 4 0 AI 39 32 ML MOD 7 1 ML 14 42

Table 1: On the left we count the number of inter-cluster edges when authors shared same affiliation and different affiliations. On the right, we count the same for intra-cluster edges. authors, we consider five different communities: Data Man- agement (MOD), and Data Mining (ML), Artificial Intelligence (AI), Robotics (ROB), Architecture (ARCH). If two authors share the same affiliation, or shared affiliation in the past, we assume that they are geographically close. We would like to hypothesize that, two authors in the same communities might collaborate even when they are geographically far. However, two authors in different com- munities are more likely to collaborate only if they share the same affiliation (or are geographically close). Table 1 describes the number of edges across the communities. It is evident that the authors from same community are likely to collaborate irrespective of the affiliations and the authors of different communities collaborate much frequently when Figure 1: Histogram: similarity of products bought together they share affiliations or are close geographically. This clearly (mean ≈ 6) indicates that the inter cluster edges are likely to form if the distance between the nodes is quite small, motivating the fact rd

2.2 Motivation of GBM: Amazon Metadata The next dataset that we use in our experiments is the Amazon product metadata on SNAP (https://snap.stanford.edu/data/ amazon-meta.html), that has 548552 products and each prod- uct is one of the following types {Books, Music CD’s, DVD’s, Videos}. Moreover, each product has a list of attributes, for example, a book may have attributes like “General”, “Ser- mon”, “Preaching” . We consider the co-purchase network over these products. We make two observations here: (1) edges get formed (that is items are co-purchased) more fre- quently if they are similar, where we measure similarity by Figure 2: Histogram: similarity of products not bought to- the number of common attributes between products, and (2) gether (mean≈ 2) two products that share an edge have more common neigh- bors (no of items that are bought along with both those prod- ucts) than two products with no edge in between. Figures 1 and 2 show respectively average similarity of not a good model for this network, as in SBM, two nodes products that were bought together, and not bought together. having common neighbors is independent of whether they From the distribution, it is quite evident that edges in a co- share an edge or not. purchase network gets formed according to distance, a salient Difference between SBM and GBM. It is important to feature of random geometric graphs, and the GBM. stress that the network structures generated by the SBM and We next take equal number of product pairs inside Book the GBM are quite different, and it is significantly difficult (also inside DVD, and across Book and DVD) that have to analyze any algorithm or lower bound on GBM compared an edge in-between and do not have an edge respectively. to SBM. This difficulty stems from the highly correlated Figure 3 shows that the number of common neighbors when edge generation in GBM (while edges are independent in two products share an edge is much higher than when they do SBM). For this reason, analyses of the sphere-comparison al- not–in fact, almost all product pairs that do not have an edge gorithm and spectral methods for clustering on GBM cannot in between also do not share any common neighbor. This be derived as straight-forward adaptations. Whereas, even for again strongly suggests towards GBM due to its transitivity simple algorithms, a property that can be immediately seen property. On the other hand, this also suggests that SBM is for SBM, will still require a proof for GBM.

2217 Figure 3: Histogram of common neighbors of edges and non-edges in the co-purchase network, from left to right: Book-DVD, Book-Book, DVD-DVD

3 The Motif-Counting Algorithm V are processed. G V,E k Suppose, we are given a graph =( ) with dis- Algorithm 1: Graph recovery from GBM joint clusters, V1,V2, ..., Vk ⊆ V generated according to Require: GBM G =(V, E), rs,rd, k GBM(V,t,rs,rd,k). Our clustering algorithm is based on Ensure: V = V1 ... Vk counting motifs, where a motif is simply defined as a config- 1: V1,...,Vk ←∅ uration of triplets in the graph. Let us explain this principle 2: for v ∈ V do by one particular motif, a triangle. For any two vertices u and 3: for i ∈{1, 2,...,k− 1} do v in V , where (u, v) is an edge, we count the total number 4: if process(Vi,v,rs,rd) then of common neighbors of u and v. We show that this count is 5: Vi ← Vi ∪{v} different when u and v belong to the same cluster, compared 6: added← true to when they belong to different clusters. We assume G is 7: end if connected, because otherwise it is impossible to recover the 8: end for 9: if ¬ added then clusters. For every pair of vertices in the graph that share 10: Vk ← Vk ∪{v} an edge, we decide whether they are in the same cluster or 11: end if not by this count of triangles. In reality, we do not have to 12: end for check every such pair, instead we can stop when we form a spanning tree. At this point, we can transitively deduce the Algorithm 2: process partition of nodes into clusters. Require: C,v, rs, rd The main new idea of this algorithm is to use this triangle- Ensure: true/false count (or motif-count in general), since they carry signifi- 1: Choose u ∈ C | (u, v) ∈ E cantly more information regarding the connectivity of the 2: count ←|{z :(z, u) ∈ E, (z, v) ∈ E}| | count − E r ,r | < | count − E r ,r | graph than an edge count. However, we can go to statistics 3: if n S ( d s) n D ( d s) then of higher order (such as the two-hop common neighbors) at 4: return true the expense of increased complexity. Surprisingly, the simple 5: end if greedy algorithm that rely on triplets can separate clusters 6: return false log n The process function counts the number of common neigh- when rd and rs are Ω( n ), which is also a minimal require- ment for connectivity of random geometric graphs (Penrose bors of two nodes and then compares the difference of the r r E E 2003). Therefore this algorithm is optimal up to a constant count with two functions of d and s, called D and S. E E r < r factor. It is interesting to note that this motif-counting algo- Formulae for D and S are different when s 2 d to r ≥ r rithm is not optimal for SBM (as we observe), in particular, s 2 d. We have compiled this in Table 2. In this table we it will not detect the clusters in the sparse threshold region of have assumed that there are only two clusters of equal size. log n The functions change when the cluster sizes are different. Our , however, it does so for GBM. n analysis described in later sections can be used to calculate The pseudocode of the algorithm is described in Algorithm new function values. In the table, u ∼ v means u and v are 1. The algorithm looks at individual pairs of vertices to decide in the same cluster. whether they belong to the same cluster or not. We go over Similarly, the process function can be run on other set of pair of vertices and label them same/different, till we have motifs (other patterns of triplets) by fixing two nodes. On enough labels to partition the graphs into clusters. considering a larger set of motifs, the process function At any stage, the algorithm picks up an unassigned node v can take a majority vote over the decisions received from and queries it with a node each from the already formed different motifs. This is helpful to resolve the clusters even clusters. To decide whether a point belongs to a cluster, it when the gap between rs and rd is small (by a constant factor calls a subroutine called process. The process function than compared to just triangle motif). tries to infer if the node v belongs to the cluster Vi by first u ∈ V v Note that, our algorithm counts motifs only for edges, and identifying a vertex i that has an edge with , and then n−1 u v does not count motifs for more than edges, as that many by counting the number of common neighbors of and to edges are sufficient to construct a spanning tree of the graph. make a decision. This procedure is continued till all nodes in

2218 Motif Distribution of count (rs > 2rd) Distribution of count (rs ≤ 2rd) u ∼ v u v u ∼ v u v 2 n 3rs n 3rs rs z | (z,u) ∈ Bin( − 2, )+ Bin(n − 2, 2rd); Bin( − 2, )+ Bin(n − 2, 2rs − r ) 2 2 n2 2 r 2 d E, z,v ∈ E r2 E r , r − s 2 ( ) n 2 d D =2 d Bin( 2 d ); rs Bin( , ); 2 r 2 ED =2rs − 2 rs E s r 2rd 2 S = 2 + d 3rs rd ES = + 4 rs 2 2 n rs rs rs +2rd−2rsrd z | (z,u) ∈ Bin( − 2, ) + Bin(n − 2, rs − rd); Bin(n−2, ); ES = Bin(n − 2, ) 2 2 2 rd rs E, z,v ∈/ E n 2rd(rs−rd E r − r 2 2 ( ) , D = s d rs +2rd−2rsrd Bin( r ); 2 ED = 2 s rd rs rd(rs−rd) ES = + 2 rs 2 2 n rs rs rs +2rd−2rsrd z | (z,u) ∈/ Bin( − 2, )+Bin(n − 2,rs − rd); Bin(n−2, ); ES = Bin(n − 2, ) 2 2 2 rd rs E, z,v ∈ E n 2rd(rs−rd E r − r 2 2 ( ) , D = s d rs +2rd−2rsrd Bin( r ); 2 ED = 2 s rd rs rd(rs−rd) ES = + 2 rs

Table 2: ES,ED values for different motifs considering different values of rs and rd, when there are two equal sized clusters. Here Bin(n, p) denotes a binomial random variable with mean np.

4 Analysis of the Algorithm dL(X, Y ) ≡ min{|X − Y |, 1 −|X − Y |},X,Y ∈ R. With- u, v ∈ V The critical observation that we have to make to analyze out loss of generality, assume 1. For any vertex z ∈ V Eu,v ≡{u, z , v, z ∈ E} z the motif-counting algorithm is the fact that given a GBM , let z ( ) ( ) be the event that is z ∈ V graph G(V,E) with two clusters V = V1 V2, and a pair of a common neighbor. For 1, u,v vertices u, v ∈ V :(u, v) ∈ E, the events Ez ,z ∈ V of any other vertex z being a common neighbor of both u and v are Pr(E u,v)=Pr((z,u) ∈ E,(z,v) ∈ E | (u, v) ∈ E) z independent (this is obvious in SBM, but does not lead to the 1 rs u,v = Pr((z,u) ∈ E,(z,v) ∈ E | d (X ,X )=x)dx same result). However, the probabilities of Ez are different r L u v u v s 0 when and are in the same cluster and when they are in dif- rs 1 3rs ferent clusters. Therefore the count of the common neighbors = (2rs − x)dx = . r 2 are going to be different, and substantially separated with 0 s high probability (due to being sums of independent random For z ∈ V2, assuming  =min(rs, 2rd),wehave, variables) for two vertices in cases when they are from the u,v Pr(Ez )=Pr((z,u) ∈ E,(z,v) ∈ E | (u, v) ∈ E) same cluster or from different clusters. This will lead the  function process to correctly characterize two vertices as 1 = Pr((z,u), (z,v) ∈ E | d (Xu,Xv)=x)dx being from same or different clusters with high probability. r L 0 s  Let us now show this more formally. We have the following  2 2rd two lemmas for a GBM graph G(V,E) with two equal-sized 1 if 2rd 2rd version of this paper (Galhotra et al. 2017). 2 2 rs when and according to n 3rs n rs By leveraging the concentration of binomial random vari- Bin( − 2, )+Bin( , 2rd − ) when rs ≤ 2rd, where 2 2 2 2 ables, in our algorithm we just check whether the count of Bin(n, p) is a binomial random variable with mean np. common neighbors is closer to the average value of the ran- u ∈ V ,v ∈ V u, v ∈ E Lemma 2. For any two vertices 1 2 :( ) dom variable described in Lemma 1 or in Lemma 2. While belonging to different clusters, the count of common neigh- more general statements are possible, we give a theorem bors Cu,v ≡|{z ∈ V :(z,u), (z,v) ∈ E}| is a random vari- log n concentrating on the special case when rs,rd ∼ n . able distributed according to Bin(n−2, 2rd) when rs > 2rd r2 a log n b log n s Theorem 1. If rs = and rd = , rs >rd, Al- and according to Bin(n − 2, 2rs − ) when rs ≤ 2rd. n n 2rd gorithm 1 can recover the clusters V1,V2 accurately with a Similar lemmas exists for other motifs as well (see the full 3 probability of 1 − n if version (Galhotra et al. 2017)). Here let us give the proof of ⎧ √ √ Lemma 1. The proof of Lemma 2 will follow similarly. These ⎨ (3a−2b)(a−2b) ≥ ( 3a + b2 + 2b) 6, a ≥ 2b 4a 4 a when expressions can also be generalized straightforwardly when √ ⎩ (a−b)(a−2b) ≥ ( 3a + b − a + 2a − a2 ) 6, the clusters are of unequal sizes, but we omit those for clarity 2b 4 4 2b else of exposition. .

Proof of Lemma 1. Let Xw ∈ [0, 1] be the uniform random Proof. Let us consider the case rs > 2rd first. Let Z de- variable associated with w ∈ V . Let us also denote by note the random variable that equals the number of common

2219 neighbors of two nodes u, v ∈ V :(u, v) ∈ E. Let us also Remark 3 (More than two clusters). When there are more denote μs = E(Z|u ∼ v) and μd = E(Z|u v), where than two clusters, the same analysis technique is applica- u ∼ v means u and v are in the same cluster. We can easily ble and we can estimate the expected number of common find μs and μd from Lemmas 1, 2. We see that, neighbors. This generalization is straightforward but tedious.

(n − O(1))(3rs − 2rd)(rs − 2rd) Motif counting algorithm for SBM. While our algorithm μs − μd = r ,r ∼ log n 4rs is near optimal for GBM in the regime of s d n ,itis   (3a − 2b)(a − 2b)logn log n far from optimal for the SBM in the same regime of average = − O . degree. Indeed, by using simple Chernoff bounds again, we a n 4 see that the motif counting algorithm is successful for SBM Now, since Z is a sum of independent binary random vari- with inter-cluster edge probability q and intra-cluster prob- Z< δ E Z ≤ ables, using the Chernoff bound, Pr( (1 + ) ( )) p p, q ∼ log n −δ2E(Z)/3 1 ability , when n . The experimental success of Pr(Z>(1 + δ)E(Z)) ≤ e = 2 , when δ =  n our algorithm in real sparse networks therefore somewhat 6logn − 3 E(Z) . Now with probability at least 1 n2 (since there are enforce the fact that GBM is a better model for those network three binomial terms involved and they can√ have deviations structures than SBM. more than 9a/2logn, 6b2/a log n, and 12b log n with 1 5 Experimental Results probability n2 each), the algorithm will be successful to label In addition to validation experiments in Section 2.1 and 2.2, (3a−2b)(a−2b)logn ≥ 3a b2 correctly as long as, 4a 4 + a + we also conducted an in-depth experimentation of our pro- √ √ posed model and techniques over a set of synthetic and real 2b 6logn. The case of rs ≤ 2rd will follow similarly. world networks. Additionally, we compared the efficacy and Now we need the labeling to be successful for n pairs of efficiency of our motif-counting algorithm with the popular vertices (so that a spanning tree can be formed). Applying spectral clustering algorithm using normalized cuts1 and the union bound over n distinct pairs guarantees the probability correlation clustering algorithm (Bansal, Blum, and Chawla of recovery as 1 − 3/n. 2004). Now instead of relying only on the triangle motif, if we Real Datasets. We use three real datasets described below. consider all different motifs (as defined in Table 2), and then • Political Blogs. (Adamic and Glance 2005) It contains a take the aggregate (majority vote) decision, we can improve list of political blogs from 2004 US Election classified as the above theorem slightly. liberal or conservative, and links between the blogs. The a log n b log n clusters are of roughly the same size with a total of 1200 Theorem 2. If rs = and rd = , the algo- n n nodes and 20K edges. rithm considering all three motifs (see Table 2) for a pair • of nodes can recover the clusters V ,V correctly with DBLP. (Yang and Leskovec 2015) The DBLP dataset is a 1 √2  collaboration network where the ground truth communities − O 1 (3a−2b)(a−2b) ≥ 3a probability 1 ( n ) if 4a 3min 4 + are defined by the research community. The original graph  √   2 √ consists of roughly 0.3 million nodes. We process it to b b, a/ b(a−b) a − b a ≥ a + 2 2+ a + when extract the top two communities of size ∼ 4500 and 7500 √   b (a−b)(a−2b) ≥ 3a b − a respectively. This is given as input to our algorithm. 2 , and 2b 3min 4 + 4 + •    LiveJournal. (Leskovec, Adamic, and Huberman 2007) 2 2 2 2a − a , a/2+ a +2b −2ab when a<2b. The LiveJournal dataset is a free online blogging social 2b 2b network of around 4 million users. Similar to DBLP, we The proof of this theorem is present in the full version of this extract the top two clusters of sizes 930 and 1400 which paper (Galhotra et al. 2017). consist of around 11.5K edges. Remark 1. Instead of using Chernoff bound we could have We have not used the academic collaboration (Section 2.1) used better concentration inequality (such as Poisson approx- dataset here because it is quite sparse and below the connec- imation) in the above analysis, to get tighter condition on the tivity threshold regime of both GBM and SBM. constants. We again preferred to keep things simple. Synthetic Datasets. We generate synthetic datasets of differ- Remark 2 (GBM for t =3and above). For GBM with ent sizes according to the GBM with t =2,k =2and for a t =3, to find the number of common neighbors of two wide spectrum of values of rs and rd, specifically we focus a log n b log n vertices, we need to find out the area of intersection of two on the sparse region where rs = n and rd = n with spherical caps on the sphere. It is possible to do that. It can variable values of a and b. be seen that, our algorithm will successfully identify the Experimental Setting. For real networks, it is difficult to log n calculate an exact threshold as the exact values of rs and rd clusters as long as rs,rd ∼ again when the constant n are not known. Hence, we follow a three step approach. Using terms satisfy some conditions. However tight characterization a somewhat large threshold T1 we sample a subgraph S such becomes increasingly difficult. For general t, our algorithm u, v S u v   1 that will be in if there is an edge between and , and r ,r ∼ log n t−1 should be successful when s d n , which is 1http://scikit-learn.org/stable/modules/clustering.html#spectral- also the regime of connectivity threshold. clustering

2220 vary b, min a f-score vs a node error vs a 80 1 0.5 theory -common neighbor b=1 exp - common neighbor b=2 0.9 0.4 b=5 theory - all motifs 60 b=10 0.8 b=1 0.3 a b=2 0.7 b=5 0.2 40 f-score b=10 node error 0.6 0.1

20 2 4 6 8 10 0.5 20 40 60 80 100 0 10 20 30 40 50 60 70 80 90 100 b a a

(a) Triangle motif varying b and minimum (b) f-score with varying a, fixing b. (c) Fraction of nodes misclassified. value of a that satisfies the accuracy bound.

Figure 4: Results of the motif-counting algorithm on a synthetic dataset with 5000 nodes.

Dataset Total no. T1 T2 T3 Accuracy Running Time (sec) of nodes Motif-Counting Spectral clustering Motif-Counting Spectral clustering Political Blogs 1222 20 2 1 0.788 0.53 1.62 0.29 DBLP 12138 10 1 2 0.675 0.63 3.93 18.077 LiveJournal 2366 20 1 1 0.7768 0.64 0.49 1.54

Table 3: Performance on real world networks

they have at least T1 common neighbors. We now attempt to rithm is more scalable than the spectral clustering algorithm. recover the subclusters inside this subgraph by following our The spectral clustering algorithm also works very well on syn- algorithm with a small threshold T2. Finally, for nodes that thetically generated SBM networks even in the sparse regime are not part of S, say x ∈ V \ S, we select each u ∈ S that x (Lei, Rinaldo, and others 2015; Rohe et al. 2011). The su- has an edge with and use a threshold of T3 to decide if u and perior performance of the simple motif clustering algorithm x should be in the same cluster. The final decision is made by over the real networks provide a further validation of GBM taking a majority vote. We can employ sophisticated methods over SBM. Correlation clustering takes 8-10 times longer as over this algorithm to improve the results further, which is compared to motif-counting algorithm for the various range beyond the scope of this work. of its parameters. We also compared our algorithm with the We use the popular f-score metric which is the harmonic mean Newman algorithm (Girvan and Newman 2002) that per- of precision (fraction of number of pairs correctly classified forms really well for the LiveJournal dataset (98% accuracy). to total number of pairs classified into clusters) and recall But it is extremely slow and performs much worse on other (fraction of number of pairs correctly classified to the total datasets. This is because the LiveJournal dataset has two well number of pairs in the same cluster for ground truth), as well defined subsets of vertices with very few intercluster edges. as the node error rate for performance evaluation. A node The reason for the worse performance of our algorithm is the is said to be misclassified if it belongs to a cluster where sparseness of the graph. If we create a subgraph by removing the majority comes from a different ground truth cluster all nodes of degrees 1 and 2,weget100% accuracy with (breaking ties arbitrarily). Following this, we use the above our algorithm. Finally, our algorithm is easily parallelizable described metrics to compare the performance of different to achieve better improvements. This clearly establishes the techniques on various datasets. efficiency and effectiveness of motif-counting. Results. We compared our algorithm with the spectral cluster- We observe similar gains on synthetic datasets. Figures 4a, ing algorithm where we extracted two eigenvectors in order 4b and 4c report results on the synthetic datasets with 5000 to extract two communities. Table 3 shows that our algorithm nodes. Figure 4a plots the minimum gap between a and b that gives an accuracy as high as 78%. The spectral clustering guarantees exact recovery according to Theorem 1 (only trian- performed worse compared to our algorithm for all real world gle motif) and Theorem 2 (all three motifs) vs minimum value datasets. It obtained the worst accuracy of 53% on political of a for varying b for which experimentally (with only trian- blogs dataset. The correlation clustering algorithm generates gle motif, and average of three runs) we were able to recover various small sized clusters leading to a very low recall, per- the clusters exactly. Empirically, our results demonstrate the forming much worse than the motif-counting algorithm for near-optimal performance of motif-counting algorithm, con- the whole spectrum of parameter values. firming the theoretical bound. We also see a clear threshold We can observe in Table 3 that our algorithm is much faster behavior on both f-score and node error rate in Figures 4b than the spectral clustering algorithm for larger datasets (Live- and 4c. We have also performed spectral clustering on this Journal and DBLP). This confirms that motif-counting algo- 5000-node synthetic dataset (Figures 5a and 5b). Compared

2221 community detection in the sparse graphs: A spectral algorithm with f-score vs a 1 optimal rate of recovery. arXiv:1501.05021. Decelle, A.; Krzakala, F.; Moore, C.; and Zdeborova,´ L. 2011. 0.9 Asymptotic analysis of the stochastic block model for modular 0.8 networks and its algorithmic applications. Physical Review E 0.7 84(6):066106. f-score b=1 b=2 0.6 b=5 Devroye, L.; Gyorgy,¨ A.; Lugosi, G.; Udina, F.; et al. 2011. High- b=10 dimensional random geometric graphs and their clique number. 0.5 20 40 60 80 100 120 a Electronic Journal of Probability 16:2481–2508. (a) f-score with varying a,fixedb. Dyer, M. E., and Frieze, A. M. 1989. The solution of some ran- node error vs a dom np-hard problems in polynomial expected time. Journal of Algorithms 10(4):451–489. 0.5 b=1 b=2 0.4 b=5 Galhotra, S.; Mazumdar, A.; Pal, S.; and Saha, B. 2017. The b=10 geometric block model. arXiv preprint 1709.05510. 0.3 Girvan, M., and Newman, M. E. 2002. in 0.2 social and biological networks. Proceedings of the national academy node error 0.1 of sciences 99(12):7821–7826.

0 20 40 60 80 100 120 Goel, A.; Rai, S.; and Krishnamachari, B. 2005. Monotone proper- a ties of random geometric graphs have sharp thresholds. Annals of (b) Fraction of nodes misclassified. Applied Probability 2535–2552. Gupta, P., and Kumar, P. R. 1998. Critical power for asymptotic Figure 5: Results of the spectral clustering on a synthetic connectivity. In 37th IEEE Conference on Decision and Control, dataset with 5000 nodes. volume 1, 1106–1110. IEEE. Haenggi, M.; Andrews, J. G.; Baccelli, F.; Dousse, O.; and Franceschetti, M. 2009. Stochastic geometry and random graphs for the analysis and design of wireless networks. IEEE Journal on to the plots of figures 4b and 4c, they show suboptimal per- Selected Areas in Communications 27(7). formance, indicating the relative ineffectiveness of spectral Hajek, B. E.; Wu, Y.; and Xu, J. 2015. Computational lower bounds clustering in GBM compared to the motif counting algorithm. for community detection on random graphs. In Proceedings of The 28th Conference on Learning Theory, COLT 2015, Paris, France, References July 3-6, 2015, 899–928. Abbe, E., and Sandon, C. 2015a. Community detection in general Hajek, B.; Wu, Y.; and Xu, J. 2016. Achieving exact cluster recovery stochastic block models: Fundamental limits and efficient algo- threshold via semidefinite programming. IEEE Transactions on rithms for recovery. In 56th Annual Symposium on Foundations of Information Theory 62(5):2788–2797. Computer Science (FOCS), 670–688. IEEE. Holland, P. W.; Laskey, K. B.; and Leinhardt, S. 1983. Stochastic Abbe, E., and Sandon, C. 2015b. Recovering communities in the blockmodels: First steps. Social networks 5(2):109–137. general stochastic block model without knowing the parameters. In Lei, J.; Rinaldo, A.; et al. 2015. Consistency of spectral clustering Advances in Neural Information Processing Systems, 676–684. in stochastic block models. The Annals of Statistics 43(1):215–237. Abbe, E.; Bandeira, A. S.; and Hall, G. 2016. Exact recovery Leskovec, J.; Adamic, L. A.; and Huberman, B. A. 2007. The in the stochastic block model. IEEE Trans. Information Theory dynamics of viral marketing. ACM Transactions on the Web (TWEB) 62(1):471–487. 1(1):5. Adamic, L. A., and Glance, N. 2005. The political blogosphere Leskovec, J.; Lang, K. J.; Dasgupta, A.; and Mahoney, M. W. 2008. and the 2004 us election: divided they blog. In 3rd international Statistical properties of community structure in large social and workshop on Link discovery, 36–43. ACM. information networks. In 17th international conference on World Wide Web, 695–704. ACM. Avin, C., and Ercal, G. 2007. On the cover time and mixing time of random geometric graphs. Theoretical Computer Science 380(1- Mossel, E.; Neeman, J.; and Sly, A. 2015. Consistency thresholds 2):2–22. for the planted bisection model. In 47th Annual ACM Symposium on Theory of Computing, 69–75. ACM. Bansal, N.; Blum, A.; and Chawla, S. 2004. Correlation clustering. Machine Learning 56(1-3):89–113. Penrose, M. D. 1991. On a continuum percolation model. Advances in applied probability 23(03):536–556. Bettstetter, C. 2002. On the minimum node degree and connectivity of a wireless multihop network. In Proceedings of the 3rd ACM Penrose, M. 2003. Random geometric graphs. Number 5. Oxford international symposium on Mobile ad hoc networking & computing, University Press. 80–91. ACM. Rohe, K.; Chatterjee, S.; Yu, B.; et al. 2011. Spectral clustering Bollobas,´ B. 1998. Random graphs. In Modern . and the high-dimensional stochastic blockmodel. The Annals of Springer. 215–252. Statistics 39(4):1878–1915. Yang, J., and Leskovec, J. 2015. Defining and evaluating network Bubeck, S.; Ding, J.; Eldan, R.; and Racz,´ M. Z. 2016. Testing for communities based on ground-truth. Knowledge and Information high-dimensional geometry in random graphs. Random Structures Systems 42(1):181–213. & Algorithms. Chin, P.; Rao, A.; and Vu, V. 2015. Stochastic block model and

2222