The Geometric Block Model

The Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18) The Geometric Block Model Sainyam Galhotra, Arya Mazumdar, Soumyabrata Pal, Barna Saha College of Information and Computer Sciences University of Massachusetts Amherst Amherst, MA 01003 {sainyam,arya,spal,barna}@cs.umass.edu Abstract are only two communities of exactly equal sizes, and the inter- b log n cluster edge probability is n and intra-cluster edge proba- To capture the inherent geometric features of many community a log n detection problems, we propose to use a new random graph bility is n√ , it is√ known√ that perfect recovery is possible if model of communities that we call a Geometric Block Model. and only if a − b> 2 (Abbe, Bandeira, and Hall 2016; The geometric block model generalizes the random geometric Mossel, Neeman, and Sly 2015). The regime of the prob- graphs in the same way that the well-studied stochastic block log n model generalizes the Erdos-Renyi¨ random graphs. It is also a abilities being Θ n has been put forward as one of natural extension of random community models inspired by most interesting ones, because in an Erdos-Renyi¨ random the recent theoretical and practical advancement in commu- graph, this is the threshold for graph connectivity (Bol- nity detection. While being a topic of fundamental theoretical lobas´ 1998). This result has been subsequently general- interest, our main contribution is to show that many practical ized for k communities (Abbe and Sandon 2015a; 2015b; community structures are better explained by the geometric Hajek, Wu, and Xu 2016) (for constant k or when k = block model. We also show that a simple triangle-counting o n algorithm to detect communities in the geometric block model (log )), and under the assumption that the communities is near-optimal. Indeed, even in the regime where the average are generated according to a probabilistic generative model degree of the graph grows only logarithmically with the num- (there is a prior probability pi of an element being in the ith ber of vertices (sparse-graph), we show that this algorithm community) (Abbe and Sandon 2015a). Note that, the results performs extremely well, both theoretically and practically. are not only of theoretical interest, many real-world networks In contrast, the triangle-counting algorithm is far from be- exhibit a “sparsely connected” community feature (Leskovec ing optimum for the stochastic block model. We simulate our et al. 2008), and any efficient recovery algorithm for SBM results on both real and synthetic datasets to show superior has many potential applications. performance of both the new model as well as our algorithm. One aspect that the SBM does not account for is a “transitivity rule” (‘friends having common friends’) inherent to 1 Introduction many social and other community structures. To be precise, consider any three vertices x, y and z.Ifx and y are con- The planted-partition model or the stochastic block model nected by an edge (or they are in the same community), and (SBM) is a random graph model for community detection that y and z are connected by an edge (or they are in the same generalizes the well-known Erdos-Renyi¨ graphs (Holland, community), then it is more likely than not that x and z are Laskey, and Leinhardt 1983; Dyer and Frieze 1989; Decelle connected by an edge. This phenomenon can be seen in many et al. 2011; Abbe and Sandon 2015a; Abbe, Bandeira, and network structures - predominantly in social networks, blog- Hall 2016; Hajek, Wu, and Xu 2015; Chin, Rao, and Vu 2015; networks and advertising. SBM, primarily a generalization Mossel, Neeman, and Sly 2015). Consider a graph G(V,E), of Erdos-Renyi¨ random graph, does not take into account where V = V V ···Vk is a disjoint union of k clusters 1 2 this characteristic, and in particular, probability of an edge denoted by V ,...,Vk. The edges of the graph are drawn 1 between x and z there is independent of the fact that there randomly: there is an edge between u ∈ Vi and v ∈ Vj with exist edges between x and y and y and z. However, one needs probability qi,j, 1 ≤ i, j ≤ k. Given the adjacency matrix of to be careful such that by allowing such “transitivity”, the such a graph, the task is to find exactly (or approximately) simplicity and elegance of the SBM is not lost. the partition V V ···Vk of V . 1 2 Inspired by the above question, we propose a random graph This model has been incredibly popular both in theoreti- community detection model analogous to the stochastic block cal and practical domains of community detection, and the model, that we call the geometric block model (GBM). The aforementioned references are just a small sample. Recent GBM depends on the basic definition of the random geo- theoretical works focus on characterizing sharp threshold of metric graph that has found a lot of practical use in wireless recovering the partition in the SBM. For example, when there networking because of its inclusion of the notion of proximity Copyright c 2018, Association for the Advancement of Artificial between nodes (Penrose 2003). Intelligence (www.aaai.org). All rights reserved. Definition. A random geometric graph (RGG) on n ver- 2215 tices has parameters n, an integer t>1 and a real number Given any simple random graph model, it is possible to t β ∈ [−1, 1]. It is defined by assigning a vector Zi ∈ R to generalize it to a random block model of communities much vertex i, 1 ≤ i, n, where Zi, 1 ≤ i ≤ n are independent in line with the SBM. We however stress that the geometric and identical random vectors uniformly distributed in the Eu- block model is perhaps the simplest possible model of real- St−1 ≡{x ∈ Rt x } clidean sphere : 2 =1 . There will be world communities that also captures the transitive/geometric an edge between vertices i and j if and only if Zi,Zj ≥β. features of communities. Moreover, the GBM explains be- Note that, the definition can be further generalized by haviors of many real world networks as we will exemplify t−1 considering Zis to have a sample space other than S , and subsequently. by using a different notion of distance than inner product (i.e., the Euclidean distance). We simply stated one of the many 2 The Geometric Block Model and its equivalent definitions (Bubeck et al. 2016). Validation Random geometric graphs are often proposed as an alter- V ≡ V V ···V native to Erdos-Renyi¨ random graphs. They are quite well Let 1 2 k be the set of vertices that is a k V ,...,V studied theoretically (though not nearly as much as the Erdos-¨ disjoint union of clusters, denoted by 1 k.Given t ≥ u ∈ V Renyi graphs), and very precise results exist regarding their an integer 2, for each vertex , define a random Z ∈ Rt St−1 ⊂ Rt, connectivity, clique numbers and other structural properties vector u that is uniformly distributed in t − (Gupta and Kumar 1998; Penrose 1991; Devroye et al. 2011; the 1-dimensional sphere. Avin and Ercal 2007; Goel, Rai, and Krishnamachari 2005). Definition (Geometric Block Model For a survey of early results on geometric graphs and the (V,t,βi,j, 1 ≤ i<j≤ k)). Given V,t and a set of analogy to results in Erdos-Renyi¨ graphs, we refer the reader real numbers βi,j ∈ [−1, 1], 1 ≤ i ≤ j ≤ k, the geometric to (Penrose 2003). A very interesting question of distinguish- block model is a random graph with vertices V and an ing an Erdos-Renyi¨ graph from a geometric random graph edge exists between v ∈ Vi and u ∈ Vj if and only if has also recently been studied (Bubeck et al. 2016). This will Zu,Zv ≥βi,j. provide a way to test between the models which better fits a The case of t =2: In this paper we particularly analyze scenario, a potentially great practical use. our algorithm for t =2. In this special case, the above def- As mentioned earlier, the “transitivity” feature led to ran- inition is equivalent to choosing random variable θu uni- dom geometric graphs being used extensively to model formly distributed in [0, 2π], for all u ∈ V . Then there will wireless networks (for example, see (Haenggi et al. 2009; be an edge between two vertices u ∈ Vi,v ∈ Vj if and only Bettstetter 2002)). Surprisingly, however, to the best of our if cos θu cos θv +sinθu sin θv = cos(θu − θv) ≥ βi,j or knowledge, random geometric graphs are never used to model min{|θu − θv|, 2π −|θu − θv|} ≤ arccos βi,j. This in turn, community detection problems. In this paper we take the first is equivalent to choosing a random variable Xu uniformly step towards this direction. Our main contributions can be distributed in [0, 1] for all u ∈ V , and there exists an edge classified as follows. between two vertices u ∈ Vi,v ∈ Vj if and only if • We define a random generative model to study canonical problems of community detection, called the geometric min{|Xu − Xv|, 1 −|Xu − Xv|} ≤ ri,j, block model (GBM). This model takes into account a mea- r ∈ , 1 , ≤ i, j ≤ k sure of proximity between nodes and this proximity measure where i,j [0 2 ] 0 , are a set of real numbers. characterizes the likelihood of two nodes being connected For the rest of this paper, we concentrate on the case when r r i ∈{,...,k} when they are in same or different communities. The geo- i,i = s for all 1 , which we call the “intra- r r i, j ∈{ ,...,k},i metric block model inherits the connectivity properties of cluster distance” and i,j = d for all 1 = j the random geometric graphs, in particular the likelihood of , which we call the “inter-cluster distance,” mainly for the “transitivity” in triplet of nodes (or more).

Load more