Efficient Inference in Stochastic Block Models with Vertex Labels
Total Page:16
File Type:pdf, Size:1020Kb
1 Efficient inference in stochastic block models with vertex labels Clara Stegehuis and Laurent Massoulie´ Abstract—We study the stochastic block model with two connectivity matrix and d the average degree in the network. communities where vertices contain side information in the form Underneath the Kesten-Stigum threshold, no algorithm is able of a vertex label. These vertex labels may have arbitrary label to infer the community memberships better than a random distributions, depending on the community memberships. We analyze a linearized version of the popular belief propagation guess, even though a community structure may be present [14]. algorithm. We show that this algorithm achieves the highest In this setting, it is even impossible to discriminate between accuracy possible whenever a certain function of the network a graph generated by the stochastic block model and an parameters has a unique fixed point. Whenever this function has Erdos˝ Renyi´ random graph with the same average degree, multiple fixed points, the belief propagation algorithm may not even though a community structure is present [13]. Above the perform optimally. We show that increasing the information in the vertex labels may reduce the number of fixed points and Kesten-Stigum threshold, the communities can be efficiently hence lead to optimality of belief propagation. reconstructed [11], [15]. A popular algorithm for community detection is belief propagation [5] (BP). This algorithm starts with initial beliefs I. INTRODUCTION on the community memberships, and iterates until these beliefs converge to a fixed point. Above the Kesten-Stigum threshold, Many real-world networks contain community structures: a fixed point that is correlated with the true community mem- groups of densely connected nodes. Finding these group berships is believed to be the only stable fixed point, so that the structures based on the connectivity matrix of the network algorithm always converges to that fixed point. Underneath the is a problem of interest, and several algorithms have been Kesten-Stigum threshold, the fixed point correlated with the developed to extract these community structures, see [6] true community memberships becomes unstable and the belief for an overview. In many applications however, the network propagation algorithm will in general not result in a partition contains more information than just the connectivity matrix. that is correlated with the true community memberships. For example, the edges can be weighted, or the vertices can However, when the belief propagation algorithm is initialized carry information. This extra network information may help with the real community memberships, there is still a regime of in extracting the community structure of the network. In this the parameters where the fixed point correlated with the true paper, we study the setting where the vertices have labels, community spins can be distinguished from the other fixed which arises in particular when vertices can be distinguished points. In this regime, community detection is believed to be into different types. For example, in social networks vertex possible (for example by exhaustive search of the state space), types may include the interests of a person, the age of a person but not in polynomial time. or the city a person lives in. We investigate how the knowledge When the two communities are equally sized (the symmetric of these vertex types helps us in identifying the community stochastic block model), the phase where community detection structures. may only be possible by non-polynomial time algorithms We focus on the stochastic block model (SBM), a popular is not present [15], [11], [14]. In the case of unbalanced random graph model to analyze community detection prob- communities (the asymmetric stochastic block model) it has arXiv:1806.07562v2 [stat.ML] 17 Aug 2018 lems [8], [4], [18]. In the simplest case, the stochastic block been shown that it is possible to infer the community structure model generates a random graph with 2 communities. First, better than random guessing even below the Kesten-Stigum the vertex set is partitioned into two communities. Then, two threshold [17]. Thus, according to the conjecture of [5], a vertices in communities i and j are connected with probability regime where community detection is possible but not in M for some connection probability matrix M. To include the ij polynomial time may be present in the case of two unbalanced vertex labels, we then attach a label to every vertex, where the communities. label distribution depends on the community membership of In this paper, we investigate the performance of the belief the vertex. propagation algorithm on the asymmetric stochastic block In the stochastic block model with two equally sized com- model when vertices contain side information. We are inter- munities, it is not always possible to infer the community ested in the fraction of correctly inferred community labels structure from the connectivity matrix. A phase transition oc- by the algorithm, and we say that the algorithm performs curs at the so-called Kesten-Stigum threshold λ2d = 1, where 2 optimally if it achieves the highest possible fraction of cor- λ is the second largest eigenvalue of a matrix related to the 2 rectly inferred community labels among all algorithms. Some C. Stegehuis is with Eindhoven University of Technology. special cases of stochastic block models with side information L. Massoulie´ is with Microsoft Research - INRIA joint centre. have already been studied. One such case is the setting 2 where a fraction β of the vertices reveals its true group may significantly improve the performance of polynomial membership [21]. Typically, it is assumed that the fraction of time algorithms for community detection. vertices that reveal their true membership tends to zero when We start by showing with an example that in some cases the graph becomes large [21], [3]. In this setting, a variant vertex labels allow us to efficiently detect communities even of the belief propagation algorithm including the vertex labels below the Kesten-Stigum threshold. seems to perform optimally in a symmetric stochastic block model [21], but may not perform optimally if the communities Example 1. We now present a simple example where it is not are not of equal size [3]. Another special case of the label possible to detect communities using the connectivity matrix distribution is when the observed labels are a noisy version of only, but where it is possible when we also use knowledge of the community memberships, where a fraction of β vertices the vertex labels. Consider an SBM with four communities of receives the label corresponding to their community, and a size n=4, 1,2,3 and 4, where the probability that a vertex in fraction of 1 β vertices receives the label corresponding community i connects to a vertex in community j is given by − to the other community. It was conjectured in [16] that for Mij. Here M is the connection probability matrix defined as this label distribution the belief propagation algorithm always 2 2 2 + + 3 performs optimally in the symmetric stochastic block model. a b a b a b 1 6 2b 2a a + b a + b7 Our contribution: We focus on asymmetric stochastic M = 6 7 : block models with arbitrary label distributions, generalizing n 4a + b a + b 2a 2b 5 the above examples. a + b a + b 2b 2a • We provide an algorithm that uses both the label distribu- The nonzero eigenvalues of this matrix are given by 2(a b)=n tion and the network connectivity matrix to estimate the (appears with multiplicity two) and 4(a + b)=n. Community− group memberships. detection in this example is not able to obtain a partition that is • The algorithm is a local algorithm, which means that it better than a random guess below the Kesten-Stigum threshold, only depends on a finite neighborhood of each vertex. which is In particular, this implies that the running time of the (a b)2 < 4(a + b): algorithm is linear, allowing it to be used on large net- − works. The algorithm is a variant of the belief propagation Now suppose that all vertices in communities 1 and 2 have algorithm, and a generalization of the algorithms provided label ` , and all vertices in communities 3 and 4 have labels in [16], [3] to include arbitrary label distributions and an 1 ` . Then, there are n=2 vertices with label ` and n=2 with asymmetric stochastic block model. 2 1 label `2. Thus, using the labels of the communities alone we • In a regime where the average vertex degrees are large, cannot distinguish between vertices in community 1 and 2 we obtain an expression for the probability that the com- or between vertices in communities 3 and 4. Thus, using the munity of a vertex is identified correctly. Furthermore, we labels only, we can only correctly infer at most half of the show that this algorithm performs optimally if a function community spins. of the network parameters has a unique fixed point. Now suppose we split the network into two smaller net- • Similarly to belief propagation without labels, we show works based on the label of the vertices. Then, we obtain two that when multiple fixed points exist, the belief prop- small networks with connection probability matrices agation algorithm may not converge to the fixed point containing the most information about the community 1 2a 2b : structure. This phenomenon was previously observed in a n 2b 2a setting where the information carried by the vertex labels tends to zero in the large graph limit [3], but we show Thus, community detection can achieve a partition that is that this may also happen if the information carried by better than a random guess in these two networks as long the vertex labels does not tend to zero.