1

Efficient inference in stochastic block models with vertex labels Clara Stegehuis and Laurent Massoulie´

Abstract—We study the stochastic block model with two connectivity matrix and d the average degree in the network. communities where vertices contain side information in the form Underneath the Kesten-Stigum threshold, no algorithm is able of a vertex label. These vertex labels may have arbitrary label to infer the community memberships better than a random distributions, depending on the community memberships. We analyze a linearized version of the popular belief propagation guess, even though a may be present [14]. algorithm. We show that this algorithm achieves the highest In this setting, it is even impossible to discriminate between accuracy possible whenever a certain function of the network a graph generated by the stochastic block model and an parameters has a unique fixed point. Whenever this function has Erdos˝ Renyi´ with the same average degree, multiple fixed points, the belief propagation algorithm may not even though a community structure is present [13]. Above the perform optimally. We show that increasing the information in the vertex labels may reduce the number of fixed points and Kesten-Stigum threshold, the communities can be efficiently hence lead to optimality of belief propagation. reconstructed [11], [15]. A popular algorithm for community detection is belief propagation [5] (BP). This algorithm starts with initial beliefs I.INTRODUCTION on the community memberships, and iterates until these beliefs converge to a fixed point. Above the Kesten-Stigum threshold, Many real-world networks contain community structures: a fixed point that is correlated with the true community mem- groups of densely connected nodes. Finding these group berships is believed to be the only stable fixed point, so that the structures based on the connectivity matrix of the network algorithm always converges to that fixed point. Underneath the is a problem of interest, and several algorithms have been Kesten-Stigum threshold, the fixed point correlated with the developed to extract these community structures, see [6] true community memberships becomes unstable and the belief for an overview. In many applications however, the network propagation algorithm will in general not result in a partition contains more information than just the connectivity matrix. that is correlated with the true community memberships. For example, the edges can be weighted, or the vertices can However, when the belief propagation algorithm is initialized carry information. This extra network information may help with the real community memberships, there is still a regime of in extracting the community structure of the network. In this the parameters where the fixed point correlated with the true paper, we study the setting where the vertices have labels, community spins can be distinguished from the other fixed which arises in particular when vertices can be distinguished points. In this regime, community detection is believed to be into different types. For example, in social networks vertex possible (for example by exhaustive search of the state space), types may include the interests of a person, the age of a person but not in polynomial time. or the city a person lives in. We investigate how the knowledge When the two communities are equally sized (the symmetric of these vertex types helps us in identifying the community stochastic block model), the phase where community detection structures. may only be possible by non-polynomial time algorithms We focus on the stochastic block model (SBM), a popular is not present [15], [11], [14]. In the case of unbalanced random graph model to analyze community detection prob- communities (the asymmetric stochastic block model) it has arXiv:1806.07562v2 [stat.ML] 17 Aug 2018 lems [8], [4], [18]. In the simplest case, the stochastic block been shown that it is possible to infer the community structure model generates a random graph with 2 communities. First, better than random guessing even below the Kesten-Stigum the vertex set is partitioned into two communities. Then, two threshold [17]. Thus, according to the conjecture of [5], a vertices in communities i and j are connected with probability regime where community detection is possible but not in M for some connection probability matrix M. To include the ij polynomial time may be present in the case of two unbalanced vertex labels, we then attach a label to every vertex, where the communities. label distribution depends on the community membership of In this paper, we investigate the performance of the belief the vertex. propagation algorithm on the asymmetric stochastic block In the stochastic block model with two equally sized com- model when vertices contain side information. We are inter- munities, it is not always possible to infer the community ested in the fraction of correctly inferred community labels structure from the connectivity matrix. A phase transition oc- by the algorithm, and we say that the algorithm performs curs at the so-called Kesten-Stigum threshold λ2d = 1, where 2 optimally if it achieves the highest possible fraction of cor- λ is the second largest eigenvalue of a matrix related to the 2 rectly inferred community labels among all algorithms. Some C. Stegehuis is with Eindhoven University of Technology. special cases of stochastic block models with side information L. Massoulie´ is with Microsoft Research - INRIA joint centre. have already been studied. One such case is the setting 2 where a fraction β of the vertices reveals its true group may significantly improve the performance of polynomial membership [21]. Typically, it is assumed that the fraction of time algorithms for community detection. vertices that reveal their true membership tends to zero when We start by showing with an example that in some cases the graph becomes large [21], [3]. In this setting, a variant vertex labels allow us to efficiently detect communities even of the belief propagation algorithm including the vertex labels below the Kesten-Stigum threshold. seems to perform optimally in a symmetric stochastic block model [21], but may not perform optimally if the communities Example 1. We now present a simple example where it is not are not of equal size [3]. Another special case of the label possible to detect communities using the connectivity matrix distribution is when the observed labels are a noisy version of only, but where it is possible when we also use knowledge of the community memberships, where a fraction of β vertices the vertex labels. Consider an SBM with four communities of receives the label corresponding to their community, and a size n/4, 1,2,3 and 4, where the probability that a vertex in fraction of 1 β vertices receives the label corresponding community i connects to a vertex in community j is given by − to the other community. It was conjectured in [16] that for Mij. Here M is the connection probability matrix defined as this label distribution the belief propagation algorithm always  2 2 + +  performs optimally in the symmetric stochastic block model. a b a b a b 1  2b 2a a + b a + b Our contribution: We focus on asymmetric stochastic M =   . block models with arbitrary label distributions, generalizing n a + b a + b 2a 2b  the above examples. a + b a + b 2b 2a • We provide an algorithm that uses both the label distribu- The nonzero eigenvalues of this matrix are given by 2(a b)/n tion and the network connectivity matrix to estimate the (appears with multiplicity two) and 4(a + b)/n. Community− group memberships. detection in this example is not able to obtain a partition that is • The algorithm is a local algorithm, which means that it better than a random guess below the Kesten-Stigum threshold, only depends on a finite neighborhood of each vertex. which is In particular, this implies that the running time of the (a b)2 < 4(a + b). algorithm is linear, allowing it to be used on large net- − works. The algorithm is a variant of the belief propagation Now suppose that all vertices in communities 1 and 2 have algorithm, and a generalization of the algorithms provided label ` , and all vertices in communities 3 and 4 have labels in [16], [3] to include arbitrary label distributions and an 1 ` . Then, there are n/2 vertices with label ` and n/2 with asymmetric stochastic block model. 2 1 label `2. Thus, using the labels of the communities alone we • In a regime where the average vertex degrees are large, cannot distinguish between vertices in community 1 and 2 we obtain an expression for the probability that the com- or between vertices in communities 3 and 4. Thus, using the munity of a vertex is identified correctly. Furthermore, we labels only, we can only correctly infer at most half of the show that this algorithm performs optimally if a function community spins. of the network parameters has a unique fixed point. Now suppose we split the network into two smaller net- • Similarly to belief propagation without labels, we show works based on the label of the vertices. Then, we obtain two that when multiple fixed points exist, the belief prop- small networks with connection probability matrices agation algorithm may not converge to the fixed point containing the most information about the community 1 2a 2b . structure. This phenomenon was previously observed in a n 2b 2a setting where the information carried by the vertex labels tends to zero in the large graph limit [3], but we show Thus, community detection can achieve a partition that is that this may also happen if the information carried by better than a random guess in these two networks as long the vertex labels does not tend to zero. The existence as (a b)2 > 2(a + b), i.e., above the corresponding Kesten- − of multiple fixed points either indicates that the optimal Stigum threshold. Thus, in the regime fixed point can still be found by an exhaustive search of 2(a + b) < (a b)2 < 4(a + b) the partition space or it may indicate that no algorithm is − able to detect the community partition. it is impossible to infer the community structure better than a • We show that increasing the correlation between the random guess without information about the vertex labels, or vertex covariate and the community structure changes the when using only the vertex labels. However, when using the number of fixed points of the BP algorithm for a specific vertex label information combined with the underlying graph example of node covariates. In particular it is possible structure, one can infer the community structure of strictly that the BP algorithm does not converge to the fixed more than half of the vertices correctly. point that is the most informative on the vertex spins if the correlation between the vertex covariates and the Notation: We say that a sequence of events ( n)n≥1 hap- E vertex spins is small, but that BP does converge to this pens with high probability (w.h.p.) if limn→∞ P ( n) = 1. Fur- E fixed point if the vertex labels contain more information thermore, we write f(n) = o(g(n)) if limn→∞ f(n)/g(n) = on the vertex spins. This shows that including node 0, and f(n) = O(g(n)) if f(n) /g(n) is uniformly bounded, | | covariates for community detection is helpful, and that it where (g(n))n≥1 is nonnegative. 3

A. Model B. Labeled Galton-Watson trees Let G be a labeled SBM with two communities. That is, A widely used algorithm to detect communities in the every vertex i has a spin σi +, , where P (σi = +) = p stochastic block model is Belief Propagation [5]. The algo- ∈ { −} independently for all i. Each pair of nodes (i, j) is connected rithm computes the belief that a specific vertex i belongs with probability da/n if σi = σj = +, with probability dc/n to community +, given the beliefs of the other vertices. if σi = σj = , and with probability db/n if σi = σj, so Because the stochastic block model is locally tree-like, we − 6 that d controls the average degree in the graph. When the study a Galton-Watson tree that behaves similar to the labeled communities do not have equal degrees, partitioning vertices stochastic block model. We denote this labeled Galton-Watson based on their degrees already results in a community detec- tree by ( , s0, σ, L), where is a Galton-Watson tree rooted T T tion algorithm that correctly identifies the spin of a vertex with at s0 with a Poisson(d) offspring distribution. Each vertex i in probability at strictly larger than 1/2 [3]. We therefore assume the tree has two covariates, σi +, and Li . Here σi ∈ { −} ∈ L that all vertices have the same average degree, that is denotes the spin of the node, and Li denotes the vertex label of node i. The root s has spin σ = + with probability p and pa + (1 p)b = pb + (1 p)c = 1, (1) 0 s0 − − spin - with probability 1 p. Given the label σP of the parent − so that the average degree is d. of node i, the probability that σi = σP is pa/d if σi = + and Beside the vertex spins, every vertex has a label attached to (1 p)c/d if σi = . Given σi = +, Li = ` with probability − − it. Let be a finite set of labels. Then vertices in community µ(`), whereas given σi = , Li = ` with probability ν(`). L (s0,L) − + have label ` with probability µ(`), and vertices in Let r denote such a tree of depth r rooted at s0, where ∈ L T community - have label ` with probability ν(`). the labels Li are observed, but the spins of the nodes are not For an estimator of the community spins T , let Ti(G) observed. Let denote the set of children of vertex . Then ∈ ∂i i +, be the estimated label of vertex i in graph G under Bayes’ rule together with (1) yields estimator{ −} . We then define the success probability of an T   estimator as = + (s0,L) T P σs0 r ( ) | T µ Ls0 p n   = 1 X (s0,L) ( )(1 ) = ν Ls0 p Psucc(T ) = P (Ti(G) = + σi = +) P σs0 r − n | − | T i=1 Q  (j,L)  (j,L)  j∈∂ aP σj = + r−1 + bP σj = r−1 + P (Ti(G) = σi = ) 1 , (2) s0 | T − | T − | − − . × Q  (j,L)  (j,L) where the subtraction of -1 is to give zero performance j∈∂ bP σj = + r−1 + cP σj = r−1 s0 | T − | T measure to estimators that do not depend on the graph structure If we define G. Let s0 be a uniformly chosen vertex. By [3, Proposition   3],  (s0,L)  P σs0 = + r (s0) | T Psucc(T ) = dTV (P+,P−), (3) ξr = log   , (6)  (s0,L) P σs0 = r where P+ and P− are the conditional distributions of G, given − | T that σs0 = + and σs0 = respectively and dTV denotes the we can write the recursion total variation distance. We− say that the community detection opt (s0) X (j) problem is solvable if the estimator T maximizing (2) ξr = h(Ls0 ) + w + f(ξr−1), (7) satisfies j∈∂s0 opt lim inf Psucc(T ) > 0. (4) n→∞ with w = log(p/(1 p)), − Note that the estimator T that estimates community one if 1 aex + b µ(`) > ν(`) and community two otherwise has a success f(x) = log (8) ex + probability of b c X X and Psucc(T1) = µ(`) + ν(`) 1 − h(`) = log(µ(`)/ν(`)). (9) `:µ(`)>ν(`) `:ν(`)≤µ(`) X = max(µ(`), ν(`)) 1 − ` C. Local algorithms X = max(µ(`), ν(`)) 1 (µ(`) + ν(`)) A local algorithm is an algorithm that bases the estimate of − 2 ` the spin of a vertex i only on the neighborhood of vertex i of 1 X radius t. In general, local algorithms are not able to obtain a = µ(`) ν(`) = dTV (µ, ν) (5) 2 | − | ` success probability (2) larger than zero in the stochastic block model [9], so that an estimator based on a local algorithm Thus, the community detection problem is always solvable does not satisfy (4). However, when a vanishing fraction when d (µ, ν) > 0. Furthermore, an estimator T performs TV of vertices reveals their labels, a local algorithm is able to better when combining the network data and the vertex labels achieve the maximum possible success probability (2) when than when only using the vertex labels if the parameters of the stochastic block model are above the Psucc(T ) > dTV (µ, ν). Kesten Stigum threshold [9]. 4

Algorithm 1: Local linearized belief propagation with Here Z is a (0, 1) random variable, and U− is a random N vertex labels. variable independent of Z which takes values log(µ(`)/ν(`)) 0 1 Set R = log(µ(Li)/ν(Li)) for all i [n] and j i. with probability ν(`). Let i→j ∈ ∈ N 2 for k = 1, . . . , t 1 do Z ∞ − 1 −y2/2 3 For all (i, j) E let Q(x) = e dy, (15) ∈ x √2π k X k−1 Ri→j = h(Li) + w + f(Rv→i). (10) and let U+ denote a random variable which takes values

v∈Ni\{j} log(µ(`)/ν(`)) with probability µ(`). Then the following theorem gives the success probability of Algorithm 1 in terms 4 end of the function Q and a fixed point of G and compares it with 5 For all i [n] set ∈ the performance of the optimal estimator T opt. t X t−1 R = h(Li) + w + f(R ). (11) t i v→i Theorem 1. Let TBP denote the estimator given by Algo- v∈Ni rithm 1 up to depth t. Then, t t t 6 For all i, set T (i) = + if Ri 0, and set T (i) = t BP BP lim inf lim inf Psucc(TBP ) if Rt < 0 ≥ − d→∞ n→∞ i       U+ αt/2 U− αt/2 = E Q − + E Q − − 1. √αt √αt − (16) D. Local, linearized belief Propagation Furthermore, if G(µ) has a unique fixed point, then The specific local algorithm we consider is a version of opt lim inf lim inf Psucc(T ) the widely used belief propagation [5]. Algorithm 1 uses the d→∞ n→∞       observed labels to initialize the belief propagation, and then U+ α∞/2 U− α∞/2 = Q − + Q − − 1, updates the beliefs as in (7). Here i denotes the neighbors E E N √α∞ √α∞ − of vertex i. Since the algorithm only uses the neighborhood (17) of vertex i up to depth t, it is indeed a local algorithm. Note and the estimator of Algorithm 1 is asymptotically optimal. that Algorithm 1 does require knowledge of all parameters We now comment on the results and its implications. of the stochastic block model: p, a, b, c, d as well as the label a) Special cases of G(α): The function G(α) has been distributions µ and ν. Furthermore, if the underlying graph G investigated for two special cases of the labeled stochastic is a tree, then (11) is the same as (7). block model. Analyzing G(α) for these special cases already turned out to be difficult, but some conjectures on its behavior II.PROPERTIES OF LOCAL, LINEARIZEDBELIEF have been made based on simulations. In [16], it was con- PROPAGATION jectured that for the special case where p = 1/2 and the We now consider the setting where a, b and c grow large. In vertex labels are noisy versions of the spins, the function this regime, we give specific performance guarantees on the G(α) only has one fixed point for all possible values of λ. success probability of the local belief propagation algorithm. Thus, Algorithm 1 is conjectured to perform optimally for the Define symmetric stochastic block model with noisy spins as vertex pa (1 p)b labels. R = . pb (1 − p)c The asymmetric stochastic block model where the informa- − tion about the community memberships carried by the vertex 2 We focus on the regime where dλ2 is fixed,where λ2 is the labels goes to zero was studied in [3]. Instead of noisy spins 2 smallest eigenvalue of R and define λ = dλ2. We let the as labels, a fraction of β vertices reveals their true spins average degree d . Then, λ < 1 corresponds to the while the other vertices carry an uninformative label, where → ∞ Kesten Stigum bound [3]. Furthermore, we assume that the β tends to zero as the graph grows large. In that setting, it average degree in each community is equal, so that (1) holds. was conjectured that the function G(α) may have 2 or 3 fixed Under this assumption, λ2 = 1 b. Thus, if we let b = 1 ε, points for small values of p and λ < 1. then in the regime we are interested− in, d = λ/ε2. Then also− b) Influence of the initial beliefs on the performance of Algorithm 1: When G(α) has more than one fixed point, the 1 p p a = 1 + − ε, b = 1 ε, c = 1 + ε. (12) success probability of Algorithm 1 corresponds to the smallest p − 1 p − fixed point of G. If in the belief propagation initialization the Define true unknown beliefs are used, the success probability of the algorithm corresponds to the largest fixed point of G. Since α0 = 0, G is increasing, this also implies that the success probability αt = G(αt−1), t 1, (13) when initializing with the true unknown beliefs is higher than ≥ the success probability of Algorithm 1. where c) Multiple fixed points of G: Figures 1a and 1b show λ  1  that in the setting of Theorem 1 where information in the ( ) = √ 1 G α 2 E . (14) labels about the vertex spins does not vanish, the function p 1 p + peU−+ αZ−α/2 − − 5

G(α) may have more than one fixed point, even when the G G 15 β = probability of observing the correct label does not go to 1/2 0.6 0.4 β =0.42 as n . This is very different from the special case β =0.44 → ∞ 10 where p = 1/2, where the function G(α) was conjectured β =0.4 0.4 β =0.46 β =0.42 β =0.48 to have at most one fixed point [16]. Indeed, Figures 1c β =0.44 β =0.5 5 β =0.46 0.2 and 1d show that for the symmetric stochastic block model β =0.48 G(α) only contains one fixed point. For the asymmetric β =0.5 0 α 0 α stochastic block model on the other hand, there is a region 0 2 4 6 8 10 12 14 16 0 0.1 0.2 0.3 0.4 0.5 of parameters where Algorithm 1 may not achieve the highest (a) p = 0.05 (b) p = 0.05, zoomed in possible accuracy among all algorithms. Belief propagation G G 4 initialized with the true beliefs corresponds to the highest fixed β = 0.6 0.4 β = 3 0.42 point of G, and thus results in a better estimator than belief β =0.44 propagation initialized with beliefs based on the vertex labels. β =0.4 0.4 β =0.46 2 β =0.42 β =0.48 In this case, exhaustive search of the partition space may still β =0.44 β =0.5 find all fixed points of the belief propagation algorithm, of 1 β =0.46 0.2 β =0.48 which one corresponds to the fixed point having maximal β =0.5 0 α 0 α overlap with the true partition. However, whether this fixed 0 2 4 6 8 10 12 14 16 0 0.1 0.2 0.3 0.4 0.5 point can be distinguished from the other fixed points without (c) p = 0.5 (d) p = 0.5, zoomed in knowledge of the true community spins is unknown. If this Fig. 1: The function ( ) for = 0 8 and noisy labels is possible, this would indicate a phase where community G α λ . ( ( ) = , ( ) = 1 , ( ) = , ( ) = 1 ) detection is possible, but computationally hard. If the fixed µ `1 β µ `3 β ν `2 β ν `3 β for various values of . The− black line is the line = −. point is indistinguishable from the other fixed points, even β y x exhaustive search of the partition space will not result in a better partition. In the asymmetric stochastic block model the shape of G, shown in Figures 1a and 1c for the setting without vertex labels, it was shown that it is sometimes indeed with noisy labels. The location of the fixed point of G is possible to detect communities even underneath the Kesten- much closer to the origin for p = 0.5 than for p = 0.05. Stigum threshold [17] (in non-polynomial time). It would This difference causes the increase in the success probability. be interesting to see in which cases this also holds for the Figures 3a and 3b show the success probability given stochastic block model including vertex labels. by (16) as a function λ for unbalanced communities. Here d) Increasing vertex label information: Interestingly, we also see that there is a small range of λ < 1 where the Figure 1b shows an example where the lowest and the highest success probability increases rapidly in λ. Figure 4 shows fixed point of G are stable, but the middle fixed point is that the accuracy obtained in Theorem 1 is higher than the unstable. Thus, to converge to a fixed point corresponding to accuracy that is obtained when only using the vertex labels a better correlation with the network partition than initializing to distinguish the communities, even underneath the Kesten- at = 0, the initial beliefs should correspond to an - α α Stigum threshold λ < 1. value that is equal to or larger than the second fixed point of in this example. Note that the case = 0 5 is G β . Psucc Psucc similar to the community detection problem without extra 1 1 vertex information, because for β = 0.5 the vertex labels 0.9 0.8 are independent of the community memberships. Thus, in 0.8 the asymmetric stochastic block model, there is a fixed point 0.7 0.6 of G corresponding to a partition with non-trivial overlap 0.6 with the true partition, but the BP algorithm does not find 0.4 0.5 p p this partition when initialized with random beliefs. The same 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 situation occurs when the information about the community (a) Noisy vertex labels: µ(`1) = (b) Revealed vertex labels: membership carried by the vertex labels is small (for example 0.55 = 1 − µ(`2), ν(`2) = µ(`1) = 0.1 = 1 − µ(`3), when β = 0.48). However, when the information carried by 0.85 = 1 − ν(`1). ν(`2) = 0.05 = 1 − ν(`3). the vertex labels is sufficiently large, the BP algorithm starts to Fig. 2: P as a function of p for λ = 0.8 converge to the largest fixed point of G, and the BP algorithm succ performs optimally. Thus, including node covariates in the BP algorithm for the asymmetric stochastic block model may change the number of fixed points, and therefore significantly III.PROOFOF THEOREM 1 improve the performance of the BP algorithm. Because the SBM is locally tree-like, we first investigate e) Success probability: Figures 2a and 2b plot the suc- Algorithm 1 on a Galton-Watson tree defined in Section I-B, (+) cess probability of Algorithm 1 given by equation (16) against where we study the recursion (7). Denote by ξr the value of p for the case of noisy labels and revealed labels respectively. ξr for a randomly chosen vertex in community + and define (−) (+) d (−) d We see that for small and large values of p, these is a rapid ξr similarly. Then, ξ0 = U+ + w and ξ0 = U− + w. We increase in the success probability. This increase is caused by first investigate the distribution of ξ1. 6

P P (+) succ succ Furthermore, we can write ξ1 as 1 1

(+) (d) X 0 0.8 ξ1 = U+ + w + N`0 f(h(` ) + w), (20) 0.8 `0 0.6 0 0 0 0.6 where N` Poisson(d(paµ(` )+(1 p)bν(` ))), independent 0.4 ∼ − of U+. Subtracting and adding the mean of the Poisson

0.4 λ 0.2 λ variable yields 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 (+) (d) X 0 0 0 (a) Noisy vertex labels: µ(`1) = (b) Revealed vertex labels: ξ = (N`0 d(paµ(` ) + (1 p)bν(` )))f(h(` ) + w) 1 − − 0.55 = 1 − µ(`2), ν(`2) = µ(`1) = 0.1 = 1 − µ(`3), `0 0.85 = 1 − ν(`1). ν(`2) = 0.05 = 1 − ν(`3). X 0 0 0 + d(paµ(` ) + (1 p)bν(` )))f(h(` ) + w) + U+. − Fig. 3: Psucc as a function of λ for p = 0.05 `0 (21) 1−p In the regime we are interested in, a = 1 + p ε, b = 1 ε, p 2 − BP c = 1 + 1−p ε and d = λ/ε for some ε > 0 (see (12)). Then, 0.6 Labels Taylor expanding f(h(`) + w) around ε = 0 results in (1 + ε 1−p )pµ(`) + (1 ε)(1 p)ν(`)! f(h(`) + w) = log p − − (1 ε)pµ(`) + (1 + ε p )(1 p)ν(`) 0.4 − 1−p − succ µ(`) ν(`) P = ε − pµ(`) + (1 p)ν(`) − 0.2 ε2 (2p 1)(µ(`) ν(`))2 + − − + O(ε3). 2 (pµ(`) + (1 p)ν(`))2 − This shows that the last term in (21) can be rewritten as 0 2 X 0 5 10− 0.1 0.15 0.2 0.25 0.3 d(paµ(`0) + b(1 p)ν(`0)))f(h(`0) · − β `0 λ X Fig. 4: Success probability of Algorithm 1 and the success = ((1 + 1−p ε)pµ(`0) ε2 p probability when only using the vertex labels for p = 0.5, `0 0 λ = 0.8, µ(`1) = 0.5 + β = 1 µ(`2) and ν(`1) = 0.5 β = + (1 ε)(1 p)ν(` )))f(h(`) + w) − − − − 1 ν(`2). λ X (µ(`0) ν(`0))2 − = − + O(ε) 2 pµ(`0) + (1 p)ν(`0) `0 − Lemma 2. As d , = α1/2 + O(ε). → ∞ By [3, Corollary A3], (+) W lim ξ1 U+ + w + (α1/2, α1), (18) 0 0 d→∞ −→ N N`0 d(apµ(` ) + b(1 p)ν(` )) W − − (0, 1), (−) W p 0 0 lim ξ1 U+ + w (α1/2, α1), (19) d(apµ(` ) + b(1 p)ν(` )) −→ N d→∞ −→ − N − as d . Then, W → ∞ where denotes convergence in the Wasserstein metric (see X 0 0 0 W −→ N`0 d(apµ(` ) + b(1 p)ν(` )))f(h(` )) (0, α1). for example [7, Section 2]). − − −→ N `0 Proof. From the recursion (13) we obtain Thus, as d → ∞ W µ(`) (+) ξ1 U+ + w + (α1/2, α1) λ X p p ν(`) = (0) = ( ) − −→ N α1 G 2 ν ` µ(`) (−) p 1 p + p and similar arguments prove the lemma for ξ1 . ` − ν(`) λ X ν(`) µ(`) We now proceed to the distribution of ξr for r > 1 by using = ν(`) − p (1 p)ν(`) + pµ(`) induction. ` − λ X ν(`)2 µ(`)ν(`) Lemma 3. Assume that = − + µ(`) ν(`) p (1 p)ν(`) + pµ(`) − (+) W ` − ξr U+ + w + (αr/2, αr) (22) 2 2 −→ N λ X pν(`) + pµ(`) 2pν(`)µ(`) (−) W = − ξr U− + w (αr/2, αr) (23) p (1 p)ν(`) + pµ(`) −→ − N ` − for some r 1. Then, X (µ(`) ν(`))2 ≥ = λ − . (+) W pµ(`) + (1 p)ν(`) ξ U+ + w + (αr+1/2, αr+1) (24) ` − r+1 −→ N 7

(−) W ξ U− + w (αr+1/2, αr+1) (25) Combining this with the induction hypothesis results in r+1 −→ − N  (+)  as d . E γr+1 → ∞ λ  1  Proof. Define = E √ 1 + O(ε) 2p2 (1 p)(1 + eU−+w+ αr Z−αr /2) − − (s0) (s0)   γ = ξ h(Ls ) w, (26) λ 1 r r − 0 − = E √ 1 + O(ε) 2p2 1 p + peU−+ αr Z−αr /2 − (+) (s0) and define γr as the value of γr for a randomly chosen 1 − = 2 G(αr) + O(ε). s0 in community +. We start by investigating the first moment (+) of γr+1. Using Walds equations, we obtain For the variance, we obtain using Walds equation (+)   (+) 2  (−) 2  (+)  (+) (−) Var γ = dap f(ξ ) + db(1 p) f(ξ ) E γr+1 = dapE [f(ξr )] + db(1 p)E [f(ξr )] r+1 E r E r − − λ (+)  1  (+) 2 1  (+) 2 = 2 (p + ε(1 p))E [f(ξr )] = 2λ(1 + ε) h (ξ ) + h (ξ ) ε − 2 E 2 r 2 E 1 r λ (−) p (1 p) + 2 (1 ε)(1 p) [f(ξ )] , (27) − ε E r 2 h (+) i  − − (+) 2 ξr + E h1(ξr ) e where the second line uses (12). We then use that [3, Eq. (A4)] p(1 p) −  1 1 x x  (−) 2  (−) 2  e e  + (1 ε) E h2(ξr ) + E h1(ξr ) f(x) = log 1 + ε + ε2 + O(ε3) − p2 (1 p)2 (1 + x) (1 + x) − p e p e 2 h (−) i  (−) 2 ξr  1 + E h1(ξr ) e + O(ε), log 1 + ε p(1 p) − (1 p)(1 + ex) − (30) − 1  where we used (28) again. Similar computations as for the + ε2 + O(ε3) . (1 p)(1 + ex) expected value then lead to − (+)  Taylor expanding log(1 + x) then results in Var γr+1 = G(αr) = αr+1. (31)

x 2x (+) e (1 + ε) 1 + ε 2 1 e Thus, the first and second moment of γr are of the correct f(x) = ε ε ε − (+) p(1 + ex) − (1 p)(1 + ex) − 2p2(1 + ex)2 size. The proof that γr+1 converges to a normal distribution 1 − then follows the exact same lines as the proof in [3, Proposition + ε2 + O(ε3). 2(1 p)2(1 + ex)2 23]. − (28) We now study the total variation distance of a labeled For all bounded continuous functions g(x) by [3, Lemma A6] Galton-Watson tree where the root is in community + and h (+) i (−) (+) −ξr a Galton-Watson tree where the root is in community . (1 p)E [g(ξr )] = pE g(ξr )e . (29) − − (t) (t) Lemma 4. Let P+ and P− denote the conditional distribu- x −1 x x −1 Denote h (x) = (1 + e ) and h (x) = e (1 + e ) . Then, (s0,L) 1 2 tions of conditionally on the spin of s0 being + and - Tt (−) (+) respectively. Then, (1 p)E [h2(ξr )] + pE [h2(ξr )] = p, −    (−) (+) (t) (t) U+ αt/2 (1 p)E [h1(ξr )] + pE [h1(ξr )] = 1 p, − − lim dTV (P+ ,P− ) = E Q − −  (−) 2  (+) 2 (+) d→∞ √αt (1 p)E h2(ξr ) + pE h2(ξr ) = pE [h2(ξr )] , −     (−) 2  (+) 2 (−) U− αt/2 (1 p)E h1(ξr ) + pE h1(ξr ) = (1 p)E [h2(ξr )] , + E Q − 1 (32) − − √αt − Combining this with (27) and (28) gives Proof. By (3), the term on the left hand side is the same as the success probability of the estimator of Algorithm 1 on (+)   1   λ 2 (+) (+) (−) E γr+1 = 2 ε(1 1) + ε E [h2(ξr )] ε − 2p a Galton-Watson tree. Using that ξt and ξt converge to normal distributions in the large graph limit, we then obtain 1 (−) 1 (−)  + E [h1(ξr )] E [h2(ξr )] + O(ε) for the total variation distance that 2(1 p) − p  − (t) (t) 1 1 p (−) dTV (P+ ,P− ) = λ − E [h2(ξr )] 2p − 2p2 (GW ) t (+)  (−)  = Psucc (TBP ) = P ξt 0 + P ξt 0 1  ≥ ≤ − 1 (−) 1 (−)       + E [h1(ξr )] E [h2(ξr )] + O(ε) U+ αt/2 U− αt/2 2(1 p) − p = E Q − − + E Q − 1. − √αt √αt −  1 1 + p 1 + p (−) = λ + E [h1(ξr )] 2p − 2p2 2p2

1 (−)  + E [h1(ξr )] + O(ε) Finally, we need to relate our results on the labeled Galton- 2(1 p) (s0,L) "− # Watson trees to the SBM. Denote by Gt the subgraph of λ 1 G induced by all vertices at distance at most t from vertex s0. = 1 + O(ε). (s ,L) 2 E ξ(−) 0 2p (1 + e r )(1 p) − Let σGt denote the spins of all vertices in Gt . Similarly, − 8

(s0,L) let σTt denote the spins of the vertices in t . Then, the with the value of α1 replaced by α˜1. Following the same lines following Lemma can be proven analogouslyT to [14]. of the proof of the analysis of ξ then leads to t o(1)    Lemma 5. For t = t(n) such that a = n , there exists a ˜t U+ α˜t/2 (s0,L) (s0,L) lim lim Psucc(TBP ) = Q − − n→∞ E coupling between (Gt , σGt ) and ( t , σTt ) such that d→∞ √α˜t (s0,L) (s0,L) T    (Gt , σGt ) = ( t , σTt ) with high probability. U− α˜t/2 T + E Q − 1. This lemma allows us to finish the proof of Theorem 1. √α˜t − Similarly to [16, Lemma 7.4], we can show that G(α) is Proof of Theorem 1. On the event that ( (s0,L) ) = Gt , σGt increasing and continuous. Therefore, if the function G only (s0,L) ( , σT ), the estimator of Algorithm 1 is the same as ˜t t Tt t has one fixed point, estimators TBP and TBP will provide the the estimator based on the sign of ξt. Therefore, same accuracy as t . → ∞ t (GW ) t lim Psucc(TBP ) = Psucc (TBP ), n→∞ IV. LEARNINGTHEMODELPARAMETERS so that Note that Algorithm 1 uses knowledge of the parameters    t U+ αt/2 of the stochastic block model, a, b and c, as well as the lim lim Psucc(TBP ) = E Q − − d→∞ n→∞ √αt entire label distribution depending on the community spin, µ    and ν. In practice however, the parameters of the model that U− αt/2 + E Q − 1, (33) underlies some observed network are often unknown. When √αt − the parameters a, b and c of the stochastic block model are which proves (16). sufficiently large (larger than log(n)), the communities can be ˜t To prove the second claim, we define estimator TBP on a recovered above the Kesten-Stigum threshold with a vanishing tree of depth t that does not only have access to the observed error fraction without knowledge of the model parameters by labels of the tree, but also to the vertex spins at depth t. using a spectral method [10], [20]. Let σˆi denote the estimated Then, similar to [16, Lemma 3.9], we can show that this community spins. We can then estimate µ(`) by estimator performs at least as well as the optimal estimator P 1{L =`} µ(`)np + o(1) on the Galton-Watson tree without revealed spins as n . µˆ(`) = i:ˆσi=+ i = = µ(`) + o(1), → ∞ P 1 (1 + (1)) The analysis of this estimator follows the exact same lines as i:∈[n] {σˆi=+} np o the analysis of Algorithm 1, except for the initial beliefs. Let (39) (+) (−) (+) (−) and ν(`) can be estimated with vanishing error as well. ζr and ζr be defined similarly as ξr and ξr , but now given the true spins at depth t. Define The spectral method described above only depends on the adjacency matrix. It is also possible to apply these spectral λ methods to a different matrix, that includes the adjacency α˜1 = , (34) p(1 p) matrix as well as the vertex labels. One example of such a − α˜r = G(˜αr−1). (35) matrix is the matrix A + K, where A is the adjacency matrix, and K is a kernel matrix based on the vertex labels. When We now show that when d every vertex has a number of vertex labels that grows in the → ∞ (+) W network size n, a spectral method on the adjacency matrix with ζ U+ + w + (˜α1/2, α˜1) (36) 1 −→ N additive kernel is able to correctly identify a larger fraction (−) W of the spins correctly than a spectral algorithm based on the ζ1 U− + w + (˜α1/2, α˜1). (37) −→ N adjacency matrix only [2], [19]. Similar to (20), we can write Another option is to study a multiplicative kernel, that is, a matrix of the form A K, where K is again some kernel (+) d ◦ ζ1 = U+ + w + N1 log(a/b) + N2 log(b/c), (38) matrix based on the vertex labels, and denotes element-wise multiplication. For example, we can take◦ K(i, j) = 1 . with and Poisson random variables with parameters {`i=`j } N1 N2 Then, A K is the adjacency matrix of the graph where all (1 ) dpa and b p b respectively. Using (12), we obtain that edges between◦ vertices of different labels are removed. The log( ) = ε−+ 2 2p−1 + ( 3) and log( ) = ε + a/b p ε 2p2 O ε c/d 1−p remaining graph graph consists of several components, where 2 2p−1 3 − ε 2(1−p)2 + O(ε ). Therefore, the vertex label within each component is equal. On average, the component corresponding to label ` contains npµ(`) dpa log(a/b) + d(1 p)b log(c/d) vertices with spin + and n(1 p)ν(`) of spin . The average  −  1 2p 1 degree of the vertices with spin− + in this graph− is therefore = λ + − + o(ε) p 2p(1 p) equal to dpµ(`)a + d(1 p)ν(`)b, whereas the average degree − − λ of the vertices with spin is equal to dpµ(`)b+d(1 p)ν(`)c. = + o(ε) =α ˜1 + o(ε). p − − 2p(1 p) Using that c = 1−p (a b)+b results in that the average degrees − of vertices label ` with− spin + and are unequal if We can then use the same arguments as in Lemma 2 to − prove (36) and (37). From there on, we can use Lemma 3, pµ(`)a = pµ(`)b + pν(`)(a b), (40) 6 − 9 so that they are unequal if a = b and µ(`) = ν(`). If this including vertex covariates [2], methods based on maximum condition holds, then it is possible6 to infer the6 community likelihood estimation [19] or using belief propagation to find spins of this connected component better than a random the parameters of the algorithm [5] may be interesting to guess [3, Lemma 4]. Then, in a regime where the average investigate. degree is at least logarithmic, we can get correlated recon- Finally, it would be interesting to investigate the perfor- struction from the spectral technique of [20] applied to the mance of a similar algorithm when more than two communi- matrix with kernel multiplication without knowledge of the ties are present. model parameters. This method may even work underneath the Kesten-Stigum threshold, as in Example 1. However, if REFERENCES the original SBM contained K communities, this method finds [1] A. E. Alaoui and M. I. Jordan. Detection limits in the high-dimensional K communities, one set of communities for each label. spiked rectangular model. arXiv:1802.07309v3, 2018. Thus,|L| for each community σ in the model, this method finds [2] N. Binkiewicz, J. T. Vogelstein, and K. Rohe. Covariate-assisted spectral ( ) clustering. Biometrika, 104(2):361–377, 2017. subcommunities σ, ` for all labels `, containing the vertices [3] F. Caltagirone, M. Lelarge, and L. Miolane. Recovering asymmetric of community σ with label `. One remaining question then communities in the stochastic block model. IEEE Transactions on is how to identify the different subcommunities that belong and Engineering, pages 1–1, 2017. [4] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborova.´ Asymptotic to the same original community. One possibility could be analysis of the stochastic block model for modular networks and its to identify the different parts of the planted communities algorithmic applications. Physical Review E, 84(6), 2011. based on estimates of the connection probabilities within the [5] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborova.´ Inference and phase transitions in the detection of modules in sparse networks. subcommunities and between the subcommunities. Physical Review Letters, 107(6), 2011. 1 The kernel matrix K(i, j) = {`i=`j } is quite restrictive, [6] S. Fortunato. Community detection in graphs. Physics Reports, 486(3- but the above example shows that using multiplicative kernel 5):75–174, 2010. [7] A. L. Gibbs and F. E. Su. On choosing and bounding probability metrics. matrices for community detection with vertex labels instead International Statistical Review, 70(3):419–435, 2002. of additive kernel matrices seems promising. It would be [8] P. W. Holland, K. B. Laskey, and S. Leinhardt. Stochastic blockmodels: interesting to investigate if other kernel matrices would result First steps. Social Networks, 5(2):109–137, 1983. [9] V. Kanade, E. Mossel, and T. Schramm. Global and local information in a better performance. For example, the kernel matrix in clustering labeled block models. IEEE Transactions on Information K(i, j) = w`i,`j for some weight matrix w may result in Theory, 62(10):5906–5917, 2016. better performance. [10] J. Lei and A. Rinaldo. Consistency of spectral clustering in stochastic block models. The Annals of Statistics, 43(1):215–237, 2015. [11] L. Massoulie.´ Community detection thresholds and the weak ramanujan property. In Proceedings of the 46th Annual ACM Symposium on Theory V. CONCLUSION of Computing - STOC14. ACM Press, 2014. In this paper, we investigated a variant of the belief propa- [12] L. Miolane. Fundamental limits of low-rank matrix estimation: the non- symmetric case. arXiv:1702.00473v3, 2017. gation (BP) algorithm for asymmetric stochastic block models [13] E. Mossel, J. Neeman, and A. Sly. Stochastic block models and with vertex labels. We find the probability that the belief reconstruction. arXiv:1202.1499v4, 2012. propagation algorithm correctly classifies a vertex when the [14] E. Mossel, J. Neeman, and A. Sly. Reconstruction and estimation in the planted partition model. Probability Theory and Related Fields, average degree of a vertex grows large. We show that in the 162(3-4):431–461, 2014. asymmetric stochastic block model, the belief propagation al- [15] E. Mossel, J. Neeman, and A. Sly. A proof of the block model threshold gorithm initialized with beliefs based on the vertex labels may conjecture. Combinatorica, 2017. [16] E. Mossel and J. Xu. Local algorithms for block models with side in- not always perform optimally. Belief propagation initialized formation. In Proceedings of the 2016 ACM Conference on Innovations with the true community memberships then results in a better in Theoretical Computer Science - ITCS16. ACM Press, 2016. partition. Whether it is possible to know that this partition [17] J. Neeman and P. Netrapalli. Non-reconstructability in the stochastic block model. arXiv:1404.6304v1, 2014. is better than the partition obtained by initializing BP based [18] T. A. Snijders and K. Nowicki. Estimation and prediction for stochastic on the vertex labels without knowing the true community blockmodels for graphs with latent block structure. Journal of Classifi- partition, is an interesting direction for future research. cation, 14(1):75–100, 1997. [19] H. Weng and Y. Feng. Community detection with nodal information. To determine the optimality or sub-optimality of BP in arXiv:1610.09735v2, 2016. such situations, one possible approach could be to characterize [20] J. Xu, L. Massoulie,´ and M. Lelarge. Edge label inference in generalized directly the optimal accuracy that can be obtained by any stochastic block models: from spectral theory to impossibility results. In M. F. Balcan, V. Feldman, and C. Szepesvri, editors, Proceedings of feasible estimation procedure irrespective of its computational The 27th Conference on Learning Theory, volume 35 of Proceedings of cost. Recent works have performed such a characterization Research, pages 903–920, Barcelona, Spain, 2014. in scenarios distinct from ours (eg [12], [1], and references PMLR. [21] P. Zhang, C. Moore, and L. Zdeborova.´ Phase transitions in semisuper- therein). It remains to be seen whether the approaches in these vised clustering of sparse networks. Physical Review E, 90(5), 2014. papers could be adapted to our present scenario. Furthermore, the belief propagation algorithm uses knowl- edge of all parameters of the stochastic block model and the vertex label distribution. In general, such parameters are not known. Another fruitful direction for future research is therefore to investigate algorithms that do not need the model parameters as input, or algorithms that estimate the parameters given a network observation. For example, spectral methods