Spectral clustering and the high-dimensional Stochastic Block Model Karl Rohe, Sourav Chatterjee and Bin Yu Department of Statistics University of California Berkeley, CA 94720, USA e-mail: [email protected] [email protected] [email protected] Abstract: Networks or graphs can easily represent a diverse set of data sources that are characterized by interacting units or actors. Social networks, representing people who communicate with each other, are one example. Communities or clusters of highly connected actors form an essential feature in the structure of several empirical networks. Spectral clustering is a popular and computationally feasible method to discover these communities. The Stochastic Block Model (Holland et al., 1983) is a model with well defined communities; each node is a member of one community. For a network generated from the Stochastic Block Model, we bound the number of nodes “misclus- tered” by spectral clustering. The asymptotic results in this paper are the first clustering results that allow the number of clusters in the model to grow with the number of nodes, hence the name high-dimensional. In order to study spectral clustering under the Stochastic Block Model, we first show that under the more general latent space model, the eigenvectors of the normalized graph Laplacian asymptotically converge to the eigenvectors of a “population” normal- ized graph Laplacian. Aside from the implication for spectral clustering, this provides insight into a graph visualization technique. Our method of studying the eigenvectors of random matrices is original. AMS 2000 subject classifications: Primary 62H30, 62H25; secondary 60B20. Keywords and phrases: Spectral clustering, latent space model, Stochastic Block Model, clustering, convergence of eigenvectors, principal components analysis.

∗The authors are grateful to Michael Mahoney and Benjamin Olding for their stimulating discussions. Also, thank you to Jinzhu Jia and Reza Khodabin for your helpful comments and suggestions on this paper. Karl Rohe is partially supported by a NSF VIGRE Graduate Fellowship. Sourav Chatterjee is supported by NSF grant DMS-0707054 and a Sloan Research Fellowship. Bin Yu is partially supported by NSF grant SES-0835531 (CDI), NSF grant DMS-0907632, NSF grant DMS-0907632, and a grant from MSRA.

1 Rohe et al./Clustering for the Stochastic Block Model 2

1. Introduction

Researchers in many fields and businesses in several industries have exploited the recent advances in information technology to produce an explosion of data on complex systems. Several of the complex systems have interacting units or actors that networks or graphs can easily represent, providing a range of disciplines with a suite of potential questions on how to produce knowledge from network data. Understanding the system of relationships be- tween people can aid both epidemiologists and sociologists. In biology, the predator-prey pursuits in a natural environment can be represented by a food web, helping researchers better understand an ecosystem. The chemical reactions between metabolites and enzymes in an organism can be portrayed in a metabolic network, providing biochemists with a tool to study metabolism. Networks or graphs conveniently describe these relationships, neces- sitating the development of statistically sound methodologies for exploring, modeling, and interpreting networks. Communities or clusters of highly connected actors form an essential feature in the struc- ture of several empirical networks. The identification of these clusters helps answer vital questions in a variety of fields. In the communication network of terrorists, a cluster could be a terrorist cell; web pages that provide hyperlinks to each other form a community that might host discussions of a similar topic; and a community or cluster in a social network likely shares a similar interest. Searching for clusters is algorithmically difficult because it is computationally intractable to search over all possible clusterings. Even on a relatively small graph, one with 100 nodes, the number of different partitions exceeds some estimates of the number of atoms in the universe by twenty orders of magnitude For several different applications, physicists, com- puter scientists, and statisticians have produced numerous algorithms to overcome these computational challenges. Often these algorithms aim to discover clusters which are ap- proximately the “best” clusters as measured by some empirical objective function (see For- tunato (2009) or Fjallstr¨ om¨ (1998) for comprehensive reviews of these algorithms from the physics or the engineering perspective respectively). Clustering algorithms generally come from two sources: from fitting procedures for var- ious statistical models that have well defined communities and, more commonly, from heuristics or insights on what network communities should look like. This division is anal- ogous to the difference in multivariate data analysis between parametric clustering algo- rithms, such as an EM algorithm fitting a mixture of gaussians model, and nonparametric clustering algorithms such as k-means, which are instead motivated by optimizing an objec- tive function. Snijders and Nowicki (1997); Nowicki and Snijders (2001); Handcock et al. (2007) and Airoldi et al. (2008) all attempt to cluster the nodes of a network by fitting var- ious network models that have well defined communities. In contrast, the Girvan-Newman algorithm (Girvan and Newman, 2002) and spectral clustering are two algorithms in a large class of algorithms motivated by insights and heuristics on communities in networks. Rohe et al./Clustering for the Stochastic Block Model 3

Newman and Girvan (2004) motivate their algorithm by observing, “If two communities are joined by only a few inter-community edges, then all paths through the network from vertices in one community to vertices in the other must pass along one of those few edges.” The Girvan–Newman algorithm searches for these few edges and removes them, result- ing in a graph with multiple connected components (connected components are clusters of nodes such that there are no connections between the clusters). The Girvan–Newman algo- rithm then returns these connected components as the clusters. Like the Girvan–Newman algorithm, spectral clustering is a “nonparametric” algorithm motivated by the following insights and heuristics: spectral clustering is a convex relaxation of the Normalized Cut optimization problem (Shi and Malik, 2000), it can identify the connected components in a graph (if there are any) (Donath and Hoffman, 1973; Fiedler, 1973), and it has an intimate connection with electrical and random walks on graphs (Klein and Randic,´ 1993; Meila˘ and Shi, 2001).

1.1. Spectral clustering

Spectral clustering is both popular and computationally feasible (von Luxburg, 2007). The algorithm has been rediscovered and reapplied in numerous different fields since the initial work of Donath and Hoffman (1973) and Fiedler (1973). Computer scientists have found many applications for variations of spectral clustering, such as load balancing and parallel computations (Van Driessche and Roose, 1995; Hendrickson and Leland, 1995), partition- ing circuits for very-large-scale integration design (Hagen and Kahng, 1992) and sparse matrix partitioning (Pothen et al., 1990). Detailed histories of spectral clustering can be found in Spielman and Teng (2007) and Von Luxburg et al. (2008). The algorithm is defined in terms of a graph G, represented by a vertex set and an edge set. The vertex set v , . . . , v contains vertices or nodes. These are the actors in the sys- { 1 n} tems discussed above. We will refer to node vi as node i. We will only consider unweighted and undirected edges. So, the edge set contains a pair (i, j) if there is an edge, or rela- tionship, between nodes i and j. The edge set can be represented by the adjacency matrix n n W 0, 1 × : ∈{ } 1 if (i, j) is in the edge set Wji = Wij = (1.1) ! 0 otherwise. n n Define L and diagonal matrix D both elements of × in the following way, R

Dii = k Wik 1/2 1/2 (1.2) L = D"− WD− .

1/2 1/2 Some readers may be more familiar defining L as I D− WD− . For spectral cluster- − ing, the difference is immaterial because both definitions have the same eigenvectors. The spectral clustering algorithm addressed in this paper is defined as follows: Rohe et al./Clustering for the Stochastic Block Model 4 Spectral clustering for k many clusters n n Input: Adjacency matrix W 0, 1 × . ∈{ } n 1. Find the eigenvectors X1,...,Xk corresponding to the k eigenvalues∈ of LR that are largest in absolute value. L is symmetric, so choose these eigenvectors to be orthogonal. Form the matrix n k X =[X1,...,Xk] × by putting the eigenvectors into the∈R columns. 2. Treating each of the n rows in X as a point in k, run k-means with k clusters. ThisR creates k non-overlapping sets A1,...,Ak whose union is 1, . . . , n.

Output: A1,...,Ak. This means that node i is assigned to cluster g if the ith row of X is assigned to Ag in step 2.

Traditionally, spectral clustering takes the eigenvectors of L corresponding to the largest k eigenvalues. The algorithm above takes the largest k eigenvalues by absolute value. The reason for this is explained in Section 3. Recently, spectral clustering has also been applied in cases where the graph G and its ad- jacency matrix W are not given, but instead inferred from a measure of pairwise similarity n n k( , ) between data points X ,...,X in a metric space. The similarity matrix K × , · · 1 n ∈R whose i, jth element is Kij = k(Xi,Xj), takes the place of the adjacency matrix W in the above definition of L, D, and the spectral clustering algorithm. For image segmentation, Shi and Malik (2000) suggested spectral clustering on an inferred network where the nodes are the pixels and the edges are determined by some measure of pixel similarity. In this way, spectral clustering has many similarities with the nonlinear dimension reduction or manifold learning techniques such as Diffusion maps and Laplacian eigenmaps (Coifman et al., 2005; Belkin and Niyogi, 2003). The normalized graph Laplacian L is an essential part of spectral clustering, Diffusion maps, and Laplacian eigenmaps. As such, its properties have been well studied under the model that the data points are randomly sampled from a probability distribution, whose support may be a manifold, and the Laplacian is built from the inferred graph based on some measure of similarity between data points. Belkin (2003); Lafon (2004); Bousquet et al. (2004); Hein et al. (2005); Hein (2006); Gine´ and Koltchinskii (2006); Belkin and Niyogi (2008); Von Luxburg et al. (2008) have all shown various forms of asymptotic Rohe et al./Clustering for the Stochastic Block Model 5 convergence for this graph Laplacian. Although all of their results are encouraging, their results do not apply to the random network models we study in this paper.

1.2. Statistical estimation

Stochastic models are useful because they force us to think clearly about the randomness in the data in a precise and possibly familiar way. Many random network models have been proposed (Erdos¨ and Renyi,´ 1959; Holland and Leinhardt, 1981; Barabasi´ and Albert, 1999; Watts and Strogatz, 1998; Van Duijn et al., 2004; Frank and Strauss, 1986; Holland et al., 1983; Hoff et al., 2002; Goldenberg et al., 2009). Some of these models, such as the Stochastic Block Model, have well defined communities. The Stochastic Block Model is characterized by the fact that each node belongs to one of multiple blocks and the proba- bility of a relationship between two nodes depends only on the block memberships of the two nodes. If the probability of an edge between two nodes in the same block is larger than the probability of an edge between two nodes in different blocks, then the blocks produce communities in the random networks generated from the model. Just as statisticians have studied when least-squares regression can estimate the “true” regression model, it is natural and important for us to study the ability of clustering algo- rithms to estimate the true clusters in a network model. Understanding when and why a clustering algorithm correctly estimates the “true” communities would provide a rigorous understanding of the behavior of these algorithms, suggest which algorithm to choose in practice, and aid the corroboration of algorithmic output. This paper studies the performance of spectral clustering, a nonparametric method, on a parametric task of estimating the blocks in the Stochastic Block Model. It connects the rst strain of clustering research based on stochastic models to the sec- ond strain based on heuristics and insights on network clusters. The Stochastic Block Model allows for some rst steps in understanding the behavior of spectral clustering and provides a benchmark to measure its performance. However, because this model does not really account for the complexities observed in several empirical networks, good performance on the Stochastic Block Model should only be considered a necessary requirement for a good clustering algorithm. Researchers have explored the performance of other clustering algorithms under the Stochastic Block Model. Snijders and Nowicki (1997) showed the consistency under the two block Stochastic Block Model of a clustering routine that clusters the nodes based on their degree distributions. Although this clustering is very easy to compute it is not clear that the estimators would behave well for larger graphs given the extensive literature on the long tail of the (Albert and Barabasi,´ 2002). Later, Condon and Karp (1999) provided an algorithm and proved that it is consistent under the Stochastic Block Model, or what they call the planted !-partition model. Their algorithm runs in lin- ear time. However, it always estimates clusters that contain an equal number of nodes. Rohe et al./Clustering for the Stochastic Block Model 6

More recently, Bickel and Chen (2009) proved that under the Stochastic Block Model, the maximizers of the Newman–Girvan (Newman and Girvan, 2004) and what they call the likelihood modularity are asymptotically consistent estimators of block partitions. These modularities are objective functions that have no clear relationship to the Girvan– Newman algorithm. Finding the maximum of the modularities is believed to be NP hard (Newman and Girvan, 2004). It is important to note that all aforementioned clustering re- sults involving the Stochastic Block Model are asymptotic in the number of nodes, with a fixed number of blocks. The work of Leskovec et al. (2008) shows that in a diverse set of large empirical networks (tens of thousands to millions of nodes), the size of the “best” clusters is not very large, around 100 nodes. Modern applications of clustering require an asymptotic regime that al- lows these sorts of clusters. Under the asymptotic regime cited in the previous paragraph, the size of the clusters grows linearly with the number of nodes. It would be more appro- priate to allow the number of communities to grow with the number of nodes. This restricts the blocks from becoming too large, following the empirical observations of Leskovec et al. (2008). This paper provides the first asymptotic clustering results that allow the number of blocks in the Stochastic Block Model to grow with the number of nodes. Similar to the asymp- totic results on regression techniques that allow the number of predictors to grow with the number of nodes, allowing the number of blocks to grow makes the problem one of high-dimensional learning. The Stochastic Block Model is an example of the more general latent space model (Hoff et al., 2002) of a random network. Under the latent space model, there are latent i.i.d. vectors z1, . . . , zn; one for each node. The probability that an edge appears between any two nodes i and j depends only on zi and zj and is independent of all other edges and unobserved vectors. The results of Aldous and Hoover show that this model characterizes the distribution of all infinite random graphs with exchangeable nodes (Kallenberg, 2005). The graphs with n nodes generated from a latent space model can be viewed as a sub- graph of an infinite graph. In order to study spectral clustering under the Stochastic Block Model, we first show that under the more general latent space model, as the number of nodes grows, the eigenvectors of L, the normalized graph Laplacian, converge to eigenvec- tors of the “population” normalized graph Laplacian that is constructed with a similarity matrix E(W z1, . . . , zn) (whose i, jth element is the probability of an edge between node | i and j) taking the place of the adjacency matrix W in Equation (1.2). In many ways, E(W z1, . . . , zn) is similar to the similarity matrix K discussed above, only this time the | vectors (z1, . . . , zn) and their similarity matrix E(W z1, . . . , zn) are unobserved. | The convergence of the eigenvectors has implications beyond spectral clustering. Graph visualization is an important tool for social network analysts looking for structure in net- works and the eigenvectors of the graph Laplacian are an essential piece of one visualiza- tion technique (Koren, 2005). Exploratory graph visualization allows researchers to find Rohe et al./Clustering for the Stochastic Block Model 7

structure in the network; this structure could be communities or something more compli- cated (Liotta, 2004; Freeman, 2000; Wasserman and Faust, 1994). In terms of the latent space model, if z1, . . . , zn form clusters or have some other structure in the latent space, then we might recover this structure from the observed graph using graph visualization. While there are several visualization techniques, there is very little theoretical understand- ing of how any perform under stochastic models of structured networks. Because the eigen- vectors of the normalized graph Laplacian converge to “population” eigenvectors, this pro- vides support for a visualization technique similar to the one proposed in Koren (2005). Following the preliminary definitions in the next subsection of the introduction, the rest of the paper is organized into two main sections; the first studies the latent space model and the second studies the Stochastic Block Model as a special case. Section 2 covers the eigenvectors of L under the latent space model. The main technical result is Theo- rem 2.1 in Section 2.2, which shows that, as the number of nodes grows, the normalized graph Laplacian multiplied by itself converges in Frobenius norm to a symmetric version of the population graph Laplacian multiplied by itself. The Davis-Kahan Theorem then implies that the eigenvectors of these matrices are close in an appropriate sense. Lemma 2.1 specifies how the eigenvectors of a matrix multiplied by itself are closely related to the eigenvectors of the original matrix. Theorem 2.2 combines Theorem 2.1 with the Davis- Kahan Theorem and Lemma 2.1 to show that the eigenvectors of the normalized graph Laplacian converge to the population eigenvectors. Section 3 applies these results to the high-dimensional Stochastic Block Model. Theorem 3.1 bounds the number of nodes that spectral clustering “misclusters.” The section concludes with an example.

1.3. Preliminaries

The latent space model proposed by Hoff et al. (2002) is a class of a probabilistic model for W . k Definition 1. For i.i.d. random vectors z1, . . . , zn and random adjacency matrix n n ∈R W 0, 1 × , let P(Wij zi,zj) be the probability mass function of Wij conditioned on zi ∈{ } | and zj. A probability distribution on W has the conditional independence relationships

P(W z1, . . . , zn)= P(Wij zi,zj) | | i

This model is often simplified to assume P(Wij zi,zj)=P (Wij dist(zi,zj)) where | | dist( , ) is some distance function. This allows the “ by attributes” interpretation · · that edges are more likely to appear between nodes whose latent vectors are closer in the latent space. Rohe et al./Clustering for the Stochastic Block Model 8

n k Define Z × such that its ith row is zi for all i V . Throughout this paper we ∈R ∈ assume Z is fixed and unknown. Because P(Wij =1Z)=E(Wij Z), the model is then | | completely parametrized by the matrix

n n W = E(W Z) × , | ∈R where W depends on Z, but this is dropped for notational convenience. The Stochastic Block Model, introduced by Holland et al. (1983), is a specific latent space model with well defined communities. We use the following definition of the undi- rected Stochastic Block Model: Definition 2. The Stochastic Block Model is a latent space model with

W = ZBZT ,

n k where Z 0, 1 × has exactly one 1 in each row and at least one 1 in each column and ∈k {k } B [0, 1] × is symmetric. ∈ We refer to W , the matrix which completely parametrizes the latent space model, as the n n population version of W . Define population versions of L and D both in × as R

Dii = k Wik 1/2 1/2 (1.3) L = "D − WD− where D is a diagonal matrix, similar to before. The results in this paper are asymptotic in the number of nodes n. When it is appropriate, the matrices above are given a superscript of n to emphasize this dependence. Other times, this superscript is discarded for notational convenience.

2. Consistency under the Latent Space Model

We will show that the empirical eigenvectors of L(n) converge in the appropriate sense to the population eigenvectors of L (n). If L(n) converged to L (n) in Frobenius norm, then the Davis-Kahan Theorem would give the desired result. However, these matrices do not converge. This is illustrated in an example below. Instead, we give a novel result showing that under certain conditions L(n)L(n) converges to L (n)L (n) in Frobenius norm. This implies that the eigenvectors of L(n)L(n) converge to the eigenvectors of L (n)L (n). The following lemma shows that these eigenvectors can be chosen to imply the eigenvectors of L(n) converge to the eigenvectors of L (n). n n Lemma 2.1. When M × is a symmetric real matrix, ∈R 1. λ2 is an eigenvalue of MM if and only if λ or λ is an eigenvalue of M. − Rohe et al./Clustering for the Stochastic Block Model 9

2. If Mv = λv, then MMv = λ2v. 3. Conversely, if MMv = λ2v, then v can be written as a linear combination of eigen- vectors of M whose eigenvalues are λ or λ. − A proof of Lemma 2.1 can be found in Appendix A.

n n Example : To see how squaring a matrix helps convergence, let the matrix W × ∈R have i.i.d. Bernoulli(1/2) entries. Because the diagonal elements in D grow like n, the ma- 1/2 1/2 trix W/n behaves similarly to D− WD− . Without squaring the matrix, the Frobenius distance from the matrix to its expectation is

1 2 W/n E(W )/n F = (Wij E(Wij)) =1/2. $ − $ n − $%i,j Notice that, for i = j, % [WW] = W W Binomial(n, 1/4) ij ik kj ∼ %k 1/2 and [WW]ii Binomial(n, 1/2). So, for any i, j, [WW]ij E[WW]ij = o(n log(n)). ∼ − Thus, the Frobenius distance from the squared matrix to its expectation is

2 2 1 2 log(n) W W/n E(WW)/n F = ([WW]ij E[WW]ij) = o . $ − $ n2 − & n1/2 ' $%i,j

When the elements of W are i.i.d. Bernoulli(1/2), (W/n)2 converges in Frobenius norm and W/n does not. The next theorem addresses the convergence of L(n)L(n). Define (n) τn = min Dii /n. (2.1) i=1,...,n

(n) Recall that Dii is the expected degree for node i. So, τn is the minimum expected degree, divided by the maximum possible degree. It measures how quickly the number of edges accumulates. (n) n n Theorem 2.1. Define the sequence of random matrices W 0, 1 × to be from a (n∈) { } n n (n) sequence of latent space models with population matrices W [0, 1] × . With W , ∈ define the observed graph Laplacian L(n) as in (1.2). Let L (n) be the population version (n) of L as defined in Equation (1.3). Define τn as in Equation (2.1). 2 If there exists N>0, such that τn log(n) > 8 for all n > N, then

(n) (n) (n) (n) log(n) L L L L F = o 2 1/2 a.s. $ − $ &τn n ' Rohe et al./Clustering for the Stochastic Block Model 10

Appendix A contains a non-asymptotic bound on L(n)L(n) L (n)L (n) as well as $ − $F the proof of Theorem 2.1. The main condition in this theorem is the lower bound on τn. This sufficient condition is used to produce gaussian tail bounds for each of the Dii and other similar quantities. For any symmetric matrix M, define λ(M) to be the eigenvalues of M and for any interval S , define ⊂R λ (M)= λ(M) S . S { ∩ } Further, define λ¯(n) λ¯(n) to be the elements of λ(L (n)L (n)) and λ(n) λ(n) 1 ≥···≥ n 1 ≥···≥ n to be the elements of λ(L(n)L(n)). The eigenvalues of L(n)L(n) converge in the following sense,

(n) ¯(n) (n) (n) (n) (n) max λi λi L L L L F i | − |≤ $ − $ log(n) = o 2 1/2 a.s. (2.2) &τn n ' This follows from Theorem 2.1, Weyl’s inequality (Bhatia, 1987), and the fact that the Frobenius norm is an upper bound of the spectral norm. (n) (n) This shows that under certain conditions on τn, the eigenvalues of L L converge to the eigenvalues of L (n)L (n). In order to study spectral clustering, it is now necessary to show that the eigenvectors also converge. The Davis-Kahan Theorem provides a bound for this. Proposition 2.1. (Davis-Kahan) Let S be an interval. Denote as an orthonormal ⊂R X matrix whose column space is equal to the eigenspace of LL corresponding to the eigen- values in λ (LL) (more formally, the column space of is the image of the spectral S X projection of LL induced by λS(LL)). Denote by X the analogous quantity for LL. Define the distance between S and the spectrum of LL outside of S as

δ = min ! s ; ! eigenvalue of LL,! S, s S . {| − | %∈ ∈ } If and X are of the same dimension, then there is an orthonormal matrix O, that depends X on and X, such that X 1 LL LL 2 X O 2 $ − $F 2$ −X $F ≤ δ2 The original Davis-Kahan Theorem bounds the “canonical angle,” also known as the “principal angle,” between the column spaces of and X. Appendix B explains how this X can be converted into the bound stated above. The bound in the Davis-Kahan Theorem is sensitive to the value δ. This reflects that when there are eigenvalues of LL close to S, but not inside of S, then a small perturbation can move these eigenvalues inside of S and drastically alter the eigenvectors. The next Rohe et al./Clustering for the Stochastic Block Model 11 theorem combines the previous results to show that the eigenvectors of L(n) converge to the eigenvectors of L (n). Because it is asymptotic in the number of nodes, it is important to allow S and δ to depend on n. For a sequence of open intervals S , define n ⊂R (n) (n) δn = inf ! s ; ! λ(L L ),! Sn,s Sn (2.3) {| − | ∈ (n) (n) %∈ ∈ } δn# = inf ! s ; ! λSn (L L ),s Sn (2.4) {|2 − | ∈ %∈ } S# = !; ! S . (2.5) n { ∈ n}

The quantity δn# is added to measure how well Sn insulates the eigenvalues of interest. If δn# is too small, then some important empirical eigenvalues might fall outside of Sn. By restricting the rate at which δn and δn# converge to zero, the next theorem ensures the dimensions of X and X agree for a large enough n. This is required in order to use the Davis-Kahan Theorem. (n) n n Theorem 2.2. Define W 0, 1 × to be a sequence of growing random adjacency ∈{ } matrices from the latent space model with population matrices W (n). With W (n), define the observed graph Laplacian L(n) as in (1.2). Let L (n) be the population version of L(n) as defined in Equation (1.3). Define τn as in Equation (2.1). With a sequence of open intervals Sn , define δn, δn# , and Sn# as in Equations (2.3), (2.4), and (2.5). ⊂R (n) (n) n kn Let k = λ (L ) , the size of the set λ (L ). Define the matrix X × such n | Sn! | Sn! n ∈R that its orthonormal columns are the eigenvectors of symmetric matrix L(n) corresponding (n) (n) n Kn to all the eigenvalues contained in λSn! (L ). For Kn = λSn! (L ) , define Xn × (n) | | ∈(nR) to be the analogous matrix for symmetric matrix L with eigenvalues in λSn! (L ). 1/2 Assume that n− log(n)=O(min δn,δn# ). Also assume that there exists positive in- { } 2 teger N such that for all n > N, it follows that τn > 8/ log(n). Eventually, kn = Kn. Afterwards, for some sequence of orthonormal rotations On, log(n) Xn XnOn F = o 2 1/2 a.s. $ − $ &δnτn n ' A proof of this Theorem 2.2 is in Appendix C. There are two key assumptions in Theo- rem 2.2: 1/2 (1) n− log(n)=O(min δn,δn# ) 2 { } (2) τn > 8/ log(n). The first assumption ensures that the gap between the eigenvalues of interest and the rest of the eigenvalues does not converge to zero too quickly. The theorem is most interesting when S includes only the leading eigenvalues. This is because the eigenvectors with the largest eigenvalues have the potential to reveal clusters or other structures in the network. When these leading eigenvalues are well separated from the smaller eigenvalues, the “eigengap” is large. The second assumption ensures that the expected degree of each node grows suf- ficiently fast. If τn is constant, then the expected degree of each node grows linearly. The 2 assumption τn > 8/ log(n) is almost as restrictive. Rohe et al./Clustering for the Stochastic Block Model 12

The usefulness of Theorem 2.2 depends on how well the eigenvectors of L (n) repre- sent the characteristics of interest in the network. For example, under the Stochastic Block (n) Model with B full rank, if Sn is chosen so that Sn# contains all nonzero eigenvalues of L , then the block structure can be determined from the columns of . It can be shown that Xn nodes i and j are in the same block if and only if the ith row of equals the jth row. The Xn next section examines how spectral clustering exploits this structure, using Xn to estimate the block structure in the Stochastic Block Model.

3. The Stochastic Block Model

The work of Leskovec et al. (2008) shows that the sizes of the best clusters are not very large in a diverse set of empirical networks, suggesting that the appropriate asymptotic framework should allow for the number of communities to grow with the number of nodes. This section shows that, under suitable conditions, spectral clustering can correctly partition most of the nodes in the Stochastic Block Model, even when the number of blocks grows with the number of nodes. The Stochastic Block Model, introduced by Holland et al. (1983), is a specific latent space model. Because it has well defined communities in the model, community detection can be framed as a problem of statistical estimation. The important assumption of this model is that of structural equivalence within the blocks; if two nodes i and j are in the same block, rows i and j of W are equal. Recall in the definition of the undirected Stochastic Block Model,

W = ZBZT ,

n k where Z 0, 1 × is fixed and has exactly one 1 in each row and at least one 1 in each ∈{ } k k column and B [0, 1] × is symmetric. In this definition there are k blocks and n nodes. ∈ If the i, gth element of Z equals one (Zig =1) then node i is in block g. As before, for k k i =1, . . . , n, z denotes the ith row of Z. The matrix B [0, 1] × contains the probability i ∈ of edges within and between blocks. Some researchers have allowed for Z to be random, we have decided to focus instead on the randomness of W conditioned on Z. The aim of a clustering algorithm is to estimate Z (up to a permutation of the columns) from W . This section bounds the number of “misclustered” nodes. Because a permutation of the columns of Z is unidentifiable in the Stochastic Block Model, it is not obvious what a “misclustered” node is. Before giving our definition of “misclustered,” some preliminaries are needed to explain why it is a reasonable definition. First a lemma that explores the eigenstructure of the population Laplacian L . k k Lemma 3.1. Under the Stochastic Block Model, there exists a matrix BL × , such T k k ∈R that L = ZB Z . There exists another matrix U × whose unit length column L L ∈R Rohe et al./Clustering for the Stochastic Block Model 13

T vectors are the eigenvectors of BLZ Z:

T BLZ ZUL = ULΛ for some diagonal matrix Λ. Further,

L (ZUL) = (ZUL)Λ.

A proof of Lemma 3.1 can be found in Appendix D. For the rest of the paper, eigen- vectors in UL corresponding to the same eigenvalue with multiplicity are assumed to be orthogonal. k k Define the diagonal matrix P × as follows, ∈R 2 T Pgg = [UL]ig(Z Z)ii, (3.1) %i 2 where [UL]ig is the i, gth element of UL, squared. Notice that this definition ensures the 1/2 columns of ZULP − have unit length. Define

1/2 n k C = ZU P − × (3.2) L ∈R and let C k be the ith row of C . i ∈R If nodes i and j are in the same block (zi = zj), then Ci = Cj. If they are not in the same block (z = z ), then i % j 1/2 1/2 1/2 C C = (z z )U P − z z U P − = √2 U P − $ i − j$2 $ i − j L $2 ≥$ i − j$2$ L $m $ L $m where 1/2 1/2 ULP − m = min vULP − 2. k $ $ v , v 2=1 $ $ ∈R % % So C1,..., Cn represent the block memberships of nodes 1, . . . , n. n k As before, define X × such that its columns are unit length eigenvectors of L ∈R corresponding to the largest k eigenvalues in absolute value. Let x k be the ith row of i ∈R X. After calculating X, the spectral clustering algorithm runs k-means on x1, . . . , xn. Define a partition matrix to be a matrix with all elements zero except for exactly one one in each row. Define

n k k k R(n, k)= ST : where S × is a partition matrix and T × { ∈R ∈R } Notice that 2 2 min X M = min min xi mg , F k g 2 M R(n,k) $ − $ m1,...,mk $ − $ ∈ { }⊂R %i Rohe et al./Clustering for the Stochastic Block Model 14 where the right hand side is the k-means objective function applied to the rows of X. Define

2 C = argminM R(n,k) X M F . (3.3) ∈ $ − $

And define Ci to be the ith row of C. After running k-means on x1, . . . , xn, Ci is the k- means center that is closest to xi. If Ci = Cj then spectral clustering puts estimates that i and j are in the same block of the Stochastic Block Model. kn kn T With singular value decomposition, define U, Σ, and V × such that C X = ∈R UΣV T . As can be seen from Appendix B, the orthonormal matrix O = VUT is the matrix that is used in the Davis-Kahan Theorem and thus the matrix used in Theorem 2.2. Notice

√2 1/2 C O C < U P − implies (3.4) $ i − i$2 2 $ L $m

C O C C C C O C $ i − j$2 ≥$i − j$2 −$ i − i$2 √2 1/2 > U P − for j such that z = z . 2 $ L $m i % j This further implies that inequality (3.4) is a sufficient condition for C O C < C O $ i − i$2 $ i − C for any j such that z = z . Because C reflects the estimated cluster of node i and j$2 i % j i Ci reflects the true block membership of node i, spectral clustering perfectly identifies the clusters if inequality (3.4) holds for all nodes. Thus, we consider the nodes that do not have this property to be incorrectly clustered. The next theorem bounds the number of nodes nodes which do not satisfy (3.4). The matrices and the eigenvalues in the following theorem depend on n. However, to ease notation, there is no n superscript. n n Theorem 3.1. Let the random adjacency matrix W × have a distribution specified T ∈R n kn kn kn by the Stochastic Block Model with W = ZBZ where Z × and B × . ∈R ∈R Define L as in (1.3). Define λ1 λ2 λn to be the eigenvalues of L . Define n kn | |≥| |≥···≥| | C × as in (3.2), its column vectors are unit length eigenvectors of L corresponding ∈R to eigenvalues λ1,...,λkn . n kn Define L as in Equation (1.2). Define X × so that its column vectors are unit ∈R length eigenvectors of L corresponding to the kn eigenvalues of L that are the largest in absolute value. Define C as in (3.3). With singular value decomposition, define U, Σ, and kn kn T T T V × such that C X = UΣV . Define O = VU . The following set contains all ∈R of the misclustered nodes (those not satisfying inequality (3.4)),

√2 1/2 M = i : C O C > U P − . { $ i − i$2 2 $ L $m}

1/2 2 Assume that n− log(n)=O(λkn ). For τn as defined in Equation (2.1), assume that 2 there exists a positive integer N such that for all n > N, it follows that τn log(n) > 8. Rohe et al./Clustering for the Stochastic Block Model 15

Then, (log(n))2 M = o a.s. | | &nλ4 τ 4 U P 1/2 2 ' kn n $ L − $m for UL as defined in Lemma 3.1, and P as defined in Equation (3.1).

kn kn Proof. For any matrix M × , MO = M . So, if ∈R $ $F $ $ 2 C = argminM R(n,k) X M F ∈ $ − $ 2 then CO = argminM R(n,kn) XO M F . ∈ $ − $ Notice that

CO C CO XO + XO C 2 XO C , $ − $F ≤$ − $F $ − $F ≤ $ − $F where the last inequality follows from the fact that C R(n, k ) and C minimizes the ∈ n equation in (3.3). In the notation of Theorem 2.2, define Sn =[λkn /2, 2], then δ = δ# = 2 1/2 2 λkn /2. Using these facts, the definition of M , the assumption that n− log(n)=O(λkn ), and Theorem 2.2,

2 2 M 1 1/2 2 CiO Ci 2 | |≤ i M ≤ ULP − m i M $ − $ %∈ $ $ %∈ 2 CO C 2 ≤ U P 1/2 2 $ − $F $ L − $m 8 XO C 2 ≤ U P 1/2 2 $ − $F $ L − $m (log(n))2 = o a.s. &nλ4 τ 4 U P 1/2 2 ' kn n $ L − $m

The two main assumptions of Theorem 3.1 are

1/2 2 (1) n− log(n)=O(λkn ) 2 (2) eventually, τn log(n) > 8. They imply the conditions needed to apply Theorem 2.2. The first assumption requires 1/2 that the smallest nonzero eigenvalue of L is not too small. It implies n− log(n)= O(min δ ,δ# ) for the appropriate choice of S . In the Stochastic Block Model it is easy to { n n} n see why this condition is required. As the smallest nonzero eigenvalue gets closer to zero, the clusters become more difficult to identify. In fact, if a nonzero eigenvalue is pushed all the way to zero, it is equivalent to the Stochastic Block Model losing one of its blocks. The second assumption is exactly the same as the second assumption in Theorem 2.2. Rohe et al./Clustering for the Stochastic Block Model 16

In all previous spectral clustering algorithms, it has been suggested that the eigenvectors corresponding to the largest eigenvalues reveal the clusters of interest. The above theorem suggests that before finding the largest eigenvalues, you should first order them by absolute value. This allows for large negative eigenvalues. In fact, eigenvectors of L corresponding to eigenvalues close to negative one (all eigenvalues of L are in [ 1, 1]) discover “het- − erophilic” structure in the network that can be useful for clustering. For example, in the network of dating relationships in a high school, two people of opposite sex are more likely to date than people of the same sex. This pattern creates the two male and female “clusters” that have many fewer edges within than between clusters. In this case, L would likely have an eigenvalue close to negative one. The corresponding eigenvector would reveal these “heterophilic” clusters. To see why this might be the case, imagine a Stochastic Block Model with two blocks, two members of each block, and

01 B = . & 10'

In this case, there are no connections within blocks and every member is connected to the two members of the opposite block. There is no variability in the matrix W . The rows and columns of L can be reordered so that it is a block matrix. The two block matrices down the diagonal are 2 2 matrices of zeros and all the elements in the off diagonal blocks are equal × to 1/2. There are two nonzero eigenvalues of L. Any constant vector is an eigenvector of L with eigenvalue equal to one. The remaining eigenvalue belongs to any eigenvector that is a constant multiples of (1, 1, 1, 1). In this case, with perfect “heterophlic” structure, − − the eigenvector that is useful for finding the clusters has eigenvalue negative one. In order to clarify the bound on M in Theorem 3.1, a simple example illustrate how 1/2 2 | | λ , τ , and U P − might depend on n. kn n $ L $m Example : The Stochastic Block Model in this example is parametrized with only four parameters, the number of nodes is n, the number of blocks is k, the probability of a con- nection between two nodes in two separate blocks is r, the probability of a connection between two nodes in the same block is p + r. It is assumed that all blocks have an equal number of nodes. k k Define B × such that ∈R T B = pIk + r1k1k k k k where Ik × is the identity matrix, 1k is a vector of ones, r (0, 1) and ∈R ∈R ∈ p (0, 1 r). Assume that p and r are fixed and k can grow with n. But, only consider ∈ − n k pairs n, k such that s = n/k is integer valued. Let Z × be a partition matrix such T ∈R that Z 1n = s1k. This guarantees that all k groups have equal size s. Define a Stochastic Rohe et al./Clustering for the Stochastic Block Model 17

Block Model, W = ZBZT . Defining

1 T BL = pIk + r1k1 , nr + sp k ( ) T ensures L = ZBLZ . To apply Theorem 3.1, it is necessary to find the smallest nonzero eigenvalue of L . T T k k Because Z Z = sI and B are full rank, B Z Z × has k nonzero eigenvalues. k L L ∈R Lemma 3.1 shows that these are also nonzero eigenvalues of L . This implies rank(L ) ≥ k. Because rank(L ) min rank(Z), rank(BL) = k, rank(L )=k. So, L does not ≤ { } T have any nonzero eigenvalues in addition to the eigenvalues from BLZ Z. This implies T that the smallest eigenvalue of BLZ Z is also the smallest nonzero eigenvalue of L . T Let λ1,...,λk be the eigenvalues of BL(Z Z)=BL(sIk)=sBL. Notice that 1k is an eigenvector. s T s(p + kr) sBL1k = pIk + r1k1 1k = 1k nr + sp k nr + sp ( ) Let λ1 = s(p + kr)/(nr + sp). Define

T = u : u 2 =1,u 1 =0 . U { $ $ } Notice that for all u , ∈U s T sp sBLu = pIk + r1k1 u = u. (3.5) nr + sp k nr + sp ( ) Equation (3.5) implies that for i>1, sp λ = . i nr + sp

This is also true for i = k. sp sp 1 λ = = = k nr + sp nr + sp k(r/p)+1

1/2 This is the smallest nonzero eigenvalue of L . The first condition of Theorem 3.1, n− log(n)= O(λk) is met so long as k = O(√n/ log(n)). 1/2 2 Theorem 3.1 also requires the quantities τ and U P − . It is straightforward to see n $ L $m that τn =(sp + rn)/n [r, r + p]. Because τn r, the second condition of Theorem 3.1 2 ∈ 2 ≥ is met: τn log(n) > 8 for all n>exp(8/r ). Rohe et al./Clustering for the Stochastic Block Model 18

The final quantity has the matrix UL, whose column vectors are unit length eigenvectors T of BLZ Z = sBL. This is a symmetric matrix, so UL is orthonormal. It follows that

1/2 1/2 ULP − m = min vULP − 2 k $ $ v , v 2=1 $ $ ∈R % % 1/2 = min vP − 2 k v , v 2=1 $ $ ∈R % % 1/2 = P − m $ 1/2 $ = s− .

Putting the quantities together in Theorem 3.1,

M = o k3(log(n))2 a.s. | | ( ) implying that if k = o(n1/3/(log(n))2), then M | | = o(1) a.s. n

This example is particularly surprising after noticing that if k = n1/4, then the vast majority of edges connect nodes in different blocks. To see this, look at a sequence of models such that k = n1/4. Note that s = n3/4. So, for each node, the expected number of connections to nodes in the same block is (p + r)n3/4 and the expected number of connections to nodes in different blocks is r(n n3/4). − 3/4 Expected number of in block connections (p + r)n 1/4 = = O(n− ) Expected number of out of block connections r(n n3/4) − These are not the tight communities that many imagine when considering networks. In- stead, a dwindling fraction of each node’s edges actually connect to nodes in the same block. A more refined result would allow r to decay with n. However, when r decays, so does the minimum expected degree and the tail bounds used in proving Theorem 2.1 requires the minimum expected degree to grow nearly as fast as n. Allowing r to decay with n is an area for future research.

4. Discussion

The goal of this paper is to bring statistical rigor to the study of community detection by assessing how well spectral clustering can estimate the clusters in the Stochastic Block Model. The Stochastic Block Model is easily amenable to the analysis of clustering al- gorithms because of its simplicity and well defined communities. The fact that spectral Rohe et al./Clustering for the Stochastic Block Model 19 clustering performs well on the Stochastic Block Model is encouraging. However, because the Stochastic Block Model fails to represent fundamental features that most empirical networks display, this result should only be considered a first step. This paper has two main results. The first main result, Theorem 2.2, proves that under the latent space model, the eigenvectors of the empirical normalized graph Laplacian con- verge to the eigenvectors of the population normalized graph Laplacian–so long as (1) the minimum expected degree grows fast enough and (2) the eigengap that separates the lead- ing eigenvalues from the smaller eigenvalues does not shrink too quickly. This theorem has consequences in addition to those related to spectral clustering. Visualization is an important tool for social networks analysts (Liotta, 2004; Freeman, 2000; Wasserman and Faust, 1994). However, there is little statistical understanding of these techniques under stochastic models. Two visualization techniques, factor analysis and multidimensional scaling, have variations that utilize the eigenvectors of the graph Laplacian. Similar approaches were suggested for social networks as far back as the 1950’s (Bock and Husain, 1952; Breiger et al., 1975). Koren (2005) suggests visualizing the graph using the eigenvectors of the unnormalized graph Laplacian. The analogous method for the normalized graph Laplacian would use the ith row of X as the coordinates for the ith node. Theorem 2.2 shows that, under the latent space model, this visualization is not much different than visualizing the graph by instead replacing X with X . If there is structure in the latent space of a latent space model (for example, the z1, . . . , zn form clusters) and this structure is represented in the eigenvectors of the population normalized graph Laplacian, then plotting the eigenvectors will potentially reveal this structure. The Stochastic Block Model is a specific latent space model that satisfies these condi- tions. It has well defined clusters or blocks and, under weak assumptions, the eigenvectors of the population normalized graph Laplacian perfectly identify the block structure. Theo- rem 2.2 suggests that you could discover this clustering structure by using the visualization technique proposed by Koren (2005). The second main result, Theorem 3.1, goes further to suggest just how many nodes you might miscluster by running k-means on those points (this is spectral clustering). Theorem 3.1 proves that if (1) the minimum expected degree grows fast enough and (2) the smallest nonzero eigenvalue of the population normalized graph Laplacian shrinks slowly enough, then the proportion of nodes that are misclustered by spectral clustering vanishes in the asymptote. The asymptotic framework applied in Theorem 3.1 allows the number of blocks to grow with the number of nodes; this is the first such high-dimensional clustering result. Allowing the number of clusters to grow is reasonable because as Leskovec et al. (2008) noted, large networks do not necessarily have large communities. In fact, in a wide range of empirical networks, the tightest communities have a roughly constant size. Allowing the number of blocks to grow with the number of nodes ensures the clusters do not become too large. We see two main limitations of our results. We suspect these are due to shortcomings in our proof and not due to the shortcomings of spectral clustering. First, the results do not Rohe et al./Clustering for the Stochastic Block Model 20 show that spectral clustering is consistent under the Stochastic Block Model; Theorem 3.1 only gives a bound on the number of misclassified nodes. This might be because we only show that the eigenvectors are close in Frobenius norm and do not study the distribution of the eigenvectors. The second shortcoming is that we require the minimum expected degree to grow at the same rate as n (ignoring log(n) terms). In many empirical networks, the edges are not dense enough to imply this type of asymptotic framework.

Appendix A: Proof of Theorem 2.1

First a proof of Lemma 2.1, n T Proof. By eigendecomposition, M = i=1 λiuiui where u1, . . . , un are orthonormal and eigenvectors of M. So, " n n n T T 2 T MM = λiuiui λiuiui = λi uiui . & '& ' %i=1 %i=1 %i=1 2 Right multiplying by any ui yields MMui = λ ui. This proves one direction of part one in the lemma, if λ is an eigenvalue of M, then λ2 is an eigenvalue of MM. It also proves part two of the lemma, all eigenvectors of M are also eigenvectors of MM. To see that if λ2 is an eigenvalue of MM, then λ or λ is an eigenvalue of M, notice − that both M and MM have exactly n eigenvalues (counting multiplicities) because both matrices are real and symmetric. So, the previous paragraph specifies n eigenvalues of MM by squaring the eigenvalues of M. Because MM has exactly n eigenvalues, there are no other eigenvalues. The rest of the proof is devoted to part three of the lemma. Let MMv = λ2v. By eigen- T value decomposition, M = i λiuiui and because u1, . . . , un are orthonormal (M is real and symmetric) there exists α"1,...,αn such that v = i αiui. " 2 2 T λ αiui = λ v = MMv = M λiuiui v = M λiαiui & ' & ' %i %i %i 2 = λiαiMui = λi αiui %i %i By the orthogonality of the u ’s, it follows that λ2α = λ2α for all i. So, if λ2 = λ2, then i i i i i % αi =0.

For i =1, . . . , n, define ci = Dii/n and τ = mini=1,...,n ci. Lemma A.1. If n1/2/ log(n) > 2,

32√2 log(n) 2 (τ 2/4) log(n) P LL LL F 4n − . &$ − $ ≥ τ 2 n1/2 ' ≤ Rohe et al./Clustering for the Stochastic Block Model 21

The main complication of the proof of Lemma A.1 is controlling the dependencies be- tween the elements of LL. We do this with an intermediate step that uses the matrix 1/2 1/2 L˜ = D − W D − 1 and two sets Γ and Λ. Γ constrains the matrix D, while Λ constrains the matrix W D − W . These sets will be defined in the proof. To ease the notation, define

PΓΛ(B)=P(B Γ Λ) ∩ ∩ where B is some event. Proof. This proof shows that under the sets Γ and Λ the probability of the norm exceeding 2 1/2 32√2 log(n) τ − n− is exactly zero for large enough n and that the probability of Γ or Λ 2 1/2 not happening is exponentially small. To ease notation, define a = 32√2 log(n) τ − n− . The diagonal terms behave differently than the off diagonal terms. So, break them appart. c P( LL LL F a) PΓΛ( LL LL F a)+P ((Γ Λ) ) $ − $ ≥ ≤ $ − $ ≥ ∩ 2 2 c = PΓΛ [LL LL]ij a + P ((Γ Λ) )  − ≥  ∩ %i,j   [LL LL]2 a2/2 PΓΛ  ij  ≤ i=j − ≥ %'   2 2 + PΓΛ [LL LL]ii a /2 & − ≥ ' %i +P ((Γ Λ)c) ∩ First, address the sum over the off diagonal terms.

[LL LL]2 a2/2 PΓΛ  ij  i=j − ≥ %'  2 2 a PΓΛ i=j [LL LL]ij (A.1) ≤ &∪ ' { − ≥ 2n2 }' a PΓΛ LL LL ij ≤ i=j &| − | ≥ √2n' %' ˜ ˜ ˜ ˜ a PΓΛ LL LL ij + LL LL ij ≤ i=j &| − | | − | ≥ √2n' %' ˜ ˜ a PΓΛ LL LL ij ≤ i=j . &| − | ≥ √8n' %' ˜ ˜ a + PΓΛ LL LL ij (A.2) &| − | ≥ √8n'/ Rohe et al./Clustering for the Stochastic Block Model 22

The sum over the diagonal terms is similar,

2 2 ˜ ˜ a PΓΛ [LL LL]ii a /2 PΓΛ LL LL ii & − ≥ ' ≤ . &| − | ≥ √8n' %i %i ˜ ˜ a + PΓΛ LL LL ii , &| − | ≥ √8n'/

with one key difference. In equation (A.1), the union bound address nearly n2 terms. This yields the 1/n2 term in line (A.1). After taking the square root, each term has a lower bound with a factor of 1/n. However, because there are only n terms on the diagonal, after taking the square root in the last equation above, the lower bound has a factor of 1/√n. To constrain the terms L˜L˜ LL for i = j and i = j, define | − |ij %

1/2 Λ= (WikWjk pijk)/ck

To see Equation (A.3), expand the left hand side of the inequality for i = j, % 1 L˜L˜ LL ij = (WikWjk pikpjk)/Dkk | − | (D D )1/2 1 − 1 ii jj 1 k 1 1% 1 1 1 1 = 1(WikWjk pikpjk)/ck 1 n2 c c 1 − 1 √ i j 1 k 1 1% 1 1 1 This is bounded on Λ, yielding 1 1

log(n) 32√2 log(n) a L˜L˜ LL ij < = . | − | τn3/2 ≤ √8τ 2 n3/2 √8n Rohe et al./Clustering for the Stochastic Block Model 23

So, Equation (A.3) holds for i = j. Equation (A.4) is different because W 2 = W . As a % ik ik result, the diagonal of L˜L˜ is a biased estimator of the diagonal of LL.

2 2 Wik pik L˜L˜ LL ii = − | − | D D 1 k ii kk 1 1 % 1 1 W p2 1 = ik − ik D D 1 k ii kk 1 1 % 1 1 W p 1 p p2 ik − ik + ik − ik ≤ D D D D 1 k ii kk 1 1 k ii kk 1 1 % 1 1 % 1 1 1 1 1 1 2 = (Wik pik)/ck + (pik pik)/ck (A.5) c n2 & − − ' i 1 k 1 1 k 1 1 % 1 1 % 1 1 1 1 1 1 3/2 Similarly to the i = j case, the first term is bounded by log(n)τ − n− on Λ. The second % 2 1 term is bounded by τ − n− :

1 2 1 (pik pik)/ck 1/τ c n2 1 − 1 ≤ c n2 1 1 i 1 k 1 i 1 k 1 1% 1 1% 1 1 1 1 1 1 1 1 .1 1 ≤ τ 2n Substituting the value of a in reveals that on the set Λ, both terms in (A.5) are bounded by 1 1 a(2√8n)− . So, their their sum is bounded by a(√8n)− , satisfying Equation (A.4). This next part addresses the difference between LL and L˜L˜, showing that for large enough n, any i = j, and some set Γ, % ˜ ˜ a PΓΛ( LL LL ij ) = 0 | − | ≥ √8n ˜ ˜ a PΓΛ( LL LL ii ) = 0 | − | ≥ √8n It is enough to show that for any i and j,

˜ ˜ a PΓΛ( LL LL ij ) = 0. (A.6) | − | ≥ √8n

1/2 For b(n) = log(n)n− , define u(n) = 1 + b(n),l(n)=1 b(n). With these define the − Rohe et al./Clustering for the Stochastic Block Model 24 following sets,

Γ= D D [l(n),u(n)] { ii ∈ ii } 0i 1 1 1 1 Γ(1) = u(n)− ,l(n)− Dii ∈ Dii 0i 3 4 56 1 1 1 1 − − Γ(2) = 1/2 1/2 u(n) ,l(n) !(DiiDjj) ∈ (DiiDjj) 2 0i,j 4 5 2 2 1 [u(n)− ,l(n)− ] Γ(3) = 1/2 1/2 !Dkk(DiiDjj) ∈ Dkk(DiiDjj) 2 i,j,k0 Notice that Γ Γ(1) Γ(2) and Γ Γ(3). Define another set, ⊆ ⊆ ⊆ 1 [1 16b(n), 1 + 16b(n)] Γ(4) = 1/2 − 1/2 . !Dkk(DiiDjj) ∈ Dkk(DiiDjj) 2 i,j,k0 The next steps show that this set contains Γ. It is sufficient to show Γ(3) Γ(4). This is ⊂ true because

2 2 1 1 b(n)− b(n)− 1 2 = 2 = 1 2 > 1 − 2 u(n) (1 + b(n)) (b(n)− + 1) (b(n)− + 1) 1 b(n)− 1 2 = 1 − =1 1 > 1 16 b(n). b(n)− +1 − b(n)− +1 − The 16 in the last bound is larger than it needs to be so that the upper and lower bounds in Γ(4) are symmetric. For the other direction,

1 1 b(n) 2 1 2 = = − = 1+ l(n)2 (1 b(n))2 (b(n) 1 1)2 & b(n) 1 1' − − − − − 2 1 = 1 + + . b(n) 1 1 (b(n) 1 1)2 − − − − We now need to bound the last two elements here. We are assuming, √n/ log(n) > 2. Equivalently, 1 b(n) > 1/2 . So, we have both of the following: − 1 2 2 2b(n) < and = < 8b(n). (b(n) 1 1)2 b(n) 1 1 b(n) 1 1 1 b(n) − − − − − − − Putting these together, 1 < 1 + 16 b(n). l(n)2 Rohe et al./Clustering for the Stochastic Block Model 25

This shows that Γ Γ(4). Now, under the set Γ, and thus Γ(4), ⊂ ˜ ˜ WikWjk WikWjk LL LL ij = 1/2 1/2 | − | 1 k &Dkk(DiiDjj) − Dkk(DiiDjj) ' 1 1 % 1 1 1 1 1 1 1 1/2 1/2 ≤ k 1Dkk(DiiDjj) − Dkk(DiiDjj) 1 % 1 1 1 1 1 16 b(n) 1 1/2 ≤ k 1Dkk(DiiDjj) 1 % 1 1 1 1 116 b(n) 1 ≤ τ 2n2 %k 16 b(n) . ≤ τ 2n

1 This is equal to a(√8n)− , showing Equation (A.4) holds for all i and j. The remaining step is to bound P((Γ Λ)c). Using the union bound this is less than or ∩ equal to P(Γc)+P(Λc).

c P(Γ )=P Dii Dii[1 b(n), 1+b(n)] & { %∈ − }' 7i P ( Dii Dii[1 b(n), 1+b(n)] ) ≤ { %∈ − } %i < 2 exp( b(n)2D /4) − ii %i 2n exp( (log(n))2τ/4) ≤ 1 (τ/4)− log(n) =2n −

Where the second to last inequality is by Chernoff’s inequality. The next inequality is Ho- effding’s.

(Λc)= (W W p )/c >n1/2 log(n) P P  ik jk ijk k  i,j !1 k − 1 2 7 1 % 1  1 1  1 1 1/2 = P (WikWjk pijk/ck >n log(n) i,j &1 k − 1 ' % 1 % 1 1 1 1 2 1 2 < 2 exp 2n(log(n)) / 1/ck &− ' %i,j %k 2n2 exp 2(log(n))2τ 2 ≤ − 2 (τ 2/(4) log(n) ) 2n − ≤ Rohe et al./Clustering for the Stochastic Block Model 26

Because W is symmetric, the independence of the WikWjk across k is not obvious. How- ever, because Wii = Wjj =0, they are independent across k. Putting the pieces together,

32√2 log(n) P LL LL F &$ − $ ≥ τ 2 n1/2 '

32√2 log(n) c PΓΛ( LL LL F )+P ((Γ Λ) ) ≤ $ − $ ≥ τ 2 n1/2 ∩ 2 (τ 2/4) log(n) 1 (τ/4) log(n) < 0+2 n − + n − 2 (τ(2/4) log(n) ) 4n − . ≤

The following proves Theorem 2.1. Proof. Adding the n super- and subscripts to Lemma A.1, it states that if n1/2/ log(n) > 2, then c log(n) 2 (τ 2/4) log(n) P LL LL F < 4n − . &$ − $ ≥ τ 2 n1/2 ' for c = 32√2. By assumption, for all n > N, τ 2 log(n) > 8. This implies that 2 − (τ 2/2) log(n) < 2 for all n > N. Rearranging and summing over n, for any fixed &> 0, − (n) (n) (n) (n) ∞ L L L L F ∞ 2 (τ 2/4) log(n) − P $ 2 − 1/2 $ & N +4 n & cτn− log(n)n− /& ≥ ' ≤ n%=1 n=%N+1 ∞ 2 N +4 n− , ≤ n=%N+1 which is a summable sequence. By the Borel-Cantelli Theorem,

(n) (n) (n) (n) 2 1/2 L L L L = o(τ − log(n) n− ) a.s. $ − $F n

Appendix B: Davis-Kahan Theorem

The statement of the theorem below and the preceding explanation come largely from von Luxburg (2007). For a more detailed account of the Davis-Kahan Theorem see Stewart and Sun (1990). To avoid the issues associated with multiple eigenvalues, this theorem’s original state- ment is instead about the subspace formed by the eigenvectors. For a distance between Rohe et al./Clustering for the Stochastic Block Model 27 subspaces, the theorem uses “canonical angles,” which are also known as “principal an- n p gles.” Given two matrices M and M both in × with orthonormal columns, the singu- 1 2 R lar values (σ1,...,σp) of M1# M2 are the cosines of the principal angles (cos Θ1,..., cos Θp) between the column space of M1 and the column space of M2. Define sin Θ(M1,M2) to be a diagonal matrix containing the sine of the principal angles of M1# M2 and define

d(M ,M )= sin Θ(M ,M ) , (B.1) 1 2 $ 1 2 $F which can be expressed as (p p σ2)1/2 by using the identity sin2 θ =1 cos2 θ. − j=1 j − Proposition B.1. (Davis-Kahan)"Let S be an interval. Denote as an orthonormal ⊂R X matrix whose column space is equal to the eigenspace of LL corresponding to the eigen- values in λ (LL) (more formally, the column space of is the image of the spectral S X projection of LL induced by λS(LL)). Denote by X the analogous quantity for LL. Define the distance between S and the spectrum of LL outside of S as

δ = min λ s ; λ eigenvalue of LL,λ S, s S . {| − | %∈ ∈ } Then the distance d( ,X)= sin Θ( ,X) between the column spaces of and X is X $ X $F X bounded by LL LL d(X, ) $ − $F . X ≤ δ In the theorem LL and LL can be replaced by any two symmetric matrices. The rest of this section converts the bound on d(X, ) to a bound on X O , where O is X $ −X $F some orthonormal rotation. For this, we will make an additional assumption that and X X have the same dimension. Assume there exists S containing k eigenvalues of LL ⊂R and k eigenvalues of LL, but containing no other eigenvalues of either matrix. Because LL and LL are symmetric, its eigenvectors can be defined to be orthonormal. Let the n k columns of X × be k orthonormal eigenvectors of LL corresponding to the k ∈R n k eigenvalues contained in S. Let the columns of X × be k orthonormal eigenvectors ∈R of LL corresponding to the k eigenvalues contained in S. By singular value decomposition, there exist orthonormal matrices U, V and diagonal matrix Σ such that X T X = UΣV T . The singular values, σ1,...,σk, down the diagonal of Σ are the cosines of the principal angles between the columns space of X and the column space of X . Although the Davis-Kahan Theorem is a statement regarding the principal angles, a few lines of algebra shows that it can be extended to a bound on the Frobenius norm between Rohe et al./Clustering for the Stochastic Block Model 28 the matrix X and X UV T , where the matrix UV T is an orthonormal rotation. 1 1 X X UV T 2 = trace((X X UV T )T (X X UV T )) 2$ − $F 2 − − 1 = trace(VUT X T X UV T + XT X 2VUT X T X) 2 − 1 = k + k 2trace(VUT X T X) 2 − ( ) LL LL 2 $ − $F , ≤ δ2 where the last inequality is explained below. It follows from a property of the trace, the fact that the singular values are in [0, 1], the trigonometric identity cos2 θ =1 sin2 θ and the − Davis-Kahan Theorem:

k k k trace(VUT X T X)= σ (cos Θ )2 = 1 (sin Θ )2 i ≥ i − i %i=1 %i=1 %i=1 LL LL 2 = k (d(X, X ))2 k $ − $F . − ≥ − δ2 This shows that the Davis-Kahan Theorem can instead be thought of as a bounding X UV T $ − X 2 instead of d(X ,X). The matrix O in Theorem 2.1 is equal to UV T . In this way, it is $F dependent on X and . X

Appendix C: Proof of Theorem 2.3

(n) (n) Proof. By Lemma 2.1, the column vectors of Xn are eigenvectors of L L correspond- (n) (n) ing to all the eigenvalues in λSn (L L ). For the application of the Davis-Kahan The- orem, this means that the column space of Xn is the image of the spectral projection of (n) (n) (n) (n) L L induced by λSn (L L ). Similarly, for the column vectors of Xn, the matrix (n) (n) (n) (n) L L , and the set λSn (L L ). ¯(n) ¯(n) Recall that λ1 λn are defined to be the eigenvalues of symmetric ma- (n) (n) ≥(n···) ≥ (n) trix L L and λ1 λn are defined to be the eigenvalues of symmet- (n) (n) ≥···≥ 1/2 ric matrix L L . The assumption n− log(n)=O(min δn,δn# ) and τn 1 imply 2 1/2 { } ≤ τ − n− log(n)=O(min δ ,δ# ). Together with Equation (2.2) this implies n { n n} (n) (n) (n) (n) 2 1/2 max λ λ¯ max λ λ¯ τ − n− log(n) i | i − i | = i | i − i | n = o(1)O(1) = o(1). min δ ,δ τ 2 n 1/2 log(n) min δ ,δ { n n# } n− − { n n# } (n) (n) (n) So, eventually max λ λ¯ < min δ ,δ# . This means that λ S if and only i | i − i | { n n} i ∈ n if λ¯(n) S . Thus the number of elements in λ (L (n)L (n)) is eventually equal to the i ∈ n Sn Rohe et al./Clustering for the Stochastic Block Model 29

(n) (n) number of elements in λSn (L L ) implying that Xn and Xn will eventually have the same number of columns, kn = Kn. Once Xn and Xn have the same number of columns, define matrices Un and Vn with T T T singular value decomposition: Xn Xn = UnΣnVn . Define On = UnVn . The result follows from the Davis-Kahan Theorem and Theorem 2.1:

(n) (n) (n) (n) 2 L L L L F log(n) Xn XnOn F $ − $ = o 2 1/2 a.s. $ − $ ≤ δn &δnτn n '

Appendix D: Stochastic Block Model

Below is a proof of Lemma 3.1 T k k Proof. To find the matrix BL, define DB = diag(BZ 1n) × where 1n is a vector of ∈R ones in n. For any i, j R

Wij 1/2 1/2 T Lij = = ziDB− BDB− (zj) DiiDjj 8 1/2 1/2 T T Define BL = DB− BDB− . It follows that Lij =(ZBLZ )ij and thus L = ZBLZ . T 1/2 T 1/2 To see the matrix UL exists, notice that (Z Z) BL(Z Z) is a symmetric matrix. So, T 1/2 T 1/2 there exists an orthonormal matrix V whose columns are its eigenvectors: (Z Z) BL(Z Z) V = T 1/2 V Λ#, for some diagonal matrix Λ#. Define UL# =(Z Z)− V .

T 1/2 T T 1/2 T 1/2 T 1/2 (Z Z) BL(Z Z)UL# =(Z Z) BL(Z Z) V = V Λ# =(Z Z) UL# Λ#

T 1/2 T Left multiply by (Z Z)− to see that the columns of UL# are eigenvectors of BL(Z Z). This also shows that Λ# =Λ. Rescaling the columns of UL# yields UL. To see the second part of the lemma,

T T L (ZUL)=ZBLZ ZUL = Z(BLZ ZUL)=Z(ULΛ) = (ZUL)Λ.

References

E.M. Airoldi, D.M. Blei, S.E. Fienberg, and E.P. Xing. Mixed membership stochastic blockmodels. The Journal of Research, 9:1981–2014, 2008. Rohe et al./Clustering for the Stochastic Block Model 30

R. Albert and A.L. Barabasi.´ Statistical mechanics of complex networks. Reviews of modern physics, 74(1):47–97, 2002. A.L. Barabasi´ and R. Albert. Emergence of scaling in random networks. Science, 286 (5439):509, 1999. M. Belkin. Problems of learning on manifolds. PhD thesis, The University of Chicago, 2003. M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data rep- resentation. Neural computation, 15(6):1373–1396, 2003. M. Belkin and P. Niyogi. Towards a theoretical foundation for Laplacian-based manifold methods. Journal of Computer and System Sciences, 74(8):1289–1308, 2008. R. Bhatia. Perturbation bounds for matrix eigenvalues. 1987. P.J. Bickel and A. Chen. A nonparametric view of network models and Newman–Girvan and other modularities. Proceedings of the National Academy of Sciences, 106(50): 21068, 2009. R.D. Bock and S. Husain. Factors of the Tele. Sociometry, 15:206–19, 1952. O. Bousquet, O. Chapelle, and M. Hein. Measure based regularization. Advances in Neural Information Processing Systems, 16, 2004. R.L. Breiger, S.A. Boorman, and P. Arabie. An algorithm for clustering relational data with applications to and comparison with multidimensional scaling. Journal of Mathematical Psychology, 12(3):328–383, 1975. RR Coifman, S. Lafon, AB Lee, M. Maggioni, B. Nadler, F. Warner, and SW Zucker. Geometric diffusions as a tool for harmonic analysis and structure definition of data: Diffusion maps. Proceedings of the National Academy of Sciences, 102(21):7426–7431, 2005. A. Condon and R.M. Karp. Algorithms for graph partitioning on the planted partition model. In Randomization, approximation, and combinatorial optimization: algorithms and techniques: Third International Workshop on Randomization and Approximation Techniques in Computer Science, and Second International Workshop on Approximation Algorithms for Combinatorial Optimization Problems RANDOM-APPROX’99, Berkeley, CA, August 8-11, 1999, proceedings, page 221. Springer Verlag, 1999. W.E. Donath and A.J. Hoffman. Lower bounds for the partitioning of graphs. IBM Journal of Research and Development, 17(5):420–425, 1973. P. Erdos¨ and A. Renyi.´ On random graphs. Publ. Math. Debrecen, 6(290-297):156, 1959. M. Fiedler. Algebraic connectivity of graphs. Czechoslovak Mathematical Journal, 23(2): 298–305, 1973. P. Fjallstr¨ om.¨ Algorithms for graph partitioning: A survey. Computer and Information Science, 3(10), 1998. S. Fortunato. Community detection in graphs. arXiv, 906, 2009. O. Frank and D. Strauss. Markov graphs. Journal of the American Statistical Association, 81(395):832–842, 1986. Rohe et al./Clustering for the Stochastic Block Model 31

L.C. Freeman. Visualizing social networks. Journal of social structure, 1(1):4, 2000. E. Gine´ and V. Koltchinskii. Empirical graph Laplacian approximation of Laplace-Beltrami operators: large sample results. Lecture Notes-Monograph Series, pages 238–259, 2006. M. Girvan and MEJ Newman. in social and biological networks. Proceedings of the National Academy of Sciences, 99(12):7821, 2002. A. Goldenberg, AX Zheng, SE Fienberg, and EM Airoldi. A survey of statistical network models. Foundations and Trends in Machine Learning, to appear, 2009. L. Hagen and A.B. Kahng. New spectral methods for ratio cut partitioning and clustering. IEEE Trans. Computer-Aided Design, 11(9):1074–1085, 1992. M.S. Handcock, A.E. Raftery, and J.M. Tantrum. Model-based clustering for social net- works. Journal of the Royal Statistical Society-Series A, 170(2):301–354, 2007. M. Hein. Uniform convergence of adaptive graph-based regularization. Lecture Notes in Computer Science, 4005:50, 2006. M. Hein, J.Y. Audibert, and U. Von Luxburg. From graphs to manifolds-weak and strong pointwise consistency of graph Laplacians. In Proceedings of the 18th Conference on Learning Theory (COLT), pages 470–485. Springer, 2005. B. Hendrickson and R. Leland. An improved spectral graph partitioning algorithm for mapping parallel computations. SIAM Journal on Scientific Computing, 16(2):452–469, 1995. P.D. Hoff, A.E. Raftery, and M.S. Handcock. Latent space approaches to social network analysis. Journal of the American Statistical Association, 97(460):1090–1098, 2002. P. Holland, K.B. Laskey, and S. Leinhardt. Stochastic blockmodels: Some first steps. Social Networks, 5:109–137, 1983. P.W. Holland and S. Leinhardt. An exponential family of probability distributions for di- rected graphs. Journal of the American Statistical Association, 76(373):33–50, 1981. O. Kallenberg. Probabilistic symmetries and invariance principles. Springer Verlag, 2005. DJ Klein and M. Randic.´ Resistance distance. Journal of Mathematical Chemistry, 12(1): 81–95, 1993. Y. Koren. Drawing graphs by eigenvectors: Theory and practice. Computers and Mathe- matics with Applications, 49(11-12):1867–1888, 2005. S.S. Lafon. Diffusion maps and geometric harmonics. PhD thesis, Yale University, 2004. J. Leskovec, K.J. Lang, A. Dasgupta, and M.W. Mahoney. Statistical properties of com- munity structure in large social and information networks. In Proceeding of the 17th international conference on World Wide Web, pages 695–704. ACM, 2008. G. Liotta. Graph drawing. Lecture Notes in Computer Science, 2912, 2004. M. Meila˘ and J. Shi. A random walks view of spectral segmentation. AI and Statistics (AISTATS), 2001, 2001. MEJ Newman and M. Girvan. Finding and evaluating community structure in networks. Physical review E, 69(2):26113, 2004. K. Nowicki and T.A.B. Snijders. Estimation and Prediction for Stochastic Blockstructures. Rohe et al./Clustering for the Stochastic Block Model 32

Journal of the American Statistical Association, 96(455), 2001. A. Pothen, H.D. Simon, and K.P. Liou. Partitioning sparse matrices with eigenvectors of graphs. SIAM Journal on Matrix Analysis and Applications, 11:430, 1990. J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transactions on pattern analysis and machine intelligence, 22(8):888–905, 2000. T.A.B. Snijders and K. Nowicki. Estimation and prediction for stochastic blockmodels for graphs with latent block structure. Journal of Classification, 14(1):75–100, 1997. D.A. Spielman and S.H. Teng. Spectral partitioning works: Planar graphs and finite element meshes. Linear Algebra and its Applications, 421(2-3):284–305, 2007. GW (Gilbert W.) Stewart and J. Sun. Matrix perturbation theory. 1990. R. Van Driessche and D. Roose. An improved spectral bisection algorithm and its applica- tion to dynamic load balancing. Parallel Computing, 21(1):29–48, 1995. M.A.J. Van Duijn, T.A.B. Snijders, and B.J.H. Zijlstra. p2: a random effects model with covariates for directed graphs. Statistica Neerlandica, 58(2):234–254, 2004. U. von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395– 416, 2007. U. Von Luxburg, M. Belkin, and O. Bousquet. Consistency of spectral clustering. Annals of Statistics, 36(2):555, 2008. S. Wasserman and K. Faust. Social network analysis: Methods and applications. Cam- bridge Univ Pr, 1994. D.J. Watts and S.H. Strogatz. Collective dynamics of small-worldnetworks. Nature, 393 (6684):440–442, 1998.