C.F. van Oosten

[email protected]

Community Detection in a Time Dynamic Stochastic Block Model

Master thesis

Thesis supervisor: Dr. J. Schmidt-Hieber

Date of publication: March 10, 2017

Mathematical Institute of Leiden University

1 1 Summary

In network data analysis, community detection is a fundamental statistical prob- lem. A popular approach to analyse such data networks is the stochastic block model. In this model connection probabilities between nodes only depend on the communities they belong to. This thesis is concerned with a two-stage method that achieves optimal misclassification proportion in the stochastic block model. Even though the stochastic block model is simple and models the group structure, it lacks flexibility and is not realistic. The reason is that models generated with the stochastic block model miss the generation of hubs which are nodes with a high degree. Hubs are however often observed in network data. Indeed, lots of real life models have a cumulative advantage where ”the rich get richer”. The model models this advantage by adding a degree dependent term in the connection probabilities, but does not include any group structure. This the- sis contributes to a previous result for misclassification proportion in stochastic block models by adding in a degree dependence to the connection probability which changes over time. This allows models different from the stochastic block model to be used. In the thesis we propose a model which combines the stochas- tic block model with the preferential attachment model to allow the modeling of hubs in networks. For this model we prove that under some conditions the average edge degrees in this model are comparable to those in the stochastic block model. We achieve this by bounding the expected degree of the time dynamic model in terms of the stochastic block model which requires techniques like martingale concentration inequalities and probabilistic approximations.

First we define the stochastic block model in Section 2. In Section 3 we summarize previous results which uses the stochastic block model as a basis. After that we introduce the preferential attachment model in Section 4 which we use to make our own time dynamic stochastic block model in Section 5.

2 Contents

1 Summary 2

2 Networks and the Stochastic Block Model 4

3 A Two-step Community Detection Algorithm 5 3.1 First Step: Spectral Clustering Methods ...... 5 3.2 Second Step: Refined Group Labelling ...... 6 3.3 Theoretical Properties ...... 10

4 Preferential Attachment 13

5 A Time Dynamic Stochastic Block Model 14 5.1 Main Result ...... 17

6 Discussion 21 6.1 Assumptions of Theorem 5.1 and further work ...... 21 6.2 Generalizing the model ...... 21

7 Theoretical Properties 22 7.1 Proof of Theorem 3.1 ...... 22 7.2 Proof of Lemma 3.3 ...... 28

8 Code 35

3 2 Networks and the Stochastic Block Model

In network data analysis [17, 7] the goal is to infer structures in the generating process of observed networks. One of those structures are communities: partitions of graph nodes which can be grouped together in a suitable sense. Several authors have proposed methods [6, 8, 15] and researched theoretical understanding on community recovery ([2, 18, 14, 13, 1] among others). To model network data with we focus on the stochastic block model. This models simple, unweighted networks where the vertices are partitioned in distinct communities. The probability of an edge between two vertices does not depend on the nodes themselves but on the communities the nodes belong to. The stochastic block model generates a network based on the given labels and connection probabilities which are assumed to be Bernoulli distributed.

We start with the basic definitions of a network. A network is a graph with a set of nodes V = {1, 2, ..., n} and a set of edges E ⊆ V × V which are connections between nodes. In social networks for example nodes often represent people while edges represent friendships between two persons. In general we describe networks by using a connectivity matrix A = (Aij)i,j∈V where Aij = 1 if edge (i, j) is in E and Aij = 0 otherwise. This has all the information about the nodes and edges needed to construct the network.

We recall the definition of the stochastic block model proposed by [10]: Let k be the number of communities in the model. The stochastic block model is char- acterized by a symmetric connectivity matrix B ∈ [0, 1]k×k and a label vector σ ∈ [k]n where [k] = {1, 2, ..., k}. The connectivity matrix B denotes the con- nection probabilities between groups and the label vector σ labels every node to a specific group. An adjacency matrix A ∈ {0, 1}n×n is generated using the stochastic block model by defining

• Auv = Avu ∼ Bern(Bσ(u)σ(v)) for all u < v ∈ [n],

• Auu = 0 for every u ∈ [n], which gives an adjacency matrix for a simple graph. To gain some heuristic insight into this model consider two communities. If the connection probabilities are very different it is easy to separate the two communities, while if they are close to each other the groups are very similar and hard to distinguish.

4 Given an adjacency matrix A we can generate a network. An example of such a network based on the stochastic block model is shown in Figure 1.

Figure 1: Randomly generated graph with n = 20 nodes, k = 2 communities and

Bii = 0.66,Bij = 0.33 for all i 6= j. Nodes 1 to 10 belong to the green group while nodes 11 to 20 belong to the red one.

3 A Two-step Community Detection Algorithm

Before coming to the time dynamic stochastic block model we first review the method for community detection in stochastic block models proposed by Gao et al. [5] and Zhang and Zhou [19]. We first look at the algorithm and then study theoretical properties of the method.

3.1 First Step: Spectral Clustering Methods

In the stochastic block model, connection probabilities only depend on the group to which a node belongs to. Therefore the rank of the connection probability matrix P = (Puv) is at most the amount of communities k. By clustering eigen- vectors of P we can therefore reduce the data from n × n to n × k. This is refered to as spectral clustering [16]. Since Auv is a Bernoulli variable with E[Auv] = Puv we apply the clustering to the adjacency matrix A which can be computed from data.

5 We discuss two different spectral clustering methods: unnormalized spectral clustering and normalized spectral clustering. Unnormalized spectral cluster- ing (USC) takes the eigenvectors of the adjacency matrix A without any nor- malization. Normalized spectral clustering (NSC) uses the eigenvectors of the −1/2 −1/2 Laplacian L(A) = ([L(A)]uv) with [L(A)]uv = du dv Auv. When these meth- ods are applied to sparse graphs they do not perform well since A and L(A) are poor estimators for P and L(P ), cf. [12]. In order to obtain better estimators we regularize both spectral clustering methods. To regularize USC we introduce a trimming operator Tτ which replaces the u-th row and u-th column of A by zero whenever du > τ. Nodes of high degree are therefore discarded. We define USC(τ) as the regularized USC with thresholdτ ∈ [0, ∞). Note that USC(∞) is the same as unregularized USC. For NSC we define the adjusted adjacency matrix τ T T n Aτ = A + n 11 where 1 = (1, 1, ..., 1) ∈ R . Define NSC(τ) as clustering of the eigenvectors of the Laplacian L(Aτ ). Note that NSC(0) is the same as unregular- ized NSC. The following initial community assignment algorithm applies spectral clustering to community detection:

Algorithm 1 (initial group labelling) ([5], Algorithm 2) ˆ n×k Input : Data matrix U ∈ R , leading eigenvectors of either Tτ (A) or L(Aτ ), q k number of communities k, critical radius r = µ n with some constant µ > 0 Output: Community assignmentσ ˆ Set S = [n] for i = 1 to k do ˆ ˆ Let ti = argmaxu∈S |{v ∈ S : kUv∗ − Uu∗k < r}|; ˆ ˆ ˆ Set Ci = {v ∈ S : kUv∗ − Uti∗k < r}; Labelσ ˆ(u) = i for all u ∈ Cˆi;

Update S ← S\Cˆi. end 1 P ˆ ˆ If S 6= ∅, then for any u ∈ S, setσ ˆ(u) = argmini∈[k] kUu∗ − Uv∗k. Cˆi v∈Cˆi

3.2 Second Step: Refined Group Labelling

Based on the initial group labelling we present in this section a refinement method that was also proposed in [5]. Denote by σ0 the initial group assignment that we obtained from Algorithm 1. For every node u, we apply Algorithm 1 on the adjacency matrix leaving node u and its edges out. This gives initial label

6 0 0 assignments σu(v) for all v 6= u. We defineσ ˆu(v) = σu(v) for all v 6= u and choose ˆ σu(u) as the community that has the most neighbouring nodes in common with u. Using this we obtain label vectorsσ ˆu for every node u. To make a unified assignment vector these label vectors have to be combined. However, due to permutations in the assignment vectors the community labels can be different. We illustrate this with the following example:

1 2 3 4 1 1 1 2 2 2 1 1 1 2 3 1 2 2 2 4 2 2 1 1

The u-th row corresponds to the label vectorσ ˆu. The bold numbers are the nodes which have the same label as u inσ ˆu. To obtain a unified community assignment we compare each row withσ ˆ1. To decide which permutation is most likely we look at which community inσ ˆ1 has the most overlap with the community of u inσ ˆu. For u = 1 this givesσ ˆ(1) =σ ˆ1(1) = 1. For u = 2 the community of u has an overlap of two nodes with community 1 while only one node overlap with community 2. Thereforeσ ˆ(2) = 1. For u = 3 the community of u has an overlap of one node with community 1 and an overlap of two nodes with community 2. Thereforeσ ˆ(3) = 2. For u = 4 the community of u has no overlap with community 1 and an overlap of two nodes with community 2. Thereforeσ ˆ(4) = 2. Thus the final community assignment becomesσ ˆ = (1, 1, 2, 2).

n−1 Let us now describe the algorithm in more detail. Define A−u ∈ {0, 1} as the adjacency matrix A leaving row u and column u out. Applying the initial 0 group label method on A−u for every u gives initial assignment vectors σu = 0  0 σu(v) v∈[n]\{u}. We defineσ ˆu(v) = σu(v) for all v 6= u and setσ ˆu(u) = 0 to make vectors of length n. For equal community sizes the maximum likelihood estimator forσ ˆu(u) given labels (ˆσu(v))v6=u [3] is ( ) X argmax Auv1 (ˆσu(v) = l) . l v When community sizes are different nodes belonging to a larger community are more common. Therefore the maximum likelihood estimator has to compensate P for larger communities by adding a penalty term ρu 1(ˆσu(v) = l). Here ρu v depends on the average connection probabilities in and between communities as labeled byσ ˆu. For the assignment ofσ ˆu(u) we then get

7 ( ) X X σˆu(u) = argmax Auv1 (ˆσu(v) = l) − ρu 1 (ˆσu(v) = l) . l v v

In the next step of the algorithm we account for different permutations in label vectors to make one unified community assignmentσ ˆ. Defineσ ˆ(1) =σ ˆ1(1) and for u = 2, 3, ..., n we setσ ˆ(u) as the community inσ ˆ1 which has the most overlap with communityσ ˆu(u) inσ ˆu. This means for u = 2, 3, ..., n we take

σˆ(u) = argmax |{v :σ ˆ1(v) = l} ∩ {v :σ ˆu(v) =σ ˆu(u)}|. l∈[k]

This gives the community assignmentσ ˆ.

8 Algorithm 2 ([5], Algorithm 1) Input : A ∈ {0, 1}n×n, number of communities k, initial community detection method σ0 Output: Community assignmentσ ˆ Penalized neighbor voting: for u = 1 to n do 0 0 0 Apply σ on A−u to obtain σu(v) for all v 6= u and let σu(u) = 0 ˜u 0 ˜u Define Ci = {v : σu(v) = i} for all i ∈ [k]; let Ei be the set of edges within ˜u ˜u ˜u ˜u Ci , and let Eij be the set of edges between Ci and Ci when i 6= j; Define |Eu| |Eu | Bˆu = i , Bˆu = ij , ∀i 6= j ∈ [k] ii 1 ˜u ˜u ij ˜u ˜u 2 |Ci |(|Ci | − 1) |Ci ||Cj | and let ˆu ˆ ˆu aˆu = n min Bii and bu = n max Bij i∈[k] i6=j∈[k] 0 Defineσ ˆu :[n] → [k] byσ ˆu(v) = σu(v) for all v 6= u and X X σˆ (u) = argmax A − ρ 1 0 u uv u {σu(v)=l} l∈[k] 0 σu(v)=l v∈[n] where for ˆbu ! 1 aˆu(1 − n ) tu = log 2 ˆ aˆu bu(1 − n ) we define aˆu −tu aˆu ! 1 n e + 1 − n ρu = − log ˆ ˆ 2tu bu tu bu n e + 1 − n end

Step 2: Defineσ ˆ(1) =σ ˆ1(1). For u = 2, ..., n, define

σˆ(u) = argmax |{v :σ ˆ1(v) = l} ∩ {v :σ ˆu(v) =σ ˆu(u)}|. l∈[k]

9 3.3 Theoretical Properties

To analyse the stochastic block model, we first recall the parameter spaces used in [5]: The space Θ0(n, k, a, b, β) consists of all (B, σ) such that

• σ :[n] → [k] is a label function

• community sizes are comparable: n βn |{u ∈ [n]: σ(u) = i}| ∈ [ − 1, + 1], ∀i ∈ [k] βk k

k×k • B = (Bij) ∈ [0, 1] are connection probabilities such that a b B = for all i and B = for all i 6= j ii n ij n

Note that n ∈ N is the number of nodes, k ∈ N the amount of communities, + a, b ∈ R constants which depend on n and β ≥ 1 constant. This parameter space consists of label functions σ with comparable community sizes and con- a b stant connection probability Bii = n within communities and Bij = n between communities. Assuming constant connection probabilities for all communities is very restrictive, since communities in real world networks generally vary in size. Therefore [5] also introduces the parameter space Θ(n, k, a, b, λ, β, α) which consists of all (B, σ) such that

• σ :[n] → [k] is a label function

• community sizes are comparable: n βn |{u ∈ [n]: σ(u) = i}| ∈ [ − 1, + 1], ∀i ∈ [k] βk k

T k×k • B = B = (Bij) ∈ [0, 1] such that a αa ≤ min Bii ≤ max Bii ≤ n i i n b 1 X b ≤ Bij ≤ max Bij ≤ αn k(k − 1) i6=j n i6=j

th • λk(P ) ≥ λ with P = (Puv) = (Bσ(u)σ(v)) and λk(P ) the k singular value of P .

10 a αa Here Bii can vary between n and n for all i. The connection probabilities b between communities Bij is bounded from above by n for all i 6= j and the average b over all Bij is bounded from below by αn . To ensure that Bij ∈ (0, 1) and that b a Bij < Bii for all i 6= j we assume that 0 < n < n < 1 and α ≥ 1 constant. Note that when λ is small enough we have Θ0(n, k, a, b, β) ⊂ Θ(n, k, a, b, λ, β, α).

In order to classify the error of a label vectorσ ˆ compared to the true label vector σ we recall the Hamming loss n 1 X l(ˆσ, σ) = min 1(π(ˆσ(u)) 6= σ(u)). π∈S n k u=1

The minimum over all permutations in the symmetric group Sk is taken because community labels are interchangeable. This loss function gives the proportion of label assignments that are different betweenσ ˆ and σ.

To prove consistency of the refined group labelling algorithm, Gao et al. [5] pro- vide three main results. The first theorem shows that the initialization step via spectral clustering satisfies a minor consistency property:

Theorem 3.1. ([5], Theorem 3) Assume e ≤ a ≤ C1b for some constant

C1 > 0 and ka 2 ≤ c, (1) λk for some sufficiently small c ∈ (0, 1) where λk = λk(P ) the k-th singular value of P . Consider USC(τ) with a sufficiently small constant µ in Algorithm 1 and 1 P τ = C2d for some sufficiently large constant C2 > 0, where d = n du is the u∈[n] average degree. For any constant C0 > 0, there exists some C > 0 only depending 0 on C ,C1,C2 and µ such that a l(ˆσ, σ) ≤ C 2 λk with probability at least 1 − n−C0 .

A similar consistency property also holds for NSC(τ), cf. [5], Theorem 4. The asymptotic minimax risk under the Hamming loss is derived in the following the- orem:

(a−b)2 Theorem 3.2. ([19], Theorem 1.1) When ak log k → ∞, we have ( nI∗  exp −(1 + η) 2 , k = 2; inf sup EB,σl(ˆσ, σ) =  ∗  σˆ nI (B,σ)∈Θ exp −(1 + η) βk , k ≥ 3;

11 a−b for both Θ = Θ0(n, k, a, b, β) and Θ = Θ(n, k, a, b, λ, β; α) with any λ < 2βk and p any β ∈ [1, 5/3), where η = ηn → 0 is some sequence tending to 0 as n → ∞ and r r ! r a b r a b I∗ = −2 log + 1 − 1 − (2) n n n n

1 a  b  is the R´enyi divergence of order 2 between Bern n and Bern n .

The condition β ∈ [1, p5/3) restricts the difference in size of communities. The next result in [5] guarantees the accuracy of the parameter estimation in Algo- rithm 2 which is required to derive upper bounds for the performance of the refinement scheme:

Lemma 3.3. ([5], Lemma 1) Let Θ = Θ(n, k, a, b, λ, β; α). Suppose as n → (a−b)2 ∞, ak → ∞ and there exist constants C0, δ > 0 and a positive sequence γ = γn such that

0 −(1+δ) inf min PB,σ{l(σ, σu) ≤ γ} ≥ 1 − C0n (3) (B,σ)∈Θ u∈[n] 1 a−b 0 with γ = o( k log k ) and γ = o( ak ) and σu as in Algorithm 2. Then there is a sequence η = ηn → 0 as n → ∞ and a constant C > 0 such that    ˆn a − b −(1+δ) min inf P min max |Bij − Bπ(i)π(j)| ≤ η ≥ 1 − Cn (4) u∈[n] (B,σ)∈Θ π∈Sk i,j∈[k] n

This means that we can consistently estimate all connection probabilities with high probability. Note that the spectral clustering algorithm USC(τ) satisfies (3) under the assumptions of Theorem 3.1. With this it is then possible to show that the refined group labelling algorithm returns an estimator that is asymptotically minimax:

(a−b)2 Theorem 3.4. ([5], Theorem 4) Suppose as n → ∞, ak log k → ∞, a  b and  1  condition 3 is satisfied for γ = o k log k and Θ = Θ0(n, k, a, b, β). Then there is a sequence η → 0 such that ∗  sup l(σ, σˆ) ≥ exp −(1 − η) nI  → 0, if k = 2,  PB,σ 2  (B,σ)∈Θ) (5) n  nI∗ o  sup PB,σ l(σ, σˆ) ≥ exp −(1 − η) βk → 0, if k ≥ 3,  (B,σ)∈Θ) ∗ 1 where I is the R´enyidivergence of order 2 as defined in (2).  1  a−b  If in addition (3) is satisfied for γ = o k log k and γ = o ak and Θ = Θ(n, k, a, b, λ, β; α), then the conclusion (5) also holds for Θ = Θ(n, k, a, b, λ, β; α).

12 4 Preferential Attachment

The preferential attachment model works differently than the stochastic block model by generating nodes iteratively. This model includes a degree dependent component in the connection probability, which gets larger when the degree of a node is higher. The motivation behind this model is to reproduce degree distri- butions with powerlaw behaviour, often observed in large real-world networks.

A popular preferential attachment model is the Barab´asi-Albert model ([11, p. 173]). (0) (0) The (BA) model works as follows: Start with an initial graph G with Nv ver- (0) (t−1) tices and Ne edges. Then in every iteration t = 1, 2, ... the current graph G is modified to create a new graph G(t) by adding a new vertex v. Each new vertex (0) (t−1) v is connected to m < Nv existing vertices in G . The probability that v connects to a given vertex u in G(t−1) = (V (t−1),E(t−1)) is given by

du P . u0∈V (t−1) du0 This probability describes the preference of the new vertex to connect with exist- ing vertices with higher degrees.

Note that this model has the condition to add exactly m edges in every step. To remove this restriction we generalize this model in the following way: We remove the restriction of having to add nodes with a given degree m and for each potential edge decide on including an edge by a Bernoulli random variable dependent on the degree of the existing vertex. Let v be the v-th iteration of the model. Then for all existing vertices u < v the probability of adding an edge between u and v is a Bernoulli random variable with parameter

P (Auv = 1|du(v − 1)) = f(du(v − 1)) , u < v where f : N → [0, 1] is an increasing function and du(v − 1) is the degree of note u after v − 1 iterations. This allows every edge to be added independent of any other edges added in the same timestep. After adding edges based on these connection probabilities the next node v + 1 is added and the process is repeated until the graph has reached a predetermined size n. This model generates hubs, which are nodes with high degrees, by giving a higher probability of new nodes being connected to these hubs. Note that the preferential attachment model does not have any group structure.

13 5 A Time Dynamic Stochastic Block Model

In this section we propose a new model that combines the preferential attachment model with the stochastic block model. Let P = (Puv) = (Bσ(u)σ(v)) and {qu}u∈[n] with 0 < qu < 1 for every u ∈ [n]. A graph is iteratively generated by adding a single node v in each step which has connection probabilities

P (Auv = 1|du(v − 1) = du) = Puvf(du(v − 1))qv , u < v (6)

−du where f(du) = 1 − (1 − qu)e and du(v − 1) is the degree of node u after v − 1 iterations. Note that f is non decreasing in du and has range [qu, 1] for du ≥ 0. The choice for this function f stimulates the creation of hubs by increasing the connection probabilities for high degree nodes.

1 As an example, assume k = 2 communities and set qu = 2 for all nodes u. Let 2 1 B11 = B22 = 3 and B12 = B21 = 3 . The network is then built up as follows:

• Node 1 of group 1 is added, no other nodes are available to connect with.

• Node 2 of group 2 is added.

1 −0 1 1 • Edge (2, 1) is added with probability B21 · (1 − 2 e ) · 2 = 12 .

• Node 3 of group 1 is added.

• Edge (3, 1) is added with probability ( B · (1 − 1 e−1) · 1 = 1 (1 − 1 ) ≈ 0.27 if edge (3, 1) is added 11 2 2 3 2e . 1 −0 1 1 B11 · (1 − 2 e ) · 2 = 6 ≈ 0.17 otherwise

• Edge (3, 2) is added with probability ( B · (1 − 1 e−1) · 1 = 1 (1 − 1 ) ≈ 0.14 if edge (3, 2) is added 12 2 2 6 2e . 1 −0 1 1 B12 · (1 − 2 e ) · 2 = 12 ≈ 0.08 otherwise

This shows that nodes added get a higher connection probability with nodes of high degree and nodes of the same group. Note that for lower values of q the variation in degree becomes a bigger factor in the connection probability as can be observed in the skewed degree distributions in Figure 3.

14 (a) After 5 timesteps

(b) After 10 timesteps

(c) After 15 timesteps

(d) After 20 timesteps

Figure 2: Timeline of the time dynamic stochastic block model with k = 2, 1 qu = q = 2 and Bii = 0.66,Bij = 0.33 for all i 6= j. Communities are color coded. The even numbers are in one group and the odd numbers form the second group.

15 (a) After 5 timesteps

(b) After 10 timesteps

(c) After 15 timesteps

(d) After 20 timesteps

Figure 3: Timeline of the time dynamic stochastic block model with k = 2, 1 qu = q = 100 and Bii = 0.66,Bij = 0.33 for all i 6= j. Communities are color coded. The even numbers are in one group and the odd numbers form the second group.

16 5.1 Main Result

To preserve consistency of Algorithm 1 for the time dynamic stochastic block model, we use the following theorem to show that the edge degrees of this model are comparable with the edge degrees in the stochastic block model:

1 Theorem 5.1. Assume k = 2 and let qu = 2 for all u ∈ [n]. Then    n n 1 X Puv X Puv 1 − O ≤ E[dˆ ] ≤ . n 2 u 2 v=1 v=1

1 Proof: From the assumptions k = 2 and qu = 2 we find that P P (A = 1|d = d (v − 1)) = uv f(d ). uv u u 2 u

Puv Since qu ≤ f(du) ≤ 1 we get P (Auv = 1|du = du(v − 1)) ≤ 2 and P (Auv = Puv Puv 1|du = du(v − 1)) = 2 f(du) ≥ 4 . Using the upper bound we find that n X Puv E[dˆ ] ≤ , (7) u 2 v=1 which gives an upper bound for the expected degree. To find a lower bound we need the following Lemma:

Lemma 5.2. (Bernstein inequality for martingales)[9, p. 144]: Let (Mn)n≥0 k k k be a (Fn)-martingale such that for all k ∈ [2, ∞),E[|Mi+1 − Mi| |Fi] ≤ c k for some constant c > 0. Then  t2  ∀t ≥ 0 : P (|M | ≥ t) ≤ 2 exp − . (8) n ce(2cn + t)

Let k X Vk = Auv − puv v=1

Puv 1 −du(v−1) with puv = E[Auv|du(v − 1)] = 2 (1 − 2 e ). Then we find

E[Vk|V1, ..., Vk−1] = Vk−1 + E[Auk − puk|V1, ..., Vk−1]

= Vk−1 + E[Auk − puk|Vk−1]

= Vk−1 + E[Auk|Vk−1] − E[puk|Vk−1]

= Vk−1 + E[Auk|du(k − 1)] − puk

= Vk−1 + puk − puk = Vk−1

17 Note that " # k k k X X X E[|Vk|] = E Auv − puv ≤ E [|Auv − puv|] ≤ 1 = k < ∞, v=1 v=1 v=1 so (Vk)k is a martingale with filtration Fk = σ(V1, ..., Vk−1). Checking the condi- tion for Lemma 5.2 gives

k E[|Vi − Vi−1| |V1, ..., Vi] k = E[|Aui − pui| |V1, ..., Vi−1] k = E[|Aui − pui| |du(i − 1)] k k = |1 − pui| P (Aui = 1|du(i − 1)) + | − pui| P (Aui = 0|du(i − 1)) k k ≤ |1 − pui| 1 + |pui| 1 ≤ 2.

Thus for c = 1 the condition for the Bernstein inequality for Martingales holds. Lemma 5.2 therefore gives  t2  ∀t ≥ 0 : P (|V | ≥ t) ≤ 2 exp − . k e(2k + t) Choose √ √ t∗ := 2 2eplog n k, where k ≥ m = K log(n). Then for K > 2e we find √ √ t∗ = 2 2eplog(n) k √ = 2p2e log(n) k √ < 2pK log(n) k √ √ ≤ 2 k k = 2k.

This means (t∗)2 (t∗)2 8e log(n)k − ≤ − = − = −2 log n. e(2k + t) e(4k) 4ek

Substituting this in Bernstein’s inequality for Martingales (8) we find for k = m = K log(n)

m ! X 2 P A − p ≥ t∗ ≤ 2 exp (−2 log n) = . uv uv n2 v=1 Thus we find that

m √ √ √ √ √ X p p p Auv − puv ≤ 2 2e log n m = 2 2e log n K log n = 2 2e K log(n), v=1

18 2 with probability at least 1 − n2 .

Let pmin ∈ (0, 1) such that 0 < pmin ≤ puv < 1 for all u, v ∈ [n], u 6= v. Let 0 < m m ∗ ∗ P P ∗ c ≤ pmin. Then c ∈ R is a constant such that puv ≥ pmin ≥ K log(n)c . v=1 v=1 m √ P ∗ ∗ Since puv ≥ K log(n)c , choosing Kc ≥ 1 + 2 2eK gives v=1 m X 2 dˆ (m) = A ≥ log n with probability at least 1 − . (9) u uv n2 v=1 Looking in the general case when a node v > m + 1 gets added we find using (6) 1 and qu = 2 for all u that

P (Auv = 1|du(v − 1))1(du(v − 1) ≥ log n) (10) P  1  = uv 1 − e−du(v−1) 1(d (v − 1) ≥ log n) 2 2 u P  1  ≥ uv 1 − e− log(n) 2 2 P  1  = uv 1 − . (11) 2 2n

For the expected degree of a node u we find n n n X X X E[dˆu] = E[ Auj] = E[Auj] = E[E[Auj|du(j − 1)]] j=1 j=1 j=1 n X = E[P (Auj|du(j − 1))] j=1 n X ≥ E[P (Auj|du(j − 1))] j=m n X ≥ P (du(j − 1) ≥ log n)P (Auj|du(j − 1))1(du(j − 1) ≥ log n) j=m (9),(10) n     X 2 Puj 1 ≥ 1 − 1 − n2 2 2n j=m n    X Puj 1 ≥ 1 − O . 2 n j=m

Combining this with the upper bound in equation (7), we obtain    n n 1 X Puj X Puj 1 − O ≤ E[dˆ ] ≤ . n 2 u 2 j=m j=1

19 20 6 Discussion

In this section we discuss a few important issues related to the time dynamic stochastic block model that could be studied in future research.

6.1 Assumptions of Theorem 5.1 and further work

For Theorem 5.1 on the edge degrees in the time dynamic stochastic block model, we worked under the assumption of two communities. Moreover, we assumed 1 qu = 2 . These assumptions are very restrictive and could be relaxed at the expense of more technical proofs. The assertion of Theorem 5.1 shows that the edge degrees for the stochastic block model and our time dynamic stochastic block model are the same up to a much smaller order. It is therefore plausible that Algorithm 1 and theory in Gao et al. [5] can be extended with minor modifications to the time dynamic stochastic block model. A formal proof is beyond the scope of this thesis and postponed for further research.

6.2 Generalizing the model

Theorem 5.1 on the edge degrees and simulations show that in the limit there are no hubs in the time dynamic stochastic block model (see Figure 4). This shows that this model does not include enough preferential attachment to generate hubs in large networks. Therefore a model with more influence of preferential attachment would be needed to model large networks with hubs. This could be attained by choosing qu dependent on the node.

21 1 1 (a) q = 2 (b) q = 100

Figure 4: Edge after 1000 nodes are added using the time dynamic stochastic block model using the same values as in Figure 2 (a) and Figure 3 (b).

7 Theoretical Properties

Below we replicate slight modifications of the proofs of Theorem 3.1 and Lemma 3.3, which are due to [5].

7.1 Proof of Theorem 3.1

To prove Theorem 3.1 we need the following lemmas which are proven by Gao et al. [5]. Lemma 7.1. ([5], Lemma 5) Consider a symmetric adjacency matrix A ∈ n×n n×n {0, 1} and a symmetric matrix P ∈ [0, 1] satisfying Auu = 0 for all U ∈ [n] 0 and Auv ∼ Bern(Puv) independently for all u > v. For any C > 0, there exists some Cˆ > 0 such that

p kTτ (A) − P kop ≤ Cˆ npmax + 1,

−C0 with probability at least 1−n uniformly over τ ∈ [C1(npmax +1),C2(npmax +1)] for some sufficiently large constants C1,C2, where pmax = maxu≥vPuv.

22 Lemma 7.2. ([5], Lemma 6) For P = (Puv) = (Bσ(u)σ(v)), we have singular value decomposition P = UΛUT , where

U = Z∆−1W,

√ √ n×k with ∆ = diag( n1, ..., nk),Z ∈ {0, 1} is a matrix with exactly one nonzero entry in each row at (i, σ(i)) taking value 1 and W ∈ O(k, k) where O(k1, k2) = k1×k2 T {V ∈ R : V V = Ik2 } for k1 ≥ k2.

0 0 Proof of Theorem 3.1: By assumption, τ = C2d and thus E[τ] ∈ [C1a, C2a] 0 0 for sufficiently large constants C1 and C2. With Bernstein’s inequality this gives ∗ ∗ −C0n τ ∈ [C1 a, C2 a] with probability at least 1 − e for sufficiently large constants ∗ ∗ C1 and C2 . When (1) holds we have √ √ k a √ λ ≥ √ ≥ 2Cˆ a, k c where the last inequality holds for sufficiently small c ∈ (0, 1). Lemma 7.1 gives

r αa √ √ kT (A) − P k ≤ Cˆpnp + 1 ≤ Cˆ n + 1 = Cˆ αa + 1 ≤ Cˆ a, (12) τ op max n where Cˆ > 0. Then √ √ √ λk(Tτ (A)) ≥ λk(P ) − kTτ (A) − P kop ≥ 2Cˆ a − Cˆ a = Cˆ a > 0.

th Therefore there exists c1 ∈ (0, 1) such that the k singular value of Tτ (A) is lower −C0 bounded by c1λk(P ) with probability at least 1 − n . With Davis-Kahan’s sin- theta theorem [4] we then obtain √ k kUˆ − UW1kF ≤ C kTτ (A) − P kop, λk where W1 ∈ O(k, k) and C > 0 is a constant. Using Lemma 7.2 we have U = −1 √ √ Z∆ W2 for some Z as in Lemma 7.2, ∆ = diag( n1, ..., nk) with ni the size −1 −1 of community i and W2 ∈ O(k, k). This gives UW1 = Z∆ W2W1 = Z∆ W3 −1 where W3 = W2W1 ∈ O(k, k). Let V = Z∆ W3 . Then we have √ k kUˆ − V kF ≤ C kTτ (A) − P kop. (13) λk Substituting (12) into (13) gives √ √ k a kUˆ − V kF ≤ C (14) λk

23 −C0 −1 k×k with probability at least 1 − n . Let Q = ∆ W3 ∈ R . Then V = ZQ. Note that (Vui)i∈[k] = (Qσ(u)i)i∈[k] for all u ∈ [n] by definition of Z. Therefore it follows that

k(Qσ(u)i)i∈[k] − (Qσ(v)i)i∈[k]k = k(Vui)i∈[k] − (Vvi)i∈[k]k q ( 1 + 1 if σ(i) 6= σ(j), = nσ(u) nσ(v) 0 if σ(i) = σ(j),

T where the terms of W3 cancel because W3 W3 = Id. Because of the parameter n βn space Θ we have nσ(u) ∈ [ βk − 1, k + 1] for all σ(u) ∈ [k]. Therefore 1 1 1 k ≥ ≈ = . n βn βn βn σ(u) k + 1 k Thus if σ(u) 6= σ(v) we have s 2k k(Q ) − (Q ) k ≥ . (15) σ(u)i i∈[k] σ(v)i i∈[k] βn

Let n r o T = u ∈ σ−1(i): k(Uˆ ) − (Q ) k < , i ∈ [k] i uj j∈[k] ij j∈[k] 2

q k be the sets of nodes within critical radius r = µ n from (Qij)j∈[k]. Note that q 2 when µ < β , the sets {Ti}i∈[k] are disjoint because of (15). Then n r o ∪ T = u ∈ [n]: k(Uˆ ) − (V ) k < . i∈[k] i uj j∈[k] uj j∈[k] 2

For the amount of nodes not included in ∪i∈[k]Ti we find

r2 X C2ka |(∪ T )c| ≤ k(Uˆ ) − (V ) k2 ≤ , 4 i∈[k] i uj j∈[k] uj j∈[k] λ2 u∈[n] k where the last inequality holds because of (14). Substituting the critical radius q k r = µ n gives the upper bound

4C2ka 4C2na |(∪ T )c| ≤ = . (16) i∈[k] i 2 2 k µ2λ2 µ λk n k

By means of contradiction we prove that

−1 c |Ti| ≥ |σ (i)| − |(∪i∈[k]Ti) |. (17)

24 −1 c Suppose |Ti| < |σ (i)| − |(∪i∈[k]Ti) | for some i ∈ [k]. Because the sets {Ti}i∈[k] are disjoint we find

X c | ∪i∈[k] Ti| = |Ti| < n − |(∪i∈[k]Ti) | = | ∪i∈[k] Ti|. i∈[k]

This is a contradiction, so (17) holds. This means the size of Ti is lower bounded by

2 −1 c n 4C na n |Ti| ≥ |σ (i)| − |(∪i∈[k]Ti) | ≥ − 2 2 > for all i ∈ [k], (18) βk µ λk 2βk where we use Assumption (1) for the last inequality with c chosen small enough µ2 such that c < 8C2β .

ˆ Thus most data points in {(Uuj)j∈[k]}u∈[n] are within critical radius of the points ˆ {(Qij)j∈[k]}i∈[k]. Note that {Ti}i∈[k] as well as {Ci}i∈[k] in Algorithm 1 are defined q k 1 q 2 by the critical radius r = µ n . When µ < 4 β , every set Ti can intersect with at most one set Cj.

We use induction to prove that there exists a permutation π ∈ Sk, with Sk the symmetric group, such that for Cˆi as defined in Algorithm 1 we have

ˆ ˆ Ci ∩ Tπ(i) 6= ∅ and |Ci| ≥ |Tπ(i)| for each i ∈ [k]. (19)

ˆ ˆ For i = 1 we have |C1| ≥ maxi∈[k] |Ti| because of the way C1 is constructed in Algorithm 1. Suppose now that there exists no set Ti that overlaps with Cˆ1. Then

c ˆ n |(∪i∈[k]Ti) | ≥ |C1| ≥ max |Ti| ≥ , (20) i∈[k] 2βk where the last inequality is given in (18). However from (16) with Assumption µ2 (1) we find for c < 8C2β that

2 c 4C na n |(∪i∈[k]Ti) | ≤ 2 2 < , µ λk 2βk which contradicts with (20). Thus there must exist a set Tj that overlaps with

Cˆ1. Write π(1) = j. We find that

ˆc ˆ |C1 ∩ Tπ(1)| = |Tπ(1)| − |Tπ(1) ∩ C1| ˆ ˆ ≤ |C1| − |Tπ(1) ∩ C1| ˆ c = |C1 ∩ Tπ(1)| c ≤ |(∪i∈[k]Ti) |,

25 ˆ where we use that Tπ(i) ∩ C1 = ∅ for all i 6= 1. With (16) we find

2 ˆc 4C na |Ci ∩ Tπ(i)| ≤ 2 2 , (21) µ λk for i = 1. Thus (19) holds for i = 1.

Suppose that (19) and (21) hold for i ∈ [l − 1] with l ≤ k. Since we know that each Cˆi overlaps with at most one Tj we find that    

[ ˆ  [   Ci ∩  Tj = ∅. i∈[l−1] j∈([k]\{π(i)}i∈[l−1]) S After l − 1 steps in Algorithm 1 we therefore have Tj ⊆ S. Since j∈([k]\{π(i)}i∈[l−1])

Cˆl is defined in the algorithm by the set of points that are critical radius r apart r while the sets Tj are constructed with critical radius 2 we find that n |Cˆl| ≥ max |Tj| ≥ . j∈([k]\{π(i)}i∈[l−1]) 2βk

ˆ Suppose that Cl overlaps with Tπ(i) for some i ∈ [l − 1]. Since (15) holds, the sets ˆ {Ti}i∈[k] are seperated from each other in such a way that Cl can only overlap ˆ with one Ti. Thus Tπ(i) is the only set from {Ti}i∈[k] that overlaps Cl. Therefore

ˆ ˆ c |Cl| ≤ |Cl ∩ Tπ(i)| + |(∪i∈[k]Ti) |. (22)

ˆ ˆ Since Cl gets constructed in a later step in Algorithm 1 than Cπ(i), we have ˆ ˆc |Cl ∩ Tπ(i)| ≤ |Ci ∩ Tπ(i)|. Substituting this in (22) with upper bound (21) and applying (16) we find that

2 ˆ 8C na |Cl| ≤ 2 2 . (23) µ λk ˆ n Note that this contradicts with the upper bound |Cl| ≥ 2βk found in (20). There- ˆ ˆ fore we must have Cl ∩Tπ(i) = ∅ for all i ∈ [l−1]. Suppose now that Cl ∩Tπ(i) = ∅ for all i ∈ [k]. Then we have n |(∪ T )c| ≥ |Cˆ | ≥ , i∈[k] i l 2βk

ˆ l−1 which contradicts (16). Therefore Cl ∩ Tπ(l) 6= ∅ for some π(l) ∈ [k]\ ∪i=1 {π(i)} and (19) holds for i = l. Using the same argument that is used to prove (21) for

26 i = 1 we have

ˆc ˆ |Cl ∩ Tπ(l)| = |Tπ(l)| − |Tπ(l) ∩ Cl| ˆ ˆ ≤ |Cl| − |Tπ(l) ∩ Cl| ˆ c = |Cl ∩ Tπ(l)| c ≤ |(∪i∈[k]Ti) |, which proves (21) for i = l given that (19) and (21) hold for i ∈ [l − 1]. Therefore by induction there exists a permutation π ∈ Sk, with Sk the symmetric group, such that for Cˆi, as defined in Algorithm 1, (19) and (21) hold.

ˆ Since we know that Cj ∩ Tπ(j) 6= ∅, no other set Tπ(i) with i 6= j can have overlap c ˆ S ˆ  S ˆ  with Cj. This means j6=i Cj ∩ Tπ(i) = ∅. Thus Tπ(i) ⊆ j6=i Cj . Since ˆc ˆc Tπ(i) ∩ Ci ⊆ Ci by default we get  c ˆc [ ˆ Tπ(i) ∩ Ci ⊆  Cj . j∈[k]

This holds for all i ∈ [k], so we find that  c [  ˆc [ ˆ Tπ(i) ∩ Ci ⊆  Cj . i∈[k] j∈[k]

Because the sets {Ti}i∈[k] do not overlap we obtain

X ˆc [  ˆc ˆ c |Tπ(i) ∩ C | = Tπ(i) ∩ C ≤ |(∪j∈[k]Cj) |. (24) i i i∈[k] i∈[k] ˆ ˆ Note that the sets {Ci}i∈[k] do not overlap. Therefore we find using |Ci| ≥

|Tπ(i)| for each i ∈ [k] from (19) that

ˆ c X ˆ X c |(∪i∈[k]Ci) | = n − |Ci| ≤ n − |Ti| = |(∪i∈[k]Ti) |. (25) i∈[k] i∈[k]

Because of (16) we then have

2 c 4C na |(∪i∈[k]Ti) | ≤ 2 2 , µ λk which means

X 4C2na |T ∩ Cˆc| ≤ . (26) π(i) i µ2λ2 i∈[k] k

27 ˆ Note that any u ∈ Tπ(i) ∩ Ci has not been mislabeled under permutation π, so σˆ(u) = π−1(σ(u)). Therefore we can bound the misclassification proportion from above as follows: 1  c l(ˆσ, π−1(σ)) ≤ ∪ (T ∩ Cˆ ) n i∈[k] π(i) i 1   c  ≤ ∪ (T ∩ Cˆ ) ∩ (∪ T ) + |(∪ T )c| n i∈[k] π(i) i i∈[k] i i∈[k] i   1 X ≤ |T ∩ Cˆc| + |(∪ T )c| n  π(i) i i∈[k] i  i∈[k]

Applying the upper bounds found in (16) and (26) we find that

2 −1 8C na l(ˆσ, π (σ)) ≤ 2 2 , µ λk which proves Theorem 3.1.

7.2 Proof of Lemma 3.3

We follow closely the proof of Lemma 1 in the paper by Gao et al. [5]. Let

Θ = Θ(n, k, a, b, λ, β; α). For any community assignments σ1 and σ2, recall the Hamming loss n 1 X l(σ , σ ) = 1(σ (u) 6= σ (u)) 1 2 n 1 2 u=1 which computes the proportion of misclassified labels between σ1 and σ2. In the first step of the proof we choose (B, σ) and u ∈ [n] constant. We work under the 0 0 event Eu = {l(πu(σ), σu ≤ γ)} where πu is a permutation on [k] and σu is the community assignment obtained from applying Algorithm 1 to A−u and setting 0 σu(u) = 0 as in Algorithm 2. For the rest of the proof we assume that πu = Id to simplify equations. By Assumption 3 event Eu occurs with high probability. On event Eu the maximum amount of misclassified nodes is γn. Choose any i ∈ [k] and let Ci,σ be the set of nodes with community label i. Using the notation ni := |Ci,σ| we find

˜u ˜u c ni ≥ |Ci,σ ∩ Ci,σ| ≥ ni − γn and |Ci,σ ∩ Ci,σ| ≤ γn. (27)

0 0 ˜u Let Ci ⊆ [n] such that (27) holds with Ci in place of Ci,σ. At most γn nodes with true community i can be mislabelled as community j 6= i and at most γn nodes with true community j 6= i can be wrongly labelled as community i. Therefore

28 γn γn P ni P n−ni 0 there are at most l l different subsets Ci ⊆ [n] with this property. l=0 m=0 Using Stirling’s formula we find γn   γn    γn  γn X ni X n − ni eni en ≤ (γn + 1)2 l l γn γn l=0 m=0 ni ≤1 γn γn n  e   e  ≤ (γn + 1)2 γ γ   e  = exp 2 log(γn + 1) + 2γn log γ   1  ≤ exp C γn log , 1 γ

0 0 where C1 > 0 is a constant. Define Ei as the set of edges between nodes in Ci. 0 The nodes in Ci can be divided into two groups: nodes with true community i and nodes with a different true community j 6= i. By equation (27) there are at least ni − γn nodes with true community i. By definition of the parameter space n 0 Θ we know that ni ≥ βk − 1. Combining both, the proportion of nodes in Ci with true community i is at least

ni − γn γn γn γnβk n→∞ = 1 − ≥ 1 − n = 1 − ≈ 1 − βγk ni ni βk − 1 n − βk 2 0 This means at least a (1 − βγk) proportion of |Ei| contains edges between two nodes with true community i. These follow the Bern(Bii) distribution. Similarly, 0 the proportion of nodes of Ci with true community j 6= i is at most βγk. This 2 0 means that at most (βγk) proportion of |Ei| are edges between two nodes of the same community j 6= i. These have distributions that are stochastically αa  a  smaller than Bern n and stochastically larger than Bern n by definition of the parameter space Θ. There are also edges between nodes of different communities. We find that there are at most 2(βγk)(1 − βγk) = 2βγk − 2(βγk)2 ≤ 2βγk edges of this type. Edges between different communities follow a distribution b  stochastically smaller than Bern n . Then in the worst case scenario there is 2 2 a proportion of (1 − βγk) edges with distribution Bern(Bii), 2βγk − 2(βγk) b  2 edges with distribution stochastically smaller than Bern n and (βγk) edges a  with distribution stochastically larger than Bern n .

Combining these observations we obtain

" 0 # 2 2 a |Ei| (1 − βγk) Bii + (βγk) ≤ E 1 0 n 2 |Ci|(|Ci| − 1)   2 2 αa b ≤ max (1 − t) Bii + t + 2t . (28) t∈[0,βγk] n n

29 1 β The assumption γ = o( k log k ) in Lemma 3.3 gives βγk = o( log k ) = o(1). We can further refine the left hand side of the previous inequality by using that a (βγk)2 = βγkB o(1), n ii which gives a (1 − βγk)2B + (βγk)2 = (1 − 2βγk + βγko(1))B + βγkB o(1) ii n ii ii

= (1 − (2 + o(1))βγk)Bii.

For t = βγk we find that the right hand side of (28) results in αa b (1 − βγk)2B + (βγk)2 + 2βγk ii n n  αa  b  = B + βγk · o(1) B + + 2βγk − B ii ii n n ii n→∞  b  ≈ B + 2βγk − B ≤ B ii n ii ii

b where the last inequality holds since n < Bii. For the right hand side of (28), the 2 2 αa b derivative of (1 − t) Bii + t n + 2t n is αa b  αa  b  −2(1 − t)B + 2t + 2 = 2t B + + 2 − B . ii n n ii n n ii

b a αa b Note that n < n ≤ Bii ≤ n , so for t = 0 the derivative is 2( n − Bii) < 0. 2 2 αa b Then (1 − t) Bii + t n + 2t n is a quadratic equation with negative derivative at t = 0 and the value at t = 0 is greater than the value at t = βγk. Therefore the 2 2 αa b maximum of (1 − t) Bii + t n + 2t n on [0, βγk] is attained at t = 0 with value Bii.

From (28) it thus holds that " # |E0| αa a − b E i − B ≤ Cβγk ≤ η0 . (29) 1 0 ii n n 2 |Ci|(|Ci| − 1)

a−b Because we assume that γ = o( ak ) in Lemma 3.3, the last inequality holds for 0 ˆ a−b some η ≤ Cβα n that depends on a, k, α, β and γ.

Since each of the edges between nodes in community i are chosen independently, we find that

0 0 X X X |Ei| − E[Ei] = Xj − E[Xj] = (Xj − E[Xj]) , 0 0 0 j∈Ei j∈Ei j∈Ei

30 where Xj are Bernoulli distributed random variables. Note that Xj ∈ {0, 1} and

E[Xj] ∈ [0, 1], so |Xj − E[Xj]| ≤ 1. Using Bernstein’s inequality we find

0 0 X P ( |Ei| − E|Ei| ≥ t) = P (| (Xj − E[Xj])| ≥ t) 0 j∈Ei ( ) t2/2 ≤ exp −P t 0 Var(Xj) + j∈Ei 3 ( ) t2 = exp − P 2t . (30) 2 0 Var(Xj) + j∈Ei 3

αa 0 Because Xj are Bernoulli distributed with parameter pj ≤ n for all j ∈ Ci, we have αa Var(X ) = p (1 − p ) ≤ p ≤ , j j j j n so X X αa Var(Xj) ≤ 0 0 n j∈Ei j∈Ei |C0|(|C0| − 1) αa = i i 2 n 1 αa ≤ (n + γn)2 . 2 i n Substituting this in (30) gives ( ) t2 P ( |E0| − E|E0| ≥ t) ≤ exp − . (31) i i 2 αa 2t (ni + γn) n + 3

Choose (t∗)2 as αa (t∗)2 =(n + γn)2 (C γn log γ−1 + (3 + δ) log n) i n 1 −1 2 ∨ (2C1γn log γ + 2(3 + δ) log n) (32) where ∨ denotes the maximum of the expressions on the left and right hand side. Define −1 c := C1γn log γ + (3 + δ) log n. Then αa (t∗)2 = (n + γn)2 c ∨ 4c2. i n

31 2 αa We can distinquish two situations: either (I) (ni + γn) n ≤ 4c or (II) (ni + 2 αa γn) n > 4c. In case (I) we find αa (t∗)2 = (n + γn)2 (C γn log γ−1 + (3 + δ) log n) i n 1 n2 aγ log γ−1. . k

log x Here we use the fact that x is a monotone decreasing function in x which means −1 1 1 1 γ log γ ≥ n log n for any γ ≥ n . Note that in the situation where γ < n , γ approaches 0 as n → ∞ which means the initialization already predicted the true 1 community labels. Therefore we only look at the situation where γ ≥ n which −1 βn means γn log γ . log n. Recall that ni ≤ k +1 because of the parameter space 1 2 Θ. With the assumption in Lemma 3.3 that γ = o( k log k ) we find (ni + γn) . n 2 k .

In case (II) we see that

∗ 2 −1 2 (t ) = 4 C1γn log γ + (3 + δ) log n −1 2 . (γn log γ ) ,

−1 1 where we again use that γn log γ . log n for any γ ≥ n .

Combining (I) and (II) gives αa (t∗)2 = (n + γn)2 c ∨ 4c2 i n n2 2 aγ log γ−1 + γn log γ−1 . k np 2 aγ log γ−1 + γn log γ−1 . . k

Substituting t∗ = p(t∗)2 from (32) in Bernstein’s inequality (31) gives in case (I) 2 αa −1 (ni + γn) n ≤ 4(C1γn log γ + (3 + δ) log n) that

 0 0 np −1 −1 P |E | − E|E | ≥ Cα,β,δ aγ log γ + γn log γ i i k ( 2 αa −1 ) (ni + γn) (C1γn log γ + (3 + δ) log n) ≤ exp − n 2 αa 2 p 2 αa −1 (ni + γn) n + 3 (ni + γn) n (C1γn log γ + (3 + δ) log n) ( 2 αa −1 ) (ni + γn) (C1γn log γ + (3 + δ) log n) ≤ exp − n −1 2 p −1 2 4(C1γn log γ + (3 + δ) log n) + 3 4(C1γn log γ + (3 + δ) log n) n αao exp −(n + γn)2 , . i n

32 ∗ where Cα,β,δ is a constant depending on α, β and δ. Substituting t in (31) gives 2 αa −1 in case (II) (ni + γn) n > 4(C1γn log γ + (3 + δ) log n) that

 0 0 np −1 −1 P |E | − E|E | ≥ Cα,β,δ aγ log γ + γn log γ i i k ( ) (2C γn log γ−1 + 2(3 + δ) log n)2 ≤ exp − 1 2 αa 2 −1 (ni + γn) n + 3 (2C1γn log γ + 2(3 + δ) log n) ( ) (2C γn log γ−1 + 2(3 + δ) log n)2 ≤ exp − 1 −1 2 −1 4(C1γn log γ + (3 + δ) log n) + 3 (2C1γn log γ + 2(3 + δ) log n) −1 . exp{−C1γn log γ − (3 + δ) log n} −1 −(3+δ) = exp{−C1γn log γ }n

Thus we get

 0 0 np −1 −1 P |E | − E|E | ≥ Cα,β,δ aγ log γ + γn log γ i i k −1 −(3+δ) . exp{−C1γn log γ }n .

−1 −(3+δ) This means with probability at least 1 − exp{−C1γn log γ }n we have p |E0| E|E0| C ( n aγ log γ−1 + γn log γ−1) i − i ≤ α,β,δ k 1 0 0 1 0 0 1 0 0 2 |Ci|(|Ci| − 1) 2 |Ci|(|Ci| − 1) 2 |Ci|(|Ci| − 1)  k p k2γ log γ−1  =∼ Cˆ aγ log γ−1 + , α,β,δ n n √ ∼ n where we use that |Ci| = k . Note that because ak  a−b under the assumption (a−b)2 −1 2 −1 (a−b)2 that ak → ∞ and kγ log γ = O(1) which gives k γ log γ . k  a . a − b. So  k p k2γ log γ−1  a − b Cˆ aγ log γ−1 + ≤ η00 , α,β,δ n n n where η00 n→∞→ 0 is a constant that depends on a, k, α, β, γ and δ. Therefore

|E0| E|E0| a − b i − i ≤ η00 . (33) 1 0 0 1 0 0 n 2 |Ci|(|Ci| − 1) 2 |Ci|(|Ci| − 1)

Combining (29) and (33) we find with probability at least −1 −(3+δ) 1 − exp{−C1γn log γ }n that

|E˜0| a − b i − B ≤ η , (34) 1 ˜0 ˜0 ii n 2 |Ci|(|Ci| − 1)

n→∞ ˜0 |Ei| ˆ where η → 0 depends on a, k, α, β, γ and δ. Note that 1 ˜0 ˜0 = Bii in 2 |Ci|(|Ci|−1) Algorithm 2. The proof for Bˆij in place of Bˆii follows the same structure.

33 −1 −(3+δ) Thus with probability at least 1 − exp{−C1γn log γ }n we have

a − b Bˆ − B ≤ η for all i, j ∈ [k]. ij ij n

Recall Boole’s inequality (also known as the union bound), where for a countable set of events A1,A2, ... it holds that ! [ X P Ai ≤ P (Ai). i i This gives    ˆn a − b P max |Bij − Bij| > η i,j∈[k] n  a − b = P ∃i, j ∈ [n]: |Bˆn − B | > η ij ij n X  a − b ≤ P |Bˆn − B | > η ij ij n i,j∈[n] −1 −(3+δ) ≤ n(n − 1) exp{−C1γn log γ }n −1 −(1+δ) = exp{−C1γn log γ }n .

Therefore    ˆn a − b −1 −(1+δ) P max |Bij − Bij| ≤ η ≥ 1 − exp{−C1γn log γ }n , i,j∈[k] n which is independent of the choice of u and (B, σ). This proves Claim (4) in Lemma 3.3.

34 8 Code

The following R code generates a graph according to the stochastic block model for Figure 1: library ( igraph ) set .seed(2016) N = 20 #nr of nodes k = 2 #nr of communities sigma = c ( rep ( c ( 1 ) ,N/k ) , rep ( c ( 2 ) ,N/k ) )#set communities #for every group #sigma = c(rep(c(1,2),N/k )) B = matrix ( data = c (0.66, 0.33,0.33,0.66), k) #community adjacency matrix A = matrix (0 ,N, N) #initialize adjacency matrix for ( i in 1 : (N−1)) { #for every node added for (j in (i+1):N) { $check for all previous nodes previous nodes A[ i , j ] = rbinom (1,1,B[sigma[i], sigma[j]]) #add edge with connection A[j,i] =A[i,j] #prob between groups and reverse the edge if needed } } #generate a graph from adjacency matrix G = graph from adjacency matrix (A, mode = ”undirected”) #set group color for every node V(G)$ c o l o r <− i f e l s e (sigma[V(G)] == 1, ”green”, ”red”) plot (G) #plot the graph

The following R code generates a graph or histogram according to the time dy- namic stochastic block model: library ( igraph ) #load graph package set .seed(20171234) #fix seed for same picture N = 5 #nr of nodes k = 2 #nr of communities #Q = c(rep(c(1,N))) Q = c ( rep ( c (1/ 2) ,N) ) #s e t q u for every node to 1/2 #sigma = c(rep(c(1),N/k),rep(c(2),N/k + N %% 2)) sigma = c ( rep ( c ( 1 , 2 ) ,N/k ) , c ( rep (1 ,N %% 2 ) ) ) #set communities #for every group B = matrix ( data = c (0.66, 0.33,0.33,0.66), k) #community adjacency matrix

35 A = matrix (0 ,N, N) #initialize adjacency matrix C old = integer (N) #initialize degree vector old C new = integer (N) #initialize degree vector new f <− function (d , q) { 1−(1−q)∗exp(−d) #define our function f as in (4) } for ( i in 2 :N) { #for every node added for ( j in 1 : ( i −1)) { #check for all previous nodes P = 1/2∗B[sigma[i],sigma[j ]] ∗ f (C old [ j ] ,Q[ j ] ) #probability of an edge #between node i and j A[ i , j ] = rbinom ( 1 , 1 ,P) #with prob P add an edge between node i and j A[j,i] =A[i,j] #reverse the edge C new[ i ] = C new[ i ]+A[ i , j ] #increase edge degree if an edge C new[ j ] = C new[ j ]+A[ i , j ] #gets added to node i or node j

} C old = C new #copy the new edge degrees for the new node } #generate a graph from adjacency matrix G = graph from adjacency matrix (A, mode = ”undirected”) #set group color for every node V(G)$ c o l o r <− i f e l s e (sigma[V(G)] == 1, ”green”, ”red”) plot (G) #plot the graph #make a histogram of the edge degrees hist (C new, freq = FALSE, breaks = seq ( − 0.5,8.5,1), main = paste ( ”Edge degrees after”, N, ”nodes added”), xlab = ”degree” )

36 References

[1] E. Abbe, A. S. Bandeira, and G. Hall. Exact Recovery in the Stochastic Block Model. ArXiv e-prints, May 2014.

[2] Peter J Bickel and Aiyou Chen. A nonparametric view of network models and newman–girvan and other modularities. Proceedings of the National Academy of Sciences, 106(50):21068–21073, 2009.

[3] Yudong Chen and Jiaming Xu. Statistical-computational tradeoffs in planted problems and submatrix localization with a growing number of clusters and submatrices. J. Mach. Learn. Res., 17(1):882–938, January 2016.

[4] Chandler Davis and W. M. Kahan. The rotation of eigenvectors by a per- turbation. iii. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970.

[5] C. Gao, Z. Ma, A. Y. Zhang, and H. H. Zhou. Achieving Optimal Mis- classification Proportion in Stochastic Block Model. ArXiv e-prints, May 2015.

[6] M. Girvan and M. E. J. Newman. Community structure in social and biologi- cal networks. Proceedings of the National Academy of Sciences, 99(12):7821– 7826, 2002.

[7] Anna Goldenberg, Alice X. Zheng, Stephen E. Fienberg, and Edoardo M. Airoldi. A survey of statistical network models. Foundations and Trends in , 2(2):129–233, 2010.

[8] Mark S. Handcock, Adrian E. Raftery, and Jeremy M. Tantrum. Model-based clustering for social networks, 2007.

[9] Marc Hoffmann. Adaptive estimation in diffusion processes. Stochastic Pro- cesses and their Applications, 79(1):135 – 163, 1999.

[10] Paul W. Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels: First steps. Social Networks, 5(2):109 – 137, 1983.

[11] Eric D. Kolaczyk. Statistical Analysis of Network Data: Methods and Models. Springer Publishing Company, Incorporated, 1st edition, 2009.

37 [12] C. M. Le, E. Levina, and R. Vershynin. Sparse random graphs: regularization and concentration of the Laplacian. ArXiv e-prints, February 2015.

[13] Laurent Massouli´e.Community detection thresholds and the weak ramanu- jan property. In Proceedings of the Forty-sixth Annual ACM Symposium on Theory of Computing, STOC ’14, pages 694–703, New York, NY, USA, 2014. ACM.

[14] E. Mossel, J. Neeman, and A. Sly. Stochastic Block Models and Reconstruc- tion. ArXiv e-prints, February 2012.

[15] M. E. J. Newman and E. A. Leicht. Mixture models and exploratory analysis in networks. Proceedings of the National Academy of Science, 104:9564–9569, June 2007.

[16] Ulrike von Luxburg. A tutorial on spectral clustering. CoRR, abs/0711.0189, 2007.

[17] Stanley Wasserman and Katherine Faust. analysis: Methods and applications, volume 8. Cambridge university press, 1994.

[18] Yunpeng Zhao, Elizaveta Levina, and Ji Zhu. Consistency of community detection in networks under degree-corrected stochastic block models. The Annals of Statistics, 40(4):2266–2292, 2012.

[19] Harrison H. Zhou and Anderson Y. Zhang. Minimax rates of community detection in stochastic block models. The Annals of Statistics, 44(5):2252– 2280, 2016.

38