Hierarchical block structures and high-resolution model selection in large networks

Tiago P. Peixoto∗ Institut für Theoretische Physik, Universität Bremen, Hochschulring 18, D-28359 Bremen, Germany

Discovering and characterizing the large-scale topological features in empirical networks are crucial steps in understanding how complex systems function. However, most existing methods used to obtain the modular structure of networks suffer from serious problems, such as being oblivious to the statistical evidence supporting the discovered patterns, which results in the inability to separate actual structure from noise. In addition to this, one also observes a resolution limit on the size of communities, where smaller but well-defined clusters are not detectable when the network becomes large. This phenomenon occurs not only for the very popular approach of optimization, which lacks built-in statistical validation, but also for more principled methods based on statistical inference and model selection, which do incorporate statistical validation in a formally correct way. Here we construct a nested generative model that, through a complete description of the entire network hierarchy at multiple scales, is capable of avoiding this limitation, and enables the detection of modular structure at levels far beyond those possible with current approaches. Even with this increased resolution, the method is based on the principle of parsimony, and is capable of separating signal from noise, and thus will not lead to the identification of spurious modules even on sparse networks. Furthermore, it fully generalizes other approaches in that it is not restricted to purely assortative mixing patterns, directed or undirected graphs, and ad hoc hierarchical structures such as binary trees. Despite its general character, the approach is tractable, and can be combined with advanced techniques of community detection to yield an efficient algorithm that scales well for very large networks.

I. INTRODUCTION not take into account the statistical evidence associated with this deviation, and as a result it is incapable of The detection of communities and other large-scale separating actual structure from those arising simply of structures in networks has become perhaps one of the statistical fluctuations of the null model, and it even finds largest undertakings in network science [1, 2]. It is high-scoring partitions in fully random graphs [13]. This motivated by the desire to be able to characterize the problem is not specific to modularity and is a character- most salient features in large biological [3–5], technolog- istic shared by the vast majority of methods proposed ical [6, 7] and social systems [3, 8, 9], such that their for solving the same task [2]. In addition to the lack of building blocks become evident, potentially giving valu- statistical validation, modularity maximization fails to able insight into the central aspects governing their func- detect clusters with size below a given threshold [14,√ 15], tion and evolution. At its simplest level, the problem which increases with the size of the system as ∼ E, seems straightforward: Modules are groups of nodes in where E is the number of edges in the entire network. the network that have a similar connectivity pattern, This limitation is independent of how salient these rela- often assumed to be assortative, i.e., connected mostly tively smaller structures are, and makes this potentially among themselves and less so with the rest of the net- very important information completely inaccessible. Fur- work. However, when attempting to formalize this no- thermore, results obtained with this method tend to be tion, and develop methods to detect such structures, the degenerate for large empirical networks [16], for which combined effort of many researchers in recent years has many different partitions can be found with modular- spawned a great variety of competing approaches to the ity values very close to the global maximum. In these problem, with no clear, universally accepted outcome [2]. common situations, the method fails in giving a faithful The method that has perhaps gathered the most representation of the actual large-scale structure present widespread use is called modularity optimization [10] and in the system. consists in maximizing a quality function that favors par- titions of nodes for which the fraction of internal edges More recently, increasing effort has been spent on a inside each cluster is larger than expected given a null different approach based on the statistical inference of model, taken to be a random graph. This method is rel- generative models, which encodea the modular structure atively easy to use and comprehend, works well in many of the network as model parameters [17–28]. This ap- accessible examples, and is capable of being applied in proach offers several advantages over a dominating frac- arXiv:1310.4377v6 [.data-an] 25 Mar 2014 very large systems via efficient heuristics [11, 12]. How- tion of existing methods, since it is more firmly grounded ever it also suffers from serious drawbacks. In particular, on well-known principles and methods of statistical anal- despite measuring a deviation from a null model, it does ysis, which allows the incorporation of the statistical ev- idence present in the data in a formally correct man- ner. Under this general framework, one could hope to overcome some of the limitations existing in more ad hoc ∗ [email protected] methods, or at least make any intrinsic limitations eas- 2 ier to understand in light of more robust concepts [29– we demonstrate the quality of the approach. We then 32]. The generative model most used for this purpose is conclude with an overall discussion. the stochastic block model [17–28, 33–36], which groups nodes in blocks with arbitrary probabilities of connec- tions between them. This very simple definition already II. HIERARCHICAL MODEL does away with the restriction of considering only purely assortative communities, and accommodates many dif- The original stochastic block model ensemble [33–36] ferent patterns, such as core-periphery structures and bi- is composed of N nodes, divided into B blocks, with e partite blocks, as well as straightforward generalizations rs edges between nodes of blocks r and s (or, for conve- to directed graphs. In this context, the detectability of nience of notation, twice that number if r = s). Here, well-defined clusters amounts, in large part, to the issue we differentiate between two very similar model variants: of model selection based on principled criteria such as 1. the edge counts e are themselves the parameters of minimum description length (MDL) [32, 37] or Bayesian rs the model; 2. the parameters are the probabilities p model selection (BMS) [38–42]. These approaches allow rs that an edge exists between two nodes of the respective the selection of the most appropriate number of blocks blocks, such that the edge counts he i = n n p are based on statistical evidence, and thus avoid the detec- rs r s rs constrained on average. Both are equally valid gener- tion of spurious communities. However, frustratingly, at ative models, and as long as the edge counts are suf- least one of the limitations of modularity maximization ficiently large, they are fully equivalent (see Ref. [50] is also present when doing model selection, namely, the and Appendix A). Here, we stick with the first variant, resolution limit mentioned above. As was recently shown since it makes the following formulation more convenient. in Ref. [32], when using MDL, the maximum number √ We also consider a further variation called the degree- N N of detectable blocks scales with , where is the corrected block model [24], which is defined exactly as the number of nodes in the network, which is very similar traditional model(s) above, but one additionally specifies to the modularity optimization limit. However, in this the degree sequence {ki} of the graph as an additional set context, this limitation arises out of the lack of knowl- of parameters (again, these values can be the parameters edge about the type of modular structure one is about themselves, or they can be constrained on average [50]). a priori to infer, and the assumption that all possibil- The degree-corrected version, although it is a relatively ities should occur with the same probability. Here we simple modification, yields much more convincing results develop a more refined method of model selection, which on many empirical networks, since it is capable of incor- consists in a nested hierarchy of stochastic block mod- porating degree variability inside each block [24]. As will els, where an upper level of the hierarchy serves as prior be seen below, it is, in general, also capable of providing information to a lower level. This dramatically changes a more compact description of arbitrary networks than the resolution of the model selection√ procedure, and re- the traditional version. N places the characteristic block size of in the nonhier- The nested version, which we define, here is based on archical model by much a smaller value that scales only the simple fact that the edge counts e themselves form N rs logarithmically with , enabling the detection of much a block multigraph, where the nodes are the blocks, and smaller blocks in very large networks. Furthermore, the the edge counts are the edge multiplicities between each approach provides a description of the network in many node pair (with self-loops allowed). This multigraph may scales, in a complete model encapsulating its entire hier- also be constructed via a generative model of its own. If archical structure at once. It generalizes previous meth- we choose a stochastic block model again as a genera- ods of hierarchical community detection [43–49], in that tive model, we obtain another smaller block multigraph it does not impose specific patterns such as dendograms as parameters at a higher level, and so on recursively, or binary trees, in addition to allowing arbitrary mod- until we finally reach a model with only one block. This ular structures as the usual stochastic block model, in- forms a nested stochastic block model hierarchy, which stead of purely assortative ones. Furthermore, despite describes a given network at several resolution levels (see its increased resolution, the approach attempts to find Fig. 1). the simplest possible model that fits the data, and is not This approach provides an increased resolution when subject to overfitting, and, hence, will not detect spuri- performing model selection, since the generative model ous modules in random networks. Finally, the method is inferred at an upper level serves as prior information to fully nonparametric, and can be implemented efficiently, the one at a lower level. Despite its more elaborate for- with a simple algorithm that scales well for very large mulation, this hierarchical model remains tractable, and networks. it is possible to apply it to very large networks, in a In Sec. II, we start with the definition of the model fully nonparametric manner, as discussed below. Fur- and then we discuss the model-selection procedure based thermore, it generalizes cleanly the flat variants, which on MDL. We then move to the analysis of the resolution correspond simply to a hierarchy with only one level. It limit, and proceed to define an efficient algorithm for also does not impose any preferred mixing pattern (e.g., the inference of the nested model, and we finalize with assortative or dissortative block structures), and is not the analysis of synthetic and empirical networks, where restricted to any specific hierarchical form, such as bi- 3

the edge and node counts associated with the block par- B nodes 2 tition {bi}, and Ω({ers}, {nr}) is the number of different E edges network realizations. Hence, maximizing the likelihood is identical to minimizing the ensemble entropy [50, 52] = 3 l S({ers}, {nr}) = ln Ω({ers}, {nr}). etdmodel Nested For the lowest level of the hierarchy (which models B1 nodes directly the observed network), we have a simple graph, E edges for which the entropies can be computed as [50]

= 2   l 1 X ers S = n n H , (1) t 2 r s b n n rs r s B0 nodes E edges for the traditional block model ensemble and,

= 1   l X 1 X ers

bevdnetwork Observed Sc ' −E − Nk ln k! − ers ln , (2) 2 eres k rs N nodes E P edges for the degree-corrected variant, where E = rs ers/2 is the total number of edges, Nk is the total number of = 0 l P nodes with degree k, er = s ers is the number of half- edges incident on block r, Hb(x) = −x ln x−(1−x) ln(1− x) is the binary entropy function, and it was assumed that nr  1. Note that only the last term of Eq. 2 is, in fact, useful when finding the best block partition, since Figure 1. Example of a nested stochastic block model with the other terms remain constant. However the full ex- three levels, and a generated network at the bottom. The pression is necessary when comparing the models against top-level structure describes a core-periphery network, which each other via model selection, as discussed below. is further subdivided in the lower levels. For the upper-level multigraphs the entropy can also be computed [50], and it takes a different form

1    nr  X nr ns X (( 2 )) nary trees or dendograms . In the following, we describe Sm = ln + ln , (3) ers err /2 the maximum likelihood method to infer the multilevel r>s r partitions, and the model selection process based on the n  n+m 1 minimum description length principle, and compare it where m = m− is the number of m-combinations with Bayesian model selection. with repetitions from a set of size n. Note that we no In the analysis, we focus on undirected networks, but longer assume that nr  1, since at the upper levels the everything is straightforwardly applicable to directed number of nodes becomes arbitrarily small. networks as well. In Appendix C we present a summary At each level l ∈ [0,L] in the hierarchy there are Bl 1 − of the relevant expressions for the directed case. nodes, which are divided into Bl blocks (with Bl ≤ Bl 1), − were we set B 1 ≡ N. The edge counts at level l are de- l − l noted ers, and the block sizes nr. Therefore, we must A. Module inference P l P l have that nr = Bl 1 and ers/2 = E; i.e., the to- r − rs tal number of nodes decreases in the upper levels, but the The inference approach consists in finding the best par- total number of edges remains the same. The combined tition {bi} of the nodes, where bi ∈ [1,B] is the block entropy of all layers is then given by membership of node i, in the observed network G, such P(G|{b }) L that the posterior likelihood i is maximized. 0 0 X l l Sn = S ({e }, {n }) + Sm({e }, {n }). (4) Since each graph with the same edge counts ers occurs t/c rs r rs r with the same probability, the posterior likelihood is sim- l=1 P(G|{b }) = 1/Ω({e }, {n }) e n ply i rs r , where rs and r are The full generative model corresponds to a nested se- quence of network ensembles, where each sample from a given level generates another ensemble at a lower level.

1 The entropy in Eq. 4 represents the amount of informa- This specification generalizes other hierarchical constructions in tion necessary to encode the decision sequence, which, a straightforward manner. For instance, the generative model of Refs. [43, 44] can be recovered as a special case by forcing a starting from the topmost model, selects the observed binary tree hierarchy, terminating at the individual nodes, and a network among all possible branches in the lower levels. strictly assortative modular structure. A similar argument holds Whenever both the number of levels and the number for the variant of Ref. [51] as well. of blocks Bl of each level is known, the best multilevel 4 partition is the one that minimizes Sn. However, such However, the value of Eq. 7 can be much smaller for information regarding the size of the model is most often nonuniform partitions. This choice has important conse- not available, and needs to be inferred from the data as quences for the resolution of relatively small blocks, as well. Using Eq. 4 for this purpose is not appropriate, will be seen below. since minimizing it across all possible hierarchies leads For the degree-corrected version, we still need to in- to a trivial and meaningless result where Bl = N for all clude the information necessary to describe the degrees l. Instead, one must employ some form of Occam’s razor at the lowest level, and select the simplest possible model that best describes X r the observed data without increasing its complexity. We Lc = Lt + nrH({pk}), (9) present such an approach in the next section. r r where {pk} is the of nodes belong- r 2 B. Model selection ing to block . It is worth noting that, if a network is sampled from the traditional block model ensemble, r so that p is a Poisson with average er/nr, Eq. 9 be- A method that directly formalizes Occam’s razor prin- k P P comes Lc = Lt +2E− r er ln er/nr + k Nk ln k!, which ciple is known as minimum description length [53, 54], means that S + L = S + L , i.e. the total description total c c t t where one specifies the amount of information nec- length is identical for both the traditional and degree- essary to described the data, which includes not only the corrected models in this case, and, therefore, both mod- sample but the model parameters as well. The descrip- els describe the same network equally well3. However, if tion length for the model above is r the distributions {pk} deviate from Poissons, the degree- corrected variant will provide, in general, a shorter de- Σ = L + S , (5) t/c t/c scription length. It is easy to see that if one has a flat L = 1 hierar- where Lt/c is the amount of information necessary to fully chy, with {Bl} = {B, 1}, the description length of the describe the model, and St/c corresponds to entropy of the lowest level l = 0 of the hierarchy. In a given level nonhierarchical model is recovered [32]; e.g., for the tra- Σ = L + S , l of the hierarchy, the information required to describe ditional model, we have L=1 L=1 t with l B the model parameters {ers} is given by the entropy Sm (( ))  B  X L = ln 2 + ln + ln N! − ln n !, (Eq. 3) of the model in level l + 1, so that we may write L=1 E N r (10) r L X l l l 1 where the only difference in comparison to Ref. [32] L = S ({e }, {n }) + L − . (6) t m rs r t is that here we are using the improved partition de- l=1 scription length of Eq. 7. Therefore, the nested gener- The only missing information is how to partition the alization fully encapsulates the flat version, such that min Σ ≤ min Σ nodes of the current level into Bl blocks, which corre- L=1; i.e., the nested model can provide l only a shorter or equal description length of the observed sponds to the term Lt in the equation above. The total l network. number of partitions with the same block sizes {nr} is Q l The MDL principle predicates that whenever the hi- given by Bl 1!/ r nr!, and the total number of different −   erarchy itself needs to be inferred, one should minimize block sizes is Bl . Hence, the amount of informa- Bl−1 Eq. 5, instead of Eq. 4 directly. However, MDL is one of tion necessary to describe the block partition of level l the many principled methods one could use to do model is selection, which include, e.g., Bayesian model selection   via integrated likelihood [21, 38, 39, 41, 42, 55], like- l Bl X l Lt = ln + ln Bl 1! − ln nr!. (7) Bl−1 − lihood ratios [56] or more approximative methods such r as Bayesian information criterion (BIC) [57] and Akaike information criterion (AIC) [58]. If any two of these Note that this is different from the choice made in Bl−1 Refs. [32, 37], which considered all possible Bl parti- tions to be equally likely, and, hence, computed the neces- sary amount of information as Bl 1 ln Bl. This choice im- 2 − Note that in Ref. [32] the degree sequence entropy was taken to P r plicitly assumes that all blocks have approximately equal be NH({pk}), with pk = r nrpk/N, which implicitly assumed sizes, and offers a worse description when this is not the that the degrees are uncorrelated with the block partitions, and case. Note that for Bl 1  1, we have hence should be interpreted only as an upper bound to the actual − description length given by Eq. 9. 3 l l Note that MDL can still be used to select the simpler model in Lt ' Bl 1H({nr/Bl 1}), (8) − − this case: Although the complete description length Σ will be P asymptotically the same with both models for networks sampled where H({pi}) = − i pi ln pi is the entropy of the distri- from the traditional block model, we still have that Lt < Lc, l bution {pi}. Therefore, for uniform blocks nr = Bl 1/Bl since the degree-corrected version still needs to include the in- l − formation on the degree sequence, as in Eq. 9. we recover asymptotically the value Lt ' Bl 1 ln Bl. − 5 methods are derived from equivalent assumptions, one N/B  1, it can be shown that the detectable phase ex- should expect them to deliver compatible results. In ists for hki > [(B−1)/(cB−1)]2 [29–31]. Let us make the Appendix A we make a comparison of the MDL ap- situation even simpler and consider the strongest possible proach with Bayesian model selection via integrated like- block structure with c = 1, i.e. B perfectly isolated as- lihood (BMS), since it is nonapproximative and can be sortative communities with N/B nodes. In this case the computed exactly for the stochastic block model, where detectability threshold lies at hki = 1. Therefore, for any we show that under compatible assumptions, these two hki > 1, we should be able to detect all B blocks, with a methods deliver the exact same results. In the follow- precision increasing with hki, if we know we have B blocks ing, we compare the results obtained with nonhierarchical to begin with. If we do not know this, we must apply a MDL/BMS and the nested model presented, and show model-selection criterion as described above to obtain the that it yields a higher quality model selection criterion, best value of B. For simplicity, let us assume that, for the which detects the correct number of blocks for sparse net- correct value of B ≡ Btrue the true partition is perfectly works, without being overconfident. Based on this anal- detected, such that St ' −E ln B, ignoring additive con- ysis we are capable of deriving the optimum number of stants, which are irrelevant at this point. If a value of blocks given a network size, and we show that the nested B > Btrue is used, we assume that the inferred parti- model does not suffer from the resolution limit, which tion corresponds to regular subdivisions of the planted hinders the nonhierarchical approach. one, such that the entropy remains approximately un- changed St ' −E ln Btrue. For B < Btrue, the blocks are uniformly merged together, so that St ' −E ln B. 1. Module detectability and the “resolution limit” Hence, we may write the expected value of the minimum description length in the nonhierarchical model by sum- ming St = −E ln min(B,Btrue) with Eq. 10. For the The general problem of module detectability can be nested version of the model, we assume a regular hierar- formulated as follows: Suppose we generate a network chy tree of depth L and with a fixed branching ratio σ, with a given parameter set. To what extent can we L l i.e. B = σ − , so that Eq. 5 becomes recover the planted parameters by observing this sin- l gle sample from the model? The answer is conditional σ  B σ on the amount of one’s prior knowledge. If the number Σ ' ln E + B ln B + N ln B 2 σ − 1 2 of blocks B is known beforehand, the remaining task is simply to classify the nodes in one of these B classes. − E ln min(B,Btrue), (11) This problem has been shown to exhibit a detectability- where B  σ was assumed, together with L  1, and indetectability phase transition [29, 30, 59, 60]: If the l B ≡ B . One may compare these criteria against each existing block structure is too weak, it becomes impos- 0 other in their capacity of recovering the planted value sible to infer the correct partition with any method, de- of B, by finding the extremum of each function. In spite the fact that the model parameters deviate from Fig. 2, we show the optimum values of B for a model that of a fully random graph. On the other hand, if the with N = 104 and B = 100, as well as the results for block structure is sufficiently strong, it is possible to de- true the direct minimization of the corresponding exact quan- tect the correct partition with a precision that increases tities for actual network realizations, and a comparison of as the block structure becomes stronger. Another situa- the obtained partitions using the normalized mutual in- tion is when one does not know the correct number B, formation (NMI)4. We also include the comparison with which is arguably more relevant in practice. In this case, a dense BMS criterion (see Appendix A) both in its full in addition to the node classification, one needs to per- form (Eqs. A5 and A8), and with the partition likelihood form model selection. Ideally, one would like to find the term omitted, i.e. P({b }|B) = 1 in Eq. A5, as was done correct B value whenever the corresponding partition is i in Refs. [23, 40]. We see that the dense BMS criterion detectable. However, in situations where the correct par- fails to detect the correct model size for sparser networks, tition is only partially detectable, i.e., the inferred par- which is in accordance with its inadequacy in this region. tition is positively but weakly correlated with the true The hierarchical model provides, as expected, the best model, an application of Occam’s razor may actually results, and detects the correct model for the sparsest choose a simpler model, with smaller B, with a compara- networks. The incomplete BMS criterion is clearly over- ble correlation with the true partition. Hence, if we lack confident for sparse networks, and detects B > 1 struc- knowledge of the model size B, there will be situations tures even when the model lies below the detectability where the true partition will be more poorly detected, when compared to the case where we have this informa- tion. This can be clearly illustrated with a very simple example known as the planted partition (PP) model [61]. 4 The normalized mutual information (NMI) is de- It corresponds to an assortative block structure given by fined as 2I({bi}, {ci})/[H({bi}) + H({ci})], where P ers = 2E[δrsc/B +(1−δrs)(1−c)/B(B −1)], nr = N/B, I({bi}, {ci}) = pbc(r, s) ln (pbc(r, s)/pb(r)pc(r)), and P rs and c ∈ [0, 1] is a free parameter that controls the as- H({xi}) = − r px(r) ln px(r), where {bi} and {ci} are two sortativity strength. For this model, if we have that partitions of the network. 6

100 becomes much easier for dense networks. A prominent problem in the detectability of block 80

B structures via other methods, such as modularity opti- 60 mization [10] is when modules are merged together, re- 40 MDL (nested) gardless of how strong the community structure is per- MDL (flat) Inferred 20 BMS (incomp.) ceived to be. For the modularity-based approach, when BMS (dense) 0 considering a maximally modular network, similar to the 1 0 1 2 PP model with c → 1, but with the additional restriction 10− 10 10 10 k that the graph remains connected, it has been shown√ [14] h i that modules are merged together as long as B > E. 1.0 This phenomenon is considered counterintuitive, and has 0.8 been called the “resolution limit” of community detec- 7 0.6 tion via this method . As it happens, this problem does not only occur for modularity-based methods, but also NMI 0.4 MDL (nested) MDL (flat) if one does statistical inference based on MDL. For the 0.2 BMS (incomp.) BMS (dense) nonhierarchical model, it can be shown that, according 0.0 to this criterion,√ the optimal number of blocks scales 1 0 1 2 10− 10 10 10 as B∗ ' µ(hki) N, where µ(x) is an increasing func- k tion [32]. Therefore if the planted number exceeds this h i threshold, blocks will be merged together, despite the Figure 2. Model selection results for a PP model with fact that the block structure is detectable with arbitrary 4 N = 10 , Btrue = 100 and fully isolated blocks (c = 1), precision if one knows the correct value of B, and it suf- using the model selection criteria described in the text. The ficiently exceeds the detectability threshold hki > 1 of top panel shows the inferred value of B versus the average the PP model. This means that the true parameters of degree hki in the network. The solid lines show the theo- the model can no longer be used to compress the gener- retical value according to each criterion, and the data points ated data. This is a direct result of the assumption that are direct optimization of the corresponding quantities for all possible block structures of a given size are equally actual generated network, averaged over 40 independent re- possible, and the number of such models becomes very alizations. The bottom panel shows the normalized mutual large, with a model description length scaling roughly information (NMI) between the inferred and planted parti- 2 tions. The dashed line marks the threshold hki = 1 where with ∼ B ln E + N ln B. In the presence of additional inference becomes impossible for N → ∞. assumptions about the model, such as the fact that one is dealing with the PP model, instead of a more general block structure, this can, in principle, be improved. How- threshold hki = 1; hence, this shows that the partition ever, in most practical situations such assumptions can- likelihood should not simply be discarded5. Both MDL not be made. One main advantage of the nested model and dense BMS fail to detect anything for hki < 2, which is that this limit can be overcome without requiring such corresponds to a strong threshold6, which interestingly prior knowledge. The description length via the nested lies above the strict detectability limit at hki = 1. This model for the maximally modular network above is given corresponds to a region where detectability is possible, by Eq. 11 with Btrue = B. As can be seen, this equation but only if the true value of B is known (or if a more has only log-linear dependencies on the model size B, in- refined model-selection criterion exists). Note that the stead of the quadratic one present in the flat MDL. The incomplete BMS criterion performs better in the region result of this is that, if one finds the value of B∗, which 1 < hki < 2, but this is perhaps better interpreted as a minimizes the nested description length, one obtains the by-product of its overall overconfidence for very sparse scaling networks. Note that all criteria eventually agree on the N correct value if hki is made sufficiently large, which corre- B∗ ∝ , (12) ln N sponds to the intuitive notion that the detection problem for sufficiently large N. This is a significant improve- ment, since the maximum number of detectable blocks grows almost linearly with the number of nodes. Thus, 5 √ The fact that the NMI between the true and inferred partitions a characteristic detectable block size N/B∗ ∼ N is re- remains slightly above zero in Fig. 2 for hki < 1 with the incom- placed by a much smaller value N/B∗ ∼ ln N, which al- plete BMS criterion is a finite size effect, as it tends increasingly to zero as N → ∞. On the other hand, according to this crite- lows for a precise assessment of small communities even rion, the inferred value of B in this region increases as N becomes in very large networks. larger. 6 This threshold corresponds simply the point where it becomes impossible to fully encode the block partition in the network structure, i.e. for uniform blocks −E ln B + N ln B = 0, which 7 This limit cannot be significantly changed even if one introduces leads to E = N and hence hki = 2. scale parameters to the definition of modularity [15, 62]. 7

4 4 10 k = 10 10 3 h i N = 10 k = 20 N = 104 Detectable E N N/B h i Detectable and , but also on the average block size of the (a) k = 100 (b) 5 h i N = 10 N = 106 remaining network, as can be seen in Fig. 3(a) and (b). 103 103 As the number of blocks in the remaining network ap-

c c √ e e proaches the maximum detectable value, B∗ ∼ N, the 102 102 Detectable with nested model Detectable with nested model more difficult it becomes to resolve the smaller blocks. Undetectable Undetectable The detectable region recedes further with increasing hki, 101 101 100 101 102 103 104 100 101 102 103 104 105 and also with the number of nodes in the remaining net- N/B N/B √ work as ec∗ ∼ N. Hence, the denser or larger the re- (c) (d) maining network, the harder it becomes to detect the smaller blocks with the flat variant of the model. In Fig. 3 are also shown the values of er∗ for which modular- ity also fails to separate the blocks (if one considers that Remaining Hierarchy they are connected to themselves and to the rest of the network by single edges8), which are overall compatible with the flat MDL criterion. The situation changes sig- Merge candidates Remaining network Merge candidates B blocks nificantly with the nested model. To consider the merge, we assume an optimal block hierarchy which splits at the Figure 3. Parameter region where two isolated blocks with top into two branches, the left one containing the two ec/2 internal edges and nc = ec/5 internal nodes are de- smaller blocks, and the right one containing the remain- tectable as separate blocks [shown schematically in panel (c)], ing network and its arbitrary hierarchical structure [see as a function of the average block size N/B, and depending Fig. 3(d)]. To consider the merge, we need to compute 5 on (a) the average degree hki with N = 10 and (b) the num- the description length only at the lowest level l = 0, since ber of nodes N with hki = 20. The dashed curves show the the rest remains unchanged after the merge. By comput- boundaries for the nonhierarchical block model, and the solid ing the difference via Eq. 5, after some manipulations we lines for the hierarchical variant. The line segments on the  3  obtain ∆Σnested = ∆St + ln nc − ln + ln(B + 1) − right-hand side of the plots√ show the detectability threshold ec for modularity [14], ec∗ = 2E. The points marked with stars ln(B+N−1)−ln(B1+B+2), with B = B0. Note that this (?) correspond to the maximum value of B that is detectable expression is independent of E, and, hence, the density in the remaining network with the nonhierarchical model, and of the remaining network cannot influence the merging the dotted line shows the same quantity for various hki val- decision. Since B1 ≤ B, and assuming B  1, we obtain ues (i.e. the region on the left of this curve corresponds to ∆Σnested ' ∆St + ln nc − ln[(ec + 2)(ec + 1)] − ln(B + N), an overfitting of the remaining network, according to the non- N B hierarchical criterion). (d) The hierarchical construction used and, hence, the dependence on either or is again to decide if the two isolated blocks are merged together with only logarithmic, ec∗ ≈ [ln(B + N) − ln nc]/ ln 2, as shown the nested model. in Fig. 3(b). With this example one can notice that the nested model is capable of compartmentalizing the network at the upper levels, such that the lower-level It is possible to understand in more detail the origin of branches can become almost independent of each other. the improvement by considering a related problem, which This means that, in many practical situations, one can is the detection of specific blocks that are much smaller sufficiently overcome the resolution limit, without aban- than the remaining network. Another facet of the res- doning a global model that describes the whole network olution limit manifests itself when two such blocks are at once. merged together, despite the fact that if they are con- In the following section we specify an efficient algo- sidered in isolation they would be kept separate. Here, rithm to infer the parameters of the nested block model in we consider this problem by using a slightly modified arbitrary networks, and we test its efficacy in uncovering scenario than the one proposed in Ref. [14], which is a the multilevel structure of synthetic as well as empirical network composed of two fully isolated blocks, each with networks. ec/2 internal edges and nc nodes, and a remaining net- work with N nodes, E edges, average degree hki = 2E/N and an arbitrary topology [see Fig. 3(c)]. We may de- III. INFERENCE ALGORITHM cide if these blocks are merged together by consider- ing the difference in the description length. The en- Individually, any specific level l of the hierarchical tropy difference for the merge is simply ∆St = ec ln 2 structure is a regular block model, and, hence, the clas- 2 (where we assume ec  nc , but the dense case can sification of the Bl 1 nodes of this level into Bl blocks be computed as well, with no significant difference in − the result). For the flat block model we have ∆Lflat = LL=1(E + ec,N + 2nc,B − 1, {nr} ∪ {2nc}) − LL=1(E + 8 Note that in the model-selection context, adding a single edge ec,N + 2nc,B, {nr} ∪ {nc, nc}), computed using Eq. 10. between the blocks is not a necessary condition for the observa- For this case, the point at which the merge happens, tion of the resolution limit, and has a negligible effect, differently from the modularity approach, where it is a deciding factor. ∆Lflat + ∆St = 0, will depend not only on the values of 8 can be done via well-established methods, such as the and keeping track of whether or not each level is “done” Monte Carlo method [32, 40], simulated annealing, or or “not done.” One marks initially all levels as not done belief propagation [29, 30, 56]. Here, we use the method and starts at the top level l = L. For the current level l, described in Ref. [63], which is an agglomerative heuris- if it is marked done it is skipped and one moves to the tic that provides high-quality results, while being unbi- level l − 1. Otherwise, all three moves are attempted. If ased with respect to the types of block structure that are any of the moves succeeds in decreasing the description inferred, and is also very efficient, with an algorithmic length Σ, one marks the levels l − 1 and l + 1 (if they ex- complexity of O(N ln2 N), independent of the number ist) as not done, the level l as done, and one proceeds (if of blocks B. If one knows the depth L of the hierar- possible) to the upper level l + 1, and repeats the proce- chy, and all {Bl} values, the multilevel partitions can be dure. If no improvement is possible, the level l is marked obtained by starting from the lowest level l = 0, and as done and one proceeds to the lower level l − 1. If the progressing upwards to l = L. However, this cannot be lowest level l = 0 is reached and cannot be improved, done when the number and sizes of the hierarchical levels the algorithm ends. Note that, in order to keep the de- are unknown. Although it is relatively simple to heuristi- scription length complete, we must impose that BL = 1 cally impose such patterns as binary trees or dendograms, throughout the above process. The final hierarchy will, these are not satisfactory given the general character of in general, depend on the starting hierarchy, and as was the model, which accommodates arbitrary branching pat- mentioned above one cannot guarantee that the global terns. However, traversing all possible hierarchies is not minimum is always found. However, we find that, in the feasible for moderate or large networks; thus, one must majority of cases, this algorithm succeeds in finding the settle with approximative methods. Here, we propose same or very similar hierarchies, independently of the ini- a very simple greedy heuristic, which, given any start- tial choice, which can simply be {Bl} = {1}. However, ing hierarchy, performs a series of local moves to obtain the actual time it takes to reach the optimum will de- the optimal branching. Although this algorithm is not pend on how close the initial tree was to the final one, guaranteed to find the global optimum, we have found and, hence, it is difficult to give an estimate of the total it to perform very well for many synthetic and empiri- number of moves necessary. However the slowest move is 2 cal networks, and it tends to find consistent hierarchies, the resize operation, which completes in O(Bl 1 ln Bl 1) − − independently of the starting estimate. It is also effi- steps, and, hence, most of the time is spent at the lowest cient enough to allow its application to very large net- level l = 0 with B 1 = N, which scales well for very − works, since it does not significantly change the overall large networks. We have succeeded in obtaining reliable algorithmic complexity of the inference procedure. The results with this algorithm for networks in excess of 107 algorithm is based on the following local moves at a given edges, hence it is suitable for large-scale systems9. hierarchy level l:

1. Resize. A new partition of the Bl 1 nodes into a − IV. SYNTHETIC BENCHMARKS newly chosen number of blocks Bl is obtained. This is done via the agglomerative heuristic mentioned Here we consider the performance of the nested block previously, with the modification that it must not model inference procedure on artificially constructed net- invalidate the partition at the level l + 1; i.e., no works. Here we use a nested version of the usual PP nodes that belong to different blocks at the upper model [61], inspired by similar constructions done in level can be merged together in the current level. Refs [51, 64]. We define a seed structure with B blocks This restriction enables the difference in Σ (Eq. 5) 0 and [m ] = δ c/B + (1 − δ )(1 − c)/B (B − 1), to be computed easily, since it depends only on the 1 rs rs 0 rs 0 0 and construct a nested matrix of depth L − 1 via m = modifications made in the current and upper levels, l ml 1 ⊗ ml 1 where ⊗ denotes the Kronecker product, l and l+1. The actual new value of B is chosen via − − l and l ∈ [1,L − 1]. The parameters of the model at level progressive bisection of the range Bl ∈ [Bl 1,Bl+1], l L 1 − l are e = 2Em , and all B = B − blocks have the so that the minimum of Σ is bracketed, and for each rs rs 0 same number of nodes. Via spectral methods [65] one value of B attempted, the best partition is found l can show that the detectability transition happens at with the algorithm of Ref. [63]. 2 hki = [(B0 − 1)/(cB0 − 1)] , which is the same as the 2. Insert. A new level is inserted at position l. Its regular PP model with B = B0 [29–31, 66]. size and partition are chosen exactly as in the resize In Fig. 4 we show the results of the inference procedure 4 move above. for a generated model with B0 = 2 and L = 5, N = 10 nodes and hki = 20. The correct number of blocks is Delete. l 3. The model in level is removed from the detected up to a given value of c > c∗, where c∗ is the hierarchy; i.e., the nodes of level l − 1 are grouped together directly as described in level l + 1.

Through repeated applications of these moves, it is pos- 9 An efficient and fully documented C++ implementation of the sible to construct any hierarchy. The actual greedy opti- algorithm described here is freely available as part of the graph- mization consists of starting with some initial hierarchy tool Python library at http://graph-tool.skewed.de. 9

1.0 5 the hierarchies becoming shallower as one approaches a 0.8 (a) 4 random graph. The approach is, therefore, conservative, 0.6 which brings confidence to the blocks and hierarchies that (b)

3 L are actually found, since despite the increased resolution NMI 0.4 NMI (B=16) capabilities it does not tend to find spurious hierarchies. (c) 2 0.2 NMI (nested) In Appendix B we also include a comparison of the Inferred L 0.0 (d) 1 method with other algorithms for community detection 0.5 0.6 0.7 0.8 0.9 1.0 that are not based on statistical inference. c

V. EMPIRICAL NETWORKS

(a) (b) Here, we present a detailed analysis of some selected empirical networks, as well as a meta-analysis of several networks, spanning different domains and size scales. In all cases, we use the degree-corrected stochastic block (c) (d) model at the lowest hierarchical level, instead of the tra- ditional model, since it almost always provides better re- Figure 4. Top: Normalized mutual information (NMI) be- sults. tween the inferred and true partitions for network realizations Political blogs of the 2004 US election. This is a of the nested PP model described in the text with B1 = 2, 4 network compiled by Adamic et al [67] of political blogs L = 5, hki = 20 and N = 10 , as a function of the assor- tativity strength c, both via the standard stochastic block during the 2004 presidential election in the USA. The model with B = 16, and the nested variant with unspeci- nodes are N = 1, 222 individual blogs, and E = 19, 027 fied parameters. The star symbols (?) show the value of L directed edges exist between pairs of blogs, if one blog for the inferred hierarchy. All points are averaged over 20 cites the other. This network is often used as an empirical independent realizations. The gray vertical line marks the example of community structure, since it displays a divi- detectability threshold c∗ when B is predetermined, and the sion along political lines, with two clearly distinct groups red line when the nested model fails to detect any structure. representing those aligned with the Republican and the Bottom: Example hierarchies inferred for the values of c in- Democratic parties. Indeed, if one applies the nested dicated in the top panel. The left image shows the network block model to this network, the topmost division in the realization itself, and the right one the hierarchical structure hierarchy corresponds exactly to this bimodal partition, [the planted hierarchy corresponds to the one in (a)]. which closely matches the accepted division (see Fig. 5). This partition is also obtained with the nonhierarchical stochastic block model if one imposes B = 2 [24]. How- detectability threshold. The hierarchy itself matches the ever, the nested version reveals a much more complete nested PP model exactly only for higher values of c, and picture of the network, where these two partitions pos- becomes progressively simplified for lower values. Note sess a detailed internal structure, culminating in B0 = 15 that for a large fraction of c values the correct lower-level subgroups with quite heterogeneous connection patterns. partition is detected with a very high precision, but the For instance, one can see that each of the two higher-level hierarchy that is inferred can be simpler than the planted groups possesses one or more subgroups composed mainly one. In these cases, however, both the inferred hierar- of peripheral nodes, i.e., blogs that cite other blogs, but chy, as well as the planted model are fully equivalent, i.e. are not themselves cited as often. Conversely, both fac- they generate the same networks [this is true for the (a), tions possess subgroups which tend to be cited by most (b) and (c) regions in Fig. 4, which have B0 = 16]. In other groups, and others which are cited predominantly other words, the shallower hierarchies that are inferred by specific groups. It is also interesting to note that a correspond to identical representations of the same ers large fraction of the connections between the two top- matrix at the lowest level, which require less informa- level groups are concentrated between only two specific tion to be described, in comparison to the sequence of subgroups, which, therefore, act as bridges between the Kronecker products used in the model specification, and, larger groups. hence, cannot really be seen as a failure of the inference This example shows that the model is capable of re- method, since it simply manages to compress the origi- vealing the structure of the network at multiple scales, nal model. Before the value of c reaches the detectability which reveal simultaneously the existence of the bimodal threshold, the inference method settles on a fully ran- large-scale division, as well the lower-level subdivisions. dom L = 1,B = 1 structure, corresponding once again The Autonomous Systems (AS) topology of the to a parameter region where the block detection is only Internet. Autonomous Systems (AS) are intermediary possible with limited precision and if one knows the cor- building blocks of the Internet topology. They represent rect model size. As predicated by the MDL criterion, the organizational units that are used to control the rout- inferred models tend to be as simple as possible, with ing of packets in the network. A single AS often corre- 10

Figure 5. The political blog network of Adamic and Glance [67]. Left: Topmost partition of the hierarchy inferred with the nested model. Right: The same network, using a circular layout, with edge bundling following the inferred hierarchy [68] (indicated also by the square nodes, and the node colors). The size of the nodes corresponds to the total degree, and the edge color indicates its direction (from dark to light). Nodes marked with a blue halo were incorrectly classified at the topmost level, according to the accepted partition in Ref. [67]

Figure 6. Large-scale structure of the Internet at the au- tonomous systems level, as obtained by the nested stochastic block model, displaying a prominent core-periphery architec- ture. The magnification shows the nodes that belong to the “core” top-level branch, containing AS nodes spread all over the globe, as shown in the map inset. See the Supplemental Figure 7. Large-scale structure of the IMDB film-actor net- Material for a higher-resolution version of this figure. work. Each node in this graph represents a lowest-level block in the hierarchy, instead of individual nodes in the graph. The size of the nodes indicates the number of nodes in each group. The hierarchy branch at the top are the actors, and at the bottom are the films. The labels classify each branch according to the most prominent geographical and temporal characteristics found in the database. See the Supplemental sponds to a network of its own, which is usually owned Material for a higher-resolution version of this figure. by a private company, or a government body. The net- work analyzed here corresponds to the traffic of informa- tion between the AS nodes, as measured by the CAIDA 11

104 104 30 30 in the periphery seem strongly correlated to geographi- 32 (a) 21 (b) 3 26 31 3 cal location. However, the nodes of the core groups are 10 17 18 24 2928 10 21 18 2520 22 27 17 16 23 26 15 19 31 not confined to a single geographical location, and are B 7 B 16 15

/ / 20 1413 7 19 8 12 N 2 11 N 2 5 24 32 10 6 5 9 10 10 instead spread all over the globe (see inset of Fig. 6, and 8 6 13 25 27 1 3 1412 23 22 2928 0 4 0 9 1 3 11 10 the Supplemental Material). 4 1 2 1 2 10 10 The Film-Actor Network. This network is compiled 102 103 104 105 106 107 102 103 104 105 106 107 E E by extracting information available in the Internet Movie

14 Database (IMDB), which contains each cast member and 30 32 1.0 (d) 7 12 (c) film as distinct nodes, and an undirected edge exists be-

) 25 7 26 0.8 6 e 25 23 31 g 4 8 23

d 29 tween a film and each of its cast members. If nodes with 10 16 20 2 e 18 0.6 8 11 16 / 2117 27 24 32 1 14

s 11 20 Q t a single connection are recursively removed, a network

i 8 0.4 0 30 b 229 27 ( 6 13 28 10 12 17 15 29 1213 21 1014 3 15 N = 372, 447 E = 1, 812, 312 E 6 24 19 0.2 of and remains (as of late / 5 0 3 19 18 Σ 5 9 1 2 22 4 28 0.0 3126 2012). As can be seen in Fig. 7, the nested block model 4 2 fully captures the bipartite of the network, and 102 103 104 105 106 107 2 4 6 8 10 12 14 E Σ/E (bits / edge) separates movies and actors at the topmost hierarchi- cal level, and proceeds to separate them in geographical, No. NE Dir. No. NE Dir. No. NE Dir. 0 62 159 No 11 21, 363 91, 286 No 22 255, 265 2, 234, 572 Yes temporal and topical (genre) lines. The observed parti- 1 105 441 No 12 27, 400 352, 504 Yes 23 317, 080 1, 049, 866 No 2 115 613 No 13 34, 401 421, 441 Yes 24 325, 729 1, 469, 679 Yes tion is similar to the one obtained via the nonhierarchi- 3 297 2, 345 Yes 14 39, 796 301, 498 Yes 25 334, 863 925, 872 No B = 971 4 903 6, 760 No 15 52, 104 399, 625 Yes 26 372, 547 1, 812, 312 No cal model [32], but one finds blocks, instead of 5 1, 222 19, 021 Yes 16 56, 739 212, 945 No 27 449, 087 4, 690, 321 Yes B = 332 6 4, 158 13, 422 No 17 75, 877 508, 836 Yes 28 654, 782 7, 499, 425 Yes with the flat version. 7 4, 941 6, 594 No 18 82, 168 870, 161 Yes 29 855, 802 5, 066, 842 Yes 8 8, 638 24, 806 No 19 105, 628 2, 299, 623 No 30 1, 134, 890 2, 987, 624 No Meta-analysis of several empirical networks. We 9 11, 204 117, 619 No 20 196, 591 950, 327 No 31 1, 637, 868 15, 205, 016 No 10 17, 903 196, 972 No 21 224, 832 394, 400 Yes 32 3, 764, 117 16, 511, 740 Yes perform an analysis of several empirical networks shown

No. Network No. Network No. Network in Fig. 8, which belong to a wide variety of domains, 0 Dolphins [69] 11 arXiv Co-Authors (cond-mat) [70] 22 Web graph of stanford.edu. [71] 1 Political Booksa 12 arXiv Citations (hep-th) [70, 72] 23 DBLP collaboration [73] and are distributed across many size scales. We used 2 American Football [3, 74] 13 arXiv Citations (hep-ph) [70, 72] 24 WWW [6] 3 C. Elegans Neurons [75] 14 PGP [76] 25 Amazon product network [73] 4 Disease Genes [77] 15 Internet AS (Caida)b 26 IMDB film-actorc [32] (bipartite) the nonhierarchical stochastic block model as well as the 5 Political Blogs [67] 16 Brightkite social network [78] 27 APS citationsd 6 arXiv Co-Authors (gr-qc) [70] 17 Epinions.com trust network [79] 28 Berkeley/Stanford web graph [71] nested variant. In Fig. 8(a) and (b), we shown the av- 7 Power Grid [75] 18 Slashdot [80] 29 Google web graph [71] 8 arXiv Co-Authors (hep-th) [70] 19 Flickr [81] 30 Youtube social network [73] N/B e erage block sizes for all networks using both mod- 9 arXiv Co-Authors (hep-ph) [70] 20 Gowalla social network [78] 31 Yahoo groups (bipartite) √ 10 arXiv Co-Authors (astro-ph) [70] 21 EU email [70] 32 US patent citations [82] els. For the nonhierarchical version, a clear N/B ∼ E a V. Krebs, retrieved trend is observed, which corresponds to the resolution from http://www-personal.umich.edu/~mejn/netdata/ limit present with this method, and other approaches as b Retrieved from http://www.caida.org. c Retrieved from http://www.imdb.com/interfaces. well. In Fig. 8(b) are shown the results for the nested d Retrieved from http://publish.aps.org/dataset. model, where such a trend can no longer be observed, e Retrieved from http://webscope.sandbox.yahoo.com. and the smallest average block sizes no longer seem to depend on the size of the network, which serves as an N/B Figure 8. (a) The average block size obtained using the empirical demonstration of the lack of resolution limit nonhierarchical model, as a function of E, for the empirical shown previously. The values of the description lengths networks√ listed in the bottom table. The dashed line shows a E slope. (b) The same as (a) but with the nested model. themselves are also distributed in a seemingly nonorga- (c) The description length Σ/E for the nested model as a nized manner [see Fig. 8(c)], i.e. no general tendency for function of E. (d) The value of modularity Q as function of larger networks can be observed, other than an increased Σ/E, for the nested model. range of possible values for larger E values. Any differ- ence observed seems to be due to the actual topological organization, rather than intrinsic constraints imposed project10. Each node in the network is an AS, and a di- by the method. We also compute the modularity of the P 2 2 rected link exists between two nodes if direct traffic has inferred block structures, Q = r err/2E − er/(2E) , been observed between the two AS. As of September 2013 which measures how assortative the topology is. Higher the network is composed of N = 52, 104 AS nodes and values of Q close to 1 indicate the existence of densely E = 399, 625 direct connections between them. The ap- connected communities. The value of Q is the most com- plication of the nested block model to this network yields mon quantity used to detect blocks in networks, and it the hierarchy seen in Fig. 6, with B = 191 blocks at the presumes that such assortative connections are present. lowest level. The most prominent feature observed is a In contrast, by fitting a general stochastic block model, strong core-periphery structure, where most connections no specific pattern is assumed, and the partition found go through a relatively small group of nodes, which act corresponds to the least random model that matches the as hubs in the network. The groups both in the core and data. In Fig. 8 we show the values of Q obtained for the analyzed networks. Indeed, some networks are mod- ular, with high values of Q. However, one does not ob- serve any strong correlation of the description length and 10 The IPv4 Routed /24 AS Links Dataset, http://www.caida.org/ the modularity values. Hence, the most structured net- data/active/ipv4_routed_topology_aslinks_dataset.xml works do not necessarily possess much larger Q values, 12 which indicate that the building blocks of their topologi- ACKNOWLEDGMENTS cal organization are not predominantly assortative com- munities (this is clear in some of the examples consid- ered previously, such as the Internet AS topology and The author would like to thank Sebastian Krause for a the IMDB network). However, for many of these net- careful reading of the manuscript, as well as Cris Moore works, it is probably possible to find partitions that lead and Lenka Zdeborová for useful comments. This work to much higher Q values. These partitions would, on the was funded by the University of Bremen, under the fund- other hand, correspond to block model ensembles with ing line ZF04. a larger entropy than those inferred via maximum like- lihood. Therefore, the maximization of Q in these cases would invariably discard topological information present in the network, and provide a much simplified and possi- bly misleading picture of the large-scale structure of the network. Hence, it seems more appropriate to confine Appendix A: Bayesian model selection (BMS) modularity maximization only to cases where the assor- tative structure is known to be the dominating pattern. In the following we compare Bayesian model selection However, even in these cases, methods based on statis- (BMS) via integrated likelihood with the MDL approach tical inference possess clear advantages, such as the lack considered in the main text, and we show that they lead of resolution limit, model selection guarantees, and the to the same criterion if the model constraints are equiv- overall more principled nature of the approach. alent. For the purpose of performing BMS, we evoke the most usual definition of the stochastic block model ensemble, VI. DISCUSSION where one defines as parameters the probabilities prs that an edge exists between two nodes belonging to blocks r s In this paper, we present a principled method to detect and . The posterior likelihood of observing a given graph {b } {p } hierarchical structures in networks via a nested stochas- with a block partition i and model parameters rs tic block model. This method fully generalizes previous is approaches for the detection of hierarchical community e rs nr nr −ers structures [43–49], since it makes no assumptions either Y 2 P(G|{b }, {p },B) = prs (1 − p ) 2 . (A1) on the actual types of large-scale structures possible (as- i rs rs rs sortative, dissortative, or any arbitrary mixture), or on the hierarchical form, which is not confined to binary trees or dendograms. We show that a major advantage The inference procedure consists in, as before, maximiz- of this approach is that it breaks the so-called resolu- ing this quantity with respect to the parameters {prs} tion limit of approaches, such as modularity optimiza- and the block partition {bi}. It is easy to see that if tion and nonhierarchical model inference, where modules √ one maximizes Eq. A1 with respect to {prs}, one re- smaller than a characteristic size scaling with N can- covers max prs ln P(G|{bi}, {prs},B) = −St, given in not be resolved. With the nested model presented, this Eq. 1, so indeed{ } these models are equivalent. However characteristic scale is replaced by a much smaller loga- this does not provide a means for model selection, since rithmic dependence, making it, in practice, non-existent models with a larger number of blocks B will invari- for many applications. This increased resolution comes ably posses a larger likelihood. Instead, the Bayesian as a result of robust model selection principles, and is model selection approach is to consider the joint proba- integrated with the desirable capacity of differentiating bility P(G, {bi}, {prs}, {pr}|B) of observing not only the between noise and actual structure, and, therefore, it is graph, but also the partition {bi}, the model parameters not susceptible to the detection of spurious communities. {prs} as well as the parameters {pr} that control the We show that the model is capable of inferring the large- probability of each partition {bi} being observed, which scale features of empirical networks in significant detail, is given by even for very large networks. This type of approach should, in principle, also be ap- Y plicable to other model classes, such as those based on nr P({bi}|{pr},B) = pr . (A2) overlapping [9, 83–85], or link communities [25, 86]. We r also predict that it should serve as a more refined method of detecting missing information in networks [23, 44], as well as for the prediction of the network evolution [87], This invariably leads to the inclusion of prior probabili- determining the more salient topological features [88, 89], ties of observing the model parameters, P({prs}|B) and or large-scale functional summaries of the network topol- P({pr}|B). Now, instead of finding the model parame- ogy [90]. ters that maximize this quantity, we compute the inte- 13 grated likelihood [38, 42, 55], and most observed empirical networks. Specifically, typ- ical parameters with p = 1/2 sampled by this prior Z rs will result in dense networks with average degree hki = P(G, {bi}|B) = dprsdpr P(G, {bi}, {prs}, {pr}|B) P rs prsnrns/N = N/2. However, most large empirical networks tend to be sparse, with an average degree which (A3) is many orders of magnitude smaller than N. Hence, as Z N becomes large, most observed networks will lie in a = dprs P(G|{bi}, {prs},B)P({prs}|B)× vanishingly small portion of the parameter space pro- Z duced by this prior. A better choice would constrain the dprP({bi}|{pr})P({pr}|B) average degree to something closer to what is observed in the data, but at the same time being otherwise noninfor- (A4) mative regarding the block structure. A choice such as P = P(G|{bi},B) × P({bi}|B). (A5) P({prs}|B) ∝ δ( rs prsnrns −2E), where E is the num- ber of edges in the observed network seems appropriate, By maximizing P(G, {bi}|B), instead of Eq. A1, one but the integral in Eq. A4 becomes difficult to solve. In- should avoid overfitting the data, since the larger mod- stead, an easier approach is to modify the model sightly, els with many parameters are dominated by a major- so that the average degree is implicitly constrained. Here, ity of choices that fit the data very badly, and, hence, we consider the model variant where the number of edges have a smaller contribution in the integral of Eq. A3. E is a fixed parameter, and each sampled edge may land Therefore, the maximization of the integrated likelihood between any two nodes belonging to blocks r and s with P also corresponds to an application of Occam’s razor, and probability qrs, and we have, therefore, r s qrs = 1. one should expect it to deliver results compatible with The full posterior likelihood of this model is ≥ MDL [53]. However, in practice things are more nu- Q mrs anced, since the value of Eq. A3 is heavily dependent on E! r s qrs P(G|{bi}, {qrs},E,B) = ≥ , the choice of priors P({prs}|B) and P({pr}|B). For the Q Ω({ers}, {nr}) r s mrs! block partitions themselves, this choice is more straight- ≥ (A9) forward. Since one wants to be agnostic with respect to where Ω({ers}, {nr}) is, as before, the number of what block sizes are possible, one should choose a flat different graphs with the same block partition and P({p }|B) = Dirichlet({p }|{α }) α = 1 prior r r r , with r , so edge counts, and mrs = ers if r 6= s or err/2 that all counts are equally likely. The integral of Eq. A4 otherwise. By maximizing Eq. A9 with respect to is then computed as {qrs}, one obtains max qrs ln P(G|{bi}, {qrs},E,B) ' { } − ln Ω({ers}, {nr}) = −St, as long as mrs  1 or  B  X ln P({bi}|B) = − ln N − ln N! + ln nr!, (A6) mrs = 0, so it also is equivalent to the previous mod- r els in this limit. With this reparametrization, the av- erage degree remains fixed independently of the choice which is identical to the partition description length of 0 of prior. Therefore, we may finally use a flat prior Eq. 7, i.e. ln P({bi}|B) = −Lt . P({qrs}|B) = Dirichlet({qrs}|{αrs = 1}), without the For the block probabilities, on the other hand, the sit- risk of the graphs becoming inadvertently dense, and uation is more subtle. A common choice is the flat prior again the integrated likelihood can be computed exactly, P({prs}|B) = 1 [23, 38, 40–42]. This choice is agnostic with respect to what block structures are expected, and Z it is also practical, since the integral can be evaluated P(G|{bi},B) = dqrs P(G|{bi}, {qrs},E,B)P({qrs}|B) exactly [23, 42], (A10)

  B 1 X nrns h (( ))i− ln P(G|{bi},B) = − ln + ln (nrns + 1) = Ω({e }, {n }) × 2 . (A11) e rs r E r>s rs  2  X nr 2  By inserting Eq. A11 into Eq. A5, and comparing with − ln + ln nr/2 + 1 e /2 equation Eq. 10, we see that ln P(G, {bi}|B) = −ΣL=1, r rr (A7) and we conclude reassuringly that the MDL approach is   fully equivalent to BMS when all model constraints are 1 X ers X ' − n n H − (B + 1) ln n , (A8) compatible. In fact, even in the dense case, although not 2 r s b n n r rs r s r quite the same, the (dense) BMS and MDL penalties are very similar. If one assumes N  B2, E ∝ N 2, and where the approximation in Eq. A8 was made assum- equal block sizes nr = N/B, both penalties become ∼ ing nr  1, and Hb(x) is the binary entropy func- B(B+1) ln N +N ln B. Therefore, it seems that whatever tion. However, there is one important issue with this differences arising from the two approaches stem simply approach. Namely, there is a strong discrepancy between from nuances in the choice of prior probabilities. This the models generated by the flat prior P({prs}|B) = 1 comparison also allows us to interpret the nested block 14

model as a hierarchical Bayesian approach, where the 8 priors P({qrs}|B) are replaced by a nested sequence of SBM (B=100) priors and hyperpriors, so that their integrated likelihood Nested SBM matches the description length defined previously. 6 Louvain Infomap

VI 4

Appendix B: Comparison with other community 2 detection methods 0 In this section we compare results obtained for syn- 0.0 0.2 0.4 0.6 0.8 1.0 thetic networks with popular community detection meth- c ods that are not based on statistical inference. Here, we focus not only on the capacity of the method of find- 3 ing a partition correlated with the planted one, but also 10 on the number of blocks detected. We concentrate on two methods which have been reported to provide good 102 results in synthetic benchmarks [91], namely the Lou- B vain method [12], based on modularity optimization, and 1 the Infomod method [45, 92, 93], based on compres- 10 Nested SBM sion of random walks. We make use of the LFR bench- Louvain Infomap mark [94], which corresponds to a specific parametriza- 100 tion of the degree-corrected stochastic block model [24], where both the degree distribution and the block size dis- 0.0 0.2 0.4 0.6 0.8 1.0 tribution follow truncated power laws. Here, we employ c a parametrization similar to Ref. [91], with a degree dis- tribution following a with exponent −2 and a Figure 9. Top: Variation of information (VI) between the k = 5 planted and obtained partitions as a function of the assorta- minimum degree min , and a community size distri- 4 bution also following a power-law, but with exponent −1, tivity parameter c, for networks with N = 2 × 10 , generated and minimum block size of 50. We also impose the follow- as described in the text. The legend indicates results obtained with different methods: Fitting the degree-corrected stochas- ing additional restrictions: The total number of blocks is = 100 B = 100 i tic block model with a fixed number of blocks B (SBM), always fixed at , and for every√ node belonging performing model selection with the nested stochastic block r k n to block , its degree i cannot exceed r, to avoid in- model (Nested SBM), the Louvain modularity maximization trinsic degree-degree correlations [50]. With this parame- method [12], and the Infomod method [45, 92, 93]. Bottom: 4 ter choice, the networks generated with N = 2×10 pos- The obtained number of blocks B as a function of c, for the sess an average degree hki ' 7.8. The actual block struc- same methods as in the top panel. The gray horizontal line ture is parametrized as ers = (1 − c)eres/2E + δrscer, marks the planted B = 100 value. All results were obtained where c controls the assortativity: For c = 1 all edges by averaging over 20 network realizations. connect nodes of the same block, and for c = 0 we have a fully random configuration model11. Because the different methods result in quite different the NMI values tend to be larger simply if the number numbers of detected blocks, the normalized mutual in- of blocks is increased, even if this larger partition is in formation (NMI) is not the most appropriate measure of no other way more strongly correlated to the true one. the overlap between partitions in this case. This is due Another measure that is less susceptible to this problem to the fact that, if the number of nodes is kept fixed, is the variation of information (VI) [95], defined as

VI({xi}, {yi}) = H({xi}) + H({yi}) − 2I({xi}, {yi}), (B1) where H({xi}) is the entropy of the partition {xi} and 11 Note that this is slightly different than in Ref. [94], which pa- I({xi}, {xi}) is the (non-normalized) mutual information rameterized the fraction of internal and external degrees via a between {xi} and {yi}. A value of VI equal to zero means local mixing parameter µ, which is the same for all communities. That choice corresponds to a different parametrization of the that the partitions are identical, whereas any positive degree-corrected block model than the one used here. However, value indicates a reduced overlap between them. since the blocks have different sizes, and the degrees are approx- The VI values between the planted partitions and those imately the same in all blocks, in general there is no choice of obtained with different methods for several network real- µ which would allow one to recover the fully-random configu- izations of the above model are shown in Fig. 9, together ration model, since the intrinsic mixing would be different for each block in this case. Because of this, we have opted for the with the obtained number of blocks. By observing the parametrization used here, however this should not alter the in- VI values for the inference method with a fixed num- terpretation of the benchmark and the comparison with Ref. [94] ber of blocks B = 100, we conclude that the strict de- in a significant way. tectability transition (when the value of B is known) lies 15 somewhere slightly above c ≈ 0.2. However, the model- entropies, we have for the undirected case [50], selection procedure based on the nested stochastic block   model presented in the main text discards any structure 1 X ers S = n n H , (C1) below the c ≈ 0.4 range, and decides on a fully random t 2 r s b n n rs r s B = 1 structure. Above this value, the inferred value of B increases from B = 1 until agreeing with the planted while for the directed case it reads, value for sufficiently large c values. As can also be seen   in Fig. 9, the Louvain method exhibits the “worst of both d X ers St = nrnsHb , (C2) worlds,” i.e., it fails to find the correct partition for all nrns values except c = 1, finding systematically smaller values rs of B, while at the same time finding spurious partitions where Hb(x) = −x ln x − (1 − x) ln(1 − x) is the binary below the detectability threshold, even when the network entropy function. In both cases, ers is the number of is completely random (c = 0). The Infomod method, on edges from block r to s (or the number of half-edges for the other hand, seems to find partitions that are largely the undirected case when r = s), and nr is the number of compatible with the planted one, at least for the param- nodes in block r. In the sparse limit, ers  nrns, these eter region above c ≈ 0.6. However, for even larger val- expressions may be written approximately as ues of c, this method detects a number of blocks that   is significantly larger than the planted value, which in- ∼ 1 X ers creases steadily as c decreases. Hence, this method is also St = E − ers ln , (C3) 2 nrns incapable of separating structure from noise, and finds rs   X ers spurious partitions far below the detectability thresh- Sd =∼ E − e ln . (C4) old. Thus, from the three methods analyzed, the one t rs n n rs r s described in the main text is the only one that combines the following three desirable properties: 1. Optimal in- For the degree-corrected variant with “hard” degree con- ference in the detectable range; 2. Guarantee against straints, we have overfitting and detection of spurious modules; 3. Fully   nonparametric implementation. ∼ X 1 X ers Sc = −E − Nk ln k! − ers ln , (C5) The suboptimal behavior of the modularity-based 2 eres k rs method is simply a combination of the resolution d ∼ X + X limit [14] and lack of built-in model selection based on Sc = −E − Nk+ ln k ! − Nk− ln k−! statistical evidence [13]. It is not currently known if the k+ k−   (C6) Infomod method suffers from problems similar to the res- X ers − ers ln + , olution limit, but clearly it lacks guarantees against de- er es− tection of spurious modules. Although it is also based on rs the principle of parsimony, it tries to compress random P where er = s ers is the number of half-edges incident on walks taking place on the network, instead of the net- + P P block r, and er = s ers and er− = s esr are the num- work itself. Apparently, the method cannot distinguish ber of out- and in-edges adjacent to block r, respectively. between the actual planted block structure and quenched These expressions are also only valid in the sparse limit, topological fluctuations — both of which will affect ran- which in this case amounts to the following conditions, dom walks — and gradually transitions between the two properties in order to best describe the network dynam- k2 − hki k2 − hki r r s s ics. (As has been shown in Ref. [91], this problem di- ers  nrns, (C7) hki2 hki2 minishes if the average degree of the network is made r s sufficiently large, in which case the method finally set- kl = P kl/n tles in a B = 1 partition for fully random graphs.) On where r i r i r [for the directed case we sim- kl ∈→ (k+)l kl → (k )l the other hand, the method in the main text is based on ply replace r r and s − s in maximizing the likelihood of the exact same generative the equation above]. Unfortunately, there is no closed- process that was used to construct the network, which form expression for the entropy outside the sparse limit, puts it in clear advantage over the other two (and, in unlike the traditional variant [50]. fact, many other methods, including all those analyzed For the upper-level multigraphs the entropies are [50], in Refs. [91, 94]), in addition to including a robust and    nr  X nr ns X (( 2 )) formally motivated model-selection procedure. Sm = ln + ln , (C8) ers err /2 r>s r X   Sd = ln nr ns , (C9) Appendix C: Directed and undirected networks m ers rs

n  n+m 1 As mentioned in the main text, the model described is where, as before, m = m− is the number of m- easily generalized for directed graphs. For the ensemble combinations with repetitions from a set of size n. 16

r For the degree-corrected model, the description length where {pk−,k+ } is the joint (in,out)-degree distribution needs to be augmented with the information necessary of nodes belonging to block r. to describe the degree sequence, analogously to Eq. 9 for the undirected case, Note that other generalizations for the directed case are possible [27], and it should be straightforward to X r Lc = Lt + nrH({pk−,k+ }), (C10) adapt the nested model for them as well. r

[1] M. E. J. Newman, “Communities, modules and large- (2010). scale structure in networks,” Nat Phys 8, 25–31 (2011). [17] M. B. Hastings, “Community detection as an inference [2] Santo Fortunato, “Community detection in graphs,” problem,” Physical Review E 74, 035102 (2006). Physics Reports 486, 75–174 (2010). [18] Diego Garlaschelli and Maria I. Loffredo, “Maximum like- [3] M. Girvan and M. E. J. Newman, “Community structure lihood: Extracting unbiased information from complex in social and biological networks,” Proceedings of the Na- networks,” Physical Review E 78, 015101 (2008). tional Academy of Sciences 99, 7821 –7826 (2002). [19] M. E. J. Newman and E. A. Leicht, “Mixture models [4] E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai, and exploratory analysis in networks,” Proceedings of the and A.-L. Barabási, “Hierarchical organization of mod- National Academy of Sciences 104, 9564 –9569 (2007). ularity in metabolic networks,” Science 297, 1551–1555 [20] J. Reichardt and D. R. White, “Role models for complex (2002), PMID: 12202830. networks,” The European Physical Journal B 60, 217– [5] Robert J. Fletcher Jr, Andre Revell, Brian E. Reichert, 224 (2007). Wiley M. Kitchens, Jeremy D. Dixon, and James D. [21] Jake M. Hofman and Chris H. Wiggins, “Bayesian ap- Austin, “Network modularity reveals critical scales for proach to network modularity,” Physical Review Letters connectivity in ecology and evolution,” Nature Commu- 100, 258701 (2008). nications 4 (2013), 10.1038/ncomms3572. [22] Peter J. Bickel and Aiyou Chen, “A nonparametric view [6] Réka Albert, Hawoong Jeong, and Albert-László of network models and Newman–Girvan and other mod- Barabási, “Internet: Diameter of the world-wide web,” ularities,” Proceedings of the National Academy of Sci- Nature 401, 130–131 (1999). ences 106, 21068–21073 (2009). [7] Soon-Hyung Yook, Hawoong Jeong, and Albert-László [23] Roger Guimerà and Marta Sales-Pardo, “Missing and Barabási, “Modeling the internet’s large-scale topology,” spurious interactions and the reconstruction of complex Proceedings of the National Academy of Sciences 99, networks,” Proceedings of the National Academy of Sci- 13382–13386 (2002), PMID: 12368484. ences 106, 22073 –22078 (2009). [8] Yunpeng Zhao, Elizaveta Levina, and Ji Zhu, “Commu- [24] Brian Karrer and M. E. J. Newman, “Stochastic block- nity extraction for social networks,” Proceedings of the models and community structure in networks,” Physical National Academy of Sciences 108, 7321 –7326 (2011). Review E 83, 016107 (2011). [9] Gergely Palla, Imre Derényi, Illés Farkas, and Tamás [25] Brian Ball, Brian Karrer, and M. E. J. Newman, “Ef- Vicsek, “Uncovering the overlapping community struc- ficient and principled method for detecting communities ture of complex networks in nature and society,” Nature in networks,” Physical Review E 84, 036103 (2011). 435, 814–818 (2005), PMID: 15944704. [26] Jörg Reichardt, Roberto Alamino, and David Saad, “The [10] M. E. J. Newman and M. Girvan, “Finding and evaluat- interplay between microscopic and mesoscopic structures ing community structure in networks,” Physical Review in complex networks,” PLoS ONE 6, e21282 (2011). E 69, 026113 (2004). [27] Yaojia Zhu, Xiaoran Yan, and Cristopher Moore, “Ori- [11] , M. E. J. Newman, and Cristopher ented and degree-generated block models: generating and Moore, “Finding community structure in very large net- inferring communities with inhomogeneous degree distri- works,” Physical Review E 70, 066111 (2004). butions,” Journal of Complex Networks 2, 1–18 (2014). [12] Vincent D. Blondel, Jean-Loup Guillaume, Renaud Lam- [28] Edward B. Baskerville, Andy P. Dobson, Trevor Bed- biotte, and Etienne Lefebvre, “Fast unfolding of commu- ford, Stefano Allesina, T. Michael Anderson, and Mer- nities in large networks,” Journal of Statistical Mechan- cedes Pascual, “Spatial guilds in the serengeti food web ics: Theory and Experiment 2008, P10008 (2008). revealed by a bayesian group model,” PLoS Comput Biol [13] Roger Guimerà, Marta Sales-Pardo, and Luís A. Nunes 7, e1002321 (2011). Amaral, “Modularity from fluctuations in random graphs [29] Aurelien Decelle, Florent Krzakala, Cristopher Moore, and complex networks,” Physical Review E 70, 025101 and Lenka Zdeborová, “Inference and phase transitions (2004). in the detection of modules in sparse networks,” Physical [14] Santo Fortunato and Marc Barthélemy, “Resolution limit Review Letters 107, 065701 (2011). in community detection,” Proceedings of the National [30] Aurelien Decelle, Florent Krzakala, Cristopher Moore, Academy of Sciences 104, 36–41 (2007). and Lenka Zdeborová, “Asymptotic analysis of the [15] Andrea Lancichinetti and Santo Fortunato, “Limits stochastic block model for modular networks and its al- of modularity maximization in community detection,” gorithmic applications,” Physical Review E 84, 066106 Physical Review E 84, 066122 (2011). (2011). [16] Benjamin H. Good, Yves-Alexandre de Montjoye, and [31] Elchanan Mossel, Joe Neeman, and Allan Sly, “Stochas- Aaron Clauset, “Performance of modularity maximiza- tic block models and reconstruction,” arXiv:1202.1499 tion in practical contexts,” Physical Review E 81, 046106 (2012). 17

[32] Tiago P. Peixoto, “Parsimonious module inference in E 80, 016109 (2009). large networks,” Physical Review Letters 110, 148701 [48] István A. Kovács, Robin Palotai, Máté S. Szalay, and (2013). Peter Csermely, “Community landscapes: An integrative [33] Paul W. Holland, Kathryn Blackmond Laskey, and approach to determine overlapping network module hier- Samuel Leinhardt, “Stochastic blockmodels: First steps,” archy, identify key nodes and predict network dynamics,” Social Networks 5, 109–137 (1983). PLoS ONE 5, e12528 (2010). [34] Stephen E. Fienberg, Michael M. Meyer, and Stanley S. [49] Yongjin Park, Cristopher Moore, and Joel S. Bader, “Dy- Wasserman, “Statistical analysis of multiple sociometric namic networks from hierarchical bayesian graph cluster- relations,” Journal of the American Statistical Associa- ing,” PLoS ONE 5, e8118 (2010). tion 80, 51–67 (1985). [50] Tiago P. Peixoto, “Entropy of stochastic blockmodel en- [35] Katherine Faust and Stanley Wasserman, “Blockmodels: sembles,” Physical Review E 85, 056122 (2012). Interpretation and evaluation,” Social Networks 14, 5–61 [51] Jure Leskovec, Deepayan Chakrabarti, Jon Kleinberg, (1992). Christos Faloutsos, and Zoubin Ghahramani, “Kronecker [36] Carolyn J. Anderson, Stanley Wasserman, and Kather- graphs: An approach to modeling networks,” J. Mach. ine Faust, “Building stochastic blockmodels,” Social Net- Learn. Res. 11, 985–1042 (2010). works 14, 137–161 (1992). [52] Ginestra Bianconi, “Entropy of network ensembles,” [37] Martin Rosvall and Carl T. Bergstrom, “An information- Physical Review E 79, 036114 (2009). theoretic framework for resolving community structure [53] Peter D. Grünwald, The Minimum Description Length in complex networks,” Proceedings of the National Principle (The MIT Press, 2007). Academy of Sciences 104, 7327–7331 (2007). [54] Jorma Rissanen, Information and Complexity in Statis- [38] J.-J. Daudin, F. Picard, and S. Robin, “A mixture model tical Modeling, 1st ed. (Springer, 2010). for random graphs,” Statistics and Computing 18, 173– [55] C. Biernacki, G. Celeux, and G. Govaert, “Assessing a 183 (2008). mixture model for clustering with the integrated com- [39] Mahendra Mariadassou, Stéphane Robin, and Corinne pleted likelihood,” IEEE Transactions on Pattern Anal- Vacher, “Uncovering latent structure in valued graphs: ysis and Machine Intelligence 22, 719–725 (2000). A variational approach,” The Annals of Applied Statis- [56] Xiaoran Yan, Jacob E. Jensen, Florent Krzakala, Cristo- tics 4, 715–742 (2010), mathematical Reviews number pher Moore, Cosma Rohilla Shalizi, Lenka Zdeborova, (MathSciNet): MR2758646. Pan Zhang, and Yaojia Zhu, “Model selection for degree- [40] Cristopher Moore, Xiaoran Yan, Yaojia Zhu, Jean- corrected block models,” arXiv:1207.3994 (2012). Baptiste Rouquier, and Terran Lane, “Active learning [57] Gideon Schwarz, “Estimating the dimension of a model,” for node classification in assortative and disassortative The Annals of Statistics 6, 461–464 (1978), mathematical networks,” in Proceedings of the 17th ACM SIGKDD in- Reviews number (MathSciNet): MR468014; Zentralblatt ternational conference on Knowledge discovery and data MATH identifier: 0379.62005. mining, KDD ’11 (ACM, New York, NY, USA, 2011) p. [58] H. Akaike, “A new look at the statistical model identi- 841–849. fication,” IEEE Transactions on Automatic Control 19, [41] Pierre Latouche, Etienne Birmele, and Christophe Am- 716–723 (1974). broise, “Variational bayesian inference and complexity [59] Joerg Reichardt and Michele Leone, “(un)detectable clus- control for stochastic block models,” Statistical Modelling ter structure in sparse networks,” 0711.1452 (2007). 12, 93–115 (2012). [60] Dandan Hu, Peter Ronhovde, and Zohar Nussinov, [42] E. Côme and P. Latouche, Model selection and cluster- “Phase transitions in random potts systems and the ing in stochastic block models with the exact integrated community detection problem: spin-glass type and dy- complete data likelihood, arXiv e-print 1303.2962 (2013). namic perspectives,” Philosophical Magazine 92, 406–445 [43] Aaron Clauset, Cristopher Moore, and Mark E. J. New- (2012). man, “Structural inference of hierarchies in networks,” [61] Anne Condon and Richard M. Karp, “Algorithms for in Statistical Network Analysis: Models, Issues, and New graph partitioning on the planted partition model,” Ran- Directions, Lecture Notes in Computer Science No. 4503, dom Structures & Algorithms 18, 116–140 (2001). edited by Edoardo Airoldi, David M. Blei, Stephen E. [62] J. Xiang, X. G. Hu, X. Y. Zhang, J. F. Fan, X. L. Zeng, Fienberg, Anna Goldenberg, Eric P. Xing, and Alice X. G. Y. Fu, K. Deng, and K. Hu, “Multi-resolution modu- Zheng (Springer Berlin Heidelberg, 2007) pp. 1–13. larity methods and their limitations in community detec- [44] Aaron Clauset, Cristopher Moore, and M. E. J. New- tion,” The European Physical Journal B 85, 1–10 (2012). man, “Hierarchical structure and the prediction of miss- [63] Tiago P. Peixoto, “Efficient monte carlo and greedy ing links in networks,” Nature 453, 98–101 (2008). heuristic for the inference of stochastic block models,” [45] Martin Rosvall and Carl T. Bergstrom, “Multilevel com- Physical Review E 89, 012804 (2014). pression of random walks on networks reveals hierarchical [64] Gergely Palla, László Lovász, and Tamás Vicsek, “Mul- organization in large integrated systems,” PLoS ONE 6, tifractal network generator,” Proceedings of the National e18209 (2011). Academy of Sciences 107, 7640 –7645 (2010). [46] Marta Sales-Pardo, Roger Guimerà, André A. Moreira, [65] Tiago P. Peixoto, “Eigenvalue spectra of modular net- and Luís A. Nunes Amaral, “Extracting the hierarchi- works,” Physical Review Letters 111, 098701 (2013). cal organization of complex systems,” Proceedings of the [66] Raj Rao Nadakuditi and M. E. J. Newman, “Graph spec- National Academy of Sciences 104, 15224–15229 (2007), tra and the detectability of community structure in net- PMID: 17881571. works,” Physical Review Letters 108, 188701 (2012). [47] Peter Ronhovde and Zohar Nussinov, “Multiresolu- [67] Lada A. Adamic and Natalie Glance, “The political blo- tion community detection for megascale networks by gosphere and the 2004 U.S. election: divided they blog,” information-based replica correlations,” Physical Review in Proceedings of the 3rd international workshop on Link 18

discovery, LinkKDD ’05 (ACM, New York, NY, USA, the eleventh ACM SIGKDD international conference on 2005) p. 36–43. Knowledge discovery in data mining, KDD ’05 (ACM, [68] D. Holten, “Hierarchical edge bundles: Visualization of New York, NY, USA, 2005) p. 177–187. adjacency relations in hierarchical data,” IEEE Transac- [83] Edoardo M. Airoldi, David M. Blei, Stephen E. Fien- tions on Visualization and Computer Graphics 12, 741– berg, and Eric P. Xing, “Mixed membership stochastic 748 (2006). blockmodels,” J. Mach. Learn. Res. 9, 1981–2014 (2008). [69] David Lusseau, Karsten Schneider, Oliver J. Boisseau, [84] Andrea Lancichinetti, Santo Fortunato, and János Patti Haase, Elisabeth Slooten, and Steve M. Dawson, Kertész, “Detecting the overlapping and hierarchical “The bottlenose dolphin community of doubtful sound community structure in complex networks,” New Jour- features a large proportion of long-lasting associations,” nal of Physics 11, 033015 (2009). Behavioral Ecology and Sociobiology 54, 396–405 (2003). [85] Prem K. Gopalan and David M. Blei, “Efficient discov- [70] Jure Leskovec, Jon Kleinberg, and Christos Falout- ery of overlapping communities in massive networks,” sos, “Graph evolution: Densification and shrinking di- Proceedings of the National Academy of Sciences 110, ameters,” ACM Trans. Knowl. Discov. Data 1 (2007), 14534–14539 (2013), PMID: 23950224. 10.1145/1217299.1217301. [86] Yong-Yeol Ahn, James P. Bagrow, and Sune Lehmann, [71] Jure Leskovec, Kevin J. Lang, Anirban Dasgupta, and “Link communities reveal multiscale complexity in net- Michael W. Mahoney, “Community structure in large works,” Nature 466, 761–764 (2010). networks: Natural cluster sizes and the absence of large [87] David Liben-Nowell and Jon Kleinberg, “The link- well-defined clusters,” arXiv:0810.1355 (2008). prediction problem for social networks,” Journal of the [72] Johannes Gehrke, Paul Ginsparg, and Jon Kleinberg, American Society for Information Science and Technol- “Overview of the 2003 KDD cup,” SIGKDD Explor. ogy 58, 1019–1031 (2007). Newsl. 5, 149–151 (2003). [88] Daniel Grady, Christian Thiemann, and Dirk Brock- [73] Jaewon Yang and Jure Leskovec, “Defining and evalu- mann, “Robust classification of salient links in complex ating network communities based on ground-truth,” in networks,” Nature Communications 3, 864 (2012). Proceedings of the ACM SIGKDD Workshop on Mining [89] Ginestra Bianconi, Paolo Pin, and Matteo Marsili, “As- Data Semantics, MDS ’12 (ACM, New York, NY, USA, sessing the relevance of node features for network struc- 2012) p. 3:1–3:8. ture,” Proceedings of the National Academy of Sciences [74] T S Evans, “American college football network files,” 106, 11433 –11438 (2009). FigShare (2012), 10.6084/m9.figshare.93179. [90] Roger Guimerà and Luís A. Nunes Amaral, “Functional [75] D. J. Watts and S. H. Strogatz, “Collective dynamics of cartography of complex metabolic networks,” Nature ’small-world’ networks,” Nature 393, 409–10 (1998). 433, 895–900 (2005). [76] Oliver Richters and Tiago P. Peixoto, “Trust transitivity [91] Andrea Lancichinetti and Santo Fortunato, “Community in social networks,” PLoS ONE 6, e18384 (2011). detection algorithms: A comparative analysis,” Physical [77] K. I. Goh, M. E. Cusick, D. Valle, B. Childs, M. Vidal, Review E 80, 056117 (2009). and A. L. Barabási, “The human disease network,” Pro- [92] Martin Rosvall and Carl T. Bergstrom, “Maps of random ceedings of the National Academy of Sciences 104, 8685 walks on complex networks reveal community structure,” (2007). Proceedings of the National Academy of Sciences 105, [78] Eunjoon Cho, Seth A. Myers, and Jure Leskovec, 1118–1123 (2008), PMID: 18216267. “Friendship and mobility: user movement in location- [93] M. Rosvall, D. Axelsson, and C. T. Bergstrom, “The based social networks,” in Proceedings of the 17th ACM map equation,” The European Physical Journal Special SIGKDD international conference on Knowledge discov- Topics 178, 13–23 (2009). ery and data mining, KDD ’11 (ACM, New York, NY, [94] Andrea Lancichinetti, Santo Fortunato, and Filippo USA, 2011) p. 1082–1090. Radicchi, “Benchmark graphs for testing community [79] Matthew Richardson, Rakesh Agrawal, and Pedro detection algorithms,” Physical Review E 78, 046110 Domingos, “Trust management for the semantic web,” in (2008). The Semantic Web - ISWC 2003 , Lecture Notes in Com- [95] Marina Meilă, “Comparing clusterings by the variation of puter Science No. 2870, edited by Dieter Fensel, Katia information,” in Learning Theory and Kernel Machines, Sycara, and John Mylopoulos (Springer Berlin Heidel- Lecture Notes in Computer Science No. 2777, edited by berg, 2003) pp. 351–368. Bernhard Schölkopf and Manfred K. Warmuth (Springer [80] Jure Leskovec, Daniel Huttenlocher, and Jon Kleinberg, Berlin Heidelberg, 2003) pp. 173–187. “Signed networks in social media,” in Proceedings of the SIGCHI Conference on Human Factors in Computing Systems, CHI ’10 (ACM, New York, NY, USA, 2010) p. 1361–1370. [81] Julian McAuley and Jure Leskovec, “Image labeling on a network: Using social-network metadata for image clas- sification,” in Computer Vision – ECCV 2012 , Lecture Notes in Computer Science No. 7575, edited by Andrew Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, and Cordelia Schmid (Springer Berlin Heidelberg, 2012) pp. 828–841. [82] Jure Leskovec, Jon Kleinberg, and Christos Falout- sos, “Graphs over time: densification laws, shrinking diameters and possible explanations,” in Proceedings of Supplemental Material: Hierarchical block structures and high-resolution model selection in large networks

Tiago P. Peixoto∗ Institut für Theoretische Physik, Universität Bremen, Hochschulring 18, D-28359 Bremen, Germany

A. Empirical networks East Asia; 3. Europe, Africa, South Asia and the Middle East; 4. Russia and Western Europe. These get further subdivided in the lower levels. In Fig. 1 is shown a higher resolution version of Fig. 6 in In Fig. 2 is shown a higher resolution version of Fig. 7 the main text, with additional information on the corre- in the main text, containing the large-scale structure of lation with geographical location. As can be seen, at the the IMDB network. See also Ref. [1] for more details topmost level it is separated into four blocks, comprising: on the type of information which can be associated with 1. The “core” AS nodes; 2. The Americas, Australia and each block.

[1] T. P. Peixoto, Physical Review Letters 110, 148701 (2013).

[email protected] 2

l = 5 l = 4

l = 3 l = 2

l = 1 l = 0

FIG. 1. Large-scale structure of the Internet at the autonomous systems level, as obtained by the nested stochastic block model, displaying a prominent core-periphery architecture. The “blow up” shows the nodes which belong to the “core” top-level branch, containing AS nodes spread all over the globe, as shown in the map inset below it. The maps at the bottom show the network partitions at different hierarchical levels, from top to bottom, showing the strong correlation with geographical divisions. The stars (?) are the “core” nodes, and the circles are regular AS nodes. 3

FIG. 2. Large-scale structure of the IMDB actor-film network. Each node in this graph represents a lowest-level block in the hierarchy, instead of individual nodes in the graph. The size of the nodes indicates the number of nodes in each group. The hierarchy branch at the top are the actors, and at the bottom are the films. The labels classify each branch according to the most prominent geographical, temporal or genre characteristics found in the database.