PHYSICAL REVIEW X 4, 011047 (2014)

Hierarchical Block Structures and High-Resolution Model Selection in Large Networks

Tiago P. Peixoto* Institut für Theoretische Physik, Universität Bremen, Hochschulring 18, D-28359 Bremen, Germany (Received 5 November 2013; published 24 March 2014) Discovering and characterizing the large-scale topological features in empirical networks are crucial steps in understanding how complex systems function. However, most existing methods used to obtain the modular structure of networks suffer from serious problems, such as being oblivious to the statistical evidence supporting the discovered patterns, which results in the inability to separate actual structure from noise. In addition to this, one also observes a resolution limit on the size of communities, where smaller but well-defined clusters are not detectable when the network becomes large. This phenomenon occurs for the very popular approach of optimization, which lacks built-in statistical validation, but also for more principled methods based on statistical inference and model selection, which do incorporate statistical validation in a formally correct way. Here, we construct a nested generative model that, through a complete description of the entire network hierarchy at multiple scales, is capable of avoiding this limitation and enables the detection of modular structure at levels far beyond those possible with current approaches. Even with this increased resolution, the method is based on the principle of parsimony, and is capable of separating signal from noise, and thus will not lead to the identification of spurious modules even on sparse networks. Furthermore, it fully generalizes other approaches in that it is not restricted to purely assortative mixing patterns, directed or undirected graphs, and ad hoc hierarchical structures such as binary trees. Despite its general character, the approach is tractable and can be combined with advanced techniques of community detection to yield an efficient algorithm that scales well for very large networks.

DOI: 10.1103/PhysRevX.4.011047 Subject Areas: Complex Systems, Interdisciplinary Physics, Statistical Physics

I. INTRODUCTION The method that has perhaps gathered the most wide- The detection of communities and other large-scale spread use is called modularity optimization [10] and structures in networks has become perhaps one of the consists in maximizing a quality function that favors largest undertakings in [1,2]. It is moti- partitions of nodes for which the fraction of internal edges vated by the desire to be able to characterize the most inside each cluster is larger than expected given a null salient features in large biological [3–5], technological model, taken to be a . This method is [6,7], and social systems [3,8,9], such that their building relatively easy to use and comprehend, works well in blocks become evident, potentially giving valuable insight many accessible examples, and is capable of being applied into the central aspects governing their function and in very large systems via efficient heuristics [11,12]. evolution. At its simplest level, the problem seems straight- However it also suffers from serious drawbacks. In par- forward: Modules are groups of nodes in the network that ticular, despite measuring a deviation from a null model, it have a similar connectivity pattern, often assumed to be does not take into account the statistical evidence asso- assortative, i.e., connected mostly among themselves and ciated with this deviation, and as a result, it is incapable of less so with the rest of the network. However, when separating actual structure from those arising simply of attempting to formalize this notion and develop methods statistical fluctuations of the null model, and it even finds to detect such structures, the combined effort of many high-scoring partitions in fully random graphs [13]. This researchers in recent years has spawned a great variety of problem is not specific to modularity and is a characteristic competing approaches to the problem, with no clear, shared by the vast majority of methods proposed for universally accepted outcome [2]. solving the same task [2]. In addition to the lack of statistical validation, modularity maximization fails to detect clusters with size below a given thresholdpffiffiffiffi[14,15], *[email protected]‑bremen.de which increases with the size of the system as ∼ E, where E is the number of edges in the entire network. This Published by the American Physical Society under the terms of the Creative Commons Attribution 3.0 License. Further distri- limitation is independent of how salient these relatively bution of this work must maintain attribution to the author(s) and smaller structures are, and makes this potentially very im- the published article’s title, journal citation, and DOI. portant information completely inaccessible. Furthermore,

2160-3308=14=4(1)=011047(18) 011047-1 Published by the American Physical Society TIAGO P. PEIXOTO PHYS. REV. X 4, 011047 (2014) results obtained with this method tend to be degenerate for model encapsulating its entire hierarchical structure at large empirical networks [16], for which many different once. It generalizes previous methods of hierarchical partitions can be found with modularity values very close to community detection [43–49], in that it does not impose the global maximum. In these common situations, the specific patterns such as dendograms or binary trees, in method fails in giving a faithful representation of the actual addition to allowing arbitrary modular structures as the usual stochastic block model, instead of purely assortative large-scale structure present in the system. ones. Furthermore, despite its increased resolution, the More recently, increasing effort has been spent on a approach attempts to find the simplest possible model that different approach based on the statistical inference of fits the data and is not subject to overfitting, and, hence, generative models, which encode the modular structure of will not detect spurious modules in random networks. the network as model parameters [17–28]. This approach Finally, the method is fully nonparametric and can be offers several advantages over a dominating fraction of implemented efficiently with a simple algorithm that scales existing methods, since it is more firmly grounded on well- well for very large networks. known principles and methods of statistical analysis, which In Sec. II, we start with the definition of the model and allows the incorporation of the statistical evidence present then we discuss the model-selection procedure based on in the data in a formally correct manner. Under this general MDL. We then move to the analysis of the resolution limit, framework, one could hope to overcome some of the and proceed to define an efficient algorithm for the limitations existing in more ad hoc methods, or at least inference of the nested model, and we finalize with the make any intrinsic limitations easier to understand in light analysis of synthetic and empirical networks, where we of more robust concepts [29–32]. Perhaps the generative demonstrate the quality of the approach. We then conclude model most used for this purpose is the stochastic block with an overall discussion. model [17–28,33–36], which groups nodes in blocks with arbitrary probabilities of connections between them. This very simple definition already does away with the restric- II. HIERARCHICAL MODEL tion of considering only purely assortative communities The original stochastic block model ensemble [33–36] is and accommodates many different patterns, such as core- composed of N nodes, divided into B blocks, with ers edges periphery structures and bipartite blocks, as well as between nodes of blocks r and s (or, for convenience of straightforward generalizations to directed graphs. In notation, twice that number if r ¼ s). Here, we differentiate this context, the detectability of well-defined clusters between two very similar model variants: (1) the edge amounts, in large part, to the issue of model selection counts ers are themselves the parameters of the model, and based on principled criteria such as minimum description (2) the parameters are the probabilities prs that an edge length (MDL) [32,37] or Bayesian model selection (BMS) exists between two nodes of the respective blocks, such that [38–42]. These approaches allow the selection of the most the edge counts hersi¼nrnsprs are constrained on aver- appropriate number of blocks based on statistical evidence, age. Both are equally valid generative models, and as long and thus avoid the detection of spurious communities. as the edge counts are sufficiently large, they are fully However, frustratingly, at least one of the limitations of equivalent (see Ref. [50] and Appendix A). Here, we stick modularity maximization is also present when doing model with the first variant, since it makes the following formu- selection, namely, the resolution limit mentioned above. As lation more convenient. We also consider a further variation was recently shown in Ref. [32], when using MDL,p theffiffiffiffi called the degree-corrected block model [24], which is maximum number of detectable blocks scales with N, defined exactly as the traditional model(s) above, but one where N is the number of nodes in the network, which is additionally specifies the degree sequence fkig of the graph very similar to the modularity optimization limit. However, as an additional set of parameters (again, these values can in this context, this limitation arises out of the lack of be the parameters themselves, or they can be constrained on knowledge about the type of modular structure one is about average [50]). The degree-corrected version, although it is a to infer and the a priori assumption that all possibilities relatively simple modification, yields much more convinc- should occur with the same probability. Here, we develop a ing results on many empirical networks, since it is capable more refined method of model selection, which consists in of incorporating degree variability inside each block [24]. a nested hierarchy of stochastic block models, where an As will be seen below, it is, in general, also capable of upper level of the hierarchy serves as prior information to a providing a more compact description of arbitrary networks lower level. This dramatically changes the resolution of the than the traditional version. model selectionpffiffiffiffi procedure and replaces the characteristic The nested version, which we define here, is based on the block size of N in the nonhierarchical model by much a simple fact that the edge counts ers themselves form a block smaller value that scales only logarithmically with N, multigraph, where the nodes are the blocks and the edge enabling the detection of much smaller blocks in very counts are the edge multiplicities between each node pair large networks. Furthermore, the approach provides a (with self-loops allowed). This multigraph may also be description of the network in many scales, in a complete constructed via a generative model of its own. If we choose

011047-2 HIERARCHICAL BLOCK STRUCTURES AND HIGH- … PHYS. REV. X 4, 011047 (2014)

A. Module inference The inference approach consists in finding the best partition fbig of the nodes, where bi ∈ ½1;B is the block membership of node i, in the observed network G, such that the posterior likelihood PðGjfbigÞ is maximized. Since each graph with the same edge counts ers occurs with the same probability, the posterior likelihood is simply PðGjfbigÞ ¼ 1=Ωðfersg; fnrgÞ, where ers and nr are the edge and node counts associated with the block partition fbig and Ωðfersg; fnrgÞ is the number of different network realizations. Hence, maximizing the likelihood is identical to minimizing the ensemble entropy [50,52] Sðfersg; fnrgÞ ¼ ln Ωðfersg; fnrgÞ. For the lowest level of the hierarchy (which models directly the observed network), we have a simple graph, for which the entropies can be computed as [50]   1 X e S n n H rs ; t ¼ 2 r s b n n (1) rs r s

for the traditional block model ensemble, and   X 1 X e FIG. 1. Example of a nested stochastic block model with three S ≃ −E − N k! − e rs ; levels, and a generated network at the bottom. The top-level c k ln 2 rs ln e e (2) rs r s structure describes a core-periphery network, which is further k subdivided in the lower levels. P for the degree-corrected variant, where E ¼ rsers=2 is the total number of edges,P Nk is the total number of nodes a stochastic block model again as a generative model, we with degree k, er ¼ sers is the number of half-edges obtain another smaller block multigraph as parameters at a incident on block r, HbðxÞ¼−x ln x − ð1 − xÞ lnð1 − xÞ is higher level, and so on recursively, until we finally reach a the binary entropy function, and it was assumed that model with only one block. This forms a nested stochastic nr ≫ 1. Note that only the last term of Eq. (2) is, in fact, block model hierarchy, which describes a given network at useful when finding the best block partition, since the other several resolution levels (see Fig. 1). terms remain constant. However, the full expression is This approach provides an increased resolution when necessary when comparing the models against each other performing model selection, since the generative model via model selection, as discussed below. inferred at an upper level serves as prior information to the For the upper-level multigraphs, the entropy can also be one at a lower level. Despite its more elaborate formulation, computed [50], and it takes a different form this hierarchical model remains tractable, and it is possible   to apply it to very large networks, in a fully nonparametric X   X n nrns ðð rÞÞ manner, as discussed below. Furthermore, it generalizes S 2 ; (3) m ¼ ln e þ ln e =2 cleanly the flat variants, which correspond simply to a r>s rs r rr hierarchy with only one level. It also does not impose any n nþm−1 preferred mixing pattern (e.g., assortative or dissortative where ðð mÞÞ ¼ ð m Þ is the number of m-combinations block structures), and is not restricted to any specific with repetitions from a set of size n. Note that we no longer hierarchical form, such as binary trees or dendograms assume that nr ≫ 1, since at the upper levels the number of [97]. In the following, we describe the maximum likelihood nodes becomes arbitrarily small. method to infer the multilevel partitions and the model At each level l ∈ ½0;L in the hierarchy there are Bl−1 selection process based on the minimum description nodes, which are divided into Bl blocks (with Bl ≤ Bl−1), length principle and compare it with Bayesian model were we set B−1 ≡ N. The edge counts at level l are l l selection. denotedP ers and the blockP sizes nr. Therefore, we must have l l In the analysis, we focus on undirected networks, but that rnr ¼ Bl−1 and rsers=2 ¼ E; i.e., the total number everything is straightforwardly applicable to directed net- of nodes decreases in the upper levels, but the total number works as well. In Appendix C, we present a summary of the of edges remains the same. The combined entropy of all relevant expressions for the directed case. layers is then given by

011047-3 TIAGO P. PEIXOTO PHYS. REV. X 4, 011047 (2014)

XL Note that this is different from the choice made in 0 0 l l Sn ¼ St=cðfersg; fnr gÞ þ Smðfersg; fnrgÞ: (4) Bl−1 Refs. [32,37], which considered all possible Bl partitions l 1 ¼ to be equally likely, and, hence, the necessary amount of B B The full generative model corresponds to a nested sequence information was computed as l−1 ln l. This choice of network ensembles, where each sample from a given implicitly assumes that all blocks have equal sizes and level generates another ensemble at a lower level. The offers a worse description when this is not the case. Note B ≫ 1 entropy in Eq. (4) represents the amount of information that for l−1 ,wehave necessary to encode the decision sequence, which, starting Ll ≃ B H nl =B ; from the topmost model, selects the observed network t l−1 ðf r l−1gÞ (8) among all possible branches in the lower levels. P where HðfpigÞ ¼ − ipi ln pi is the entropy of the distri- Whenever both the number of levels and the number of l bution pi . Therefore, for uniform blocks nr Bl−1=Bl, blocks Bl of each level is known, the best multilevel f g ¼ Ll ≃ B B partition is the one that minimizes Sn. However, such we recover asymptotically the value t l−1 ln l. information regarding the size of the model is most often However, the value of Eq. (7) can be much smaller for not available and needs to be inferred from the data as well. nonuniform partitions. This choice has important conse- Using Eq. (4) for this purpose is not appropriate, since quences for the resolution of relatively small blocks, as will minimizing it across all possible hierarchies leads to a be seen below. trivial and meaningless result where Bl ¼ N for all l. For the degree-corrected version, we still need to include Instead, one must employ some form of Occam’s razor and the information necessary to describe the degrees at the select the simplest possible model that best describes the lowest level, observed data without increasing its complexity. We X r present such an approach in the next section. Lc ¼ Lt þ nrHðfpkgÞ; (9) r

B. Model selection r where fpkg is the of nodes belonging A method that directly formalizes Occam’s razor prin- to block r [98]. It is worth noting that, if a network ciple is known as minimum description length [53,54], is sampled from the traditional block model ensemble, r where one specifies the total amount of information so that pk is a Poisson withP average er=nPr, Eq. (9) necessary to described the data, which includes not only becomes Lc ¼ Lt þ 2E − rer ln er=nr þ kNk ln k!, the sample but the model parameters as well. The descrip- which means that Sc þ Lc ¼ St þ Lt, i.e., the total tion length for the model above is description length is identical for both the traditional and the degree-corrected models in this case, and, therefore, Σ ¼ Lt=c þ St=c; (5) both models describe the same network equally well [99]. r However, if the distributions fpkg deviate from Poissons, where Lt=c is the amount of information necessary to fully the degree-corrected variant will provide, in general, a describe the model and St=c corresponds to entropy of the shorter description length. lowest level l ¼ 0 of the hierarchy. In a given level l of the It is easy to see that, if one has a flat L ¼ 1 hierarchy, hierarchy, the information required to describe the model with fBlg¼fB; 1g, the description length of the non- l parameters fersg is given by the entropy Sm [Eq. (3)] of the hierarchical model is recovered [32]; e.g., for the traditional l 1 Σ L S model in level þ , so that we may write model, we have L¼1 ¼ L¼1 þ t, with     XL B B X L S el ; nl Ll−1: L ðð 2ÞÞ N!− n !; (10) t ¼ mðf rsg f rgÞ þ t (6) L¼1 ¼ ln E þln N þln ln r l¼1 r

The only missing information is how to partition the nodes where the only difference in comparison to Ref. [32] is of the current level into Bl blocks, which corresponds to the that here we are using the improved partition description l term Lt in the equation above. The total number of length of Eq. (7). Therefore, the nested generalization fully l Σ ≤ Σ partitionsQ with the same block sizes fnrg is given by encapsulates the flat version, such that min min L¼1; l Bl−1!= rnr!, and the total number of different block sizes i.e., the nested model can provide only a shorter or equal Bl description length of the observed network. is ðð B ÞÞ. Hence, the amount of information necessary to l−1 The MDL principle predicates that whenever the hier- describe the block partition of level l is archy size itself needs to be inferred, one should minimize   B X Eq. (5), instead of Eq. (4) directly. However, MDL is one of Ll l B ! − nl !: (7) t ¼ ln B þ ln l−1 ln r the many principled methods one could use to do model l−1 r selection, which include, e.g., Bayesian model selection via

011047-4 HIERARCHICAL BLOCK STRUCTURES AND HIGH- … PHYS. REV. X 4, 011047 (2014) integrated likelihood [21,38,39,41,42,55], likelihood ratios given by ers ¼ 2E½δrsc=B þð1 − δrsÞð1 − cÞ=BðB − 1Þ, [56], or more approximative methods, such as Bayesian nr ¼ N=B, and c ∈ ½0; 1 is a free parameter that controls information criterion [57] and Akaike information criterion the strength. For this model, if we have that [58]. If any two of these methods are derived from N=B ≫ 1, it can be shown that the detectable phase exists equivalent assumptions, one should expect them to deliver for hki > ½ðB − 1Þ=ðcB − 1Þ2 [29–31]. Let us make the compatible results. In Appendix A, we make a comparison situation even simpler and consider the strongest possible of the MDL approach with Bayesian model selection via block structure with c ¼ 1, i.e., B perfectly isolated integrated likelihood (BMS), since it is nonapproximative assortative communities with N=B nodes. In this case, and can be computed exactly for the stochastic block the detectability threshold lies at hki¼1. Therefore, for model, where we show that under compatible assumptions, any hki > 1, we should be able to detect all B blocks, with a these two methods deliver the exact same results. In the precision increasing with hki, if we know we have B blocks following, we compare the results obtained with non- to begin with. If we do not know this, we must apply a hierarchical MDL (or BMS) and the nested model pre- model-selection criterion as described above to obtain the sented, and show that it yields a higher quality model best value of B. For simplicity, let us assume that, for the B ≡ B selection criterion, which detects the correct number of correct value of true, the true partition is perfectly blocks for sparse networks, without being overconfident. detected, such that St ≃ −E ln B, ignoring additive con- Based on this analysis, we are capable of deriving the stants, which are irrelevant at this point. If a value of B>B optimum number of blocks given a network size, and we true is used, we assume that the inferred partition show that the nested model does not suffer from the corresponds to regular subdivisions of the planted one, such resolution limit, which hinders the nonhierarchical that the entropy remains approximately unchanged S ≃ −E B B1 structures even when the model lies simple example known as the planted partition (PP) model below the detectability threshold hki¼1; hence, this [61]. It corresponds to an assortative block structure shows that the partition likelihood should not simply be

011047-5 TIAGO P. PEIXOTO PHYS. REV. X 4, 011047 (2014)

modularity-based methods, but also if one does statistical inference based on MDL. For the nonhierarchical model, it can be shown that, according to this criterion,pffiffiffiffi the optimal number of blocks scales as B ≃ μðhkiÞ N, where μðxÞ is an increasing function [32]. Therefore, if the planted number exceeds this threshold, blocks will be merged together, despite the fact that the block structure is detectable with arbitrary precision if one knows the correct value of B, and it sufficiently exceeds the detectability threshold hki > 1 of the PP model. This means that the true parameters of the model can no longer be used to compress the generated data. This is a direct result of the assumption that all possible block structures of a given size are equally possible, and the number of such models becomes very large, with a model description length scaling roughly with ∼B2 ln E þ N ln B. In the presence of additional assump- tions about the model, such as the fact that one is dealing with the PP model instead of a more general block structure, this can, in principle, be improved. However, 4 FIG. 2. Model selection results for a PP model with N ¼ 10 , in most practical situations, such assumptions cannot be B 100 c 1 true ¼ , and fully isolated blocks ( ¼ ), using the model- made. One main advantage of the nested model is that this selection criteria described in the text. The top panel shows the without B k limit can be overcome requiring such prior knowl- inferred value of versus the average degree h i in the network. edge. The description length via the nested model for the The solid lines show the theoretical value according to each maximally modular network above is given by Eq. (11), criterion, and the data points are direct optimization of the B B corresponding quantities for actual generated network, averaged with true ¼ . As can be seen, this equation has only log- B over 40 independent realizations. The bottom panel shows the linear dependencies on the model size , instead of the normalized mutual information (NMI) between the inferred and quadratic one present in the flat MDL. The result of this is planted partitions. The dashed line marks the threshold hki¼1 that, if one finds the value of B, which minimizes the where inference becomes impossible for N → ∞. nested description length, one obtains the scaling N B ∝ ; (12) discarded [101]. Both MDL and dense BMS fail to detect ln N anything for hki < 2, which corresponds to a strong threshold [102], which interestingly lies above the strict for sufficiently large N. This is a significant improvement, detectability limit at hki¼1. This corresponds to a region since the maximum number of detectable blocks grows where detectability is possible, but only if the true value of almost linearly with the number of nodes.pffiffiffiffi Thus, a char- B is known (or if a more refined model-selection criterion acteristic detectable block size N=B ∼ N is replaced by a exists). Note that the incomplete BMS criterion performs much smaller value N=B ∼ ln N, which allows for a better in the region 1 < hki < 2, but this is perhaps better precise assessment of small communities even in very interpreted as a by-product of its overall overconfidence for large networks. very sparse networks. Note that all criteria eventually agree It is possible to understand in more detail the origin of on the correct value if hki is made sufficiently large, which the improvement by considering a related problem, which corresponds to the intuitive notion that the detection is the detection of specific blocks that are much smaller problem becomes much easier for dense networks. than the remaining network. Another facet of the resolution A prominent problem in the detectability of block limit manifests itself when two such blocks are merged structures via other methods, such as modularity optimi- together, despite the fact that if they are considered in zation [10], is when modules are merged together, regard- isolation they would be kept separate. Here, we consider less of how strong the is perceived to this problem by using a slightly modified scenario than the be. For the modularity-based approach, when considering a one proposed in Ref. [14], which is a network composed of maximally modular network, similar to the PP model with two fully isolated blocks, each with ec=2 internal edges and c → 1, but with the additional restriction that the graph nc nodes, and a remaining network with N nodes, E edges, k 2E=N remains connected, it has been shownpffiffiffiffi [14] that modules are average degree h i¼ , and an arbitrary topology [see merged together as long as B> E. This phenomenon is Fig. 3(c)]. We may decide if these blocks are merged considered counterintuitive, and has been called the together by considering the difference in the description “resolution limit” of community detection via this method length. The entropy difference for the merge is simply 2 [103]. As it happens, this problem does not only occur for ΔSt ¼ ec ln 2 (where we assume ec ≪ nc, but the dense

011047-6 HIERARCHICAL BLOCK STRUCTURES AND HIGH- … PHYS. REV. X 4, 011047 (2014)

(a) (b) optimal block hierarchy that splits at the top into two branches, the left one containing the two smaller blocks and the right one containing the remaining network and its arbitrary hierarchical structure [see Fig. 3(d)]. To consider the merge, we need to compute the description length only at the lowest level, since the rest remains unchanged after the merge. By computing the difference via Eq. (5), after ΔΣ ΔS n − some manipulations we obtain nested ¼ t þ ln c (c) (d) 3 ln ln B 1 − ln B N − 1 − ln B1 B 2 , ðð ecÞÞ þ ð þ Þ ð þ Þ ð þ þ Þ with B ¼ B0. Note that this expression is independent of E, and, hence, the density of the remaining network cannot influence the merging decision. Since B1 ≤ B, and assum- B ≫ 1 ΔΣ ≃ ΔS n − e 2 ing , we obtain nested t þ ln c ln½ð c þ Þ ðec þ 1Þ − lnðB þ NÞ, and, hence, the dependence on either N or B is again only logarithmic, ec ≈ ½lnðB þ NÞ − ln nc= ln 2, as shown in Fig. 3(b). FIG. 3. Parameter region where two isolated blocks with ec=2 With this example, one can notice that the nested model internal edges and nc ¼ ec=5 nodes are detectable as separate is capable of compartmentalizing the network at the upper blocks [shown schematically in panel (c)], as a function of the levels, such that the lower-level branches can become N=B average block size , and depending on (a) the average degree almost independent of each other. This means that, in k N 105 N k 20 h i with ¼ and (b) the number of nodes with h i¼ . many practical situations, one can sufficiently overcome The dashed curves show the boundaries for the nonhierarchical the resolution limit without abandoning a global model that block model, and the solid lines for the hierarchical variant. The describes the whole network at once. line segments on the right-hand side of the plotsp showffiffiffiffiffiffi the detectability threshold for modularity [14], ec ¼ 2E. The In the following section, we specify an efficient algo- points marked with stars (⋆) correspond to the maximum value rithm to infer the parameters of the nested block model in of B that is detectable in the remaining network with the arbitrary networks, and we test its efficacy in uncovering nonhierarchical model, and the dotted line shows the same the multilevel structure of synthetic as well as empirical quantity for various hki values (i.e., the region on the left of networks. this curve corresponds to an overfitting of the remaining network, according to the non-hierarchical criterion). (d) The hierarchical III. INFERENCE ALGORITHM construction used to decide if the two isolated blocks are merged together with the nested model. Individually, any specific level l of the hierarchical structure is a regular block model, and, hence, the classi- fication of the Bl−1 nodes of this level into Bl blocks case can be computed as well, with no significant differ- can be done via well-established methods, such as the ence in the result). For the flat block model, Monte Carlo method [32,40], simulated annealing, or belief ΔL L E e ;N 2n ;B− 1; n ∪ we have flat ¼ L¼1ð þ c þ c f rg propagation [29,30,56]. Here, we use the method described 2n − L E e ;N 2n ;B; n ∪ n ;n f cgÞ L¼1ð þ c þ c f rg f c cgÞ, com- in Ref. [63], which is an agglomerative heuristic that puted using Eq. (10). For this case, the point at which the provides high-quality results, while being unbiased with ΔL ΔS 0 merge happens, flat þ t ¼ , will depend not only on respect to the types of block structure that are inferred, and the values of E and N, but also on the average block size is also very efficient, with an algorithmic complexity of N=B of the remaining network, as can be seen in Figs. 3(a) OðNln2NÞ, independent of the number of blocks B. If one and 3(b). As the number of blocks in the remainingp networkffiffiffiffi knows the depth L of the hierarchy, and all fBlg values, the approaches the maximum detectable value, B ∼ N, the multilevel partitions can be obtained by starting from the more difficult it becomes to resolve the smaller blocks. The lowest level l ¼ 0 and progressing upwards to l ¼ L. detectable region recedes further with increasing hki, and However, this cannot be done when the number and sizes also withpffiffiffiffi the number of nodes in the remaining network as of the hierarchical levels are unknown. Although it is ec ∼ N. Hence, the denser or larger the remaining net- relatively simple to heuristically impose such patterns as work, the harder it becomes to detect the smaller blocks binary trees or dendograms, these are not satisfactory given with the flat variant of the model. In Fig. 3 are also shown the general character of the model, which accommodates the values of er for which modularity also fails to separate arbitrary branching patterns. However, traversing all pos- the blocks (if one considers that they are connected to sible hierarchies is not feasible for moderate or large themselves and to the rest of the network by single edges networks; thus, one must settle with approximative meth- [104]), which are overall compatible with the flat MDL ods. Here, we propose a very simple greedy heuristic, criterion. The situation changes significantly with the which, given any starting hierarchy, performs a series of nested model. To consider the merge, we assume an local moves to obtain the optimal branching. Although this

011047-7 TIAGO P. PEIXOTO PHYS. REV. X 4, 011047 (2014) algorithm is not guaranteed to find the global optimum, we the slowest move is the resize operation, which completes in 2 have found it to perform very well for many synthetic and OðBl−1ln Bl−1Þ steps, and, hence, most of the time is spent empirical networks, and it tends to find consistent hier- at the lowest level l ¼ 0 with B−1 ¼ N, which scales well archies, independently of the starting estimate. It is also for very large networks. We have succeed in obtaining efficient enough not to hinder its application to very large reliable results with this algorithm for networks in excess of networks, since it does not significantly change the overall 107 edges; hence, it is suitable for large-scale systems [105]. algorithmic complexity of the inference procedure. The algorithm is based on the following local moves at a given IV. SYNTHETIC BENCHMARKS hierarchy level l. (1) Resize.—A new partition of the Bl−1 nodes into a Here, we consider the performance of the nested newly chosen number of blocks Bl is obtained. This is done block model inference procedure on artificially constructed via the agglomerative heuristic mentioned previously, with networks. Here, we use a nested version of the usual the modification that it must not invalidate the partition at PP model [61], inspired by similar constructions done in the level l þ 1; i.e., no nodes that belong to different blocks Refs. [51,64]. We define a seed structure with B0 blocks at the upper level can be merged together in the current and ½m1rs ¼ δrsc=B0 þð1 − δrsÞð1 − cÞ=B0ðB0 − 1Þ, and level. This restriction enables the difference in Σ [Eq. (5)]to construct a nested matrix of depth L − 1 via be computed easily, since it depends only on the mod- ml ¼ ml−1 ⊗ ml−1, where ⊗ denotes the Kronecker l l 1 product and l ∈ ½1;L− 1. The parameters of the model ifications made in the current and upper levels, and þ . l L−1 at level l are ers 2Emrs, and all B B blocks The actual new value of Bl is chosen via progressive ¼ ¼ 0 have the same number of nodes. Via spectral methods bisection of the range Bl ∈ ½Bl−1;Bl 1, so that the mini- þ [65], one can show that the detectability transition happens mum of Σ is bracketed, and for each value of Bl attempted, 2 at k B0 − 1 = cB0 − 1 , which is the same as the the best partition is found with the algorithm of Ref. [63]. h i¼½ð Þ ð Þ regular PP model with B B0 [29–31,66]. (2) Insert.—A new level is inserted at position l. Its size ¼ and partition are chosen exactly as in the resize move above. (3) Delete.—The model in level l is removed from the hierarchy; i.e., the nodes of level l − 1 are grouped together directly as described in level l þ 1. Through repeated applications of these moves, it is possible to construct any hierarchy. The actual greedy optimization consists of starting with some initial hierarchy and keeping track of whether or not each level is “done” or “not done.” One initially marks all levels as not done and starts at the top level l ¼ L. For the current level l,ifitis marked done, it is skipped and one moves to the level l − 1. Otherwise, all three moves are attempted. If any of the moves succeeds in decreasing the description length Σ, one (a) (b) marks the levels l − 1 and l þ 1 (if they exist) as not done, the level l as done, and one proceeds (if possible) to the upper level l þ 1, and repeats the procedure. If no improve- ment is possible, the level l is marked as done and one (c) (d) proceeds to the lower level l − 1. If the lowest level l ¼ 0 is reached and cannot be improved, the algorithm ends. Note FIG. 4. Top: NMI between the inferred and true partitions for that, in order to keep the description length complete, we network realizations of the nested PP model described in the text 4 must impose that BL ¼ 1 throughout the above process. with B1 ¼ 2, L ¼ 5, hki¼20, and N ¼ 10 , as a function of the The final hierarchy will, in general, depend on the starting assortativity strength c, both via the standard stochastic block hierarchy, and as was mentioned above, one cannot model with B ¼ 16 and the nested variant with unspecified ⋆ L guarantee that the global minimum is always found. parameters. The star symbols ( ) show the value of for the However, we find that, in the majority of cases, this inferred hierarchy. All points are averaged over 20 independent realizations. The gray vertical line marks the detectability thresh- algorithm succeeds in finding the same or very similar old when B is predetermined, and the red line when the nested hierarchies, independently of the initial choice, which can model fails to detect any structure. Bottom: Example hierarchies B 1 simply be f lg¼f g. However, the actual time it takes to inferred for the values of c indicated in the top panel. The left- reach the optimum will depend on how close the initial tree hand image shows the network realization itself, and the was to the final one, and, hence, it is difficult to give an right-hand one the hierarchical structure [the planted hierarchy estimate of the total number of moves necessary. However, corresponds to the one in (a)].

011047-8 HIERARCHICAL BLOCK STRUCTURES AND HIGH- … PHYS. REV. X 4, 011047 (2014)

In Fig. 4, we show the results of the inference procedure are actually found, since, despite the increased resolution 4 for a generated model with B0 ¼ 2 and L ¼ 5, N ¼ 10 capabilities, it does not tend to find spurious hierarchies. nodes, and hki¼20. The correct number of blocks is In Appendix B, we also include a comparison of the detected up to a given value of c>c, where c is the method with other algorithms for community detection that detectability threshold. The hierarchy itself matches the are not based on statistical inference. nested PP model exactly only for higher values of c, and becomes progressively simplified for lower values. Note V. EMPIRICAL NETWORKS that, for a large fraction of c values, the correct lower-level partition is detected with a very high precision, but the Here, we present a detailed analysis of some selected hierarchy that is inferred is simpler than the planted one. In empirical networks, as well as a meta-analysis of several these cases, however, both the inferred hierarchy as well as networks, spanning different domains and size scales. In all the planted model are fully equivalent; i.e., they generate cases, we use the degree-corrected stochastic block model the same networks. In other words, the shallower hierar- at the lowest hierarchical level, instead of the traditional chies that are inferred correspond to identical representa- model, since it almost always provides better results. tions of the same ers matrix at the lowest level, which Political blogs of the 2004 U.S. election.—This is a require less information to be described, in comparison to network compiled by Adamic and Glance [67] of political the sequence of Kronecker products used in the model blogs during the 2004 presidential election in the U.S. The specification, and, hence, cannot really be seen as a failure nodes are N ¼ 1222 individual blogs, and E ¼ 19027 of the inference method, since it simply manages to directed edges exists between pairs of blogs, if one blog compress the original model. Before the value of c reaches cites the other. This network is often used as an empirical the detectability threshold, the inference method settles on a example of community structure, since it displays a fully random L ¼ 1, B ¼ 1 structure, corresponding once division along political lines, with two clearly distinct again to a parameter region where the block detection is groups representing those aligned with the Republican and only possible with limited precision and if one knows the the Democratic parties. Indeed, if one applies the nested correct model size. As predicated by the MDL criterion, block model to this network, the topmost division in the the inferred models tend to be as simple as possible, with hierarchy corresponds exactly to this bimodal partition, the hierarchies becoming shallower as one approaches a which closely matches the accepted division (see Fig. 5). random graph. The approach is, therefore, conservative, This partition is also obtained with the nonhierarchical which brings confidence to the blocks and hierarchies that stochastic block model if one imposes B ¼ 2 [24].

FIG. 5. The political blog network of Adamic and Glance [67]. Left: Topmost partition of the hierarchy inferred with the nested model. Right: The same network, using a circular layout, with edge bundling following the inferred hierarchy [68] (indicated also by the square nodes and the node colors). The size of the nodes corresponds to the total degree, and the edge color indicates its direction (from dark to light). Nodes marked with a blue halo were incorrectly classified at the topmost level, according to the accepted partition in Ref. [67].

011047-9 TIAGO P. PEIXOTO PHYS. REV. X 4, 011047 (2014)

However, the nested version reveals a much more complete most connections go through a relatively small group of picture of the network, where these two partitions possess a nodes, which act as hubs in the network. The groups both in detailed internal structure, culminating in B0 ¼ 15 sub- the core and in the periphery seem strongly correlated to groups with quite heterogeneous connection patterns. For geographical location. However, the nodes of the core instance, one can see that each of the two higher-level groups groups are not confined to a single geographical location, possesses one or more subgroups composed mainly of and are instead spread all over the globe (see inset of Fig. 6 peripheral nodes, i.e., blogs that cite other blogs but are and the Supplemental Material [96]). not themselves cited as often. Conversely, both factions The film-actor network.—This network is compiled by possess subgroups that tend to be cited by most other groups, extracting information available in the Internet Movie and others which are cited predominantly by specific groups. Database (IMDB), which contains each cast member and It is also interesting to note that a large fraction of the film as distinct nodes, and an undirected edge exists connections between the two top-level groups are concen- between a film and each of its cast members. If nodes trated between only two specific subgroups, which, there- with a single connection are recursively removed, a net- fore, act as bridges between the larger groups. work of N ¼ 372 447 and E ¼ 1 812 312 remains (as of This example shows that the model is capable of late 2012). As can be seen in Fig. 7, the nested block model revealing the structure of the network at multiple scales, fully captures the bipartite nature of the network and which reveal simultaneously the existence of the bimodal separates movies and actors at the topmost hierarchical large-scale division, as well the lower-level subdivisions. level, and proceeds to separate them in geographical, The autonomous systems topology of the Internet.— temporal, and topical (genre) lines. The observed partition Autonomous systems (AS) are intermediary building is similar to the one obtained via the nonhierarchical model blocks of the Internet topology. They represent organiza- [32], but one finds B ¼ 971 blocks, instead of B ¼ 332 tional units that are used to control the routing of packets in with the flat version. the network. A single AS often corresponds to a network of Meta-analysis of several empirical networks.—We per- its own, which is usually owned by a private company or a form an analysis of several empirical networks shown in government body. The network analyzed here corresponds Fig. 8, which belong to a wide variety of domains and are to the traffic of information between the AS nodes, as distributed across many size scales. We use the nonhier- measured by the CAIDA project [106]. Each node in the archical stochastic block model as well as the nested network is an AS, and a directed link exists between two variant. In Figs. 8(a) and 8(b), we show the average block nodes if direct traffic has been observed between the two AS. As of September 2013, the network is composed of N ¼ 52 104 AS nodes and E ¼ 399 625 direct connections between them. The application of the nested block model to this network yields the hierarchy seen in Fig. 6, with B ¼ 191 blocks at the lowest level. The most prominent feature observed is a strong core-periphery structure, where

FIG. 7. Large-scale structure of the IMDB film-actor network. Each node in this graph represents a lowest-level block in the FIG. 6. Large-scale structure of the Internet at the autonomous hierarchy, instead of individual nodes in the graph. The size of the systems level, as obtained by the nested stochastic block model, nodes indicates the number of nodes in each group. The hierarchy displaying a prominent core-periphery architecture. The magni- branch at the top are the actors, and at the bottom are the films. fication shows the nodes that belong to the “core” top-level The labels classify each branch according to the most prominent branch, containing AS nodes spread all over the globe, as shown geographical and temporal characteristics found in the database. in the map inset. See the Supplemental Material [96] for a higher- See the Supplemental Material [96] for a higher-resolution resolution version of this figure. version of this figure.

011047-10 HIERARCHICAL BLOCK STRUCTURES AND HIGH- … PHYS. REV. X 4, 011047 (2014) sizes N=B for all networks using both models.ffiffiffiffi For the theP modularity of the inferred block structures, p 2 2 nonhierarchical version, a clear N=B ∼ E trend is Q ¼ rerr=2E − er=ð2EÞ , which measures how assorta- observed, which corresponds to the resolution limit present tive the topology is. Higher values of Q close to 1 indicate with this method and other approaches as well. In Fig. 8(b), the existence of densely connected communities. The value we show the results for the nested model, where such a of Q is the most common quantity used to detect blocks in trend can no longer be observed, and the smallest average networks, and it presumes that such assortative connections block sizes no longer seem to depend on the size of the are present. In contrast, by fitting a general stochastic block network, which serves as an empirical demonstration of model, no specific pattern is assumed, and the partition the lack of resolution limit shown previously. The values of found corresponds to the least random model that matches the description lengths themselves are also distributed in a the data. In Fig. 8, we show the values of Q obtained for the seemingly nonorganized manner [see Fig. 8(c)], i.e., no analyzed networks. Indeed, some networks are modular, general tendency for larger networks can be observed, other with high values of Q. However, one does not observe any than an increased range of possible values for larger E strong correlation of the description length and the mod- values. Any difference observed seems to be due to the ularity values. Hence, the most structured networks do not actual topological organization, rather than intrinsic necessarily possess much larger Q values, which indicate constraints imposed by the method. We also compute that the building blocks of their topological organization are not predominantly assortative communities (this is clear in some of the examples considered previously, such as the (a) (b) Internet AS topology and the IMDB network). However, for many of these networks, it is probably possible to find partitions that lead to much higher Q values. These partitions would, on the other hand, correspond to block model ensembles with a larger entropy than those inferred via maximum likelihood. Therefore, the maximization of Q in these cases would invariably discard topological (c) (d) information present in the network and provide a much simplified and possibly misleading picture of the large-scale structure of the network. Hence, it seems more appropriate to confine modularity maximization only to cases where the assortative structure is known to be the dominating pattern. However, even in these cases, methods based on statistical inference possess clear advantages, such as the lack of resolution limit, model selection guarantees, and the overall more principled nature of the approach.

VI. DISCUSSION In this paper, we present a principled method to detect hierarchical structures in networks via a nested stochastic block model. This method fully generalizes previous approaches for the detection of hierarchical community structures [43–49], since it makes no assumptions either on the actual types of large-scale structures possible (assorta- tive, dissortative, or any arbitrary mixture) or on the hierarchical form, which is not confined to binary trees or dendograms. We show that a major advantage of this approach is that it breaks the so-called resolution limit of approaches, such as modularity optimization and nonhier- archical model inference, wherepffiffiffiffi modules smaller than a characteristic size scaling with N cannot be resolved. With FIG. 8. (a) The average block size N=B obtained using the nonhierarchicalmodel, asa functionofE, for the empiricalnetworks the nested model presented, this characteristic scale is replaced by a much smaller logarithmic dependence, making listed in the bottom table (the Dir. column specifiespffiffiffiffi whether a given network is directed). The dashed line shows a E slope. (b) The it, in practice, nonexistent for many applications. This sameas(a)butwiththenestedmodel.(c)ThedescriptionlengthΣ=E increased resolution comes as a result of robust model forthenestedmodelasafunction ofE.(d)ThevalueofmodularityQ selection principles, and is integrated with the desirable as function of Σ=E, for the nested model. capacity of differentiating between noise and actual

011047-11 TIAGO P. PEIXOTO PHYS. REV. X 4, 011047 (2014) structure, and, therefore, it is not susceptible to the detection This invariably leads to the inclusion of prior probabilities of spurious communities. We show that the model is capable of observing the model parameters, PðfprsgjBÞ and of inferring the large-scale features of empirical networks in PðfprgjBÞ. Now, instead of finding the model parameters significant detail, even for very large networks. that maximize this quantity, we compute the integrated This type of approach should, in principle, also be likelihood [38,42,55], applicable to other model classes, such as those based on Z overlapping [9,83–85] or link communities [25,86]. We also PðG; fbigjBÞ¼ dprsdprPðG; fbig; fprsg; fprgjBÞ predict that it should serve as a more refined method of detecting missing information in networks [23,44],aswellas (A3) for the prediction of the network evolution [87],determining Z the more salient topological features [88,89] or large-scale dp P G b ; p ;B P p B functional summaries of the network topology [90]. ¼ rs ð jf ig f rsg Þ ðf rsgj Þ Z

ACKNOWLEDGMENTS × dprPðfbigjfprgÞPðfprgjBÞ (A4) The author would like to thank Sebastian Krause for a careful reading of the manuscript, as well as Cris Moore ¼ PðGjfbig;BÞ × PðfbigjBÞ: (A5) and Lenka Zdeborová for useful comments. This work was funded by the University of Bremen, under the funding By maximizing PðG; fbigjBÞ, instead of Eq. (A1),one line ZF04. should avoid overfitting the data, since the larger models with many parameters are dominated by a majority of choices that APPENDIX A: BAYESIAN fit the data very badly, and, hence, have a smaller contribution MODEL SELECTION in the integral of Eq. (A3). Therefore, the maximization of the integrated likelihood also corresponds to an application of In the following, we compare Bayesian model selection Occam’s razor, and one should expect it to deliver results via integrated likelihood with the MDL approach consid- compatible with MDL [53]. However, in practice, things are ered in the main text, and we show that they lead to the more nuanced, since the value of Eq. (A3) is heavily same criterion if the model constraints are equivalent. dependent on the choice of priors P prs B and For the purpose of performing BMS, we evoke the most ðf gj Þ P pr B . For the block partitions themselves, this choice usual definition of the stochastic block model ensemble, ðf gj Þ is more straightforward. Since one wants to be agnostic with where one defines as parameters the probabilities prs that r respect to what block sizes are possible, one should choose a an edge exists between two nodes belonging to blocks P p B p α α 1 and s. The posterior likelihood of observing a given graph flat prior ðf rgj Þ¼Dirichletðf rgjf rgÞ, with r ¼ , so that all counts are equally likely. The integral of Eq. (A4) is with a block partition fbig and model parameters fprsg is then computed as Y e =2   X rs ðnrnr−ersÞ=2 B PðGjfbig; fprsg;BÞ¼ prs ð1 − prsÞ : ln PðfbigjBÞ¼− ln − ln N! þ ln nr!; (A6) rs N r (A1) which is identical to the partition description length of P b B −L0 The inference procedure consists in, as before, maximizing Eq. (7),i.e.,ln ðf igj Þ¼ t . For the block probabilities, on the other hand, the this quantity with respect to the parameters fprsg and situation is more subtle. A common choice is the flat prior the block partition fbig. It is easy to see that if one P p B 1 – maximizes Eq. (A1) with respect to fprsg, one recovers ðf rsgj Þ¼ [23,38,40 42]. This choice is agnostic P G b ; p ;B −S with respect to what block structures are expected, and it is maxfprsg ln ð jf ig f rsg Þ¼ t, given in Eq. (1), so indeed these models are equivalent. However, this also practical, since the integral can be evaluated exactly does not provide a means for model selection, since [23,42], B   models with a larger number of blocks will invariably X n n posses a larger likelihood. Instead, the Bayesian model P G b ;B − r s n n 1 ln ð jf ig Þ¼ ln e þ ln ð r s þ Þ selection approach is to consider the joint probability r>s rs   PðG; fbig; fprsg; fprgjBÞ of observing not only the graph X n2 b p − r n2=2 1 but also the partition f ig, the model parameters f rsg,as ln e =2 þ ln ð r þ Þ r rr well as the parameters fprg that control the probability of each partition fbig being observed, which is given by (A7)   Y 1 X e X P b p ;B pnr : ≃ − n n H rs − B 1 n ; (A8) ðf igjf rg Þ¼ r (A2) 2 r s b n n ð þ Þ ln r r rs r s r

011047-12 HIERARCHICAL BLOCK STRUCTURES AND HIGH- … PHYS. REV. X 4, 011047 (2014) where the approximation in Eq. (A8) was made assuming equivalent to BMS when all model constraints are com- nr ≫ 1, and HbðxÞ is the binary entropy function. patible. In fact, even in the dense case, although not However, there is one important issue with this approach. quite the same, the (dense) BMS and MDL penalties are Namely, there is a strong discrepancy between the models very similar. If one assumes N ≫ B2, E ∝ N2, and equal generated by the flat prior PðfprsgjBÞ¼1 and most block sizes nr ¼ N=B, both penalties become observed empirical networks. Specifically, typical ∼BðB þ 1Þ ln N þ N ln B. Therefore, it seems that what- parameters with prs ¼ 1=2 sampled by this prior will ever differences arising from the two approaches stem Presult in dense networks with average degree hki¼ simply from nuances in the choice of prior probabilities. rsprsnrns=N ¼ N=2. However, most large empirical This comparison also allows us to interpret the nested block networks tend to be sparse, with an average degree that model as a hierarchical Bayesian approach, where the is many orders of magnitude smaller than N. Hence, as N priors PðfqrsgjBÞ are replaced by a nested sequence of becomes large, most observed networks will lie in a priors and hyperpriors, so that their integrated likelihood vanishingly small portion of the parameter space produced matches the description length defined previously. by this prior. A better choice would constrain the average degree to something closer to what is observed in the APPENDIX B: COMPARISON WITH OTHER data, but at the same time being otherwise noninformative COMMUNITY DETECTION METHODS regarding theP block structure. A choice such as In this section, we compare results obtained for synthetic PðfprsgjBÞ ∝ δð rsprsnrns − 2EÞ, where E is the num- ber of edges in the observed network, seems appropriate, networks with popular community detection methods that but the integral in Eq. (A4) becomes difficult to solve. are not based on statistical inference. Here, we focus not Instead, an easier approach is to modify the model sightly, only on the capacity of the method of finding a partition so that the average degree is implicitly constrained. Here, correlated with the planted one, but also on the number of we consider the model variant where the number of edges E blocks detected. We concentrate on two methods that have is a fixed parameter, and each sampled edge may land been reported to provide good results in synthetic bench- between any two nodes belonging to blocks r and s with marks [91], namely, the [12], based on P modularity optimization, and the Infomod method probability qrs, and we have, therefore, r≥sqrs ¼ 1. The [45,92,93], based on compression of random walks. We full posterior likelihood of this model is make use of the LFR benchmark [94], which corresponds Q E! qmrs to a specific parametrization of the degree-corrected sto- P G b ; q ;E;B Qr≥s rs ; ð jf ig f rsg Þ¼Ω e ; n m ! chastic block model [24], where both the degree distribu- ðf rsg f rgÞ r≥s rs tion and the block size distribution follow truncated power (A9) laws. Here, we employ a parametrization similar to Ref. [91], with a degree distribution following a power where Ωðfersg; fnrgÞ is, as before, the number of different −2 k 5 law with exponent and a minimum degree min ¼ , and graphs with the same block partition and edge counts, and a community size distribution also following a power law, m e r ≠ s e =2 rs ¼ rs if or rr otherwise. By maximizing but with exponent −1, and minimum block size of 50. We q Eq. (A9) with respect to f rsg, one obtains also impose the following additional restrictions: The total P G b ; q ;E;B ≃ − Ω e ; n maxfqrsg ln ð jf ig f rsg Þ ln ðf rsg f rgÞ ¼ number of blocks is always fixed at B ¼ 100, and for every −S m ≫ 1 m 0 pffiffiffiffiffi t, as long as rs or rs ¼ , so it also is equivalent node i belonging to block r,itsdegreeki cannot exceed nr, to the previous models in this limit. With this reparamet- to avoid intrinsic degree-degree correlations [50]. With this rization, the average degree remains fixed independently of parameter choice, the networks generated with N ¼ 2 × 104 the choice of prior. Therefore, we may finally use a flat possess an average degree hki ≃ 7.8. The actual block P q B q α 1 prior ðf rsgj Þ¼Dirichletðf rsgjf rs ¼ gÞ, without structure is parametrized as ers ¼ð1−cÞeres=2Eþδrscer, the risk of the graphs becoming inadvertently dense, and where c controls the assortativity: For c ¼ 1, all edges again the integrated likelihood can be computed exactly, connect nodes of the same block, and for c ¼ 0,wehavea Z fully random [107]. PðGjfbig;BÞ¼ dqrsPðGjfbig; fqrsg;E;BÞPðfqrsgjBÞ Because the different methods result in quite different numbers of detected blocks, the normalized mutual infor- (A10) mation is not the most appropriate measure of the overlap    between partitions in this case. This is due to the fact that, if B −1 ðð2ÞÞ the number of nodes is kept fixed, the NMI values tend to ¼ Ωðfersg; fnrgÞ × : (A11) E be larger simply if the number of blocks is increased, even if this larger partition is in no other way more strongly By inserting Eq. (A11) into Eq. (A5) and comparing with correlated to the true one. Another measure that is less P G; b B −Σ Eq. (10), we see that ln ð f igj Þ¼ L¼1, and we susceptible to this problem is the variation of information conclude reassuringly that the MDL approach is fully (VI) [95], defined as

011047-13 TIAGO P. PEIXOTO PHYS. REV. X 4, 011047 (2014)

VIðfxig; fyigÞ ¼ HðfxigÞ þ HðfyigÞ − 2Iðfxig; fyigÞ; c ≈ 0.4 range, and decides on a fully random B ¼ 1 (B1) structure. Above this value, the inferred value of B increases from B ¼ 1 until agreeing with the planted value for sufficiently large c values. As can also be seen in Fig. 9, where HðfxigÞ is the entropy of the partition fxig and the Louvain method exhibits the “worst of both worlds,” Iðfxig; fxigÞ is the (non-normalized) mutual information i.e., it fails to find the correct partition for all values except between fxig and fyig. A value of VI equal to zero means c ¼ 1, finding systematically smaller values of B, while at that the partitions are identical, whereas any positive value the same time finding spurious partitions below the indicates a reduced overlap between them. detectability threshold, even when the network is com- The VI values between the planted partitions and those pletely random (c ¼ 0). The Infomod method, on the other obtained with different methods for several network real- hand, seems to find partitions that are largely compatible izations of the above model are shown in Fig. 9, together with the planted one, at least for the parameter region above with the obtained number of blocks. By observing the VI c ≈ 0.6. However, for even larger values of c, this method values for the inference method with a fixed number of detects a number of blocks that is significantly larger than blocks B ¼ 100, we conclude that the strict detectability the planted value, which increases steadily as c decreases. transition (when the value of B is known) lies somewhere Hence, this method is also incapable of separating structure slightly above c ≈ 0.2. However, the model-selection pro- from noise, and finds spurious partitions far below the cedure based on the nested stochastic block model pre- detectability threshold. Thus, from the three methods sented in the main text discards any structure below the analyzed, the one described in the main text is the only one that combines the following three desirable properties: (1) optimal inference in the detectable range, (2) guarantee against overfitting and detection of spurious modules, and (3) fully nonparametric implementation. The suboptimal behavior of the modularity-based method is simply a combination of the resolution limit [14] and lack of built-in model selection based on statistical evidence [13]. It is not currently known if the Infomod method suffers from problems similar to the resolution limit, but clearly it lacks guarantees against detection of spurious modules. Although it is also based on the principle of parsimony, it tries to compress random walks taking place on the network, instead of the network itself. Apparently, the method cannot distinguish between the actual planted block structure and quenched topological fluctuations—both of which will affect random walks— and gradually transitions between the two properties in order to best describe the network dynamics. (As has been shown in Ref. [91], this problem diminishes if the average degree of the network is made sufficiently large, in which case the method finally settles in a B ¼ 1 partition for fully random graphs.) On the other hand, the method in the main text is based on maximizing the likelihood of the exact same generative process that was used to construct the FIG. 9. Top: Variation of information (VI) between the planted network, which puts it in clear advantage over the other two and obtained partitions as a function of the assortativity parameter (and, in fact, many other methods, including all those c, for networks with N ¼ 2 × 104, generated as described in the analyzed in Refs. [91,94]), in addition to including a robust text. The legend indicates results obtained with different meth- and formally motivated model-selection procedure. ods: Fitting the degree-corrected stochastic block model with a fixed number of blocks B ¼ 100 (SBM), performing model selection with the nested stochastic block model (Nested SBM), the Louvain modularity maximization method [12], and APPENDIX C: DIRECTED AND the Infomod method [45,92,93]. Bottom: The obtained number UNDIRECTED NETWORKS of blocks B as a function of c, for the same methods as in the top panel. The gray horizontal line marks the planted B ¼ 100 As mentioned in the main text, the model described is value. All results were obtained by averaging over 20 network easily generalized for directed graphs. For the ensemble realizations. entropies, we have for the undirected case [50]

011047-14 HIERARCHICAL BLOCK STRUCTURES AND HIGH- … PHYS. REV. X 4, 011047 (2014)   1 X e n nþm−1 S n n H rs ; where, as before, ðð mÞÞ ¼ ð m Þ is the number of t ¼ 2 r s b n n (C1) m n rs r s -combinations with repetitions from a set of size . For the degree-corrected model, the description length while for the directed case it reads needs to be augmented with the information necessary to   describe the degree sequence, analogously to Eq. (9) for the X e undirected case, Sd n n H rs ; t ¼ r s b n n (C2) rs r s X L L n H pr ; c ¼ t þ r ðf k−;kþ gÞ (C10) where HbðxÞ¼−x ln x − ð1 − xÞ lnð1 − xÞ is the binary r entropy function. In both cases, ers is the number of edges r s from block to (or the number of half-edges for the r where fp − þ g is the joint (in, out)-degree distribution of undirected case when r ¼ s), and nr is the number of nodes k ;k nodes belonging to block r. in block r. In the sparse limit, ers ≪ nrns, these expres- sions may be written approximately as Note that other generalizations for the directed case are possible [27], and it should be straightforward to adapt the   1 X e nested model for them as well. S ≅ E − e rs ; (C3) t 2 rs ln n n rs r s   X e Sd ≅ E − e rs : (C4) t rs ln n n rs r s [1] M. E. J. Newman, Communities, Modules, and Large- Scale Structure in Networks, Nat. Phys. 8, 25 (2011). For the degree-corrected variant with “hard” degree con- [2] S. Fortunato, Community Detection in Graphs, Phys. Rep. straints, we have 486, 75 (2010). [3] M. Girvan and M. E. J. Newman, Community Structure in   X 1 X e Social and Biological Networks, Proc. Natl. Acad. Sci. S ≅−E − N k! − e rs ; (C5) U.S.A. 99, 7821 (2002). c k ln 2 rs ln e e k rs r s [4] E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai, and X X A.-L. Barabási, Hierarchical Organization of Modularity d þ − in Metabolic Networks, Science 297, 1551 (2002). Sc ≅−E − Nkþ ln k ! − Nk− ln k ! kþ k− [5] R. J. Fletcher, Jr, A. Revell, B. E. Reichert, W. M. X   Kitchens, J. D. Dixon, and J. D. Austin, Network Modu- ers larity Reveals Critical Scales for Connectivity in Ecology − ers ln ; (C6) eþe− rs r s and Evolution, Nat. Commun. 4, 2572 (2013). P [6] R. Albert, H. Jeong, and A.-L. Barabási, Internet: Diam- eter of the World Wide Web, Nature (London) 401, 130 where er ¼ sers Pis the number ofP half-edges incident on þ − (1999). block r, and er ¼ sers and er ¼ sesr are the number r [7] S.-H. Yook, H. Jeong, and A.-L. Barabási, Modeling the of out and in edges adjacent to block , respectively. These Internet’s Large-Scale Topology, Proc. Natl. Acad. Sci. expressions are also only valid in the sparse limit, which in U.S.A. 99, 13 382 (2002). this case amounts to the following conditions, [8] Y. Zhao, E. Levina, and J. Zhu, Community Extraction for Social Networks, Proc. Natl. Acad. Sci. U.S.A. 108, 7321 k2 − k k2 − k e h ir h ir h is h is ≪ n n ; (2011). rs 2 2 r s (C7) Uncovering hkir hkis [9] G. Palla, I. Derényi, I. Farkas, and T. Vicsek, P the Overlapping Community Structure of Complex Net- l l works in Nature and Society, Nature (London) 435, 814 where hk ir ¼ i∈rki=nr [for the directed case, we simply l þ l l − l (2005). replace hk ir → hðk Þ ir and hk is → hðk Þ is in the equa- [10] M. E. J. Newman and M. Girvan, Finding and Evaluating tion above]. Unfortunately, there is no closed-form expres- Community Structure in Networks, Phys. Rev. E 69, sion for the entropy outside the sparse limit, unlike the 026113 (2004). traditional variant [50]. [11] A. Clauset, M. E. J. Newman, and C. Moore, Finding For the upper-level multigraphs, the entropies are [50] Community Structure in Very Large Networks, Phys. Rev.     E 70, 066111 (2004). X n n X nr S r s ðð 2 ÞÞ ; (C8) [12] V. D. Blondel, J.-L. Guillaume, R. Lambiotte, and m ¼ ln e þ ln e =2 E. Lefebvre, Fast Unfolding of Communities in Large r>s rs r rr Networks, J. Stat. Mech. (2008) P10008. X n n  [13] R. Guimerà, M. Sales-Pardo, and L. A. N. Amaral, Sd r s ; (C9) m ¼ ln e Modularity from Fluctuations in Random Graphs and rs rs Complex Networks, Phys. Rev. E 70, 025101 (2004).

011047-15 TIAGO P. PEIXOTO PHYS. REV. X 4, 011047 (2014)

[14] S. Fortunato and M. Barthélemy, Resolution Limit in [34] S. E. Fienberg, M. M. Meyer, and S. S. Wasserman, Stat- Community Detection, Proc. Natl. Acad. Sci. U.S.A. istical Analysis of Multiple Sociometric Relations, J. Am. 104, 36 (2007). Stat. Assoc. 80, 51 (1985). [15] A. Lancichinetti and S. Fortunato, Limits of Modularity [35] K. Faust and S. Wasserman, Block Models: Interpretation Maximization in Community Detection, Phys. Rev. E 84, and Evaluation, Soc. Networks 14, 5 (1992). 066122 (2011). [36] C. J. Anderson, S. Wasserman, and K. Faust, Building [16] B. H. Good, Y.-A. de Montjoye, and A. Clauset, Perfor- Stochastic Block Models, Soc. Networks 14, 137 mance of Modularity Maximization in Practical Contexts, (1992). Phys. Rev. E 81, 046106 (2010). [37] M. Rosvall and C. T. Bergstrom, An Information-Theoretic [17] M. B. Hastings, Community Detection as an Inference Framework for Resolving Community Structure in Com- Problem, Phys. Rev. E 74, 035102 (2006). plex Networks, Proc. Natl. Acad. Sci. U.S.A. 104, 7327 [18] D. Garlaschelli and M. I. Loffredo, Maximum Likelihood: (2007). Extracting Unbiased Information from Complex Networks, [38] J.-J. Daudin, F. Picard, and S. Robin, A Mixture Model for Phys. Rev. E 78, 015101 (2008). Rndom Graphs, Stat. Comput. 18, 173 (2008). [19] M. E. J. Newman and E. A. Leicht, Mixture Models and [39] M. Mariadassou, S. Robin, and C. Vacher, Uncovering Exploratory Analysis in Networks, Proc. Natl. Acad. Sci. Latent Structure in Valued Graphs: A Variational U.S.A. 104, 9564 (2007). Approach, Ann. Appl. Stat. 4, 715 (2010). [20] J. Reichardt and D. R. White, Role Models for Complex [40] C. Moore, X. Yan, Y. Zhu, J.-B. Rouquier, and T. Lane, Networks, Eur. Phys. J. B 60, 217 (2007). Active Learning for Node Classification in Assortative and [21] J. M. Hofman and C. H. Wiggins, Bayesian Approach to Disassortative Networks,inProceedings of the 17th ACM Network Modularity, Phys. Rev. Lett. 100, 258701 (2008). SIGKDD International Conference on Knowledge Discov- [22] P. J. Bickel and A. Chen, A Nonparametric View of ery and Data Mining (KDD ’11) (ACM, New York, 2011), Network Models and Newman-Girvan and Other pp. 841–849. Modularities, Proc. Natl. Acad. Sci. U.S.A. 106, 21068 [41] P. Latouche, E. Birmele, and C. Ambroise, Variational (2009). Bayesian Inference and Complexity Control for Stochastic [23] R. Guimerà and M. Sales-Pardo, Missing and Spurious Block Models, Stat. Model. 12, 93 (2012). Interactions and the Reconstruction of Complex Networks, [42] E. Côme and P. Latouche, Model Selection and Clustering Proc. Natl. Acad. Sci. U.S.A. 106, 22 073 (2009). in Stochastic Bock Models with the Exact Integrated [24] B. Karrer and M. E. J. Newman, Stochastic Block Models Complete Data Likelihood, arXiv:1303.2962. and Community Structure in Networks, Phys. Rev. E 83, [43] A. Clauset, C. Moore, and M. E. J. Newman, Statistical 016107 (2011). Network Analysis: Models, Issues, and New Directions, [25] B. Ball, B. Karrer, and M. E. J. Newman, Efficient Lecture Notes in Computer Science Vol. 4503, edited by E. and Principled Method for Detecting Communities in Airoldi, D. M. Blei, S. E. Fienberg, A. Goldenberg, E. P. Networks, Phys. Rev. E 84, 036103 (2011). Xing, and A. X. Zheng (Springer, Berlin, 2007), pp. 1–13. [26] J. Reichardt, R. Alamino, and D. Saad, The Interplay [44] A. Clauset, C. Moore, and M. E. J. Newman, Hierarchical between Microscopic and Mesoscopic Structures in Structure and the Prediction of Missing Links in Networks, Complex Networks, PLoS One 6, e21282 (2011). Nature (London) 453, 98 (2008). [27] Y. Zhu, X. Yan, and C. Moore, Oriented and Degree- [45] M. Rosvall and C. T. Bergstrom, Multilevel Compression Generated Block Models: Generating and Inferring of Random Walks on Networks Reveals Hierarchical Communities with Inhomogeneous Degree Distributions, Organization in Large Integrated Systems, PLoS One 6, J. Complex Netw. 2, 1 (2014). e18209 (2011). [28] E. B. Baskerville, A. P. Dobson, T. Bedford, S. Allesina, T. [46] M. Sales-Pardo, R. Guimerà, A. A. Moreira, and L. A. N. Michael Anderson, and M. Pascual, Spatial Guilds in the Amaral, Extracting the Hierarchical Organization of Serengeti Food Web Revealed by a Bayesian Group Complex Systems, Proc. Natl. Acad. Sci. U.S.A. 104, Model, PLoS Comput. Biol. 7, e1002321 (2011). 15 224 (2007). [29] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová, [47] P. Ronhovde and Z. Nussinov, Multiresolution Community Inference and Phase Transitions in the Detection of Detection for Megascale Networks by Information-Based Modules in Sparse Networks, Phys. Rev. Lett. 107, Replica Correlations, Phys. Rev. E 80, 016109 (2009). 065701 (2011). [48] I. A. Kovács, R. Palotai, M. S. Szalay, and P. Csermely, [30] A. Decelle, F. Krzakala, C. Moore, and L. Zdeborová, Community Landscapes: An Integrative Approach to Asymptotic Analysis of the Stochastic Block Model for Determine Overlapping Network Module Hierarchy, Modular Networks and Its Algorithmic Applications, Phys. Identify Key Nodes, and Predict Network Dynamics, PLoS Rev. E 84, 066106 (2011). One 5, e12528 (2010). [31] E. Mossel, J. Neeman, and A. Sly, Stochastic Block Models [49] Y. Park, C. Moore, and J. S. Bader, Dynamic Networks and Reconstruction, arXiv:1202.1499. from Hierarchical Bayesian Graph Clustering, PLoS One [32] T. P. Peixoto, Parsimonious Module Inference in Large 5, e8118 (2010). Networks, Phys. Rev. Lett. 110, 148701 (2013). [50] T. P. Peixoto, Entropy of Stochastic Block Model [33] P. W. Holland, K. B. Laskey, and S. Leinhardt, Stochastic Ensembles, Phys. Rev. E 85, 056122 (2012). Block Models: First Steps, Soc. Networks 5, 109 [51] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, (1983). and Z. Ghahramani, Kronecker Graphs: An Approach to

011047-16 HIERARCHICAL BLOCK STRUCTURES AND HIGH- … PHYS. REV. X 4, 011047 (2014)

Modeling Networks, J. Mach. Learn. Res. 11, 9851042 [71] J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, (2010). Community Structure in Large Networks: Natural Cluster [52] G. Bianconi, Entropy of Network Ensembles, Phys. Rev. E Sizes and the Absence of Large Well-Defined Clusters, 79, 036114 (2009). arXiv:0810.1355. [53] P. D. Grànwald, The Minimum Description Length Prin- [72] J. Gehrke, P. Ginsparg, and J. Kleinberg, Overview of the ciple (MIT Press, Cambridge, MA, 2007). 2003 KDD Cup, SIGKDD Explor. Newsletter 5, 149 [54] J. Rissanen, Information and Complexity in Statistical (2003). Modeling, (Springer, New York, 2010), 1st ed. [73] J. Yang and J. Leskovec, Defining and Evaluating Network [55] C. Biernacki, G. Celeux, and G. Govaert, Assessing a Communities Based on Ground Truth,inProceedings of Mixture Model for Clustering with the Integrated the ACM SIGKDD Workshop on Mining Data Semantics Completed Likelihood, IEEE Trans. Pattern Anal. Mach. (MDS ’12) (ACM, New York, 2012) p. 3:1–3:8. Intell. 22, 719 (2000). [74] T. S. Evans, American College Football Network [56] X. Yan, J. E. Jensen, F. Krzakala, C. Moore, C. R. Shalizi, Files, (FigShare), http://figshare.com/articles/American_ L. Zdeborova, P. Zhang, and Y. Zhu, Model Selection for College_Football_Network_Files/93179. Degree-Corrected Block Models, arXiv:1207.3994. [75] D. J. Watts and S. H. Strogatz, Collective dynamics of [57] G. Schwarz, Estimating the Dimension of a Model, Ann. “Small-World” Networks, Nature (London) 393, 440 Stat. 6, 461 (1978). (1998). [58] H. Akaike, A New Look at the Statistical Model Identi- [76] O. Richters and T. P. Peixoto, Trust Transitivity in Social fication, IEEE Trans. Autom. Control 19, 716 (1974). Networks, PLoS One 6, e18384 (2011). [59] J. Reichardt and M. Leone, (Un)detectable Cluster Struc- [77] K. I. Goh, M. E. Cusick, D. Valle, B. Childs, M. Vidal, and ture in Sparse Networks, Phys. Rev. Lett. 101, 078701 A. L. Barabási, The Human Disease Network, Proc. Natl. (2008). Acad. Sci. U.S.A. 104, 8685 (2007). [60] D. Hu, P. Ronhovde, and Z. Nussinov, Phase Transitions [78] E. Cho, S. A. Myers, and J. Leskovec, Friendship and in Random Potts Systems and the Community Detection Mobility: User Movement in Location-Based Social Net- Problem: Spin-Glass Type and Dynamic Perspectives, works,inProceedings of the 17th ACM SIGKDD Philos. Mag. 92, 406 (2012). International Conference on Knowledge Discovery and [61] A. Condon and R. M. Karp, Algorithms for Graph Par- Data Mining, San Diego, California, USA, 2011 (KDD titioning on the Planted Partition Model, Random Struct. ’11) (ACM, New York, NY, 2011), (Ref. [40]), pp. 1082– Algorithms 18, 116 (2001). 1090. [62] J. Xiang, X. G. Hu, X. Y. Zhang, J. F. Fan, X. L. Zeng, G. [79] M. Richardson, R. Agrawal, and P. Domingos, The Y. Fu, K. Deng, and K. Hu, Multiresolution Modularity Semantic Web–ISWC 2003, in Lecture Notes in Computer Methods and Their Limitations in Community Detection, Science Vol. 2870, edited by D. Fensel, K. Sycara, and Eur. Phys. J. B 85, 1 (2012). J. Mylopoulos (Springer, Berlin, 2003), pp. 351–368. [63] T. P. Peixoto, Efficient Monte Carlo and Greedy Heuristic [80] J. Leskovec, D. Huttenlocher, and J. Kleinberg, Signed for the Inference of Stochastic Block Models, Phys. Rev. E Networks in Social Media,inProceedings of the SIGCHI 89, 012804 (2014). Conference on Human Factors in Computing Systems [64] G. Palla, L. Lovász, and T. Vicsek, Multifractal Network (CHI ’10) (ACM, New York, 2010) pp. 1361–1370. Generator, Proc. Natl. Acad. Sci. U.S.A. 107, 7640 [81] J. McAuley and J. Leskovec, Computer Vision ECCV (2010). 2012, in Lecture Notes in Computer Science Vol. 7575, [65] T. P. Peixoto, Eigenvalue Spectra of Modular Networks, edited by A. Fitzgibbon, S. Lazebnik, P. Perona, Y. Sato, Phys. Rev. Lett. 111, 098701 (2013). and C. Schmid (Springer, Berlin, 2012), pp. 828–841. [66] R. R. Nadakuditi and M. E. J. Newman, Graph Spectra [82] J. Leskovec, J. Kleinberg, and C. Faloutsos, Graphs Over and the Detectability of Community Structure in Networks, Time: Densification Laws, Shrinking diameters, and Pos- Phys. Rev. Lett. 108, 188701 (2012). sible Explanations,inProceedings of the 11th ACM [67] L. A. Adamic and N. Glance, The Political Blogosphere SIGKDD International Conference on Knowledge and the 2004 U.S. Election: Divided They Blog,in Discovery in Data Mining, Chicago, Illinois, USA, 2005 Proceedings of the 3rd International Workshop on Link (KDD ’05) (ACM, New York, NY, 2005), (Ref. [67]), Discovery (LinkKDD ’05) (ACM, New York, 2005), pp. 177–187. pp. 36–43. [83] E. M. Airoldi, D. M. Blei, S. E. Fienberg, and E. P. Xing, [68] D. Holten, Hierarchical Edge Bundles: Visualization of Mixed Membership stochastic Block Models, J. Mach. Adjacency Relations in Hierarchical Data, IEEE Trans. Learn. Res. 9, 1981 (2008). Visual. Comput. Graph. 12, 741 (2006). [84] A. Lancichinetti, S. Fortunato, and J. Kertész, Detecting [69] D. Lusseau, K. Schneider, O. J. Boisseau, P. Haase, E. the Overlapping and Hierarchical Community Structure Slooten, and S. M. Dawson, The Bottlenose Dolphin in Complex Networks, New J. Phys. 11, 033015 (2009). Community of Doubtful Sound Features a Large Propor- [85] P. K. Gopalan and D. M. Blei, Efficient discovery of tion of Long-Lasting Associations, Behav. Ecol. Socio- Overlapping Communities in Massive Networks, Proc. biol., 54, 396 (2003). Natl. Acad. Sci. U.S.A. 110, 14 534 (2013). [70] J. Leskovec, J. Kleinberg, and C. Faloutsos, Graph [86] Y.-Y. Ahn, J. P. Bagrow, and S. Lehmann, Link Commun- Evolution: Densification and Shrinking Diameters, ities Reveal Multiscale Complexity in Networks, Nature ACM Trans. Knowl. Discov. Data 1, 2 (2007). (London) 466, 761 (2010).

011047-17 TIAGO P. PEIXOTO PHYS. REV. X 4, 011047 (2014)

[87] D. Liben-Nowell and J. Kleinberg, The Link-Prediction still have that Lt < Lc, since the degree-corrected version Problem for Social Networks, J. Am. Soc. Inf. Sci. still needs to include the information on the degree Technol. 58, 1019 (2007). sequence, as in Eq. (9). [88] D. Grady, C. Thiemann, and D. Brockmann, Robust [100] The normalized mutual information (NMI) is Classification of Salient Links in Complex Networks, defined as 2IðfbPig; fcigÞ=½HðfbigÞ þ HðfcigÞ, where Nat. Commun. 3, 864 (2012). Iðfbig; fcigÞ ¼ Prspbcðr; sÞ ln ½pbcðr; sÞ=pbðrÞpcðrÞ [89] G. Bianconi, P. Pin, and M. Marsili, Assessing the and HðfxigÞ ¼ − rpxðrÞ ln pxðrÞ, where fbig and fcig Relevance of Node Features for Network Structure, Proc. are two partitions of the network. Natl. Acad. Sci. U.S.A. 106, 11 433 (2009). [101] The fact that the NMI between the true and inferred [90] R. Guimerà and L. A. N. Amaral, Functional Cartography partitions remains slightly above zero in Fig. 2 for hki < of Complex Metabolic Networks, Nature (London) 433, 1 with the incomplete BMS criterion is a finite size effect, 895 (2005). as it tends increasingly to zero as N → ∞. On the other [91] A. Lancichinetti and S. Fortunato, Community Detection hand, according to this criterion, the inferred value of B in Algorithms: A Comparative Analysis, Phys. Rev. E 80, this region increases as N becomes larger. 056117 (2009). [102] This threshold corresponds simply to the point where it [92] M. Rosvall and C. T. Bergstrom, Maps of Random Walks becomes impossible to fully encode the block partition on Complex Networks Reveal Community Structure, Proc. in the network structure, i.e., for uniform blocks Natl. Acad. Sci. U.S.A. 105, 1118 (2008). −E ln B þ N ln B ¼ 0, which leads to E ¼ N and, hence, [93] M. Rosvall, D. Axelsson, and C. T. Bergstrom, The Map hki¼2. Equation, Eur. Phys. J. Spec. Top. 178, 13 (2009). [103] This limit cannot be significantly changed even if one [94] A. Lancichinetti, S. Fortunato, and F. Radicchi, Benchmark introduces scale parameters to the definition of modularity Graphs for Testing Community Detection Algorithms, [15,62]. Phys. Rev. E 78, 046110 (2008). [104] In the model selection context, adding a single edge [95] M. Meilă, Learning Theory and Kernel Machines,in between the blocks is not a necessary condition for the Lecture Notes in Computer Science Vol. 2777, edited observation of the resolution limit, and has a negligible by B. Schölkopf and M. K. Warmuth (Springer, Berlin, effect, differently from the modularity approach, where it is 2003), pp. 173–187. a deciding factor. [96] See Supplemental Material at http://link.aps.org/ [105] An efficient and fully documented C++ implementation of supplemental/10.1103/PhysRevX.4.011047 for high reso- the algorithm described here is freely available as part lution versions of Figs. 6 and 7, and additional information of the graph-tool Python library at http://graph‑tool on the internet data. .skewed.de. [97] This specification generalizes other hierarchical construc- [106] IPv4 Routed /24 AS Links Dataset, http://www.caida tions in a straightforward manner. For instance, the .org/data/active/ipv4‑routed‑topology‑aslinks‑dataset.xml. generative model of Refs. [43,44] can be recovered as a [107] Note that this is slightly different than in Ref. [94], which special case by forcing a binary tree hierarchy, terminating parametrized the fraction of internal and external degrees at the individual nodes, and a strictly assortative modular via a local mixing parameter μ, which is the same for all structure. A similar argument holds for the variant of communities. That choice corresponds to a different para- Ref. [51] as well. metrization of the degree-corrected block model than the [98] In Ref. [32] the degree sequenceP entropy was taken to be one used here. However, since the blocks have different r NHðfpkgÞ,withpk ¼ rnrpk=N, which implicitly as- sizes, and the degrees are approximately the same in all sumed that the degrees are uncorrelated with the block blocks, in general, there is no choice of μ that would allow partitions, and, hence, should be interpreted only as an upper one to recover the fully random configuration model, since bound to the actual description length given by Eq. (9). the intrinsic mixing would be different for each block in [99] Note that MDL can still be used to select the simpler model this case. Because of this, we have opted for the para- in this case: Although the complete description length Σ metrization used here; however, this should not alter the will be asymptotically the same with both models for interpretation of the benchmark and the comparison with networks sampled from the traditional block model, we Ref. [94] in a significant way.

011047-18