arXiv:0811.0484v1 [stat.ML] 4 Nov 2008 gto 1,1] iems oko omnt tutr,it structure, community on work most Like nav- and 13]. search network [12, of igation th studies some to in realization) used in models based different (although spirit in similar ∗ sn rbblsi prahi hc eedwec inter- each endow we node which nal in approach relat- mathematical connec- probabilistic modelled of a be a related degree can using their type of distantly this on of probability depend more Structure the to edness. of vertices expect two those We between than tion 1. tree that Fig. the ancestors pairs—see in common lowest lower have are vertices of or tree pairs a related by represented is ture s scale hierarchical Conventionally, all simultaneously. at network organization a 9, including 6, explicitly cluster simple by [5, beyond however, networks goes structure in Hierarchical structure 11]. 10, community and clustering of phenom- network many into insight ena. networks, offering complex of of principle capable organizing central hierar a that is suggest chy results our together, Taken tech- [8]. competing niques than structures network and general accuracy, more high connec- hier- for with missing networks of known predict partially knowledge to in used that tions be show can further structure We archical shor and lengths. de- coefficients, right-skewed path clustering as high such distributions, gree networks, of properties topo- logical observed commonly and many explain reproduce simultaneously quantitatively can hierarchy that of demonstrate existence and the hierar- data network inferring 7]. from for structure technique 6, chical general 5, a [4, present networks we social Here in regulatory communities genetic or or networks), networks, interac- metabolic (protein food networks, networks to in tion biochemical correspond niches in ecological to modules as found webs, such are scales. units, multiple groups functional over these known forth cases so many and subdivide In groups, further of that groups groups into into sug- divide studies vertices organizatio Recent where hierarchical exhibit 3]. often 2, networks [1, that science gest systems of complex branches quantifying many and in describing for tool able o netridpnetywt probability with independently ancestor mon onc ahpi fvrie o whom for vertices of pair each connect doi:10.1038/nature06830. hspprwspbihdas published was paper This hsmdl hc ecl a call we which model, This ra elo eetwr a endvtdt h study the to devoted been has work recent of deal great A ewrshv nrcn er mre sa invalu- an as emerged years recent in have Networks 4 eateto hsc n etrfrteSuyo ope Sys Complex of Study the for Center and of Department r ftednrga ihaprobability a with dendrogram the of 3 irrhclsrcueadtepeito fmsiglinks missing of prediction the and structure Hierarchical eateto hsc n srnm,Uiest fNwMexi New of University Astronomy, and Physics of Department 1 eateto optrSine nvriyo e eio A Mexico, New of University Science, Computer of Department ao Clauset, Aaron 453 2 irrhclrno graph random hierarchical at eIsiue 39Hd akRa,SnaF M 70,U 87501, NM, Fe Santa Road, Park Hyde 1399 Institute, Fe Santa dendrogram 8–11(2008); 101 – 98 , r stelws com- lowest the is nwihclosely which in ,2 1, p r Fg 1). (Fig. rsohrMoore, Cristopher p r n then and tree- e truc- ing, in s is , n, ly - t I.1 irrhclntokwt tutr nmn scales interna many Each on graph. structure random with hierarchical network corresponding hierarchical the A 1: FIG. ftednrga sascae ihaprobability a with associated is dendrogram the of hdso h nenlndsi h gr ersn h proba the represent figure the in conne nodes are internal node that the of of subtrees shades right and left the in vertices h rsn oe led atrsmn ftesrcua fe structural interest. the of of many tures belo captures discuss already we model as present but the model, usi probabilistic represented elaborate be more could been a and [14]) occasionally are example have for organization (see, communities studied of level Overlapping each at disjoint. communities that assumes ot al apigagrtm[6 ntesaeo all of space the on [16] algorithm with sampling [15] infe Carlo approach statistical Monte likelihood of a maximum tools a the model combining using hierarchical ence, data the network fitting r observed the by in to this networks accomplish ana- of We and any, detect if world. to structure, like hierarchical her would the We goal lyze Our one. different situations. a certain in is however, pro useful a be artifi- structure, might that generate hierarchical dure specified to a us with allows networks cial model graph random erarchical es nvriyo ihgn n ro,M 80,USA 48109, MI Arbor, Ann Michigan, of University tems, ,2 3 2, 1, ie edormadasto probabilities of set a and dendrogram a Given n .E .Newman J. E. M. and

o luuru,N 73,USA 87131, NM Albuquerque, co, bqeqe M811 USA 87131, NM lbuquerque, nnetworks in SA ,4 2, ∗ p r htapi of pair a that p r td (The cted. h hi- the , node l bilities.) and ce- eal ng a- w e, r- r 2

Network hkireal hkisamp Creal Csamp dreal dsamp T. pallidum 4.8 3.7(1) 0.0625 0.0444(2) 3.690 3.940(6) Terrorists 4.9 5.1(2) 0.361 0.352(1) 2.575 2.794(7) a Grassland 3.0 2.9(1) 0.174 0.168(1) 3.29 3.69(2)

TABLE I: Comparison of network statistics for the three example networks studied and new networks generated by resampling from our hierarchical model. The generated networks closely match the average degree hki, clustering coefficient C, and average vertex- vertex distance d in each case, suggesting that they capture much of the real networks’ structure. Parenthetical values indicate standard errors on the final digits.

possible dendrograms. This technique allows us to sample hi- erarchical random graphs with probability proportional to the likelihood that they generate the observed network. To obtain b the results described below we combine information from a large number of such samples, each of which is a reasonably likely model of the data. The success of this approach relies on the flexible nature of our hierarchical model, which allows us to fit a wide range of network structures. The traditional picture of communities or modules in a network, for example, corresponds to con- nections that are dense within groups of vertices and sparse between them—a behaviour called “assortativity” in the lit- erature [17]. The hierarchical random graph can capture be- haviour of this kind using probabilities pr that decrease as we move higher up the tree. Conversely, probabilities that in- crease as we move up the tree correspond to “disassortative” structures in which vertices are less likely to be connected FIG. 2: Application of the hierarchical decomposition to the net- on small scales than on large ones. By letting the prvalues work of grassland species interactions. a, Consensus dendrogram vary arbitrarily throughout the dendrogram, the hierarchical reconstructed from the sampled hierarchical models. b, A visualiza- random graph can capture both assortative and disassortative tion of the network in which the upper few levels of the consensus dendrogram are shown as boxes around species (plants, herbivores, structure, as well as arbitrary mixtures of the two, at all scales parasitoids, hyper-parasitoids and hyper-hyper-parasitoids are shown and in all parts of the network. as circles, boxes, down triangles, up triangles and diamonds respec- To demonstrate our method we have used it to construct hi- tively). Note that in several cases, a set of parasitoids is grouped into erarchical decompositions of three example networks drawn a disassortative community by the algorithm, not because they prey from disparate fields: the metabolic network of the spirochete on each other, but because they prey on the same herbivore. Treponema pallidum [18], a network of associations between terrorists [19], and a food web of grassland species [20]. To test whether these decompositions accurately capture the net- works’ important structural features, we use the sampled den- cussed above, our method can generates not just a single den- drograms to generate new networks, different in detail from drogram but a set of dendrograms, each of which is a good fit the originals but, by definition, having similar hierarchical to the data. From this set we can, using techniques from phy- structure (see the Supplementary Information for more de- logeny reconstruction [21], create a single consensus dendro- tails). We find that these “resampled” networks match the gram, which captures the topological features that appear con- statistical properties of the originals closely, including their sistently across all or a large fraction of the dendrograms and degree distributions, clustering coefficients, and distributions typically represents a better summary of the network’s struc- of shortest path lengths between pairs of vertices, despite the ture than any individual dendrogram. Figure 2a shows such fact that none of these properties is explicitly represented in a consensus dendrogram for the grassland species network, the hierarchical random graph (Table I, and Fig. S3 in the which clearly reveals communities and sub-communities of Supplementary Information). Thus it appears that a network’s plants, herbivores, parasitoids, and hyper-parasitoids. hierarchical structure is capable of explaining a wide variety Another application of the hierarchical decomposition is of other network features as well. the prediction of missing interactions in networks. In many The dendrograms produced by our method are also of inter- settings, the discovery of interactions in a network requires est in themselves, as a graphical representation and summary significant experimental effort in the laboratory or the field. of the hierarchical structure of the observed network. As dis- As a result, our current pictures of many networks are sub- 3 stantially incomplete [22, 23, 24, 25, 26, 27, 28]. An attrac- a Terrorist association network 1 tive alternative to checking exhaustively for a connection be- Pure chance Common neighbors tween every pair of vertices in a network is to try to predict, 0.9 Jaccard coefficient in advance and based on the connections already observed, Degree product Shortest paths which vertices are most likely to be connected, so that scarce 0.8 Hierarchical structure experimental resources can be focused on testing for those in- teractions. If our predictions are good, we can in this way 0.7 reduce substantially the effort required to establish the net- AUC work’s topology. 0.6 The hierarchical decomposition can be used as the basis for an effective method of predicting missing interactions as fol- 0.5 lows. Given an observed but incomplete network, we gener- 0.4 ate as described above a set of hierarchical random graphs— 0 0.2 0.4 0.6 0.8 1 Fraction of edges observed dendrograms and the associated probabilities pr—that fit that network. Then we look for pairs of vertices that have a high b T. pallidum metabolic network 1 average probability of connection within these hierarchical Pure chance Common neighbors random graphsbut which are unconnectedin the observednet- 0.9 Jaccard coefficient work. These pairs we consider the most likely candidates for Degree product Shortest paths missing connections. (Technical details of the procedure are 0.8 Hierarchical structure given in the Supplementary Information.) We demonstrate the method using our three example net- 0.7 works again. For each network we remove a subset of con- AUC nections chosen uniformly at random and then attempt to pre- 0.6 dict, based on the remaining connections, which ones have been removed. A standard metric for quantifying the accu- 0.5 racy of prediction algorithms, commonly used in the medi- 0.4 cal and machine learning communities, is the AUC statistic, 0 0.2 0.4 0.6 0.8 1 which is equivalent to the area under the receiver-operating Fraction of edges observed characteristic (ROC) curve [29]. In the present context, the c Grassland species network 1 AUC statistic can be interpreted as the probability that a ran- Pure chance Common neighbors domly chosen missing connection (a true positive) is given a 0.9 Jaccard coefficient higher score by our method than a randomly chosen pair of Degree product Shortest paths unconnected vertices (a true negative). Thus, the degree to 0.8 Hierarchical structure which the AUC exceeds 1/2 indicates how much better our predictions are than chance. Figure 3 shows the AUC statis- 0.7 tic for the three networks as a function of the fraction of the AUC connections known to the algorithm. For all three networks 0.6 our algorithm does far better than chance, indicating that hi- erarchy is a strong general predictor of missing structure. It 0.5 is also instructive to compare the performance of our method 0.4 to that of other methods for link prediction [8]. Previously 0 0.2 0.4 0.6 0.8 1 proposed methods include assuming that vertices are likely to Fraction of edges observed be connected if they have many common neighbours, if there are short paths between them, or if the product of their de- FIG. 3: Comparison of link prediction methods. Average AUC statis- grees is large. These approaches work well for strongly as- tic, i.e., the probability of ranking a true positive over a true negative, sortative networks such as the collaboration and citation net- as a function of the fraction of connections known to the algorithm, works [8] and for the metabolic and terrorist networks stud- for the link prediction method presented here and a variety of previ- ied here (Fig. 3a,b). Indeed, for the metabolic network the ously published methods. shortest-path heuristic performs better than our algorithm. However, these simple methods can be misleading for net- works that exhibit more general types of structure. In food webs, for instance, pairs of predators often share prey species, deed, in Fig. 2b there are several groups of parasitoids that but rarely prey on each other. In such situations a common- our algorithm has grouped together in a disassortative com- neighbour or shortest-path-based method would predict con- munity, in which they prey on the same herbivore but not on nections between predators where none exist. The hierarchi- each other.) The hierarchical method thus makes accurate pre- cal model, by contrast, is capable of expressing both assorta- dictions for a wider range of network structures than the pre- tive and disassortative structure and, as Fig. 3c shows, gives vious methods. substantially better predictions for the grassland network. (In- In the applications above, we have assumed for simplicity 4 that there are no false positives in our network data, i.e., that tions of roughly equal likelihood. Previous work, by contrast, every observed edge corresponds to a real interaction. In net- has typically sought a single hierarchical representation for a works where false positives may be present, however, they too given network. By sampling an ensemble of dendrograms,our could be predicted using the same approach: we would simply approach avoids over-fitting the data and allows us to explain look for pairs of vertices that have a low average probability many common topological features, generate resampled net- of connection within the hierarchical random graph but which works with similar structure to the original, derive a clear and are connected in the observed network. concise summary of a network’s structure via its consensus The method described here could also be extended to incor- dendrogram, and accurately predict missing connections in a porate domain-specific information, such as species morpho- wide variety of situations. logical or behavioural traits for food webs [28] or phyloge- netic or binding-domain data for biochemical networks [23], Acknowledgments: The authors thank Jennifer Dunne, by adjusting the probabilities of edges accordingly. As the Michael Gastner, Petter Holme, Mikael Huss, Mason Porter, results above show, however, we can obtain good predictions Cosma Shalizi and Chris Wiggins for their help, and the even in the absence of such information, indicating that topol- for its support. C.M. thanks the Center for ogy alone can provide rich insights. the Study of Complex Systems at the University of Michi- In closing, we note that our approach differs crucially from gan for their hospitality while some of this work was con- previous work on hierarchical structure in networks [1, 4, 5, 6, ducted. Computer code implementing many of the analy- 7, 9, 11, 30] in that it acknowledges explicitly that most real- sis methods described in this paper can be found online at world networks have many plausible hierarchical representa- http://www.santafe.edu/∼aaronc/randomgraphs/ .

[1] S. Wasserman, K. Faust, Social Network Analysis (Cambridge McMorris, B. Mirkin, F. Roberts, eds. (DIMACS, 2003), pp. University Press, Cambridge, 1994). 163–184. [2] R. Albert, A.-L. Barab´asi, Rev. Mod. Phys. 74, 47 (2002). [22] J. A. Dunne, R. J. Williams, N. D. Martinez, Proc. Natl. Acad. [3] M. E. J. Newman, SIAM Review 45, 167 (2003). Sci. USA 99, 12917 (2002). [4] E. Ravasz, A. L. Somera, D. A. Mongru, Z. N. Oltvai, A.-L. [23] A. Szil´agyi, V. Grimm, A. K. Arakaki, J. Skolnick, Physical Barab´asi, Science 30, 1551 (2002). Biology 2, S1 (2005). [5] A. Clauset, M. E. J. Newman, C. Moore, Phys. Rev. E 70, [24] E. Sprinzak, S. Sattath, H. Margalit, J. Mol. Biol. 327, 919 066111 (2004). (2003). [6] R. Guimera, L. A. N. Amaral, Nature 433, 895 (2005). [25] T. Ito, et al., Proc. Natl. Acad. Sci. USA 98, 4569 (2001). [7] M. C. Lagomarsino, P. Jona, B. Bassetti, H. Isambert, Proc. [26] A. Lakhina, J. W. Byers, M. Crovella, P. Xie, Proc. INFOCOM Natl. Acad. Sci. USA 104, 5516 (2001). (2003). [8] D. Liben-Nowell, J. Kleinberg, International Conference on In- [27] A. Clauset, C. Moore, Phys. Rev. Lett. 94, 018701 (2005). formation and Knowledge Management (2003). It should be [28] N. D. Martinez, B. A. Hawkins, H. A. Dawah, B. P. Feifarek, noted that this paper focuses on predicting future connections Ecology 80, 1044 (1999). given current ones, as opposed to predicting connections which [29] J. A. Hanley, B. J. McNeil, Radiology 143, 29 (1982). exist but have not yet been observed. [30] M. Sales-Pardo, R. Guimer´a, A. A. Moreira, L. A. N. Amaral, [9] M. Girvan, M. E. J. Newman, Proc. Natl. Acad. Sci. USA 99, Proc. Natl. Acad. Sci. USA 104, 15224 (2007). 7821 (2002). [10] A. E. Krause, K. A. Frank, D. M. Mason, R. E. Ulanowicz, W. W. Taylor, Nature 426, 282 (2003). [11] F. Radicchi, C. Castellano, F. Cecconi, V. Loreto, D. Parisi, Proc. Natl. Acad. Sci. USA 101, 2658 (2004). [12] D. J. Watts, P. S. Dodds, M. E. J. Newman, Science 296, 1302 (2002). [13] J. Kleinberg, Proceedings of the 2001 Neural Information Processing Systems Conference, T. G. Dietterich, S. Becker, Z. Ghahramani, eds. (MIT Press, Cambridge, MA, 2002). [14] G. Palla, I. Der´enyi, I. Farkas, T. Vicsek, Nature 435, 814 (2005). [15] G. Casella, R. L. Berger, Statistical Inference (Duxbury Press, Belmont, 2001). [16] M. E. J. Newman, G. T. Barkema, Monte Carlo Methods in Statistical Physics (Clarendon Press, Oxford, 1999). [17] M. E. J. Newman, Phys. Rev. Lett. 89, 208701 (2002). [18] M. Huss, P. Holme, IET Systems Biology 1, 280 (2007). [19] V. Krebs, Connections 24, 43 (2002). [20] H. A. Dawah, B. A. Hawkins, M. F. Claridge, Journal of Animal Ecology 64, 708 (1995). [21] D. Bryant, BioConsensus, M. Janowitz, F.-J. Lapointe, F. R. 5

SUPPLEMENTARY INFORMATION a e c d APPENDIX A: HIERARCHICAL RANDOM GRAPHS b f

Our model for the hierarchical organization of a network 1/4 is as follows.1 Let G be a graph with n vertices. A dendro- 1/3 1/9 gram D is a binary treewith n leaves correspondingto the ver- 1 1 1 tices of G. Each of the n − 1 internal nodes of D corresponds to the group of vertices that are descended from it. We asso- 1 1 1 1 ciate a probability pr with each internal node r. Then, given a b c d e f a b c d e f two vertices i, j of G, the probability pij that they are con- nected by an edge is pij = pr where r is their lowest common FIG. S1: An example network G consisting of six vertices, and the ancestor in D. The combination (D, {pr}) of the dendrogram likelihood of two possible dendrograms. The internal nodes r of and the set of probabilities then defines a hierarchical random each dendrogram are labeled with the maximum-likelihood proba- graph. bility pr, i.e., the fraction of potential edges between their left and Note that if a community has, say, three subcommunities, right subtrees that exist in G. According to Eq. (B3), the likelihoods 2 2 6 with an equal probability p of connections between them, we of the two dendrograms are L(D1)=(1/3)(2/3) ·(1/4) (3/4) = 8 can represent this in our model by first splitting one of these 0.00165 . . . and L(D2) = (1/9)(8/9) = 0.0433 . . . The second subcommunities off, and then splitting the other two. The two dendrogram is far more likely because it correctly divides the net- internal nodes corresponding to these splits would be given work into two highly-connected subgraphs at the first level. the same probabilities pr = p. This yields three possible bi- nary dendrograms, which are all considered equally likely. We can think of the hierarchical random graph as a varia- sample the space of all models with probability proportional tion on the classical Erd˝os–R´enyi random graph G(n,p). As to L. in that model, the presence or absence of an edge between Let E be the numberofedges in G whose endpoints have r any pair of vertices is independent of the presence or absence r as their lowest common ancestor in D, and let L and R , of any other edge. However, whereas in G(n,p) every pair r r respectively, be the numbers of leaves in the left and right of vertices has the same probability p of being connected, in subtrees rooted at r. Then the likelihood of the hierarchical the hierarchical random graph the probabilities are inhomo- random graph is geneous, with the inhomogeneities controlled by the topolog- ical structure of the dendrogram D and the parameters {p }. − r L(D, {p })= pEr (1 − p )LrRr Er (B1) Many other models with inhomogeneous edge probabilities r Y r r have, of course, been studied in the past. One example is a r∈D structured random graph in which there are a finite number 0 with the convention that 0 =1. of types of vertices with a matrix p giving the connection kl If we fix the dendrogram D, it is easy to find the proba- probabilities between them.2 bilities {pr} that maximize L(D, {pr}). For each r, they are given by APPENDIX B: FITTING THE HIERARCHICAL RANDOM Er GRAPH TO DATA pr = , (B2) LrRr Now we turn to the question of finding the hierarchi- the fraction of potential edges between the two subtrees of r cal random graph or graphs that best fits the observed real- that actually appear in the graph G. The likelihood of the world network G. Assuming that all hierarchical random dendrogram evaluated at this maximum is then graphs are a priori equally likely, the probability that a given 1− LrRr model (D, {pr}) is the correct explanation of the data is, by L(D)= p pr (1 − p ) pr . (B3) Bayes’ theorem, proportional to the posterior probability or Y h r r i r∈D likelihood L with which that model generates the observed network.3 Our goal is to maximize L or, more generally, to Figure S1 shows an illustrative example, consisting of a net- work with six vertices. It is often convenient to work with the logarithm of the like- lihood, 1 Computer code implementing many of the analysis methods described in this paper can be found online at log L(D)= − L R h(p ), (B4) www.santafe.edu/∼aaronc/randomgraphs/. X r r r 2 F. McSherry, “Spectral Partitioning of Random Graphs.” Proc. Founda- r∈D tions of Computer Science (FOCS), pp. 529–537 (2001) 3 G. Casella and R. L. Berger, “Statistical Inference.” Duxbury Press, Bel- where h(p) = −p log p − (1 − p) log(1 − p) is the Gibbs- mont (2001). Shannon entropy function. Note that each term −LrRrh(pr) 6

steps. This is not a rigorous performance guarantee, however, and indeed there are mathematical results for similar Markov r chains that suggest that equilibration could take exponential 5 s t u s t u s u t time in the worst case. Still, as our results here show, the method seems to work quite well in practice. The algorithm is able to handle networks with up to a few thousand vertices FIG. S2: Each internal node r of the dendrogram has three associated in a reasonable amount of computer time. subtrees s, t, and u, which can be placed in any of three configura- tions. (Note that the topology of the dendrogram depends only on We find that there are typically many dendrograms with the sibling and parent relationships; the order, left to right, in which roughly equal likelihoods, which reinforces our contention they are depicted is irrelevant). that it is important to sample the distribution of dendrograms rather than merely focusing on the most likely one.

is maximized when pr is close to 0 or to 1, i.e., when the en- APPENDIX C: RESAMPLING FROM THE tropy is minimized. In other words, high-likelihood dendro- HIERARCHICAL RANDOM GRAPH grams are those that partition the vertices into groups between which connections are either very common or very rare. The procedure for resampling from the hierarchical random We now use a Markov chain Monte Carlo method to sam- graph is as follows. ple dendrograms D with probability proportional to their like- lihood L(D). To create the Markov chain we need to pick a set of transitions between possible dendrograms. The tran- 1. Initialize the Markov chain by choosing a random start- sitions we use consist of rearrangements of subtrees of the ing dendrogram. dendrogram as follows. First, note that each internal node r of a dendrogram D is associated with three subtrees: the sub- 2. Run the Monte Carlo algorithm until equilibrium is trees s,t descended from its two daughters, and the subtree u reached. descended from its sibling. As Figure S2 shows, there are two ways we can reorder these subtrees without disturbing any of 3. Sample dendrogramsat regular intervals thereafter from their internal relationships. Each step of our Markov chain those generated by the Markov chain. consists first of choosing an internal node r uniformly at ran- dom (other than the root) and then choosing uniformly at ran- 4. For each sampled dendrogram D, create a resampled dom between the two alternate configurations of the subtrees graph G′ with n vertices by placing an edge between associated with that node and adopting that configuration. The each of the n(n − 1)/2 vertex pairs (i, j) with inde- result is a new dendrogram D′. It is straightforward to show pendent probability pij = p , where r is the lowest that transitions of this type are ergodic, i.e., that any pair of r common ancestor of i and j in D and pr is given by finite dendrograms can be connected by a finite series of such Eq. (B2). (In principle, there is nothing to prevent us transitions. from generating many resampled graphs from a den- ′ Once we have generated our new dendrogram D we ac- drogram, but in the calculations described in this paper cept or reject that dendrogram according to the standard we generate only one from each dendrogram.) Metropolis–Hastings rule.4 Specifically, we accept the transi- ′ ′ tion D → D if ∆log L = log L(D ) − log L(D) is nonneg- After generating many samples in this way, we can compute ative, so that D′ is at least as likely as D; otherwise we accept ′ averages of network statistics such as the , the transition with probability exp(log ∆L) = L(D )/L(D). the clustering coefficient, the vertex-vertex distance distribu- If the transition is not accepted, the dendrogram remains the tion, and so forth. Thus, in a way similar to Bayesian model same on this step of the chain. The Metropolis-Hastings rule averaging,6 we can estimate the distribution of network statis- ensures detailed balance and, in combination with the ergod- tics defined by the equilibrium ensemble of dendrograms. icity of the transitions, guarantees a limiting probability dis- For the construction of consensus dendrograms such as the tribution over dendrograms that is proportional to the likeli- one shown in Fig. 2a, we found it useful to weight the most hood, . The quantity can be calculated P (D) ∝ L(D) ∆log L likely dendrograms more heavily, giving them weight propor- easily, since the only terms in Eq. (B4) that change from D tional to the square of their likelihood, in order to extract a to ′ are those involving the subtrees , , and associated D s t u coherent consensus structure from the equilibrium set of mod- with the chosen node. els. The Markov chain appears to converge relatively quickly, with the likelihood reaching a plateau after roughly O(n2)

5 E. Mossel and E. Vigoda, “Phylogenetic MCMC Are Misleading on Mix- tures of Trees.” Science 309, 2207 (2005) 4 M. E. J. Newman and G. T. Barkema, “Monte Carlo Methods in Statistical 6 T. Hastie, R. Tibshirani and J. Friedman, “The Elements of Statistical Physics.” Clarendon Press, Oxford (1999). Learning.” Springer, New York (2001). 7

a 0 b 0 10 10 d k

−1 −1 10 10

−2 −2 10 10 Fraction of vertices with degree Fraction of vertex−pairs at distance −3 −3 10 10 0 1 10 10 2 4 6 8 10 Degree, k Distance, d

FIG. S3: Application of our hierarchical decomposition to the network of grassland species interactions. a, Original (blue) and resampled (red) degree distributions. b, Original and resampled distributions of vertex-vertex distances.

APPENDIX D: PREDICTING MISSING CONNECTIONS their score, and predicting that those with the highest scores are the most likely to be connected. Several different types of Our algorithm for using hierarchical random graphs to pre- scores were investigated, defined as follows, where Γ(j) is the dict missing connections is as follows. set of vertices connected to j. 1. Common neighbors: score(i, j) = |Γ(i) ∩ Γ(j)|, the 1. Initialize the Markov chain by choosing a random start- number of common neighbors of vertices i and j. ing dendrogram. 2. Jaccard coefficient: score(i, j) = |Γ(i) ∩ 2. Run the Monte Carlo algorithm until equilibrium is Γ(j)| / |Γ(i) ∪ Γ(j)|, the fraction of all neighbors of i reached. and j that are neighbors of both. 3. Sample dendrogramsat regularintervals thereafter from 3. Degree product: score(i, j)= |Γ(i)| |Γ(j)|, the product those generated by the Markov chain. of the degrees of i and j. 4. For each pair of vertices i, j for which there is not al- 4. Short paths: score(i, j) is 1 divided by the length of the ready a known connection, calculate the mean probabil- shortest path through the network from i to j (or zero ity hpij i that they are connected by averaging over the for vertex pairs that are not connected by any path). corresponding probabilities pij in each of the sampled One way to quantify the success of a prediction method, dendrograms D. used by previous authors who have studied link prediction problems7, is the ratio between the probability that the top- 5. Sort these pairs i, j in decreasing order of hpij i and pre- dict that the highest-ranked ones have missing connec- ranked pair is connected and the probability that a randomly tions. chosen pair of vertices, which do not have an observed con- nection between them, are connected. Figure S4 shows the In general, we find that the top 1% of such predictions are average value of this ratio as a function of the percentage of highly accurate. However, for large networks, even the top 1% the network shown to the algorithm, for each of our three net- can be an unreasonably large number of candidates to check works. Even when fully 50% of the network is missing, our experimentally. In many contexts, researchers may want to method predicts missing connections about ten times better consider using the procedure interactively, i.e., predicting a than chance for all three networks. In practical terms, this small number of missing connections, checking them exper- means that the amount of work required of the experimenter imentally, adding the results to the network, and running the to discover a new connection is reduced by a factor of 10, an algorithm again to predict additional connections. enormous improvement by any standard. If a greater fraction The alternative prediction methods we compared against, of the network is known, the accuracy becomes even greater, which were previously investigated in7, consist of giving each rising as high as 200 times better than chance when only a few pair i, j of vertices a score, sorting pairs in decreasing order of connections are missing. We note, however, that using this ratio to judge prediction algorithms has an important disadvantage. Some missing con- nections are much easier to predict than others: for instance, 7 D. Liben-Nowell and J. Kleinberg, “The link prediction problem for social if a network has a heavy-tailed degree distribution and we re- networks.” Proc. Internat. Conf. on Info. and Know. Manage. (2003). move a randomly chosen subset of the edges, the chances are 8

a Terrorist association network b T. pallidum metabolic network 200 Common neighbors 200 Common neighbors 100 Jaccard coefficient Jaccard coefficient Degree product 100 Degree product Shortest paths Shortest paths Hierarchical structure Hierarchical structure

20 20

10 10

Factor of improvement over chance 2 Factor of improvement over chance 2

1 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction of edges observed Fraction of edges observed c Grassland species network 200 Common neighbors 100 Jaccard coefficient Degree product Shortest paths Hierarchical structure

20

10

Factor of improvement over chance 2

1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Fraction of edges observed

FIG. S4: Further comparison of link prediction algorithms. Data points represent the average ratio between the probability that the top- ranked pair of vertices is in fact connected and the corresponding probability for a randomly-chosen pair, as a function of the fraction of the connections known to the algorithm. For each network, (a, Terrorist associations; b, T. pallidum metabolites; and c, Grassland species interactions), we compare our method with simpler methods such as guessing that two vertices are connected if they share common neighbors, have a high degree product, or have a short path between them.

excellent that two high-degree vertices will have a missing in this case, since the connections are completely indepen- connection and such a connection can be easily predicted by dent random events and there is no structure to discover. We simple heuristics such as those discussed above. The AUC also tested each algorithm on a graph with a power-law degree statistic used in the text, by contrast, looks at an algorithm’s distribution generated according to the configuration model.8 overall ability to rank all the missing connections over nonex- In this case, guessing that high-degree vertices are likely to istent ones, not just those that are easiest to predict. be connected performs quite well, whereas the method based Finally, we have investigatedthe performanceof each of the on the hierarchical random graph performs poorly since these prediction algorithms on purely random (i.e., Erd˝os–R´enyi) graphs have no hierarchical structure to discover. graphs. As expected, no method performs better than chance 8 M. Molloy and B. Reed, “A critical point for random graphs with a given degree sequence.” Random Structures and Algorithms 6, 161–179 (1995)