LETTER doi:10.1038/nature11459
Popularity versus similarity in growing networks
Fragkiskos Papadopoulos1, Maksim Kitsak2,M.A´ ngeles Serrano3, Maria´n Bogun˜a´3 & Dmitri Krioukov2
The principle1 that ‘popularity is attractive’ underlies preferential connect simply to the closest m nodes on the plane, except that attachment2, which is a common explanation for the emergence of distances are not Euclidean but hyperbolic25. The hyperbolic distance scaling in growing networks. If new connections are made pref- between two nodes at polar coordinates (rs, hs) and (rt, ht) is approxi- erentially to more popular nodes, then the resulting distribution of mately xst 5 rs 1 rt 1 ln(hst/2) 5 ln(sthst/2). Therefore the sets of the number of connections possessed by nodes follows power nodes s minimizing xst or shst for each t are identical. The hyperbolic laws3,4, as observed in many real networks5,6. Preferential attach- ment has been directly validated for some real networks (including a b the Internet7,8), and can be a consequence of different underlying processes based on node fitness, ranking, optimization, random 2 walks or duplication9–16.Hereweshowthatpopularityisjustone 2 dimension of attractiveness; another dimension is similarity17–24. 1 We develop a framework in which new connections optimize certain θ = π/3 θ = π/2 23 = 5 /6 1 23 = 2 /3 trade-offs between popularity and similarity, instead of simply pre- θ13 π θ13 π ferring popular nodes. The framework has a geometric interpretation in which popularity preference emerges from local optimization. As 3 opposed to preferential attachment, our optimization framework 3 accurately describes the large-scale evolution of technological (the Internet), social (trust relationships between people) and biological (Escherichia coli metabolic) networks, predicting the probability of c 16 19 new links with high precision. The framework that we have developed 15 11 can thus be used for predicting new links in evolving networks, and 13 provides a different perspective on preferential attachment as an emergent phenomenon. 18 Nodes that are similar have a higher chance of getting connected, 6 Nodes in a certain range even if they are not popular. This effect is known as homophily in social 3 sciences17,18, and it has been observed in many real networks19–24. In the of age or expected degree 23,24 10 web , for example, an individual creating her new homepage tends 14 to link it not only to popular sites such as Google or Facebook, but also 1 4 12 to not-so-popular sites that are close to her special interests—for 2 example, sites devoted to the composer Tartini or to free solo climbing. 5 These observations suggest the introduction of a measure of attractive- 17 Connectivity perimeter ness that would somehow balance popularity and similarity. of the new node The simplest proxy for popularity is the node birth time. All other things being equal, older nodes have more chances to become popular 8 7 and attract connections3,4. If nodes join the network one by one, then 9 the node birth time is simply the node number t 5 1, 2, …. To model 20 (r , θ ) similarity, we randomly place nodes on a circle that represents the t t simplest similarity space. That is, the angular distances between nodes New node model their similarity distances, such as the cosine similarity or any other measure22–24. The simplest way to model a balance between popu- Figure 1 | Geometric interpretation of popularity 3 similarity optimization. The nodes (dots) are numbered by their birth times, and located larity and similarity is then to establish new connections that optimize at random angular (similarity) coordinates. On its birth, the new circled node t the product between popularity and similarity. In other words, the in the yellow annulus connects to m old nodes s minimizing shst. The new model is simply as follows: (1) initially the network is empty; (2) at connections are shown by the thicker blue links. In a and b, t 5 3 and m 5 1. In time t $ 1, new node t appears at a random angular position ht on the a, node 3 connects to node 2 because 2h23 5 2p/3 , 1h13 5 5p/6. In b, node 3 circle; and (3) new node t connects to a subset of existing nodes s, s , t, connects to node 1 because 1h13 5 2p/3 , 2h23 5 p.Inc, an optimization- consisting of the m nodes with the m smallest values of product shst, driven network with m 5 3 is simulated for up to 20 nodes. The radial where m is a parameter controlling the average node degree k~2m, and (popularity) coordinate of new node t 5 20 is rt 5 ln t, shown by the long thick arrow. This node connects to the three hyperbolically closest nodes. The red hst is the angular distance between nodes s and t (Fig. 1a, b). At early times t # m,nodet connects to all the existing nodes. shape marks the set of points located at hyperbolic distances less than rt from the new node. Arrows on dots show all nodes drifting away from the crossed This model has an interesting geometric interpretation, shown in origin, emulating popularity fading as explained in the text. The drift speed in Fig. 1c. Specifically, after mapping birth time t of a node to its radial the network shown corresponds to the degree distribution exponent c 5 2.1. coordinate rt via rt 5 ln t, all nodes lie not on a circle but on a plane— The outer green circle shows the current network boundary of radius rt 5 ln t their polar coordinates are (rt, ht). It then turns out that new nodes expanding with time t as indicated by green arrows.
1Department of Electrical Engineering, Computer Engineering and Informatics, Cyprus University of Technology, 33 Saripolou Street, 3036 Limassol, Cyprus. 2Cooperative Association for Internet Data Analysis (CAIDA), University of California, San Diego (UCSD), La Jolla, California 92093, USA. 3Departament de Fı´sica Fonamental, Universitat de Barcelona, Martı´ iFranque`s1,08028Barcelona,Spain.
27 SEPTEMBER 2012 | VOL 489 | NATURE | 537 ©2012 Macmillan Publishers Limited. All rights reserved RESEARCH LETTER distance is then nothing but a convenient single-metric representation We show in Supplementary Information section IV that the described of a combination of the two attractiveness attributes, radial popularity optimization model leads to clustering that is the strongest possible for and angular similarity. We will use this metric extensively below. networks with a given degree distribution. The networks grown as described may seem to have nothing in The strength of clustering and the power-law exponent can both be common with preferential attachment (PA)2–4. Yet we show in adjusted to arbitrary values via the following model modifications. We Fig. 2a that the probability P(k) that an existing node of degree k first consider the effect of popularity fading, observed in many real attracts a connection from a new node is the same linear function of networks27,28. We note that the closer the node to the centre in Fig. 1c, k in the described model and in PA. It is not surprising then that the the more popular it is: the higher its degree, and the more new con- degree distributions in PA and in our model are the same power laws. nections it attracts, which explains why PA emerges in the model. In Supplementary Information section IV, we prove that the exponent Therefore to model popularity fading, we let all nodes drift away from c of this power law approaches 2. Preferential attachment thus emerges the centre such that the radial coordinate of node s at time t . s is as a process originating from optimization trade-offs between popu- increasing as rs(t) 5 brs 1 (1 2 b)rt, where rs 5 ln s and rt 5 ln t, and larity and similarity. parameter b g [0, 1]. This modification is identical to minimizing b b a However, there are crucial differences between such optimization s hst (or s hst with b 5 b/a) instead of shst. It changes the power-law and PA. In the latter, new nodes connect with the same probability exponent to c 5 1 1 1/b $ 2. If b 5 1, the nodes do not move and P(k) to any nodes of degree k in the network. In the former, new nodes c 5 2. If b 5 0, all nodes move with the maximum speed, always lying connect only to specific subsets of such k-degree nodes that are closest on the circle of radius rt, while the network degenerates to a random to the new node along the similarity dimension h (Fig. 1c). To quantify, geometric graph growing on the circle. PA emerges at any c 5 1 1 1/b we compare in Fig. 2b the probability of connection between a pair of since the attraction probability P(k) is a linear function of degree k, nodes as a function of their hyperbolic distance in the two cases. We see P(k) k 1 m(c 2 2), the same as in PA4. We prove these statements that close nodes are almost always connected in the optimization in Supplementary/ Information sections IV–VII, where we also show model, whereas in PA the probability of their connection is lower by that the popular fitness model10 can be mapped to our geometric an order of magnitude. On the other hand, nodes that are far apart are optimization framework by letting different nodes drift away with never connected in the optimization model, whereas they can be con- different speeds (Supplementary Information section V). nected in PA. These differences manifest themselves in the strength of Because the strongest clustering is due to connections to the closest clustering, which is the probability that two nodes connected to the nodes, to weaken clustering we allow connections to more-distant same node are also connected to each other. In PA, clustering is nodes. Connecting to the m closest nodes is approximately the same 26 5,6 asymptotically zero , whereas it is strong in many real networks . as connecting to nodes lying within distance Rt < rt (see Fig. 1c and Supplementary Information section IV, where we derive the exact a 100 expression for Rt, which controls the average degree in the network). If new nodes t establish connections to existing nodes s at hyperbolic 10−1 distance xst with probability p(xst) 5 1/{1 1 exp[(xst 2 Rt)/T]}, where parameter T $ 0 is the network temperature (see Supplementary Π(k) ∝ k 10−2 Information sections IV and VI), then clustering is a decreasing func- tion of temperature. That is, temperature is the parameter controlling clustering in the network. At zero temperature, the connection 10−3 probability p(xst) is either 1 or 0 depending on whether distance xst is less or greater than Rt, so that we recover the strongest clustering case Attraction probability 10−4 Popularity × similarity PA above, where new nodes connect only to the closest existing nodes. Clustering gradually decreases to zero at T 5 1, and remains −5 Theory 10 asymptotically zero for any T $ 1 (Supplementary Information 100 101 102 103 104 Degree sections IV, VI). At high temperatures T R ‘, the model degenerates either to growing random graphs, or, remarkably, to standard PA b 100 (Supplementary Information section VII). Popularity × similarity To investigate if similarity shapes the structure and dynamics of real PA 10−1 networks as our model predicts, we consider a series of historical snap- shots of the Internet, the E. coli metabolic network, and the social 10−2 network of trust relationships between people, also known as the web of trust (WoT). The first two networks are disassortative (nodes 10−3 of dissimilar degrees are connected with a higher probability), while the third is assortative (nodes of similar degrees are connected with a 10−4 higher probability), and its degree distribution deviates from a power Connection probability law. We map these networks to their popularity 3 similarity spaces 10−5 (Methods Summary). The mapping infers the radial (popularity) and angular (similarity) coordinates for all nodes, so that we can compute 0 5 10 15 20 25 the hyperbolic distances between all node pairs, and the probabilities of Hyperbolic distance new connections as functions of the hyperbolic distance between Figure 2 | Emergence of PA from popularity 3 similarity optimization. corresponding nodes. These probabilities are shown in Fig. 3. In all 5 Two growing networks have been simulated up to t 5 10 nodes, one growing the three networks, they are close to the theoretical predictions of our according to the described optimization model, and the other according to PA. model. 5 In both networks, each new node connects to m 2 existing nodes. The c R 2 This finding is important for several reasons. First, it shows that real- limit is not well-defined in PA, so that c 5 2.1 is used instead as described in the text. a, The probability P(k) that an existing node of degree k attracts a new link. world networks evolve as our framework predicts. Specifically, given The solid line is the theoretical prediction, while the dashed line is a linear the popularity and similarity coordinates of two nodes, they link with function, P(k) k. b, The probability p(x) that a pair of nodes located at probability close to the theoretical value predicted by the model. The hyperbolic distance/ x are connected. The average clustering (over all nodes) in framework may thus be used for link prediction, a notoriously difficult the optimization and PA networks is c~0:83 and c~0:12, respectively. and important problem in many disciplines29, with applications
538 | NATURE | VOL 489 | 27 SEPTEMBER 2012 ©2012 Macmillan Publishers Limited. All rights reserved LETTER RESEARCH
a 100 ranging from predicting protein interactions or terrorist connections Jan−Apr 2007 to designing recommender and collaborative filtering systems30. Apr−Jun 2009 Second, Fig. 3 directly validates our framework and its core mech- Jan−Apr 2007 (PA) Apr−Jun 2009 (PA) anism. It is not surprising then that, as a consequence, the synthetic −2 10 Theory graphs that the model generates are remarkably similar to real net- Jan−Apr 2007 works across a range of metrics (Supplementary Information section 101 Apr−Jun 2009 IX), implying that the framework can be also used for accurate
−4 modelling of real network topologies. We review related work in 10 0 10 Supplementary Information section X, and to the best of our knowledge, there is no other model that would simultaneously: (1) be simple and Connection probability 10−1 universal, that is, applicable to many different networks, (2) have a 10−6 similarity space as its core component, (3) cast PA as an emergent 10 20 30 40 50 phenomenon, (4) generate graphs similar to real networks across a wide 10 20 30 40 50 range of metrics, and (5) validate the proposed growth mechanism Hyperbolic distance between ASs directly. Validation is usually limited to comparing certain graph b 100 metrics, such as degree distribution, between modelled and real net- S0−S1 works; however, this ‘validates’ a consequence of the mechanism, not
S1−S2 the mechanism itself. Direct validation is usually difficult, because pro- 10−1 S −S (PA) posed mechanisms tend to incorporate many unmeasurable factors— 0 1 economic or political factors in Internet evolution, for example. Our S −S (PA) 1 2 approach is no different in that it cannot measure all the factors or node 101 S −S Theory 10−2 0 1 attributes contributing to node similarity in any of the considered real S −S 1 2 networks. Yet, the angular distances between nodes in our approach can 100 be considered as projections of properly weighted combinations of all −3 such similarity factors affecting network evolution, and we can infer 10 10–1 Connection probability these distances using statistical inference methods, directly validating
10–2 the growth mechanism. 0 10 20 30 10−4 To summarize, popularity is attractive, but so is similarity. 0 5 10 15 20 25 30 Neglecting the latter would lead to severe aberrations. Within the Hyperbolic distance between metabolites Internet, for example, a local network in Nebraska would connect directly to a local network in Tibet, in the same way as on the web, a c 100 Apr−Oct 2003 person not even knowing about Tartini or free solo climbing would Dec 2005−Dec 2006 suddenly link her page to these subjects. The probability of such Apr−Oct 2003 (PA) 10−2 dissimilar connections is very low in reality, and the stronger the Dec 2005−Dec 2006 (PA) similarity forces, the smaller this probability is. Neglecting the network 103 Theory Apr−Oct 2003 similarity structure leads to overestimations or underestimations of 2 10−4 10 Dec 2005−Dec 2006 the probability of dissimilar or similar connections by orders of mag- 1 10 nitude (Fig. 3). However, one cannot tell the difference between our 100 10−6 framework and PA by examining node degrees only. The probability 10−1 that an existing node of degree k attracts a new link optimizing popu- Connection probability 10−2 larity 3 similarity is exactly the same linear function of k as in PA 10−8 0 10 20 30 40 50 (Fig. 2a). Supplementary Fig. 1 shows that this function is indeed realized in the considered real networks, re-validating effective PA 0 10 20 30 40 50 60 for these networks. Therefore the popularity 3 similarity optimization Hyperbolic distance between PGP certi"cates approach provides a natural geometric explanation for the following Figure 3 | Popularity 3 similarity optimization for three different ‘dilemmas’ characteristic of PA. On the one hand, PA has been vali- networks. a, The growing Internet; b, E. coli metabolic network; and c, pretty- dated for many real networks, while on the other hand, it requires good-privacy (PGP) web of trust (WoT) between people. Each plot shows the exogenous mechanisms to explain not only strong clustering, but also probability of connections between new and old nodes, as a function of the linear popularity preference, and how such preference can emerge in hyperbolic (popularity 3 similarity) distance x between them in the real real networks, where nodes do not have any global information about networks (circles and squares) and in PA emulations (diamonds and triangles). the network structure. As PA appears as an emergent phenomenon in To emulate PA, new links are disconnected from old nodes to which these links are connected in the real networks, and reconnected to old nodes the framework developed here, our framework provides a simple and natural resolution to these dilemmas, and this resolution is directly according to PA. For a pair of historical network snapshots S0 (older) and S1 (newer), new nodes are the nodes present in S1 but not in S0, and old nodes are validated against large-scale evolution of very different real networks. the nodes present both in S1 and S0. Each plot shows the data for two pairs of We conclude with the observation that to know the closest nodes in such historical snapshots. The solid curve in each plot is the theoretical the hyperbolic popularity 3 similarity space requires precise global connection probability in the optimization model with the parameters information about all node locations. However, non-zero tempera- corresponding to a given real network. Because the probability of new tures smooth out the sharp connectivity perimeter threshold in connections in the real networks is close to the theoretical curves, the shown Fig. 1c, thus modelling reality where this proximity information is data demonstrate that these networks grow as the popularity 3 similarity not precise and mixed with errors and noise. In that respect, PA is a optimization model predicts, whereas PA, accounting only for popularity, is off by orders of magnitude in predicting the connections between similar (small x) limiting regime with similarity forces reduced to nothing but noise. or dissimilar (large x) nodes. To quantify this inaccuracy, the insets show the ratio between the connection probabilities in PA emulations and in the real METHODS SUMMARY networks, that is, the ratios of the values shown by diamonds and circles, and by To infer the radial ri and angular hi coordinates for each node i in a real network triangles and squares in the main plots. The x-axes in the insets are the same as snapshot with adjacency matrix aij, i, j 5 1, 2, …, t, we use the Markov Chain in the main plots. Monte Carlo (MCMC) method described in detail in Supplementary Information.
27 SEPTEMBER 2012 | VOL 489 | NATURE | 539 ©2012 Macmillan Publishers Limited. All rights reserved RESEARCH LETTER
Specifically, we derive there the exact relation between the expected current degree 12. Va´zquez, A. Growing network with local rules: preferential attachment, clustering { hierarchy, and degree correlations. Phys. Rev. E 67, 056104 (2003). k of node i and its current radial coordinate r , which scales as k *ert ri . To infer i i i 13. Pastor-Satorras, R., Smith, E. & Sole, R. V. Evolving protein interaction networks the radial coordinates we use the same expression substituting in it the real degrees through gene duplication. J. Theor. Biol. 222, 199–210 (2003). ki of nodes instead of their expected degrees. Having inferred the radial coordi- 14. Fortunato, S., Flammini, A. & Menczer, F. Scale-free network growth by ranking. nates, we then execute the Metropolis-Hastings algorithm to find the node angular Phys. Rev. Lett. 96, 218701 (2006). aij 1{aij 15. D’Souza, R. M., Borgs, C., Chayes, J. T., Berger, N. & Kleinberg, R. D. Emergence of coordinates that maximize likelihood ~Pivjpxij 1{pxij , where x {R T L px ~1 1ze ij = is the connection probability in the model, and para- tempered preferential attachment from optimization. Proc. Natl Acad. Sci. USA ij ðÞ 104, 6112–6117 (2007). meters R and T are defined by the average nodeÀÁ degreeÂÃ andÀÁ clustering in the ÀÁ .hi 16. Motter, A. E. & Toroczkai, Z. Introduction: optimization in networks. Chaos 17, network via expressions in Supplementary Information section IV. Likelihood 026101 (2007). is the probability that the network snapshot with node coordinates (r , h ), 17. McPherson, M., Smith-Lovin, L. & Cook, J. M. Birds of a feather: homophily in social L i i defining the hyperbolic distances xij between all nodes, is produced by the model. networks. Annu. Rev. Sociol. 27, 415–444 (2001). The algorithm employs an MCMC process, which finds coordinates h for all i that 18. Sims¸ek, O. & Jensen, D. Navigating networks by using homophily and degree. Proc. i Natl Acad. Sci. USA 105, 12758–12762 (2008). approximately maximize . Further details are in Supplementary Information L 19. Redner, S. How popular is your paper? An empirical study of the citation sections II and III, where we also show that the method yields meaningful results distribution. Eur. Phys. J. B 4, 131–134 (1998). for the considered networks, but not for a network (movie actor collaborations) to 20. Watts, D. J., Dodds, P. S. & Newman, M. E. J. Identity and search in social networks. which popularity 3 similarity optimization does not apply. Science 296, 1302–1305 (2002). The nodes in Fig. 3a, b and c are respectively autonomous systems (ASs), 21. Bo¨rner, K., Maru, J. T. & Goldstone, R. L. The simultaneous evolution of author and paper networks. Proc. Natl Acad. Sci. USA 101, 5266–5273 (2004). metabolites and Pretty Good Privacy (PGP) certificates associating users’ email 22. Crandall, D., Cosley, D., Huttenlocher, D., Kleinberg, J. & Suri, S. in Proc. 14th ACM addresses with their cryptographic keys. Parameters (R, T) used to infer the coor- SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD dinates and to draw the theoretical connection probabilities are (25.2, 0.79), (14.4, 2008) (eds Li, Y., Liu, B. & Sarawagi, S.) 160–168 (ACM, 2008). 0.77) and (23, 0.59). Each panel of Fig. 3 shows data for two pairs of snapshots: 23. Menczer, F. Growing and navigating the small world Web by local content. Proc. Fig. 3a, January–April 2007 and April–June 2009; Fig. 3b, S –S and S –S defined Natl Acad. Sci. USA 99, 14014–14019 (2002). 0 1 1 2 24. Menczer, F. Evolution of document networks. Proc. Natl Acad. Sci. USA 101, in Supplementary Information section I; and Fig. 3c, April–October 2003 and 5261–5265 (2004). December 2005–December 2006. The few missing data points in the empirical 25. Bonahon, F. Low-Dimensional Geometry (AMS, 2009). curves (circles and squares) indicate that there are no node pairs at the corres- 26. Bolloba´s, B. & Riordan, O. in Handbook of Graphs and Networks (eds Bornholdt, S. & ponding distances after the mapping, whereas extra missing points in the PA Schuster, H. G.) Ch. 1 1–34 (Wiley-VCH, 2003). emulation curves (diamonds and triangles) indicate that all node pairs at those 27. Adamic, L. A. & Huberman, B. A. Power-law distribution of the World Wide Web. Science 287, 2115 (2000). distances are not connected after PA emulations, meaning that the PA connection 28. van Raan, A. F. J. On growth, ageing, and fractal differentiation of science. probability is zero there. Scientometrics 47, 347–362 (2000). 29. Clauset, A., Moore, C. & Newman, M. E. J. Hierarchical structure and the prediction Received 25 April; accepted 26 July 2012. of missing links in networks. Nature 453, 98–101 (2008). Published online 12 September 2012. 30. Menon, A. K. & Elkan, C. in Machine Learning and Knowledge Discovery in Databases (ECML) (eds Gunopulos, D., Hofmann, T., Malerba, D. & Vazirgiannis, M.) 437–452 (Lecture Notes in Computer Science, Vol. 6912, Springer, 2011). 1. Dorogovtsev, S., Mendes, J. & Samukhin, A. WWW and Internet models from 1955 till our days and the ‘‘popularity is attractive’’ principle. Preprint at http:// Supplementary Information is available in the online version of the paper. arXiv.org/abs/cond-mat/0009090 (2000). 2. Baraba´si, A.-L. & Albert, R. Emergence of scaling in random networks. Science 286, Acknowledgements We thank C. Elkan, G. Bianconi, P. Krapivsky, S. Redner, S. Havlin, 509–512 (1999). E. Stanley and A.-L. Baraba´si for discussions and suggestions. This work was supported 3. Krapivsky, P. L., Redner, S. & Leyvraz, F. Connectivity of growing random networks. by a Marie Curie International Reintegration Grant within the 7th European Community Phys. Rev. Lett. 85, 4629–4632 (2000). Framework Programme; MICINN Projects FIS2010-21781-C02-02 and 4. Dorogovtsev, S. N., Mendes, J. F. F. & Samukhin, A. N. Structure of growing networks BFU2010-21847-C02-02; Generalitat de Catalunya grant 2009SGR838; the Ramo´ny with preferential linking. Phys. Rev. Lett. 85, 4633–4636 (2000). Cajal programme of the Spanish Ministry of Science; ICREA Academia prize 2010, 5. Dorogovtsev, S. N. Lectures on Complex Networks (Oxford Univ. Press, 2010). funded by the Generalitat de Catalunya; NSF grants CNS-0964236, CNS-1039646, 6. Newman, M. E. J. Networks: An Introduction (Oxford Univ. Press, 2010). CNS-0722070; DHS grant N66001-08-C-2029; DARPA grant HR0011-12-1-0012; 7. Pastor-Satorras, R., Va´zquez, A. & Vespignani, A. Dynamical and correlation and Cisco Systems. properties of the internet. Phys. Rev. Lett. 87, 258701 (2001). Author Contributions F.P. and D.K. planned research, performed research and wrote 8. Jeong, H., Ne´da, Z. & Baraba´si, A. L. Measuring preferential attachment in evolving the paper; M.K., M.A.S. and M.B. planned and performed research. All authors networks. Europhys. Lett. 61, 567–572 (2003). discussed the results and reviewed the manuscript. 9. Dorogovtsev, S. N., Mendes, J. & Samukhin, A. Size-dependent degree distribution of a scale-free growing network. Phys. Rev. E 63, 062101 (2001). Author Information Reprints and permissions information is available at 10. Bianconi, G. & Baraba´si, A.-L. Bose-Einstein Condensation in complex networks. www.nature.com/reprints. The authors declare no competing financial interests. Phys. Rev. Lett. 86, 5632–5635 (2001). Readers are welcome to comment on the online version of the paper. Correspondence 11. Caldarelli, G., Capocci, A., Rios, P. D. L., &. Mun˜ oz, M. A. Scale-free networks from and requests for materials should be addressed to F.P. ([email protected]) or varying vertex intrinsic fitness. Phys. Rev. Lett. 89, 258702 (2002). D.K. ([email protected]).
540 | NATURE | VOL 489 | 27 SEPTEMBER 2012 ©2012 Macmillan Publishers Limited. All rights reserved SUPPLEMENTARY INFORMATION doi:10.1038/nature
Popularity versus similarity in growing networks
Fragkiskos Papadopoulos,1 Maksim Kitsak,2 M. Angeles´ Serrano,3 Mari´anBogu˜n´a,3 and Dmitri Krioukov2 1Department of Electrical Engineering, Computer Engineering and Informatics, Cyprus University of Technology, 33 Saripolou Street, 3036 Limassol, Cyprus 2Cooperative Association for Internet Data Analysis (CAIDA), University of California, San Diego (UCSD), La Jolla, CA 92093, USA 3Departament de F´ısica Fonamental, Universitat de Barcelona, Mart´ıi Franqu`es1, 08028 Barcelona, Spain
Supplementary Methods
Here we provide details on the real-world network data used in the main text to validate the popularity similarity optimization approach. We have considered the AS Internet, the E.coli metabolic network, and⇥ the web of trust among people extracted from Pretty-Good-Privacy (PGP) data. That is, we have validated our approach against three paradigmatic real networks, from three di↵erent domains— technology, biology, and society. We also describe the network mapping method used to infer the popularity and similarity coordinates in the considered real networks, and show that this method yields meaningful results, without overfitting or other artifacts.
I. REAL-WORLD NETWORKS
A. Internet
The Internet data used in Fig. 3(a) of the main text and in Fig. S11 of Section IX is collected and prepared as follows. First, we obtain 11 lists of all the autonomous systems (ASs) observed in a collection of Border Gateway Protocol (BGP) data exactly as described in [31]. These AS lists are linearly spaced in time with the interval of three months: time t = 0 corresponds to January 2007, t = 1 is April 2007, and so on until t = 10, June 2009. We denote the obtained AS lists by Lt. For any pair of t and t0 >t, we call the ASs
present both in Lt and Lt0 the old ASs, and the ASs present only in Lt0 but not in Lt are called the new
ASs. The number of ASs in L0 is 17258, while the numbers of new ASs in Lt0 with t0 =1, 2,...,10 compared to t = 0 are 806, 1614, 2389, 3103, 3973, 4794, 5434, 5843, 6207, and 6426. We then take the Archipelago AS topology [32] of June 2009, available at [33], and for each t =0, 1,...,10 we remove from it all ASs and their adjacent links that are not in Lt, thus obtaining a time series of historical AS topology snapshots St. We then map each St to the hyperbolic space as described in Section II, and for each t =0, 1,...,9 and t0 = t + 1 we compute the empirical probability p(x) of connections between new and old ASs as a function of hyperbolic distance x between the ASs. To compute p(x), we linearly bin distance x, and show in each bin the ratio of the number of connected ASs to the total number of AS pairs located at hyperbolic distances falling within this bin. To avoid clutter, Fig. 3(a) in the main text shows the results for the first and last pairs of consecutive snapshots, i.e., S0,S1 and S9,S10. The number of new ASs in each pair is respectively 806 and 259. Similar results hold for all intermediate snapshot pairs. The figure also shows the results of PA emulations. To emulate PA in a snapshot pair, the links adjacent to a new AS are first disconnected from the old ASs to which these links are connected in reality, and then reconnected to old ASs chosen randomly with the normalized probability k + k¯( 2)/2, where k is the number of connections the AS has to other old ASs, and k¯ =5.3, =2.1,⇠ taken from the Internet. The average clustering in the Internet isc ¯ =0.61, and both k¯ andc ¯ are stable across the considered period. Fig. S1(a) validates e↵ective preferential attachment for the pairs of Internet snapshots considered in Fig. 3(a) of the main text.
WWW.NATURE.COM/NATURE | 1 doi:10.1038/nature RESEARCH SUPPLEMENTARY INFORMATION
B. E.coli metabolic network
We use the bipartite metabolic network representation of the E.coli metabolism from [34], reconstructed from data in the BiGG database [35, 36], iAF1260 version of the K12 MG1655 [37] strain. The bipartite representation di↵erentiates two subsets of nodes, metabolites and reactions, mutually interconnected through unweighted and undirected links, without self-loops or dead end reactions. Reactions that do not involve direct chemical transformations, such as di↵usion and exchange reactions, are avoided and isomer metabolites are di↵erentiated. To enhance the resolution of the mapping procedure, currency metabolites are eliminated (h, h2o, atp, pi, adp, ppi, nad, nadh, amo, nadp, nadph), altogether with a few isolated reaction-metabolite pairs and reaction-metabolite-reaction triplets. This leads to a globally connected set of 1512 reactions and 1010 metabolites. Starting from this bipartite network, we construct its one mode projection over the space of metabolites, that is, we consider only metabolites and declare two metabolites as connected if they participate in the same reaction in the original bipartite network. The resulting unipartite network of metabolites has a power law degree distribution with exponent =2.5, average degree k¯ =6.5, and the average clustering is c¯ =0.48. Empirical data for ancestral metabolic networks is not available. However, it has been argued that there exists a direct relation between the evolutionary history of metabolism and the connectivity of metabolites. The hypothesis is that metabolic networks grew by adding new metabolites, such that the most highly connected metabolites should also be the phylogenetically oldest [38–41]. Following this idea, we sorted the network of metabolites by degree to construct an ancestor core metabolic network of 460 metabolites with degrees larger than 4, and two shells including metabolites of degrees 4 and 3, respectively. Each shell is meant to represent the addition of new metabolites in subsequent evolutionary steps. The first shell consists of 142 new metabolites and the second shell of 171 new metabolites. Time t = 0 corresponds to the core network S0. Time t = 1 corresponds to the snapshot of the topology S1 consisting of the metabolites in S0 and the new metabolites in the first shell. And, time t = 2 corresponds to the snapshot of the topology S2 consisting of the metabolites in S1 and the new metabolites in the second shell. We map S0,S1,S2 to the hyperbolic space and compute the empirical connection probability, following the same procedure as in the previous subsection for the Internet. As before, we also perform PA emulations. The results are shown in Fig. 3(b) in the main text and in Fig. S1(b), and are very similar to those of Figs. 3(a), S1(a). The data from this section are also used in Fig. S12 of Section IX.
C. PGP web of trust
Pretty-Good-Privacy (PGP) is a data encryption and decryption computer program that provides cryp- tographic privacy and authentication for data communication [42]. PGP web of trust is a directed network where nodes are certificates consisting of public PGP keys and owner information. A directed link in the web of trust pointing from certificate A to certificate B represents a digital signature by owner of A endorsing the owner/public key association of B. We use temporal PGP web of trust data collected and maintained by J¨orgenCederl¨of[43]. The PGP web of trust (WoT) data is analyzed as follows. We consider two closely spaced in time pairs of WoT snapshots taken in April 2003, October 2003, and December 2005, December 2006. For each of the directed graphs we form their undirected counterparts by taking into account only bi-directional trust links between the certificates. For each of the undirected counterparts we isolate its largest connected component. Then, for each pair of the snapshots we identify old and new sets of nodes. As before, the set of old nodes contains all nodes present in both snapshots, while the set of new nodes contains nodes that are present in the newer snapshot and not in the older snapshot. We refer to the obtained undirected connected subgraphs of the WoT, as snapshots S0,S1,S2,S3, for April 2003, October 2003, December 2005, December 2006, respectively. The numbers of nodes in S0,S1,S2,S3 are 14367, 17155, 23797, 26701, while the average degree k¯ is 5.3, 6.2, 7.9, 8.1 and the average clustering isc ¯ =0.47-0.48. The degree distribution can be roughly approximated by a power-law with exponent =2.1, yet we observe some deviations from this power at high degrees, see Fig. S13(a). We map S0,S1,S2,S3 to the hyperbolic space and compute the empirical connection probability, following the same procedure as in the previous two subsections. As before, we again perform PA emulations. The results are shown in Fig. 3(c) in the main text and in Fig. S1(c), and are very similar to Figs. 3(a), 3(b) and Figs. S1(a), S1(b). The data from this section are also used in Fig. S13 of Section IX. By using the PGP data as described, we strengthen the social component of the WoT, since we only
WWW.NATURE.COM/NATURE | 2 doi:10.1038/nature RESEARCH SUPPLEMENTARY INFORMATION
consider bi-directional signatures, i.e., pairs of users (owners of PGP keys) who have reciprocally signed each other’s keys. This filtering process increases the probability that the connected users know each other, and makes the extracted network a reliable proxy to the underlying social network. We consider the PGP WoT since it is a massive evolving unipartite graph, which represents real social relationships of trust among individuals, and for which complete historical data is available.
II. INFERRING THE POPULARITY AND SIMILARITY COORDINATES
Given a snapshot of a real network consisting of t nodes, we use the Markov Chain Monte Carlo (MCMC) method described in [44] to compute the current radial (popularity) rs(t) and angular (similarity) ✓s coor- dinates for each node s in the network. In this section we briefly describe this method. See [44] for further details. To infer the radial coordinates is relatively easy. In Section IV, we derive the exact relation between the r r (t) expected current degree k (t) of node s and its current radial coordinate r (t) in the model, k (t) e t s , s s s ⇠ where rt is the current radius of the hyperbolic disc. Therefore to infer the radial coordinates in a real network, we use the same expression substituting in it the real degrees ks(t) of nodes instead of their expected degrees. The inference of the angular coordinates is much more involved. In summary, we first measure the average degree, power-law exponent, and average clustering in the network to determine m, , and T , and then execute the Metropolis-Hastings algorithm trying to find the angular coordinates that would maximize the probability (or likelihood)
a 1 a = p(x ) ij [1 p(x )] ij , (1) L ij ij i