NETWORK SCIENCE

Random Networks

Prof. Marcello Pelillo Ca’ Foscari University of Venice

a.y. 2016/17 Section 3.2

The random network model RANDOM NETWORK MODEL

Pál Erdös Alfréd Rényi (1913-1996) (1921-1970)

Erdös-Rényi model (1960)

Connect with probability p

p=1/6 N=10 ~ 1.5 SECTION 3.2 THE RANDOM NETWORK MODEL

BOX 3.1

DEFINING RANDOM NETWORKS RANDOM NETWORK MODEL aims to build models that reproduce the properties of There are two equivalent defini- real networks. Most networks we encounter do not have the comforting tions of a random network: Definition: regularity of a crystal lattice or the predictable radial architecture of a spi- der web. Rather, at first inspection they look as if they were spun randomly G(N, L) Model A is a graph of N nodes where each pair (Figure 2.4). Random network theoryof nodes embraces is connected this byapparent probability randomness p. N labeled nodes are connect- by constructing networks that are truly random. ed with L randomly placed links. Erds and Rényi used

From a modeling perspectiveTo a construct network isa arandom relatively network simple G(N, object, p): this definition in their string consisting of only nodes and links. The real challenge, however, is to decide of papers on random net- where to place the links between1) the Start nodes with so thatN isolated we reproduce nodes the com- works [2-9]. plexity of a real system. In this 2)respect Select the a philosophynode pair, behindand generate a random a network is simple: We assume thatrandom this goal number is best betweenachieved by0 and placing 1. If the G(N, p) Model random number exceeds p, connect the the links randomly between the nodes.selected That takes node us pair to the with definition a link, otherwise of a Each pair of N labeled nodes random network (BOX 3.1): leave them disconnected is connected with probability 3) Repeat step (2) for each of the N(N-1)/2 p, a model introduced by Gil- A random network consists of N nodesnode wherepairs. each node pair is connect- bert [10]. ed with probability p. Network Science: Random Hence, the G(N, p) model fixes To construct a random network we follow these steps: the probability p that two nodes are connected and the G(N, L) 1) Start with N isolated nodes. model fixes the total number of links L. While in the G(N, L) 2) Select a node pair and generate a random number between 0 and 1. model the average degree of a If the number exceeds p, connect the selected node pair with a link, node is simply = 2L/N, oth- otherwise leave them disconnected. er network characteristics are easier to calculate in the G(N, p) 3) Repeat step (2) for each of the N(N-1)/2 node pairs. model. Throughout this book we will explore the G(N, p) model, The network obtained after this procedure is called a random graph or not only for the ease that it al- a random network. Two mathematicians, Pál Erds and Alfréd Rényi, have lows us to calculate key network played an important role in understanding the properties of these net- characteristics, but also because works. In their honor a random network is called the Erds-Rényi network in real networks the number of (BOX 3.2). links is rarely fixed.

RANDOM NETWORKS 4 Hence is the product of the probability p that two nodes are con- nected and the number of pairs we attempt to connect, which is Lmax = N(N - 1)/2 (CHAPTER 2).

Using (3.2) we obtain the average degree of a random network

2 L k ==pN(1− ). (3.3) N

Hence is the product of the probability p that two nodes are con- nected and (N-1), which is the maximum number of links a node can have in a network of size N. RANDOM NETWORK MODEL In summary the number of links in a random network varies between realizations. Its expected value is determined by N and p. If we increase p a random networkp=1/6 becomes denser: The average number of links increase N=12 linearly from = 0 to Lmax and the average degree of a node increases from = 0 to = N-1.

L=8 L=10 L=7

Figure 3.3 Random Networks are Truly Random

Top Row Three realizations of a random network gen- erated with the same parameters p =1/6 and N = 12. Despite the identical parameters, the net- works not only look different, but they have a different number of links as well (L = 8, 10, 7).

Bottom Row Three realizations of a random network with p = 1/6 and N = 100.

RANDOM NETWORKS 7 THE NUMBER OF LINKS IS VARIABLE RANDOM NETWORK MODEL

p=0.03 N=100

MATH TUTORIAL Binomial Distribution: The bottom line

Network Science: Random Graphs Number of links in a random network

P(L): the probability to have exactly L links in a network of N nodes and probability p:

The maximum number of links in a network of N nodes = number of pairs of distinct nodes.

⎛ ⎞ ⎛ ⎞ ⎛ N ⎞ ⎜ N ⎟ ⎜ ⎟−L Binomial distribution... ⎜ ⎟ L ⎝ 2 ⎠ P(L) = ⎜ ⎝ 2 ⎠ ⎟ p (1− p) ⎜ ⎟ ⎝ L ⎠

Number of different ways we can choose L links among all potential links. ⎛ N ⎞ N(N −1) ⎜ ⎟ = ⎝ 2 ⎠ 2

Network Science: Random Graphs RANDOM NETWORK MODEL

P(L): the probability to have a network of exactly L links ⎛ ⎞ ⎛ ⎞ ⎛ N ⎞ ⎜ N ⎟ ⎜ ⎟−L ⎜ ⎟ L ⎝ 2 ⎠ P(L) = ⎜ ⎝ 2 ⎠ ⎟ p (1− p) ⎜ ⎟ ⎝ L ⎠

• The average number of links in a random graph

N(N−1) 2 N(N −1) < L > < L >= ∑ LP(L) = p < k >= 2 = p(N −1) L= 0 2 N

• The standard deviation N(N −1) σ2 = p(1− p) 2 Network Science: Random Graphs

Section 3.4

Degree distribution OF A RANDOM GRAPH

⎛ N −1⎞ k (N 1) k P(k) = ⎜ ⎟ p (1− p) − − ⎝ k ⎠

probability of Select k missing N-1-k nodes from N-1 probability of edges having k edges

2 < k >= p(N −1) σk = p(1− p)(N −1)

1/ 2 σk ⎡1 − p 1 ⎤ 1 = ⎢ ⎥ ≈ 1/ 2 < k > ⎣ p (N −1)⎦ (N −1) As the network size increases, the distribution becomes increasingly narrow—we are increasingly confident that the degree of a node is in the vicinity of .

Network Science: Random Graphs DEGREE DISTRIBUTION OF A RANDOM GRAPH

⎛ ⎞ N −1 k (N−1)−k < k > P(k) = ⎜ ⎟ p (1− p) < k >= p(N −1) p = ⎝ k ⎠ (N −1)

For large N and small k, we can use the following approximations: ⎛ N −1⎞ (N −1)! (N −1)(N −1−1)(N −1− 2)...(N −1− k +1)(N− 1− k)! (N −1)k ⎜ ⎟ = = = ⎝ k ⎠ k!(N −1− k)! k!(N −1− k)! k!

< k > < k > k ln[(1− p)(N −1)−k ] = (N −1− k)ln(1− ) = −(N −1− k) = − < k > (1− ) ≅ − < k > N −1 N −1 N −1

∞ n +1 2 3 (N−1)−k − (−1) n x x (1− p) = e ln(1+ x) = ∑ x = x − + − ... for x ≤1 n=1 n 2 3

k k k k ⎛ N −1⎞ k (N−1)−k (N −1) k − (N −1) ⎛ < k >⎞ − < k > P(k) = ⎜ ⎟ p (1− p) = p e = ⎜ ⎟ e = e ⎝ k ⎠ k! k! ⎝ N −1⎠ k!

Network Science: Random Graphs

POISSON DEGREE DISTRIBUTION

⎛ ⎞ N −1 k (N−1)−k < k > P(k) = ⎜ ⎟ p (1− p) < k >= p(N −1) p = ⎝ k ⎠ (N −1)

For large N and small k, we arrive to the Poisson distribution:

< k >k P(k) = e−< k> k!

Network Science: Random Graphs DEGREE DISTRIBUTION OF A RANDOM GRAPH k − < k > =50 P(k) = e k!

P(k)

k Network Science: Random Graphs DEGREE DISTRIBUTION OF A RANDOM NETWORK

Exact Result Large N limit -binomial distribution- -Poisson distribution- (PDF) ProbabilityDistribution Function

Image 3.4a Image 3.4b Anatomy of a binomial and a Poisson degree distribution. Degree distribution is independent of the network size.

The exact form of the degree distribution of a random network is the The degree distribution of a random network with average degree ‹k› = 50 binomial distribution (left). For N » ‹k›, the binomial can be well approx- and sizes N = 102 , 103 , 104. For N = 102 the degree distribution deviates imated by a Poisson distribution (right). As both distributions describe significantly from the Poisson prediction (8), as the condition for the the same quantity, they have the same properties, which are expressed in Poisson approximation, N » ‹k›, is not satisfied. Hence for small networks terms of different parameters: the binomial distribution uses p and N as one needs to use the exact binomial form of Eq. (7) (dotted line). For N = its fundamental parameters, while the Poisson distribution has only one 103 and larger networks the degree distribution becomes indistinguishable parameter, ‹k›. from the Poisson prediction, (8), shown as a continuous line, illustrating that for large N the degree distribution is independent of the network size. In the figure we averaged over 1,000 independently generated random networks to decrease the noise in the degree distribution.

54 | NETWORK SCIENCE Section 3.4

Real Networks are not Poisson NO OUTLIERS IN A RANDOM SOCIETY

Sociologists estimate that a typical person knows about 1,000 individuals on a first name basis, prompting us to assume that ‹k› ≈ 1,000.

k àThe most connected individual has degree k ~1,185 < k > max P(k) = e− k! à The least connected individual has degree kmin ~ 816

The probability to find an individual with degree k > 2,000 is 10-27. Hence the chance of finding an individual with 2,000 acquaintances is so tiny that such nodes are virtually inexistent in a random society.

à a random society would consist of mainly average individuals, with everyone with roughly the same number of friends.

à It would lack outliers, individuals that are either highly popular or recluse.

This suprising conclusion is a consequence of an important property of random networks: In a large random network the degree of most nodes is in the narrow vicinity of ‹k›

Network Science: Random Graphs k − < k > P(k) = e (3.8) k!

FACING REALITY: Degree distribution of real networks

< k >k P(k) = e− k!

Section 6

The evolution of a random network EVOLUTION OF A RANDOM NETWORK

disconnected nodes è NETWORK.

How does this transition happen? EVOLUTION OF A RANDOM NETWORK

disconnected nodes è NETWORK.

=1 (Erdos and Renyi, 1959)

The fact that at least one link per node is necessary to have a giant component is not unexpected. Indeed, for a giant component to exist, each of its nodes must be linked to at least one other node.

It is somewhat unexpected, however that one link is sufficient for the emergence of a giant component.

It is equally interesting that the emergence of the giant cluster is not gradual, but follows what physicists call a second order at =1. SECTION 3.13 SECTION 3.13 ADVANCED TOPICSTOPICS 3.C 3.C GIANT COMPONENTCOMPONENT

InIn thisthis section section we we introduce introduce the the argument, argument, proposed proposed independently independently by by Solomonoff and and Rapoport Rapoport [11], [11], and and by by Erd Erds ands and Rényi Rényi [2], [2], for for the the emer emer- - Sectiongence ofof3.4 giant giant component component at at < k<>=k>= 1 [31].1 [31].

Let us denote with u = 1 - N /N the fraction of nodes that are not in the Let us denote with u = 1 - NG G/N the fraction of nodes that are not in the giant component (GC), whose size we take to be N . If node i is part of the giant component (GC), whose size we take to beG NG. If node i is part of the GC,, itit mustmust link link to to another another node node j, whichj, which must must also also be bepart part of theof the GC .GC Hence. Hence ifif ii isis notnot part part of of the the GC GC, that, that could could happen happen for for two two reasons: reasons:

•• ThereThere is is no no link link between between i andi and j (probability j (probability for for this this is 1- is p 1-). p).

• There is a link between i and j, but j is not part of the GC (probability • There is a link between i and j, but j is not part of the GC (probability for this is pu). for this is pu).

Therefore the total probability that i is not part of the GC via node j is Therefore the total probability that i is not part of the GC via node j is 1 - p + pu. The probability that i is not linked to the GC via any other node is 1 - p + pu. The probability that i is not linked to the GC via any other node is therefore (1 - p + pu)N - 1, as there are N - 1 nodes that could serve as potential kS 1 therefore (1 - p + pu)N - 1, as there are N - 1 nodes that could serve as potential −〈 〉 (3.32) (a) links to the GC for node i. As u is the fraction of nodes that do not belong to Se = 1.− links to the GC for node i. As u is the fraction of nodes that do not belong to the GC, for any p and N the solution of the equation the GC, for any p and N the solution of the equation 0.8 ThisN−1 equation(3.30) provides the size of the giant component S in function of up= (1 − + pu) up= (1 − + pu)N−1 (3.30) (Figure 3.17). While (3.32) looks simple, it does not have a closed solu-

provides the size of the giant component via NG = N(1 - u). Using p = / provides the size of the giant component via N = N(1 - u). Using p = / 0.6 (N - 1) and taking the log of both sides, tion.for « NWeG we obtain can solve it graphically by plotting the right hand side of (3.32) as k = 1.5 (N - 1) and taking the log of both sides, for « N we obtain y ⎡a function〈〉k ⎤ of S for various values of . To have a nonzero solution, the lnuN (1−−)ln1 (1 − u). (3.31) ⎢ N 1 ⎥ k = 1 ⎣ ⎡ −〈〉k ⎦ ⎤ (3.31) 0.4 lnuN (1−−)ln1⎢ (1 − u).⎥ obtained⎣ N −1 ⎦ curve must intersect with the dotted diagonal, representing the Taking an exponential of both sides leads to u = exp[- (1 - u)]. If we (3.32) denoteTaking with an S the exponential fraction of of nodes both in sides theleft giant leads hand component, to u = exp side[- S <=k N>(1G /of N- u, then )]. If we . For small the two curves intersect each other Sdenote = 1 - u withand ( S3.31 the) results fraction in of nodes in the giant component, S = N / N, then 0.2 only at S = 0, indicatingG that for small the size of the giant component = 0.5 S = 1 - u and (3.31) results in k is zero. Only when exceeds a threshold value, does a non-zero solution

RANDOM NETWORKS emerge. 37 RANDOM NETWORKS 37 0 0.2 0.4 0.6 0.8 1 S To determine the value of at which we start having a nonzero solu- tion we take a derivative of (3.32), as the phase transition point is when the (b) 1 r.h.s. of (3.32) has the same derivative as the l.h.s. of (3.32), i.e. when d 0.8 11− e−〈kS〉 = , dS () (3.33) 0.6 S −〈kS〉 〈〉ke =1. 0.4

0.2 Setting S = 0, we obtain that the phase transition point is at = 1 (see also ADVANCED TOPICS 3.F).

0 1 2 3 k

Figure 3.17 Graphical Solution

(a) The three purple curves correspond to y = 1-exp[ - S ] for =0.5, 1, 1.5. The green dashed diagonal corresponds y = S, and the intersection of the dashed and purple curves provides the solution to (3.32). For =0.5 there is only one intersection at S = 0, indicating the absence of a giant com- ponent. The =1.5 curve has a solution at S = 0.583 (green vertical line). The =1 curve is precisely at the critical point, repre- senting the separation between the regime where a nonzero solution for S exists and the regime where there is only the solution at S = 0.

(b) The size of the giant component in function of as predicted by (3.32). After [31].

RANDOM NETWORKS 38 ADVANCED TOPICS 3.C GIANT COMPONENT 1 Se = 1.− −〈kS〉 (3.32) (a)

This equation provides the size of the giant component S in function of 0.8 (Figure 3.17). While (3.32) looks simple, it does not have a closed solu- tion. We can solve it graphically by plotting the right hand side of (3.32) as 0.6 k = 1.5 y a function of for various values of < >. To have a nonzero solution, the S k = 1 0.4 k obtained curve must intersect withSection the dotted 3.4 diagonal, representing the left hand side of (3.32). For small < k> the two curves intersect each other 0.2 only at S = 0, indicating that for small the size of the giant component k = 0.5 is zero. Only when exceeds a threshold value, does a non-zero solution 1 Se = 1.− −〈kS〉 (3.32) (a) emerge. 0 0.2 0.4 0.6 0.8 1 This equation provides the size of the giant component S Sin function of 0.8 To determine the value of at which we start having a nonzero solu- (Figure 3.17). While (3.32) looks simple, it does not have a closed solu- −〈kS〉 tion we take a derivative (3.32)of (3.32), as the(a) phase1 transition point is when the (b) 1 Se = 1.− 0.6 r.h.s. of (3.32) has the same derivative astion. the l.h.s. We ofcan (3.32) solve, i.e. it when graphically by plotting the right hand side of (3.32) as k = 1.5 y This equation provides the size of the giant component S in function of a function0.8 of S for various values of . To0.8 have a nonzero solution, the d kS k = 1 11− e−〈 〉 = , 0.4 (Figure 3.17). While (3.32) looks simple, it does not have a closed solu- dS ()obtained curve must intersect with the dotted diagonal, representing the 0.6 0.6 tion. We can solve it graphically by plotting the right hand side of (3.32) as left hand sidek = of1.5 (3.32). For small(3.33) the two curves intersect each other y a function of S for various values of . To have a nonzero solution, the S 0.2 −〈kS〉 only at S = 0, indicating that fork = small 1 the size of the giant component k = 0.5 〈〉ke =1.0.4 0.4 obtained curve must intersect with the dotted diagonal, representing the is zero. Only when exceeds a threshold value, does a non-zero solution left hand side of (3.32). For small the two curves intersect each other emerge.0.2 0.2 only at S = 0, indicating that for small the sizeSetting of the S giant = 0, we component obtain that the phase transition point is at = 1 k(see = 0.5 0 0.2 0.4 0.6 0.8 1 is zero. Only when exceeds a threshold value,also doesADVANCED a non-zero TOPICS solution 3.F). S emerge. To determine the value of at which we start having a nonzero solu- 0 1 2 3 tion 0we take0.2 a derivative0.4 0.6 of (3.32)0.8 , as1 the phase transition point is when the (b) 1 S k To determine the value of at which we start having a nonzero solu- r.h.s. of (3.32) has the same derivative as the l.h.s. of (3.32), i.e. when 1 0.8 tion we take a derivative of (3.32), as the phase transition point is when the (b) d Figure 3.17 11− e−〈kS〉 = , r.h.s. of (3.32) has the same derivative as the l.h.s. of (3.32), i.e. when dS ()Graphical Solution d 0.8 (3.33) 0.6 11− e−〈kS〉 = , dS () (a) The three purple curves correspond to y = S 0.6 −〈kS〉 (3.33) 〈〉ke =1-1.exp[ - S ] for =0.5, 1, 1.5. The green 0.4 S dashed diagonal corresponds y = S, and −〈kS〉 〈〉ke =1. 0.4 the intersection of the dashed and purple curves provides the solution to (3.32). For 0.2 Setting S = 0, we obtain that the phase=0.5 transition there is point only oneis at intersection = 1 (see at S 0.2 = 0, indicating the absence of a giant com- Setting S = 0, we obtain that the phase transition point is at = 1 (see also ADVANCED TOPICS 3.F). ponent. The =1.5 curve has a solution also ADVANCED TOPICS 3.F). at S = 0.583 (green vertical line). The =1 0 1 2 3 curve is precisely at the critical point, repre- 0 1 2 3 senting the separation between the regime k k where a nonzero solution for S exists and the regime where there is only the solution at S = 0. Figure 3.17 Figure 3.17 Graphical Solution (b) The size of the giant component in function Graphical Solution of as predicted by (3.32). After [31].

(a) The three purple curves correspond to y = (a) The three purple curves correspond to y = 1-exp[ - S ] for =0.5, 1, 1.5. The green 1-exp[ - S ] for =0.5, 1, 1.5. The green dashed diagonal corresponds y = S, and dashed diagonal corresponds y = S, and the intersection of the dashed and purple the intersection of the dashed and purple curves provides the solution to (3.32). For curves provides the solution to (3.32). For =0.5 there is only one intersection at S =0.5 there is only one intersection at S = 0, indicating the absence of a giant com- = 0, indicating the absence of a giant com- ponent. The =1.5 curve has a solution ponent. The =1.5 curve has a solution at S = 0.583 (green vertical line). The =1 at S = 0.583 (green vertical line). The =1 curve is precisely at the critical point, repre- curve is precisely at the critical point, repre- senting the separation between the regime senting the separation between the regime where a nonzero solution for S exists and RANDOM NETWORKS 38 ADVANCED TOPICS 3.C where a nonzero solution for S exists and the regime where there is only the solution GIANT COMPONENT at S = 0. the regime where there is only the solution at S = 0. (b) The size of the giant component in function of as predicted by (3.32). After [31]. (b) The size of the giant component in function of as predicted by (3.32). After [31].

RANDOM NETWORKS 38 ADVANCED TOPICS 3.C GIANT COMPONENT RANDOM NETWORKS 38 ADVANCED TOPICS 3.C GIANT COMPONENT EVOLUTION OF A RANDOM NETWORK

disconnected nodes è NETWORK.

How does this transition happen? Phase transitions in complex systems: liquids

Water Ice I: II: III: IV: Subcritical Critical Supercritical Connected < 1 = 1 > 1 > ln N

N=100

=0.5 =1 =3 =5 I: Subcritical < 1

p < pc=1/N

No giant component.

N-L isolated clusters, cluster size distribution is exponential −3 / 2 −( k −1)s+(s−1)ln k p(s) ~ s e The largest cluster is a tree, its size ~ ln N

II: Critical = 1

p=pc=1/N

2/3 Unique giant component: NG~ N -1/3 à contains a vanishing fraction of all nodes, NG/N~N A jump in the cluster size: à Small components are trees, GC has loops. N=1,000 à ln N~ 6.9; N2/3~95 9 2/3 Cluster size distribution: p(s)~s-3/2 N=7 10 à ln N~ 22; N ~3,659,250 III: Supercritical > 1

p > pc=1/N =3

Unique giant component: N ~ (p-p )N G c à GC has loops.

Cluster size distribution: exponential p(s) ~ s−3 / 2e−( k −1)s+(s−1)ln k

IV: Connected > ln N p > (ln N)/N

=5

Only one cluster: NG=N à GC is dense. Cluster size distribution: None

Section 7

Real networks are supercritical SECTION 7 REAL NETWORKS ARE SUPERCRITICAL

Two predictions of random are of special in each case finding ‹k› > 1. Hence the average degree of importance for real networks: real networks is well beyond the ‹k› = 1 threshold, implying that they all have a giant component. 1. Once the average degree exceeds ‹k› = 1, a giant com- ponent emerges that contains a finite fraction of all Let us now inspect if we have single component (if ‹k› > nodes. Hence only for ‹k› > 1 the nodes organize them- lnN), or we expect the network to be fragmented into selves into a recognizable network. multiple components (if ‹k› < lnN ). For social networks this would mean that ‹k› ≥ ln(7 ×109) 22.7. That is, if the 2. For ‹k› > lnN all components are absorbed by the giant average individual has more than two dozens acquain- component, resulting in a single connected network. tances, then a random society would have a single com- ponent, leaving no node disconnected. With ‹k› 1,000 this is clearly satisfied. Yet, according to Table 3.1 most real But, do real networks satisfy the criteria for the existence networks do not satisfy this criteria, indicating that they of a giant component, i.e. ‹k› › 1? And will this giant com- should consist of several disconnected components. This ponent contain all nodes, i.e. is ‹k› › lnN , or do we expect is a disconcerting prediction for the Internet, as it suggests some nodes and components to remain disconnected? that we should have routers that, being disconnected from These questions can be answered by comparing the mea- the giant component, are unable to communicate with sured ‹k› with the theoretical thresholds uncovered above. other routers. This prediction is at odd with reality, as these routers would be of little utility.

Network N L ln N Section 7 Internet 192,244 609,066 6.34 12.17 Subcritical Supercritical Fully Connected Power Grid 4,941 6,594 2.67 8.51 Internet Science Collaboration 23,133 186,936 8.08 10.04 Actor Network 212,250 3,054,278 28.78 12.27 Power Grid Science Yeast Protein Interactions 2,018 2,930 2.90 7.61 Collaboration

Table 3.1 Actor Network

Are real networks connected? Yeast Protein The number of nodes N and links L for several undirected networks, Interactions together with ‹k› and lnN. A giant component is expected for ‹k› > 1 and all nodes should join the giant component for ‹k› lnN. While for all 1 10 networks ‹k› > 1, for most ‹k› is under the lnN threshold. Image 3.8 Most real networks are supercritical. The measurements indicate that real networks extrava- The four regimes predicted by random network theory, marking with a gantly exceed the ‹k› = 1 threshold. Indeed, sociologists es- cross the location of several real networks of Table 3.1. The diagram indi- timate that an average person has around 1,000 acquain- cates that most networks are in the supercritical regime, hence they are tances; a typical neuron is connected to dozens of other expected to be broken into numerous isolated components. Only the actor network is in the connected regime, meaning that all nodes are expected neurons, some to thousands; in our cells, each molecule to be part of a single giant component. Note that while the boundary be- takes part in several chemical reactions, some, like water, tween the subcritical and the supercritical regime is always at ‹k› = 1, the in hundreds. This conclusion is supported by Table 3.1, boundary between the supercritical and the connected regimes is at lnN, listing the average degree of several undirected networks, hence varies from system to system.

60 | NETWORK SCIENCE Section 3.8

Small worlds SIX DEGREES small worlds

Ralph Sarah

Jane

Peter

Frigyes Karinthy, 1929 Stanley Milgram, 1967 Image by Matthew Hurst Blogosphere SIX DEGREES 1929: Frigyes Kartinthy

1929: Minden másképpen van (Everything is Different) Láncszemek (Chains)

“Look, Selma Lagerlöf just won the Nobel Prize for Literature, thus she is bound to know King Gustav of Sweden, after all he is the one who handed her the Prize, as required by tradition. King Gustav, to be sure, is a passionate tennis player, who always participates in international tournaments. He is known to have played Mr. Kehrling, whom he must therefore know for sure, and as it happens I myself know Mr. Kehrling quite well.”

"The worker knows the manager in the shop, who knows Ford; Ford is on friendly terms with the general director of Hearst Publications, who last year became good friends with Arpad Pasztor, someone I not only know, but to the best of my knowledge a good friend of mine. So I could easily ask him to send a telegram via the general director telling Ford that he should talk to the manager and have the worker in the shop quickly hammer together a car for me, as I happen to need one."

Frigyes Karinthy (1887-1938) Hungarian Writer Network Science: Random Graphs SIX DEGREES 1967: Stanley Milgram

HOW TO TAKE PART IN THIS STUDY

1. ADD YOUR NAME TO THE ROSTER AT THE BOTTOM OF THIS SHEET, so that the next person who receives this letter will know who it came from.

2. DETACH ONE POSTCARD. FILL IT AND RETURN IT TO HARVARD UNIVERSITY. No stamp is needed. The postcard is very important. It allows us to keep track of the progress of the folder as it moves toward the target person.

3. IF YOU KNOW THE TARGET PERSON ON A PERSONAL BASIS, MAIL THIS FOLDER DIRECTLY TO HIM (HER). Do this only if you have previously met the target person and know each other on a first name basis.

4. IF YOU DO NOT KNOW THE TARGET PERSON ON A PERSONAL BASIS, DO NOT TRY TO CONTACT HIM DIRECTLY. INSTEAD, MAIL THIS FOLDER (POST CARDS AND ALL) TO A PERSONAL ACQUAINTANCE WHO IS MORE LIKELY THAN YOU TO KNOW THE TARGET PERSON. You may send the folder to a friend, relative or acquaintance, but it must be someone you know on a first name basis.

Network Science: Random Graphs BOX 3.6 SIX DEGREES 1967: Stanley Milgram SIX DEGREES: EXPERIMENTAL CONFIRMATION

The first empirical study of the small world phenomena took place in 1967, when Stanley Milgram, building on the work of 15 Pool and Kochen [20], designed an experiment to measure the distances in social networks [21, 22]. Milgram chose a stock bro- N=64 10 ker in Boston and a divinity student in Sharon, Massachusetts as targets. He then radomly selected residents of Wichita and Omaha, sending them a letter containing a short summary of 5 the study’s purpose, a photograph, the name, address and infor- NUMBER OF CHAINS mation about the target person. They were asked to forward the 0 letter to a friend, relative or acquantance who is more likely to 0 1 2 3 4 5 6 7 8 9 10 11 12 NUMBER OF INTERMEDIARIES know the target person. Network Science: Random Graphs 0.7 Within a few days the first letter arrived, passing through only 0.6 Worldwide USA two links. Eventually 64 of the 296 letters made it back, some, 0.5 p however, requiring close to a dozen intermediates [22]. These d 0.4 completed chains allowed Milgram to determine the number of 0.1 individuals required to get the letter to the target (Figure 3.12a). He found that the median number of intermediates was 5.2, a 0.2 relatively small number that was remarkably close to Frigyes 0.1 Karinthy’s 1929 insight (BOX 3.7). 0 0 2 4 d 6 8 10

Milgram lacked an accurate map of the full acquaintance net- Figure 3.12 work, hence his experiment could not detect the true distance Six Degrees? From Milgram to Facebook between his study’s participants. Today Facebook has the most extensive social network map ever assembled. Using Facebook’s (a) In Milgram's experiment 64 of the 296 letters made it to the recipient. The fig- social graph of May 2011, consisting of 721 million active users ure shows the length distribution of the and 68 billion symmetric friendship links, researchers found an completed chains, indicating that some average distance 4.74 between the users (Figure 3.12b). Therefore, letters required only one intermediary, while others required as many as ten. The the study detected only ‘four degrees of separation’ [18], closer mean of the distribution was 5.2, indicat- to the prediction of (3.20) than to Milgram’s six degrees [21, 22]. ing that on average six ‘handshakes’ were required to get a letter to its recipient. The playwright John Guare renamed this ‘six degrees of separation’ two decades later. After [22].

(b) The distance distribution, pd , for all pairs “I asked a person of intelligence how of Facebook users worldwide and within the US only.Using Facebook’s N and L (3.19) many steps he thought it would take, and predicts the average degree to be approx- he said that it would require 100 interme- imately 3.90, not far from the reported four degrees. After [18]. diate persons, or more, to move from Ne- braska to Sharon.” Stanley Milgram, 1969

RANDOM NETWORKS 24 THE EVOLUTION OF A RANDOM NETWORK SIX DEGREES 1991: John Guare

"Everybody on this planet is separated by only six other people. Six degrees of separation. Between us and everybody else on this planet. The president of the United States. A gondolier in Venice…. It's not just the big names. It's anyone. A native in a rain forest. A Tierra del Fuegan. An Eskimo. I am bound to everyone on this planet by a trail of six people. It's a profound thought. How every person is a new door, opening up into other worlds."

Network Science: Random Graphs SECTION 3.8 SMALL WORLDS

The small world phenomenon, also known as six degrees of separation, has long fascinated the general public. It states that if you choose any two q individuals anywhere on earth, you will find a path of at most six acquain- q tances between them (Figure 3.10). The fact that individuals who live in the q w same city are only a few handshakes from each other is by no means sur- Ralphq Sarah prising. The small world concept states, however, that even individuals Jane w who are on the opposite side of the globe can be connected to us via a few q w acquaintances. Peter

In the language of network science the small world phenomenon im- plies that the distance between two randomly chosen nodes in a network w is short. This statement raises two questions: What does short (or small) q mean, i.e. short compared to what? How do we explain the existence of these short distances? Figure 3.10 Six Deegree of Separation

Both questions are answered by a simple calculation. Consider a ran- According to six degrees of separation two DISTANCES IN RANDOMdom GRAPHSnetwork with average degree . A node in this network has on av- individuals, anywhere in the world, can be connected through a chain of six or fewer ac- erage: Random graphs tend to have a tree-like topology with almost constant node degrees. quaintances. This means that while Sarah does not know Peter, she knows Ralph, who knows nodes at distance one (d=1). Jane and who in turn knows Peter. Hence Sar- ah is three handshakes, or three degrees from 2 nodes at distance two (d=2). Peter. In the language of network science six 3 nodes at distance three (d =3). degrees, also called the small world proper- ty, means that the distance between any two ... nodes in a network is unexpectedly small. d nodes at distance d.

For example,d +1 if < > 1,000, which is the estimated number of acquain- max k logN 2 d max k −1 d max N =1+ k + k + ...+ k = ≈ k dmax =6 tences an individualk −1 has, we expect 10 log individualsk at distance two and about a billion, i.e. almost the whole earth’s population, at distance three Network Science: Random Graphs from us. To be precise, the expected number of nodes up to distance d from our starting node is d+1 2 d 〈〉k −1 Nd()1+ 〈kk〉 + 〈〉++... 〈〉k = . (3.15) 〈〉k −1

RANDOM NETWORKS 21 DISTANCES IN RANDOM GRAPHS

logN d = max log k In most networks this offers a better approximation to the average distance

between two randomly chosen nodes, ⟨d⟩, than to dmax .

logN < d >= log k

We will call the small world phenomena the property that the average path length or the diameter depends logarithmically on the system size.

Hence, ”small ” means that ⟨d⟩ is proportional to log N, rather than N.

The 1/log⟨k⟩ term implies that denser the network, the smaller will be the distance between the nodes.

Network Science: Random Graphs DISTANCES IN RANDOM GRAPHS compare with real data

Given the huge differences in scope, size, and average degree, the agreement is excellent. Why are small worlds surprising? Suprising compared to what?

Network Science: Random Graphs BOX 3.6

SIX DEGREES: EXPERIMENTAL CONFIRMATION Three, Four or Six Degrees? The first empirical study of the small world phenomena took place in 1967, when Stanley Milgram, building on the work of 15 Pool and Kochen [20], designed an experiment to measure the distances in social networks [21, 22]. Milgram chose a stock bro- N=64 10 ker in Boston For and athe divinity globe student’s social in Sharon, networks: Massachusetts as targets. He then radomly selected residents of Wichita and 3 ⟨k⟩ ≃ 10 5 Omaha, sending them a letter containing a short summary of NUMBER OF CHAINS the study’s purpose, N ≃ 7a photograph, × 109 for the the name, world address’s population. and infor- mation about the target person. They were asked to forward the 0 letter to a friend, relative or acquantance who is more likely to 0 1 2 3 4 5 6 7 8 9 10 11 12 NUMBER OF INTERMEDIARIES know the target person. 0.7 Within a few days the first letter arrived, passing through only 0.6 Worldwide USA two links. Eventually 64 of the 296ln( lettersN made) it back, some, 0.5 p however, requiring close to a dozen intermediates [22]. These d < d >= = 3.28 0.4 completed chains allowed Milgramln to determinek the number of 0.1 individuals required to get the letter to the target (Figure 3.12a). He found that the median number of intermediates was 5.2, a 0.2 relatively small number that was remarkably close to Frigyes 0.1 Karinthy’s 1929 insight (BOX 3.7). 0 0 2 4 d 6 8 10

Milgram lacked an accurate map of the full acquaintance net- Figure 3.12 work, hence his experiment could not detect the true distance Six Degrees? From Milgram to Facebook between his study’s participants. Today Facebook has the most extensive social network map ever assembled. Using Facebook’s (a) In Milgram's experiment 64 of the 296 letters made it to the recipient. The fig- social graph of May 2011, consisting of 721 million active users ure shows the length distribution of the and 68 billion symmetric friendship links, researchers found an completed chains, indicating that some average distance 4.74 between the users (Figure 3.12b). Therefore, letters required only one intermediary, while others required as many as ten. The the study detected only ‘four degrees of separation’ [18], closer mean of the distribution was 5.2, indicat- to the prediction of (3.20) than to Milgram’s six degrees [21, 22]. ing that on average six ‘handshakes’ were required to get a letter to its recipient. The playwright John Guare renamed this ‘six degrees of separation’ two decades later. After [22].

(b) The distance distribution, pd , for all pairs “I asked a person of intelligence how of Facebook users worldwide and within the US only.Using Facebook’s N and L (3.19) many steps he thought it would take, and predicts the average degree to be approx- he said that it would require 100 interme- imately 3.90, not far from the reported four degrees. After [18]. diate persons, or more, to move from Ne- braska to Sharon.” Stanley Milgram, 1969

RANDOM NETWORKS 24 THE EVOLUTION OF A RANDOM NETWORK Section 9

Clustering coefficient SECTION 3.9 CLUSTERING COEFFICIENT

The degree of a node contains no information about the relationship between a node's neighbors. Do they all know each other, by having links between them? Or are they perhaps isolated from each other? The answer

is provided by the local clustering coefficient Ci, that measures the density of links in node i’s immediate neighborhood: C = 0 means that there are no links between i’s neighbors; C = 1 implies that each of the i’s neighbors link to each other (SECTION 2.10).

To calculate Ci for a node in a random network we need to estimate the

expected number of links Li between the node’s ki neighbors. In a random network the probability that two of i’s neighbors link to each other is p. As

there are ki(ki - 1)/2 possible links between the ki neighbors of node i, the CLUSTERING COEFFICIENT expected value of Li is

2 < Li > kk(1− ) C ≡ 〈〉Lp= ii . (3.20) i k (k −1) i 2 i i

Since edgesThus are the independent local clustering and have coefficient the same ofprobability a random p, network is

k (k 1) 2〈〉L 〈〉k i i − C i p . < Li >≅ p i = == (3.21) 2 kkii(1− ) N

• The clusteringEquation coefficient (3.21) makes of random two graphs predictions: is small.

• For fixed degree C decreases with the system size N. (1) For fixed , the larger the network, the smaller is a node’s cluster- • C is independent of a node’s degree k. ing coefficient. Consequently a node's local clustering coefficient Ci is expected to decrease as 1/N. Note that the network's average clus- tering coefficient, also follows (3.21).

(2) The local clustering coefficient of a node is independent of the node’s degree.

To test the validity of (3.21) we plot / in function of N for several undirected networks (Figure 3.13a). We find that / does not decrease

RANDOM NETWORKS 26 SECTION 3.9 CLUSTERING COEFFICIENT

The degree of a node contains no information about the relationship between a node's neighbors. Do they all know each other, by having links between them? Or are they perhaps isolated from each other? The answer

is provided by the local clustering coefficient Ci, that measures the density of links in node i’s immediate neighborhood: C = 0 means that there are no links between i’s neighbors; C = 1 implies that each of the i’s neighbors link to each other (SECTION 2.10).

To calculate Ci for a node in a random network we need to estimate the

expected number of links Li between the node’s ki neighbors. In a random network the probability that two of i’s neighbors link to each other is p. As

there are ki(ki - 1)/2 possible links between the ki neighbors of node i, the as N-1, but it is largely independent of N, in violation of the prediction (3.21) andexpected point (1) above. In Figure 3.13b-dvalue we also show of the dependency Li is of C on the node’s degree ki for three real networks, finding that C(k) systematical- ly decreases with the degree, again in violation of (3.21) and point (1).

In summary, we find that the random network model does not capture the clustering of real networks. Instead real networks have a much high- er clustering coefficient than expected for a random network of similar N and L. An extension of the random network model proposed by Watts kk(1− ) and Strogatz [26] addresses the coexistence of high C and the small world ii (3.20) property (BOX 3.8). It fails to explain, however, why high-degree nodes have 〈〉Lpi = . a smaller clustering coefficient than low-degree nodes. Models explaining 2 the shape of C(k) are discussed in Chapter 9. CLUSTERING COEFFICIENT

(a) AllThus Networks the local(b) Internet clusteringFigure 3.13 coefficient of a random network is 100 100 Clustering in Real Networks

(a) Comparing the average clustering co- efficient of real networks with the 10-2 10-1 prediction (3.21) for random networks. The circles and their colors correspond to the networks of Table 3.2. Directed C / k C(k) network were made undirected to cal- culate C and . The green line cor- -2 2 L -4 〈〉 k 10 10 responds to (3.21), predicting thati for 〈〉 randomC networks= the average cluster- ==p . ing coefficienti decreases as N-1. In con- (3.21) trast, for real networkskk(1 appears− to) N 10-6 10-3 be independent of N.ii

1 3 5 0 1 2 3 4 10 10 10 10 10 10 10 10 (b)-(d) The dependence of the local clustering N k coefficient, C(k), on the node’s degree for (b) the Internet, (c) science collabo- (c) Science Collaboration (d) Protein Interactions ration network and (d) protein interac- 100 100 tion network. C(k) is measured by av- Equation (3.21) makes twoeraging predictions: the local clustering coefficient Cof alldecreases nodes with the samewith degree the k .system size N. The green horizontal line corresponds k C(k) C(k) Cto is. independent of a node’s degree k.

10-1 10-1 (1) For fixed , the larger the network, the smaller is a node’s cluster-

ing coefficient.10-2 Consequently a node's local clustering coefficient C 0 1 2 3 4 0 1 2 3 10 10 10 10 10 10 10 10 10 Network Science: Random Graphs i isk expected tok decrease as 1/N. Note that the network's average clus- tering coefficient, also follows (3.21).

RANDOM NETWORKS 27 CLUSTERING COEFFICIENT (2) The local clustering coefficient of a node is independent of the node’s degree.

To test the validity of (3.21) we plot / in function of N for several undirected networks (Figure 3.13a). We find that / does not decrease

RANDOM NETWORKS 26 Section 10

Real networks are not random SECTION 3.9 CLUSTERING COEFFICIENT

The degree of a node contains no information about the relationship between a node's neighbors. Do they all know each other, by having links between them? Or are they perhaps isolated from each other? The answer

is provided by the local clustering coefficient Ci, that measures the density of links in node i’s immediate neighborhood: C = 0 means that there are no links between i’s neighbors; C = 1 implies that each of the i’s neighbors link to each other (SECTION 2.10).

To calculate Ci for a node in a random network we need to estimate the

expected number of links Li between the node’s ki neighbors. In a random ARE REALnetwork NETWORKS the probability LIKE that RANDOM two of i’s neighborsGRAPHS? link to each other is p. As

there are ki(ki - 1)/2 possible links between the ki neighbors of node i, the As quantitative data about real networks became available, we can expected value of Li is compare their topology with the predictions of random .

Note that once we have N and for a random network, from it we can derive every kk(1− ) measurable property. Indeed, we have: 〈〉Lp= ii . (3.20) i 2

Average path length: Thus the local clustering coefficientlog ofN a random network is < lrand >≈ log k Clustering Coefficient:

2〈〉Li 〈〉k Ci = ==p . (3.21) kkii(1− ) N Degree Distribution: Equation (3.21) makes two predictions: < k >k P(k) = e− k! (1) For fixed , the larger the network, the smaller is a node’s cluster-

ing coefficient. Consequently a node's local clustering coefficient Ci Network Science: Random Graphs is expected to decrease as 1/N. Note that the network's average clus- tering coefficient, also follows (3.21).

(2) The local clustering coefficient of a node is independent of the node’s degree.

To test the validity of (3.21) we plot / in function of N for several undirected networks (Figure 3.13a). We find that / does not decrease

RANDOM NETWORKS 26 PATH LENGTHS IN REAL NETWORKS

Prediction:

logN < d >= log k

Real networks have short distances like random graphs.

Network Science: Random Graphs SECTION 3.9 CLUSTERING COEFFICIENT

The degree of a node contains no information about the relationship between a node's neighbors. Do they all know each other, by having links between them? Or are they perhaps isolated from each other? The answer is provided by the local clustering coefficient Ci, that measures the density of links in node i’s immediate neighborhood: C = 0 means that there are no links between i’s neighbors; C = 1 implies that each of the i’s neighbors link to each other (SECTION 2.10).

To calculate Ci for a node in a random network we need to estimate the expected number of links Li between the node’s ki neighbors. In a random network the probability that two of i’s neighbors link to each other is p. As there are ki(ki - 1)/2 possible links between the ki neighbors of node i, the expected value of Li is CLUSTERING COEFFICIENT kk(1− ) 〈〉Lp= ii . (3.20) Prediction:i 2

Thus the local clustering coefficient of a random network is

2〈〉Li 〈〉k Ci = ==p . (3.21) kkii(1− ) N

Equation (3.21) makes two predictions:

(1) For fixed , the larger the network, the smaller is a node’s cluster- ing coefficient. Consequently a node's local clustering coefficient Ci is expected to decrease as 1/N. Note that the network's average clus- Crand underestimates with orders of magnitudes tering coefficient,the clustering also follows (3.21)coefficient. of real networks.

(2) The local clustering coefficient of a node is independent of the node’s Network Science: Random Graphs degree.

To test the validity of (3.21) we plot / in function of N for several undirected networks (Figure 3.13a). We find that / does not decrease

RANDOM NETWORKS 26 THE DEGREE DISTRIBUTION

Prediction:

< k >k P(k) = e− k!

Data:

P(k) ≈ k −γ

Network Science: Random Graphs SECTION 3.9 CLUSTERING COEFFICIENT

The degree of a node contains no information about the relationship between a node's neighbors. Do they all know each other, by having links between them? Or are they perhaps isolated from each other? The answer

is provided by the local clustering coefficient Ci, that measures the density of links in node i’s immediate neighborhood: C = 0 means that there are no links between i’s neighbors; C = 1 implies that each of the i’s neighbors link to each other (SECTION 2.10).

To calculate Ci for a node in a random network we need to estimate the

expected number of links Li between the node’s ki neighbors. In a random ARE REALnetwork NETWORKS the probability LIKE that RANDOM two of i’s neighborsGRAPHS? link to each other is p. As

there are ki(ki - 1)/2 possible links between the ki neighbors of node i, the As quantitative data about real networks became available, we can expected value of Li is compare their topology with the predictions of random graph theory.

Note that once we have N and for a random network, from it we can derive every kk(1− ) measurable property. Indeed, we have: 〈〉Lp= ii . (3.20) i 2

Average path length: Thus the local clustering coefficientlog ofN a random network is < lrand >≈ log k Clustering Coefficient:

2〈〉Li 〈〉k Ci = ==p . (3.21) kkii(1− ) N Degree Distribution: Equation (3.21) makes two predictions: < k >k P(k) = e− k! (1) For fixed , the larger the network, the smaller is a node’s cluster-

ing coefficient. Consequently a node's local clustering coefficient Ci Network Science: Random Graphs is expected to decrease as 1/N. Note that the network's average clus- tering coefficient, also follows (3.21).

(2) The local clustering coefficient of a node is independent of the node’s degree.

To test the validity of (3.21) we plot / in function of N for several undirected networks (Figure 3.13a). We find that / does not decrease

RANDOM NETWORKS 26 The Watts-Strogatz Model

We start from a ring of nodes, Regular networks (p=0) each node being connected to - large distances (bad) their immediate and next - large clustering coefficients (good) neighbors. Hence inially each node has ‹C› = 3/4 (p = 0). Random networks (p=1): - small distances (good) With probability p each link is - small clustering coefficients (bad) rewired to a randomly chosen node. For small p the network maintains high clustering but the random long-range links can drascally decrease the distances between the nodes.

For p = 1 all links have been rewired, so the network turns into a random network. Watts-Strogatz Model

All graphs have N=1000 and ‹k›=10.

The dependence of the average path length d(p) and clustering coefficient ‹C(p)› on the rewiring parameter p. Note that d(p) and ‹C(p)› have been normalized by d(0) and ‹C(0)› obtained for a regular lattice (i.e. for p=0 in (a)). The rapid drop in d(p) signals the onset of the small-world phenomenon. During this drop, ‹C(p)› remains high.

Image byHence Matthew in Hurst the range 0.001‹p‹0.1 short path lengths and high clustering coexist. Blogosphere IS THE RANDOM GRAPH MODEL RELEVANT TO REAL SYSTEMS?

(B) Most important: we need to ask ourselves, are real networks random?

The answer is simply: NO

There is no network in nature that we know of that would be described by the random network model.

Network Science: Random Graphs IF IT IS WRONG AND IRRELEVANT, WHY DID WE DEVOT TO IT A FULL CLASS?

It is the reference model for the rest of the class.

It will help us calculate many quantities, that can then be compared to the real data, understanding to what degree is a particular property the result of some random process.

Patterns in real networks that are shared by a large number of real networks, yet which deviate from the predictions of the random network model.

In order to identify these, we need to understand how would a particular property look like if it is driven entirely by random processes.

While WRONG and IRRELEVANT, it will turn out to be extremly USEFUL!

Network Science: Random Graphs HISTORICAL NOTE

1951, Rapoport and Solomonoff:

à first systematic study of a random graph. àdemonstrates the phase transition.

ànatural systems: neural networks; the social networks of physical contacts (epidemics); genetics.

1959: G(N,p) Anatol Rapoport 1911- 2007 Edgar N. Gilbert (b.1923)

Why do we call it the Erdos-Renyi random model?

Network Science: Random Graphs