CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu  Intro sessions to SNAP and NetworkX: . SNAP: Today. Thu 9/27, 6-7:30pm in 200-030 . We already posted slides and code samples! . NetworkX: Fri 9/28, 6-7:30pm in 200-030  About the software libraries: . TAs support SNAP (Anshul, Wayne), NetworkX (Ashton, Jacob) . You can use other libraries as well (JUNG, iGraph, Boost, R) . They will do the job but we don’t offer support for them . Start early on HW0 since these packages are new to you, complex and non-trivial to use!  Review of: . Probability: Tue 10/2 6:30-8pm in GESB 150 . Linear algebra: Wed 10/3 6:30-8pm in GESB 150

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2 Observations Models Algorithms

Small diameter, Erdös-Renyi model, Decentralized search Edge clustering Small-world model

Patterns of signed Structural balance, Models for predicting edge creation Theory of status edge signs

Viral Marketing, Blogosphere, Independent cascade model, Influence maximization, Memetracking Game theoretic model Outbreak detection, LIM

Preferential attachment, PageRank, Hubs and Scale-Free Copying model authorities

Densification power law, Microscopic model of Link prediction, Shrinking diameters evolving networks Supervised random walks

Strength of weak ties, Community detection: Kronecker Graphs Core-periphery Girvan-Newman,

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3  For example, last time we talked about Observations and Modes for the web graph: . 1) We took a real system: the Web v . 2) We represented it as a . 3) We used the language of . Strongly Connected Components . 4) We designed a computational Out(v) experiment: . Find In- and Out-components of a given node v . 5) We learned something about the structure of the Web: BOWTIE!

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4 Undirected Directed  Links: undirected  Links: directed (symmetrical) (arcs)

L D A B F M C I D E B G G A H F C  Undirected links:  Directed links: . Collaborations . Phone calls . Friendship on Facebook . Following on Twitter

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5 4 4

2 3 2 3 1 1

Aij = 1 if there is a link from node i to node j Aij = 0 otherwise

0 1 0 1 0 0 0 1     1 0 0 1 1 0 0 0 A = A = 0 0 0 1 0 0 0 0         1 1 1 0 0 1 1 0

Note that for a directed graph (right) the matrix is not symmetric.

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6 Node degree, ki: the number of edges adjacent to node i

A

Undirected Avg. degree: In directed networks we define D B C an in-degree and out-degree. The (total) degree of a node is the

E sum of in- and out-degrees.

Directed G A k in = 2 k out =1 k = 3 F C C C Source: node with kin = 0 E in out out k = k = k Sink: node with k = 0 N 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7 The maximum number of edges in an undirected graph on N nodes is  N  N(N −1)   Emax =   =  2  2

A graph with the number of edges E = Emax is a complete graph, and its average degree is N-1

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8 Most real-world networks are sparse

E << Emax (or k << N-1)

WWW (Stanford-Berkeley): N=319,717 〈k〉=9.65 Social networks (LinkedIn): N=6,946,668 〈k〉=8.87 Communication (MSN IM): N=242,720,596 〈k〉=11.1 Coauthorships (DBLP): N=317,080 〈k〉=6.62 Internet (AS-Skitter): N=1,719,037 〈k〉=14.91 Roads (California): N=1,957,027 〈k〉=2.82 Protein (S. Cerevisiae): N=1,870 〈k〉=2.39 (Source: Leskovec et al., Internet Mathematics, 2009)

Consequence: is filled with zeros! (Density (E/N2): WWW=1.51×10-5, MSN IM = 2.27×10-8)

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9  Unweighted  Weighted (undirected) (undirected) 4 4 1 1

2 2 3 3  0 1 1 0  0 2 0.5 0     1 0 1 1 2 0 1 4 A =   A =   ij  1 1 0 0 ij  0.5 1 0 0      0 1 0 0  0 4 0 0

Aii = 0 Aij = Aji 1 N 2E E = ∑ nonzero(Aij ) k = 2 i, j=1 N Examples: Friendship, Sex Examples: Collaboration, Internet, Roads 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10  Self-edges (self-loops)  Multigraph (undirected) (undirected) 4 4 1 1

2 2 3 3  1 1 1 0  0 2 1 0     1 0 1 1 2 0 1 3 A =   A =   ij  1 1 0 0 ij  1 1 0 0      0 1 0 1  0 3 0 0

Aii ≠ 0 Aij = Aji Aii = 0 Aij = Aji 1 N N 1 N 2E E = ∑ Aij + ∑ Aii ? E = ∑ nonzero(Aij ) k = 2 i, j=1,i≠ j i=1 2 i, j=1 N

Examples: Proteins Examples: Communication, Collaboration 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11 WWW >> directed multigraph with self-interactions

Facebook friendships >> undirected, unweighted

Citation networks >> unweighted, directed, acyclic

Collaboration networks >> undirected multigraph or weighted graph

Mobile phone calls >> directed, (weighted?) multigraph

Protein Interactions >> undirected, unweighted with self-interactions

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12 A  is a graph whose nodes can be divided into two disjoint sets U and V such B that every link connects a node in U to one in V; that is, U and V are independent sets. C

 Examples: D . Authors-to-papers (they authored) E . Actors-to-Movies (they appeared in) U V . Users-to-Movies (they rated) B  “Folded” networks: A . Author collaboration networks C E . Movie co-rating networks D

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13

P(k): Probability that a randomly chosen node has degree k

Nk = # nodes with degree k  P(k) Normalized histogram: 0.6 0.5 P(k) = N / N ➔ plot 0.4 k 0.3 0.2 0.1

1 2 3 4 k

Nk

k 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15  A path is a sequence of nodes in which each node is linked to the next one

Pn = {i0,i1,i2,...,in} Pn = {(i0 ,i1),(i1,i2 ),(i2 ,i3 ),...,(in−1,in )}

 Path can intersect itself and pass through the

same edge multiple times B F . A E.g.: ACBDCDEG D E . In a directed graph a path G C can only follow the direction H of the “arrow”

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16  Number of paths between nodes u and v : . Length h=1: If there is a link between u and v,

Auv=1 else Auv=0 . Length h=2: If there is a path of length two

between u and v then Auk Akv=1 else Auk Akv=0 N (2) 2 H = A A = [A ] uv ∑ uk kv uv k =1 . Length h: If there is a path of length h between u

and v then Auk .... Akv=1 else Auk .... Akv=0 So, the no. of paths of length h between u and v is (h) h Huv = [A ]uv (holds for both directed and undirected graphs) 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17  Distance (shortest path, geodesic) D

A between a pair of nodes is defined as the number of edges along the C B shortest path connecting the nodes . hB,D = 2 *If the two nodes are disconnected, the distance is usually defined as infinite D  In directed graphs paths need to A follow the direction of the arrows

C . Consequence: Distance is not B symmetric: hA,C ≠ hC, A hB,C = 1, hC,B = 2 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18  Diameter: the maximum (shortest path) distance between any pair of nodes in a graph

 Average path length for a connected graph (component) or a strongly connected (component of a) directed graph 1 h = h ∑ ij 2Emax i, j≠i where hij is the distance from node i to node j . Many times we compute the average only over the connected pairs of nodes (we ignore “infinite” length paths)

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19  Breath-First Search:

. Start with node u, mark it to be at distance hu(u)=0, add u to the queue . While the queue not empty: . Take node v off the queue, put its unmarked neighbors w into the queue and mark hu(w)=hu(v)+1 3 4 3 u 2 4 3 2 1 0 1 3 4

2 3 3 4 1 4

4 3 4 4 2 2

3

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20  Clustering coefficient: . What portion of i’s neighbors are connected?

. Node i with degree ki

. Ci ∈ [0,1]

where is the number of edges . ei between the neighbors of node i

i i i C =0 C =1/3 C =1 i i i 1 N  Average Clustering Coefficient: C = ∑Ci N i 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 21  Clustering coefficient: . What portion of i’s neighbors are connected?

. Node i with degree ki

where is the number of edges . ei between the neighbors of node i

B A F kB=2, eB=1, CB=2/2 = 1 D E G C kD=4, eD=2, CD=4/12 = 1/3 H

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 22 Degree distribution: P(k) Path length: h Clustering coefficient: C

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 23  P(k) = δ(k-4) ki =4 for all nodes  C = ½ all as N → ∞  Path length: . The average shortest path-length:

 So, we have: Constant degree, Note about calculations: We are interested in quantities as graphs get Constant avg. clustering coeff. large (N→∞)

We will use big-O: Linear avg. path-length f(x) = O(g(x)) as x→∞ if f(x) < g(x)*c for all x > x0 and some constant c. 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 24  P(k) = δ(k-6) . k =6 for each inside node  C = 6/15 for inside nodes  Path length:

 In general, for lattices:

1/ D . Average path-length is h ≈ N (D… lattice dimensionality) . Constant degree, constant clustering coefficient

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 25

 Erdös-Renyi Random Graphs [Erdös-Renyi, ‘60]  Two variants:

. Gn,p: undirected graph on n nodes and each edge (u,v) appears i.i.d. with probability p

. Gn,m : undirected graph with n nodes, and m uniformly at random picked edges

What kinds of networks does such model produce?

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 28  n and p do not uniquely determine the graph! . The graph is a result of a random process  We can have many different realizations

n = 10 p= 1/6

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 29  How likely is a graph on E edges?

 P(E): the probability that a given Gnp generates a graph on exactly E edges: max  E  −   E Emax E P(E) =   p (1− p)  E 

where Emax=n(n-1)/2 is the maximum possible number of edges in an undirected graph of n nodes

Binomial distribution >>>

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 30  What is expected degree of a node? . Let X be a rnd. var. measuring the degree of node v v n−1 . We want to know: E[X v ] = ∑ j P(X v = j) j=0 . For the calculation we will need: Linearity of expectation

. For any random variables Y1,Y2,…,Yk . If Y=Y1+Y2+…Yk, then E[Y]= ∑i E[Yi]  Easier way:

. Decompose Xv to Xv= Xv,1+Xv,2+…+Xv,n-1 . where Xv,u is a {0,1}-random variable which tells if edge (v,u) exists or not

How to think about this? • Prob. of node u linking to node v is p • u can link (flips a coin) to all other (n-1) nodes • Thus, the expected degree of node u is: p(n-1) 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 31 Degree distribution: P(k) Path length: h Clustering coefficient: C

What are values of these

properties for Gnp?

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 32  Fact: Degree distribution of Gnp is Binomial.  Let P(k) denote a fraction of nodes with degree k:

n −1 k n−1−k P(k) =   p (1− p)   P(k)  k 

Probability of missing the rest of Select k nodes Probability of the n-1-k edges k out of n-1 having k edges

Mean, variance of a binomial distribution

k = p(n −1) As the network size increases, the distribution becomes increasingly narrow—we are increasingly confident that the degree of a node is in the vicinity of k.

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 33 2e  i Remember: Ci = ki (ki −1)  Edges in Gnp appear i.i.d with prob. p

 So: No. of distinct pairs of neighbors Each pair is connected of node i of degree ki with prob. p  Then:

Clustering coefficient of a is small. For a fixed avg. degree, C decreases with the graph size N.

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 34  Goal: Generate a random graph with a

given degree sequence k1, k2, … kN  :

C A A C

B D B D A B C D

Randomly pair up Nodes with spokes Resulting graph “mini”-n0des

 Useful for as a “null” model of networks . We can compare the real network G and a “random” G’ which has the same degree sequence as G

9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 35  To prove the diameter of a Gnp we define few concepts  Random k-Regular graph: . Assume each node has k spokes (half-edges) . k=1: Graph is a set of pairs

. k=2: Graph is a set of cycles

Arbitrarily complicated . k=3: graphs

. Randomly pair them up!

10/4/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 36  Graph G(V, E) has expansion α: if∀ S ⊆ V: # of edges leaving S ≥ α⋅ min(|S|,|V\S|)  Or equivalently: #edges leaving S α = min S⊆V min(| S |,|V \ S |)

S V \ S

10/4/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 37 S nodes α·S edges

S’ nodes α·S’ edges

#edges leaving S α = min S⊆V min(| S |,|V \ S |) (A big) graph with “good” expansion

10/4/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 38 #edges leaving S α = min S⊆V min(| S |,|V \ S |)  Expansion is measure of robustness: . To disconnect l nodes, we need to cut ≥ α⋅ l edges  Low expansion:

 High expansion:

 Social networks: . “Communities”

10/4/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 39 #edges leaving S α = min  k-regular graph (every node has degree k): S⊆V min(| S |,|V \ S |) . Expansion is at most k (when S is a single node)

 Is there a graph on n nodes (n→∞), of fixed max deg. k, so that expansion α remains const?

Examples: . n×n grid: k=4: α =2n/(n2/4)→0 S (S=n/2 × n/2 square in the center)

. Complete binary tree: α →0 for |S|=(n/2)-1 S

. Fact: For a random 3-regular graph on n nodes, there is some const α (α >0, independent. of n) such that w.h.p. the expansion of the graph is ≥ α

10/4/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 40  Fact: In a graph on n nodes with expansion α for all pairs of nodes s and t there is a path of O((log n) / α) edges connecting them.  Proof: . Proof strategy: s . We want to show that from any node s there is a path of length S0

O((log n)/α) to any other node t S1 S . Let Sj be a set of all nodes 2 found within j steps of BFS from s.

. How does Sj increase as a function of j? 10/4/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 41  Proof (continued): s . Let Sj be a set of all nodes found within j steps of BFS from s. S0 . Expansion Then: S1

α S j S2 S ≥ S + = j+1 j k

At most k edges “collide” at a node

|Sj| |Sj+1| nodes nodes

At least Each of degree k α|Sj| edges

10/4/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 43  Proof (continued): In j steps, we In j steps, we ⇒ Diameter = 2·j reach >n/2 nodes reach >n/2 nodes . In how many steps of BFS we reach >n/2 nodes? j  α  n s t . Need j so that: 1+  ≥  k  2

k log2 n . Let’s set: j = In log(n) steps, we In log(n) steps, we ⇒ Diameter α reach >n/2 nodes reach >n/2 nodes = 2 log(n) . Then: k log2 n Note: k log2 n α α α α   log2 n   log2 n n 1+  ≥ 2 1+  ≥ 2 = n >  k   k  2 Remember n>0, α ≤ k then: 1 log n log n if α = k : (1+1)1 2 = 2 2 k . In (2k/α·log n) steps |S | grows to Θ(n). if α → 0 then = x → ∞ : j α xlog n  1  2 So, the diameter of G is O(log(n)/ α) and 1+  = elog2 n > 2log2 n  x  10/4/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 44 n −1 k n−1−k Degree distribution: P(k ) =   p (1− p)  k  Path length: O(log n) Clustering coefficient: C=p=k/n

10/4/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 45 Next class: What happens to Gnp when we vary p?  Remember, expected degree E[X v ] = (n −1) p  We want E[Xv] be independent of n So let: p=c/(n-1)

 Observation: If we build random graph Gnp with p=c/(n-1) we have many isolated nodes  Why? n−1

n−1  c  −c P[v has degree 0] = (1− p) = 1−  →e  n −1 n→∞ −c n−1 −x⋅c  −x   c   1   1  − lim1−  = 1−  = lim1−   = e c n→∞  n −1  x  x→∞  x  By definition:   x  1  1 c e = lim1+  = x→∞  x  Use substitution x n −1 e 10/4/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 48  How big do we have to make p before we are likely to have no isolated nodes?  We know: P[v has degree 0] = e-c  Event we are asking about is: . I = some node is isolated . I =  I v where Iv is the event that v is isolated v∈N

 We have: Union bound

  Ai −c P(I ) = P  Iv  ≤ ∑ P(Iv ) = ne  v∈N  v∈N  Ai ≤ ∑ Ai i i

10/4/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 49  We just learned: P(I) = n e-c  Let’s try: . c = ln n then: n e-c = n e-ln n =n⋅1/n= 1 . c = 2 ln n then: n e-2 ln n = n⋅1/n2 = 1/n

 So if: . p = ln n then: P(I) = 1 . p = 2 ln n then: P(I) = 1/n → 0 as n→∞

10/4/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 50  Graph structure of Gnp as p changes:

p 1/(n-1) c/(n-1) log(n)/(n-1) 2*log(n)/(n-1) 0 Giant component Avg. deg const. Fewer isolated No isolated nodes. 1 appears Lots of isolated nodes. Empty nodes. Complete graph graph

 Emergence of a Giant Component: avg. degree k=2E/n or p=k/(n-1) . k=1-ε: all components are of size Ω(log n) . k=1+ε: 1 component of size Ω(n), others have size Ω(log n)

10/4/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 51 Fraction of nodes in the largest component

 Gnp, n=100k, p(n-1) = 0.5 … 3

10/4/2011 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 52