Slides and Code Samples!
Total Page:16
File Type:pdf, Size:1020Kb
CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu Intro sessions to SNAP and NetworkX: . SNAP: Today. Thu 9/27, 6-7:30pm in 200-030 . We already posted slides and code samples! . NetworkX: Fri 9/28, 6-7:30pm in 200-030 About the software libraries: . TAs support SNAP (Anshul, Wayne), NetworkX (Ashton, Jacob) . You can use other libraries as well (JUNG, iGraph, Boost, R) . They will do the job but we don’t offer support for them . Start early on HW0 since these packages are new to you, complex and non-trivial to use! Review of: . Probability: Tue 10/2 6:30-8pm in GESB 150 . Linear algebra: Wed 10/3 6:30-8pm in GESB 150 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 2 Observations Models Algorithms Small diameter, Erdös-Renyi model, Decentralized search Edge clustering Small-world model Patterns of signed Structural balance, Models for predicting edge creation Theory of status edge signs Viral Marketing, Blogosphere, Independent cascade model, Influence maximization, Memetracking Game theoretic model Outbreak detection, LIM Preferential attachment, PageRank, Hubs and Scale-Free Copying model authorities Densification power law, Microscopic model of Link prediction, Shrinking diameters evolving networks Supervised random walks Strength of weak ties, Community detection: Kronecker Graphs Core-periphery Girvan-Newman, Modularity 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 3 For example, last time we talked about Observations and Modes for the web graph: . 1) We took a real system: the Web v . 2) We represented it as a directed graph . 3) We used the language of graph theory . Strongly Connected Components . 4) We designed a computational Out(v) experiment: . Find In- and Out-components of a given node v . 5) We learned something about the structure of the Web: BOWTIE! 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 4 Undirected Directed Links: undirected Links: directed (symmetrical) (arcs) L D A B F M C I D E B G G A H F C Undirected links: Directed links: . Collaborations . Phone calls . Friendship on Facebook . Following on Twitter 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 5 4 4 2 3 2 3 1 1 Aij = 1 if there is a link from node i to node j Aij = 0 otherwise 0 1 0 1 0 0 0 1 1 0 0 1 1 0 0 0 A = A = 0 0 0 1 0 0 0 0 1 1 1 0 0 1 1 0 Note that for a directed graph (right) the matrix is not symmetric. 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 6 Node degree, ki: the number of edges adjacent to node i A Undirected Avg. degree: In directed networks we define D B C an in-degree and out-degree. The (total) degree of a node is the E sum of in- and out-degrees. Directed G A k in = 2 k out =1 k = 3 F C C C Source: node with kin = 0 E in out out k = k = k Sink: node with k = 0 N 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 7 The maximum number of edges in an undirected graph on N nodes is N N(N −1) Emax = = 2 2 A graph with the number of edges E = Emax is a complete graph, and its average degree is N-1 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 8 Most real-world networks are sparse E << Emax (or k << N-1) WWW (Stanford-Berkeley): N=319,717 〈k〉=9.65 Social networks (LinkedIn): N=6,946,668 〈k〉=8.87 Communication (MSN IM): N=242,720,596 〈k〉=11.1 Coauthorships (DBLP): N=317,080 〈k〉=6.62 Internet (AS-Skitter): N=1,719,037 〈k〉=14.91 Roads (California): N=1,957,027 〈k〉=2.82 Protein (S. Cerevisiae): N=1,870 〈k〉=2.39 (Source: Leskovec et al., Internet Mathematics, 2009) Consequence: Adjacency matrix is filled with zeros! 2 -5 -8 (Density (E/N ): WWW=1.51×10 , MSN IM = 2.27×10 ) 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 9 Unweighted Weighted (undirected) (undirected) 4 4 1 1 2 2 3 3 0 1 1 0 0 2 0.5 0 1 0 1 1 2 0 1 4 A = A = ij 1 1 0 0 ij 0.5 1 0 0 0 1 0 0 0 4 0 0 Aii = 0 Aij = Aji 1 N 2E E = ∑ nonzero(Aij ) k = 2 i, j=1 N Examples: Friendship, Sex Examples: Collaboration, Internet, Roads 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 10 Self-edges (self-loops) Multigraph (undirected) (undirected) 4 4 1 1 2 2 3 3 1 1 1 0 0 2 1 0 1 0 1 1 2 0 1 3 A = A = ij 1 1 0 0 ij 1 1 0 0 0 1 0 1 0 3 0 0 Aii ≠ 0 Aij = Aji Aii = 0 Aij = Aji 1 N N 1 N 2E E = ∑ Aij + ∑ Aii ? E = ∑ nonzero(Aij ) k = 2 i, j=1,i≠ j i=1 2 i, j=1 N Examples: Proteins Examples: Communication, Collaboration 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 11 WWW >> directed multigraph with self-interactions Facebook friendships >> undirected, unweighted Citation networks >> unweighted, directed, acyclic Collaboration networks >> undirected multigraph or weighted graph Mobile phone calls >> directed, (weighted?) multigraph Protein Interactions >> undirected, unweighted with self-interactions 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 12 A Bipartite graph is a graph whose nodes can be divided into two disjoint sets U and V such B that every link connects a node in U to one in V; that is, U and V are independent sets. C Examples: D . Authors-to-papers (they authored) E . Actors-to-Movies (they appeared in) U V . Users-to-Movies (they rated) B “Folded” networks: A . Author collaboration networks C E . Movie co-rating networks D 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 13 Degree distribution P(k): Probability that a randomly chosen node has degree k Nk = # nodes with degree k P(k) Normalized histogram: 0.6 0.5 0.4 P(k) = N / N ➔ plot k 0.3 0.2 0.1 1 2 3 4 k Nk k 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 15 A path is a sequence of nodes in which each node is linked to the next one Pn = {i0,i1,i2,...,in} Pn = {(i0 ,i1),(i1,i2 ),(i2 ,i3 ),...,(in−1,in )} Path can intersect itself and pass through the same edge multiple times B F . A E.g.: ACBDCDEG D E . In a directed graph a path G C can only follow the direction H of the “arrow” 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 16 Number of paths between nodes u and v : . Length h=1: If there is a link between u and v, Auv=1 else Auv=0 . Length h=2: If there is a path of length two between u and v then Auk Akv=1 else Auk Akv=0 N (2) 2 H = A A = [A ] uv ∑ uk kv uv k =1 . Length h: If there is a path of length h between u and v then Auk .... Akv=1 else Auk .... Akv=0 So, the no. of paths of length h between u and v is (h) h Huv = [A ]uv (holds for both directed and undirected graphs) 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 17 Distance (shortest path, geodesic) D A between a pair of nodes is defined as the number of edges along the C B shortest path connecting the nodes . hB,D = 2 *If the two nodes are disconnected, the distance is usually defined as infinite D In directed graphs paths need to A follow the direction of the arrows C . Consequence: Distance is not B symmetric: hA,C ≠ hC, A hB,C = 1, hC,B = 2 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 18 Diameter: the maximum (shortest path) distance between any pair of nodes in a graph Average path length for a connected graph (component) or a strongly connected (component of a) directed graph 1 h = ∑ hij 2Emax i, j≠i where hij is the distance from node i to node j . Many times we compute the average only over the connected pairs of nodes (we ignore “infinite” length paths) 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 19 Breath-First Search: . Start with node u, mark it to be at distance hu(u)=0, add u to the queue . While the queue not empty: . Take node v off the queue, put its unmarked neighbors w into the queue and mark hu(w)=hu(v)+1 3 4 3 u 2 4 3 2 1 0 1 3 4 2 3 3 4 1 4 4 3 4 4 2 2 3 9/27/2012 Jure Leskovec, Stanford CS224W: Social and Information Network Analysis, http://cs224w.stanford.edu 20 Clustering coefficient: .