Stochastic Adjacency Matrix M
Total Page:16
File Type:pdf, Size:1020Kb
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu Web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu We already know: Since there is large diversity in the connectivity of the vs. webgraph we can rank the pages by the link structure 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 We will cover the following Link Analysis approaches to computing importances of nodes in a graph: . Page Rank . Hubs and Authorities (HITS) . Topic-Specific (Personalized) Page Rank . Web Spam Detection Algorithms 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 3 Idea: Links as votes . Page is more important if it has more links . In-coming links? Out-going links? Think of in-links as votes: . www.stanford.edu has 23,400 inlinks . www.joe-schmoe.com has 1 inlink Are all in-links are equal? . Links from important pages count more . Recursive question! 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 4 Each link’s vote is proportional to the importance of its source page If page p with importance x has n out-links, each link gets x/n votes Page p’s own importance is the sum of the votes on its in-links p 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 5 A “vote” from an important The web in 1839 page is worth more y/2 A page is important if it is y pointed to by other important a/2 pages y/2 Define a “rank” r for node j m j a m r a/2 i Flow equations: rj = ∑ ry = ry /2 + ra /2 i→ j dout (i) ra = ry /2 + rm rm = ra /2 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 6 Flow equations: 3 equations, 3 unknowns, ry = ry /2 + ra /2 no constants ra = ry /2 + rm . No unique solution rm = ra /2 . All solutions equivalent modulo scale factor Additional constraint forces uniqueness . y + a + m = 1 . y = 2/5, a = 2/5, m = 1/5 Gaussian elimination method works for small examples, but we need a better method for large web-size graphs 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 7 Stochastic adjacency matrix M . Let page j has dj out-links . If j → i, then Mij = 1/dj else Mij = 0 . M is a column stochastic matrix . Columns sum to 1 Rank vector r: vector with an entry per page . ri is the importance score of page i . ∑i ri = 1 The flow equations can be written r = M r 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 8 Suppose page j links to 3 pages, including i j i i = 1/3 M r r 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 9 The flow equations can be written r = M ∙ r So the rank vector is an eigenvector of the stochastic web matrix . In fact, its first or principal eigenvector, with corresponding eigenvalue 1 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 10 y a m y y ½ ½ 0 a ½ 0 1 a m m 0 ½ 0 r = Mr ry = ry /2 + ra /2 y ½ ½ 0 y a = ½ 0 1 a ra = ry /2 + rm m 0 ½ 0 m rm = ra /2 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 11 Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks Power iteration: a simple iterative scheme . Suppose there are N web pages (t) . (0) T (t+1) ri Initialize: r = [1/N,….,1/N] rj = ∑ d . Iterate: r(t+1) = M ∙ r(t) i→ j i di …. out-degree of node i (t+1) (t) . Stop when |r – r |1 < ε . |x|1 = ∑1≤i≤N|xi| is the L1 norm . Can use any other vector norm e.g., Euclidean 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 12 y a m Power Iteration: y y ½ ½ 0 . Set = 1/N a ½ 0 1 a m m 0 ½ 0 . =푟푗 푖 푟 ry = ry /2 + ra /2 .푗 And iterate푖→푗 푖 푟 ∑ 푑 ra = ry /2 + rm . r =∑ M ∙r i j ij j rm = ra /2 Example: ry 1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 13 Imagine a random web surfer: i1 i2 i3 . At any time t, surfer is on some page u . At time t+1, the surfer follows an j out-link from u uniformly at random ri rj = ∑ d (i) . Ends up on some page v linked from u i→ j out . Process repeats indefinitely Let: p(t) … vector whose ith coordinate is the prob. that the surfer is at page i at time t . p(t) is a probability distribution over pages 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 14 i i i Where is the surfer at time t+1? 1 2 3 . Follows a link uniformly at random j p(t+1) = M · p(t) + = ⋅ p(t 1) M p(t) Suppose the random walk reaches a state p(t+1) = M · p(t) = p(t) then p(t) is stationary distribution of a random walk Our rank vector r satisfies r = M · r . So, it is a stationary distribution for the random walk 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 15 (t) (t+1) r i or rj = ∑ equivalently r = Mr i→ j di Does this converge? Does it converge to what we want? Are results reasonable? 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 16 (t) (t+1) ri a b rj = ∑ i→ j di Example: r 1 0 1 0 a = rb 0 1 0 1 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 17 (t) (t+1) ri a b rj = ∑ i→ j di Example: r 1 0 0 0 a = rb 0 1 0 0 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 18 2 problems: Some pages are “dead ends” (have no out-links) . Such pages cause importance to “leak out” Spider traps (all out links are within the group) . Eventually spider traps absorb all importance 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 19 y a m Power Iteration: y y ½ ½ 0 . Set = 1 a ½ 0 0 a m m 0 ½ 1 . =푟푗 푖 푟 ry = ry /2 + ra /2 .푗 And iterate푖→푗 푖 푟 ∑ 푑 ra = ry /2 rm = ra /2 + rm Example: ry 1/3 2/6 3/12 5/24 0 ra = 1/3 1/6 2/12 3/24 … 0 rm 1/3 3/6 7/12 16/24 1 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 20 The Google solution for spider traps: At each time step, the random surfer has two options: . With probability β, follow a link at random . With probability 1-β, jump to some page uniformly at random . Common values for β are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps y y a m a m 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 21 y a m Power Iteration: y y ½ ½ 0 . Set = 1 a ½ 0 0 a m m 0 ½ 0 . =푟푗 푖 푟 ry = ry /2 + ra /2 .푗 And iterate푖→푗 푖 푟 ∑ 푑 ra = ry /2 rm = ra /2 Example: ry 1/3 2/6 3/12 5/24 0 ra = 1/3 1/6 2/12 3/24 … 0 rm 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 22 Teleports: Follow random teleport links with probability 1.0 from dead-ends . Adjust matrix accordingly y y a m a m y a m y a m y ½ ½ 0 y ½ ½ ⅓ a ½ 0 0 a ½ 0 ⅓ m 0 ½ 0 m 0 ½ ⅓ 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 23 r (t+1) = Mr (t) Markov Chains Set of states X Transition matrix P where Pij = P(Xt=i | Xt-1=j) π specifying the probability of being at each state x ∈ X Goal is to find π such that π = P π 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 24 Theory of Markov chains Fact: For any start vector, the power method applied to a Markov transition matrix P will converge to a unique positive stationary vector as long as P is stochastic, irreducible and aperiodic. 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 25 Stochastic: Every column sums to 1 A possible solution: Add green links 1 • ai…=1 if node i has T out deg 0, =0 else S = M + a ( 1) • 1…vector of all 1s n y a m y y ½ ½ 1/3 ry = ry /2 + ra /2 + rm /3 a ½ 0 1/3 ra = ry /2+ rm /3 r = r /2 + r /3 a m m 0 ½ 1/3 m a m 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 26 A chain is periodic if there exists k > 1 such that the interval between two visits to some state s is always a multiple of k.