Stochastic Adjacency Matrix M

Stochastic Adjacency Matrix M

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu Web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu We already know: Since there is large diversity in the connectivity of the vs. webgraph we can rank the pages by the link structure 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 We will cover the following Link Analysis approaches to computing importances of nodes in a graph: . Page Rank . Hubs and Authorities (HITS) . Topic-Specific (Personalized) Page Rank . Web Spam Detection Algorithms 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 3 Idea: Links as votes . Page is more important if it has more links . In-coming links? Out-going links? Think of in-links as votes: . www.stanford.edu has 23,400 inlinks . www.joe-schmoe.com has 1 inlink Are all in-links are equal? . Links from important pages count more . Recursive question! 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 4 Each link’s vote is proportional to the importance of its source page If page p with importance x has n out-links, each link gets x/n votes Page p’s own importance is the sum of the votes on its in-links p 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 5 A “vote” from an important The web in 1839 page is worth more y/2 A page is important if it is y pointed to by other important a/2 pages y/2 Define a “rank” r for node j m j a m r a/2 i Flow equations: rj = ∑ ry = ry /2 + ra /2 i→ j dout (i) ra = ry /2 + rm rm = ra /2 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 6 Flow equations: 3 equations, 3 unknowns, ry = ry /2 + ra /2 no constants ra = ry /2 + rm . No unique solution rm = ra /2 . All solutions equivalent modulo scale factor Additional constraint forces uniqueness . y + a + m = 1 . y = 2/5, a = 2/5, m = 1/5 Gaussian elimination method works for small examples, but we need a better method for large web-size graphs 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 7 Stochastic adjacency matrix M . Let page j has dj out-links . If j → i, then Mij = 1/dj else Mij = 0 . M is a column stochastic matrix . Columns sum to 1 Rank vector r: vector with an entry per page . ri is the importance score of page i . ∑i ri = 1 The flow equations can be written r = M r 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 8 Suppose page j links to 3 pages, including i j i i = 1/3 M r r 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 9 The flow equations can be written r = M ∙ r So the rank vector is an eigenvector of the stochastic web matrix . In fact, its first or principal eigenvector, with corresponding eigenvalue 1 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 10 y a m y y ½ ½ 0 a ½ 0 1 a m m 0 ½ 0 r = Mr ry = ry /2 + ra /2 y ½ ½ 0 y a = ½ 0 1 a ra = ry /2 + rm m 0 ½ 0 m rm = ra /2 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 11 Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks Power iteration: a simple iterative scheme . Suppose there are N web pages (t) . (0) T (t+1) ri Initialize: r = [1/N,….,1/N] rj = ∑ d . Iterate: r(t+1) = M ∙ r(t) i→ j i di …. out-degree of node i (t+1) (t) . Stop when |r – r |1 < ε . |x|1 = ∑1≤i≤N|xi| is the L1 norm . Can use any other vector norm e.g., Euclidean 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 12 y a m Power Iteration: y y ½ ½ 0 . Set = 1/N a ½ 0 1 a m m 0 ½ 0 . =푟푗 푖 푟 ry = ry /2 + ra /2 .푗 And iterate푖→푗 푖 푟 ∑ 푑 ra = ry /2 + rm . r =∑ M ∙r i j ij j rm = ra /2 Example: ry 1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 13 Imagine a random web surfer: i1 i2 i3 . At any time t, surfer is on some page u . At time t+1, the surfer follows an j out-link from u uniformly at random ri rj = ∑ d (i) . Ends up on some page v linked from u i→ j out . Process repeats indefinitely Let: p(t) … vector whose ith coordinate is the prob. that the surfer is at page i at time t . p(t) is a probability distribution over pages 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 14 i i i Where is the surfer at time t+1? 1 2 3 . Follows a link uniformly at random j p(t+1) = M · p(t) + = ⋅ p(t 1) M p(t) Suppose the random walk reaches a state p(t+1) = M · p(t) = p(t) then p(t) is stationary distribution of a random walk Our rank vector r satisfies r = M · r . So, it is a stationary distribution for the random walk 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 15 (t) (t+1) r i or rj = ∑ equivalently r = Mr i→ j di Does this converge? Does it converge to what we want? Are results reasonable? 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 16 (t) (t+1) ri a b rj = ∑ i→ j di Example: r 1 0 1 0 a = rb 0 1 0 1 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 17 (t) (t+1) ri a b rj = ∑ i→ j di Example: r 1 0 0 0 a = rb 0 1 0 0 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 18 2 problems: Some pages are “dead ends” (have no out-links) . Such pages cause importance to “leak out” Spider traps (all out links are within the group) . Eventually spider traps absorb all importance 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 19 y a m Power Iteration: y y ½ ½ 0 . Set = 1 a ½ 0 0 a m m 0 ½ 1 . =푟푗 푖 푟 ry = ry /2 + ra /2 .푗 And iterate푖→푗 푖 푟 ∑ 푑 ra = ry /2 rm = ra /2 + rm Example: ry 1/3 2/6 3/12 5/24 0 ra = 1/3 1/6 2/12 3/24 … 0 rm 1/3 3/6 7/12 16/24 1 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 20 The Google solution for spider traps: At each time step, the random surfer has two options: . With probability β, follow a link at random . With probability 1-β, jump to some page uniformly at random . Common values for β are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps y y a m a m 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 21 y a m Power Iteration: y y ½ ½ 0 . Set = 1 a ½ 0 0 a m m 0 ½ 0 . =푟푗 푖 푟 ry = ry /2 + ra /2 .푗 And iterate푖→푗 푖 푟 ∑ 푑 ra = ry /2 rm = ra /2 Example: ry 1/3 2/6 3/12 5/24 0 ra = 1/3 1/6 2/12 3/24 … 0 rm 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, … 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 22 Teleports: Follow random teleport links with probability 1.0 from dead-ends . Adjust matrix accordingly y y a m a m y a m y a m y ½ ½ 0 y ½ ½ ⅓ a ½ 0 0 a ½ 0 ⅓ m 0 ½ 0 m 0 ½ ⅓ 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 23 r (t+1) = Mr (t) Markov Chains Set of states X Transition matrix P where Pij = P(Xt=i | Xt-1=j) π specifying the probability of being at each state x ∈ X Goal is to find π such that π = P π 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 24 Theory of Markov chains Fact: For any start vector, the power method applied to a Markov transition matrix P will converge to a unique positive stationary vector as long as P is stochastic, irreducible and aperiodic. 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 25 Stochastic: Every column sums to 1 A possible solution: Add green links 1 • ai…=1 if node i has T out deg 0, =0 else S = M + a ( 1) • 1…vector of all 1s n y a m y y ½ ½ 1/3 ry = ry /2 + ra /2 + rm /3 a ½ 0 1/3 ra = ry /2+ rm /3 r = r /2 + r /3 a m m 0 ½ 1/3 m a m 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 26 A chain is periodic if there exists k > 1 such that the interval between two visits to some state s is always a multiple of k.

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    44 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us