CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu Web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu
We already know: Since there is large diversity in the connectivity of the vs. webgraph we can rank the pages by the link structure
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 2 We will cover the following Link Analysis approaches to computing importances of nodes in a graph: . Page Rank . Hubs and Authorities (HITS) . Topic-Specific (Personalized) Page Rank . Web Spam Detection Algorithms
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 3
Idea: Links as votes . Page is more important if it has more links . In-coming links? Out-going links? Think of in-links as votes: . www.stanford.edu has 23,400 inlinks . www.joe-schmoe.com has 1 inlink
Are all in-links are equal? . Links from important pages count more . Recursive question!
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 4 Each link’s vote is proportional to the importance of its source page
If page p with importance x has n out-links, each link gets x/n votes
Page p’s own importance is the sum of the votes on its in-links
p
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 5 A “vote” from an important The web in 1839 page is worth more y/2 A page is important if it is y pointed to by other important a/2 pages y/2 Define a “rank” r for node j m j a m r a/2 i Flow equations: rj = ∑ ry = ry /2 + ra /2 i→ j dout (i) ra = ry /2 + rm
rm = ra /2
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 6 Flow equations: 3 equations, 3 unknowns, ry = ry /2 + ra /2 no constants ra = ry /2 + rm . No unique solution rm = ra /2 . All solutions equivalent modulo scale factor Additional constraint forces uniqueness . y + a + m = 1 . y = 2/5, a = 2/5, m = 1/5
Gaussian elimination method works for small examples, but we need a better method for large web-size graphs
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 7 Stochastic adjacency matrix M
. Let page j has dj out-links . If j → i, then Mij = 1/dj else Mij = 0 . M is a column stochastic matrix . Columns sum to 1
Rank vector r: vector with an entry per page
. ri is the importance score of page i . ∑i ri = 1
The flow equations can be written r = M r
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 8 Suppose page j links to 3 pages, including i
j
i i = 1/3
M r r
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 9 The flow equations can be written r = M ∙ r
So the rank vector is an eigenvector of the stochastic web matrix . In fact, its first or principal eigenvector, with corresponding eigenvalue 1
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 10 y a m y y ½ ½ 0 a ½ 0 1 a m m 0 ½ 0 r = Mr
ry = ry /2 + ra /2 y ½ ½ 0 y a = ½ 0 1 a ra = ry /2 + rm m 0 ½ 0 m rm = ra /2
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 11 Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks Power iteration: a simple iterative scheme . Suppose there are N web pages (t) . (0) T (t+1) ri Initialize: r = [1/N,….,1/N] rj = ∑ d . Iterate: r(t+1) = M ∙ r(t) i→ j i di …. out-degree of node i (t+1) (t) . Stop when |r – r |1 < ε
. |x|1 = ∑1≤i≤N|xi| is the L1 norm . Can use any other vector norm e.g., Euclidean
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 12 y a m Power Iteration: y y ½ ½ 0 . Set = 1/N a ½ 0 1 a m m 0 ½ 0 . =푟푗 푖 푟 ry = ry /2 + ra /2 .푗 And iterate푖→푗 푖 푟 ∑ 푑 ra = ry /2 + rm . r =∑ M ∙r i j ij j rm = ra /2 Example:
ry 1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, …
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 13 Imagine a random web surfer: i1 i2 i3 . At any time t, surfer is on some page u . At time t+1, the surfer follows an j out-link from u uniformly at random ri rj = ∑ d (i) . Ends up on some page v linked from u i→ j out . Process repeats indefinitely Let: p(t) … vector whose ith coordinate is the prob. that the surfer is at page i at time t . p(t) is a probability distribution over pages
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 14 i i i Where is the surfer at time t+1? 1 2 3 . Follows a link uniformly at random j p(t+1) = M · p(t) + = ⋅ p(t 1) M p(t) Suppose the random walk reaches a state p(t+1) = M · p(t) = p(t) then p(t) is stationary distribution of a random walk Our rank vector r satisfies r = M · r . So, it is a stationary distribution for the random walk
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 15 (t) (t+1) ri r = or j ∑ equivalently r = Mr i→ j di
Does this converge?
Does it converge to what we want?
Are results reasonable?
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 16
(t) (t+1) ri a b rj = ∑ i→ j di Example: r 1 0 1 0 a = rb 0 1 0 1 Iteration 0, 1, 2, …
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 17
(t) (t+1) ri a b rj = ∑ i→ j di Example: r 1 0 0 0 a = rb 0 1 0 0 Iteration 0, 1, 2, …
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 18 2 problems: Some pages are “dead ends” (have no out-links) . Such pages cause importance to “leak out”
Spider traps (all out links are within the group) . Eventually spider traps absorb all importance
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 19 y a m Power Iteration: y y ½ ½ 0 . Set = 1 a ½ 0 0 a m m 0 ½ 1 . =푟푗 푖 푟 ry = ry /2 + ra /2 .푗 And iterate푖→푗 푖 푟 ∑ 푑 ra = ry /2
rm = ra /2 + rm Example:
ry 1/3 2/6 3/12 5/24 0 ra = 1/3 1/6 2/12 3/24 … 0 rm 1/3 3/6 7/12 16/24 1 Iteration 0, 1, 2, …
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 20 The Google solution for spider traps: At each time step, the random surfer has two options: . With probability β, follow a link at random . With probability 1-β, jump to some page uniformly at random . Common values for β are in the range 0.8 to 0.9 Surfer will teleport out of spider trap within a few time steps y y
a m a m
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 21 y a m Power Iteration: y y ½ ½ 0 . Set = 1 a ½ 0 0 a m m 0 ½ 0 . =푟푗 푖 푟 ry = ry /2 + ra /2 .푗 And iterate푖→푗 푖 푟 ∑ 푑 ra = ry /2
rm = ra /2 Example:
ry 1/3 2/6 3/12 5/24 0 ra = 1/3 1/6 2/12 3/24 … 0 rm 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, …
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 22 Teleports: Follow random teleport links with probability 1.0 from dead-ends . Adjust matrix accordingly
y y
a m a m
y a m y a m y ½ ½ 0 y ½ ½ ⅓ a ½ 0 0 a ½ 0 ⅓ m 0 ½ 0 m 0 ½ ⅓
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 23 r (t+1) = Mr (t) Markov Chains Set of states X
Transition matrix P where Pij = P(Xt=i | Xt-1=j) π specifying the probability of being at each state x ∈ X Goal is to find π such that π = P π
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 24
Theory of Markov chains
Fact: For any start vector, the power method applied to a Markov transition matrix P will converge to a unique positive stationary vector as long as P is stochastic, irreducible and aperiodic.
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 25 Stochastic: Every column sums to 1 A possible solution: Add green links
1 • ai…=1 if node i has T out deg 0, =0 else S = M + a ( 1) • 1…vector of all 1s n y a m y y ½ ½ 1/3 ry = ry /2 + ra /2 + rm /3 a ½ 0 1/3 ra = ry /2+ rm /3 r = r /2 + r /3 a m m 0 ½ 1/3 m a m
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 26 A chain is periodic if there exists k > 1 such that the interval between two visits to some state s is always a multiple of k. A possible solution: Add green links
y
a m
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 27 From any state, there is a non-zero probability of going from any one state to any another A possible solution: Add green links
y
a m
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 28 Google’s solution that does it all: . Makes M stochastic, aperiodic, irreducible At each step, random surfer has two options: . With probability 1-β, follow a link at random . With probability β, jump to some random page PageRank equation [Brin-Page, 98] 1 = (1 ) + 푟푖 푗 d … out-degree 푟 Assuming� we follow− random훽 teleport푖 links훽 i 푖with→ 푗probability 1.0 from dead푑 -ends 푛 of node i 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 29 PageRank equation [Brin-Page, 98] 1 = (1 ) + 푖 푗 푟 The Google푟 Matrix� A−: 훽 푖 훽 푖→푗 푑 1 푛 = 1 + G is stochastic, aperiodic and irreducible,푇 so 퐴 − 훽 푆 훽 ퟏ ⋅ ퟏ ( ) = 푛( ) What is β ? 푡+1 푡 . In practice β =0.15푟 (make퐴 5⋅ steps푟 and jump)
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 30 1/n·1·1T 0.8·½+0.2·⅓ S y 1/2 1/2 0 1/3 1/3 1/3 0.8 1/2 0 0 + 0.2 1/3 1/3 1/3 0 1/2 1 1/3 1/3 1/3
y 7/15 7/15 1/15 0.8+0.2·⅓ a 7/15 1/15 1/15 a m m 1/15 7/15 13/15 A
y 1/3 0.33 0.24 0.26 7/33 a = 1/3 0.20 0.20 0.18 . . . 5/33 m 1/3 0.46 0.52 0.56 21/33
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 31 Suppose there are N pages Consider a page j, with set of out-links O(j) We have Mij = 1/|O(j)| when j→i and Mij = 0 otherwise The random teleport is equivalent to . Adding a teleport link from j to every other page with probability (1-β)/N . Reducing the probability of following each out-link from 1/|O(j)| to β/|O(j)| . Equivalent: Tax each page a fraction (1-β) of its score and redistribute evenly
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 32 Construct the N x N matrix A as follows
. Aij = β ∙ Mij + (1-β)/N Verify that A is a stochastic matrix The page rank vector r is the principal eigenvector of this matrix A . satisfying r = A ∙ r Equivalently, r is the stationary distribution of the random walk with teleports
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 33 Key step is matrix-vector multiplication . rnew = A ∙ rold Easy if we have enough main memory to hold A, rold, rnew Say N = 1 billion pages A = β∙M + (1-β) [1/N] . We need 4 bytes for NxN each entry (say) ½ ½ 0 1/3 1/3 1/3 A = 0.8 ½ 0 0 +0.2 1/3 1/3 1/3 . 2 billion entries for 0 ½ 1 1/3 1/3 1/3 vectors, approx 8GB . Matrix A has N2 entries 7/15 7/15 1/15 . 1018 is a large number! = 7/15 1/15 1/15 1/15 7/15 13/15
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 34 1 = , where = + −훽 = =1 푖푖 푖푖 푁 푟 퐴 ⋅ 푟 퐴1 훽 푀 = 푁 + 푟푖 ∑�=1 퐴푖푖 ⋅ 푟� −훽1 = 푁 + 푟푖 ∑� =1 훽 푀푖푖 푁 ⋅ 푟� =1 1−훽 = 푁 + 푁 = 1 ∑�=1 훽 푀푖푖 ⋅ 푟� 푁 ,∑ �since푟� 푁 1 −훽 = + So, ∑� 훽 푀푖푖 ⋅ 푟� 푁 ∑푟� −훽 푟 훽 푀 ⋅ 푟 푁 푁 [x]N … a vector of length N with all entries x
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 35 We can rearrange the PageRank equation
. = + ퟏ−휷 . [(1-β)/N] is an N-vector with all entries (1-β)/N 풓 휷휷 ⋅N풓 푵 푵
M is a sparse matrix! . 10 links per node, approx 10N entries So in each iteration, we need to: . Compute rnew = β M ∙ rold . Add a constant value (1-β)/N to each entry in rnew
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 36
Encode sparse matrix using only nonzero entries . Space proportional roughly to number of links . Say 10N, or 4*10*1 billion = 40GB . Still won’t fit in memory, but will fit on disk source node degree destination nodes 0 3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 37 Assume enough RAM to fit rnew into memory . Store rold and matrix M on disk Then 1 step of power-iteration is: Initialize all entries of rnew to (1-β)/N For each page p (of out-degree n): old Read into memory: p, n, dest1,…,destn, r (p) new old for j = 1…n: r (destj) += β r (p) / n rnew rold 0 src degree destination 0 1 1 0 3 1, 5, 6 2 2 3 1 4 17, 64, 113, 117 3 4 4 2 2 13, 23 5 5 6 6 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 38 Assume enough RAM to fit rnew into memory . Store rold and matrix M on disk In each iteration, we have to: . Read rold and M . Write rnew back to disk . IO cost = 2|r| + |M|
Question: . What if we could not even fit rnew in memory?
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 39 rnew src degree destination rold 0 0 0 4 0, 1, 3, 5 1 1 1 2 0, 5 2 2 3 2 2 3, 4 4 3 5 4 5
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 40 Similar to nested-loop join in databases . Break rnew into k blocks that fit in memory . Scan M and rold once for each block k scans of M and rold . k(|M| + |r|) + |r| = k|M| + (k+1)|r|
Can we do better? . Hint: M is much bigger than r (approx 10-20x), so we must avoid reading it k times per iteration
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 41 src degree destination rnew 0 4 0, 1 0 old 1 1 3 0 r 0 2 2 1 1 2 2 0 4 3 3 4 3 2 2 3 5
0 4 5 4 5 1 3 5 2 2 4
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 42 Break M into stripes . Each stripe contains only destination nodes in the corresponding block of rnew Some additional overhead per stripe . But it is usually worth it Cost per iteration . |M|(1+ε) + (k+1)|r|
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 43 Measures generic popularity of a page . Biased against topic-specific authorities . Solution: Topic-Specific PageRank (next) Uses a single measure of importance . Other models e.g., hubs-and-authorities . Solution: Hubs-and-Authorities (next) Susceptible to Link spam . Artificial link topographies created in order to boost page rank . Solution: TrustRank (next)
2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 44