<<

CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu  Web pages are not equally “important” www.joe-schmoe.com vs. www.stanford.edu

 We already know: Since there is large diversity in the connectivity of the vs. webgraph we can rank the pages by the link structure

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 2  We will cover the following Link Analysis approaches to computing importances of nodes in a graph: . Page Rank . Hubs and Authorities (HITS) . Topic-Specific (Personalized) Page Rank . Web Spam Detection Algorithms

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 3

 Idea: Links as votes . Page is more important if it has more links . In-coming links? Out-going links?  Think of in-links as votes: . www.stanford.edu has 23,400 inlinks . www.joe-schmoe.com has 1 inlink

 Are all in-links are equal? . Links from important pages count more . Recursive question!

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 4  Each link’s vote is proportional to the importance of its source page

 If page p with importance x has n out-links, each link gets x/n votes

 Page p’s own importance is the sum of the votes on its in-links

p

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 5  A “vote” from an important The web in 1839 page is worth more y/2  A page is important if it is y pointed to by other important a/2 pages y/2  Define a “rank” r for node j m j a m r a/2 i Flow equations: rj = ∑ ry = ry /2 + ra /2 i→ j dout (i) ra = ry /2 + rm

rm = ra /2

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 6 Flow equations:  3 equations, 3 unknowns, ry = ry /2 + ra /2 no constants ra = ry /2 + rm . No unique solution rm = ra /2 . All solutions equivalent modulo scale factor  Additional constraint forces uniqueness . y + a + m = 1 . y = 2/5, a = 2/5, m = 1/5

 Gaussian elimination method works for small examples, but we need a better method for large web-size graphs

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 7  adjacency M

. Let page j has dj out-links . If j → i, then Mij = 1/dj else Mij = 0 . M is a column stochastic matrix . Columns sum to 1

 Rank vector r: vector with an entry per page

. ri is the importance score of page i . ∑i ri = 1

 The flow equations can be written r = M r

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 8  Suppose page j links to 3 pages, including i

j

i i = 1/3

M r r

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 9  The flow equations can be written r = M ∙ r

 So the rank vector is an eigenvector of the stochastic web matrix . In fact, its first or principal eigenvector, with corresponding eigenvalue 1

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 10 y a m y y ½ ½ 0 a ½ 0 1 a m m 0 ½ 0 r = Mr

ry = ry /2 + ra /2 y ½ ½ 0 y a = ½ 0 1 a ra = ry /2 + rm m 0 ½ 0 m rm = ra /2

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 11  Given a web graph with n nodes, where the nodes are pages and edges are hyperlinks  Power iteration: a simple iterative scheme . Suppose there are N web pages (t) . (0) T (t+1) ri Initialize: r = [1/N,….,1/N] rj = ∑ d . Iterate: r(t+1) = M ∙ r(t) i→ j i di …. out-degree of node i (t+1) (t) . Stop when |r – r |1 < ε

. |x|1 = ∑1≤i≤N|xi| is the L1 norm . Can use any other vector norm e.g., Euclidean

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 12 y a m  Power Iteration: y y ½ ½ 0 . Set = 1/N a ½ 0 1 a m m 0 ½ 0 . =푟푗 푖 푟 ry = ry /2 + ra /2 .푗 And iterate푖→푗 푖 푟 ∑ 푑 ra = ry /2 + rm . r =∑ M ∙r i j ij j rm = ra /2  Example:

ry 1/3 1/3 5/12 9/24 6/15 ra = 1/3 3/6 1/3 11/24 … 6/15 rm 1/3 1/6 3/12 1/6 3/15 Iteration 0, 1, 2, …

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 13  Imagine a random web surfer: i1 i2 i3 . At any time t, surfer is on some page u . At time t+1, the surfer follows an j out-link from u uniformly at random ri rj = ∑ d (i) . Ends up on some page v linked from u i→ j out . Process repeats indefinitely  Let:  p(t) … vector whose ith coordinate is the prob. that the surfer is at page i at time t . p(t) is a distribution over pages

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 14 i i i  Where is the surfer at time t+1? 1 2 3 . Follows a link uniformly at random j p(t+1) = M · p(t) + = ⋅ p(t 1) M p(t)  Suppose the random walk reaches a state p(t+1) = M · p(t) = p(t) then p(t) is stationary distribution of a random walk  Our rank vector r satisfies r = M · r . So, it is a stationary distribution for the random walk

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 15 (t) (t+1) ri r = or j ∑ equivalently r = Mr i→ j di

 Does this converge?

 Does it converge to what we want?

 Are results reasonable?

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 16

(t) (t+1) ri a b rj = ∑ i→ j di  Example: r 1 0 1 0 a = rb 0 1 0 1 Iteration 0, 1, 2, …

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 17

(t) (t+1) ri a b rj = ∑ i→ j di  Example: r 1 0 0 0 a = rb 0 1 0 0 Iteration 0, 1, 2, …

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 18 2 problems:  Some pages are “dead ends” (have no out-links) . Such pages cause importance to “leak out”

 Spider traps (all out links are within the group) . Eventually spider traps absorb all importance

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 19 y a m  Power Iteration: y y ½ ½ 0 . Set = 1 a ½ 0 0 a m m 0 ½ 1 . =푟푗 푖 푟 ry = ry /2 + ra /2 .푗 And iterate푖→푗 푖 푟 ∑ 푑 ra = ry /2

rm = ra /2 + rm  Example:

ry 1/3 2/6 3/12 5/24 0 ra = 1/3 1/6 2/12 3/24 … 0 rm 1/3 3/6 7/12 16/24 1 Iteration 0, 1, 2, …

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 20  The Google solution for spider traps: At each time step, the random surfer has two options: . With probability β, follow a link at random . With probability 1-β, jump to some page uniformly at random . Common values for β are in the range 0.8 to 0.9  Surfer will teleport out of spider trap within a few time steps y y

a m a m

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 21 y a m  Power Iteration: y y ½ ½ 0 . Set = 1 a ½ 0 0 a m m 0 ½ 0 . =푟푗 푖 푟 ry = ry /2 + ra /2 .푗 And iterate푖→푗 푖 푟 ∑ 푑 ra = ry /2

rm = ra /2  Example:

ry 1/3 2/6 3/12 5/24 0 ra = 1/3 1/6 2/12 3/24 … 0 rm 1/3 1/6 1/12 2/24 0 Iteration 0, 1, 2, …

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 22  Teleports: Follow random teleport links with probability 1.0 from dead-ends . Adjust matrix accordingly

y y

a m a m

y a m y a m y ½ ½ 0 y ½ ½ ⅓ a ½ 0 0 a ½ 0 ⅓ m 0 ½ 0 m 0 ½ ⅓

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 23 r (t+1) = Mr (t) Markov Chains  Set of states X

 Transition matrix P where Pij = P(Xt=i | Xt-1=j)  π specifying the probability of being at each state x ∈ X  Goal is to find π such that π = P π

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 24

 Theory of Markov chains

 Fact: For any start vector, the power method applied to a Markov transition matrix P will converge to a unique positive stationary vector as long as P is stochastic, irreducible and aperiodic.

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 25  Stochastic: Every column sums to 1  A possible solution: Add green links

1 • ai…=1 if node i has T out deg 0, =0 else S = M + a ( 1) • 1…vector of all 1s n y a m y y ½ ½ 1/3 ry = ry /2 + ra /2 + rm /3 a ½ 0 1/3 ra = ry /2+ rm /3 r = r /2 + r /3 a m m 0 ½ 1/3 m a m

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 26  A chain is periodic if there exists k > 1 such that the interval between two visits to some state s is always a multiple of k.  A possible solution: Add green links

y

a m

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 27  From any state, there is a non-zero probability of going from any one state to any another  A possible solution: Add green links

y

a m

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 28  Google’s solution that does it all: . Makes M stochastic, aperiodic, irreducible  At each step, random surfer has two options: . With probability 1-β, follow a link at random . With probability β, jump to some random page  PageRank equation [Brin-Page, 98] 1 = (1 ) + 푟푖 푗 d … out-degree 푟 Assuming� we follow− random훽 teleport푖 links훽 i 푖with→ 푗probability 1.0 from dead푑 -ends 푛 of node i 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 29  PageRank equation [Brin-Page, 98] 1 = (1 ) + 푖 푗 푟  The Google푟 Matrix� A−: 훽 푖 훽 푖→푗 푑 1 푛 = 1 +  G is stochastic, aperiodic and irreducible,푇 so 퐴 − 훽 푆 훽 ퟏ ⋅ ퟏ ( ) = 푛( )  What is β ? 푡+1 푡 . In practice β =0.15푟 (make퐴 5⋅ steps푟 and jump)

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 30 1/n·1·1T 0.8·½+0.2·⅓ S y 1/2 1/2 0 1/3 1/3 1/3 0.8 1/2 0 0 + 0.2 1/3 1/3 1/3 0 1/2 1 1/3 1/3 1/3

y 7/15 7/15 1/15 0.8+0.2·⅓ a 7/15 1/15 1/15 a m m 1/15 7/15 13/15 A

y 1/3 0.33 0.24 0.26 7/33 a = 1/3 0.20 0.20 0.18 . . . 5/33 m 1/3 0.46 0.52 0.56 21/33

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 31  Suppose there are N pages  Consider a page j, with set of out-links O(j)  We have Mij = 1/|O(j)| when j→i and Mij = 0 otherwise  The random teleport is equivalent to . Adding a teleport link from j to every other page with probability (1-β)/N . Reducing the probability of following each out-link from 1/|O(j)| to β/|O(j)| . Equivalent: Tax each page a fraction (1-β) of its score and redistribute evenly

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 32  Construct the N x N matrix A as follows

. Aij = β ∙ Mij + (1-β)/N  Verify that A is a stochastic matrix  The page rank vector r is the principal eigenvector of this matrix A . satisfying r = A ∙ r  Equivalently, r is the stationary distribution of the random walk with teleports

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 33  Key step is matrix-vector multiplication . rnew = A ∙ rold  Easy if we have enough main memory to hold A, rold, rnew  Say N = 1 billion pages A = β∙M + (1-β) [1/N] . We need 4 bytes for NxN each entry (say) ½ ½ 0 1/3 1/3 1/3 A = 0.8 ½ 0 0 +0.2 1/3 1/3 1/3 . 2 billion entries for 0 ½ 1 1/3 1/3 1/3 vectors, approx 8GB . Matrix A has N2 entries 7/15 7/15 1/15 . 1018 is a large number! = 7/15 1/15 1/15 1/15 7/15 13/15

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 34 1  = , where = + −훽  = =1 푖푖 푖푖 푁 푟 퐴 ⋅ 푟 퐴1 훽 푀  = 푁 + 푟푖 ∑�=1 퐴푖푖 ⋅ 푟� −훽1 = 푁 + 푟푖 ∑� =1 훽 푀푖푖 푁 ⋅ 푟� =1 1−훽 = 푁 + 푁 = 1 ∑�=1 훽 푀푖푖 ⋅ 푟� 푁 ,∑ �since푟� 푁 1 −훽  = + So, ∑� 훽 푀푖푖 ⋅ 푟� 푁 ∑푟� −훽 푟 훽 푀 ⋅ 푟 푁 푁 [x]N … a vector of length N with all entries x

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 35  We can rearrange the PageRank equation

. = + ퟏ−휷 . [(1-β)/N] is an N-vector with all entries (1-β)/N 풓 휷휷 ⋅N풓 푵 푵

 M is a ! . 10 links per node, approx 10N entries  So in each iteration, we need to: . Compute rnew = β M ∙ rold . Add a constant value (1-β)/N to each entry in rnew

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 36

 Encode sparse matrix using only nonzero entries . Space proportional roughly to number of links . Say 10N, or 4*10*1 billion = 40GB . Still won’t fit in memory, but will fit on disk source node degree destination nodes 0 3 1, 5, 7 1 5 17, 64, 113, 117, 245 2 2 13, 23

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 37  Assume enough RAM to fit rnew into memory . Store rold and matrix M on disk  Then 1 step of power-iteration is: Initialize all entries of rnew to (1-β)/N For each page p (of out-degree n): old Read into memory: p, n, dest1,…,destn, r (p) new old for j = 1…n: r (destj) += β r (p) / n rnew rold 0 src degree destination 0 1 1 0 3 1, 5, 6 2 2 3 1 4 17, 64, 113, 117 3 4 4 2 2 13, 23 5 5 6 6 2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 38  Assume enough RAM to fit rnew into memory . Store rold and matrix M on disk  In each iteration, we have to: . Read rold and M . Write rnew back to disk . IO cost = 2|r| + |M|

 Question: . What if we could not even fit rnew in memory?

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 39 rnew src degree destination rold 0 0 0 4 0, 1, 3, 5 1 1 1 2 0, 5 2 2 3 2 2 3, 4 4 3 5 4 5

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 40  Similar to nested-loop join in databases . Break rnew into k blocks that fit in memory . Scan M and rold once for each block  k scans of M and rold . k(|M| + |r|) + |r| = k|M| + (k+1)|r|

 Can we do better? . Hint: M is much bigger than r (approx 10-20x), so we must avoid reading it k times per iteration

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 41 src degree destination rnew 0 4 0, 1 0 old 1 1 3 0 r 0 2 2 1 1 2 2 0 4 3 3 4 3 2 2 3 5

0 4 5 4 5 1 3 5 2 2 4

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 42  Break M into stripes . Each stripe contains only destination nodes in the corresponding block of rnew  Some additional overhead per stripe . But it is usually worth it  Cost per iteration . |M|(1+ε) + (k+1)|r|

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 43  Measures generic popularity of a page . Biased against topic-specific authorities . Solution: Topic-Specific PageRank (next)  Uses a single measure of importance . Other models e.g., hubs-and-authorities . Solution: Hubs-and-Authorities (next)  Susceptible to Link spam . Artificial link topographies created in order to boost page rank . Solution: TrustRank (next)

2/7/2012 Jure Leskovec, Stanford C246: Mining Massive Datasets 44