COMP4650
Connected Components, and Pagerank
Lexing Xie Research School of Computer Science, ANU
Lecture slides credit: Lada Adamic, U Michigan Jure Leskovec, Stanford, Andreas Haeberlen, UPenn
Connected components • Strongly connected components – Each node within the component can be reached from every other node in the component by following directed links B F n Strongly connected components C G n B C D E A n A n G H H n F D E n Weakly connected components: every node can be reached from every other node by following links in either direc on
n Weakly connected components B F n A B C D E C G n G H F A
H D n In undirected networks one talks simply E about ‘connected components’ by Lada Adamic, U Michigan Strongly Connected Component (SCC) Proof (*) Giant component • if the largest component encompasses a significant frac on of the graph, it is called the giant component
by Lada Adamic, U Michigan Why not? (*) Bow-Tie Structure of the Web
Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. 2000. Graph structure in the Web. Comput. Netw. 33, 1-6 (June 2000), 309-320. Not Everyone Asks/Replies
The Web is a bow e The Java Forum network is an uneven bow e • Core: A strongly connected component, in which everyone asks and answers • IN: Mostly askers. • OUT: Mostly Helpers
Jun Zhang, Mark S. Ackerman, and Lada Adamic. 2007. Exper se networks in online communi es: structure and algorithms. In Proceedings of the 16th interna onal conference on World Wide Web (WWW '07)., 221-230. Minkyoung Kim, Lexing Xie, Peter Christen, Event Diffusion Pa erns in Social Media (2012) Intl. Conf. on Weblogs and Social Media (ICWSM), 8 pages, Dublin, Ireland, June 2012 Modern Informa on Retrieval: The Concepts and Technology behind Search Ricardo Baeza-Yates, Berthier Ribeiro-Neto Addison-Wesley Professional; 2 edi on (February 10, 2011) Google's PageRank (Brin/Page 98)
• A technique for es ma ng page quality – Based on web link graph
• Results are combined with IR score – Think of it as: TotalScore = IR score * PageRank – In prac ce, search engines use many other factors (for example, Google says it uses more than 200)
11 A. Haeberlen , Z. Ives, U Penn Shouldn't E's vote be worth more than F's? PageRank: Intui on
G A
H E B
How many levels I C should we consider? F J D
• Imagine a contest for The Web's Best Page – Ini ally, each page has one vote – Each page votes for all the pages it has a link to – To ensure fairness, pages vo ng for more than one page must split their vote equally between them – Vo ng proceeds in rounds; in each round, each page has the number of votes it received in the previous round – In prac ce, it's a li le more complicated - but not much!
12 Source: A. Haeberlen , Z. Ives, U Penn PageRank
• Each page i is given a rank xi
• Goal: Assign the xi such that the rank of each page is governed by the ranks of the pages linking to it: 1 x x i = ∑ j j∈Bi N j
Rank of page i Rank of page j Number of How do we compute links out the rank values? Every page j that links to i from page j 13 Source: A. Haeberlen , Z. Ives, U Penn Itera ve PageRank (simplified)
Initialize all ranks to (0) 1 be equal, e.g.: x = i n
1 Iterate until x(k+1) x(k) convergence i = ∑ j j∈Bi N j
14 Source: A. Haeberlen , Z. Ives, U Penn Example: Step 0
1 Initialize all ranks x(0) = to be equal i n
0.33
0.33
0.33
15 Source: A. Haeberlen , Z. Ives, U Penn Example: Step 1
Propagate weights (k+1) 1 (k) across out-edges x x i = ∑ j j∈Bi N j
0.33 0.17 0.33
0.17
16 Source: A. Haeberlen , Z. Ives, U Penn Example: Step 2
Compute weights 1 based on in-edges x (1) x (0) i = ∑ j j∈Bi N j 0.50
0.33 0.17
17 Source: A. Haeberlen , Z. Ives, U Penn Example: Convergence
1 x(k+1) x(k) i = ∑ j j∈Bi N j 0.40
0.4 0.2
18 Source: A. Haeberlen , Z. Ives, U Penn Naïve PageRank Algorithm Restated
• Let – N(p) = number outgoing links from page p – B(p) = number of back-links to page p 1 PageRank( p) = PageRank(b) ∑ N(b) b∈Bp – Each page b distributes its importance to all of the pages it points to (so we scale by 1/N(b)) – Page p’s importance is increased by the importance of its back set
Source: A. Haeberlen , Z. Ives, U Penn 19 In Linear Algebra formula on
• Create an m x m matrix M to capture links:
– M(i, j) = 1 / nj if page i is pointed to by page j and page j has nj outgoing links = 0 otherwise
– Ini alize all PageRanks to 1, mul ply by M repeatedly un l all values converge:
⎡ PageRank( p1')⎤ ⎡ PageRank( p1)⎤ ⎢PageRank( p ')⎥ ⎢PageRank( p )⎥ ⎢ 2 ⎥ = M ⎢ 2 ⎥ ⎢ ... ⎥ ⎢ ... ⎥ ⎢PageRank( p ')⎥ ⎢PageRank( p )⎥ ⎣ m ⎦ ⎣ m ⎦ – Computes principal eigenvector via power itera on
Source: A. Haeberlen , Z. Ives, U Penn 20 A Brief Example
Google g' 0 0.5 0.5 g y’ = 0 0 0.5 * y Amazon Yahoo a’ 1 0.5 0 a
Running for multiple iterations:
g 1 1 1 1 y = 1 , 0.5 , 0.75 , … 0.67 a 1 1.5 1.25 1.33
Total rank sums to number of pages
Source: A. Haeberlen , Z. Ives, U Penn 21 Oops #1 – PageRank Sinks
g' 0 0 0.5 g Google y’ = 0.5 0 0.5 * y a’ 0.5 0 0 a Amazon Yahoo
'dead end' - PageRank is lost a er each round Running for multiple iterations:
g 1 0.5 0.25 0 y = 1 , 1 , 0.5 , … , 0 a 1 0.5 0.25 0
22 Source: A. Haeberlen , Z. Ives, U Penn Oops #2 – PageRank hogs
Google g' 0 0 0.5 g y’ = 0.5 1 0.5 * y a’ 0.5 0 0 a Amazon Yahoo
PageRank cannot flow out and accumulates Running for multiple iterations:
g 1 0.5 0.25 0 y = 1 , 2 , 2.5 , … , 3 a 1 0.5 0.25 0
23 Source: A. Haeberlen , Z. Ives, U Penn Improved PageRank
• Remove out-degree 0 nodes (or consider them to refer back to referrer) • Add decay factor d to deal with sinks 1 PageRank( p) = (1− d) + d ∑ PageRank(b) b∈Bp N(b)
• Typical value: d=0.85
24 Source: A. Haeberlen , Z. Ives, U Penn Stopping the Hog
g' Google 0 0 0.5 g 0.15 y’ = 0.85 0.5 1 0.5 * y + 0.15 a’ 0.5 0 0 a 0.15 Amazon Yahoo
Running for multiple iterations:
g 0.57 0.39 0.32 0.26 , , , , … , y = 1.85 2.21 2.36 2.48 a 0.57 0.39 0.32 0.26 … though does this seem right? 25 Source: A. Haeberlen , Z. Ives, U Penn Random Surfer Model • PageRank has an intui ve basis in random walks on graphs
• Imagine a random surfer, who starts on a random page and, in each step, – with probability d, klicks on a random link on the page – with probability 1-d, jumps to a random page (bored?)
• The PageRank of a page can be interpreted as the frac on of steps the surfer spends on the corresponding page – Transi on matrix can be interpreted as a Markov Chain
26 Source: A. Haeberlen , Z. Ives, U Penn PageRank, random walk on graphs
G W = (1-d)*G + d*U
+ Pagerank vector
– with probability d, klicks on a random link on the page – with probability 1-d, jumps to a random page (bored?) – Transi on matrix W can be interpreted as a Markov Chain
COMP4650 L Xie, ANU Computer Science COMP4650 L Xie, ANU Computer Science 28 Search Engine Op miza on (SEO) • Has become a big business • White-hat techniques – Google webmaster tools – Add meta tags to documents, etc. • Black-hat techniques – Link farms – Keyword stuffing, hidden text, meta-tag stuffing, ... – Spamdexing • Ini al solu on: ... • Some people started to abuse this to improve their own rankings – Doorway pages / cloaking • Special pages just for search engines • BMW Germany and Ricoh Germany banned in February 2006 – Link buying
29 Source: A. Haeberlen , Z. Ives, U Penn Recap: PageRank
• Es mates absolute 'quality' or 'importance' of a given page based on inbound links – Query-independent – Can be computed via fixpoint itera on – Can be interpreted as the frac on of me a 'random surfer' would spend on the page – Several refinements, e.g., to deal with sinks • Considered rela vely stable – But vulnerable to black-hat SEO • An important factor, but not the only one – Overall ranking is based on many factors (Google: >200)
30 Source: A. Haeberlen , Z. Ives, U Penn What could be the other 200 factors?
Posi ve Nega ve Keyword in tle? URL? Links to 'bad neighborhood' Keyword in domain name? Keyword stuffing Quality of HTML code Over-op miza on On-page Page freshness Hidden content (text has Rate of change same color as background) ... Automa c redirect/refresh ... High PageRank Fast increase in number of Anchor text of inbound links inbound links (link buying?) Links from authority sites Link farming Off-page Links from well-known sites Different pages user/spider Domain expira on date Content duplica on ...... • Note: This is en rely specula ve!
31 Source: Web Informa on Systems, Prof. Beat Signer, VU Brussels Summary • Macro structure of networks – Strongly and weakly connected components – Applica ons in web/discussion forums etc.
• PageRank algorithm – Computed using power itera ons – Used for ranking webpages – Can be thought of as random walk on graphs