Connected Components, and Pagerank

COMP4650 Connected Components, and Pagerank Lexing Xie Research School of Computer Science, ANU Lecture slides credit: Lada Adamic, U Michigan Jure Leskovec, Stanford, Andreas Haeberlen, UPenn Connected components • Strongly connected components – Each node within the component can be reached from every other node in the component by following directed links B F n Strongly connected components C G n B C D E A n A n G H H n F D E n Weakly connected components: every node can be reached from every other node by following links in either direcRon n Weakly connected components B F n A B C D E C G n G H F A H D n In undirected networks one talks simply E about ‘connected components’ by Lada Adamic, U Michigan Strongly Connected Component (SCC) Proof (*) Giant component • if the largest component encompasses a significant fracRon of the graph, it is called the giant component by Lada Adamic, U Michigan Why not? (*) Bow-Tie Structure of the Web Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. 2000. Graph structure in the Web. Comput. Netw. 33, 1-6 (June 2000), 309-320. Not Everyone Asks/Replies The Web is a bow Re The Java Forum network is an uneven bow Re • Core: A strongly connected component, in which everyone asks and answers • IN: Mostly askers. • OUT: Mostly Helpers Jun Zhang, Mark S. Ackerman, and Lada Adamic. 2007. ExperRse networks in online communiRes: structure and algorithms. In Proceedings of the 16th internaonal conference on World Wide Web (WWW '07)., 221-230. Minkyoung Kim, Lexing Xie, Peter Christen, Event Diffusion Paerns in Social Media (2012) Intl. Conf. on Weblogs and Social Media (ICWSM), 8 pages, Dublin, Ireland, June 2012 Modern Informaon Retrieval: The Concepts and Technology behind Search Ricardo Baeza-Yates, Berthier Ribeiro-Neto Addison-Wesley Professional; 2 ediRon (February 10, 2011) Google's PageRank (Brin/Page 98) • A technique for esRmang page quality – Based on web link graph • Results are combined with IR score – Think of it as: TotalScore = IR score * PageRank – In pracRce, search engines use many other factors (for example, Google says it uses more than 200) 11 A. Haeberlen , Z. Ives, U Penn Shouldn't E's vote be worth more than F's? PageRank: IntuiRon G A H E B How many levels I C should we consider? F J D • Imagine a contest for The Web's Best Page – IniRally, each page has one vote – Each page votes for all the pages it has a link to – To ensure fairness, pages voRng for more than one page must split their vote equally between them – VoRng proceeds in rounds; in each round, each page has the number of votes it received in the previous round – In pracRce, it's a lihle more complicated - but not much! 12 Source: A. Haeberlen , Z. Ives, U Penn PageRank • Each page i is given a rank xi • Goal: Assign the xi such that the rank of each page is governed by the ranks of the pages linking to it: 1 x x i = ∑ j j∈Bi N j Rank of page i Rank of page j Number of How do we compute links out the rank values? Every page j that links to i from page j 13 Source: A. Haeberlen , Z. Ives, U Penn Iterave PageRank (simplified) Initialize all ranks to (0) 1 be equal, e.g.: x = i n 1 Iterate until x(k+1) x(k) convergence i = ∑ j j∈Bi N j 14 Source: A. Haeberlen , Z. Ives, U Penn Example: Step 0 1 Initialize all ranks x(0) = to be equal i n 0.33 0.33 0.33 15 Source: A. Haeberlen , Z. Ives, U Penn Example: Step 1 Propagate weights (k+1) 1 (k) across out-edges x x i = ∑ j j∈Bi N j 0.33 0.17 0.33 0.17 16 Source: A. Haeberlen , Z. Ives, U Penn Example: Step 2 Compute weights 1 based on in-edges x (1) x (0) i = ∑ j j∈Bi N j 0.50 0.33 0.17 17 Source: A. Haeberlen , Z. Ives, U Penn Example: Convergence 1 x(k+1) x(k) i = ∑ j j∈Bi N j 0.40 0.4 0.2 18 Source: A. Haeberlen , Z. Ives, U Penn Naïve PageRank Algorithm Restated • Let – N(p) = number outgoing links from page p – B(p) = number of back-links to page p 1 PageRank( p) = PageRank(b) ∑ N(b) b∈Bp – Each page b distributes its importance to all of the pages it points to (so we scale by 1/N(b)) – Page p’s importance is increased by the importance of its back set Source: A. Haeberlen , Z. Ives, U Penn 19 In Linear Algebra formulaon • Create an m x m matrix M to capture links: – M(i, j) = 1 / nj if page i is pointed to by page j and page j has nj outgoing links = 0 otherwise – IniRalize all PageRanks to 1, mulRply by M repeatedly unRl all values converge: ⎡ PageRank( p1')⎤ ⎡ PageRank( p1)⎤ ⎢PageRank( p ')⎥ ⎢PageRank( p )⎥ ⎢ 2 ⎥ = M ⎢ 2 ⎥ ⎢ ... ⎥ ⎢ ... ⎥ ⎢PageRank( p ')⎥ ⎢PageRank( p )⎥ ⎣ m ⎦ ⎣ m ⎦ – Computes principal eigenvector via power iteraon Source: A. Haeberlen , Z. Ives, U Penn 20 A Brief Example Google g' 0 0.5 0.5 g y’ = 0 0 0.5 * y Amazon Yahoo a’ 1 0.5 0 a Running for multiple iterations: g 1 1 1 1 y = 1 , 0.5 , 0.75 , … 0.67 a 1 1.5 1.25 1.33 Total rank sums to number of pages Source: A. Haeberlen , Z. Ives, U Penn 21 Oops #1 – PageRank Sinks g' 0 0 0.5 g Google y’ = 0.5 0 0.5 * y a’ 0.5 0 0 a Amazon Yahoo 'dead end' - PageRank is lost aer each round Running for multiple iterations: g 1 0.5 0.25 0 y = 1 , 1 , 0.5 , … , 0 a 1 0.5 0.25 0 22 Source: A. Haeberlen , Z. Ives, U Penn Oops #2 – PageRank hogs Google g' 0 0 0.5 g y’ = 0.5 1 0.5 * y a’ 0.5 0 0 a Amazon Yahoo PageRank cannot flow out and accumulates Running for multiple iterations: g 1 0.5 0.25 0 y = 1 , 2 , 2.5 , … , 3 a 1 0.5 0.25 0 23 Source: A. Haeberlen , Z. Ives, U Penn Improved PageRank • Remove out-degree 0 nodes (or consider them to refer back to referrer) • Add decay factor d to deal with sinks 1 PageRank( p) = (1− d) + d ∑ PageRank(b) b∈Bp N(b) • Typical value: d=0.85 24 Source: A. Haeberlen , Z. Ives, U Penn Stopping the Hog g' Google 0 0 0.5 g 0.15 y’ = 0.85 0.5 1 0.5 * y + 0.15 a’ 0.5 0 0 a 0.15 Amazon Yahoo Running for multiple iterations: g 0.57 0.39 0.32 0.26 , , , , … , y = 1.85 2.21 2.36 2.48 a 0.57 0.39 0.32 0.26 … though does this seem right? 25 Source: A. Haeberlen , Z. Ives, U Penn Random Surfer Model • PageRank has an intuiRve basis in random walks on graphs • Imagine a random surfer, who starts on a random page and, in each step, – with probability d, klicks on a random link on the page – with probability 1-d, jumps to a random page (bored?) • The PageRank of a page can be interpreted as the fracRon of steps the surfer spends on the corresponding page – TransiRon matrix can be interpreted as a Markov Chain 26 Source: A. Haeberlen , Z. Ives, U Penn PageRank, random walk on graphs G W = (1-d)*G + d*U + Pagerank vector – with probability d, klicks on a random link on the page – with probability 1-d, jumps to a random page (bored?) – TransiRon matrix W can be interpreted as a Markov Chain COMP4650 L Xie, ANU Computer Science COMP4650 L Xie, ANU Computer Science 28 Search Engine OpRmizaon (SEO) • Has become a big business • White-hat techniques – Google webmaster tools – Add meta tags to documents, etc. • Black-hat techniques – Link farms – Keyword stuffing, hidden text, meta-tag stuffing, ... – Spamdexing • IniRal soluRon: <a rel="nofollow" href="...">...</a> • Some people started to abuse this to improve their own rankings – Doorway pages / cloaking • Special pages just for search engines • BMW Germany and Ricoh Germany banned in February 2006 – Link buying 29 Source: A. Haeberlen , Z. Ives, U Penn Recap: PageRank • EsRmates absolute 'quality' or 'importance' of a given page based on inbound links – Query-independent – Can be computed via fixpoint iteraon – Can be interpreted as the fracRon of Rme a 'random surfer' would spend on the page – Several refinements, e.g., to deal with sinks • Considered relavely stable – But vulnerable to black-hat SEO • An important factor, but not the only one – Overall ranking is based on many factors (Google: >200) 30 Source: A. Haeberlen , Z. Ives, U Penn What could be the other 200 factors? Posive Negave Keyword in Rtle? URL? Links to 'bad neighborhood' Keyword in domain name? Keyword stuffing Quality of HTML code Over-opRmizaon On-page Page freshness Hidden content (text has Rate of change same color as background) ... Automac redirect/refresh ... High PageRank Fast increase in number of Anchor text of inbound links inbound links (link buying?) Links from authority sites Link farming Off-page Links from well-known sites Different pages user/spider Domain expiraon date Content duplicaon ... ... • Note: This is enRrely speculave! 31 Source: Web Informaon Systems, Prof.

Load more