COMP4650

Connected Components, and Pagerank

Lexing Xie Research School of Computer Science, ANU

Lecture slides credit: Lada Adamic, U Michigan Jure Leskovec, Stanford, Andreas Haeberlen, UPenn

Connected components • Strongly connected components – Each node within the component can be reached from every other node in the component by following directed links B F n Strongly connected components C G n B C D E A n A n G H H n F D E n Weakly connected components: every node can be reached from every other node by following links in either direcon

n Weakly connected components B F n A B C D E C G n G H F A

H D n In undirected networks one talks simply E about ‘connected components’ by Lada Adamic, U Michigan Strongly Connected Component (SCC) Proof (*) Giant component • if the largest component encompasses a significant fracon of the graph, it is called the giant component

by Lada Adamic, U Michigan Why not? (*) Bow-Tie Structure of the Web

Andrei Broder, Ravi Kumar, Farzin Maghoul, Prabhakar Raghavan, Sridhar Rajagopalan, Raymie Stata, Andrew Tomkins, and Janet Wiener. 2000. Graph structure in the Web. Comput. Netw. 33, 1-6 (June 2000), 309-320. Not Everyone Asks/Replies

The Web is a bow e The Java Forum network is an uneven bow e • Core: A strongly connected component, in which everyone asks and answers • IN: Mostly askers. • OUT: Mostly Helpers

Jun Zhang, Mark S. Ackerman, and Lada Adamic. 2007. Experse networks in online communies: structure and . In Proceedings of the 16th internaonal conference on (WWW '07)., 221-230. Minkyoung Kim, Lexing Xie, Peter Christen, Event Diffusion Paerns in Social Media (2012) Intl. Conf. on Weblogs and Social Media (ICWSM), 8 pages, Dublin, Ireland, June 2012 Modern Informaon Retrieval: The Concepts and Technology behind Search Ricardo Baeza-Yates, Berthier Ribeiro-Neto Addison-Wesley Professional; 2 edion (February 10, 2011) 's PageRank (Brin/Page 98)

• A technique for esmang page quality – Based on web link graph

• Results are combined with IR score – Think of it as: TotalScore = IR score * PageRank – In pracce, search engines use many other factors (for example, Google says it uses more than 200)

11 A. Haeberlen , Z. Ives, U Penn Shouldn't E's vote be worth more than F's? PageRank: Intuion

G A

H E B

How many levels I C should we consider? F J D

• Imagine a contest for The Web's Best Page – Inially, each page has one vote – Each page votes for all the pages it has a link to – To ensure fairness, pages vong for more than one page must split their vote equally between them – Vong proceeds in rounds; in each round, each page has the number of votes it received in the previous round – In pracce, it's a lile more complicated - but not much!

12 Source: A. Haeberlen , Z. Ives, U Penn PageRank

• Each page i is given a rank xi

• Goal: Assign the xi such that the rank of each page is governed by the ranks of the pages linking to it: 1 x x i = ∑ j j∈Bi N j

Rank of page i Rank of page j Number of How do we compute links out the rank values? Every page j that links to i from page j 13 Source: A. Haeberlen , Z. Ives, U Penn Iterave PageRank (simplified)

Initialize all ranks to (0) 1 be equal, e.g.: x = i n

1 Iterate until x(k+1) x(k) convergence i = ∑ j j∈Bi N j

14 Source: A. Haeberlen , Z. Ives, U Penn Example: Step 0

1 Initialize all ranks x(0) = to be equal i n

0.33

0.33

0.33

15 Source: A. Haeberlen , Z. Ives, U Penn Example: Step 1

Propagate weights (k+1) 1 (k) across out-edges x x i = ∑ j j∈Bi N j

0.33 0.17 0.33

0.17

16 Source: A. Haeberlen , Z. Ives, U Penn Example: Step 2

Compute weights 1 based on in-edges x (1) x (0) i = ∑ j j∈Bi N j 0.50

0.33 0.17

17 Source: A. Haeberlen , Z. Ives, U Penn Example: Convergence

1 x(k+1) x(k) i = ∑ j j∈Bi N j 0.40

0.4 0.2

18 Source: A. Haeberlen , Z. Ives, U Penn Naïve PageRank Restated

• Let – N(p) = number outgoing links from page p – B(p) = number of back-links to page p 1 PageRank( p) = PageRank(b) ∑ N(b) b∈Bp – Each page b distributes its importance to all of the pages it points to (so we scale by 1/N(b)) – Page p’s importance is increased by the importance of its back set

Source: A. Haeberlen , Z. Ives, U Penn 19 In Linear Algebra formulaon

• Create an m x m M to capture links:

– M(i, j) = 1 / nj if page i is pointed to by page j and page j has nj outgoing links = 0 otherwise

– Inialize all PageRanks to 1, mulply by M repeatedly unl all values converge:

⎡ PageRank( p1')⎤ ⎡ PageRank( p1)⎤ ⎢PageRank( p ')⎥ ⎢PageRank( p )⎥ ⎢ 2 ⎥ = M ⎢ 2 ⎥ ⎢ ... ⎥ ⎢ ... ⎥ ⎢PageRank( p ')⎥ ⎢PageRank( p )⎥ ⎣ m ⎦ ⎣ m ⎦ – Computes principal eigenvector via power iteraon

Source: A. Haeberlen , Z. Ives, U Penn 20 A Brief Example

Google g' 0 0.5 0.5 g y’ = 0 0 0.5 * y Amazon Yahoo a’ 1 0.5 0 a

Running for multiple iterations:

g 1 1 1 1 y = 1 , 0.5 , 0.75 , … 0.67 a 1 1.5 1.25 1.33

Total rank sums to number of pages

Source: A. Haeberlen , Z. Ives, U Penn 21 Oops #1 – PageRank Sinks

g' 0 0 0.5 g Google y’ = 0.5 0 0.5 * y a’ 0.5 0 0 a Amazon Yahoo

'dead end' - PageRank is lost aer each round Running for multiple iterations:

g 1 0.5 0.25 0 y = 1 , 1 , 0.5 , … , 0 a 1 0.5 0.25 0

22 Source: A. Haeberlen , Z. Ives, U Penn Oops #2 – PageRank hogs

Google g' 0 0 0.5 g y’ = 0.5 1 0.5 * y a’ 0.5 0 0 a Amazon Yahoo

PageRank cannot flow out and accumulates Running for multiple iterations:

g 1 0.5 0.25 0 y = 1 , 2 , 2.5 , … , 3 a 1 0.5 0.25 0

23 Source: A. Haeberlen , Z. Ives, U Penn Improved PageRank

• Remove out-degree 0 nodes (or consider them to refer back to referrer) • Add decay factor d to deal with sinks 1 PageRank( p) = (1− d) + d ∑ PageRank(b) b∈Bp N(b)

• Typical value: d=0.85

24 Source: A. Haeberlen , Z. Ives, U Penn Stopping the Hog

g' Google 0 0 0.5 g 0.15 y’ = 0.85 0.5 1 0.5 * y + 0.15 a’ 0.5 0 0 a 0.15 Amazon Yahoo

Running for multiple iterations:

g 0.57 0.39 0.32 0.26 , , , , … , y = 1.85 2.21 2.36 2.48 a 0.57 0.39 0.32 0.26 … though does this seem right? 25 Source: A. Haeberlen , Z. Ives, U Penn Random Surfer Model • PageRank has an intuive basis in random walks on graphs

• Imagine a random surfer, who starts on a random page and, in each step, – with probability d, klicks on a random link on the page – with probability 1-d, jumps to a random page (bored?)

• The PageRank of a page can be interpreted as the fracon of steps the surfer spends on the corresponding page – Transion matrix can be interpreted as a

26 Source: A. Haeberlen , Z. Ives, U Penn PageRank, on graphs

G W = (1-d)*G + d*U

+ Pagerank vector

– with probability d, klicks on a random link on the page – with probability 1-d, jumps to a random page (bored?) – Transion matrix W can be interpreted as a Markov Chain

COMP4650 L Xie, ANU Computer Science COMP4650 L Xie, ANU Computer Science 28 Opmizaon (SEO) • Has become a big business • White-hat techniques – Google webmaster tools – Add meta tags to documents, etc. • Black-hat techniques – Link farms – Keyword stuffing, hidden text, meta-tag stuffing, ... – • Inial soluon: ... • Some people started to abuse this to improve their own – Doorway pages / cloaking • Special pages just for search engines • BMW Germany and Ricoh Germany banned in February 2006 – Link buying

29 Source: A. Haeberlen , Z. Ives, U Penn Recap: PageRank

• Esmates absolute 'quality' or 'importance' of a given page based on inbound links – Query-independent – Can be computed via fixpoint iteraon – Can be interpreted as the fracon of me a 'random surfer' would spend on the page – Several refinements, e.g., to deal with sinks • Considered relavely stable – But vulnerable to black-hat SEO • An important factor, but not the only one – Overall is based on many factors (Google: >200)

30 Source: A. Haeberlen , Z. Ives, U Penn What could be the other 200 factors?

Posive Negave Keyword in tle? URL? Links to 'bad neighborhood' Keyword in domain name? Keyword stuffing Quality of HTML code Over-opmizaon On-page Page freshness Hidden content (text has Rate of change same color as background) ... Automac redirect/refresh ... High PageRank Fast increase in number of Anchor text of inbound links inbound links (link buying?) Links from authority sites Link farming Off-page Links from well-known sites Different pages user/spider Domain expiraon date Content duplicaon ...... • Note: This is enrely speculave!

31 Source: Web Informaon Systems, Prof. Beat Signer, VU Brussels Summary • Macro structure of networks – Strongly and weakly connected components – Applicaons in web/discussion forums etc.

• PageRank algorithm – Computed using power iteraons – Used for ranking webpages – Can be thought of as random walk on graphs