Google Pagerank and Markov Chains
Total Page:16
File Type:pdf, Size:1020Kb
COMP4121 Lecture Notes The PageRank, Markov chains and the Perron Frobenius theory LiC: Aleks Ignjatovic [email protected] THE UNIVERSITY OF NEW SOUTH WALES School of Computer Science and Engineering The University of New South Wales Sydney 2052, Australia Topic One: the PageRank Please read Chapter 3 of the textbook Networked Life, entitled \How does Google rank webpages?"; other references on the material covered are listed at the end of these lecture notes. 1 Problem: ordering webpages according to their importance Setup: Consider all the webpages on the entire WWW (\World Wide Web") as a directed graph whose nodes are the web pages fPi : Pi 2 WWW g, with a directed edge Pi ! Pj just in case page Pi points to page Pj, i.e. page Pi has a link to page Pj. Problem: Rank all the webpages of the WWW according to their \importance".1 Intuitively, one might feel that if many pages point to a page P0, then P0 should have a high rank, because if one includes on their webpage a link to page P0, then this can be seen as their recommendation of P0. However, this is not a good criterion for several reasons. For example, it can be easily manipulated to increase the rating of any webpage P0, simply by creating a lot of silly web pages which just point to P0; also, if a webpage \generously" points to a very large number of webpages, such \easy to get" recommendation is of dubious value. Thus, we need to refine our strategy how to rank webpages according to their \importance", by doing something which cannot be easily manipulated \locally" (i.e., by any group of people who might collude, even if such a group is sizeable). One of the crucial insights of Google founders was that the ranks of a webpage should depend on the entire global structure of the web; in other words, we should make the rank of every webpage dependent on the ranks of all other webpages on the entire WWW! Such a method would be virtually immune to spamming, because any group of people can alter only a negligible part of the entire WWW structure. Notation: Assume that ρ is some ranking of webpages (obtained in whatever way); so with such a ranking function the rank of a page P is ρ(P ). Let also #(P ) be the number of all outgoing links on a webpage P . As we have mentioned, we will write P1 ! P0 to denote that a page P1 has a link pointing to a page P0. Intuitively, we would want a web page P to have a high rank only if it is pointed at by many pages Pi which themselves have a high rank, and which do not point to an excessive number of other web pages. One idea how to obtain ranking with such a desirable property would be to look for a ranking function ρ which satisfies X ρ(Pi) ρ(P ) = ; (1.1) #(Pi) Pi!P for all web pages P on the web. If (1.1) is satisfied, then a page P will have a high rank only if the right- hand side of this equation is large. Clearly, the sum on the right is large only if there are enough pages Pi which point to P and which: (i) themselves have a high rank ρ(Pi); (ii) do not point to too many other pages besides P , so that #(Pi) is not too large; otherwise ρ(Pi)=#(Pi) might be small even if ρ(Pi) is reasonably large. Note that the above formula cannot be directly used to compute ρ(P ) because we do not know what ρ(Pi) are; it is only a condition which ρ should satisfy. But: 1. why should such ρ exist at all? 2. even if such ρ exists, is it the unique solution satisfying (1.1)? Note that question 2 above is as important as question 1: if such ρ existed but were not unique, there would be two different ways to assign page ranks to webpages and an arbitrary choice between the two would make many people very upset, if it changes the relative ranking of their webpage and the webpages of their business competitors. It turns out that, in order to ensure that above two conditions are met, we must slightly alter equation (1.1) and refine our ranking model further. Most interestingly, besides ensuring the existence and uniqueness of the ranking function ρ which satisfies a reasonable modification of equation (1.1), such change comes naturally from another, entirely intuitive model or a \heuristics" for assigning page ranks, based on the concept of a Random Surfer, which we will describe later; at the moment, we will stick with such \imperfect" condition given by equation (1.1) and use it to introduce some necessary concepts and notation. 1The quotation marks indicate that we have not defined what we mean by \importance"; we are using this word in a rather vague manner. 1 As we have mentioned, since page ranks ρ(Ai) appear on both sides of equation (1.1), such equation cannot be used to actually compute page ranks because both ρ(P ) and all of ρ(Pi) are initially unknown and must be assigned values simultaneously, in a way such that (a form of) (1.1) is satisfied. Such condition must be satisfied for all pages P 2 WWW on the web. Thus, we have a system of equations, one for each page P on the web: ( ) X ρ(Pi) ρ(P ) = (1.2) #(Pi) Pi!P P 2WWW The best way to write such a huge system of linear equations (linear in the values ρ(Pi)) is using matrices. Imagine now we write a huge matrix G1 of size M × M where M = jWWW j is the total number of all web th pages on the web at this moment. Let P1;P2;:::;PM be an indexing of all such web pages. The i row of the matrix G1 corresponds to web page Pi and the entry in this row which is in column j is denoted by g(i; j) 1 and is given as g(i; j) = if Pi has a link pointing to Pj, (in our notation, if Pi ! Pj), and is equal to 0 #(Pi) otherwise: 0 g(1; 1) : : : g(1; j) : : : g(1;M) 1 B g(2; 1) : : : g(2; j) : : : g(2;M) C B C B ::: C B C B ::: C B C B ::: C B C G1 = B g(i; 1) : : : g(i; j) : : : g(i; M) C B C B ::: C B C B ::: C B C B ::: C B C @ g(M − 1; 1) : : : g(M − 1; j) : : : g(M − 1;M) A g(M; 1) : : : g(M; j) : : : g(M; M) 8 1 if Pi ! Pj < #(Pi) g(i; j) = : 0 otherwise th Thus, G1 consists mostly of zeros, because in the i row the non zero entries are only in columns j such th that Pi ! Pj, and each page has links to only a few other webpages, so G1 with its i row looks something like this: 0 1 B C B C B ::: C B C B ::: C B C B ::: C B 1 1 1 C G1 = B ::: 0 0 0 0 :::::::::::: 0 0 0 0 ::::::::: 0 0 0 0 ::: C B k k k C B ::: C B C B ::: C B C B ::: C B C @ A where k is equal to #(Pi), the number of pages Pj1 ;:::Pjk which the page Pi has links to, and with non zero entries only in the corresponding columns j1; : : : jk. Recall from Linear Algebra that vectors r are usually represented as column vectors; to make them into row vectors we have to transpose them. Thus, letting rT be the row vector of all page ranks: T r = (ρ(P1); ρ(P2); : : : ; ρ(PM )) ; the set of all equations (1.1) for all webpages fP1;:::;PM g can now be written as a single matrix equation T T r = r G1 (1.3) 2 This is because the jth entry of the result of the multiplication on the right is obtained as the product of rT with th th the j column of G1. Since the entries g(i; j) in the j column are non zero only for ip such that Pip ! Pj, 1 th T in which case g(ip; j) = , we have that the j entry of the result of the multiplication r G1 equals #(Pip ) 0 0 1 B 0 C B C B . C B . C B C B 0 C B C B 0 C B 1 C B #(P ) C B i1 C B C B 0 C B C B 0 C B . C X ρ(Pip ) (ρ(P1); ρ(P2); : : : ; ρ(Pi); : : : ; ρ(PM )) B . C = B C #(Pip ) B C Pip !Pj B 0 C B C B 0 C B 1 C B #(Pi ) C B 2 C B 0 C B C B 0 C B C B . C B . C B C @ 0 A 0 Thus, the result of such multiplication would be equal to the jth entry of rT just in case X ρ(Pip ) ρ(Pj) = #(Pip ) Pip !Pj Notes: • Matrices of size M × M which have much fewer than M 2 non zero entries are called sparse matrices. • We say that λ is a left-hand eigenvalue of a matrix G if there exist a vector x such that xT G = λxT; vectors satisfying this property are called the (left-hand) eigenvectors corresponding to the eigenvalue λ.