The Google Markov Chain: Convergence Speed and Eigenvalues
Total Page:16
File Type:pdf, Size:1020Kb
U.U.D.M. Project Report 2012:14 The Google Markov Chain: convergence speed and eigenvalues Fredrik Backåker Examensarbete i matematik, 15 hp Handledare och examinator: Jakob Björnberg Juni 2012 Department of Mathematics Uppsala University Acknowledgments I would like to thank my supervisor Jakob Björnberg for helping me writing this thesis. 1 The Google Markov Chain: convergence speed and eigenvalues Contents 1 Introduction 2 Definitions and background 2.1 Markov chains 2.2 The Google PageRank 3 Convergence speed 3.1 General theory of convergence speed 3.2 Convergence speed and eigenvalues of Google´s Markov Chain 4 Simulations 4.1 Multiplicity of the second eigenvalue 4.2 Quality of the limit distribution 5 Conclusion 6 References Appendices Matlab-Code 2 1 Introduction There are many different search engines on the internet which help us find the information we want. These search engines use different methods to rank pages and display them to us in a way such that the most relevant and important information is showed first. In this thesis, we study a mathematical method that is a part of how PageRank,the ranking method for the search engine Google, ranks the order of which pages are displayed in a search. This method we look at uses pages as states in a stochastic Markov chain where outgoing links from pages are the transitions and the corresponding transition probabilities are equally divided among the number of outgoing links from the related page. The transition probability matrix that is given by this is then used to compute a stationary distribution where the page with the largest stationary value is ranked first, the page with the second largest is ranked second and so on. This method can be put into two variants, with a dampening factor or without. The variant without a dampening factor is the one we just described. In the other variant, which we study in this thesis, the dampening factor (often set to 0.85) is introduced mainly to ensure that the stationary distribution is unique. This variant is considered to be the most useful one and in this thesis we take a light look at how the dampening factor affects the computation of PageRank. We will begin by going through some basic definitions for Markov chains and explain the Google PageRank in more detail. In the section after, we go through some general theory about the rate of convergence for Markov chains since it turns out that the eigenvalues of a transition probability matrix is connected to the convergence speed to its steady state. Further, we look at the second largest eigenvalue of the Google Markov chain and its algebraic multiplicity, which are the main factors that affect the convergence rate of the chain. Next, we go through some results of how the second eigenvalue of the Google Markov chain is limited by the dampening factor and by this, makes the choice of the dampening factor very important. We end by doing some simulations to check how different properties of PageRank are affected by choices of the dampening factor and in particular, which value of the dampening factor that is most adapted for a fast convergence speed of the Google Markov chain. 3 2 Definitions and background 2.1 Markov chains { } A discrete time Markov chain is a stochastic process X n with finite state space S that satisfies the Markov property: ( = ∣ = … = )= ( = ∣ = ) P X n x n X 0 x0 , , X n−1 xn−1 P X n xn X n−1 xn−1 … ∈ ⩾ for all x0 , , xn S and n 1. In other words, the next step of a Markov chain is independent of the past and only relies upon the most recent state. The chain is called time-homogenous if the transition probabilities do not change over time, i.e. if ∈ = ( = ∣ = ) for each i , j S , pij P X n j X n−1 i does not depend on n. In this case the probabilities pij are the Markov chains transition probabilities when moving from (m)= ( = ∣ = ) state i to state j. Also let pij P X m+n j X n i denote the transition probabilities in m steps, m=0,1,2... The probabilities can be collected in a transition probability matrix, here denoted by P: … p00 p01 =( …) P p10 p11 ⋮ ⋮ ⋱ ∑ = This matrix is called a stochastic matrix if all of the row vectors in it sum to one: pij 1. The j Markov chain is said to be irreducible if it is possible to reach each state i from any other state j, in any number of steps. More formally, if ( = ∣ = )> ≥ ∀ P X n j X 0 i 0 for some n 0 i , j A state i has period k if any return to state i occurs in multiples of k steps: = { ( = ∣ = )> } k greatest common divisor of the set n: P X n i X 0 i 0 If all the states in a Markov chain has period one, it is said to be aperiodic, i.e. the greatest common divisor of the return time to any state from itself is one. The following result is standard and we do not prove it. Proposition 1 A Markov chain that is irreducible and aperiodic with finite state space has a unique stationary distribution π, which is a probability vector such that π=πP. Additionally, the transition probabilities converges to a steady state when the number of steps goes to infinity in the sense that (m)= lim pij π j for all i,j in S. m →∞ 2.2 The Google PageRank The Google PageRank is one of many methods that the search engine Google uses to determine the importance or relevance of a page. This method uses a special Markov chain which is used to compute the rank of web pages and this rank determines in which order the pages should be listed in a search in Google. 4 Let all the web pages Google communicates with be denoted by the state space W. The size of W is =( ) n, several billion pages. Let C c denote the connectivity matrix of W, which means that C is = ij = a n×n matrix with cij 1 if there is an hyperlink from page i to page j and cij 0 otherwise. The number of outgoing links from page i are the row sums n =∑ si cij = = j 1 =( ) If s 0 , it has no outgoing links and is called a dangling node. Let T t be given by =i / ⩾ = / ij t ij cij si if si 1 and t ij 1 n if i is a dangling node. By this, T can be seen as a transition probability matrix of the Markov chain with state space W. Furthermore, to define the Google Markov chain we include an additional parameter d, which is a dampening factor that can be set between 0 and 1. The transition probability matrix of the Google Markov chain is defined by: 1 P=dT+(1−d)( )E n where E is the n×n matrix with only ones. This Markov chain can be described as a ”random surfer” who, with probability d, clicks on an outgoing link on the current web page with equal probabilites or, if the page has no outgoing links, chooses another page at random in W. Also, with probability 1-d, the surfer jumps to a page at random among all the pages n. The Google Markov chain is finite, irreducible and also aperiodic depending on which value d has. If d<1, the chain is aperiodic since all its states have a probability to jump back to them self and therefor periods that are equal to one. If d=1, we get P=T and that the periodicity and irreducibility is completely determined by the outgoing links from all of the pages. By this, it is possible that two pages only link to each other and create a subset with a periodicity of two. If so, the chain is neither aperiodic nor irreducible. Then there is no unique stationary distribution and because the chain stays in a subset, the limit distribution π j depends on the starting state. This would be the most realistic case considering how the internet is structured and is the main reason why the dampening factor d is introduced. Further d affects, as we will see, the convergence speed of the Google Markov chain. In the computation of PageRank, d is usually set to 0.85[1] and then the Google Markov chain is finite, irreducible and aperiodic. Hence by Proposition 1 there exist a unique stationary distribution π. This stationary distribution is used to rank all the pages in W by letting the page with the largest πi be ranked first ,and the second largest be ranked second, and so on until all we get a Google PageRank for all the pages. One way of computing the Google PageRank is done by simulating the transitions until you reach a (approximate) steady state and according to Brin and Page[1], the creators of Google, ”a PageRank for 26 million web pages can be computed in a few hours on a medium size workstation”. 5 3 Convergence speed 3.1 General theory of convergence speed Since the Google PageRank consists of many billion pages, one might would like to know how fast this can be computed. This can be done by determining how fast the transition probability matrix of the Google Markov chain converges to its steady state as in Proposition 1. To find this rate of convergence, we need to go through some definitions and theorems.