Fast Parallel Pagerank: a Linear System Approach
Total Page:16
File Type:pdf, Size:1020Kb
Fast Parallel PageRank: A Linear System Approach ∗ David Gleich Leonid Zhukov Pavel Berkhin Stanford University, ICME Yahoo! Yahoo! Stanford, CA 94305 701 First Ave 701 First Ave [email protected] Sunnyvale, CA 94089 Sunnyvale, CA 94089 [email protected] pberkhin@yahoo- inc.com ABSTRACT PageRank is also becoming a useful tool applied in many In this paper we investigate the convergence of iterative sta- Web search technologies and beyond, for example, spam de- tionary and Krylov subspace methods for the PageRank lin- tection [13], crawler configuration [10], or trust networks ear system, including the convergence dependency on tele- [20]. In this setting many PageRanks corresponding to dif- portation. We demonstrate that linear system iterations ferent modifications – such as graphs with a different level of converge faster than the simple power method and are less granularity (HostRank) or different link weight assignments sensitive to the changes in teleportation. (internal, external, etc.) – have to be computed. For each In order to perform this study we developed a framework technology, the critical computation is the PageRank-like for parallel PageRank computing. We describe the details vector of interest. Thus, methods to accelerate and paral- of the parallel implementation and provide experimental re- lelize these computations are important. Various methods to sults obtained on a 70-node Beowulf cluster. accelerate the simple power iterations process have already been developed, including an extrapolation method [19], a block-structure method [18], and an adaptive method [17]. Categories and Subject Descriptors Traditionally, PageRank has been computed as the prin- G.1.3 [Numerical Analysis]: Numerical Linear Algebra; ciple eigenvector of a Markov chain probability transition D.2 [Software]: Software Engineering; H.3.3 [Information matrix. In this paper we consider the PageRank linear sys- Storage and Retrieval]: Information Search and Retrieval tem formulation and its iterative solution methods. An ef- ficient solution of a linear system strongly depends on the proper choice of iterative methods and, for high performance General Terms solvers, the computation architecture as well. There is no Algorithms, Performance, Experimentation best overall iterative method; one of the goals of this paper is to investigate the best method for the particular class of problems arising from PageRank computations on parallel Keywords architectures. PageRank, Eigenvalues, Linear Systems, Parallel Comput- It is well known that the random teleportation used in ing PageRank strongly affects the convergence of power iter- ations [15]. It has also been shown that high teleportation can help spam pages to accumulate PageRank [13], but a re- 1. INTRODUCTION duction in teleportation typically hampers the convergence The PageRank algorithm, a method for computing the of standard power methods. In this paper, we investigate relative rank of web pages based on the Web link struc- how the convergence of the linear system is affected by a ture, was introduced in [25, 8] and has been widely used reduction in a degree of teleportation. since then. PageRank computations are a key component of To perform multiple numerical experiments on real Web modern Web search ranking systems. For a general review graph data, we developed a system that can compute results of PageRank computing see [22, 7]. within minutes on Web graphs with one billion or more links. Until recently, the PageRank vector was primarily used When constructing our system, we did not attempt to op- to calculate a global importance score for each page on timize each method and instead chose an approach that is the web. These scores were recomputed for each new Web amenable to working with multiple methods in a consistent graph crawl. Recently, significant attention has been given manner while still easily adaptable to new methods. An ex- to topic-specific and personalized PageRanks [14, 16]. In ample of this trade-off is that our system stores edge weights both cases one has to compute multiple PageRanks corre- for each graph, even though many Web graphs do not use sponding to various teleportation vectors for different topics these weights. or user preferences. The remainder of the paper is organized as follows. The next section, PageRank Linear System, contains an analyt- ∗ Work performed while at Yahoo! ical derivation of a linear system version of PageRank, fol- lowed by discussion of numerical methods. We then pro- vide details on our parallel implementation. The Numerical Method IP SAXPY MV Storage PAGERANK 1 1 M + 3v JACOBI 1 1 M + 3v GMRES i + 1 i + 1 1 M + (i + 5)v BiCG 2 5 2 M + 10v BiCGSTAB 4 6 2 M + 10v Table 1: Computational Requirements. Operations per iteration: IP counts inner products, SAXPY counts AXPY operations, MV counts matrix vector multiplications, and Storage counts the number of matrices and vectors required for the method. Figure 1: Chart of computational methods. the second condition, aperiodicity, is routinely fulfilled (after the first condition is satisfied). The easiest way to achieve strong connectivity is to add to every page, dangling or not, Experiments section contains experimental results including a small degree of teleportation timings of the iterative methods on various size graphs and P 00 = cP 0 + (1 − c)evT , e = (1,..., 1). (3) an analysis of error metrics. where c is a teleportation coefficient. In practice 0.85 ≤ c < 2. PAGERANK LINEAR SYSTEM 1. After these modifications, matrix P 00 is row-stochastic and irreducible, and therefore, simple power iterations 2.1 PageRank Formulation (k+1) 00T (k) x = P x (4) Consider a Web graph adjacency matrix A with elements 00T Aij equal to 1 when there is a link i → j and equal to 0 oth- for the eigen-system P x = x converge to its principal erwise. Here i, j = 1 : n and n is a number of Web pages. eigenvector. For pages with a non-zero number of out-links deg(i) > 0, Now, combining Eq.(2 - 4) we get the rows of A can be normalized (made row-stochastic) by [cP T + c(vdT ) + (1 − c)(veT )]x = x. (5) setting Pij = Aij /deg(i). Assuming there are no dangling T T pages (vide infra) the PageRank x can be defined as a lim- Noting that (e x) = (x e) = ||x||1 = ||x|| (henceforth, all iting solution of the iterative process norms will be assumed to be 1-norms) we can derive a con- (k+1) X (k) X (k) venient identity xj = Pij xi = xi /deg(i). (1) T T i i→j (d x) = ||x|| − ||P x||. (6) At each iterative step, the process distributes page author- Then, Eq. (5) can be written as a linear system ity weights equally along the out-links. The PageRank is T defined as a stationary point of the transformation from (I − cP )x = k v (7) (k) (k+1) x to x defined by (1). This transfer of authority where k = k(x) = ||x|| − c||P T x|| = (1 − c)||x|| + (dT x). corresponds to the Markov chain transition associated with A particular value of k only results in a rescaling of x, the random surfer model when a random surfer on a page which has no effect on page ranking. The solution can always arbitrarily follows one of its out-links uniformly. be normalized to be a probability distribution by x/||x||. The Web contains many pages without out-links, called This brings us to the point that equation (7) with a fixed k dangling nodes. Dangling pages present a problem for the constitutes a linear system formulation of PageRank. In the mathematical PageRank formulation. A review of various future we simply set k = 1 − c. We also introduce the nota- approaches dealing with dangling pages can be found in [11]. tions A = I − cP T and b = k v = (1 − c)v, thus the linear One way to overcome this difficulty, is to slightly change system under consideration has the standard form Ax = b. the transition matrix P to a truly row-stochastic matrix The non-normalized solution to the linear system has an P 0 = P + d · vT , (2) advantage of depending linearly on the teleportation distri- bution. This property is important for blending topical or deg(i) where di = δ0 is the dangling page indicator, and v is otherwise personalized PageRanks. For this linear system, some probability distribution over pages. This model means we study the performance of advanced iterative methods in that the random surfer jumps from a dangling page accord- a parallel environment. ing to a distribution v. For this reason vector v is called a Casting PageRank as a linear system was suggested by teleportation distribution. Originally uniform teleportation, Arasu et al. [2] where Jacobi, Gauss-Seidel, and Successive vi = 1/n, was used, but non-uniform teleportation received Over-Relaxation iterative methods were considered. Numer- much attention after the advent of PageRank personaliza- ical solutions for various Markov chain problems are also tion [14, 16]. investigated in [26]. The classic simple power iterations (1) starting at an ar- bitrary initial guess converge to a principal eigenvector of P 2.2 Iterative Methods under two conditions specified in the Perron-Frobenius theo- The PageRank linear system matrix A = I − cP T is very rem. The first condition, matrix irreducibility or strong con- large, sparse and non-symmetric. Solution of the linear sys- nectivity of a graph, is not satisfied for a Web graph, while tem Eq. (7) by a direct method is not feasible due to the Name Nodes Links Storage Size 2.3 Parallel Implementation edu 2M 14M 176 MB We chose a parallel implementation of the PageRank al- yahoo-r2 14M 266M 3.25 GB gorithms to meet the scalability requirement. We wanted uk 18.5M 300M 3.67 GB to be able to compute PageRank vectors for large graphs yahoo-r3 60M 850M 10.4 GB (one billion links or more) quickly to facilitate experiments db 70M 1B 12.3 GB with the PageRank equation.