Personalized Pagerank Estimation and Search: a Bidirectional Approach
Total Page:16
File Type:pdf, Size:1020Kb
Personalized PageRank Estimation and Search: A Bidirectional Approach Peter Lofgren Siddhartha Banerjee Ashish Goel Department of CS School of ORIE Department of MS&E Stanford University Cornell University Stanford University [email protected] [email protected] [email protected] ABSTRACT 1. INTRODUCTION We present new algorithms for Personalized PageRank es- On social networks, personalization is necessary for re- timation and Personalized PageRank search. First, for the turning relevant results for a query. For example, if a user problem of estimating Personalized PageRank (PPR) from searches for a common name like John on a social network a source distribution to a target node, we present a new like Facebook, the results should depend on who is doing the bidirectional estimator with simple yet strong guarantees on search and who their friends are. A good personalized model correctness and performance, and 3x to 8x speedup over ex- for measuring the importance of a node t to a searcher s is isting estimators in experiments on a diverse set of networks. Personalized PageRank πs(t)[20, 13, 12] { this motivates a Moreover, it has a clean algebraic structure which enables natural Personalized PageRank Search Problem: Given it to be used as a primitive for the Personalized PageRank • a network with nodes V (each associated with a set of Search problem: Given a network like Facebook, a query keywords) and edges E (possibly weighted and directed), like \people named John," and a searching user, return the • a keyword inducing a set of targets: top nodes in the network ranked by PPR from the perspec- T = ft 2 V : t is relevant to the keywordg tive of the searching user. Previous solutions either score all • a searching user s 2 V (or more generally, a distribution nodes or score candidate nodes one at a time, which is pro- over starting nodes), hibitively slow for large candidate sets. We develop a new return the top-k targets t1; : : : ; tk 2 T ranked by Personal- algorithm based on our bidirectional PPR estimator which ized PageRank πs(ti). identifies the most relevant results by sampling candidates The importance of personalized search extends beyond so- based on their PPR; this is the first solution to PPR search cial networks. For example, personalized PageRank can be that can find the best results without iterating through the used to rank items in a bi-partite user-item graph, in which set of all candidate results. Finally, by combining PPR sam- there is an edge from a user to an item if the user has liked pling with sequential PPR estimation and Monte Carlo, we that item. This has proven useful on YouTube when recom- develop practical algorithms for PPR search, and we show mending videos [5] and on Twitter for suggested users [3, via experiments that our algorithms are efficient on networks 12]. On the web graph there is a large body of work on us- with billions of edges. ing Personalized PageRank to rank web pages (e.g. [14, 13]). The most clear-cut motivation for our work is for the social network name-search application discussed above, which we Categories and Subject Descriptors use as a running example in this paper. H.3.3 [Information Search and Retrieval ]: Search pro- The personalized search problem is difficult because every cess; G.2.2 [Graph Theory]: Graph Algorithms searching user has a different ranking on the target nodes. One naive solution would be to precompute the ranking for every searching user, but if our network has n users this General Terms requires Θ(n2) storage, which is clearly infeasible. Another arXiv:1507.05999v3 [cs.DS] 15 Dec 2015 Algorithms, Performance, Experimentation, Theory naive baseline would be to use power iteration [20] at query time, but that would take Θ(m) computation between the search query and response, where m is the number edges, Keywords which is also clearly infeasible. The challenge we face is Personalized Search, Personalized PageRank, Social Net- to create a data structure much smaller than O(n2) which work Analysis allows us to rank jT j targets in response to a query in less Permission to make digital or hard copies of all or part of this work for personal or than O(jT j) time. classroom use is granted without fee provided that copies are not made or distributed Previous work has considered the problem of personalized for profit or commercial advantage and that copies bear this notice and the full citation search on social networks. For example Vieira et. al. [24] on the first page. Copyrights for components of this work owned by others than the consider this problem and provide excellent motivation for author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or why results to a name-search query should be ranked based republish, to post on servers or to redistribute to lists, requires prior specific permission the friendships of the searching user and the candidate re- and/or a fee. Request permissions from [email protected]. WSDM 2016 February 22 - 25, 2016, San Francisco, CA, USA sults. They and others (e.g. [4]) propose to rank results Copyright is held by the owner/author(s). Publication rights licensed to ACM. by shortest path length. However, this metric doesn't take ACM 978-1-4503-3716-8/16/02/$15.00 into account the number of paths between two users: If the DOI: http://dx.doi.org/10.1145/2835776.2835823. searcher and two results John A and John B are distance { BiPPR-Precomp-Grouped precomputes and stores the re- 3 apart, but the searcher and John A are connected by 100 verse vectors yt; t 2 T after grouping them by their co- length-3 paths while the searcher and John B are connected ordinates. This exploits the natural sparsity of these by a single length-3 path, than John A should be ranked vectors to speed-up the computation of the PPR esti- above John B, yet the shortest distance can't distinguish mates at runtime. the two. To the best of our knowledge, no prior work has { BiPPR-Precomp-Sampling samples nodes t 2 T propor- solved the Personalized PageRank search problem using less tional to their PPR πs(t). Now since PPR values are than O(n2) storage and O(jT j) query time. The reason we usually highly skewed, this serves as a good proxy for are able to solve this is by exploiting a new bidirectional finding the top k search results. method of PageRank, introduced in [19] and improved in • Extensive simulations on the Twitter-2010 network to test this work. the scalability of our algorithms for PPR-search. Our ex- Our search algorithm is based on two key ideas. The first periments demonstrate the trade-off between storage and is that we can find the top target nodes without having to runtime, and suggest that we should use a combination consider each separately by sampling a target ti 2 T in pro- of methods, depending on the size of the set of targets T portion to its Personalized PageRank πs(ti). Because the top induced by the keyword. results typically have a much higher personalized PageRank than an average result, by sampling we can find the top re- sults without iterating over all the results. The second idea 2. PRELIMINARIES is that the probability of a random walk exactly reaching an We are given a graph G = (V; E) with n nodes and m element in T is often very small, but by pre-computing an edges. Define the out-neighbors of a node u by N out(u) = expanded set of nodes around each target, we can efficiently out out in fv :(u; v) 2 Eg and let d (u) = N (u) ; define N (u) sample random walks until they get close to a target node, and din(u) similarly. Define the average degree of nodes d¯= and then use the pre-computed data to sample targets t in i m . If the graph is weighted, for each (u; v) 2 E there is some proportion to πs(ti). n positive weight w ; otherwise we define w = 1 for There are currently two main limitations to our work. u;v u;v dout(u) First, because we do pre-computation on the set of nodes all (u; v) 2 E. For simplicity we assume the weights are P relevant to a query, we need the set of queries to be known normalized such that for all u, v wu;v = 1. in advance, although in the case of name search we can sim- The personalized PageRank from source distribution σ to ply let the space of queries be the set of all first or last names. target node t can be defined using linear algebra as the solu- tion to the equation π = π (ασ+(1−α)W ), or equivalently Second, the pre-computedp storage is significant; for name-p σ σ search it is O (n m) to achieve query running time O( m), defined using random walks where n is the number of nodes and m is the number of edges. However, large graphs tend to be sparse, so this is πσ(t) = Pr[a random walk starting from s ∼ σ 2 still much smaller than O n and is less storage than any of length ∼ geometric(α) stops at t] prior solution to the Personalized PageRank Search prob- lem. Also, pre-computation doesn't need to be done for all as shown in [2]. For concreteness, in this paper we often queries: for queries with small or very large target sets we assume σ = es for some single node s (meaning the random describe alternative algorithms which do not require pre- walks always start at a single node s), but all results extend computation.