Parallel Personalized PageRank on Dynamic Graphs

Wentian Guo, Yuchen Li, Mo Sha, Kian-Lee Tan School of Computing, National University of Singapore wentian, liyuchen, sham, tankl @comp.nus.edu.sg { }

ABSTRACT representative graphs often evolve rapidly in many applica- 9 Personalized PageRank (PPR) is a well-known proximity tions. For example, network trac data averages 10 pack- measure in graphs. To meet the need for dynamic PPR ets/hour/router for large ISPs [17] and Twitter has more maintenance, recent works have proposed a local update than 500 million tweets per day [3]. Since computation of scheme to support incremental computation. Nevertheless, PPR from scratch is prohibitively slow against high rate of sequential execution of the scheme is still too slow for high- graph updates, recent works [38, 49] focus on incremental speed stream processing. Therefore, we are motivated to PPR maintenance. The state-of-the-art dynamic PPR ap- design a parallel approach for dynamic PPR computation. proaches are based on a local update scheme [6, 5, 11, 31, 29, First, as updates always come in batches, we devise a batch 38, 49]. This scheme takes two steps against each single up- processing method to reduce synchronization cost among ev- date: (1) restore invariants; (2) perform local pushes. More ery single update and enable more parallelism for iterative specifically, each vertex v keeps two values Ps(v)andRs(v) parallel execution. Our theoretical analysis shows that the where Ps(v) is the current estimate to the PPR value of v parallel approach has the same asymptotic complexity as w.r.t. the source vertex s and Rs(v)istheupperboundon the sequential approach. Second, we devise novel optimiza- the estimation bias. We call Ps(v)andRs(v)astheestimate tion techniques to e↵ectively reduce runtime overheads for and residual of v respectively. The invariant is maintained parallel processes. Experimental evaluation shows that our for all vertices to ensure the residual is valid. To handle parallel algorithm can achieve orders of magnitude speedups any update, e.g., an edge insertion/deletion, the invariant is on GPUs and multi-core CPUs compared with the state-of- first restored which results in change in the residual. Subse- the-art sequential algorithm. quently, if the residual of a vertex v is over a user-specified threshold, i.e., ✏, the residual is pushed to v’s neighbors. PVLDB Reference Format: Although it has been shown in [49] that only a few op- Wentian Guo, Yuchen Li, Mo Sha, Kian-Lee Tan. Parallel Per- erations are needed to handle any single edge update (in- sonalized PageRank on Dynamic Graphs. PVLDB,11(1):3-9 sertion/deletion) under the common edge arrival models, it 106, 2017. DOI: https://doi.org/10.14778/3136610.3136618 is practically inecient. We have observed latency of up to 2 3minutesagainstanupdatebatchconsistingof5000 edges using the state-of-the-art approach [49] in our exper- 1. INTRODUCTION iments. This hardly matches the streaming rates of real- Personalized PageRank is one of the most widely used world dynamic graphs (e.g., 7000 tweets are generated per proximity measure in graphs. While PageRank quantifies second in Twitter [3]). Meanwhile, the emerging hardware the importance of vertices overall, the PPR vector measures architectures provide the consumers with massive computa- the importance of vertices w.r.t to a given source vertex s. tion power. The two of the most popular hardwares, multi- For instance, a vertex v with a larger PPR value w.r.t. s core CPUs and GPUs have shown their great potentials in implies that a random walk starting from s has a higher accelerating the graph applications [42, 36, 47, 34, 25, 35]. probability to reach v.PPRisusedinapplicationsofmany By utilizing the emerging hardwares, we are motivated to fields, such as web search for Internet individuals [39, 22], devise an ecient parallel approach for dynamic PPR com- user recommendation [8, 19], community detection [48] and putation under the state-of-the-art local update scheme to graph partitioning [6, 5]. achieve high-speed stream processing. Most existing works [39, 22, 6, 5, 30, 31, 29, 11] fo- Parallelizing dynamic PPR computation poses two new cus on PPR computation on static graphs. However, the challenges. First, as the update often comes in batches, it remains unclear if the batch can be updated in parallel for Permission to make digital or hard copies of all or part of this work for the local update scheme. A naive solution is to synchro- personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies nize on each single update to repair the invariant and then bear this notice and the full citation on the first page. To copy otherwise, to launch the local push. However, this solution causes sig- republish, to post on servers or to redistribute to lists, requires prior specific nificant synchronization overheads against batch updates. permission and/or a fee. Articles from this volume were invited to present Second, parallelizing the local push is non-trivial as it in- their results at The 44th International Conference on Very Large Data Bases, curs extra overheads compared with a sequential approach. August 2018, Rio de Janeiro, Brazil. We call the local push getting parallelized as the parallel Proceedings of the VLDB Endowment, Vol. 11, No. 1 Copyright 2017 VLDB Endowment 2150-8097/17/09... $ 10.00. push. The extra overheads of the parallel push come from DOI: https://doi.org/10.14778/3136610.3136618

93 two sources: (a) the parallel push, can take more operations, Algorithm 1 RestoreInvariant as it reads stale values of the residual at the beginning of it- Input: (G,Ps,Rs,u,v,op) erations and then propagates less probability mass from the Require: (u, v, op)istheupdatingedgewithop being the in- frontier vertices (the frontier contains vertices that need to sert/delete operation. 1: procedure Insert(u, v) be pushed in current iteration); (b) the parallel push needs (1 ↵) Ps(v) Ps(u) ↵ Rs(u)+↵ 1u=s 1 2: Rs(u) += · · · to iteratively maintain a vertex frontier and thus results in dout(u) · ↵ 3: procedure Delete(u, v) synchronization costs to merge duplicate vertices. (1 ↵) Ps(v) Ps(u) ↵ Rs(u)+↵ 1u=s 1 4: Rs(u) = · · · In this paper, we propose a novel parallel approach to dout(u) · ↵ tackle the aforementioned challenges. First, we devise a batch processing method to avoid synchronization costs within a batch as well as to generate enough workloads for paral- are the out degrees of all vertices in G.ThePPRvector⇡s lelism. Our theoretical analysis proves that the parallel ap- w.r.t. s is the solution of the following linear equation: proach has the same asymptotic complexity as its sequential counterparts. Second, we propose novel optimization tech- ⇡s = ↵es +(1 ↵)W ⇡s (1) niques to practically reduce the number of local push op- erations (eager propagation)andtoeciently maintain the where ↵ is the teleport probability (which is a constant and vertex frontier in parallel by utilizing the monotonicity of it is typically set to 0.15), es is a personalized unit vector the local push process (local duplicate detection). with a single non-zero entry of 1 at s,andW is the transition We hereby summarize our contributions as follows: T 1 matrix s.t. W = A D . We introduce a parallel approach to support dynamic Note that general PPR formulation does not restrict the • PPR computation over high-speed graph streams. One personalized vector to be a unit vector. We omit the dis- salient feature is that updates are processed in batches. cussion for the general case as it can be reduced to the case Theoretical analysis shows that the parallel approach with the unit vector scenario. In fact, many existing works has the same asymptotic complexity with its sequen- study how to use the unit vector formulation as a building tial counterpart under the two common edge arrival block to eciently support the general case by maintaining models. To the best of our knowledge, this is the first multiple PPR vectors with di↵erent personalized unit vec- work on parallelizing dynamic PPR processing. tors. Interested readers are referred to [49, 38, 10, 31, 6, 9, We propose several optimization techniques to improve 29, 11]. Henceforth, this paper focuses on providing ecient • the performance of the parallel push. Eager propaga- parallel approach for the unit vector formalization. tion helps to reduce the number of local push opera- tions. Our novel frontier generation method eciently 2.2 Dynamic Graph Model maintains the vertex frontier by mitigating synchro- We introduce a typical dynamic graph model [10, 49] to nization overheads to merge duplicate vertices. articulate the dynamic PPR problem. The dynamic model We implement our parallel approach on GPUs and t • consists of an unbounded sequence of updates E ,which multi-core CPUs. The experiments over real-world indicate the set of edges arriving at time step t.Eachel- datasets have verified the e↵ectiveness of our proposal ement of Et is (u, v, op)thatmeansanedgeu v with over both architectures. The proposed approach can op denoting the type (insertion/deletion). Note that! in pre- achieve orders of magnitude speedups on GPUs and vious works [38, 49] only one single edge update arrives at multi-core CPUs compared to the state-of-the-art se- time step t, in the following we will discuss them with the quential approach. same notations but readers should notice Et = 1 for the The remaining part of this paper is organized as follows. Sec- case of single update. 0 0 0 tion 2 introduces preliminaries and the background. Section Initially, the graph is G =(V ,E )attimestep0.At t t 1 3 presents the parallel local update approach, together with time step t>0, E is applied to G and produces a new t 1 its theoretical analysis. We propose the optimizations for graph as G . Note that an edge insertion may introduce t the parallel push in Section 4. Section 5 reports the exper- new vertices to V and the deletion of an edge may discard t 1 imental results. The related works are discussed in Section some vertices with zero from V .Giventhemodel, 6. Finally, we conclude the paper in Section 7. the goal of dynamic PPR problem is to incrementally main- tain the PPR vector for each update. 2. PRELIMINARY In this section, we briefly describe the PPR problem and 2.3 Sequential Local Update review the state-of-the-art local update scheme for dynamic The state-of-the-art approaches on dynamic PPR vector PPR maintenance. maintenance relies on a local update scheme [49, 6, 5]. The PPR value of a vertex returned by the local update scheme is 2.1 Personalized PageRank (PPR) an ✏-approximation to the true PPR value, i.e., the absolute Let G(V,E) be a directed graph with the vertex set V error is smaller than the error threshold ✏.Theschemekeeps and the edge set E. Intuitively, PPR can be interpreted two values for each vertex v: Ps(v)andRs(v) where Ps(v)is as a random walk process on G that starts from a source the current estimate to the PPR value of v w.r.t. the source vertex s and then iteratively jumps to an uniformly chosen vertex s,andRs(v) is the “residual” which bounds the es- timation bias. Given a vertex v V with its out neighbor out-neighbor or teleports to s with probability ↵.ThePPR 2 value ⇡s(v) is the probability that such random walk starts from s and stops at v.WedenoteA as the adjacent matrix of 1To simplify the presentation, the time step subscript is G and D as the diagonal matrix, where the diagonal entries dropped when the context is clear.

94 R1(1) P1(1) R1(2) P1(2) R1(3) P1(3) R1(1) P1(1) R1(2) P1(2) R1(3) P1(3) R1(1) P1(1) R1(2) P1(2) R1(3) P1(3) R1(1) P1(1) R1(2) P1(2) R1(3) P1(3) 0.0625 0.5 0.0 0.25 0.0 0.1875 0.1562 0.5 0.0 0.25 0.0 0.1875 0.1562 0.5 0.0 0.25 0.0 0.1875 0.0 0.5812 0.0781 0.25 0.039 0.1875 1 2 3 1 2 3 1 2 3 1 2 3

R1(4) P1(4) R1(4) P1(4) R1(4) P1(4) R1(4) P1(4) 4 0.0625 0.0625 4 0.0625 0.0625 4 0.0625 0.0625 4 0.0625 0.0625

a. Initial state b. Single update on e1 c. Push frontier: v1 d. Convergent state normal vertex frontier vertex aected vertex during restore invariant normal edge traversed edge during neighbor propagation new edge

Figure 1: An example of the sequential local update. Assuming ↵ =0.5and✏ =0.1. When a new edge e1 arrive, invoke Algorithm 1 to restore the invariant for v1 and then launch Algorithm 2 to proceed the local push.

Algorithm 2 SequentialLocalPush Lemma 1 ([5]). For any vector p,thefollowingtrans-

Input: (G,Ps,Rs,✏) formation from Ps and Rs to Ps0 and Rs0 maintains the in- 1: while maxu Rs(u) > ✏ do variant shown by Equation 2. 2: SeqPush(u) 3: while minuRs(u) < ✏ do Ps0 = Ps + ↵p 4: SeqPush(u) T R0 = Rs p +(1 ↵)W p 5: return (Ps,Rs) s 6: procedure SeqPush(u) Zhang et al. [49] prove the complexity required for the 7: Ps(u) += ↵ Rs(u) · sequential local update to maintain the PPR vector under 8: for all v Nin(u) do 2 two common edge arrival models. 9: Rs(v) += (1 ↵) Rs(u)/dout(v) · 10: R (u)=0 s Theorem 1 ([49]). Given a random source vertex, the number of operations required by the sequential local update to maintain a valid approximate solution on the dynamic set Nout(v) and out degrees dout(v), an invariant [49, 6] is graph is at most O(K+K/(n✏)+d/✏) for K updates (d is the formulated as follows to ensure Ps(v)isRs(v)-approximate. average degree), under two edge arrival models: random edge permutation of a directed graph and arbitrary edge updates Ps(x) Ps(v)+↵Rs(v)= (1 ↵) + ↵ 1v=s (2) of an undirected graph. · dout(v) · x N (v) 2 Xout Example 1. We show a running example in Figure 1 to illustrate the sequential local update. As a new edge v1 v2 To handle an update in the dynamic graph model, the ! framework of the scheme takes a two-step procedure: (1) arrives in Figure 1(b), RestoreInvariant is invoked to restore invariant; (2) perform local pushes.Whenanedge modify R1(1),whichwillactivatev1. SeqPush performs update (u, v, op) arrives, the scheme repairs the invariant local push on v1.ItupdatesR1(2) and R1(3) but does not specified in Equation 2, which may change the residual value activate v2 and v3.Sincethereisnoactivevertexafter- of u and v.Incasethisactivates any vertex, which means wards, convergence is achieved. the residual value exceeds ✏,thelocalpushisinvoked.Itup- dates the estimate of this vertex and propagates the residual Although there are many variants of the local update to its neighbors, which may activate more vertices. The pro- scheme [49, 6, 5, 38, 31, 11], they share the similar behavior cess continues until no residual exceeds ✏. (i.e., restore invariant against an update and then perform The step (1) and step (2) of the local update scheme are local pushes to reduce the residuals). We thus focus on the presented in Algorithm 1 (RestoreInvariant)andAlgo- state-of-the-art approach proposed in [49] and omit other rithm 2 (SequentialLocalPush) respectively. Restor- approaches as there is no essential di↵erence in parallelizing eInvariant describes how to maintain the invariant for all them compared to parallelizing the approach in [49]. Before vertices for an update of directed edge u v. We present moving on to the technical contributions of this paper, we the case for insertion and the case for deletion! is similar. summarize the frequently used notations in Table 1. Note that the only vertex where the invariant does not hold is u as dout(u) becomes larger with the extra out-neighbor v. 3. DYNAMIC PPR IN PARALLEL In this case, it suces to change only Rs(u)withouttouch- As previously mentioned in the Introduction, the sequen- ing any other vertices. After invoking RestoreInvariant, tial approach for dynamic PPR is practically inecient to if the residual value exceeds ✏, SequentialLocalPush is meet the requirement for real-world stream processing. We launched to preserve the approximate guarantee. thus propose a novel parallel approach to boost the perfor- There are two phases in SequentialLocalPush.The mance of dynamic PPR computation. first one handles vertices with positive excessive residuals and the second one deals with negative residuals. Each 3.1 Parallel Local Update phase iteratively invokes SeqPush (Lines 6-10) until no resid- A naive parallel approach is to synchronize on each single ual has an absolute value larger than ✏.Whenthishappens, update to repair the invariant and then push all vertices it is guaranteed that ⇡s(u) Ps(u) ✏ for any u V . in the frontier in parallel (the frontier consists of vertices | |  2 And we say the local push converges at this time. Lemma which have larger residuals than ✏). However, a single edge 1 ensures the invariant holds throughout the local push. update is not expected to result in substantial change in the representative graph, and hence the residuals will not

95 Table 1: Frequently used notations in this paper. Algorithm 3 ParallelLocalPush Symbol Descriptions Input: (Ps,Rs,G,✏) t t t the graph, the vertex set and 1: FQ = u V pushCond(Rs(u),POS) G ,V ,E 2: while {FQ2= |do } the edge set at time step t 6 ; 3: FQ = ParallelPush(Ps,Rs,FQ,POS) Nin(u),Nout(u) in- and out- neighbor set of u 4: FQ = u V pushCond(Rs(u),NEG) dout(u) Out-degree of u 5: while {FQ2= |do } 6 ; d the average degree 6: FQ = ParallelPush(Ps,Rs,FQ,NEG) K Total number of edge updates 7: return (Ps,Rs) s the source vertex of PPR 8: procedure pushCond(r, phase) ↵ Teleport probability of PPR 9: if phase = POS then return r>✏ 10: else return r< ✏ ⇡s True PPR vector w.r.t s 11: procedure ParallelPush(Ps,Rs,FQ,phase) es Unit vector with 1 at s 12: S = 13: parallel; for u FQ ✏ Error threshold 2 14: S = S (u, Rs(u)) Ps Estimate PPR vector w.r.t. s [ 15: Ps(u) += ↵ Rs(u) R Residual vector of P w.r.t. s · s s 16: Rs(u)=0 17: synchronize vary significantly as well. Thus, the single update scenario 18: FQ0 = 19: parallel; for (u, w) S often causes a small frontier for any parallel approach to be 2 20: parallel for (v, u) Nin(u) e↵ective. Moreover, as updates often come in batches, the 2 21: atomicAdd(Rs(v), (1 ↵) w/dout(v)) naive approach causes significant synchronization overheads 22: if pushcond(v, phase)then· against batch updates. 23: UniqueEnqueue(FQ0,v) In this paper, we propose a simple-to-implement yet e↵ec- 24: synchronize tive approach to support ecient batch processing in paral- 25: return FQ0 lel. Given a batch of k updates, we first restore the invari- ant by calling RestoreInvariant (Algorithm 1) k times to adapt to the graph changes for the batch and then paral- invoked for both insertions and changes the residuals of v1 lelize the local push. The critical component of our approach and v4.Then,v1 and v4 become the frontier as their resid- is the parallel local push presented in Algorithm 3. Simi- uals exceed ✏ and ParallelLocalPush is launched corre- lar to SequentialLocalPush (Algorithm 2), there are two spondingly. It achieves convergence in one iteration. phases to handle positive and negative residuals respectively. In each phase, it first initializes a frontier queue FQand then iteratively invokes the push procedure (ParallelPush)un- 3.2 Theoretical Analysis til no vertex in the frontier. ParallelPush consists of two Although it is apparently simple to implement our parallel parallel sessions. The first session takes out the residual of approach, we note the theoretical analysis is intricate. In each frontier vertex, and increments the estimate by ↵ por- this section, we verify the soundness and eciency of our tion of its residual. The second session updates the neigh- proposed approach. First, we need to examine if the parallel bors of the frontier vertex with the remaining 1 ↵ portion approach produces a valid approximation result. Second, we check the complexity of the parallel approach against of the residual and generates a new frontier queue FQ0 if necessary. To simplify the presentation, we refer to the first the sequential approach in terms of number of operations parallel session as self-update and the second session as performed. In the following, we prove the correctness of the neighbor-propagation. parallel push in Theorem 2. We note that in Line 21, atomic operations are utilized Theorem 2. For a given ✏,theparallelpush(Algorithm3) to ensure the residuals are correctly transferred from the can produce a valid approximate result with the same error frontier to their neighbors. An alternative method to avoid threshold as the sequential push (Algorithm 2). atomic operations is to run a parallel sort on the key-value pairs consisting of the neighbor vertex ids and the corre- Proof. We borrow Lemma 1 to aid our proof. Lemma 1 sponding residuals. Then, it runs a parallel reduce opera- suggests that, as long as the vector p used for the self-update tion to aggregate the residuals with the same keys and then and the neighbor-propagation is the same, the invariant al- transfer the residuals to the designated vertices. However, ways holds. Let be the vector s.t. (u)=1ifu FQ and 2 this sorting-and-aggregate method incurs significant over- zero otherwise, where u V . We rewrite the parallel push 2 heads for large frontiers. In fact, most graph processing with a transformation in a vector form: systems [42, 47, 16, 36, 23] and similar applications [13, 33, Ps0 = Ps + ↵ ( Rs) 44, 43, 35] adopt atomic operations over the sorting-and- · T R0 = Rs ( Rs)+(1 ↵)W ( Rs) aggregate method to update the neighbors. Thus, we use s the atomic method in our implementation 2. Let p = Rs, the above transformation maintains the in- variant by Lemma 1. The rest of the proof naturally follows Example 2. We continue the running example in Figure as when Algorithm 3 terminates, Rs(u) < ✏ for any u V , 2fortheparallellocalupdate.Twonewinsertionsv1 v2 | | 2 ! and the invariant still holds. and v4 v1 arrive in Figure 2(b), RestoreInvariant is ! 2We will not show the experimental results for using the We are left to examine the complexity of the parallel local sorting-and-aggregate method as the performance is signif- update. Previous study on the sequential local update [49] icantly worse than the atomic method according to our in- bounds the complexity for dynamic PPR maintenance by vestigations and results. tracking the residual change caused by each edge update,

96 R1(1) P1(1) R1(2) P1(2) R1(3) P1(3) R1(1) P1(1) R1(2) P1(2) R1(3) P1(3) R1(1) P1(1) R1(2) P1(2) R1(3) P1(3) R1(1) P1(1) R1(2) P1(2) R1(3) P1(3) 0.0625 0.5 0.0 0.25 0.0 0.1875 0.1562 0.5 0.0 0.25 0.0 0.1875 0.1562 0.5 0.0 0.25 0.0 0.1875 0.0546 0.5781 0.0781 0.25 0.039 0.1875 1 2 3 1 2 3 1 2 3 1 2 3

R1(4) P1(4) R1(4) P1(4) R1(4) P1(4) R1(4) P1(4) 4 0.0625 0.0625 4 0.2187 0.0625 4 0.2187 0.0625 4 0.039 0.1718

a. Initial state b. Batch update on e1,e2 c. Push frontier: v1,v4 d. Convergent state normal vertex frontier vertex aected vertex during restore invariant normal edge traversed edge during neighbor propagation new edge

Figure 2: An example of the parallel local update. Assume ↵ =0.5and✏ =0.1. When new edges e1 and e2 arrive, invoke Algorithm 1 to restore the invariant for both v1 and v4 and then launch Algorithm 3 to proceed the local push.

i under the two common edge arrival models mentioned in first compute s(u). Define Aj , Bj and j as follows:

Theorem 1. Following this method, we analyze the com- opj plexity for the parallel local update. The major di↵erence Aj = (1 ↵) Ps(vj ) ↵dj (u) · · is that we track the residual change caused by a batch up- opj date instead of a single update. Our subsequent analysis Bj = ↵ 1u=s Ps(u) ↵dj (u) · · is structured as follows: Lemma 2 computes the number of opj ⇥ ⇤ operations required for the parallel local update as a func- j =1 tion of the residual change. Lemma 3 thus deduces an upper dj (u) bound on the residual change for an arbitrary batch update. By Algorithm 1, we have the recursive formula for rj (u): By combining Lemmas 2 and 3, the overall complexity of the rj (u)=j rj 1(u)+Aj + Bj parallel local update is presented in Theorem 3. · Based on the above equation, we can start from rk(u)and recursively substitute rj (u)withrj 1(u), and derive the re- Lemma 2. Given a number of t batches which collectively contains K updates and a random source vertex, the total lationship between rk(u)andr0(u). number of operations required by the parallel local update to k k k process the t batches is bounded by d under random edge rk(u)= j r0(u)+ x (Aj + Bj )(3) permutation of a directed graph. j=1 j=1 x=j+1 Y X Y t i i Note that j is the ratio of u’s out-degree before and af- d ↵ +4 1 dout(u) s V s(u) th dj 1(u) d = + K + · 2 ter j edge update, i.e. , so we have the following ↵✏ · n↵2 n · ↵✏ dj (u) i=1 u V P proposition for . X X2 j b b and it is bounded by u under arbitrary edge updates of an dj 1(u) da(u) j = = undirected graph. dj (u) db(u) j=a+1 j=a+1 Y Y t i i d 2 1 dout(u) s V s(u) Replace the multiplication sequence of j in Equation 3 with u = + K + · 2 ↵✏ · ↵ n · ↵✏ the above proposition, which will cancel dj (u)inAj and Bj . i=1 u V P k X X2 Using the fact that opj = dk(u) d0(u), we can further j=1 i reorganize Equation 3 as: Note that s(u) is the residual change of a vertex u caused P k by restoreinvariant at time step i. i j=1 opj U(u, vj ) s = rk(u) r0(u)= · P ↵dk(u) Since the proof process is similar to Theorem 1, we refer where U(u, vj )=(1 ↵) Ps(vj ) Ps(u) ↵ r0(u)+↵ 1u=s. interested readers to our extended version [1]. Note that the · i · · To calculate the upper bound of s(u), take out the abso- expressions above are both associated with i (u), for s V s lute value of each term in opj U(u, vj ) and have the following which we derive an upper bound in the following2 lemma. · expression. We use the fact that Ps(vj ) ⇡s(vj )+✏ and P  r0(u) ✏.  Lemma 3. At time step i,supposethebatchcontainsk (1 ↵) Ps(vj )+Ps(u)+↵ r0(u)+↵ 1u=s directed edge updates that start from u,i.e. (u vj ,opj ), · · · ! (1 ↵) (⇡s(vj )+✏)+(⇡s(u)+✏)+↵✏ + ↵ 1u=s for 1 j k,wehave:  · ·   i Now the numerator of s(u)isatmostk times the above i 2n✏ +2 s(u) i k expression. Notice that s V ⇡s(vj ) = 1 (similarly for u),  ↵d (u) · 2 s V out if we take the summation of i (u)overs V , ⇡ (v )and X2 P s s j ⇡ (u)willbecanceled.Inthisway,wecancompletethe2 Proof. To facilitate our proof, we define op = 1 for in- s proof for the lemma. sertion and op = 1 for deletion. We further denote the th residual and out-degrees of u after applying j update as Theorem 3. The number of operations required by the rj (u)anddj (u). So r0(u)andrk(u) are the initial value parallel local update to maintain a valid approximate solution and new value of Rs(u) respectively after invoking restor- is asymptotically the same as that by the sequential local i einvariant on these k updates. To bound s V s(u), we update, under the two edge arrival models (see Theorem 1). 2 P

97 R1(1) P1(1) R1(2) P1(2) R1(3) P1(3) R1(1) P1(1) R1(2) P1(2) R1(3) P1(3) R1(1) P1(1) R1(2) P1(2) R1(3) P1(3) R1(1) P1(1) R1(2) P1(2) R1(3) P1(3) 1.0 0.0 0.0 0.0 0.0 0.0 0.0 0.5 0.5 0.0 0.25 0.0 0.0 0.5 0.0 0.25 0.125 0.125 0.0625 0.5 0.0 0.25 0.0 0.1875 1 2 3 1 2 3 1 2 3 1 2 3

R1(4) P1(4) R1(4) P1(4) R1(4) P1(4) R1(4) P1(4) 4 0.0 0.0 4 0.0 0.0 4 0.125 0.0 4 0.0625 0.0625 a(1)/b(1). The initial frontier of both approaches: v1 a(2). Push frontier: v2,v3 a(3). Push frontier: v3,v4 a(4). Convergent state

R1(1) P1(1) R1(2) P1(2) R1(3) P1(3) R1(1) P1(1) R1(2) P1(2) R1(3) P1(3) R1(1) P1(1) R1(2) P1(2) R1(3) P1(3) R1(1) P1(1) R1(2) P1(2) R1(3) P1(3) 0.0 0.5 0.5 0.0 0.25 0.0 0.0 0.5 0.0 0.25 0.375 0.0 0.0 0.5 0.0 0.25 0.0 0.1875 0.0937 0.5 0.0 0.25 0.0 0.1875 1 2 3 1 2 3 1 2 3 1 2 3

R1(4) P1(4) R1(4) P1(4) R1(4) P1(4) R1(4) P1(4) 4 0.0 0.0 4 0.0 0.0 4 0.1875 0.0 4 0.0 0.0937

b(2). Push frontier: v2 b(3). Push frontier: v3 b(4). Push frontier: v4 b(5). Convergent state

normal vertex frontier vertex normal edge traversed edge during neighbor propagation

Figure 3: The comparison of the parallel push (Algorithm 3) a(1)-a(4) and the sequential push (Algorithm 2) b(1)-b(5). Assume ↵ =0.5and✏ =0.1. Starting from the same initial state, the parallel push pushes v1,v2,v3,v3,v4 ,whilethe { } sequential push pushes v1,v2,v3,v4 . The parallel push performs one more push operation on v3. { } Proof. Suppose at time step i there are ki(u) directed 4.1 Eager Propagation edge updates that start from u.ByLemma3,wehave Parallel execution accelerates the local push, but the im- the following inequality for the third term in d and u provement does not come for free. Compared with the se- (Lemma 2). quential push, we find the parallel push can take more op- t i i erations in many cases. This is caused by a phenomenon we dout(u) s V s(u) · 2 call parallel loss . To solve this problem, we propose a tech- ↵✏ i=1 u V P nique called eager propagation to the gap between X X2 t the parallel and sequential push. di (u) 2n✏ +2 out ki(u)  ↵✏ · ↵di (u) · i=1 u V out 4.1.1 Parallel Loss X X2 t Consider an example shown in Figure 3. To achieve con- i 2n✏ +2 k (u) vergence (i.e., all vertices have smaller residual than ✏), both  · ↵2✏ i=1 u V the parallel and sequential push perform push operations on X X2 the same set of vertices v1,v2,v3,v4 , but the parallel push In the following, we consider di↵erent edge arrival models. { } In a directed graph with random edge permutation, as- costs one more operation because v3 is pushed twice. For suming the total number of edge updates over t time steps the first time to push v3 in a(2) of Figure 3, the frontier t i vertices are v2 and v3.Sincev3 is the in-neighbor of v2, v2 is K,wehave i=1 u V k (u)=K.Thus,wecanobtain 2 propagates some residual to v . This amount is large enough the upper bound of d as: 3 P P to activate v3 in the next iteration. Thus in a(3), v3 needs to d ↵ +4 2 2 be pushed once more. The sequential push, however, does d + K 2 + K ( 2 + 2 )(4)  ↵✏ · n↵ · ↵ ↵ n✏ not need such e↵ort. In b(2), it pushes v2 and v2 propagates In an undirected graph with arbitrary edge updates, we its residual to v3.Inb(3),v3 has a large amount of residual, need to apply restoreinvariant twice as an undirected 0.375, consisting of the amount it has in b(2) 0.25 and the edge update is treated as two directed updates, and thus the propagation from v2 0.125. The sequential push can resolve total number of directed edge updates is 2K,whichmeans this residual with only one push operation. In b(2), if push- t i ing both v2 and v3 in parallel, we will lose the opportunity i=1 u V k (u)=2K.Thus,weobtaintheupperbound 2 of u. to process a larger residual. In the end, it may take more P P operations to converge. This phenomenon, caused by con- d 2 4 4 u + K + K ( + )(5)current push operations on multiple frontier vertices in one  ↵✏ · ↵ · ↵2 ↵2n✏ iteration, is called parallel loss. The asymptotic bound of d and u is O(K+K/(n✏)+d/✏), Parallel loss cannot be avoided as long as we parallelize which is the same as the sequential local update specified in the local push. Although we prove the worst-case complex- Theorem 1. Thus we conclude the proof. ity of the parallel approach matches that of the sequential approach (Section 3), the former can take more operations 4. OPTIMIZATION at runtime. We introduce Lemma 4 to support this argu- ment. It shows that when ✏ is suciently small, the overall As repairing the invariant only takes a constant time, the residual of the parallel push is always larger than that of the parallel push dominates the running performance of the par- sequential push because of parallel loss. With larger overall allel local update approach. In this section, we describe our residual, the parallel push takes more operations than the novel optimization techniques, i.e., eager propagation and sequential push under the same convergence condition. local duplicate detection, to improve the performance of the parallel push.

98 p q k k Lemma 4. Let Rx and Rx be the residuals for the par- p p =( R (1 ↵) R (vi) c(vi)) + O(✏)+ allel and sequential push in iteration x for any fixed source k x+1k1 · | x |· i=1 i=1 vertex. Given any iteration x and a corresponding frontier, X X k the sequential push handles all vertices in its frontier one by p p b(vi) c(vi)= R + O(✏)+ (b(vi) R (vi) ) c(vi) one in serial, whereas the parallel push handles the frontier · k x+1k1 | x | · i=1 vertices all at once. Given the same initial residual distri- X p q q p where Sq = x R (u) and Sp = x R (u) . bution, i.e. R0 = R0,if✏ 0,thesumofabsoluteresidual u V FQ x u V FQ x ! 2q \ | p | k 2 \ p| | values for the parallel push is always larger than that of the When ✏ 0, R R + (b(vi) Rx(vi) ) p q ! Pk x+1k1 k x+1k1 i=1P | | · sequential push. That is Rs (x) Rs(x) ,foranyit- c(v ). By the friendship paradox [15], the mean number of k k1 k k1 i eration x. friends of an individual is bounded byP that of his friends, i.e. dout(vi) / dout(x) x Nin(vi) 1, so Proof. First of all, when ✏ 0, the frontiers of every E E E x dout(vi) ! 1 | 2  2 iteration are the same for the parallel and sequential push 1. This leads to c(vi) 0. Moreover, b(vi) dout⇥ (x) ⇤ ⇥ ⇤ ⇥ P since the initial residual distribution is the same. This can q p k p Rx(vi) Rx(vi) ,so i=1(b(vi) Rx(vi) ) c(vi) 0. be proved by induction. The base case is trivial. Suppose in ⇤ q p With| this|| condition,| R R | . Thus,| · we prove iteration x, the frontiers for both local pushes are the same, x+1 1 x+1 1 the lemma by inductionk P as thek k base casek always holds. then any in-neighbor v of the frontier vertex u will become frontier in iteration x + 1 as the propagated residual from 4.1.2 Mitigate Loss u always activates v,i.e.,(1 ↵) Rs(x, u)/dout(v) ✏. · The major issue of parallel loss is due to the fact that the Thus the frontiers for both local pushes at iteration x +1 parallel push reads stale residual values and thus results in x are the same. To facilitate our proof, we denote FQ as the pushing smaller amount of residual. More specifically, as a common frontier for both local pushes at iteration x. vertex v is pushing its residual to its incoming neighbors, an Next, we show that if the sum of absolute residuals of the out-going neighbor of v is pushing some amount of residual parallel push are larger than or equal to that of the sequen- to v and v is not aware of the additional residual pushed to p q p tial push in iteration x,i.e., Rx 1 Rx 1,then Rx+1 itself. Motivated by this issue, we propose eager propagation q k k k k k k1 Rx+1 1. Suppose there are k frontier vertices in iteration that reads the recent residuals of frontier vertices to mitigate k k x x, vi FQ , 1 i k,wedenoteb(v1),b(v2),...,b(vk)as 2   parallel loss. the residual values pushed at iteration x for the sequential In the parallel push, Algorithm 3 reads the residuals of push. Assuming the order of pushing the frontier vertices frontier vertices before the neighbor-propagation, without by the sequential push is v1,v2,..., we have being aware of the potential increase of the residuals. In- i 1 stead, eager propagation is anxious for more residuals to q b(vi)=Rx(vi)+ (1 ↵) b(vj ) /dout(vi) 1v N (v ) push and proactively reads the recent residuals of frontier · · i2 in j j=1 vertices, which may get increased by the neighbors. To im- X plement eager propagation, we redesign the parallel push: Since the residuals of frontier vertices are either all positive (1) We alter the order of the two parallel sessions, and (2) or all negative, we have b(v ) Rq (v ) from the above i x i ensure the residuals of frontier vertices that are used in both equation. By Equation 2,| the change| | of the| residual sum of q sessions to be consistent. the sequential push R Rq is k x+1k1 k xk1 Algorithm 4 shows the optimized parallel push that im- k k 1 plements eager propagation. Compared with Algorithm 3, (1 ↵) b(vi) b(vi) there are several major di↵erences due to eager propagation. · dout(x) i=1 x N (v ) i=1 First, the local push executes the two parallel sessions in the X 2 Xin i X k k reverse order, i.e. neighbor-propagation first and self-update 1 = ↵ b(vi)+ b(vi) (1 ↵) ( 1) afterwards. Second, the accesses to the residual values of · · dout(x) i=1 i=1 x N (v ) frontier vertices are di↵erent. Note that at Line 10 we read X X 2 Xin i the up-to-date residual of u as ru, which is the key step for For the parallel push, Rp Rp can lead to the same k x+1k1 k xk1 eager propagation. Since this read is in the loop of neighbor- result as the above equation except that b(vi) is replaced propagation, R (u) can be increased by concurrent propaga- p 1 s with Rx(vi). Let c(vi)be(1 ↵) ( x N (v ) d (x) 1), tion from u’s neighbors and the value r is potentially larger · 2 in i out u we have the following inequality. P than the value of Rs(u) before the neighbor-propagation ses- k k sion starts. To ensure the consistent residual value used in q q R = R ↵ b(vi)+ b(vi) c(vi) both sessions, ru is recorded in E at Line 11, which will be k x+1k1 k xk1 · i=1 i=1 passed down to later session. Then Rs(u) is subtracted with X X k k the consistent ru at Line 21, instead of being zeroed as in q q R ↵ R (vi) + b(vi) c(vi) Algorithm 3. k xk1 | x | · i=1 i=1 There are also some other di↵erences in Algorithm 4 com- X X k pared with Algorithm 3. First, there is an additional round q (1 ↵) R + ↵Sq + b(vi) c(vi) of frontier generation in the self-update at Lines 22-23. Note  k xk1 · i=1 that the absolute value of Rs(u) could be larger than ✏ af- X k ter the subtraction at Line 21, making u as the potential p (1 ↵) R + ↵Sq + b(vi) c(vi) frontier for the next iteration, since Rs(u)canstillbein-  k xk1 · i=1 creased by u’s neighbors after u pushes out its residual. X k k Compared with the other frontier generation process in the p p neighbor-propagation, this one is to take care of those ver- =( R ↵ R (vi) )+↵(Sq Sp)+ b(vi) c(vi) k xk1 | x | · i=1 i=1 tices that are frontier in both the current and next iteration, X X

99 i.e. u FQ u FQ0, while the other is for those vertices performs the addition to a 32/64 bit address atomically and that are2 frontier^ 2 in the next iteration but not in the current returns the before-value after finishing this operation. For iteration, i.e. u FQ0 u/FQ. Second, the frontier gen- some architectures, such an atomic addition can be directly eration in the neighbor-propagation2 ^ 2 of Algorithm 4 is also supported by the hardware intrinsics. For the remaining ar- di↵erent from that in Algorithm 3. Both of these di↵erences chitectures where it cannot be directly supported, observe are related to our novel optimization technique for the fron- that the compare-and-swap intrinsics are always supported tier generation, which we will discuss in the next subsection. [20, 12], and we can use such intrinsics to implement the functionality of the atomic addition. atomicAdd at Line 4.2 Frontier Generation 14 is the implementation of such an operation. The before- value ru is the by-product of updating Rs(u). With the Algorithm 4 OptParallelPush before-value, the after-value can be computed locally. To perform duplicate detection on v, u enqueues v as the fron- 1: procedure PushCondLocal(rpre,rcur ,phase) 2: if !pushcond(rpre,phase) then tier if the before-value of Rs(v)doesnotmeetthepush 3: if pushCond(rcur ,phase) then condition, but the after-value of Rs(v)does.Inthisway, 4: return true threads can identify duplicates locally without synchroniz- 5: return false ing globally on shared data structures. We call this ecient 6: procedure OptParallelPush(Ps,Rs,FQ,phase) 7: FQ0 = technique as local duplicate detection. 8: E = ; We implement local duplicate detection at Lines 14-17 of 9: parallel; for u FQ 2 Algorithm 4. At Line 14, the residual of v is atomically 10: ru Rs(u) updated and the before-value of Rs(v) is returned as rpre. 11: E = E (u, ru) [ 12: parallel for v N (u) At Line 15, the after-value is calculated as rcur. If a vertex v 2 in 13: inc = (1 ↵) ru/dout(v) can pass the test at Line 16, it means that v passes the push · 14: rpre =atomicAdd(Rs(v),inc) condition and putting v to FQ0 will not cause duplicates. At 15: rcur = rpre + inc Line 17, we enqueue those qualified vertices with enqueue. 16: if PushCondLocal(rpre,rcur ,phase) then 17: enqueue(FQ0,v) The di↵erence between enqueue here and uniqueenqueue 18: synchronize is that enqueue removes the duplicate detection process. 19: parallel for (u, ru) E 2 Note that Algorithm 4 has one more frontier generation 20: P (u) += ↵ r s · u process at Lines 22-23. This is because the first frontier 21: Rs(u) = ru 22: if pushCond(Rs(u),phase) then generation process cannot handle the case when a vertex is 23: enqueue(FQ0,u) in the frontier for both the current and the next iteration. 24: synchronize Given a frontier vertex u,i.e. Rs(u) > ✏,aneighborof 25: return FQ0 | | u updates Rs(u) and gets the before-value of Rs(u)asrpre ( r > ✏). Under such circumstances, the after-value In general, there are two methods for frontier generation pre r | =|r +inc also satisfies r > ✏ due to monotonicity. [21, 34, 35, 25, 26, 13], and both of them cannot give sat- cur pre cur Thus, u is never enqueued,| even| if u can meet the push isfactory performance for the parallel push. The first one condition after self-update. To remedy this, we add the is a topology-driven method, which inspects all vertices at second frontier generation to handle the frontier vertices. As the start of each iteration to construct the frontier. This the parallel-for loop at Lines 19-23 of Algorithm 4 does not method is not work-ecient especially when the frontier size have shared vertices, the frontier can be generated eciently is small. The second one is a data-driven method that takes without duplicate detection. in a frontier queue and generates a frontier queue for the next iteration. However, to avoid duplicate items during frontier generation, such a method requires atomic opera- 5. EXPERIMENTS tions, which incurs expensive overheads. The unoptimized In this section, we will first introduce the experimental parallel push, i.e., Algorithm 3, adopts the second method. setups and then report the results of the experimental study. The duplicate detection in uniqueenqueue of Algorithm 3 5.1 Experimental Setup can cause significant synchronization overheads. In fact, this overhead can be avoided by exploring appli- Implementation. We implement the proposed parallel lo- cation properties: monotonicity. Note that in the work cal update approach on both the CPU and GPU architec- flow of the parallel push, the positive residuals are pushed tures. For comparison, we also implement various baselines. out in the first phase and then the negative residuals in the The details are shown as follows. second phase. Thus, during neighbor-propagation, the resid- CPU Sequential Push with Single Update (CPU-Base): uals are changing monotonically, increasing in the first phase • We implement the state-of-the-art sequential algorithm and decreasing in the second phase. Take the first phase for for dynamic PPR computation [49]. For each edge up- example, for a vertex v, its residual absorbs incoming prop- date, it repairs the invariant and then invokes the sequen- agation from v’s out-neighbors. There is one and only one tial push. out-neighbor, u, that turns v from the state when Rs(v) < ✏ CPU Sequential Push with Batch Update (CPU-Seq): We • to the other state when Rs(v) > ✏. Therefore, we can des- add the batch update optimization for CPU-Base.To ignate u as the one responsible to put v into new frontier handle a batch of size k, it repairs the invariant for k queue, and others can avoid the attempt to enqueue v.But updates and then launches the sequential push. this raises a question: how can u know the value of Rs(v) CPU Parallel Push with Batch Update (CPU-MT): We im- during frontier generation? • plement our parallel local update using the multi-thread Note that in most parallel architectures like GPUs and framework CilkPlus [24] to manage parallel contexts and multi-core CPUs we can implement an atomic operation that resources on CPUs.

100 GPU Parallel Push with Batch Update (GPU): We imple- Table 2: All parameters used in the experiments. The • ment our parallel local update on GPUs using CUDA [37]. default values are highlighted. Incremental Monte-Carlo on CPU (Monte-Carlo): We Parameter Values • implement the state-of-the-art incremental Monte-Carlo ↵ 0.15 5 6 7 8 9 10 approach [10]. To improve the performance, we parallelize ✏ 10 , 10 , 10 , 10 , 10 , 10 the maintenance of random walks with the multi-thread randomly chosen vertices with framework CilkPlus [24]. source vertex s Top-10,Top-1KandTop-1M Vertex-centric implementation in Ligra (Ligra): We im- out-degrees • plement our approach, i.e. parallel push with batch up- 1%,0.1%and0.01% batch size date, on top of Ligra [42], the state-of-the-art in-memory of sliding window size graph processing system on multi-core architectures. No. of random 6 V ( V is No. of vertices) walk samples | | | | Datasets. Our experiments are conducted on the following real-world graph data sets 3. Table 3: Variants of the parallel push Opt Eager DupDetect Vanil la Pokec has 1.6M vertices and 30.6M edges, generated by Eager Propagation • X X ⇥ ⇥ the most popular on-line in Slovakia. Local Duplicate LiveJournal has 4.8M vertices and 68.9M edges, gener- Detection X ⇥ X ⇥ • ated by an online community that allows users to declare friendships with others. cores share a 25 MB L3 cache and a 64 GB local DRAM. Youtub e has 1.1M vertices and 2.9M edges, extracted • The programs are compiled with GCC 5.4.0 using O3 from the users’ friendships of Youtube. flag. For multi-threaded execution, the program is also has 3.0M vertices and 117.1M edges, extracted • compiled with CilkPlus from GCC 5.4.0. from users’ friendships of an online social network. For GPU-based implementation (GPU), we run the exper- Twitter has 41.6M vertices and 1.4B edges, generated • • iments on a CentOS server that has 64GB main memory by Twitter network of followed-by relationships from a and one Intel(R) Core(TM) i7-3820 4-core processor with sample in 2010. the frequency of 3.60 GHz, and GeForce GTX TITAN X GPU with 12GB device memory that is connected to Graph Stream. For all data sets, they do not possess PCIe v2.0 16. The programs are compiled with CUDA- a timestamp for each edge. Following previous works [10, 7.0 and GCC⇥ 4.4.7 with O3flag. 38, 49], we simulate the random edge arrival model by ran- domly setting the timestamps for all edges. Then, the graph 5.2 Effects of Optimizations stream of each dataset receives the edges with increasing timestamps. To evaluate both insertions and deletions on In this subsection, we evaluate the e↵ectiveness of the the arrival of graph streams, the sliding window model is optimization techniques, including eager propagation and adopted in the experiments. For initialization, the first 10% local duplicate detection. For comparison, we implement edges in the stream are used to construct the sliding window di↵erent variants of the parallel push with these techniques before updates start. As the window slides for a batch size enabled or disabled, as shown in Table 3. of k, k edges are inserted and the same number of edges We run these variants of the parallel local update and re- are deleted according to their timestamps. The experimen- port the average latency. As shown in Figure 4, the fully tal results are the average of 10 slides for Twitter and the optimized algorithm can achieve about 2.5timesspeedups average of 100 slides for the other data sets. compared with the unoptimized version for GPUs and multi- Parameter setting. We summarize the parameters used cores. Each of the techniques provides significant perfor- in the experiments together with their highlighted default mance improvement over the unoptimized version. For eager values in Table 2. For a source vertex, No. of random walk propagation, the speedup comes from the mitigation of par- 3log(2/pf ) allel loss. The number of required operations is reduced. For samples maintained by Monte-Carlo is w 2 [46], ✏r local duplicate detection, it achieves speedups by reducing where , pf , ✏r are the result threshold, failure probability synchronization e↵orts at runtime. Notice that both tech- and relative error. Normally, , pf and ✏r are set to 1/ V , niques show more significant improvements in larger data 1/ V and 0.5 respectively [46, 29, 31]. This makes w huge| | | | sets. This is because large data sets lead to large frontier for large graphs like Twitter and leads to poor performance. sizes, which introduce more severe parallel loss and incur In our experiment, we favor Monte-Carlo and set w to higher frontier generation costs. asmallervalue,i.e.6V to improve the performance by | | trading accuracies (with =1/ V , pf =2/e, ✏r =0.71). | | 5.3 Stream Throughtput Experimental Environment. We conduct all the experi- In this subsection, we compare the stream throughput be- ments on two machines. tween the parallel local update and various baselines. We For CPU-based implementations (i.e. CPU-Base, CPU-Seq, report the number of edges consumed per second after run- • CPU-MT, Monte-Carlo and Ligra), the experiments are ning for 5 minutes. Meanwhile, we vary the batch size to performed on a machine running Ubuntu 16.04 with four evaluate its e↵ect over the stream throughput. 10-core Intel Xeon E7-4820 processors clocked at 1.9 GHz In Figure 5, we can see the parallel local update achieves and 128GB main memory. Each core owns a private 32 significant speedups over the sequential local update. Com- KB L1 cache and a private 256 KB L2 cache. Every 10 pared with CPU-Base, which is the state-of-the-art approach [49], GPU and CPU-MT achieve the speedups of more than 3https://snap.stanford.edu/data/ 100-7000 and 30-4000 times respectively with the batch size

101 Figure 4: E↵ect of optimizations for the parallel local update.

Figure 5: The comparison of streaming throughput. of 105. Compared with CPU-Seq, GPU and CPU-MT achieve 5.4 Effects of Parameters the speedups of more than 12 74 and 6 20 times re- In this subsection, we study how the parameter setups spectively. The performance improvement of our approach a↵ect the performance of the parallel local update. comes from two dimensions. One is the parallel push and Varying ✏.Thechoiceof✏ determines the accuracy of the other is the batch update. First, GPU and CPU-MT the solution and a smaller ✏ requires more computation. We 4 10 achieve speedups over CPU-Seq, which shows that the par- vary ✏ in the range from 10 to 10 and show the latency allel push e↵ectively accelerates the local push. Second, as of window slides in Figure 6. As ✏ decreases, the latency the batch size increases, the throughputs of GPU and CPU-MT for all approaches increases dramatically, because of more get larger, because the batch update raises the degree of computation. The speedups of the parallel approach over parallelism. With a large batch size, there are more work- the sequential approach become larger for smaller ✏.This loads in each iteration, so parallel threads can process more is because smaller ✏ creates larger frontier for the parallel frontier vertices all at once. As CPU-Base is significantly approach to be e↵ective. With the parallel speedups, our slower than the other approaches,we will omit the results for approach can achieve fairly low latency with high accuracy. CPU-Base in the rest of the experiments. Varying source vertex selection. The degree of the Figure 5 shows our parallel local update has superior per- source vertex can a↵ect the performance of the local push. formance over the incremental Monte-Carlo approach. Com- Starting from a source vertex s with low degrees, the vertices pared with Monte-Carlo, GPU and CPU-MT achieve the that have high PageRank values w.r.t. s are mostly in the speedups of 18 - 254 and 9 - 135 times respectively with the local cluster of s.Hence,iftheedgeupdatesdonota↵ect 5 batch size of 10 . We attribute the unsatisfactory perfor- the structure of the local cluster of s,thenthelocalpush mance of Monte-Carlo to the huge overheads of incremental only needs few operations for the incremental computation. maintenance of random walk samples. The overheads come On the contrary, if s has high degree, a small number of from two sources. First, the number of random walk sam- edge updates can severely change the current estimate and ples w is huge especially for large graphs like Twitter (al- the residual vector, since s has influence over a wide range though we set w to a value smaller than existing works [46, of vertices. In Figure 7, we choose source vertices with var- 29, 31]). Second, the incremental maintenance of random ious degrees (top-10, top-1K and top-1M out-degree) and walk samples needs to track some auxiliary data structures. validate that as the degrees increase, the latencies of all ap- The Monte-Carlo approach on static graphs only needs to proaches become larger. We also find that the improvement maintain the number of visits made by random walks for of our parallel approach is not significant when the source each vertex. However, the incremental approach should also vertex has a small degree. The phenomenon can be under- keep track of the traces of the random walk samples, i.e. the stood as the parallel workload is reduced for small-degree vertices visited, and the inverted index that records the set vertices compared with large-degree vertices. In practice, of random walks passing through each vertex. When new one is often interested in PPR of large-degree source ver- edges arrive, we should update the traces of these samples tices and our parallel approach is superior under such cases. and the inverted index by regenerating random walks on the Varying batch sizes.Tostudythee↵ect of di↵erent batch new graph. This updating process requires frequent atomic sizes, we vary it as di↵erent ratios of the sliding window accesses to the shared memory. Thus, these auxiliary data size, i.e., 1%, 0.1% and 0.01%. We show the latency of all structures are large and the maintenance incurs a huge cost. approaches in Figure 8. As the batch size of the window As shown in Figure 5, our specialized implementations slide decreases, the latencies of all approaches get smaller (GPU and CPU-MT) are better than the implementation in for fewer updates are required. But the parallel approaches the general graph processing system with vertex-centric ab- GPU and CPU-MT still keep the speedups over CPU-Seq for straction (Ligra). This is because those systems lack ap- smaller batches, which shows that our parallelization is ef- plication knowledge to perform specific optimizations such fective and robust to di↵erent batch sizes. as eager propagation and local duplicate detection.

102 Figure 6: E↵ect of ✏ for parallel push.

Figure 7: E↵ect of the source vertex. It is chosen from those vertices having top-10, top-1K and top-1M out-degree.

Table 4: Profiling metrics for resource consumption 6. RELATED WORKS Approach Metrics Personalized PageRank. PPR is extensively explored Achieved warp occupancy (WO) GPU in the literature [49, 38, 10, 31, 6, 5, 9, 29, 11]. They can Global load eciency (GLD) be categorized into three lines of schemes. L2 data cache miss rate (L2DCM) The first scheme is the power iteration [39, 48, 32]. It is CPU-MT L3 cache miss rate (L3CM) slow [38] since it requires ⌦(m) for updating the PPR vector ratio of cycles stalled on resource (STL) once, where m is the number of edges in the graph. The second scheme is based on Monte-Carlo [7, 10, 9, 5.5 Resource Consumption 27]. It simulates w random walk samples from each ver- We study the e↵ects of di↵erent optimizations and param- tex, and estimates the PPR value as the number of visits eters on the resource consumption of the hardwares, includ- to a vertex made by these samples. On the arrival of any ing the memory eciency and thread utilization. To profile graph update, say u v, the samples that pass through the execution of our approaches, we use nvprof in CUDA u are found, and then! redo the random walks on the new toolkit [37] for GPU and PAPI [2] for CPU-MT.Table4lists graph. The analysis in [10] shows that the expected number the profiling metrics used in the experiments. Note that for of random walk samples to be updated is only O(w log(k)) GPU, WO is the ratio of the average warps per active cycle, assuming k edges randomly arrive for an arbitrary directed· while GLD is the ratio of the requested global memory load 3log(2/pf ) graph [10]. However, w is required to be at least 2 throughput to the maximum load throughput. ✏r [46], where ,p , ✏ are the result threshold, failure prob- Figure 9 shows the profiling results with varying batch f r ability and relative error ratio respectively such that when size. For GPU, we can observe that the achieved warp oc- ⇡s(x) > , Ps(x) ⇡s(x) ✏r⇡s(x) with probability at least cupancy gets larger as the batch size increases. This is be- | |  1 pf .Since and pf are set to 1/ V normally [46, 31, 29], cause a large batch size provides more workloads for the nlog(n) | | w is O( 2 ), which implies the overheads for storage and parallel push, and more warps can be utilized for the execu- ✏r tion of GPU kernels. Meanwhile, the global store eciency computation can be significant especially for large graphs. decreases with the increase of the batch size. As traversing The third scheme is the local update [6, 5, 11, 38, 49, the graph structure leads to many random memory accesses, 31, 29, 28, 30], which is the one we parallelize in this work. when there are more workloads, the parallel push performs Many previous works [6, 5, 11] propose variants of the lo- more random memory accesses, which degenerate the global cal update scheme for di↵erent applications such as graph memory throughput a bit. For the same reason, L2 data partitioning and web search, but the PPR is computed in cache miss rate and L3 cache miss rate of CPU-MT increase static graphs. Recent works [38, 49] propose incremental ap- a bit as the batch size gets large. For CPU-MT,wecanalso proaches for dynamic PPR computation, which can outper- see more CPU cycles are spent waiting for resource as the form the Monte-Carlo-based approaches. Amortized analy- batch size increases, since there are more memory accesses. sis in [49] shows that dynamic maintenance of PPR vector is We also profile the execution with varying optimization tech- ecient. In a directed graph with random edge permutation niques, the error tolerance and the source vertex, the results or undirected graph with arbitrary edge updates, the cost to of which are presented in our extended version [1]. maintain a reverse PPR vector for a random target vertex is only O(k + d/✏), where k is the number of edge updates 5.6 Scalability and d is the average degree; given an undirected graph with We also evaluate the scalability of our parallel approach on arbitrary edge updates, the cost to maintain a forward PPR multi-core architectures. We vary the number of cores used vector for a random start vertex is O(k +1/✏). These com- for parallelization and report the throughputs for CPU-MT. plexities suggest that the maintenance cost is only O(1) for The experiment is run for 5 minutes with a batch size of 105. each edge update plus the cost of computing the PPR vector As shown in Figure 10, we can see that the performance once for the static graph. The variants of the local update scales as the core number increases. scheme slightly di↵er in the local push, such as convergence

103 Figure 8: E↵ect of batch size

Figure 9: Resource consumption with varying batch size.

Figure 10: Scalability on Multi-cores. criteria[6, 49], the direction of graph traversal [6, 38, 49] and erate graph applications by utilizing GPUs and multi-core the amount of residuals used for the local push [6, 5]. The CPUs. They provide the general vertex-centric abstraction variant of the local update we parallelize in this work is from that can implement most graph algorithms, but lack the [5, 49], but our parallel approach can be naturally extended abilities to support application-specific optimizations. For to other variants of the local update, since they share many example, eager propagation requires active vertices to ab- similarities. On the arrival of graph updates, they repair sorb incoming messages so as to use the up-to-date residuals, the estimates and residuals of the a↵ected vertices to re- which cannot be supported in bulk synchronous processing store the invariant; the local push iteratively performs push model; local duplicate detection exploits the monotonicity operations to maintain the estimate and residual vector until to reduce synchronization overheads, but the graph systems the convergence condition is satisfied. cannot leverage this knowledge for the frontier generation. Recent works [27, 18, 46] rely on e↵ective indexing meth- While most existing graph systems only support static ods to accelerate the PPR computation. PowerWalk [27] graphs, the recent frameworks [14, 41] can eciently handle simulates the random walk samples based on the memory the dynamic scenario by the architecture-optimized graph budget during o✏ine index construction. For online queries, storage schemes. How to integrate our work with these it reuses these samples and further invokes more iterative frameworks could be an interesting future work. computation to meet the accuracy requirement. However, this Monte-Carlo based approach can incur large overheads 7. CONCLUSION to maintain the stored random walk samples on dynamic We propose the parallel local update approach for dy- graphs. Guo et al. [18] designs a distributed local update namic PPR computation over high-speed graph streams. approach that makes use of pre-computed PPR vectors of Our approach adopts a batch processing method for incre- the selected hub vertices. Wang et al. [46] develops an elas- mental updates that can reduce synchronization costs within tic indexing approach that tradeo↵s accuracy, query time a batch as well as to generate more workload for parallelism. and memory budget. It also exploits the pre-computed PPR We propose optimization techniques to improve the runtime vectors of the hub vertices to aid for the local update. Our performance of the parallel push: eager propagation can approach is helpful for both these two works [18, 46] to main- practically reduce the number of push operations and lo- tain the indexed PPR vectors on dynamic graphs. cal duplicate detection avoids synchronization overheads to To the best of our knowledge, there are only two works on merge duplicate vertices during frontier generation. The ex- parallelizing the local update for PPR computation. Shun et perimental evaluation shows that our parallel approach on al. [44] implements various local update algorithms for graph GPUs architectures and multicore CPUs achieves significant clustering on multi-core architectures; Perozzi et al. [40] speedups over the state-of-the-art sequential approach. studies PageRank-Nibble algorithm [6] in the distributed setting. However, these two works only discuss the com- Acknowledgments. The project is supported by a grant, putation on static graphs. MOE2017-T2-1-141, from the Singapore Ministry of Educa- Parallel Graph Processing. There are many graph tion. processing systems [36, 23, 47, 42, 16] proposed to accel-

104 8. REFERENCES computation on natural graphs. In OSDI,volume12, [1] Extended version. http://www.comp.nus.edu.sg/ page 2, 2012. ˜wentian/paper/tr2017-ppr.pdf. [17] S. Guha and A. McGregor. Graph synopses, sketches, [2] PAPI. http://icl.utk.edu/papi/. and streams: A survey. PVLDB,5(12):2030–2031, [3] Twitter usage statistics. 2012. http://www.internetlivestats.com/ [18] T. Guo, X. Cao, G. Cong, J. Lu, and X. Lin. twitter-statistics/. accessed 18-April-2017. Distributed algorithms on exact personalized [4] V. Agarwal, F. Petrini, D. Pasetto, and D. A. Bader. pagerank. In Proceedings of the 2017 ACM Scalable graph exploration on multicore processors. In International Conference on Management of Data, Proceedings of the 2010 ACM/IEEE International pages 479–494. ACM, 2017. Conference for High Performance Computing, [19] P. Gupta, A. Goel, J. Lin, A. Sharma, D. Wang, and Networking, Storage and Analysis,pages1–11.IEEE R. Zadeh. Wtf: The who to follow service at twitter. Computer Society, 2010. In Proceedings of the 22nd international conference on [5] R. Andersen, C. Borgs, J. Chayes, J. Hopcraft, V. S. World Wide Web,pages505–514.ACM,2013. Mirrokni, and S.-H. Teng. Local computation of [20] M. Herlihy. Wait-free synchronization. ACM pagerank contributions. In International Workshop on Transactions on Programming Languages and Systems Algorithms and Models for the Web-Graph,pages (TOPLAS),13(1):124–149,1991. 150–165. Springer, 2007. [21] S. Hong, T. Oguntebi, and K. Olukotun. Ecient [6] R. Andersen, F. Chung, and K. Lang. Local graph parallel graph exploration on multi-core cpu and gpu. partitioning using pagerank vectors. In 2006 47th In Parallel Architectures and Compilation Techniques Annual IEEE Symposium on Foundations of Computer (PACT), 2011 International Conference on,pages Science (FOCS’06),pages475–486.IEEE,2006. 78–88. IEEE, 2011. [7] K. Avrachenkov, N. Litvak, D. Nemirovsky, and [22] G. Jeh and J. Widom. Scaling personalized web N. Osipova. Monte carlo methods in pagerank search. In Proceedings of the 12th international computation: When one iteration is sucient. SIAM conference on World Wide Web,pages271–279.ACM, Journal on Numerical Analysis,45(2):890–904,2007. 2003. [8] L. Backstrom and J. Leskovec. Supervised random [23] F. Khorasani, K. Vora, R. Gupta, and L. N. Bhuyan. walks: predicting and recommending links in social Cusha: vertex-centric graph processing on gpus. In networks. In Proceedings of the fourth ACM Proceedings of the 23rd international symposium on international conference on Web search and data High-performance parallel and distributed computing, mining,pages635–644.ACM,2011. pages 239–252. ACM, 2014. [9] B. Bahmani, K. Chakrabarti, and D. Xin. Fast [24] C. E. Leiserson. The cilk++ concurrency platform. In personalized pagerank on mapreduce. In Proceedings Design Automation Conference, 2009. DAC’09. 46th of the 2011 ACM SIGMOD International Conference ACM/IEEE,pages522–527.IEEE,2009. on Management of data,pages973–984.ACM,2011. [25] H. Liu and H. H. Huang. Enterprise: Breadth-first [10] B. Bahmani, A. Chowdhury, and A. Goel. Fast graph traversal on gpus. In Proceedings of the incremental and personalized pagerank. PVLDB, International Conference for High Performance 4(3):173–184, 2010. Computing, Networking, Storage and Analysis, [11] P. Berkhin. Bookmark-coloring algorithm for page 68. ACM, 2015. personalized pagerank computing. Internet [26] H. Liu, H. H. Huang, and Y. Hu. ibfs: Concurrent Mathematics,3(1):41–62,2006. breadth-first search on gpus. In Proceedings of the [12] T. David, R. Guerraoui, and V. Trigonakis. 2016 International Conference on Management of Everything you always wanted to know about Data,pages403–416.ACM,2016. synchronization but were afraid to ask. In Proceedings [27] Q. Liu, Z. Li, J. Lui, and J. Cheng. Powerwalk: of the Twenty-Fourth ACM Symposium on Operating Scalable personalized pagerank via random walks with Systems Principles,pages33–48.ACM,2013. vertex-centric decomposition. In Proceedings of the [13] A. Davidson, S. Baxter, M. Garland, and J. D. Owens. 25th ACM International on Conference on Work-ecient parallel gpu methods for single-source Information and Knowledge Management,pages shortest paths. In Parallel and Distributed Processing 195–204. ACM, 2016. Symposium, 2014 IEEE 28th International,pages [28] P. Lofgren. Ecient algorithms for personalized 349–359. IEEE, 2014. pagerank. arXiv preprint arXiv:1512.04633,2015. [14] D. Ediger, R. McColl, J. Riedy, and D. A. Bader. [29] P. Lofgren, S. Banerjee, and A. Goel. Personalized Stinger: High performance data structure for pagerank estimation and search: A bidirectional streaming graphs. In High Performance Extreme approach. In Proceedings of the Ninth ACM Computing (HPEC), 2012 IEEE Conference on,pages International Conference on Web Search and Data 1–5. IEEE, 2012. Mining,pages163–172.ACM,2016. [15] S. L. Feld. Why your friends have more friends than [30] P. Lofgren and A. Goel. Personalized pagerank to a you do. American Journal of Sociology, target node. arXiv preprint arXiv:1304.4658,2013. 96(6):1464–1477, 1991. [31] P. A. Lofgren, S. Banerjee, A. Goel, and C. Seshadhri. [16] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and Fast-ppr: scaling personalized pagerank estimation for C. Guestrin. Powergraph: Distributed graph-parallel large graphs. In Proceedings of the 20th ACM SIGKDD international conference on Knowledge

105 discovery and data mining,pages1436–1445.ACM, graph clustering with parallel approximate pagerank. 2014. and Mining,4(1):1–11,2014. [32] T. Maehara, T. Akiba, Y. Iwata, and K.-i. [41] M. Sha, Y. Li, B. He, and K.-L. Tan. Accelerating Kawarabayashi. Computing personalized pagerank dynamic graph analytics on gpus. PVLDB,11(1), quickly by exploiting graph structures. PVLDB, 2017. 7(12):1023–1034, 2014. [42] J. Shun and G. E. Blelloch. Ligra: a lightweight graph [33] A. McLaughlin and D. A. Bader. Scalable and high processing framework for shared memory. In ACM performance betweenness on the gpu. In Sigplan Notices, volume 48, pages 135–146. ACM, Proceedings of the International Conference for High 2013. performance computing, networking, storage and [43] J. Shun, L. Dhulipala, and G. Blelloch. A simple and analysis, pages 572–583. IEEE Press, 2014. practical linear-work parallel algorithm for [34] D. Merrill, M. Garland, and A. Grimshaw. connectivity. In Proceedings of the 26th ACM High-performance and scalable gpu graph traversal. symposium on Parallelism in algorithms and ACM Transactions on Parallel Computing,1(2):14, architectures,pages143–153.ACM,2014. 2015. [44] J. Shun, F. Roosta-Khorasani, K. Fountoulakis, and [35] R. Nasre, M. Burtscher, and K. Pingali. Data-driven M. W. Mahoney. Parallel local graph clustering. versus topology-driven irregular computations on PVLDB,9(12):1041–1052,2016. gpus. In Parallel & Distributed Processing (IPDPS), [45] M. Then, M. Kaufmann, F. Chirigati, T.-A. 2013 IEEE 27th International Symposium on,pages Hoang-Vu, K. Pham, A. Kemper, T. Neumann, and 463–474. IEEE, 2013. H. T. Vo. The more the merrier: Ecient multi-source [36] D. Nguyen, A. Lenharth, and K. Pingali. A lightweight graph traversal. PVLDB,8(4):449–460,2014. infrastructure for graph analytics. In Proceedings of [46] S. Wang, Y. Tang, X. Xiao, Y. Yang, and Z. Li. the Twenty-Fourth ACM Symposium on Operating Hubppr: e↵ective indexing for approximate Systems Principles,pages456–471.ACM,2013. personalized pagerank. PVLDB,10(3):205–216,2016. [37] J. Nickolls, I. Buck, M. Garland, and K. Skadron. [47] Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Ri↵el, and Scalable parallel programming with cuda. Queue, J. D. Owens. Gunrock: A high-performance graph 6(2):40–53, 2008. processing library on the gpu. In Proceedings of the [38] N. Ohsaka, T. Maehara, and K.-i. Kawarabayashi. 21st ACM SIGPLAN Symposium on Principles and Ecient pagerank tracking in evolving networks. In Practice of Parallel Programming,page11.ACM, Proceedings of the 21th ACM SIGKDD International 2016. Conference on Knowledge Discovery and Data Mining, [48] J. Yang and J. Leskovec. Defining and evaluating pages 875–884. ACM, 2015. network communities based on ground-truth. [39] L. Page, S. Brin, R. Motwani, and T. Winograd. The Knowledge and Information Systems,42(1):181–213, pagerank citation ranking: bringing order to the web. 2015. 1999. [49] H. Zhang, P. Lofgren, and A. Goel. Approximate [40] B. Perozzi, C. McCubbin, and J. T. Halbert. Scalable personalized pagerank on dynamic graphs. arXiv preprint arXiv:1603.07796,2016.

106