<<

Practical Parallel Algorithms

Julian Shun [email protected] MIT CSAIL Abstract v0 While there has been significant work on parallel graph pro- e0 cessing, there has been very surprisingly little work on high- v0 v1 v1 performance hypergraph processing. This paper presents a e collection of efficient parallel algorithms for hypergraph pro- 1 v2 cessing, including algorithms for computing hypertrees, hy- v v 2 3 e perpaths, betweenness , maximal independent sets, 2 v k-core decomposition, connected components, PageRank, 3 and single-source shortest paths. For these problems, we ei- (a) Hypergraph (b) Bipartite representation ther provide new parallel algorithms or more efficient imple- mentations than prior work. Furthermore, our algorithms are Figure 1. An example hypergraph representing the groups theoretically-efficient in terms of work and depth. To imple- {v0,v1,v2}, {v1,v2,v3}, and {v0,v3}, and its bipartite repre- ment our algorithms, we extend the Ligra graph processing sentation. framework to support , and our implementations benefit from graph optimizations including switching between improved compared to using a graph representation. Unfor- sparse and dense traversals based on the frontier size, edge- tunately, there is been little research on parallel hypergraph aware parallelization, using buckets to prioritize processing processing. of vertices, and compression. Our experiments on a 72-core The main contribution of this paper is a suite of efficient machine and show that our algorithms obtain excellent paral- parallel hypergraph algorithms, including algorithms for hy- lel speedups, and are significantly faster than algorithms in pertrees, hyperpaths, betweenness centrality, maximal inde- existing hypergraph processing frameworks. pendent sets, k-core decomposition, connectivity, PageRank, and single-source shortest paths. For these problems, we pro- 1 Introduction vide either new parallel hypergraph algorithms (e.g., between- A graph contains vertices and edges, where a vertex represents ness centrality and k-core decomposition) or more efficient an entity of interest, and an edge between two vertices repre- implementations than prior work. Additionally, we show that sents an interaction between the two corresponding entities. most of our algorithms are theoretically-efficient in terms of There has been significant work on developing algorithms their work and depth complexities. and programming frameworks for efficient graph processing We observe that our parallel hypergraph algorithms can due to their applications in various domains, such as social be implemented efficiently by taking advantage of graph pro- network and Web analysis, cyber-security, and scientific com- cessing machinery. To implement our parallel hypergraph putations. One limitation of modeling data using graphs is that algorithms, we made relatively simple extensions to the Ligra only binary relationships can be expressed, and can lead to graph processing framework [75] and we call the extended loss of information from the original data. Hypergraphs are a framework Ligra-H. As with Ligra, Ligra-H is well-suited generalization of graphs where the relationships, represented for frontier-based algorithms, where small subsets of ele- as hyperedges, can contain an arbitrary number of vertices. ments (referred to as frontiers) are processed on each iteration. Hyperedges correspond to group relationships among ver- We use a bipartite graph representation to store hypergraphs, tices (e.g., a community in a ). An example of and use Ligra’s data structures for representing subsets of a hypergraph is shown in Figure 1a. Hypergraphs have been vertices and hyperedges as well as operators for mapping shown to enable richer analysis of structured data in various application-specific functions over these elements. The oper- applications, such as protein network analysis [71], machine ators for processing subsets of vertices and hyperedges are learning [94], and image segmentation [12, 23]. For example, theoretically-efficient, which enables us to implement paral- Ducournau and Bretto [23] show that by representing pixels lel hypergraph algorithms with strong theoretical guarantees. in an image as vertices and groups of nearby and similar pix- Ligra-H inherits from Ligra various optimizations developed els as hyperedges, the quality of image segmentation can be for graphs, including switching between different traversal strategies based on the size of the frontier, edge-aware paral- PL’18, January 01–03, 2018, New York, NY, USA lelization, bucketing for prioritizing the processing of vertices, 2018. and compression. 1 PL’18, January 01–03, 2018, New York, NY, USA Julian Shun

Our experiments on a variety of real-world and synthetic on every iteration, even if few vertices/hyperedges are active, hypergraphs show that our algorithms implemented in Ligra- which makes them inefficient for frontier-based hypergraph H achieve good parallel speedup and scalability with respect algorithms. The Chapel HyperGraph Library [39] is a library to input size. On 72 cores with hyper-threading, we achieve for generating synthetic hypergraphs. However, it does not a parallel speedup of between 8.5–76.5x. Compared to Hy- implement any hypergraph algorithms, and currently does not perX [41] and MESH [36], the only existing frameworks for have a high-level programming abstraction for doing so. hypergraph processing, our results are significantly faster. For Hypergraphs are useful in modeling communication costs example, one iteration of PageRank on the Orkut commu- in parallel machines, and partitioning can be used to minimize nity hypergraph with 2.3 million vertices and 15.3 million communication. There has been significant work on both hyperedges [54] takes 0.083s on 72 cores and 3.31s on one sequential and parallel hypergraph partitioning algorithms thread in Ligra-H, while taking 1 minute on eight 12-core ma- (see, e.g., [14, 15, 20, 44, 47, 49, 83]). While we do not chines using MESH [36] and 10s using eight 4-core machines consider the problem of hypergraph partitioning in this paper, in HyperX [41]. Certain hypergraph algorithms (hypertrees, these techniques could potentially be used to improve the connected components, and single-source shortest paths) can locality of our algorithms. be implemented correctly by expanding each hyperedge into Algorithms have been designed for a variety of problems a clique among its member vertices and running the corre- on hypergraphs, including random walks [23], shortest hy- sponding graph algorithm on the resulting graph. We also perpaths [1, 65], betweenness centrality [69], hypertrees [26], compare with this alternative approach by using the original connectivity [26], maximal independent sets [3, 7, 43, 45], Ligra framework to process the clique-expanded graphs, and and k-core decomposition [40]. However, there have been no show the space usage and performance is significantly worse efficient parallel implementations of hypergraph algorithms, than that of Ligra-H (2.8x–30.6x slower while using 235x with the exception of [40], which provides a GPU implemen- more space on the Friendster hypergraph). tation for a special case of k-core decomposition. Our work shows that high-performance hypergraph pro- 3 Preliminaries cessing can be done using just a single multicore machine, on which we can process all existing publicly-available hyper- Hypergraph Notation. We denote an unweighted hyper- ( ) graphs. Prior work has shown that graph processing can be graph by H V , E , where V is the set of vertices and E is done efficiently on just a single multicore machine (e.g.,[21, the set of hyperedges. A weighted hypergraph is denoted by ( ) 22, 62, 64, 75, 81, 90, 95]), and this work extends the obser- H = V , E,w , where w is a function that maps a hyperedge vation to hypergraphs. To benefit the community, our source to a real value (its weight). The number of vertices in a hyper- | | | | code will be made publicly available. graph is nv = V , and the number of hyperedges is ne = E . Vertices are assumed to be labeled from 0 to n − 1, and hy- 2 Related Work v peredges from 0 to ne −1. For undirected hypergraphs, we use Graph Processing. There has been significant work on devel- deg(e) to denote the number of vertices a hyperedge e ∈ E con- oping graph libraries and frameworks to reduce programming tains (i.e., its cardinality), and deg(v) to denote the number of effort by providing high-level operators that capture common hyperedges that a vertex v ∈ V belongs to. In directed hyper- algorithm design patterns (e.g., [16, 18, 19, 24, 28–30, 33–35, graphs, hyperedges contain incoming vertices and outgoing 37, 42, 46, 51, 55, 56, 58–60, 63, 64, 66–68, 70, 72, 75, 78– vertices. For a hyperedge e ∈ E, we use N −(e) and N +(e) to 81, 85, 88, 90, 93, 95], among many others; see [61, 73, 87] denote its incoming vertices and outgoing vertices, respec- for surveys). Many of these frameworks can process large tively, and deg−(e) = |N −(e)| and deg+(e) = |N +(e)|. We use graphs efficiently, but none of them directly support hyper- N −(v) and N +(v) to denote the hyperedges that v ∈ V is an graph processing. incoming vertex and outgoing vertex for, respectively, and − Hypergraph Processing. HyperX [41] and MESH [36] are deg (v) = |N −(v)| and deg+(v) = |N +(v)|. We denote the Í − + distributed systems for hypergraph processing built on top of size |H | of a hypergraph to be nv + e ∈E (deg (e) + deg (e)), Spark [89]. Algorithms are written using hyperedge programs i.e., the number of vertices plus the sum of the hyperedge and vertex programs that are iteratively applied on hyperedges cardinalities. and vertices, respectively. HyperX stores hypergraphs using Computational Model. We use the work-depth model [38] fixed-length tuples containing vertex and hyperedge identi- to analyze the theoretical efficiency of algorithms. The work fiers, and MESH stores the hypergraph as a bipartite graph. of an algorithm is the number of operations used, and the HyperX includes algorithms for random walks, label prop- depth is the length of the longest sequence dependence. We agation, and spectral learning. MESH includes PageRank, assume that concurrent reads and writes are supported. PageRank-Entropy (a variant of PageRank that also com- By Brent’s scheduling theorem [11], an algorithm with putes the entropy of vertex ranks in each hyperedge), label work W and depth D has overall running time W /P + D, propagation, and single-source shortest paths. Both HyperX where P is the number of processors available. A parallel and MESH do work proportional to the entire hypergraph algorithm is work-efficient if its work asymptotically matches 2 Practical Parallel Hypergraph Algorithms PL’18, January 01–03, 2018, New York, NY, USA that of the best sequential algorithm for the problem, which is arguments, the vertex v and the number of neighbors of v important since in practice the W /P term in the running time in U . The output is an array of pairs (v, cnt) where v is a often dominates. neighbor of U and cnt is the number of neighbors of v in U . Primitives. Scan takes an array A of length n, an associative We have analogous functions for VertexSets. binary operator ⊕, and an identity element ⊥ such that ⊥⊕x = Ligra-H also supports the bucketing interface developed in x for any x, and returns the array (⊥, ⊥ ⊕ A[0], ⊥ ⊕ A[0] ⊕ the Julienne framework [21]. Vertices are stored in buckets as- n−2 n−1 sociated with bucket IDs, and algorithms can process buckets A[1],..., ⊥ ⊕i=0 A[i]) as well as the overall sum, ⊥ ⊕i=0 A[i]. Filter takes an array A and a predicate f and returns a new in increasing or decreasing order. Vertices can be moved to array containing a ∈ A for which f (a) is true, in the same different buckets during the computation. The MakeBuckets order as in A. Scan and filter take O(n) work and O(logn) function takes a size, an array A, and an ordering, and creates depth (assuming ⊕ and f take O(1) work) [38]. a bucketing structure B that stores each vertex v in bucket A compare-and-swap CAS(&x,o,n) takes a memory loca- A[v]. The NextBucket function takes a bucketing structure B tion x and atomically updates the value at location x to n if and returns the next non-empty bucket in the specified order- the value is currently o, returning true if it succeeds and false ing. The UpdateBuckets function takes a bucketing structure otherwise. A fetch-and-add FAA(&x,n) takes a memory lo- B and an array of pairs (v, bkt), and moves each vertex v cation x, atomically returns the current value at x and then from its original bucket to the bucket with ID bkt. Julienne increments the value at x by n.A WRITEMIN(&x,n) takes a actually groups multiple buckets together into an overflow memory location x, and a value n, and atomically updates x to bucket and thus has a GETBUCKET function determines the be the minimum of the value at x and n; it returns true if the physical bucket from a logical bucket ID, but for simplicity update was successful and false otherwise. We assume that we will not use it in our discussion. these operations take O(1) work and depth in our model, but 4.2 Implementation note that these operations can be simulated work-efficiently A method for representing hypergraphs is to create a clique in logarithmic depth on weaker models [32]. among all pairs of vertices in each hyperedge and store the 4 Ligra-H Framework result as a graph [36, 41]. However, this leads to a loss of in- formation compared to the original hypergraph as the groups This section presents the interface and implementation of the of vertices in hyperedges are no longer distinguished. Fur- Ligra-H framework, which extends Ligra [75] to hypergraphs. thermore, the space required to store the resulting graph 4.1 Interface can be significantly higher than that of the original hyper- Ligra-H contains the basic VertexSet and HyperedgeSet data graph [36, 41]. Another approach is to use a bipartite graph structures, which are used to represent subsets of vertices and representation with vertices in one partition and hyperedges hyperedges, respectively. VertexMap takes as input a boolean in the other, where each hyperedge connects to all vertices function F and a VertexSet U , applies F to all vertices in U , belonging to it [36] (an example is shown in Figure 1b). This and returns an output VertexSet containing all elements from is the approach that we adopt in this paper. ∈ ( ) u U such that F u = true. HyperedgeMap is an analogous In the bipartite representation, there is an edge (u, ngh) if function for HyperedgeSets. u ∈ V and ngh ∈ N +(u), or u ∈ E and ngh ∈ N +(u). The VertexProp takes as input a hypergraph H, a VertexSet edges for each element are stored in an adjacency array. We U , and two boolean functions F and C. It applies F to all also store the incoming edges for each vertex and hyperedge pairs (v, e) such that v ∈ U , v ∈ N −(e), and C(e) = true (call ′ to enable the direction optimization that we discuss in Sec- this subset of pairs P), and returns a HyperedgeSet U where tion 4.3. For weighted hypergraphs, we store the weights ∈ ′ ( ) ∈ ( ) e U if and only if v, e P and F v, e = true. Hyper- interleaved with the edges in the bipartite representation for edgeProp takes as input a hypergraph H, a HyperedgeSet U , cache locality. The hypergraph can be transposed using the and two boolean functions F and C. It applies F to all pairs TRANSPOSE function, which swaps the roles of the incoming ( ) ∈ ∈ +( ) ( ) e,v such that e U , v N e , and C v = true (call this and outgoing edges for all elements. ′ ∈ ′ subset of pairs P), and returns a VertexSet U where v U if One implementation choice that we considered was to di- ( ) ∈ ( ) and only if e,v P and F e,v = true. For weighted hyper- rectly pass bipartite graphs to the Ligra framework. However, graphs, the F function takes the weight as the third argument. this would either require hyperedges to have distinct identi- We provide a function HyperedgeFilterNgh that takes as fiers from vertices, making it unnatural to index arrays inthe input a hypergraph H, a HyperedgeSet U , and a boolean application code, or require mapping the hyperedge identifiers function C, and filters out the incident vertices v for each to the range [0,...,n − 1] on every array access, leading to ∈ ( ) e hyperedge e U such that C v = false. This mutates the additional overhead on hyperedge accesses and more compli- hypergraph, so that future computations will not inspect these cated application code. Instead, we modified the Ligra code vertices. HyperedgePropCount takes as input a hypergraph to distinguish between vertices and hyperedges and repre- H, a HyperedgeSet U , and a function F, and applies F to sent them using identifiers in the ranges [0,...,nv − 1] and each neighbor of U in parallel. The function F takes two 3 PL’18, January 01–03, 2018, New York, NY, USA Julian Shun

[0,...,ne − 1], respectively. We borrow existing data struc- for HYPEREDGEPROP. The sparse traversals use the sparse tures and functions from Ligra, which we describe here for set representation, and the dense traversals uses the dense completeness. set representation. The input set is converted between the VertexSets (HyperedgeSets) have two underlying imple- representations based on the traversal type. mentations: a sparse integer array storing the IDs of the ele- For the dense traversals, instead of simply mapping over ments in the set, and a dense boolean array of length |V | (|E|) the vertices with a parallel-for loop, we added an edge-aware storing 1’s in the locations corresponding to the IDs of the parallelization scheme that creates tasks containing a roughly elements in the set, and 0’s everywhere else. equal number of edges that are managed by the work-stealing Implementing VERTEXMAP and HYPEREDGEMAP simply scheduler [92]. We found this optimization to significantly requires mapping the function over the input VertexSet or improve load balancing for hypergraphs with highly-skewed HyperedgeSet, and applying a parallel filter on the result. degree distributions. Assuming that the function takes O(1) work (which is true in As in Ligra, we also provide a push-based dense traver- all of our applications), the overall work is O(|U |) and depth sal that densely represents the input set but loops over their is O(log |U |) for an input set U . outgoing edges, instead of over incoming edges of all vertices. VERTEXPROP and HYPEREDGEPROP map the C function For VERTEXPROP and HYPEREDGEPROP, we use opti- over the outgoing edges of the input set and for the edges mized versions that do not remove duplicates that can be that return true, applies the F function in parallel. A parallel used if the program guarantees that no duplicates will be scan is applied over the degrees of elements in the input generated in the output. When the output of VERTEXMAP, to determine offsets into an array storing the neighbors. A HYPEREDGEMAP, VERTEXPROP, and HYPEREDGEPROP parallel filter is applied over the neighbors of F to obtain the is not needed, we use optimized implementations that do not output set. For an input set U , and functions F and C that take call filter. O(1) work (which is true in all of our applications), this takes To reduce memory usage, Ligra-H supports compression Í + O(|U | + u ∈U deg (u)) work and O(log |H |) depth. We can of the underlying bipartite graph using the compression code remove duplicates from the output in the same bounds. from Ligra [77]. The neighbors of vertices and hyperedges are HYPEREDGEFILTERNGH can be implemented by inspect- compressed using variable-length codes, and decoded on-the- ing all neighbors of each hyperedge in the input HyperedgeSet fly when accessed inVERTEXPROP and HYPEREDGEPROP. in parallel and using a parallel filter to remove the vertices 5 Parallel Hypergraph Algorithms not satisfying C. This takes the same work and depth as HY- We have developed a collection of parallel hypergraph al- PEREDGEPROP. HYPEREDGEPROPCOUNT requires the same gorithms using Ligra-H: hypertrees, hyperpaths, between- work and depth bounds as HYPEREDGEPROP as the counts ness centrality (BC), maximal independent set (MIS), k- can be implemented using fetch-and-adds or a semisort [31]. core decomposition, connected components (CC), PageR- We refer the reader to [21] for implementation details of ank, and single-source shortest paths (SSSP). Our algorithms the bucketing structure. For the complexity of bucketing, we for betweenness centrality and k-core decomposition are will use the following lemma from [21]: new, while the connected components, PageRank, and single- Lemma 1 ([21]). For n identifiers, T total buckets, K calls source shortest paths algorithms are more efficient variants to UPDATEBUCKETS, each of which updates a set Si of of previously described algorithms [36, 41]. The hypertrees identifiers, and L calls to NEXTBUCKET, bucketing takes and hyperpaths algorithms are similar to parallel breadth-first ÍK O(n +T + i=0 |Si |) expected work and O((K +L) logn) depth search. The maximal independent set algorithm is the first with high probability. practical implementation for finding maximal independent 4.3 Optimizations sets in hypergraphs. Due to space constraints, we only present the betweenness centrality, maximal independent set, and k- VERTEXPROP and HYPEREDGEPROP uses the direction op- core decomposition algorithms in this section, and present timization [4, 75] to switch between a sparse traversal (de- the remaining algorithms in the supplementary material. Our scribed in Section 4.2) and a dense traversal based on the pseudocode uses partially evaluated functions, i.e., invoking a size of the input VertexSet or HyperedgeSet and the sum of function with fewer than all of its arguments gives a function its out-degrees. For VERTEXPROP, the dense traversal loops that takes the remaining arguments as input. over all hyperedges e in parallel, checking if they satisfy the C 5.1 Betweenness Centrality function, and if so applying F on its incoming edges serially, stopping once C(e) returns false. We use the dense traversal The betweenness centrality (BC) [25] of a vertex v measures when the input set and sum of its out-degrees is a constant the fraction of shortest paths between all pairs of vertices fraction (1/20 in our experiments) of the sum of in-degrees of that pass through v. In this paper, we consider BC on un- hyperedges (which preserves work-efficiency), and the sparse weighted hypergraphs, although the definition extends to traversal otherwise. The sum of out-degrees is computed us- weighted hypergraphs. More formally, let σs,t be the num- ing a parallel scan. We have an analogous implementation ber of shortest paths between vertices s and t, σs,t (v) be the 4 Practical Parallel Hypergraph Algorithms PL’18, January 01–03, 2018, New York, NY, USA number of shortest paths between s and t that pass through Algorithm 1 Pseudocode for BC in Ligra-H v, and δs,t (v) = σs,t (v)/σs,t . The betweenness centrality 1: NumPathsV = {0,..., 0} Í { } of vertex v is defined to be s,v,t ∈V δs,t (v). Brandes [10] 2: NumPathsE = 0,..., 0 presents a sequential algorithm for computing BC on graphs 3: VisitedV = {0,..., 0} that takes O(|V ||E|) work, where each vertex v does a for- 4: VisitedE = {0,..., 0} 5: DependenciesV = {0,..., 0} ward traversal to compute the number of shortest paths from 6: DependenciesE = {0,..., 0} v to every other vertex, and a backward traversal to com- 7: Levels = [] pute the betweenness centrality contributions for all vertices 8: procedure PATHUPDATE(NumPathsSrc, NumPathsDst, s, d) from shortest paths starting at v. Each traversal takes O(|E|) 9: return (FAA(&NumPathsDst[d], NumPathsSrc[s]) == 0) work. Brandes defines the dependency of a vertex s on v as 10: procedure CHECK(Visited, i) Í return [ ] δs•(v) = t ∈V δs,t (v), and the traversals from s compute δs• 11: (Visited i == 0) values for all other vertices. The betweenness centrality of a 12: procedure VISIT(Visited, i) Í 13: Visited[i] = 1 vertex v will then be s ∈V δs•(v). This algorithm has been parallelized in the literature (see, e.g., [64, 75, 82, 84]). 14: procedure VISITVERTEXBACK(v) 15: Visited[v] = 1 Puzis et al. [69] present a sequential algorithm for com- 16: DependenciesV[v]+=1 puting betweenness centrality in hypergraphs based on Bran- 17: procedure VTOE(v, e) des’ algorithm. In the forward phase, a breadth-first search- 18: FAA(&DependenciesE[e], DependenciesV[v]/NumPathsV[v]) like procedure is run, generating a predecessor set for each 19: procedure ETOV(e, v) vertex and hyperedge containing all elements in the previ- 20: FAA(&DependenciesV[v], DependenciesE[e] × NumPathsV[v]) ous level of the search. Let PV (v) be the predecessor hy- 21: procedure BC(H , src) ▷ src is the source vertex peredges of vertex v and PE (e) be the predecessor vertices 22: NumPathsV[src] = 1, VisitedV[src] = 1 of hyperedge e for the search from source s. σs,v will be 23: VertexSet FrontierV = {src} Í 24: HyperedgeSet FrontierE = {} computed as u ∈P (e) : e ∈P (v) σs,u . This phase takes O(nv + E V 25: = Í ( +( )· −( ))) work as each hyperedge is expanded currLevel 0 e ∈E deg e deg e 26: while (true) do once per incoming vertex. Note that this work complexity can 27: FrontierE = VERTEXPROP(H, FrontierV, be significantly larger than the size of the hypergraph. The PATHUPDATE(NumPathsV, NumPathsE), CHECK(VisitedE)) backward phase computes the dependency scores by itera- 28: if |FrontierE| == 0 then break tively propagating values from vertices to their predecessor 29:H YPEREDGEMAP(FrontierE, VISIT(VisitedE)) hyperedges, and from hyperedges to their predecessor ver- 30: Levels[currLevel++] = FrontierE 31: FrontierV = HYPEREDGEPROP(H, FrontierE, tices starting from the furthest elements from the source. The PATHUPDATE(NumPathsE, NumPathsV), CHECK(VisitedV)) update equation for a hyperedge e is shown in Equation 1 and 32: if |FrontierV | == 0 then break for a vertex v is shown in Equation 2. 33:V ERTEXMAP(FrontierV, VISIT(VisitedV)) 34: Levels[currLevel++] = FrontierV Õ δs•(v) δˆs (e) = (1) 35: VisitedV = {0,..., 0}, VisitedE = {0,..., 0} σs,v v : e ∈PV (v) 36:T RANSPOSE(H ) 37: currLevel = currLevel − 1 Õ 38: while currLevel ≥ 0 do δs•(v) = 1 + (σs,v · δˆs (e)) (2) 39: FrontierV = Levels[currLevel--] e : v ∈PE (e) 40:V ERTEXMAP(FrontierV, VISITVERTEXBACK) By separating the vertex and hyperedge updates, each hy- 41:V ERTEXPROP(H, FrontierV, VTOE, CHECK(VisitedE)) peredge and vertex only needs to be processed once, and the 42: FrontierE = Levels[currLevel--] total work of the backward phase is O(|H |). Puzis et al. [69] 43:H YPEREDGEMAP(FrontierE, VISIT(VisitedE)) 44:H YPEREDGEPROP(H, FrontierE, ETOV, CHECK(VisitedV)) also propose a heuristic for merging vertices belonging to only 45: return DependenciesV a single hyperedge together, but the theoretical complexity remains the same. In this section, we present a new parallel BC algorithm scores would require running the algorithm from all sources, on hypergraphs that takes linear work per source vertex. We although in practice a subset of sources are used to compute represent vertices and hyperedges at equal distance from the approximate BC scores [2, 27]. As far as we know, this is the source as frontiers using VertexSets and HyperedgeSets, and first parallel BC algorithm on hypergraphs. process each frontier in parallel. We split the updates in the The pseudocode for our BC algorithm from a single source forward phase into separate update steps for vertices and is shown in Algorithm 1. We initialize auxiliary arrays as hyperedges, so that each hyperedge only needs to be expanded well as the DependenciesV array storing the final dependency once, giving linear work. The backward phase processes the scores on Lines 1–6. The forward phase of the algorithm frontiers in decreasing distance from the source, using fetch- is shown on Lines 22–37. We first set the number of paths and-adds to update the δˆs and δs• values. Computing exact BC for the source vertex to 1, mark it as visited, and place it 5 PL’18, January 01–03, 2018, New York, NY, USA Julian Shun on the initial frontier, represented as a VertexSet (Lines 22– 5.2 Maximal Independent Set 23). While there are still reachable hyperedges and vertices, Given an undirected, unweighted hypergraph, an independent we repeatedly propagate the number of paths from vertices set is a subset of vertices U ⊆ V such that no hyperedge has to hyperedges via VERTEXPROP on Line 27 and from hy- all of its incident vertices in U . In a graph, this definition is peredges to vertices via HYPEREDGEPROP on Line 31. The equivalent to the condition that no two vertices in the inde- function PATHUPDATE (Lines 8–9) passed to VERTEXPROP pendent set are neighbors, although this does not hold for and HYPEREDGEPROP increments the number of paths of a hypergraphs. A maximal independent set (MIS) is an inde- successor element using a fetch-and-add. In contrast to [69], pendent set that is not contained in a larger independent set. we first gather the number of paths at a hyperedge from allof Finding maximal independent sets in parallel has been widely its predecessor vertices before passing it to its successor ver- studied for graphs, and there exists linear-work parallel algo- tices. In this way, a hyperedge only needs to visit each of its rithms for the problem [8, 57]. However, the problem is much successor vertices once, passing the sum of all contributions harder to solve on hypergraphs, and the total work of known from predecessor vertices. Duplicates in the output do not parallel algorithms is super-linear [3, 7, 43, 45]. As far as we need to be removed as PATHUPDATE returns true only for the know, there have been no implementations of parallel MIS first update on the target. The CHECK function (Lines 10–11) algorithms on hypergraphs. passed to VERTEXPROP and HYPEREDGEPROP guarantees This paper implements a variant of the Beame-Luby MIS that only unexplored vertices and hyperedges are visited. We algorithm [3], which is a core component of a more recent mark visited hyperedges and vertices on Lines 29 and 33, algorithm by Bercea et al. [7]. The algorithm is iterative and respectively, to ensure that each hyperedge and vertex is vis- performs the following steps in each iteration: ited at most once. Each frontier that is explored is placed in (1) Generate a sample of vertices I, each sampled with proba- d+1 the Levels array, so that we can explore them in a backward bility p = 1/(2 ∆), where d = maxe ∈E deд(e) and ∆ is fashion in the second phase of the algorithm. the normalized degree as defined in [3, 7]. The backward phase of the algorithm is shown on Lines 35– (2) For any hyperedge e that has all of its vertices in I, remove 44. We reuse the arrays VisitedV and VisitedE (Line 35). We all vertices in e from I. transpose the hypergraph (Line 36) and explore the fron- (3) Add the remaining vertices in I to the MIS and delete tiers from the first phase in a backward fashion. Line 40 them from V . uses a VERTEXMAP with the VISITVERTEXBACK function (4) Remove the vertices in I from all remaining hyperedges. (Lines 14–16) to mark vertices on the frontier as visited (5) Remove hyperedges whose vertices is a subset of another and add 1 to their dependency score, as required in Equa- hyperedge’s vertices. tion 2. Line 41 uses a VERTEXPROP with the VTOE function (6) Remove hyperedges that contain only one vertex, and (Lines 17–18) on predecessors (obtained by considering unex- remove those vertices from V . plored vertices via the CHECK function), which implements Our implementation picks vertices with a constant probabil- Equation 1. Line 43 marks hyperedges on the frontier as vis- ity p = 1/3 as we found that it performs better in practice, and ited with a HYPEREDGEMAP. Finally, Line 44 implements does not perform Step (5), which is not needed for correctness. the sum in Equation 2 with the ETOV function (Lines 19–20) The pseudocode is shown in Algorithm 2. on predecessors. Our implementation uses a Flags array to represent the Analysis. We analyze the complexity for a single source ver- status of vertices, with a value of Flags[v] = 0 meaning tex. In the forward phase of BC, each vertex and hyperedge that v is undecided, Flags[v] = 1 indicating that v is not will appear in at most one frontier because once a vertex or in the MIS, and any other value indicating that v is in the hyperedge has been visited, its VisitedV or VisitedE entry will MIS. Flags is initialized to all 0’s on Line 1. We also initial- be marked, and it will fail the check by the CHECK function ize an auxiliary array Counts, which will be used to count in subsequent iterations. Therefore the sum of the sizes of the number of incident vertices of hyperedges selected in all frontiers plus their out-degrees will be O(|H |). As VER- the random sample (Line 2). We create initial frontiers con- TEXPROP and HYPEREDGEPROP do work proportional to taining all vertices and hyperedges (FrontierV and FrontierE the size of the input set plus the sum of its out-degrees, the on Lines 22–23). We also keep track of the round number overall work performed by the algorithm is O(|H |), which (Lines 24 and 26). Line 27 uses a VERTEXMAP with the is work-efficient. Each call to VERTEXPROP and HYPER- function SAMPLE (Lines 3–4) to sample vertices by marking EDGEPROP takes O(log |H |) depth, and so the overall depth is their Flags value with the round number with probability p. O(D log |H |) where D is the diameter of the hypergraph. The We reset the Counts values for hyperedges on the frontier backward phase processes each frontier exactly once, giving on Line 28. On Line 29, we count for each hyperedge the the same work and depth bounds. Thus, the overall work is number of its vertices that were selected in the sample for this O(|H |) and depth is O(D log |H |). round using HYPEREDGEPROP with the COUNT (Lines 5–6) and CHECKF (Lines 19–20) functions. We then check which

6 Practical Parallel Hypergraph Algorithms PL’18, January 01–03, 2018, New York, NY, USA

Algorithm 2 Pseudocode for MIS in Ligra-H Algorithm 3 Pseudocode for Coreness in Ligra-H

1: Flags = {0,..., 0} 1: D = {deg(v0),..., deg(vnv −1)} ▷ initialized to vertex degrees 2: Counts = {0,..., 0} 2: Flags = {0,..., 0} ▷ initialized to all 0 3: procedure SAMPLE(round, v) 3: procedure REMOVEHYPEREDGE(v, e) 4: With probability p, set Flags[v] = round 4: return CAS(&Flags[e], 0, 1) 5: procedure COUNT(e, v) 5: procedure CHECKREMOVED(e) 6: FAA(&Counts[e], 1) 6: return (Flags[e] == 0) 7: procedure RESETNGH(e, v) 7: procedure UPDATED(k, v, numNghs) 8: Flags[v] = 0 8: if D[v] > k then 9: [ ] ( [ ] − , ) 9: procedure FILTERV(v) D v = max D v numNghs k 10: return (Flags[v] == 0) 10: return (v, D[v]) 11: else return (⊥, ⊥) 11: procedure FILTERE(e) 12: procedure CORENESS(H ) 12: if (deg(e) == 1 and Flags[ngh0(e)] == 0) then 13: B = MAKEBUCKETS(nv, D, INCREASING), finished = 0 13: Flags[ngh0(e)] = 1 14: while (finished < nv ) do 14: return (deg(e) > 1) 15: (k, VertexSet FrontierV) = NEXTBUCKET(B) 15: procedure IND(e) 16: finished+=|FrontierV | 16: return (Counts[e] == deg(e)) 17: HyperedgeSet FrontierE = VERTEXPROP(H, FrontierV, EMOVE YPEREDGE, HECK EMOVED) 17: procedure RESET(e) R H C R 18: Counts[e] = 0 18: Moved = HYPEREDGEPROPCOUNT(H, FrontierE, UPDATED(k)) 19:U PDATEBUCKETS(B, Moved) 19: procedure CHECKF(round , v) 20: return D 20: return (Flags[v] == round) 21: procedure MIS(H ) 22: VertexSet FrontierV = {0,..., n − 1} ▷ all vertices v 5.3 k-core Decomposition 23: HyperedgeSet FrontierE = {0,..., ne − 1} ▷ all hyperedges 24: round = 1 For an undirected, unweighted hypergraph, a k-core is a max- 25: while (|FrontierV | > 0) do imal connected sub-hypergraph where every vertex has in- 26: round++ duced degree at least k. The coreness problem is to compute 27:V ERTEXMAP(FrontierV, SAMPLE(round)) for each vertex the largest value of k for which it is part of the 28:H YPEREDGEMAP(FrontierE, RESET) 29:H YPEREDGEPROP(H, FrontierE, COUNT, CHECKF(round)) k-core. A simple parallel algorithm for coreness iteratively 30: HyperedgeSet FullEdges = HYPEREDGEMAP(FrontierE, IND) removes all vertices with degree at most k along with their 31:H YPEREDGEPROP(H, FullEdges, RESETNGH, CHECKF(round)) incident hyperedges starting with k = 0, assigning removed 32:H YPEREDGEFILTERNGH(H, FrontierE, FILTERV) vertices a coreness value of k, and incrementing k when all 33: FrontierE = HYPEREDGEMAP(FrontierE, FILTERE) remaining vertices have induced degree greater than k [40]. 34: FrontierV = VERTEXMAP(FrontierV, FILTERV) 35: return Flags Since each iteration requires scanning over all remaining ver- tices, this algorithm requires a total of O(|H | + ρ|V |) work, where ρ is the number of iterations required by the algorithm, also known as the peeling complexity [21]. hyperedges had all of their vertices selected in the sample This section presents a new linear-work algorithm for com- on Line 30 with a HYPEREDGEMAP with the IND function puting coreness on hypergraphs based on the linear-work that checks if the count is equal to the hyperedge’s cardinality algorithm for graphs by Dhulipala et al. [21]. We have also (Lines 15–16). The HyperedgeSet FullEdges contains the hy- implemented the O(|H | + ρ|V |) work coreness algorithm in peredges where this is true, and we unmark the Flags values Ligra-H, and compare the performance of the O(ρ|H |) work of their vertices on Line 31 using HYPEREDGEPROP with the algorithm and the work-efficient algorithm in Section 6. To RESETNGH function (Lines 7–8). On Line 32, we remove ver- obtain work-efficiency, our algorithm uses the bucketing data tices that have been selected in the MIS from the hyperedges structure described in Section 4. The pseudocode of our algo- using HYPEREDGEFILTERNGH with the FILTERV function rithm is shown in Algorithm 3. (Lines 9–10), so that we do not need to process them in future An array D is initialized with the degrees of the vertices rounds. Line 33 updates the hyperedge frontier by filtering (Line 1). This array will keep track of the induced degrees of out hyperedges with cardinality 0 and 1 using the FILTERE the vertices, and also store the final coreness value of the ver- function (Lines 11–14). For hyperedges with cardinality 1 we tices. An array Flags (Line 2) is used to keep track of whether mark their only vertex (ngh0) as not being in the MIS. Line 34 a hyperedge has been deleted (0 means not deleted and 1 updates the vertex frontier with the FILTERV function by fil- means deleted). Line 13 initializes the bucketing structure, tering out vertices whose status has already been decided. The specifying that they should be processed in increasing order. algorithm terminates when the status of all vertices have been Line 15 gets the next non-empty bucket in increasing order, decided, at which point FrontierV will be empty. and returns k, which corresponds to the current k-core being 7 PL’18, January 01–03, 2018, New York, NY, USA Julian Shun

processed, as well as vertices with degree at most k in a Ver- data from [54], where each community is a hyperedge con- texSet FrontierV. Line 17 marks the neighboring vertices of taining its members as vertices. These are the largest real- FrontierV as deleted using VERTEXPROP with the functions world datasets used by prior work on hypergraph process- REMOVEHYPEREDGE (Lines 3–4) and CHECKREMOVED ing [36, 41]. We also include three larger real-world datasets, (Lines 5–6). Hyperedges that are deleted will be returned in orkut-groups, Web, and LiveJournal, which are constructed the HyperedgeSet FrontierE, and duplicates do not need to from bipartite graphs from the Koblenz Network Collec- be removed, since the CAS on Line 4 will return true ex- tion [50]. To test on larger inputs, we also constructed syn- actly once per hyperedge. On Line 18, we update the induced thetic random hypergraphs. Rand1 and Rand2 have 108 and degrees of the vertices due to the removal of hyperedges us- 109 vertices/hyperedges, respectively, where the cardinality ing a HYPEREDGEPROPCOUNT. The UPDATED function of each hyperedge is 10 and its member vertices are chosen (Lines 7–11) will decrement the induced degree of each ver- uniformly at random. Rand3 has 107 vertices/hyperedges, and tex v by its number of neighbors in FrontierE (numNghs), the cardinality of each hyperedge is 100 with its member ver- and set it to k if it falls below k, since this means v will have tices chosen uniformly at random. For input size scalability a coreness value of k. UPDATED returns a pair indicating experiments, we also generated random hypergraphs with the target bucket of the vertex v, which is its new induced varying sizes and hyperedge cardinalities. For SSSP, we use degree D[v] (Line 10). For vertices whose coreness value weighted versions of the hypergraphs with random hyperedge have already been determined, the pair (⊥, ⊥) is returned weights from 1 to ⌊log2(max (nv ,nh))⌋. The inputs are all (Line 11). These pairs are stored in the Moved array output undirected. by HYPEREDGEPROPCOUNT. Line 19 moves the vertices Results. Table 2 shows the sequential and parallel running to new buckets using UPDATEBUCKETS with the Moved ar- times of our algorithms, as well as their parallel speedup. The ray as input. UPDATEBUCKETS ignores the (⊥, ⊥) pairs. The BC times are for a single source, and PageRank times are for algorithm terminates when all vertices have been extracted 1 iteration. For k-core, we include times for both the work- from the bucket structure and processed. efficient (WE) version and the work-inefficient (WI) version. Analysis. Each hyperedge will place each of its incident We did not include hyperpaths in our experiments because the vertices in the Moved array only when it is deleted. There- hyperpaths found in the inputs are too small to give meaning- fore the total size of the sets passed to UPDATEBUCKETS ful running times. For the Orkut-group, Web, and LiveJournal Í is O( e ∈E deg(e)). The number of identifiers in the bucket inputs, we used the edge-aware parallelization scheme due to structure is nv and the number of buckets is at most the max- their highly skewed degree distributions. imum vertex degree, which is O(ne ). The total number of Overall, the algorithms get good parallel speedup, ranging calls to UPDATEBUCKETS and NEXTBUCKET is the peeling from 8.5–76.5x, and the parallel times on the real-world in- complexity ρ. Using Lemma 1, we obtain an expected work puts are usually under 1 second. The random hypergraphs are of O(|H |) and depth of O(ρ log |H |) with high probability. larger than hypergraphs used in prior work, and we are able 6 Experiments to achieve parallel running times on the order of seconds for Rand1 and Rand3 and tens of seconds for Rand2. The lower Experimental Setup. We run all of our experiments on a 72- speedups for k-core on the real-world inputs are due to the core Dell PowerEdge R930 (with two-way hyper-threading) large number of peeling rounds (see Table 1), many of which with four 2.4GHz 18-core E7-8867 v4 Xeon processors, each have few active vertices and hyperedges. with a 45MB cache, and a total of 1TB of RAM. Our pro- We see that our work-efficient k-core algorithm is usually grams use Cilk Plus [53] for parallelism and are compiled much faster than the work-inefficient version, by a factor of g++ -O3 with the compiler (version 5.5.0) with the flag. By up to 733x in parallel, as it does less work. The benefit is using Cilk’s work-stealing scheduler we are able obtain an ex- higher for the inputs with more peeling rounds (e.g., the Web / ( ) pected running time of W P + O D for an algorithm with W hypergraph). work and D depth on P processors [9]. Ligra-H also supports Figure 2 shows the running time vs. number of threads compilation with OpenMP. for all of the algorithms on Rand1. We see good parallel numactl For the parallel experiments, we use the command scalability for all of the algorithms, with speedups ranging -i all to balance the memory allocations across the sock- from 31–53x on 72 cores with hyper-threading. ets. All of the speedup numbers we report are the running Figure 3 shows the running time vs. hyperedge count for times of our parallel implementation on 72-cores with hyper- all of the algorithms on random hypergraphs with 107 ver- threading over the running time of the implementation on a tices and cardinality-10 hyperedges (we also tried fixing the single thread. vertex and hyperedge count and varying the cardinality, and Data Sets. Our input hypergraphs are shown in Table 1. com- found similar trends). We see a near-linear increase in run- Orkut and Friendster are constructed using the community ning time on all of the algorithms except hypertree and k-core, which have a sub-linear increase. For hypertree, the number

8 Practical Parallel Hypergraph Algorithms PL’18, January 01–03, 2018, New York, NY, USA

Hypergraph |V | |E | Í deg(e) max deg(v) max deg(e) Num. peeling rounds (ρ) Num. clique-expanded edges e∈E v ∈V e∈E com-Orkut 2.32 ×106 1.53 ×107 1.07 ×108 2958 9120 1698 3.87 ×1010 Friendster 7.94 ×106 1.62 ×106 2.35 ×107 1700 9299 351 5.53 ×109 Orkut-group 2.78 ×106 8.73 ×106 3.27 ×108 40425 3.18×105 2923 2.45 ×1012 Web 2.77 ×107 1.28 ×107 1.41 ×108 1.1 ×106 1.16×107 3.18 ×105 1.06 ×1014 LiveJournal 3.20 ×106 7.49 ×106 1.12 ×108 300 1.05×106 820 2.7 ×1012 Rand1 108 108 109 34 10 30 4.45 ×109 Rand2 109 109 1010 35 10 33 4.5 ×1010 Rand3 107 107 109 153 100 109 4.95 ×1010 Table 1. Hypergraph inputs.

com-Orkut Friendster Rand1 Rand2 Rand3 Orkut-group Web LiveJournal Algorithm T1 T72 SU T1 T72 SU T1 T72 SU T1 T72 SU T1 T72 SU T1 T72 SU T1 T72 SU T1 T72 SU Hypertree 1.04 0.031 33.5 0.803 0.022 36.5 24.3 0.676 35.9 321 8.97 35.8 2.18 0.047 46.4 0.551 0.021 26.2 2.71 0.068 39.9 0.754 0.022 34.3 BC 5.31 0.12 44.3 2.7 0.07 38.6 131.0 3.72 35.2 1890 39.4 48.0 40.8 0.82 49.8 7.58 0.141 53.8 10.7 0.517 20.7 3.66 0.099 37.0 CC 7.87 0.162 48.6 3.34 0.082 40.5 330.0 9.88 33.4 4190 121 34.6 70.5 1.07 65.9 11.6 0.18 64.4 11.0 0.478 23.0 4.01 0.081 49.5 PageRank 3.31 0.083 39.9 0.941 0.026 36.2 84.1 2.61 32.2 955 28.6 33.4 57.3 1.6 35.8 6.88 0.119 57.8 5.27 0.27 19.5 2.66 0.062 42.9 SSSP 8.81 0.157 56.1 3.54 0.107 76.5 290.0 6.76 43.3 5730 79.8 71.8 54.0 1.0 54.0 14.7 0.261 56.3 7.04 0.245 28.7 5.56 0.118 47.1 MIS 7.73 0.227 34.1 3.26 0.11 29.6 154.0 4.19 36.8 1680 44.5 37.8 68.0 1.09 62.4 9.86 0.411 24.0 17.8 2.09 8.52 4.92 0.434 11.3 WE k-core 7.09 0.738 9.61 2.09 0.081 25.8 116.0 2.17 53.5 2210 31.8 69.5 41.0 0.866 47.3 13.7 0.831 16.5 12.5 0.965 13.0 6.13 0.325 18.9 WI k-core 33.9 1.88 18.0 9.72 0.421 23.1 133.0 3.4 39.1 1150 33.6 34.2 87.3 1.18 74.0 96.6 3.34 28.9 23500 707.0 33.2 16.9 0.905 18.7 Table 2. Sequential times (T1) and parallel (T72) times (seconds), as well as the parallel speedup (SU) on a 72-core machine with hyper-threading.

1000 11.4 Hypertree 1 BC Dense Sparse CC PageRank 0.8 Hybrid 100 SSSP MIS WE k-core 0.6 10 0.4

1 0.2 Running time (seconds) Running time (seconds)

0 Hypertree BC CC PageRankSSSP MIS WE k-core 0.1 1 2 4 8 16 24 32 48 72 72h Number of threads Figure 2. Running time vs. number of threads on Rand1. Figure 4. Running times of dense, sparse, and hybrid traver- “72h” refers to 144 hyper-threads. sals on com-Orkut using 72 cores with hyper-threading.

4.5 Hypertree 4 BC sum of in-degrees of the hyperedges for VERTEXPROP and CC 3.5 PageRank sum of in-degrees of vertices for HYPEREDGEPROP. For all SSSP 3 MIS of the algorithms, we see that the hybrid traversal is the the WE k-core 2.5 fastest or tied for the fastest among the three cases. We found 2 the default threshold to work reasonably well across all our

1.5 applications and inputs, and that the performance was simi- 1 Running time (seconds) 0.5 lar across a wide range of thresholds (experiments showing 0 the running time with varying thresholds are provided in the 10 20 30 40 50 60 70 80 90 100 Millions of hyperedges supplementary material). Figure 3. Running time vs. number of hyperedges on 72 cores In Table 3, we report the memory, percentage of cycles with hyper-threading. stalled due to memory accesses, and LLC miss rate for sev- eral algorithms on com-Orkut, Rand1, and LiveJournal. We of edges traversed increases sub-linearly due to the direction see that the cache miss rate and memory bandwidth is the optimization that avoids many edge traversals. For k-core, highest for the random hypergraph, Rand1, as the edges have the peeling complexity, and hence running time, does not very little locality. The memory bandwidth is close to the increase linearly with the number of hyperedges. peak bandwidth of the machine, and the algorithms on Rand1 Figures 4 and 5 show the impact of the direction optimiza- are memory bandwidth-bound. com-Orkut and LiveJournal tion on com-Orkut and LiveJournal. We plot the running time exhibit locality in their structure, and thus have lower cache using all sparse traversals, all dense traversals, and hybrid miss rates, and require fewer requests to DRAM, thereby traversals with the default threshold of 1/20 fraction of the lowering the memory bandwidth. However, a decent fraction 9 PL’18, January 01–03, 2018, New York, NY, USA Julian Shun

com-Orkut Rand1 LiveJournal Fraction of LLC Memory Fraction of LLC Memory Fraction of LLC Memory Algorithm Cycles Stalled Miss Rate Bandwidth Cycles Stalled Miss Rate Bandwidth Cycles Stalled Miss Rate Bandwidth Hypertree 0.364 0.122 144.7 0.726 0.551 161.7 0.28 0.161 135.5 BC 0.462 0.111 131.1 0.804 0.808 161.8 0.35 0.097 120.1 CC 0.444 0.089 134.4 0.837 0.833 147.5 0.40 0.044 126.1 PageRank 0.79 0.24 123.1 0.927 0.933 146.6 0.69 0.163 104.0 SSSP 0.5 0.098 132.1 0.842 0.781 146.2 0.49 0.057 123.2 WE k-core 0.367 0.358 53.86 0.573 0.422 140.1 0.39 0.286 72.5 Table 3. Fraction of cycles stalled on memory requests, LLC miss rate, and memory bandwidth (GB/s). All experiments use 72 cores with hyper-threading.

3.09 work proportional to the frontier size plus the sum of its out- 1.4 Dense Sparse degrees. We ran the single-source shortest paths algorithm 1.2 Hybrid 1 from MESH (which works on unit weights and so is similar 0.8 to our hypertree algorithm) on our 72-core machine and ob- 0.6 served that for com-Orkut just the first iteration takes over 1 0.4 minute. This is much slower than the Ligra-H time for running

Running time (seconds) 0.2 the algorithm to convergence. 0 Hypertree BC CC PageRankSSSP MIS WE k-core As mentioned in Section 4.2, another method for repre- senting a hypergraph is to create a clique among all vertices for each hyperedge, store the result as a graph (known as the Figure 5. Running times of dense, sparse, and hybrid traver- clique-expanded graph), and apply graph algorithms on it. sals on LiveJournal using 72 cores with hyper-threading. This approach would work for algorithms that do not treat of the cycles are still stalled waiting for memory accesses, hyperedges differently from vertices, such as hypertree, con- making the algorithms memory latency-bound. All of the nected components, and single-source shortest paths (for al- algorithms benefit from spatial locality when traversing the gorithms that treat the hyperedges specially, this approach adjacency list in the bipartite representation. would generate incorrect results). We show the number edges Comparison with Alternatives. While it is difficult to di- in the clique-expanded graph for each of our inputs in Table 1. rectly compare with HyperX and MESH as they are designed We see that the sizes are several orders of magnitude greater for distributed memory, we first perform a rough comparison than the corresponding hypergraph using the bipartite graph in terms of the running times reported in their papers [36, 41]. representation. As a baseline, we ran Ligra’s breadth-first MESH reports a running time of about 1 minute per iteration search, connected components, and SSSP implementations on com-Orkut using a cluster of eight 12-core machines, and on the clique-expanded graph for Friendster (which is 235x HyperX reports a running time of about 10s using a cluster larger than the bipartite representation) on 72 cores. Breadth- of eight 4-core machines (HyperX’s algorithm is for random first search took 0.061s, which is 2.8x slower than Ligra-H’s walks, which does less work than PageRank per iteration). hypertree implementation (see Table 2). Connected compo- In contrast, one iteration of Ligra-H’s PageRank on com- nents took 2.35s, which is 28.7x slower than Ligra-H. SSSP Orkut takes 0.083s on 72 cores and 3.31s on one thread. Even took 3.27s, which is 30.6x slower than Ligra-H. The overhead adjusting for differences in processor specifications, we are is due to additional edge traversals in the clique-expanded significantly faster than their reported parallel numbers using graph. However, the running time overhead is not as high as just a single thread, and orders of magnitude faster in parallel. the space overhead, since the clique-expanded graph is much The large difference in performance of MESH and HyperX denser and has better locality. The overhead is only 2.8x for compared to Ligra-H is due to the higher communication breadth-first search since the dense traversal optimization costs of distributed memory and overheads of Spark. allows many edges to be skipped. We also ran MESH on our 72-core machine, and did a 7 Conclusion sweep of the parameter space (partition strategy and num- We have presented a suite of parallel hypergraph algorithms ber of partitions), and the best running time we obtained for with strong theoretical guarantees. We implemented the algo- one iteration of PageRank on com-Orkut was over 2 minutes, rithms by extending the Ligra graph processing framework which is much slower than Ligra-H’s time. MESH reports to handle hypergraphs. Our experiments show that the al- competitive performance with HyperX [36], and so we ex- gorithms achieve good parallel scalability and significantly pect the performance of HyperX to be in the same ballpark. better performance than prior work. Future work includes For frontier-based algorithms our speedups would be even extending graph optimizations for locality and scalability higher as HyperX and MESH require work proportional to (e.g., [5, 48, 52, 79, 86, 90, 91, 95]) to hypergraphs. the hypergraph size on every iteration whereas we only do 10 Practical Parallel Hypergraph Algorithms PL’18, January 01–03, 2018, New York, NY, USA

Acknowledgements. This research was supported by DOE Graph Analytics. In ACM SIGPLAN Conference on Programming Lan- Early Career Award #DE-SC0018947, NSF CAREER Award guage Design and Implementation (PLDI). 752–768. #CCF-1845763, DARPA SDH Award #HR0011-18-3-0007, [19] Roshan Dathathri, Gurbinder Gill, Loc Hoang, and Keshav Pingali. 2019. Phoenix: A Substrate for Resilient Distributed Graph Analytics. and Applications Driving Architectures (ADA) Research Cen- In International Conference on Architectural Support for Programming ter, a JUMP Center co-sponsored by SRC and DARPA. Languages and Operating Systems (ASPLOS). 615–630. References [20] Karen D. Devine, Erik G. Boman, Robert T. Heaphy, Rob H. Bisseling, and Umit V. Catalyurek. 2006. Parallel Hypergraph Partitioning for [1] Giorgio Ausiello, Giuseppe F. Italiano, and Umberto Nanni. 1998. Scientific Computing. In International Conference on Parallel and Hypergraph traversal revisited: Cost measures and dynamic algorithms. Distributed Processing (IPDPS). 124–124. In Mathematical Foundations of . 1–16. [21] Laxman Dhulipala, Guy Blelloch, and Julian Shun. 2017. Julienne: A [2] David A. Bader, Shiva Kintali, Kamesh Madduri, and Milena Mihail. Framework for Parallel Graph Algorithms Using Work-efficient Bucket- 2007. Approximating betweenness centrality. In Workshop on Algo- ing. In ACM Symposium on Parallelism in Algorithms and Architectures rithms and Models for the Web-Graph (WAW). 124–137. (SPAA). 293–304. [3] Paul Beame and Michael Luby. 1990. Parallel Search for Maximal [22] Laxman Dhulipala, Guy E. Blelloch, and Julian Shun. 2018. Theoret- Independence Given Minimal Dependence. In ACM-SIAM Symposium ically Efficient Parallel Graph Algorithms Can Be Fast and Scalable. on Discrete Algorithms (SODA). 212–218. In ACM Symposium on Parallelism in Algorithms and Architectures [4] Scott Beamer, Krste Asanovic, and David Patterson. 2012. Direction- (SPAA). 293–304. optimizing breadth-first search. In ACM/IEEE International Conference [23] Aurelien Ducournau and Alain Bretto. 2014. Random walks in directed for High Performance Computing, Networking, Storage and Analysis hypergraphs and application to semi-supervised image segmentation. (SC). Article 12, 12:1–12:10 pages. Computer Vision and Image Understanding 120 (2014), 91 – 102. [5] S. Beamer, K. Asanovic, and D. Patterson. 2017. Reducing Pager- [24] D. Ediger, R. McColl, J. Riedy, and D. A. Bader. 2012. STINGER: High ank Communication via Propagation Blocking. In IEEE International performance data structure for streaming graphs. In IEEE Conference Parallel and Distributed Processing Symposium (IPDPS). 820–831. on High Performance Extreme Computing (HPEC). 1–5. [6] Abdelghani Bellaachia and Mohammed Al-Dhelaan. 2013. Random [25] Linton Freeman. 1977. A set of measures of centrality based upon Walks in Hypergraph. In International Conference on Applied Mathe- betweenness. Sociometry 40 (1977), 35–41. matics and Computational Methods. [26] Giorgio Gallo, Giustino Longo, Stefano Pallottino, and Sang Nguyen. [7] Ioana O. Bercea, Navin Goyal, David G. Harris, and Aravind Srinivasan. 1993. Directed hypergraphs and applications. Discrete Applied Mathe- 2017. On Computing Maximal Independent Sets of Hypergraphs in matics 42, 2 (1993), 177 – 201. Parallel. ACM Trans. Parallel Comput. 3, 1, Article 5 (Jan. 2017), [27] Robert Geisberger, Peter Sanders, and Dominik Schultes. 2008. Better 5:1–5:13 pages. Approximation of Betweenness Centrality. In Algorithms Engineering [8] Guy E. Blelloch, Jeremy T. Fineman, and Julian Shun. 2012. Greedy se- and Experiments (ALENEX). 90–100. quential maximal independent set and matching are parallel on average. [28] Gurbinder Gill, Roshan Dathathri, Loc Hoang, Andrew Lenharth, and In ACM Symposium on Parallelism in Algorithms and Architectures Keshav Pingali. 2018. Abelian: A Compiler for Graph Analytics on (SPAA). 308–317. Distributed, Heterogeneous Platforms. In Euro-Par. 249–264. [9] Robert D. Blumofe and Charles E. Leiserson. 1999. Scheduling Multi- [29] Joseph Gonzalez, Yucheng Low, Haijie Gu, Danny Bickson, and Carlos threaded Computations by Work Stealing. J. ACM 46, 5 (Sept. 1999), Guestrin. 2012. PowerGraph: Distributed Graph-Parallel Computation 720–748. on Natural Graphs. In USENIX Symposium on Operating System Design [10] Ulrik Brandes. 2001. A Faster Algorithm for Betweenness Centrality. and Implementation (OSDI). 17–30. Journal of Mathematical Sociology 25 (2001), 163–177. [30] Samuel Grossman, Heiner Litz, and Christos Kozyrakis. 2018. Making [11] Richard P. Brent. 1974. The Parallel Evaluation of General Arithmetic Pull-based Graph Processing Performant. In ACM SIGPLAN Sympo- Expressions. J. ACM 21, 2 (April 1974), 201–206. sium on Principles and Practice of Parallel Programming (PPoPP). [12] Alain Bretto, Hocine Cherifi, and Driss Aboutajdine. 2002. Hypergraph 246–260. imaging: an overview. Pattern Recognition 35, 3 (2002), 651 – 658. [31] Yan Gu, Julian Shun, Yihan Sun, and Guy E. Blelloch. 2015. A Top- [13] S. Brin and L. Page. 1998. The Anatomy of a Large-Scale Hypertextual Down Parallel Semisort. In ACM Symposium on Parallelism in Algo- Web Search Engine. In Computer Networks and ISDN Systems. 107– rithms and Architectures (SPAA). 24–34. 117. [32] Torben Hagerup. 1992. Fast and optimal simulations between CRCW [14] U. V. Catalyurek and C. Aykanat. 1999. Hypergraph-partitioning-based PRAMs. In Annual Symposium on Theoretical Aspects of Computer decomposition for parallel sparse-matrix vector multiplication. IEEE Science (STACS). 45–56. Transactions on Parallel and Distributed Systems 10, 7 (Jul 1999), [33] Harshvardhan, Adam Fidel, Nancy M. Amato, and Lawrence Rauch- 673–693. werger. 2012. The STAPL Parallel Graph Library. In Languages and [15] Umit V. Catalyurek, Erik G. Boman, Karen D. Devine, Doruk Bozdaag, Compilers for Parallel Computing (LCPC). 46–60. Robert T. Heaphy, and Lee Ann Riesen. 2009. A repartitioning hy- [34] Harshvardhan, Adam Fidel, Nancy M. Amato, and Lawrence Rauch- pergraph model for dynamic load balancing. J. Parallel and Distrib. werger. 2014. KLA: a new algorithmic paradigm for parallel graph Comput. 69, 8 (2009), 711 – 724. computations. In International Conference on Parallel Architectures [16] L. Chen, X. Huo, B. Ren, S. Jain, and G. Agrawal. 2015. Efficient and Compilation (PACT). 27–38. and Simplified Parallel Graph Processing over CPU and MIC.In IEEE [35] Harshvardhan, Adam Fidel, Nancy M. Amato, and Lawrence Rauchw- International Parallel and Distributed Processing Symposium (IPDPS). erger. 2015. An Algorithmic Approach to Communication Reduction 819–828. in Parallel Graph Algorithms. In International Conference on Parallel [17] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clif- Architecture and Compilation (PACT). 201–212. ford Stein. 2009. Introduction to Algorithms (3. ed.). MIT Press. [36] Benjamin Heintz, Rankyung Hong, Shivangi Singh, Gaurav Khandel- [18] Roshan Dathathri, Gurbinder Gill, Loc Hoang, Hoang-Vu Dang, Alex wal, Corey Tesdahl, and Abhishek Chandra. 2019. MESH: A Flexible Brooks, Nikoli Dryden, Marc Snir, and Keshav Pingali. 2018. Gluon: Distributed Hypergraph Processing System. In IEEE International Con- A Communication-optimizing Substrate for Distributed Heterogeneous ference on Cloud Engineering (IC2E).

11 PL’18, January 01–03, 2018, New York, NY, USA Julian Shun

[37] C. Hong, A. Sukumaran-Rajam, J. Kim, and P. Sadayappan. 2017. of the International Conference for High Performance Computing, Net- MultiGraph: Efficient Graph Processing on GPUs. In International Con- working, Storage, and Analysis (SC). 56:1–56:11. ference on Parallel Architectures and Compilation Techniques (PACT). [56] Yucheng Low, Joseph Gonzalez, Aapo Kyrola, Danny Bickson, Carlos 27–40. Guestrin, and Joseph M. Hellerstein. 2010. GraphLab: A New Parallel [38] J. Jaja. 1992. Introduction to Parallel Algorithms. Addison-Wesley Framework for Machine Learning. In Conference on Uncertainty in Professional. Artificial Intelligence (UAI). 340–349. [39] Louis Jenkins, Tanveer Hossain Bhuiyan, Sarah Harun, Christopher [57] Michael Luby. 1986. A simple parallel algorithm for the maximal Lightsey, David Mentgen, Sinan G. Aksoy, Timothy Stavcnger, Marcin independent set problem. SIAM J. Comput. 15, 4 (November 1986), Zalewski, Hugh R. Medal, and Cliff Joslyn. 2018. Chapel HyperGraph 1036–1055. Library (CHGL). In IEEE High Performance Extreme Computing Con- [58] P. Macko, V. J. Marathe, D. W. Margo, and M. I. Seltzer. 2015. LLAMA: ference (HPEC). 1–6. Efficient graph analytics using Large Multiversioned Arrays. In IEEE [40] Jiayang Jiang, , and Justin Thaler. 2017. Parallel International Conference on Data Engineering (ICDE). 363–374. Peeling Algorithms. ACM Trans. Parallel Comput. 3, 1, Article 7 (Jan. [59] Saeed Maleki, G. Carl Evans, and David A. Padua. 2015. Tiled Linear 2017), 7:1–7:27 pages. Algebra a System for Parallel Graph Algorithms. In Languages and [41] W. Jiang, J. Qi, J. X. Yu, J. Huang, and R. Zhang. 2019. HyperX: A Compilers for Parallel Computing. 116–130. Scalable Hypergraph Framework. IEEE Transactions on Knowledge [60] Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehn- and Data Engineering 31, 5, 909–922. ert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: [42] U. Kang, Charalampos E. Tsourakakis, and Christos Faloutsos. 2011. a system for large-scale graph processing. In ACM Conference on PEGASUS: mining peta-scale graphs. Knowl. Inf. Syst. 27, 2 (2011), Management of Data (SIGMOD). 135–146. 303–325. [61] Robert Ryan McCune, Tim Weninger, and Greg Madey. 2015. Thinking [43] Richard M. Karp, Eli Upfal, and Avi Wigderson. 1988. The Complexity Like a Vertex: A Survey of Vertex-Centric Frameworks for Large-Scale of Parallel Search. J. Comput. Syst. Sci. 36, 2 (April 1988), 225–253. Distributed Graph Processing. ACM Comput. Surv. 48, 2, Article 25 [44] G. Karypis, R. Aggarwal, V. Kumar, and S. Shekhar. 1999. Multilevel (Oct. 2015), 39 pages. hypergraph partitioning: applications in VLSI domain. IEEE Trans- [62] Frank McSherry, Michael Isard, and Derek G. Murray. 2015. Scala- actions on Very Large Scale Integration (VLSI) Systems 7, 1 (March bility! But at What COST?. In USENIX Conference on Hot Topics in 1999), 69–79. Operating Systems (HotOS). [45] Pierre Kelsen. 1992. On the Parallel Complexity of Computing a [63] Ke Meng, Jiajia Li, Guangming Tan, and Ninghui Sun. 2019. A pat- Maximal Independent Set in a Hypergraph. In ACM Symposium on tern based algorithmic autotuner for graph processing on GPUs. In Theory of Computing (STOC). 339–350. ACM SIGPLAN Symposium on Principles and Practice of Parallel [46] Jeremy Kepner, Peter Aaltonen, David A. Bader, Aydin Buluç, Franz Programming (PPoPP). 201–213. Franchetti, John R. Gilbert, Dylan Hutchison, Manoj Kumar, Andrew [64] Donald Nguyen, Andrew Lenharth, and Keshav Pingali. 2013. A Light- Lumsdaine, Henning Meyerhenke, Scott McMillan, Carl Yang, John D. weight Infrastructure for Graph Analytics. In ACM Symposium on Owens, Marcin Zalewski, Timothy G. Mattson, and José E. Moreira. Operating Systems Principles (SOSP). 456–471. 2016. Mathematical foundations of the GraphBLAS. In IEEE High [65] Lars Relund Nielsen, Kim Allan Andersen, and Daniele Pretolani. 2005. Performance Extreme Computing Conference (HPEC). 1–9. Finding the K Shortest Hyperpaths. Comput. Oper. Res. 32, 6 (June [47] Gaurav Khanna, Nagavijayalakshmi Vydyanathan, T. Kurc, U. 2005), 1477–1497. Catalyurek, P. Wyckoff, J. Saltz, and P. Sadayappan. 2005. A hyper- [66] Amir Hossein Nodehi Sabet, Junqiao Qiu, and Zhijia Zhao. 2018. Tigr: graph partitioning based approach for scheduling of tasks with batch- Transforming Irregular Graphs for GPU-Friendly Graph Processing. In shared I/O. In IEEE International Symposium on Cluster Computing International Conference on Architectural Support for Programming and the Grid (CCGRID). 792–799. Languages and Operating Systems (ASPLOS). 622–636. [48] Vladimir Kiriansky, Yunming Zhang, and Saman P. Amarasinghe. 2016. [67] Sreepathi Pai and Keshav Pingali. 2016. A Compiler for Through- Optimizing Indirect Memory References with milk. In International put Optimization of Graph Algorithms on GPUs. In ACM SIGPLAN Conference on Parallel Architectures and Compilation (PACT). 299– International Conference on Object-Oriented Programming, Systems, 312. Languages, and Applications (OOPSLA). 1–19. [49] Sriram Krishnamoorthy, Umit Catalyurek, Jarek Nieplocha, Atanas [68] Zhen Peng, Alexander Powell, Bo Wu, Tekin Bicer, and Bin Ren. 2018. Rountev, and P. Sadayappan. 2006. Hypergraph Partitioning for Auto- Graphphi: efficient parallel graph processing on emerging throughput- matic Memory Hierarchy Management. In ACM/IEEE Conference on oriented architectures. In International Conference on Parallel Archi- Supercomputing (SC). tectures and Compilation Techniques (PACT). 9:1–9:14. [50] Jérôme Kunegis. 2013. KONECT: The Koblenz Network Collection. [69] Rami Puzis, Manish Purohit, and V.S. Subrahmanian. 2013. Between- In International Conference on World Wide Web (WWW). 1343–1350. ness computation in the single graph representation of hypergraphs. [51] Aapo Kyrola, Guy Blelloch, and Carlos Guestrin. 2012. GraphChi: Social Networks 35, 4 (2013), 561 – 572. Large-Scale Graph computation on Just a PC. In USENIX Symposium [70] S. Riazi and B. Norris. 2016. GraphFlow: Workflow-based big graph on Operating Systems Design and Implementation (OSDI). 31–46. processing. In IEEE International Conference on Big Data (Big Data). [52] Kartik Lakhotia, Rajgopal Kannan, and Viktor Prasanna. 2018. Ac- 3336–3343. celerating PageRank using Partition-Centric Processing. In USENIX [71] A. Ritz, B. Avent, and T. M. Murali. 2017. Pathway Analysis with Annual Technical Conference (ATC). 427–440. Signaling Hypergraphs. IEEE/ACM Transactions on Computational [53] Charles E. Leiserson. 2010. The Cilk++ concurrency platform. J. Biology and Bioinformatics 14, 5 (Sept 2017), 1042–1055. Supercomputing 51, 3 (2010). [72] D. Sengupta, S. L. Song, K. Agarwal, and K. Schwan. 2015. GraphRe- [54] Jure Leskovec and Andrej Krevl. 2019. SNAP Datasets: Stanford Large duce: processing large-scale graphs on accelerator-based systems. In Network Dataset Collection. http://snap.stanford.edu/data. Proceedings of the International Conference for High Performance [55] Heng Lin, Xiaowei Zhu, Bowen Yu, Xiongchao Tang, Wei Xue, Wen- Computing, Networking, Storage and Analysis (SC). 1–12. guang Chen, Lufei Zhang, Torsten Hoefler, Xiaosong Ma, Xin Liu, [73] Xuanhua Shi, Zhigao Zheng, Yongluan Zhou, Hai Jin, Ligang He, Weimin Zheng, and Jingfang Xu. 2018. ShenTu: Processing Multi- Bo Liu, and Qiang-Sheng Hua. 2018. Graph Processing on GPUs: trillion Edge Graphs on Millions of Cores in Seconds. In Proceedings

12 Practical Parallel Hypergraph Algorithms PL’18, January 01–03, 2018, New York, NY, USA

A Survey. ACM Comput. Surv. 50, 6, Article 81 (Jan. 2018), 81:1– [90] Kaiyuan Zhang, Rong Chen, and Haibo Chen. 2015. NUMA-aware 81:35 pages. Graph-structured Analytics. In ACM SIGPLAN Symposium on Princi- [74] Yossi Shiloach and Uzi Vishkin. 1982. An O(log n) Parallel Connec- ples and Practice of Parallel Programming (PPoPP). 183–193. tivity Algorithm. J. Algorithms 3, 1 (1982), 57–67. [91] Yunming Zhang, Vladimir Kiriansky, Charith Mendis, Matei Zaharia, [75] Julian Shun and Guy E. Blelloch. 2013. Ligra: A Lightweight Graph and Saman P. Amarasinghe. 2017. Making Caches Work for Graph Processing Framework for Shared Memory. In ACM SIGPLAN Sym- Analytics. In IEEE International Conference on Big Data (BigData). posium on Principles and Practice of Parallel Programming (PPoPP). 293–302. 135–146. [92] Yunming Zhang, Mengjiao Yang, Riyadh Baghdadi, Shoaib Kamil, [76] Julian Shun, Laxman Dhulipala, and Guy E. Blelloch. 2014. A Simple Julian Shun, and Saman Amarasinghe. 2018. GraphIt: A High- and Practical Linear-Work Parallel Algorithm for Connectivity. In ACM performance Graph DSL. Proc. ACM Program. Lang. 2, OOPSLA, Symposium on Parallelism in Algorithms and Architectures (SPAA). Article 121 (Oct. 2018), 121:1–121:30 pages. 143–153. [93] Peng Zhao, Chen Ding, Lei Liu, Jiping Yu, Wentao Han, and Xiao-Bing [77] Julian Shun, Laxman Dhulipala, and Guy E. Blelloch. 2015. Smaller Feng. 2019. Cacheap: Portable and Collaborative I/O Optimization for and Faster: Parallel Processing of Compressed Graphs with Ligra+. In Graph Processing. Journal of Computer Science and Technology 34, 3 IEEE Data Compression Conference (DCC). 403–412. (01 May 2019), 690–706. [78] Shuang Song, Xu Liu, Qinzhe Wu, Andreas Gerstlauer, Tao Li, and [94] Dengyong Zhou, Jiayuan Huang, and Bernhard Schölkopf. 2006. Learn- Lizy K. John. 2018. Start Late or Finish Early: A Distributed Graph ing with Hypergraphs: Clustering, Classification, and Embedding. In Processing System with Redundancy Reduction. PVLDB 12, 2 (2018), International Conference on Neural Information Processing Systems 154–168. (NIPS). 1601–1608. [79] Jiawen Sun, Hans Vandierendonck, and Dimitrios S. Nikolopoulos. [95] Xiaowei Zhu, Wenguang Chen, Weimin Zheng, and Xiaosong Ma. 2017. GraphGrind: Addressing Load Imbalance of Graph Partitioning. 2016. Gemini: A Computation-Centric Distributed Graph Processing In International Conference on Supercomputing (ICS). Article 16, 16:1– System. In USENIX Symposium on Operating Systems Design and 16:10 pages. Implementation (OSDI). 301–316. [80] Jiawen Sun, Hans Vandierendonck, and Dimitrios S. Nikolopoulos. 2019. VEBO: a vertex- and edge-balanced ordering heuristic to load balance parallel graph processing. In ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP). 391–392. [81] Narayanan Sundaram, Nadathur Satish, Md Mostofa Ali Patwary, Sub- ramanya R. Dulloor, Michael J. Anderson, Satya Gautam Vadlamudi, Dipankar Das, and Pradeep Dubey. 2015. GraphMat: High Performance Graph Analytics Made Productive. Proc. VLDB Endow. 8, 11 (July 2015), 1214–1225. [82] Guangming Tan, Dengbiao Tu, and Ninghui Sun. 2009. A Parallel Algorithm for Computing Betweenness Centrality. In International Conference on Parallel Processing (ICPP). 340–347. [83] Aleksandar Trifunovic and William J. Knottenbelt. 2008. Parallel multilevel algorithms for hypergraph partitioning. J. Parallel and Distrib. Comput. 68, 5 (2008), 563 – 581. [84] Dengbiao Tu and Guangming Tan. 2009. Characterizing Between- ness Centrality Algorithm on Multi-core Architectures. In IEEE In- ternational Symposium on Parallel and Distributed Processing with Applications (ISPA). 182–189. [85] Lei Wang, Liangji Zhuang, Junhang Chen, Huimin Cui, Fang Lv, Ying Liu, and Xiaobing Feng. 2018. Lazygraph: lazy data coherency for repli- cas in distributed graph-parallel computation. In ACM SIGPLAN Sym- posium on Principles and Practice of Parallel Programming (PPoPP). 276–289. [86] Hao Wei, Jeffrey Xu Yu, Can Lu, and Xuemin Lin. 2016. Speedup Graph Processing by Graph Ordering. In ACM International Conference on Management of Data (SIGMOD). 1813–1828. [87] Da Yan, Yingyi Bu, Yuanyuan Tian, and Amol Deshpande. 2017. Big Graph Analytics Platforms. Foundations and Trends in Databases 7, 1-2 (2017), 1–195. [88] Jie Yan, Guangming Tan, Zeyao Mo, and Ninghui Sun. 2016. Graphine: Programming Graph-Parallel Computation of Large Natural Graphs for Multicore Clusters. IEEE Trans. Parallel Distrib. Syst. 27, 6 (2016), 1647–1659. [89] Matei Zaharia, Reynold S. Xin, Patrick Wendell, Tathagata Das, Michael Armbrust, Ankur Dave, Xiangrui Meng, Josh Rosen, Shivaram Venkataraman, Michael J. Franklin, Ali Ghodsi, Joseph Gonzalez, , and Ion Stoica. 2016. Apache Spark: A Unified Engine for Big Data Processing. Commun. ACM 59, 11 (Oct. 2016), 56–65.

13 PL’18, January 01–03, 2018, New York, NY, USA Julian Shun

A Hypertrees Algorithm 4 Pseudocode for Hypertrees in Ligra-H Given an unweighted hypergraph and a source vertex src, 1: ParentsV = {−1,..., −1}, ParentsE = {−1,..., −1} a hypertree contains all vertices and hyperedges reachable 2: procedure UPDATE(Parents, s, d) from src [26]. An algorithm that computes a hypertree out- 3: return CAS(&Parents[d], −1, s) puts predecessor arrays for vertices and hyperedges, which 4: procedure CHECK(Parents, i) specify one of its predecessors in a shortest path from src. The 5: return (Parents[i] == −1) predecessor of a hyperedge is a vertex, and vice versa. The 6: procedure HYPERTREE(H , src) ▷ src is the source vertex sequential algorithm for generating hypertrees is similar to a 7: ParentsV[src] = ∞ 8: VertexSet FrontierV = {src} breadth-first search, and takes linear work in the size ofthe 9: HyperedgeSet FrontierE = {} hypergraph [26]. Vertices are visited in order of their distance 10: while (true) do from the source, and each hyperedge is processed only the 11: FrontierE = VERTEXPROP(H, FrontierV, first time that a vertex visits it. UPDATE(ParentsE), CHECK(ParentsE)) A parallel algorithm can be obtained by processing all ver- 12: if |FrontierE| == 0 then break tices or hyperedges at the same distance from the source 13: FrontierV = HYPEREDGEPROP(H, FrontierE, UPDATE(ParentsV), CHECK(ParentsV)) in parallel. This algorithm can be naturally implemented 14: if |FrontierV | == 0 then break in Ligra-H using the VERTEXPROP and HYPEREDGEPROP 15: return (ParentsV, ParentsE) functions. The frontier of vertices or hyperedges at the same distance from src are maintained using VertexSets and Hyper- Algorithm 5 Pseudocode for Hyperpath Trees in Ligra-H edgeSets. The algorithm is similar to Ligra’s parallel breadth- 1: ParentsV = {−1,..., −1} ▷ initialized to all −1 {− −( ) − −( )} first search implementation. 2: CountsE = deg e0 ,..., deg ene −1 The pseudocode for implementing hypertrees in Ligra-H is 3: procedure INCREMENT(s, d) ( [ ] ) − shown in Algorithm 4. We initialize two arrays ParentsV and 4: return (FAA &CountsE d , 1 == 1) ParentsE, which store the ID of the predecessor for each ver- 5: procedure CHECKVISITEDE(i) 6: return (CountsE[i] < 0) tex and hyperedge, respectively. A value of −1 indicates that the vertex or hyperedge has not been visited yet in the search. 7: procedure VISIT(s, d) 8: return (CAS(&ParentsV[d], −1, s)) Line 7 sets the ParentsV entry of the source vertex src to ∞ (as procedure HECK ISITED it has no predecessor), and Line 8 places src on the initial fron- 9: C V V(i) 10: return (ParentsV[i] == −1) tier, which is a VertexSet that we call FrontierV. On Line 11 11: procedure HYPERPATH(H , src) ▷ src is the source vertex ERTEX ROP we apply a V P to the VertexSet FrontierV, which 12: ParentsV[src] = ∞ will explore the hyperedges that the vertices in FrontierV are 13: VertexSet FrontierV = {src} incoming vertices for. The function CHECK will first check 14: HyperedgeSet FrontierE = {} whether the hyperedge has not been visited yet (ParentsE 15: while (true) do 16: = value is −1), and if so, applies the UPDATE function on the FrontierE VERTEXPROP(H, FrontierV, INCREMENT, CHECKVISITEDE) target hyperedge, which will mark its ParentsE entry of the hy- 17: if |FrontierE| == 0 then break peredge with the ID of the vertex that first visits it. Any newly 18: FrontierV = visited hyperedges will be placed in the output Hyperedge- HYPEREDGEPROP(H, FrontierE, VISIT, CHECKVISITEDV) Set, which we call FrontierE. Line 13 repeats the same logic 19: if |FrontierV | == 0 then break using HYPEREDGEPROP over the HyperedgeSet FrontierE 20: return ParentsV to visit unexplored vertices, generating an output frontier longer be −1, and it will fail the check by the CHECK func- FrontierV. Since the CAS in the UPDATE function returns tion in subsequent iterations. Therefore the sum of the sizes true exactly once for each newly explored element, no du- of all frontiers plus their out-degrees will be O(|H |). As VER- plicates need to be removed. If either FrontierV or FrontierE TEXPROP and HYPEREDGEPROP do work proportional to is empty (Lines 12 and 14), then the algorithm has explored the size of the input set plus the sum of its out-degrees, the all reachable vertices and hyperedges from src, and termi- overall work performed by the algorithm is O(|H |), which nates. Note that when using the dense traversal modes for is work-efficient. Each call to VERTEXPROP and HYPER- VERTEXPROP and HYPEREDGEPROP, once an element finds EDGEPROP takes O(log |H |) depth, and so the overall depth an incoming neighbor on the frontier, it does not need to is O(D log |H |) where D is the diameter of the hypergraph. scan over the remaining edges as subsequent applications of CHECK will return false. B Hyperpaths Analysis. Each vertex and hyperedge will appear in at most Given an unweighted hypergraph and a source vertex src, a one frontier because once a vertex or hyperedge has been hyperpath is a maximal hypergraph containing all vertices visited, its entry in the ParentsV or ParentsE array will no reachable from src via cycle-free paths (i.e., no vertex appears in more than one hyperedge along any particular path) [26]. The sequential algorithm for computing hyperpath trees [26] 14 Practical Parallel Hypergraph Algorithms PL’18, January 01–03, 2018, New York, NY, USA visits a hyperedge only when all incoming vertices of the Algorithm 6 Pseudocode for SSSP in Ligra-H hyperedge have visited it (instead of the first time an incoming 1: SPV = {∞,..., ∞} vertex visits it). The algorithm takes linear work in the size of 2: SPE = {∞,... ∞} the hypergraph. 3: VisitedV = {0,..., 0} We implemented a parallel algorithm for computing a hy- 4: VisitedE = {0,..., 0} perpath tree in Ligra-H, which requires minor changes to our 5: procedure RELAX(SrcSP, DstSP, Visited, s, d, w) 6: if WRITEMIN(&DstSP[d], SrcSP[s] + w(s, d)) then hypertree algorithm. The pseudocode for our parallel algo- 7: return CAS(&Visited[d], 0, 1) rithm is shown in Algorithm 5. Compared to Algorithm 4, we 8: else return 0 change the functions passed to HYPEREDGEPROP to check 9: procedure RESET(Visited, i) when all of the incoming vertices of a hyperedge have been 10: Visited[i] = 0 visited. Instead of the ParentsE array, we initialize an array 11: procedure SSSP(H , src) CountsE that stores the negation of the number of incoming 12: SPV[src] = 0 vertices each hyperedge contains. Whenever we visit a hyper- 13: VertexSet FrontierV = {src} 14: HyperedgeSet FrontierE = {} edge via VERTEXPROP, we increment this value by for each 1 15: iter = 0 incoming vertex that visits it using a fetch-and-add, and place 16: while (iter++ < nv + ne ) do the hyperedge in the output frontier when the count becomes 17: FrontierE = VERTEXPROP(H, FrontierV, 0 (Lines 3–4). Checking whether a hyperedge will be visited RELAX(SPV, SPE, VisitedE), TRUE) ( ( )) during VERTEXMAP simply requires checking whether its 18:H YPEREDGEMAP FrontierE, RESET VisitedE 19: if |FrontierE| == 0 then break CountsE value is negative (Lines 5–6). 20: FrontierV = HYPEREDGEPROP(H, FrontierE, The analysis of the algorithm is similar to the analysis in RELAX(SPE, SPV, VisitedV), TRUE) Section A. The overall work is O(|H |) and depth is O(L log |H |) 21:V ERTEXMAP(FrontierV, RESET(VisitedV)) with L being the length of the longest simple path in the re- 22: if |FrontierV | == 0 then break sulting hyperpath tree. 23: if iters == nv + ne then return “Negative cycle” C Single-Source Shortest Paths 24: else return (SPV, SPE) Given a weighted hypergraph and a source vertex src, the value with the source’s SP value plus the weight of the hyper- goal of single-source shortest paths (SSSP) is to compute the edge involved (w(s,d)). If the WRITEMIN is successful, we distance of the shortest path from src to every other reachable use a CAS to set the visited status of the target; if the CAS vertex in the hypergraph. We implement a parallel SSSP algo- is successful, the target is placed in the output frontier. The rithm for hypergraphs based on the Bellman-Ford algorithm CAS guarantees that the target is placed on the output frontier for SSSP on graphs [17]. The algorithm initializes tentative exactly once and removing duplicates can be avoided. The shortest path distances (SP) of all vertices and hyperedges visited status of hyperedges and vertices are reset on Lines 18 to ∞, except for the source vertex which has a distance of 0. and 21, respectively. If the number of iterations is nv + ne The algorithm repeatedly applies a RELAX procedure which then a negative cycle is reported (Line 23), and otherwise the updates the SP values of all active vertices and hyperedges SP values are returned (Line 24). to the minimum of its original SP value and the SP value The work of this algorithm is O((nv + ne )|H |) as each of an incoming neighbor plus its weight (an active element iteration can process all vertices and hyperedges, and the is one whose SP value changed in the previous iteration). If depth is O((nv + ne ) log |H |). no SP values change in an iteration then the shortest path D Connected Components distances have been found, and the algorithm terminates. If Given an undirected, unweighted hypergraph, a connected the algorithm hasn’t terminated after n + n iterations, then v e component is a maximal set of vertices that can all reach that means there is a negative cycle in the hypergraph, and one another via incident hyperedges. The label propagation the algorithm reports this and terminates. technique can be used to compute the connected components The pseudocode for our SSSP implementation in Ligra-H of a hypergraph [36, 41]. The idea is to initialize vertices with is shown in Algorithm 6. We initialize the SP value of the unique IDs and iteratively propagate IDs of vertices to their source vertex to 0 (Line 12) and of all other vertices and hyper- neighbors, having each vertex store the minimum ID among edges to ∞ (Lines 1–2). We also initialize arrays to indicate the IDs it receives and its own. At convergence, the IDs on the visited statuses of vertices and hyperedges (Lines 3–4). the vertices partition them into connected components. The source vertex is placed onto the initial frontier (Line 13). We implement the label propagation algorithm in Ligra-H, While the iteration count is less than n + n , we repeatedly v e but we note that there are more efficient parallel algorithms apply the RELAX procedure using VERTEXPROP and HY- for connected components for graphs (e.g., [74, 76]) that PEREDGEPROP (Lines 17 and 20) until no more elements could be applied to hypergraphs. Our implementation itera- have their SP value change. The RELAX procedure (Lines 5– tively propagates vertex IDs to hyperedges and hyperedge 7) uses a WRITEMIN to atomically update the target’s SP IDs to vertices using VERTEXPROP and HYPEREDGEPROP, 15 PL’18, January 01–03, 2018, New York, NY, USA Julian Shun

Algorithm 7 Pseudocode for Label Propagation in Ligra-H elements. This contrasts with the label propagation algorithms

1: IDsV = {0,..., nv − 1} ▷ initialized to IDsV[i] = i in [36, 41], which touch the entire hypergraph per iteration. 2: IDsE = {∞,... ∞} ▷ initialized to all ∞ Analysis. Each iteration of the algorithm takes O(|H |) work { ,..., − } ▷ [ ] 3: PrevIDsV = 0 nv 1 initialized to PrevIDsV i = i and O(log |H |) depth as the calls to VERTEXPROP and HY- 4: PrevIDsE = {∞,... ∞} ▷ initialized to all ∞ PEREDGEPROP could potentially process all vertices and hy- 5: procedure UPDATE(SrcIDs, DstIDs, DstPrevIDs, s, d) 6: origID = DstIDs[d] peredges. For a hypergraph with diameter D, the overall work 7: if WRITEMIN(&DstIDs[d], SrcIDs[s]) then is O(D|H |) and overall depth is O(D log |H |). 8: return (origID == DstPrevIDs[d]) E PageRank 9: else return 0 PageRank 10: procedure COPY(IDs, PrevIDs, i) is an algorithm for computing the importance 11: PrevIDs[i] = IDs[i] of vertices in a graph [13], and can be extended to hyper- 12: procedure CC(H ) graphs [6, 36, 41]. We consider PageRank on unweighted, 13: VertexSet FrontierV = {0,..., nv − 1} ▷ all vertices connected hypergraphs. The following update equations de- 14: HyperedgeSet FrontierE = {} fines the algorithm for a damping factor 0 ≤ α ≤ 1: 15: while (true) do 16:H YPEREDGEMAP(FrontierE, COPY(IDsE, prevIDsE)) 1 − α Õ PR[e] [ ] 17: FrontierE = VERTEXPROP(H, FrontierV, PR v = + α + (3) nv − deд (e) UPDATE(IDsV, IDsE, PrevIDsE), TRUE) e ∈N (v) 18: if |FrontierE| == 0 then break 19:V ERTEXMAP(FrontierV, COPY(IDsV, prevIDsV)) Õ PR[v] 20: FrontierV = HYPEREDGEPROP(H, FrontierE, PR[e] = (4) deд+(v) UPDATE(IDsE, IDsV, PrevIDsV), TRUE) v ∈N −(e) 21: if |FrontierV | == 0 then break 22: return (IDsV, IEsE) Vertices spread their ranks equally to hyperedges for which they are incoming vertices for in Equation 4, and hyperedges spread their ranks equally to outgoing vertices in Equation 3. respectively, with the WRITEMIN function until the frontier The update equations are applied iteratively until some conver- becomes empty. The output frontier of VERTEXPROP and HY- gence criterion is met (e.g., a maximum number of iterations PEREDGEPROP contain only the elements whose IDs changed. is reached or the error falls below some threshold). The pseudocode of our algorithm is shown in Algorithm 7. We implement PageRank in Ligra-H by iteratively calling We initialize the array IDsV with unique IDs (Line 1). We VERTEXPROP to pass PageRank values from vertices to hy- use an array IDsE that will store the minimum ID passed to peredges and HYPEREDGEPROP to pass PageRank values a hyperedge, and initialize it to all ∞ (Line 2). The auxiliary from hyperedges to vertices. We also use VERTEXMAP to arrays PrevIDsV and PrevIDsE will enable us to keep track of normalize the PageRank scores as required in Equation 3, the first time an ID changes in an iteration (Lines 3–4). We and VERTEXMAP and HYPEREDGEMAP to reset arrays. The initialize the first frontier to contain all the vertices (Line 13). pseudocode for our implementation in Ligra-H is shown in We iteratively call VERTEXPROP to propagate vertex IDs Algorithm 8. The arrays VPR and EPR store the PageRank to hyperedges (Line 17), and HYPEREDGEPROP to propa- values for the vertices and hyperedges, respectively, and are gate hyperedge IDs to vertices (Line 20) until the frontier initialized on Lines 1–2. The Pagerank values for vertices are becomes empty. Since all neighbors need to be explored, initialized to 1/nv and the Pagerank values for hyperedges are we pass the TRUE function to VERTEXPROP and HYPER- initialized to 0. Since all vertices and hyperedges need to be EDGEPROP, which always returns true. VERTEXPROP and processed in each iteration, we create two frontiers containing HYPEREDGEPROP call the UPDATE function (Lines 5–9) that all of the elements on Lines 10–11. We repeatedly call the atomically stores the minimum of the source element’s ID and VERTEXPROP and HYPEREDGEPROP functions on Lines 15 the destination element’s ID at the target using a WRITEMIN. and 17 to compute the sums in Equation 4 and Equation 3, If the WRITEMIN was successful, it checks whether the orig- respectively. Each call to VERTEXPROP is preceded by a call inal ID before the WRITEMIN was equal to the target’s pre- to HYPEREDGEMAP, and vice versa, to reset the values in vious ID, and if so returns true. This will only return true the appropriate PageRank array (Lines 14 and 16). Finally, at most once per target, and so VERTEXPROP and HYPER- on Line 18, a VERTEXMAP with the VERTEXF function EDGEPROP do not need to remove duplicates. Prior to calling (Lines 5–6) is used to multiply the PageRank contributions by VERTEXPROP and HYPEREDGEPROP, the arrays storing the α and add (1 − α)/nv to each vertex’s PageRank, as needed previous IDs are synchronized with the current IDs using for Equation 3. HYPEREDGEMAP and VERTEXMAP (Lines 16 and 19) with We can also implement the PageRank-Entropy algorithm the COPY function (Lines 10–11). Note that only the ele- from MESH [36], which computes the entropy of the ranks ments whose IDs have changed will be placed in the output of vertices in each hyperedge. This can be done with a VER- frontier, and so the next iteration only needs to process those TEXPROP call that passes the entropy contribution of each 16 Practical Parallel Hypergraph Algorithms PL’18, January 01–03, 2018, New York, NY, USA

1 Algorithm 8 Pseudocode for PageRank in Ligra-H Hypertree 0.9 BC 1: VPR = {1/n ,..., 1/n } CC v v 0.8 SSSP 2: EPR = {0,..., 0} 0.7

3: procedure UPDATE(SrcPR, DstPR, s, d) 0.6 4: FAA(&DstPR[d], SrcPR[s]/deg+(s)) 0.5 0.4 5: procedure VERTEXF(i) Running time 0.3 6: VPR[i] = (1 − α)/nv + α · VPR[i] 0.2

7: procedure RESET(PR, i) 0.1 0 8: PR[i] = 0 10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 1

9: procedure PAGERANK(H , maxIters) Threshold (fraction of sum of in-degrees) 10: VertexSet FrontierV = {0,..., n − 1} ▷ all vertices v Figure 7. Running times as a function of threshold on Live- 11: HyperedgeSet FrontierE = {0,..., ne − 1} ▷ all hyperedges 12: iter = 0 Journal using 72 cores with hyper-threading. 13: while (iter++ < maxIters) do 14:H YPEREDGEMAP(FrontierE, RESET(EPR)) 15:V ERTEXPROP(H, FrontierV, UPDATE(VPR, EPR), TRUE) 16:V ERTEXMAP(FrontierV, RESET(VPR)) 17:H YPEREDGEPROP(H, FrontierE, UPDATE(EPR, VPR), TRUE) 18:V ERTEXMAP(FrontierV, VERTEXF) 19: return VPR

Algorithm 9 Pseudocode for PageRank-Entropy in Ligra-H 1: Entropy = {0,..., 0} ▷ initialized to 0’s 2: procedure UPDATE-ENTROPY(s, d) 3: FAA(&Entropy[d], (VPR[s]/EPR[d]) × log2(EPR[d]/VPR[s])) vertex’s rank to each hyperedge that it is an incoming ver- tex for. The function passed to VERTEXPROP is provided in Algorithm 9. Each iteration of PageRank (and PageRank-Entropy) pro- cesses all vertices and hyperedges using VERTEXPROP and HYPEREDGEPROP. Therefore, the per-iteration work is O(|H |) and depth is O(log |H |). F Additional Experiments We show the running time as a function of threshold for several applications on the com-Orkut and LiveJournal hy- pergraphs in Figures 6 and 7. We see that the performance is similar across a wide range of thresholds.

0.35 Hypertree BC 0.3 CC SSSP 0.25

0.2

0.15 Running time 0.1

0.05

0 10-8 10-7 10-6 10-5 10-4 10-3 10-2 10-1 1 Threshold (fraction of sum of in-degrees) Figure 6. Running times as a function of threshold on com- Orkut using 72 cores with hyper-threading. 17