Efficient, Reachability-based, Parallel Algorithms for ∗ Finding Strongly Connected Components [Technical Report] Daniel Tomkins, Timmie Smith, Nancy M. Amato, Lawrence Rauchwerger Department of Computer Science and Engineering Texas A&M University College Station, TX 77845-3112

ABSTRACT ning [13]. We can efficiently analyze graph properties us- Large, complex graphs are increasingly used to represent ing sequential techniques, but many graphs, especially those unstructured data in scientific applications. Applications used in scientific and social network applications, exceed the such as discrete ordinates methods for radiation transport memory available to a single processor. require these graphs to be directed with all cycles eliminated, Unfortunately, directed graphs are generally difficult to where cycles are identified by finding the graph.s strongly analyze in parallel; many sequential techniques are inher- connected components (SCCs). Deterministic parallel algo- ently non-parallelizable [15]. Indeed, parallel graph algo- rithms identify SCCs in polylogarithmic time, but the work rithms have been studied for several decades, but for many of the algorithms on sparse graphs exceeds the linear sequen- problems we are still unable to reach the theoretical bounds tial complexity bound. Randomized algorithms based on provided by sequential algorithms. reachability — the ability to get from one vertex to another This paper considers one such problem: finding strongly along a directed path — are generally favorable for sparse connected components (SCCs) in a . An SCC graphs, but either do not perform well on all graphs or intro- is a maximal subgraph of a directed graph such that there is duce substantial overheads. This paper introduces two new a path from every SCC node to every other SCC node. SCCs algorithms, SCCMulti and SCCMulti-PriorityAware, which are used in compiler analysis [2], data mining, and schedul- perform well on all graphs without significant overhead. We ing [6]. They can be used to locate cycles and collapse cyclic also provide a characterization that compares reachability- graphs to their directed, acyclic graphs [16]. Sig- based SCC algorithms. Experimental results demonstrate nificantly, particle transport applications perform a traversal scalability for both algorithms to 393,216 of cores on several of the spatial domain in each direction of particle travel; in types of graphs. order for the traversals to terminate, cycled in the graph rep- resenting the spatial domain must be identified and removed [1]. Categories and Subject Descriptors: Theory of Com- The optimal sequential algorithm for finding SCCs is Tar- putation [Design and Analysis of Algorithms]: Parallel Al- jan’s linear-time algorithm [19], which depends on depth- gorithms first search (DFS). Unfortunately DFS is P-complete [15]. Keywords: Strongly Connected Components, Random- To overcome this, parallel SCC algorithm research initially ized Algorithms, Parallel Graph Algorithms, Parallel Com- focused on algorithms based on the of the plexity Theory graph [8, 9, 3]. However, computing transitive closure for sparse graphs is costly in the total amount of work and memory. Other deterministic algorithms improved the total 1. INTRODUCTION work, but only for certain subsets of graphs [12, 4]. Graphs are used in many fields, including particle transport Due to the cost of the deterministic algorithms, random- studies [1], social network analysis [7], and motion plan- ized algorithms are of interest [17, 6, 14, 16]. Currently, the best parallel SCC algorithms share a basic technique ∗This research supported in part by NSF awards CNS-0551685, CCF 0702765, CCF-0833199, CCF-1439145, CCF-1423111, CCF- and structure; we call this family of algorithms reachability- 0830753, IIS-0916053, IIS-0917266, EFRI-1240483, RI-1217991, based SCC (RB-SCC) algorithms, because they use directed by NIH NCI R25 CA090301-11, by DOE awards DE-AC02- paths from a given source node to separate the SCCs. How- 06CH11357, DE-NA0002376, B575363, by Samsung, IBM, Intel, ever, each of these RB-SCC algorithms has some drawback and by Award KUS-C1-016-04, made by King Abdullah Univer- that limits its functionality. For instance, the Divide-and- sity of Science and Technology (KAUST). Conquer Strong Components (DCSC) algorithm [6, 14] has a large set of worst case inputs; it performs poorly on graphs that have low average reachability. The MultiPivot algo- rithm [16] probabilistically guarantees its asymptotic com- plexity for all inputs, but uses extra computation to choose source nodes for the reachability queries. A third algo- rithm, NSCC [17], offers DCSC’s average case work and MultiPivot’s input-independence, but contains a substan- tial memory overhead and certain stateful requirements on

1 its reachability queries. We say that the predecessors of t reach t, and the succes- In this work, we present two new randomized RB-SCC al- sors of u are reached by u. gorithms, SCCMulti and SCCMulti-PriorityAware (SCCMulti-PA), that overcome several of the drawbacks Definition 2.6 (SCC). A strongly connected compo- of previous RB-SCC algorithms. Like MultiPivot, our al- nent (SCC) of a directed graph G is a maximal subgraph gorithms select multiple nodes and compute reachability in- S ⊆ G such that formation about them. Like DCSC, they use the information ∃path(u,t) ∀u,t ∈ S. obtained to remove many edges of the graph, fracturing it That is, it is a maximal subgraph such that every node has into many pieces. As we will show, both SCCMulti and a path to every other node. SCCMulti-PA obtain the fast speed of DCSC and NSCC, the input independence of MultiPivot and NSCC, and the The following property of SCCs, presented as Lemma 1 in low memory overhead of DCSC and MultiPivot. [6], is important throughout this work: The main contributions of this paper include • a survey of reachability-based SCC algorithms and a Lemma 2.7. Denote the SCC containing v ∈ V as SCCv. framework to compare them; Then SCCv = SUCC (v) PRED(v). • SCCMulti and SCCMulti-PA, two new randomized, reachability-based SCC algorithms; Definition 2.8 (TransitiveT Closure). For a graph • probabilistic guarantees that SCCMulti and SCC- G, the transitive closure of G =(V, E) is a graph TC(G) = ′ Multi-PA are expected to use the resources of O(log n) (V, E ) such that ∀u,t ∈ G, reachability queries; and t ∈ SUCC (u) ⇔ (u,t) ∈ E′. • experimental results that validate our algorithms’ the- That is, the transitive closure has an edge between any two oretical properties up to 393,216 cores. nodes which can reach each other. 2. PRELIMINARIES Before discussing our new algorithms, we review some graph 2.2 Parallel Graph Algorithms terminology and related algorithms. We will also survey pre- This section reviews useful classes of parallel graph algo- vious randomized RB-SCC algorithms in Section 3, which rithms. We first discuss algorithms for finding the reacha- will set the stage for our own algorithms, SCCMulti and bility of a vertex, which is closely related to the SCC prop- SCCMulti-PA. erty. Then we discuss deterministic, parallel SCC algo- rithms, which perform prohibitive total work on sparse gra- 2.1 Graph Terminology phs and lead to the investigation of randomized algorithms. Definition 2.1 (Digraph). A directed graph (or di- graph) G is a pair (V, E), where V is a non-empty set of 2.2.1 Parallel Reachability Algorithms nodes (equivalently, vertices) and E ⊆ {(u,t): u,t ∈ V } is Any algorithm that can take v ∈ V and return SUCC (v) a set of directed edges, or ordered pairs of elements of V . is called a reachability algorithm, and a single call to one of these algorithms is called a reachability query. Existing In the context of this paper, we frequently call a digraph a parallel algorithms for this problem include Spencer’s par- graph and a directed edge an edge. We also say that vertex allel breadth-first-search (PBFS) [18, 17] and the algorithm v ∈ G if v ∈ V and that edge (u,t) ∈ G if (u,t) ∈ E. As is of Ullman and Yannakakis [21]. common in , we denote |V | by n and |E| by m. In this work, we assume that RB-SCC algorithms are Definition 2.2 (Transpose Graph). For a graph G, provided with a black-box, parallel reachability algorithm, the transpose or reverse graph of G = (V, E) is a graph RQ-fw(G, v, mk), that takes a graph G, a vertex v ∈ G, and G′ = (V, E′) where (u,t) ∈ E ⇔ (t,u) ∈ E′. That is, G′ is a mark mk and modifies G, marking all nodes in SUCC (v) G with all edges reversed. with mk. That is, SUCC (v) is the set of nodes marked with an mk after a call to RQ-fw(G, v, mk). All marks are stored Definition 2.3 (Path). For a graph G and nodes u,t ∈ on the vertices, not in any auxiliary structure. To be a black- G, a directed path (or path) is a sequence of edges from one box algorithm, RQ-fw(·) requires that the vertex provide an vertex to another; i.e., interface for being marked; this typically involves setting a path(u,t)=((v1 = u, v2), (v2, v3),..., (vl−1, vl = t)), flag on the vertex, although the marks could be a more com- where (vi, vi+1) ∈ G for all i ∈ [1,l − 1]. Furthermore, plex data structure. We also assume that RQ-bw(G, v, mk), ∃path(u,u) ∈ G for any u ∈ G; this is a “trivial path” of which marks all nodes in PRED(v), is provided; this is just length 0. RQ-fw(·) on the transpose graph. We assume that the algorithm used for RQ-fw(G, v, mk) Reachability expresses the ability to get from one vertex to is a state-of-the-art, parallel reachability algorithm, such as another vertex through some path. That is, the reachability PBFS or Ullman and Yannakakis’s algorithm. is the transitive closure of the edge set; a node u reaches a node t iff there is a path from u to t. 2.2.2 Deterministic Parallel SCC Algorithms The SCC problem is readily seen to be in the class NC. As- Definition 2.4 (Predecessors). For a graph G = sume that we are given an ordering on the nodes and the (V, E) and vertex t ∈ G, the predecessors of t is the set transitive closure of the graph. This allows the development PRED(t) = {u : ∃path(u,t) ∈ G} . of a na¨ıve O(log n) time, O(n2) processor algorithm for find- ing SCCs: For every u ∈ G, perform a parallel reduction on Definition 2.5 (Successors). For a graph G = the transitive closure to find the lowest t ∈ G which reaches (V, E) and vertex u ∈ G, the successors of u is the set and is reached by u, using this t as the SCC marking for u. SUCC (u) = {t : ∃path(u,t) ∈ G} . This uses O(log n) time and O(n2) processors.

2 Algorithm Year Reach. Queries Memory per Node Note NSCC [18] 1991 O(log n) O(f(G)) Must use PBFS [18]; f(·) is sublinear DCSC [6] 2000 Inputb Dependent O(1) O(n) depth on some inputs MultiPivot [16] 2007 O(log2 n) O(1) – b SCCMulti (This Paper) 2015 O(log n) O(1) – b b SCCMulti-PA (This Paper) 2015 O(log n) O(1) Uses specialized reachability query b Table 1. Summary of Reachability-Based SCC Algorithms; O(·) is defined in Section 3 b

Most deterministic, parallel SCC algorithms use matrix multiplication to find the transitive closure of the graph [8, line2 line3 line8 9, 3]. These algorithms solve the problem in O(logk n) time with O(n2+ǫ) processors (for specific k ≥ 2 and 0 < ǫ < 1 pairs). Other algorithms, such as Ullman and Yannakakis’s reachability algorithm, also calculate the transitive closure, but this takes O(n2) space to store. While this is a rea- sonable strategy for dense graphs (m is Θ(n2)), for sparse v v v graphs this total work far exceeds the linear bound of O(m). There exist efficient algorithms for planar graphs [12], Figure 1. Example execution of an RB-SCC algorithm. SUCC (v) has horizontal hatching, PRED(v) has vertical hatching, and SCCv even attaining O(n) work for large n [4], yet no work optimal has grid hatching. deterministic algorithm is known to find SCCs on arbitrary sparse graphs. Each of the algorithms presented in this section implement Because of the cost associated with these deterministic the general framework differently, primarily in how they se- algorithms, attention turned to randomized algorithms that lect pivots (line 2) and how they use the reachability infor- offer an expectation of solving the problem quickly. These mation to divide G (line 8). These differences lead to differ- algorithms are described in the next section. ences in performance. For instance, poorly selected pivots can cause imbalanced divisions and inefficient processing of 3. REACHABILITY-BASED SCC subpartitions, whereas investing too much computation in choosing a good pivot can lead to wasted work. ALGORITHMS 3.1.1 Parallel Complexity Many of the best-performing parallel SCC algorithms (Table The strength of RB-SCC algorithms derives from the fact 1) are quite similar. This paper introduces the Reachability- that most of their operations — selecting pivots and remov- Based SCC (RB-SCC) Algorithm framework and shows that ing edges — are easy to parallelize. Since these operations these randomized SCC algorithms fit this model. are more efficient than a reachability query, we are able to 3.1 RB-SCC Algorithm Framework bound the resources used by RB-SCC algorithms in terms The RB-SCC algorithms surveyed here share many features: of the number of reachability queries they use. they select a random pivot or set of pivots, test how these All of these algorithms are Las Vegas algorithms. This pivots reach the graph, and then use this information to di- means that, while they always produce the correct result, vide the problem into smaller subproblems by dividing the the resource bounds are not deterministic. For clarity, we ex- graph into subgraphs. The algorithms differ only in the press probabilistic bounds using the O(·) notation. Bound- methods used to select pivots and how the reachability in- ing a resource by O(f(n)) indicates that there exists a posi- formation is used to divide the graph. tive constant α such that the expectedb value of the resource is no more than αf(n). Algorithm 1 RB-SCC Algorithm Framework b Input: a digraph G Serial Reachability Queries. Assume that we are given a Output: G with SCCs marked set of vertices and asked to find the reachability of each one. 1: if G is empty then return There are two main approaches to solving this problem in 2: select a set of pivot nodes S ⊆ G parallel: we can run reachability queries serially (waiting for 3: for all v ∈ S parallel do the results of the first before beginning the second), or we 4: //parallel; modifies w.marks-fw for w ∈ SUCC (v) can start them in parallel. 5: RQ-fw(G,v,mkv) 6: //parallel; modifies w.marks-bw for w ∈ PRED(v) Using either of these schemes, the total work will be the 7: RQ-bw(G,v,mkv) same. However, the latter is a better use of parallelism; if 8: using marks, partition G into SCCs and other subgraphs the reachability of each vertex is small, then much of the 9: in parallel(repeat operations on partitions of G) system will be idle while performing any given reachabil- The general framework of an RB-SCC algorithm can be ity query. This presents the major challenge for RB-SCC seen in Algorithm 1, and an example of its execution is algorithms; waiting for small reachability queries is tempo- shown in Figure 1. In line 2, the algorithm selects a set of rally expensive, whereas starting too many big reachability pivot nodes; for our example, the pivot is labeled v. Then, queries before dividing the graph increases the total work. in the loop beginning in line 3, the algorithm finds the reach- 3.1.2 Trim Subroutine ability of each pivot (where marks-fw is represented by hori- Since every SCC discovered requires a reachability query, zontal hatching and marks-bw by vertical hatching). Finally, processing graphs made up of many small SCCs can be very in line 8 the algorithm divides the graph according to the expensive. McLendon et. al. [14] observed that determining reachability information. that a node is a source (i.e., a node with no predecessors)

3 or sink (i.e., a node with no successors) can be done locally, the graph into small pieces through two additional mech- and that any source or sink node is an SCC of size one. anisms: they remove all SCCs of size 2 using a modified Using this idea, they add a Trim preprocessing subroutine Trim step and then divide the graph based on its weakly which uses a single scan; this scan begins from all source or connected components (WCCs). Since each divided piece sink nodes, removes them, and then checks their neighbors has been identified, the modified DCSC algorithm is able to for the source or sink condition. recurse on each piece independently. Since Trim can be done with one scan of the graph that Finding the WCCs is extremely useful for DCSC, since uses no more time or work than a reachability query on on most inputs this avoids the worst case input. However, the entire graph, it affects neither the time nor the work of graphs can be constructed which have low average reacha- any of these RB-SCC algorithms — they already require at bility and still are not quickly broken into separate WCCs. least one reachability query per step. However, Trim can (For an example, see Appendix.) These graphs provide a provide a significant advantage on graphs containing a large new worst case for the modified DCSC, so it is still not percentage of size one SCCs, as is shown in [14]. probabilistically guaranteed to be efficient. 3.2 DCSC 3.3 MultiPivot The Divide-and-Conquer Strong Components (DCSC) algo- Schudy [16] describes MultiPivot, shown in Algorithm 3, rithm of Fleischer et al. [6, 14] fits very simply within the which chooses pivots more intelligently than DCSC. To se- general RB-SCC framework: DCSC randomly selects one lect a pivot, MultiPivot randomly permutes the graph node as the pivot, groups the nodes into SCCv, SUCC (v), nodes. It then uses a binary-search to find the minimal PRED(v), or OTHER and recurses on all groups but SCCv. k such that the first k vertices reach at least half the graph. Algorithm 2 DCSC[6, 14] The pivot is then set as the kth vertex, and MultiPivot partitions the vertices into five sets based on how they are 1: if G is empty then return 2: v ← random node(G) reached by vk and the first k vertices. 3: //parallel; w.marks-fw ← mk for w ∈ SUCC (v) Algorithm 3 MultiPivot [16] 4: RQ-fw(G,v,mk) 5: //parallel; w.marks-bw ← mk for w ∈ PRED(v) 1: if G is empty then return 6: RQ-bw(G,v,mk) 2: randomly permute G.V |G| 7: for all w ∈ G parallel do 3: find max(k) such that MPCounter(G, k − 1) < 2 8: if mk = w.marks-fw and mk = w.marks-bw then 4: for all w ∈ G parallel do 9: SCCv ← SCCv ∪ {w} //partition SCC 5: if mk = w.marks-fw then 10: else if mk = w.marks-fw then 6: A ← A ∪ {w} //partition subgraph 11: SUC ← SUC ∪ {w} //partition subgraph 7: //parallel; w.marks-fw ← mkk for w ∈ SUCC (vk) 12: else if mk = w.marks-bw then 8: RQ-fw(G,vk,mkk) 13: PRE ← PRE ∪ {w} //partition subgraph 9: for all w ∈ G parallel do 14: else 10: if mkk = w.marks-fw then 15: OTH ← OTH ∪ {w} //partition subgraph 11: B ← B ∪ {w} //partition subgraph 16: in parallel(DCSC(SUC), DCSC(PRE), DCSC(OTH)) 12: //parallel; w.marks-bw ← mkk for w ∈ (PRED(vk) ∩ B) 13: RQ-bw(B,vk,mkk) If the three recursed groups are approximately the same 14: for all w ∈ B parallel do size, the depth of recursion for DCSC is O(log n). How- 15: if mkk = w.marks-bw then ever, when any one of the groups dominates in size, the 16: SCCvk ← SCCvk ∪ {w} //partition SCC 17: in parallel(MultiPivot(G \ A \ B), MultiPivot(A \ B), performance is significantly affected. As mentionedb in [16], MultiPivot(B\A\SCCv ), MultiPivot(A∩B\SCCv )) DCSC has recursion depth Θ(n) in the case that SCCs are k k not connected to each other (or, in fact, whenever the aver- MPCounter(G, k) age reachability of any vertex is small). It is worth noting v G that on meshes, DCSC’s depth is O(log n); this has been its 1: for all ∈ parallel do 2: v.marks-fw ← 0 application in scientific experiments [6]. 3: for all v ∈ G parallel do DCSC’s poor performance on someb graphs is an example 4: //parallel; w.marks-fw ← mk for w ∈ SUCC (vi),i ≤ k of the problem of waiting on small reachability queries. The 5: if v.index ≤ k then RQ-fw(G,v,mk) algorithm always performs one reachability query for each 6: return the number of nodes with mark mk SCC; however, when the reachability queries are performed Schudy [16] demonstrates that, with high probability, serially and on a small portion of the graph, there is little n MultiPivot’s subgraphs are all smaller than 2 nodes; there- opportunity for the algorithm to exploit parallelism. fore, the probabilistically guaranteed depth of recursion, re- Efficient Implementation for Small-World Graphs. gardless of the input graph, is O(log n). Furthermore, be- Small-world graphs tend to have one large SCC containing cause the algorithm recurses on balanced partitions, the size over 90% of the graph. If this large SCC is removed, the of the divided subgraphs is inverselyb proportional to the remaining graph is comprised of small SCCs with low av- number of recursive tasks, so the algorithm avoids waiting erage reachability. DCSC is very likely to find the large on queries that do not reach much of the graph. SCC on the first level of recursion, and the remaining graph The primary work at each level is in finding the mini- presents DCSC’s worst case input, so it is not efficient for mal k that meets the requirements, which takes O(log n) these graphs. steps with binary search. Each step of this binary search Hong et al. [11] proposed a version of DCSC for small- uses one multi-source reachability query; therefore, Multi- world graphs, and their shared-memory implementation sho- Pivot performs O(log n) reachability queries at each level. wed the first consistent speedup over Tarjan’s algorithm for It has O(log n) levels, so it uses O(log2 n) total reachability this type of graph. They attained this speedup by dividing queries. While this is worse than DCSC’s best case, it is b b

4 significantly better than DCSC’s worst case; that is, Mul- 4.1 SCCMulti tiPivot gives up some efficiency for overall consistency of SCCMulti, shown in Algorithm 5, selects multiple unique expected resources used. pivots from which to traverse. It then performs reachability queries for all pivots at one time. After eliminating the SCC 3.4 NSCC of each pivot, it removes edges between nodes with different sets of visiting pivots. Spencer’s NSCC algorithm [17] is very similar to DCSC, To accommodate multiple pivots visiting the same node, but it includes logic to avoid waiting on small reachability marks-fw and marks-bw are unordered sets (hash sets). Each queries. NSCC, shown in Algorithm 4, serially fractures the call to RQ-fw(G, v, mk) operates as before, except that the graph into many pieces using only the work and time of mark is added to the set and does not overwrite previous one reachability query. Instead of choosing only one pivot marks. on each level of recursion, NSCC uses a loop to iteratively Algorithm 5 SCCMulti choose multiple pivots. On each iteration of this loop, NSCC randomly chooses a pivot v and finds SUCC (v), recursing 1: for k from 0 to log n do 2: for all v ∈ G parallel do on SUCC (v) \ SCCv. 3: v.marks-fw ← ∅, v.marks-bw ← ∅, v.min ←∞ Algorithm 4 NSCC [17] 4: for all i ∈ 1, 2k parallel do i 5: vrnd ← random node(G) 1: ρ ← f(|G|) (where f(·) is sublinear) i 2: while |G| >ρ do 6: //parallel; adds i to w.marks-fw for w ∈ SUCC (vrnd) i 3: for all v ∈ {u : |SUCC (u)| <ρ} do 7: RQ-fw(G,vrnd,i) i 4: G ← G \ SUCC (v) 8: //parallel; adds i to w.marks-bw for w ∈ PRED(vrnd) 5: v ← random node(G) i 9: RQ-bw(G,vrnd,i) 6: SCCv ← SUCC (v) ∩ PRED(v) 10: //partition SCCs 7: in parallel(NSCC(SUCC (v) \ SCCv)) 11: for all v ∈ G parallel do 8: G ← G \ SUCC (v) 12: for all mk ∈ v.marks-bw do 9: NSCC(G) 13: if v.marks-fw.contains(mk) and mk < v.min then 14: v.min ← mk Spencer showed that, with constant probability, each re- 15: if v.min 6= ∞ then cursive call operates on no more than a constant fraction 16: SCCv.min ← SCCv.min ∪ {v} of the graph. Therefore, the total depth of recursion is 17: G ← G \ {v} O(log n). While the total number of reachability queries 18: //partition other subgraphs 19: for all (u,t) ∈ E parallel do at each level is not constant, their total time and work is 20: if u.marks-fw 6= t.marks-fw nob more than that of one reachability query on the entire or u.marks-bw 6= t.marks-bw then graph. Therefore, with high probability, the algorithm uses 21: E ← E \ {(u,t)} no more time or work than O(log n) reachability queries. While the total work of NSCC is probabilistically guar- 4.1.1 Correctness anteed to be better than eitherb DCSC or MultiPivot, it If nodes u,t ∈ G are in the same SCC, they will reach trades this performance for a significant memory overhead. and be reached by the same set of nodes. It follows that To avoid waiting on small reachability queries, NSCC re- if u.marks-fw 6= t.marks-fw, the reachability, and thereby quires some assurance that each SUCC (v) is not“too small.” the SCCs, of u and t must be different. Therefore, any To meet this requirement, NSCC starts each iteration by re- edges removed by SCCMulti cannot be internal to an SCC. moving any nodes v for which SUCC (v) is small (line 4). From Lemma 2.7, any SCCs identified as the intersection of However, for this operation to be efficient, NSCC must be SUCC(v) and PRED(v) are correct. Since the algorithm able to tell in constant time whether |SUCC (v)| > ρ for correctly identifies the SCC of each pivot, it is complete. every v ∈ G, where ρ is a function of the number of proces- Furthermore, at least one pivot is selected and removed on sors and size of the graph. Therefore, NSCC cannot use the each iteration; therefore, the algorithm terminates in a fi- black-box RQ-fw(G, v, mk); when NSCC finds reachability nite amount of time. Since SCCMulti correctly identifies or partitions the subgraphs, it must maintain a representa- all SCCs and terminates, it must be correct. tive set of reachable nodes for every v ∈ G. This is a signif- 4.1.2 Analysis icant overhead, but without this step NSCC is functionally First, observe that SCCMulti loops exactly log n times (line equivalent to DCSC and has the same weaknesses. 1); therefore, as long as each iteration does O(1) reachability queries per iteration, SCCMulti will have complexity of 4. SCCMulti ALGORITHMS O(log n) reachability queries. b We analyze the complexity of SCCMulti in two steps. The main contribution of this paper is the pair of SCC- First,b we prove that the total expected time and work of Multi algorithms that use multiple, overlapping reachabil- any iteration’s reachability queries is not more than that ity queries on each step to divide the graph quickly. Both of a single reachability query that reaches the entire graph; algorithms only use the time and work of O(log n) reacha- that is, each iteration marks O(n) nodes. Then, we show bility queries: SCCMulti gambles on the amount of work it that, like other RB-SCC algorithms, the work done in each does per iteration (guaranteeing O(log n) iterations), while b iteration is dominated by the reachabilityb queries. SCCMulti-PA gambles on the number of iterations but guarantees O(m) work per iteration. As a result of their Work on Each Iteration. We wish to prove that the re- structures, SCCMulti does well when reachability queries sources used by each iteration are bounded by those required are inexpensive, whereas SCCMulti-PA does well when the for a single reachability query. Since SCCMulti performs average degree of the graph is small. all reachability queries in parallel (as opposed to serially,

5 like DCSC or NSCC), we are not concerned with waiting on A topological-weak ordering differs from the standard topo- small reachability queries; we only need to show that O(n) logical ordering, which is only valid for an acyclic graph, in nodes will be tested for reachability on each iteration. that it allows arbitrary ordering of nodes internal to an SCC. The proof proceeds as a series of lemmas. First, we showb Now we begin the proof. that the reachability of a graph G and its transitive closure TC(G) are the same and that the transitive closure opera- Lemma 4.5. For any graph G, tion and graph division, an operation we define below, are reach(G) = reach(TC(G)). associative (i.e., the order in which they are applied is not Proof. Edges added by the transitive closure operation important). Next, we show that the reachability of a graph do not change the reach of any vertex. is proportional to the number of edges in the graph’s transi- tive closure. Finally, we show that the expected number of Lemma 4.6. For any graph G and v ∈ G, edges in any graph is inversely proportional to the number TC (G/v) = TC(G)/v. of times the graph has been divided, concluding that the expected reachability of any graph is inversely proportional Proof. In this proof, we denote by G.V the vertex set to the number of pivots that have acted upon it. and by G.E the edge set of graph G. We first present a few definitions useful in this proof. First, observe that G.V = TC(G).V since no vertices are removed when dividing the graph. Hence, we only need to Definition 4.1 (Vertex Reach). The reach of ver- show that TC(G/v).E ⊆ (TC(G)/v).E and that tex v ∈ G is TC(G/v).E ⊇ (TC(G)/v).E. reach(v) = |PRED(v) ∪ SUCC (v)|, ⊆ Assume, for the sake of contradiction, that there exists some edge (u,t) ∈ TC(G/v) that is not in TC(G)/v. i.e., the number of nodes reached by or reaching v. That is, ∃path(u,t) ∈ G/v, but dividing TC(G) by v Definition 4.2 (Graph Reach). The reach of a graph removes the edge (u,t) from TC(G). Since (u,t) was G is removed from TC(G), then either u ∈ PRED(v),t∈ / PRED(v) or u∈ / SUCC(v),t ∈ SUCC(v). WLOG due reach(G) = reach(v); to symmetry, assume u ∈ PRED(v),t∈ / PRED(v). v∈G X We know that ∃path(u,t) ∈ G/v; since u ∈ PRED(v), i.e., reach(G) is the total reach of the graph, and the average t∈ / PRED(v), there must be some edge (w,x) along this path such that w ∈ PRED(v),x∈ / PRED(v) (that reach of each node is reach(G) . n is, the state must change along this path). This edge Definition 4.3 (Divided Graph). should have been removed when v divided G, so this 1. For a graph G and vertex v ∈ G, denote by G/v the is a contradiction. graph resulting when G is divided by v; that is, when ⊇ Assume, for the sake of contradiction, that there ex- the following edges are removed: ists some edge (u,t) ∈ TC(G)/v such that (u,t) ∈/ (a) (u,t) is removed if u ∈ PRED(v) and t∈ / PRED(v). TC(G/v). That is, dividing by v does not remove (b) (u,t) is removed if u∈ / SUCC (v) and t ∈ SUCC (v). (u,t) from TC(G), but it does break any paths from u (c) (u,t) is removed if u ∈ SCC(v) or t ∈ SCC(v). to t in G. To break a path, v must remove some edge This operation is illustrated in Figure 2. (w,x) ∈ path(u,t), either because w ∈ PRED(v),x∈ / q PRED(v) or because w∈ / SUCC (v),x ∈ SUCC (v). 2. For a graph G and q ∈ Z+, denote by [G/vrand] the graph resulting when G is serially divided by q random WLOG due to symmetry, assume w ∈ PRED(v),x∈ / vertices. PRED(v). Because (w,x) ∈ path(u,t), u ∈ PRED(w), and This definition comes from the behavior of a pivot in SCC- since we assume w ∈ PRED(v), then u ∈ PRED(v). Multi. While SCCMulti removes the SCC nodes from the Because (u,t) was not removed from TC(G), u and graph, here we only remove the edges to and from them. t were reached the same way by v, so t ∈ PRED(v). a Since (w,x) ∈ path(u,t) and x ∈ PRED(t), then x ∈ PRED(v). This is a contradiction. b,c a,b b vc v a,c Lemma 4.7. For any graph G such that TC(G)=(V, E′), a,c ′ reach(G) ≤ 2 ·|E | + n. Proof. All of the nodes reached by a vertex v are its neighbors in the transitive closure. Every edge has two end- Figure 2. Example of a graph division. Nodes in SUCC (v) have hor- points, so the total number of neighbors is not more than izontal hatching; nodes in PRED(v) have vertical hatching. Edges ′ are labeled with the rules from Definition 4.3 part 1 that removed 2 ·|E |. Each node can also reach itself, so the total reacha- ′ them. bility is not more than 2 ·|E | + n.

Definition 4.4 (Topological-Weak Ordering). Lemma 4.8. When a graph G is divided by a random ver- tex, the expected number of edges removed from G is at least For any graph G, an ordering of G’s vertices (v1, v2,...,vn) 2 2 m m is called topological-weak if for all 1 ≤ i

6 k n Let (u1,u2,...,uf ) be a topological-weak ordering of v’s divide the graph before iteration k, we will reach O(2 2k ) = forward neighbors. For any ui, v ∈ PRED(ui); if ui is se- O(n) vertices (and O(m) edges) on this iteration. Further- lected to divide the graph, the edge (v,uj ) will be removed more, only O(n) marks will be created, so the memoryb per unless uj ∈ PRED(ui) and uj ∈/ SCC(ui). Since v’s for- nodeb is O(1). b ward neighbors are topological-weakly ordered, only i − 1 b such uj can exist. Dominated Work. In addition to the reachability queries, b Each ui will be selected to divide the graph with proba- SCCMulti performs two operations on the marking sets. 1 bility n . Therefore, we expect at least First, for every vertex v, it must find the smallest mark f which is in both v.marks-fw and v.marks-bw (lines 12 to f − (i − 1) f 2 + f f 2 = ≥ 14). Since the expected number of marks per node is O(1), n 2n 2n i=1 X scanning through v.marks-bw (line 12) takes O(1) work. forward edges of v to be removed by a random vertex. marks-fw is an unordered hash set, so marks-fw.containsb (·) v is only one vertex; over all vertices, we expect at least 2 is an O(1) work operation. Therefore, this operationb takes n fi i=1 2n edges to be removed, where fi is the number of O(1) work per vertex. 2 forward edges on the ith vertex. Since fi = m, fi is SCCMulti must also compare the marking sets across P m minimized when fi = . Therefore, we expect at least everyb edge; this can be done in O(1) work per edge. Con- n P P n m 2 sider the edge (u,t): any mark that reaches u will reach m2 n = t along (u,t). Consequently, u.marks-fw ⊆ t.marks-fw, so 2n 2n2 i=1  u.marks-fw 6= t.marks-fw only if |u.marks-fw| > |t.marks-fw|. X edges to be removed. Therefore, the resources used by this algorithm are dom- inated by the reachability queries. Lemma 4.9. For any graph G and q ∈ Z+, the expected 2 2 4.2 SCCMulti-PriorityAware number of edges in [G/v ]q is no more than 2n = O( n ). rand q+2 q Assume that there exists a blocking, priority-aware reacha- Proof. For the sake of contradiction, assume that there bility query; that is, every pivot is assigned a priority, and b exists some graph G and q ∈ Z+ such that mq, the expected the query marks the set of nodes that are reached by the 2 q 2n current pivot but not by any higher priority pivot. When number of edges in [G/vrand] , exceeds q+2 . Furthermore, assume that q is the smallest such integer. No graph may this algorithm is called with more than one pivot, the high- have more than n2 edges, so q 6= 0. est priority pivot will mark every node it would have marked Since q is the smallest such integer, mq−1 is no more than in the black-box, non priority-aware RQ-fw(G,v,m), while 2 2 2n 2n lower priority pivots will only mark a subset of nodes that . Also, mq ≤ mq−1, so mq−1 > . To restate, q+1 q+2 they would otherwise mark. 2 2 2n 2n If this scheme is implemented such that processors first

7 Algorithm 6 SCCMulti-PA query (RQ), which measures the time required to perform 1: while |G| > 0 do ⌈log n⌉ queries on the input graph, to benchmark the exe- 2: for all v ∈ G parallel do cution times of the algorithms. All implementations use the 3: v.marks-fw ← ∅, v.marks-bw ← ∅ Trim subroutine (Section 3.1.2) as a preprocessing phase at 4: v.priority ← rand(0,n) ∗ n + v.index each level. 5: for all v ∈ G parallel do Each iteration of MultiPivot takes approximately the 6: RQ-fw-PA(G, v, v.priority) //see discussion 7: RQ-bw-PA(G, v, v.priority) //see discussion same time as one complete call to SCCMulti or SCC- 8: //partition SCCs Multi-PA; because of this, we always stopped MultiPivot 9: for all v ∈ G parallel do after four iterations. Our results show that on all graphs the 10: if v.marks-fw = v.marks-bw then SCCMulti algorithms outperform this abbreviated Multi- 11: SCCv.marks-fw ← SCCv.marks-fw ∪ {v} Pivot; running MultiPivot to completion would consume 12: G = G \ {v} our limited computing resources without changing the rela- 13: //partition subgraphs 14: for all (u,t) ∈ E parallel do tive outcome. 15: if u.succ 6= t.succ or u.pred 6= t.pred then Our experiments were run on a Cray XE6m-200 (XE6M) 16: E ← E \ {(u,t)} available in our department. This machine has twenty-four compute nodes, twelve of which have a single AMD Opteron pivot, so the algorithm must terminate. Therefore, the al- 6272 ‘Interlagos’ 16-core processor running at 2.1GHz and gorithm correctly labels each SCC and terminates. a total of 32GB of memory (2GB per core). The remaining 4.2.2 Analysis twelve nodes have two 16-core ‘Interlagos’ processors run- ning at 2.1GHz and a total of 64GB of memory (maintain- First, we show that SCCMulti-PA expects O(log n) iter- ing 2GB per core). Experiments run on 32 to 256 cores were ations. Then, we again show that each iteration’s slowest restricted to run on nodes with 32 cores in order to minimize step is the reachability queries. b variability in the experimental setup. Number of Iterations. By Lemma 4.10, vertices in a graph We evaluated the scalability of all five algorithms on a G that has been divided by q pivots have reach of O( n ); 393,216 core IBM BG/Q system (BG/Q) at Lawrence Liv- q ermore National Laboratory. The BG/Q system has 24,576 therefore, O(n/ n ) = O(q) pivots will reach their own SCCs q nodes, each with 16GB of memory and a single 16 core IBM without being overwritten by a higher priority pivot.b By PowerPC A2 processor running at 1.6 GHz. We ran each ex- this, the number of non-overwritten pivots doubles on each b b periment 10 times and for all results present the mean along iteration, so SCCMulti-PA will take O(log n) iterations. with a 95% confidence interval using the t-distribution. Ad- Since SCCMulti-PA tests reachability on O(n) vertices ditional results can be found in a technical report [20] avail- on each iteration and uses O(log n) iterations,b it will also able on our laboratory website, use the time and work of O(log n) reachability queries on http://parasol.tamu.edu/publications. the entire graph. b 5.2 Input Graphs b Dominated Work. At the end of every iteration of SCC- In order to show that the performance of our algorithms is Multi-PA, every node will have been labeled by some pivot. not limited to a particular class of graphs, the algorithms Since the reachability queries are priority-aware, the time were evaluated using five different types of graphs: required for all of them will be asymptotically equal to that Separable Cycles (SC): A graph with a set of uncon- of a single reachability query over the entire graph. nected cycles. Chained Cycles (CC): A graph consisting of a chain of 5. EXPERIMENTAL RESULTS cycles. Our experimental evaluation of the RB-SCC algorithms com- Permuted Mesh (PM): A graph representing a 3D mesh pares the absolute execution time and scalability of each with a specified percentage of edges randomly reversed. algorithm on different types of graphs. The use of multiple Watts-Strogatz (WS): A graph with small-world proper- graphs from different domains demonstrates that the perfor- ties created by forming a ring of vertices and, for each mance of our algorithms is not restricted to a specific prob- vertex, randomly rewiring each of its edges to one of lem domain. That is, the algorithmic characteristics noted its k neighbors. in the preceding discussion are evident when the algorithms Graph 500 (G500): A graph with scale-free, small-world are used to identify the SCCs of graphs. properties created using the modified Kronecker gen- erator from the Graph 500 benchmark. 5.1 Experimental Setup SC was chosen to represent the class of problems where We implemented DCSC, Hong’s modified DCSC (HONG), the graph has low average reachability; this is the classically MultiPivot, SCCMulti, and SCCMulti-PA using stapl difficult input for DCSC. CC is an extreme case that stresses [5]. RQ-fw(·) and RQ-bw(·), used by all algorithms, were the underlying RQ. PM represents graphs found, for exam- implemented as flood fills using stapl’s KLA paradigm [10], ple, in scientific applications where the spatial domain is which enables simple development of efficient parallel algo- represented by a graph and local mutations cause the con- rithms for which the level of asynchrony can be varied para- nectivity between vertices to change. WS and G500 have metrically. The flood fill algorithm for SCCMulti-PA uses many of the small-world properties found in graphs from a priority scheduler to determine which task from the set of social networks, road maps, and other physical phenomena. ready tasks is run next. Vertex operations are guaranteed to be atomic by stapl’s task scheduler and runtime sys- 5.3 Comparison of RB-SCC Algorithms tem. We did not implement NSCC due to its high memory We first compare the RB-SCC algorithms on the XE6M us- requirement. We include an evaluation of our reachability ing a 32,768 vertex SC graph with cycles of size 2, shown

8 Separable 2- Graph, 32k vertices, XE6M 3000 Separable 2-Cycle Graph, 1B vertices, BG/Q Chain of 128k 2-cycles, 256k vertices, XE6M 8000 1000 1000 SCCMulti SCCMulti SCCMulti SCCMulti-PA SCCMulti-PA SCCMulti-PA MultiPivot MultiPivot MultiPivot log n RQs DCSC 100 DCSC HONG HONG log n RQs log n RQs 100 1000 10 Time(s) Time(s) Time(s) 10 1

100 .1 .05 1 50

2K 4K 8K 16K 32K 64K 128K 384K 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 Processors Processors Processors (b) Separable 2-cycles, BG/Q (c) Chain of 2-cycles, XE6M (a) Separable 2-cycles, XE6M

Watts-Strogatz Graph, 64k vertices, 16 neighbors, XE6M Graph500 Kronecker Graph, Scale 18 -- 256k vertices, XE6M Permuted Mesh Graph, 40% permutation, 2M vertices, XE6M 500 400 SCCMulti 10 SCCMulti-PA MultiPivot 5 DCSC 100 HONG log n RQs 100 1

50 Time(s) Time(s) Time(s) 10 .2

SCCMulti SCCMulti SCCMulti-PA SCCMulti-PA 10 MultiPivot MultiPivot DCSC DCSC HONG HONG 1 log n RQs 5 log n RQs 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 1 2 4 8 16 32 64 128 256 512 Processors Processors Processors (f) Permuted 3D mesh, XE6M (d) Small-world graph, XE6M (e) Scale free graph, XE6M

Watts-Strogatz Graph, 8M vertices, 16 neighbors, BG/Q Graph500 Kronecker Graph, 1M vertices, 16 neighbors, BG/Q Permuted Mesh Graph, 40% permutation, 16M vertices, BG/Q 1000 SCCMulti MultiPivot 500 SCCMulti-PA DCSC 500 log n RQs HONG

100

50 250 SCCMulti SCCMulti-PA MultiPivot 100 DCSC

Time(s) 10 Time(s) HONG Time(s) 50 log n RQs

100 SCCMulti 1 10 SCCMulti-PA MultiPivot DCSC HONG log n RQs

512 1K 2K 4K 8K 16K 32K 64K 128K 384K 512 1K 2K 4K 8K 16K 32K 512 1K 2K 4K 8K 16K 32K 64K Processors Processors Processors

(g) Small-world graph, BG/Q (h) Scale free graph, BG/Q (i) Permuted 3D mesh, BG/Q Figure 3. Timing results for different graphs on XE6M and BG/Q platforms in Figure 3(a). For this input DCSC does not scale and Pivot. requires two to three orders of magnitude longer than the The comparison of the algorithms on the CC input is other algorithms; as mentioned in Section 3.2, unconnected shown in Figure 3(c). DCSC and SCCMulti-PA have com- SCCs do not allow DCSC to exploit any parallelism. HONG parable execution time and scalability. MultiPivot is still has significantly reduced run time compared to DCSC, but the slowest of the algorithms, despite its execution being does not change improve in scalability. For SC, the four it- halted after four iterations. Since this is a graph with ex- erations of MultiPivot take approximately four times as ceptionally low degree, SCCMulti-PA outperforms all the long as SCCMulti and SCCMulti-PA. This is expected, other algorithms, and SCCMulti scales similarly. The de- as MultiPivot does as much work at each level as SCC- crease in scalability from four to eight cores is due to limited Multi does in total. Both SCCMulti and SCCMulti-PA memory bandwidth as all eight cores on one die of the 16- scale better than any of the other algorithms, in addition to core processor are utilized. The lack of scalability at higher having lower execution times; the scalability of both SCC- core counts is due to the small graph size, used to allow Multi and SCCMulti-PA is comparable to that of the RQ. MultiPivot to complete its abbreviated execution in a rea- (Again, note that RQ shows the time required for ⌈log n⌉ se- sonable amount of time. rial reachability queries.) We expect all algorithms to scale Figure 3(d) presents the comparison of the algorithms on with the RQ, but the communication cost of maintaining a WS input. MultiPivot remains the most computation- subgraph information penalizes DCSC, HONG, and Multi- ally expensive, but after an increase in execution time be-

9 tween 1 and 2 cores it scales similarly to the others. DCSC, 6. CONCLUSION HONG, SCCMulti, and SCCMulti-PA have similar exe- Computing SCCs is an important problem, yet solutions to cution times on lower core counts, but as the number of cores it are difficult to parallelize. Many of the best performing increases SCCMulti and SCCMulti-PA outscale DCSC parallel SCC algorithms for sparse graphs work by choosing and HONG. In all experiments the scalability of SCCMulti random pivots, computing the reachability of these pivots, and SCCMulti-PA is similar to that of the ⌈log n⌉ RQs. and using this information to divide the graph. This paper The execution times of the RB-SCC algorithms on a G500 surveys these reachability-based SCC (RB-SCC) algorithms input are shown in Figure 3(e). Similar to the results ob- and shows that they subscribe to a single framework. tained on the WS graph, MultiPivot is the most computa- The main contribution of this paper is two new parallel tionally expensive algorithm. SCCMulti, SCCMulti-PA, SCC algorithms, SCCMulti and SCCMulti-PA, that use DCSC and HONG scale similarly and have comparable exe- multiple pivots in the same iteration to divide the graph cution times. All algorithms show a decrease in scaling from into many pieces. SCCMulti guarantees O(log n) itera- 32 to 64 cores when two nodes of the system are employed tions and probabilistically guarantees the time and work of instead of a single node. Scalability beyond 32 cores is lower O(1) reachability queries per iteration, while SCCMulti- due to the large number of messages required to complete PA guarantees the time and work of O(1) reachability queries a single RQ and the fact that most of these messages are perb iteration, but only probabilistically guarantees O(log n) between different nodes of the system. iterations. Unlike previous RB-SCC algorithms (DCSC, The performance of all of the RB-SCC algorithms on a PM MultiPivot, and NSCC), these algorithms offer thisb perfor- with 40% edge reversal is shown in Figure 3(f). All of the al- mance on any graph without introducing significant compu- gorithms, with the exception of HONG, scale similarly, with tation or memory overhead. Our experimental results show SCCMulti performing the best on all core counts (this is that SCCMulti and SCCMulti-PA scale up to hundreds of a higher degree graph with low diameter). Hong’s modified thousands of cores; unlike DCSC, they attain consistent per- DCSC does not scale well due to the increased communica- formance regardless of the input graph, and they are much tion required by the computation of the WCCs. faster than MultiPivot. Therefore, they are consistently 5.4 Scalability of RB-SCC Algorithms fast, scalable parallel SCC algorithms. To evaluate the scalability of the algorithms on higher core counts we use the SC, PM, WS, and G500. The SC scal- APPENDIX ability up to 384K cores is shown in Figure 3(b), where This section describes a graph that will take DCSC Θ(n) DCSC and HONG are excluded because of their already recursive levels, even for the modified DCSC algorithm of poor scalability in Figure 3(a). Because of the overhead in- Hong et al. [11]. Consider a 3D Cartesian space. Place volved in finding subgraphs, MultiPivot does not scale on a node at every integral (x,y, 0) and (x,y, 1), creating two these core counts, and it was not run beyond 16,384 cores. horizontal planes of nodes. For every node in the z = 0 SCCMulti and SCCMulti-PA both scale well across all plane, create an edge to the five vertices closest to it in the core counts evaluated. The execution of ⌈log n⌉ reachability z = 1 plane; that is, (x,y, 0) will have edges to (x − 1,y, 1), queries(RQ), on which all of the algorithms are based, is also (x+1,y, 1), (x,y, 1), (x, y −1, 1), and (x, y +1, 1). A portion included in the figure to show that the underlying query is of this graph is shown in Figure 4. The nodes in the z = 0 itself scalable to high core counts. plane lie along dashed lines, while the nodes in the z = 1 Figure 3(g) illustrates the scalability of all five algorithms plane lie along solid lines. The five directed lines show the on a WS graph of 8 million vertices out to 393,216 cores. outgoing edges for an example node in the z = 0 plane. The scalability of the algorithms on PM out to 65,536 cores is shown in Figure 3(i). The trends noted in earlier exper- iments continue to hold: SCCMulti and SCCMulti-PA are consistently the fastest algorithms and demonstrate su- perior scalability comparable to the baseline of ⌈log n⌉ RQs. Again, because of the overhead involved in finding pivots, MultiPivot does not scale on the large number of cores being utilized, and it was not run beyond 64K cores. The bookkeeping of DCSC and HONG prevent them from scal- ing, and the algorithms require more execution time than all other algorithms (except on the highest core counts of each experiment, when MultiPivot’s execution time ex- ceeds them). In Figures 3(b) and (g) the three highest core counts used are 128K, 256K, and 384K. Figure 3(h) shows that none of the algorithms scale well on Figure 4. A bad case graph for the modified DCSC the scale-free graph from the Graph 500 benchmark. This This is a directed graph with low average reachability; is due to the lack of scalability of the reachability query however, no pivot will ever divide it into more than seven on which all of the algorithms are based. The reachability weakly connected components: the SCC of the pivot (which query does not scale due to the large amount of inter-node is the pivot itself), the five vertices to which the pivot has communication required in order to traverse the graph. edges, and the rest of the graph. The results show that, across a variety of graphs, SCC- One could argue that the SCCs of this graph would be Multi and SCCMulti-PA have low execution times and found by the modified Trim subroutine. This is indeed true; exhibit scalable performance out to 384k cores, making them however, replacing every individual vertex with a 3-cycle a preferable alternative to MultiPivot, DCSC, and HONG. would make Trim useless, but would not significantly change

10 the properties of the graph. Therefore, this graph presents a worst case graph for DCSC that is not assisted by the adaptation of Hong et al.[11].

11 References [19] R. E. Tarjan. Depth-first search and linear graph algorithms. [1] M. L. Adams and E. W. Larsen. Fast iterative methods for SIAM J. Comput., 1(2):146–160, 1972. discrete-ordinates particle transport calculations. Progress in [20] D. Tomkins, T. Smith, N. M. Amato, and L. Rauchwerger. Nuclear Energy, 40(1):3–159, 2002. Efficient, reachability-based, parallel algorithms for finding [2] A. V. Aho, R. Sethi, and J. D. Ullman. Compilers: Princi- strongly connected components. Technical Report TR15-002, ples, Techniques, and Tools. Addison-Wesley Longman Pub- Dept. of Computer Science and Engineering, Texas A&M lishing Co., Inc., Boston, MA, USA, 1986. University, January 2015. [3] N. Amato. Improved processor bounds for parallel algorithms [21] J. D. Ullman and M. Yannakakis. High-probability parallel for weighted directed graphs. Information Processing Let- transitive-closure algorithms. SIAM J. Comput., 20(1):100– ters, 45(3):147–152, Mar. 1993. 125, 1991. [4] D. Bader. A practical parallel algorithm for in partitioned digraphs. Technical Report AHPCC-TR-99- 013, Electrical & Computer Eng. Dept., Univ. New Mexico, Albuquerque, NM, 1999. [5] A. Buss, Harshvardhan, I. Papadopoulos, O. Pearce, T. Smith, G. Tanase, N. Thomas, X. Xu, M. Bianco, N. M. Amato, and L. Rauchwerger. STAPL: Standard template adaptive parallel library. In Proc. Annual Haifa Experimen- tal Systems Conference (SYSTOR), pages 1–10, New York, NY, USA, 2010. ACM. [6] L. Fleischer, B. Hendrickson, and A. Pinar. On identifying strongly connected components in parallel. In J. D. P. Rolim, editor, IPDPS Workshops, volume 1800 of Lecture Notes in Computer Science, pages 505–511. Springer, 2000. [7] A. A. Gakh, E. G. Gakh, B. G. Sumpter, and D. W. Noid. Neural network-graph theory approach to the prediction of the physical properties of organic compounds. J. of Chemical Information and Computer Sciences, 34(4):832–839, 1994. [8] H. Gazit and G. L. Miller. An improved parallel algorithm that computes the BFS numbering of a directed graph. In- formation Processing Letters, 28(2):61–65, June 1988. [9] M. Goodrich, Y. Matias, and U. Vishkin. Optimal parallel approximation for prefix sums and integer sorting. In Proc. of the 5th Annual ACM-SIAM Symp. on Discrete Algorithms, pages 241–250, 1994. [10] Harshvardhan, A. Fidel, N. M. Amato, and L. Rauchwerger. KLA: A new algorithmic paradigm for parallel graph com- putations. In Proc. Intern. Conf. Parallel Architecture and Compilation Techniques (PACT), PACT ’14, pages 27–38, New York, NY, USA, 2014. ACM. Conference Best Paper Award. [11] S. Hong, N. C. Rodia, and K. Olukotun. On fast parallel detection of strongly connected components (scc) in small- world graphs. In Proc. of the Intern. Conf. on High Perfor- mance Computing, Networking, Storage and Analysis, SC ’13, pages 92:1–92:11, New York, NY, USA, 2013. ACM. [12] M.-Y. Kao, S.-H. Teng, and K. Toyama. An optimal parallel algorithm for planar cycle separators. Algorithmica, pages 398–408, 1995. [13] L. Kavraki, P. Svestka, J.-C. Latombe, and M. Over- mars. Probabilistic roadmaps for path planning in high- dimensional configuration spaces. Robotics and Automation, IEEE Trans. on, 12(4):566 –580, aug 1996. [14] W. McLendon, III, B. Hendrickson, S. J. Plimpton, and L. Rauchwerger. Finding strongly connected components in distributed graphs. J. Parallel Distrib. Comput., 65:901–910, 2005. [15] J. H. Reif. Depth-first search is inherently sequential. Infor- mation Processing Letters, 20(5):229–234, 1985. [16] W. Schudy. Finding strongly connected components in par- allel using o(log2n) reachability queries. In Proc. of the 20th Annual Symp. on Parallelism in Algorithms and Architec- tures, SPAA ’08, pages 146–151, New York, NY, USA, 2008. ACM. [17] T. H. Spencer. More time-work tradeoffs for parallel graph algorithms. In Proc. of the 3rd Annual ACM Symp. on Par- allel Algorithms and Architectures, SPAA ’91, pages 81–93, New York, NY, USA, 1991. ACM. [18] T. H. Spencer. Time-work tradeoffs for parallel graph algo- rithms. In Proc. of the 2nd Annual ACM-SIAM Symp. on Discrete Algorithms, SODA ’91, pages 425–432, Philadel- phia, PA, USA, 1991. SIAM.

12