High-Performance Parallel with Strong Guarantees on Work, Depth, and Quality

Maciej Besta1, Armon Carigiet1, Zur Vonarburg-Shmaria1, Kacper Janda2, Lukas Gianinazzi1, Torsten Hoefler1 1Department of Computer Science, ETH Zurich 2Faculty of Computer Science, Electronics and Telecommunications, AGH-UST

Abstract—We develop the first parallel graph coloring heuris- to their degrees, random (R) [26] which chooses vertices tics with strong theoretical guarantees on work and depth and uniformly at random, incidence- (ID) [1] which picks coloring quality. The key idea is to design a relaxation of the vertices with the largest number of uncolored neighbors first, degeneracy order, a well-known concept, and to color vertices in the order dictated by this relaxation. This saturation-degree (SD) [27], where a vertex whose neighbors introduces a tunable amount of parallelism into the degeneracy use the largest number of distinct colors is chosen first, and ordering that is otherwise hard to parallelize. This simple idea smallest-degree-last (SL) [28] that removes lowest degree enables significant benefits in several key aspects of graph color- vertices, recursively colors the resulting graph, and then colors ing. For example, one of our ensures polylogarithmic the removed vertices. All these ordering heuristics, combined depth and a bound on the number of used colors that is superior to all other parallelizable schemes, while maintaining work- with Greedy, have the inherent problem of no parallelism. efficiency. In addition to provable guarantees, the developed Jones and Plassmann combined this line of work with earlier algorithms have competitive run-times for several real-world parallel schemes for deriving maximum independent sets [29], graphs, while almost always providing superior coloring quality. [30] and obtained a parallel graph coloring (JP) that Our degeneracy ordering relaxation is of separate interest for colors a vertex v once all of v’s neighbors that come later in algorithms outside the context of coloring. the provided ordering have been colored. They showed that This is the full version of a paper published at JP, combined with a random vertex ordering (JP-R), runs in ACM/IEEE Supercomputing’20 under the same title expected depth O(log n/ log log n) and O(n + m) work for constant-degree graphs (n and m are #vertices and #edges I.INTRODUCTION in G, respectively). Recently, Hasenplaugh et al. [31] extended Graph coloring, more specifically vertex coloring, is a well JP with the largest-log-degree first (LLF) and smallest-log- studied problem in computer science, with many practical degree-last (SLL) orderings with better bounds on depth; these applications in domains such as sparse linear algebra computa- orderings approximate the LF and SL orderings, respectively. tions [1]–[7], conflicting task scheduling [8]–[11], networking There is also another (earlier) work [32] that – similarly to and routing [12]–[20], [21], [22], and many JP-SLL – approximates SL with the “ASL” ordering. The others [23]. A vertex coloring of a graph G is an assignment resulting coloring combines JP with ASL, we denote it as JP- of colors to vertices, such that no two neighboring vertices ASL [32]. However, it offers no bounds for work or depth. share the same color. A k-coloring is a vertex coloring of G Overall, there is no parallel algorithm with strong theoreti- which uses k distinct colors. The minimal amount of colors k cal guarantees on work and depth and quality. Whilst having for which a k-coloring can be found for G is referred to as the a reasonable theoretical run-time, JP-R may offer colorings chromatic number χ(G). An optimal coloring, also sometimes of poor quality [31], [33]. On the other hand, JP-LF and JP- referred to as the coloring problem or a χ-coloring, is the SL, which provide a better coloring quality, run in Ω(n) or problem of coloring G with χ(G) colors. Finding such and Ω(∆2) for some graphs [31]. This was addressed by the recent arXiv:2008.11321v3 [cs.DS] 11 Nov 2020 optimal coloring was shown to be NP-complete [24]. JP-LLF and JP-SLL algorithms [31] that produce colorings Nonetheless, colorings with a reasonably low number of of similarly good quality to their counterparts JP-LF and JP- colors can in practice be computed quite efficiently in the SL, and run in an expected depth that is within a logarithmic sequential setting using heuristics. One of the most important factor of JP-R. However, no guaranteed upper bounds on the is the Greedy heuristic [25], which sequentially colors vertices coloring quality (#colors), better than the trivial ∆ + 1 bound by choosing, for each selected vertex v, the smallest color not from Greedy, exist for JP-LLF, JP-SLL, or JP-ASL. already taken by v’s neighbors. This gives a guarantee for To alleviate these issues, we present the first graph coloring a coloring of G with at most ∆ + 1 colors, where ∆ is the algorithms with provably good bounds on work and depth maximum degree in G. To further improve the coloring quality and quality, simultaneously ensuring high performance and (i.e., #colors used), Greedy is in practice often used with a competitive quality in practice. The key idea is to use a certain vertex ordering heuristic, which decides the order in novel vertex ordering, the provably approximate degeneracy which Greedy colors the vertices. Example heuristics are first- ordering (ADG, contribution #1) when selecting which vertex fit (FF) [25] which uses the default order of the vertices in G, is the next to be colored. The exact degeneracy ordering is – largest-degree-first (LF) [25] which orders vertices according intuitively – an ordering obtained by iteratively removing ver-

1 tices of smallest degrees. Using the degeneracy ordering with real-world graphs, while (2) offering superior coloring quality. JP leads to the best possible Greedy coloring quality [28]. Still, Only JP-SL and JP-SLL use comparably few colors, but they computing the exact degeneracy ordering is hard to parallelize: are at least 1.5× slower. Our routines are of interest for for some graphs, it leads to Ω(n) coloring run-time [31]. To coloring both small and large graphs, for example in online tackle this, we (provably) relax the strict degeneracy order by execution scheduling and offline data analytics, respectively. assigning the same rank (in the ADG ordering) to a batch We conclude that our algorithms offer the best coloring quality of vertices that – intuitively – have similarly small degrees. at the smallest required runtime overhead. This approach also results in provably higher parallelization In a brief summary, we offer the following: because each batch of vertices can be processed in parallel. • The first parallel algorithm for deriving the (approximate) This simple idea, when applied to graph coloring, gives graph degeneracy ordering (ADG). surprisingly rich outcome. We use it to develop three novel • The first parallel graph coloring algorithm (JP-ADG), in a graph coloring algorithms that enhance two relevant lines line of heuristics based on Jones and Plassman’s scheme, of research. We first combine ADG with JP, obtaining JP- with strong bounds on work, depth, and coloring quality. ADG (contribution #2), a coloring algorithm that is paral- • The first parallel graph coloring algorithm (DEC-ADG), lelizable: vertices with the same ADG rank are colored in in a line of heuristics based on speculative coloring, with 2 parallel. It has the expected worst-case depth of O(log n + strong bounds on work, depth, and coloring quality 2 log ∆(d log n + log d log n/ log log n)). Here, d is the degen- • A use case of how ADG can seamlessly enhance an existing eracy of a graph G: an upper bound on the minimal degree of state-of-the-art graph coloring scheme (DEC-ADG-ITR). every of G (detailed in § II-B) [28]. JP-ADG • The most extensive (so far) theoretical analysis of parallel is also work-efficient (O(n + m) work) and has good coloring graph coloring algorithms, showing advantages of our quality: it uses at most 2(1 + ε)d + 1 colors1, for any ε > 0. algorithms over state-of-the-art in several dimensions. Moreover, we also combine ADG with another important • Superior coloring quality offered by our algorithms over line of graph coloring algorithms that are not based on JP tuned modern schemes for many real-world graphs. but instead use speculation [32], [34]–[46]. Here, vertices We note that degeneracy ordering is used beyond graph color- are colored independently (“speculative coloring”). Potential ing [49]–[52]; thus, our ADG scheme is of separate interest. coloring conflicts (adjacent vertices assigned the same colors) are resolved by repeating coloring attempts. Combining ADG II.FUNDAMENTAL CONCEPTS with this design gives DEC-ADG (contribution #3), the first We start with background; Table I lists key symbols. Vertex scheme based on speculative coloring with provable strong coloring was already described in Section I. guarantees on all key aspects of parallel graph coloring: work O(n + m), depth O(log d log2 n) , and quality (2 + ε)d. A. Graph Model and Representation Finally, we combine key design ideas in DEC-ADG with We model a graph G as a tuple (V,E); V is a set of vertices an existing recent algorithm [40] (referred to as ITR) also and E ⊆ V × V is a set of edges; |V | = n and |E| = m. We based on speculative coloring. We derive an algorithm called focus on graph coloring problems where edge directions are DEC-ADG-ITR that improves coloring quality of ITR both not relevant. Thus, G is undirected. The maximum, minimum, in theory and practice. and average degree of a given graph G are ∆, δ, and δb, We conduct the most extensive theoretical analysis of graph respectively. The neighbors and the degree of a given vertex v coloring algorithms so far, considering 20 parallel graph are N(v) and deg(v), respectively. G[U] = (U, E[U]) denotes coloring routines with provable guarantees (contribution #5). an induced subgraph of G: a graph where U ⊆ V and All our algorithms offer substantially better bounds than past E[U] = {(v, u) | v ∈ U ∧ u ∈ U}, i.e., E[U] contains work. Compared to the most recent JP-SLL and JP-LLF edges with both endpoints in U. NU (v) and degU (v) are colorings [31], JP-ADG gives a strong theoretic coloring the neighborhood and the degree of v ∈ V in G[U]. The guarantee that is much better than ∆ + 1 because d  ∆ for vertices are identified by integer IDs that define a total order: many classes of sparse graphs, such as scale-free networks [47] V = {1, . . . , n}. These IDs define a total order on the and planar graphs [48]. It also provides an interesting novel vertices that is used to sort the neighborhoods. We store G 2 tradeoff in the depth. On one hand, it depends on log n while using CSR, the standard graph representation that consists of JP-SLL and JP-LLF depend on log n. However,√ while JP-SLL n sorted arrays with neighbors of each vertex (2m words) and and JP-LLF depend linearly on ∆ or m (which are usually offsets to each array (n words). large in today’s graphs), the depth of JP-ADG√ depends linearly on degeneracy d, and, as we also show, d  m and d  ∆. B. Degeneracy, Coreness, and Related Concepts We also perform a broad empirical evaluation, illustrating that A graph G is s-degenerate [53] if, in each of its induced our algorithms (1) are competitive in run-times for several subgraphs, there is a vertex with a degree of at most s. The degeneracy d of G [54]–[57] is the smallest s, such that G 1The provided bound on the number of colors in JP-ADG and in DEC-ADG- is still s-degenerate. The degeneracy ordering of G [28] is an ITR is exactly d2(1 + ε)de + 1, and d(2 + ε)de in DEC-ADG. However, for clarity of notation, we will omit d·e, for both numbers of colors and for ordering, where each vertex v has at most d neighbors that vertex neighbors, whenever it does not change any conclusions or insights. are ordered higher than v. Then, a k-approximate degeneracy

2 G A graph G = (V,E); V and E are sets of vertices and edges. a W–D algorithm that only relies on concurrent reads, we use a G[U] G[U] = (U, E[U]) is a subgraph of G induced on U ⊆ V . n, m Numbers of vertices and edges in G; |V | = n, |E| = m. term “the CREW setting”. Similarly, for a W–D algorithm that ∆, δ, δb Maximum degree, minimum degree, and average degree of G. needs concurrent writes, we use a term “the CRCW setting”. d The degeneracy of G. deg(v),N(v) The degree and the neighborhood of a vertex v ∈ V . The machine model used in this work is similar to PRAM. degU (v) The degree of v in a subgraph induced by the vertex set U ⊆ V . One key difference is that we do not rely on the (unrealistic) NU (v) The neighborhood of v in a subgraph induced by U ⊆ V . ρX (v) A priority function V → R associated with vertex ordering X. PRAM assumption of the synchronicity of all steps of the P The number of processors (in a given PRAM machine). algorithm (across all processors). The well-known Brent’s TABLE I: Selected symbols used in the paper. When we use a symbol in the result states that any deterministic algorithm with work W context of a specific loop iteration `, we add ` in brackets or as subscript and depth D can be executed on P processors in time T such (e.g., δ is δ in iteration `). b` b that max{W/P, D} ≤ T ≤ W/P + D [67]. Thus, all our results are applicable to a PRAM setting. ordering differs from the exact one in that v has at most k · d neighbors ranked higher in this order. A partial k-approximate D. Compute Primitives degeneracy ordering is a similar ordering, where multiple We use a Reduce operation. It takes as input a set S = vertices can be ranked equally, and we have that each vertex {s1, ..., sn} implemented as an array (or a bitmap). It uses a has at most k·d neighbors with equal or higher rank. A partial function f : S → N called the operator; f(s) must be defined k-approximate degeneracy ordering can be trivially extended for any s ∈ S. Reduce calculates the sum of elements in S into a k-approximate degeneracy ordering by imposing an with respect to f: f(s1) + ... + f(sn). This takes O(log n) (arbitrary) order on vertices ranked equally. Both degeneracy depth and O(n) work in the CREW setting [68], [69], where and a degeneracy ordering of G can be computed in linear time n is the array size. We use Reduce to implement Count(S), by sequentially removing vertices of smallest degree [28]. which computes the size |S|. For this, the associated operator Closely related notions are a k-core of G [58]: A connected f is defined as f(s) = 1 if s ∈ S, and f(s) = 0 otherwise. component that is left over after iteratively removing vertices Reduce, as well as the more general PrefixSum operation, are with degree less than k from G. The coreness of a vertex v well studied parallel primitives, which are used in numerous is defined as the largest possible k, such that v is part of a parallel computations and algorithms [69]. We also assume subgraph S of G with minimum degree k. a DecrementAndFetch (DAF) to be available; it atomically C. Models for Algorithm Analysis decrements its operand and returns a new value [31]. We Join As a compute model, we use the DAG model of dynamic use DAF to implement to synchronize processors (Join multithreading [59], [60]. In this model, a specific computation decrements its operand, returns the new value, and releases a (resulting from running some parallel program) is modeled processor under a specified condition). For details, we refer to as a (DAG). Each node in a DAG the paper of Hasenplaugh et al. [31]. corresponds to a constant time operation. In-edges of a node model the data used for the operation. As operations run in E. Randomization constant time, there are O(1) in-edges per node. The out-edges Randomization has been used in numerous algorithms [30], of a node correspond to the computed output. A node can be [70]–[75]. We distinguish between Monte Carlo algorithms, executed as soon as all predecessors finish executing. which return a correct result w.h.p.2 and Las Vegas algorithms, Following related work [31], [61], we assume that a parallel which always return the correct result but have probabilistic computation (modeled as a DAG) runs on the ideal parallel run-time bounds. The JP algorithm, with an ordering heuristic computer (machine model). Each instruction executes in unit that employs randomness like ADG, is a Las Vegas algorithm. time and there is support for concurrent reads, writes, and For a large part of the analysis we use Random Variables to read-modify-write atomics (any number of such instructions describe various random events. To prove that statements hold finish in O(1) time). These are standard assumptions used in w.h.p. we mainly use simple Markov and Chernoff bounds. all recent parallel graph coloring algorithms [31], [61]. We Further, we also employ another technique of coupling random develop algorithms based on these assumptions but we also variables in our analysis. A coupling of two random variables provide algorithms that use weaker assumptions (algorithms X and Y is defined as a new variable (X0,Y 0) over the joint that only rely on concurrent reads). probability space, such that the marginal distribution of X0 and We use the work-depth (W–D) analysis for bounding run- Y 0 coincides with the distribution of X and Y respectively. times of parallel algorithms in the DAG model. The work of More precisely we can define this property as follows: Let an algorithm is the total number of nodes and the depth is f(x) be the distribution of X and g(y) be the distribution of defined as the longest directed path in the DAG [62], [63]. Y . Then a random variable (X0,Y 0) is a coupling of X and Y Our Analyses vs. PRAM In our W-D analysis, two used P 0 0 if and only if both statements y P r[X = x, Y = y] ≡ f(x) machine model variants (1) only need concurrent reads and P 0 0 and x P r[X = x, Y = y] ≡ g(y) hold [76]. (2) may also need concurrent writes. These variants are analogous to those of the well-known PRAM model [63]– 2A statement holds with high probability (w.h.p.) if it holds with a probability 1 [66]: CREW and CRCW, respectively. Thus, when describing of at least 1 − nc for all c

3 III.PARALLEL APPROXIMATE DEGENERACY ORDERING run in O(log n) depth and O(|U|) work, the same holds for the average degree calculation. We first describe ADG, a parallel algorithm for computing a partial approximate degeneracy ordering. ADG outputs vertex Depth First, note that each line in the while loop runs in O(log n) depth, as discussed above. We will now prove priorities ρADG, which are then used by our coloring algo- that the while loop iterates O(log n) many times, giving the rithms (Section IV). Specifically, these priorities produce an 2 order in which to color the vertices (ties are broken randomly). total ADG depth of O(log n). The key notion is that, in each ADG is shown in Algorithm 1. ADG is similar to SL [28], iteration, we remove a constant fraction of vertices due to the which iteratively removes vertices of the smallest degree to way that we construct R (based on the average degree δb). construct the exact degeneracy ordering. The key difference Lemma 1. For a constant ε > 0, ADG does O(log n) and our core idea is to repeatedly remove in parallel all iterations and has O(log2 n) depth in the CRCW setting. vertices with degrees smaller than (1 + ε)δb. The parameter ε ≥ 0 controls the approximation accuracy. We multiply 1 + ε Proof. At each step ` of the algorithm we can have at most n (1 + ε)δ by the average degree δb as it enables good bounds on quality 1+ε vertices with a degree larger than b`. This can and run-time, as we show in Lemma 1 and 4. Compared to be seen from the fact, that the sum of degrees in the current SL (which has depth O(n)), ADG has depth O(log2 n) and subgraph can be at most n times the average degree δb`. For n obtains a partial 2(1 + ε)-approximate degeneracy ordering. vertices with a degree exactly (1 + ε)δb` we get 1+ε · (1 + ε)δb` = nδb`, which would result in a contradiction if we had 1 /* Input :A graph G(V,E). n 2 * Output :A priority(ordering) function ρ : V → R.*/ more than 1+ε vertices with larger degree. Thus, if we remove 3 4 D = [ deg(v1) deg(v2) ... deg(vn)] //An array with vertex degrees all vertices with degree ≤ (1 + ε)δb`, we remove a constant 5 ` = 1; U = V //U is the induced subgraph used in each iteration ` fraction of vertices in each iteration (at least ε n vertices), 6 1+ε 7 while U 6= ∅ do: which implies that ADG performs O(log n) iterations in the 8 |U| = Count(U);//Derive |U| usinga primitive Count 2 P 9 cnt = Reduce(U);//Derive the sum of degrees in U : v∈U D[v] worst case, immediately giving the O(log n) depth. To see 10 δ = cnt //Derive the average degree for vertices in U b |U| this explicitly, one can define a simple recurrence relation for 11  1  12 //R contains vertices assigned priority ina given iteration: the number of iterations T (n) ≤ 1 + T 1+ε n , T (1) = 1; 13 R = {u ∈ U | D[u] ≤ (1 + ε)δb } 14 UPDATE(U , R, D)//Update D to reflect removing R from U l log n m 15 U = U \ R //Remove selected low-degree vertices(that are in R) solving it gives T (n) ≤ log (1+ε) + 1 ∈ O(log n). 16 for all v ∈ R do in parallel://Set the priority of vertices 17 ρADG(v) = ` //The priority is the current iteration number ` 18 ` = ` + 1 Work The proof of work is similar; it also uses the fact 19 20 //Update D to reflect removing vertices in R froma set U : that a constant fraction of vertices is removed in each iteration. 21 UPDATE(U , R, D): Intuitively, (1) we show that each while loop iteration per- 22 for all v ∈ R do in parallel:   23 for all u ∈ NU (v) do in parallel: forms O P deg(v) + |U | work (where U is the set 24 DecrementAndFetch(D[u]) v∈R i i U in iteration i), and (2) we bound Pk |U | by a geometric Algorithm 1: ADG, our algorithm for computing the 2(1 + ε)-approximate i=1 i degeneracy ordering; it runs in the CRCW setting. series, implying that it is still in O(n). Lemma 2. For a constant ε > 0, ADG does O(n + m) work In ADG, we maintain a set U ⊆ V of active vertices that in the CRCW setting. starts as V (Line 5). In each step (Lines 8–18), we use ε and δb to select vertices with small enough degrees (Line 13); The Proof. Let k be the number of iterations we perform and let average degree δb is computed in Lines 8–10. The selected Ui be the set U in iteration i. To calculate the total work performed by ADG, we first consider the work in one iteration. vertices form a set R and receive a priority ρADG equal to the step counter `. We then remove them from the set U (Line 15) As explained in “Design Details”, deriving the average degree and update the degrees D accordingly (Line 14). We continue takes O(|Ui|) work in one iteration. Initializing R takes until the set U is empty. When used with JP, ties between O(|Ui|) and removing R from Ui takes O(|R|). UPDATE takes O P deg(v) work. Thus, the total work in one vertices with the same ρADGA are broken with another priority v∈R 0 P   function ρ (e.g., random ordering). iteration is in O v∈R deg(v) + |Ui| . As each vertex Design Details For the theoretical analysis we use the becomes included in R in a single unique iteration, this gives Pk P deg(v) ∈ O(m). Moreover, since we remove following design assumptions. We implement D as an array i=1 v∈Ri and use n-bit dense bitmaps for U and R. This enables a constant number of vertices in each iteration from U (at ε updating vertex degrees in O(1) and resolving v ∈ U and least 1+ε as shown above in the proof of the depth of ADG), Pk v ∈ R in O(1) time. Constructing R in each step can be we can bound i=1 |Ui| by a geometric series, implying implemented in O(1) depth and O(|U|) work. The operation that it is still in O(n). This can be seen from the fact that i i U = U\R takes O(1) depth and O(|R|) work by overwriting Pk Pk  1  P∞  1  (1+ε) i=1 |Ui| ≤ i=0 n ≤ i=0 n = n the bitmap U. To calculate the average degree on Line 10, we 1+ε 1+ε ε if 1 < 1 (which holds as ε > 0). Ultimately, we have derive |U| and sum all degrees of vertices in U. The former is 1+ε  O Pk P deg(v) + |U | ∈ O(m)+O(n) ∈ O(m+ done with a Count over U. The latter uses Reduce with the i=0 v∈Ri i associated operator f(v) = D[v]. As both Reduce operations n).

4 Approximation ratio We now prove that the approxima- Lemma 5. A variant of ADG as specified in Algorithm 2 and tion ratio of ADG on the degeneracy order is 2(1 + ε). First, in § III-B does O(m + nd) work in the CREW setting. we give a small lemma used throughout the analysis. Proof. The proof is identical to that for ADG in the CRCW Lemma 3. Every induced subgraph of a graph G with setting, the difference is in analyzing the impact of UPDATE degeneracy d, has an average degree of at most 2d. on total work. To compute Count(NUi (v) ∩ Ri) (cf. Algo- rithm 2), we use a Reduce over N (v). Since R ⊆ U , we Proof. By the definition of a d-degenerate graph, in every Ui i i can define the operator f as f(v) = 1 if v ∈ R , and f(v) = 0 induced subgraph G[U], there is a vertex v with deg (v) ≤ d. i U otherwise. Thus, when computing total work in iteration i, in- If we remove v from G[U], at most d edges are removed. Thus, stead of P deg (v), we must consider P deg (v). v∈Ri Ui v∈Ui Ui if we iteratively remove such vertices from G[U], until only Pk P Pk Consequently, we have degU (v) ≤ (2d · one vertex is left, we remove at most d · (|U| − 1) edges. We i=1 v∈Ui i i=1 Pk 1 P |Ui|) ≤ 2d · |Ui| ∈ O(nd) (the first inequality by conclude that δb(G[U]) = |U| v∈U degU (v) ≤ 2d. i=1 Lemma 3, the second one by the observation that we remove Lemma 4. ADG computes a partial 2(1 + ε)-approximate a constant fraction of vertices in each iteration, cf. the proof degeneracy ordering of G. of Lemma 1). Proof. R By the definition of (Line 13), all vertices removed IV. PARALLEL GRAPH COLORING in step ` have a degree of at most (1 + ε)δb`, where δb` is the average degree of vertices in subgraph U in step `. From We now use our approximate degeneracy ordering to de- velop new parallel graph coloring algorithms. We directly Lemma 3, we know that δb` ≤ 2d. Thus, each vertex has a degree of at most 2(1+ε)d in the subgraph G[U] (in the current enhance the recent line of works based on scheduling colors, step). Hence, each vertex has at most 2(1+ε)d neighbors that i.e., assigning colors to vertices without generating coloring are ranked equal or higher. The result follows by the definition conflicts (§ IV-A). In two other algorithms, we allow conflicts of a partial 2(1 + ε)-approximate degeneracy order. but we also provably resolve them fast (§ IV-B, § IV-C). A. Comparison to Other Vertex Orderings A. Graph Coloring by Color Scheduling (JP-ADG) We analyze orderings in Table II. While SLL and ASL We directly enhance recent works of Hasenplaugh et al. [31] heuristically approximate SL, which computes a degeneracy by combining their Jones-Plassmann (JP) version of coloring ordering, they do not offer guaranteed approximation factors. with our ADG, obtaining JP-ADG. For this, we first overview Only ADG comes with provable bounds on the accuracy of JP and definitions used in JP. The JP algorithm uses the notion the degeneracy order while being (provably) parallelizable. of a computation DAG Gρ(V,Eρ), which is a directed version of the input graph G. Specifically, the DAG Gρ is used by JP Ordering heuristic Time / Depth Work F.? B., Approx.? to schedule the coloring of the vertices: The position of a FF (first-fit) [25] O(1) O(1) n/a n/a vertex v in the DAG Gρ determines the moment the vertex v - R (random) [26], [31] O(1) O(n) n/a is colored. The DAG Gρ contains the edges of G directed from ID (incidence-degree) [1] O(n + m) O(n + m) n/a n/a SD (saturation-degree) [27], [31] O(n + m) O(n + m) n/a n/a the higher priority to lower priority vertices according to the - LF (largest-degree-first) [31] O(1) O(n) n/a priority function ρ, i.e. , Eρ = {(u, v) ∈ E | ρ(u) > ρ(v)}. LLF (largest-log-degree-first) [31] O(1) O(n) - n/a SLL (smallest-log-degree-last) [31] O(log ∆ log n) O(n + m) - ŏ JP is described in Algorithm 3. As input, besides G, it takes SL (smallest-degree-last) [28], [31] O(n) O(m) -- exact a priority function ρ : V → R which defines a total order on ASL (approximate-SL) [32] O(n) O(m) - ŏ vertices V . First, JP uses ρ to compute a DAG Gρ(V,Eρ), ADG [approx. degeneracy] O log2 n O(n + m) -- 2(1 + ε) where edges always go from vertices with higher ρ to ones TABLE II: Ordering heuristics related to the degeneracy ordering. with lower ρ (Alg. 3, Lines 6–9). Vertices can then be safely “F. (Free)?”: Is the scheme free from concurrent writes? “B. (Bounds)?”: Are assigned a color if all neighbors of higher ρ (predecessors there provable bounds and approximation ratio on degeneracy ordering? “-”: support, “ŏ”: no support. Notation is explained in Table I and in Section II. in the DAG) have been colored. The algorithm does this by first calling JPColor with the set of vertices that have no predecessors (Alg. 3, Lines 13–15). JPColor then colors v by B. Using Only Concurrent Reads calling GetColor, which chooses the smallest color not already So far, we have presented ADG in the CRCW setting. We taken by v’s predecessors. Afterwards, JPColor checks if any now discuss modifications required to make it work in the of v’s successors can be colored, and if yes, it calls again CREW setting. The key change is to redesign the UPDATE JPColor on them. As shown by Hasenplaugh et al. [31], this routine, as illustrated in Algorithm 2. This preserves the algorithm can run in O(|P| log ∆+log n) depth and O(n+m) 2 O(log n) depth, but the work increases to O(m + nd). work, where |P| is the size of the longest path in Gρ. Now, in JP-ADG, we first call ADG to derive ρADG. Then, 1 UPDATE(Ui , Ri , D):// Ui and Ri are sets U and R in iteration i 2 for all v ∈ Ui do in parallel: we run JP using ρADG. More precisely, we use ρ = hρADG, ρRi 3 D[v] = D[v] − Count(N (v) ∩ R ) Ui i where ρR randomly breaks ties of vertices that were removed Algorithm 2: The modified version of the UPDATE routine from ADG in the same iteration in Algorithm 1, and thus have the same (Algorithm 1) that works in the CREW setting. rank in ρADG. The obtained JP-ADG algorithm is similar to

5 1 /* Input :A graph G(V,E),a priority function ρ. 2 * Output : An array C , it assignsa color to each vertex.*/ Proof. Since hρADG, ρRi is a 2(1+ε)-approximate degeneracy 3 4 //Part 1: compute theDAG Gρ based on ρ ordering, the result follows from Lemma 6 and 4. 5 C = [0 0 ... 0] //Initialize colors 6 for all v ∈ V do in parallel: 7 //Predecessors and successors of each vertex v in Gρ : Depth, Work To bound the depth of JP-ADG, we follow 8 pred[v] = {u ∈ N(v) | ρ(u) > ρ(v)} the approach by Hasenplaugh et al. [31]. We analyze the 9 succ[v] = {u ∈ N(v) | ρ(u) < ρ(v)} 10 //Number of uncolored predecessors of each vertex v in Gρ : expected length of the longest path in a DAG induced by 11 count[v] = |pred[v]| 12 JP-ADG to bound its expected depth. We first provide some 13 //Part 2: color vertices using Gρ 14 for all v ∈ V do in parallel: additional definitions. The induced subgraph of Gρ is a 15 //Start by coloring all vertices without predecessors: (directed) subgraph of Gρ induced by a given vertex set. Note 16 if pred[v] == ∅: JPColor(v) 17 that as ρ is a total order on V , the DAG Gρ is strongly 18 JPColor (v)//JPColor,a routine used inJP 19 C[v] = GetColor(v) connected. Finally, we denote ρ = maxv∈V {ρADG(v)}. 20 for all u ∈ succ[v] in parallel: 21 //Decrement u's counter to reflect that v is now colored: Lemma 7. For a priority function ρ = hρADG, ρRi, where ρADG 22 if Join(count[u]) == 0: 23 JPColor(u) //Color u if it has no uncolored predecessors is a partial k-approximate degeneracy ordering for a constant 24 25 GetColor (v)//GetColor,a routine used in JPColor k > 1, ρR is a random priority function, the expected length of 26 C = {1, 2,..., |pred[v]| + 1}  log d log2 n  27 for all u ∈ pred[v] do in parallel: C = C − {C[u]} the longest path in the DAG Gρ is O d log n + log log n . 28 return min (C) // Output : the smallest color available. Algorithm 3: JP, the Jones-Plassman coloring heuristic. With ρ = Proof. Let Gρ(`) be the subgraph of Gρ induced by the vertex hρADG, ρRi, it gives JP-ADG that provides (2(1 + ε)d + 1)-coloring. set V (`) = {v ∈ V | ρADG = `}. Let ∆` be the maximal degree and δb` be the average degree of the subgraph Gρ(`). Since, by the definition of Gρ, there can be no edges in 0 Gρ that go from one subgraph Gρ(`) to another Gρ(` ) with past work based on JP in that it follows the same “skeleton” 0 in which coloring of vertices is guided by the pre-computed ` > `, we can see that a longest (directed) path P in Gρ will always go through the subgraph Gρ(`) in a monotonically order, on our case ρADG. However, as we prove later in this section, using ADG is key to our novel bounds on depth, decreasing order with regards to `. Therefore, we can split P into a sequence of (directed) sub-paths P1,..., Pρ, where P` work, and coloring quality. Intuitively, ADG gives an ordering P is a path in G(`). We have |P| = |Pi| and of vertices in which each vertex has a bounded number of i∈{ρADG(v)|v∈V } predecessors (by definition of s-degenerate graphs and graph by Corollary 6 from past work [31], the expected length of degeneracy d). We use this to bound coloring quality and sizes a longest sub-path P` is in O(∆` + log ∆` log n/ log log n), because Gρ(`) is induced by a random priority function. By of subgraphs in Gρ. The latter enables bounding the maximum linearity of expectation, we have for the whole path P: path in Gρ, which in turn gives depth and work bounds. As for different combinations of JP and orderings, we define ρ ! them similarly to past work. JP-R is JP with a random priority X  log n  [|P|] = O ∆ + log ∆ · (1) function ρR. JP-FF uses the natural vertex order. JP-LF uses E ` ` log log n `=1 ρ(v) = hdeg(v), ρRi with a lexicographic order. JP-SL is defined by ρ(v) = hρSL, ρRi, with a degeneracy ordering ρSL. Next, since ρADG is a partial k-approximate degeneracy JP-LLF is defined by ρ = hdlog (deg(v))e, ρRi and JP-SLL by ordering, all vertices in G(`) have at most kd neighbors ρ = hρSLL, ρRi, where ρSLL is the order computed by the SLL in G(`). Thus, ∆` ≤ kd holds. This and the fact that algorithm from Hasenplaugh et al. [31]. ρ ∈ O(log n) gives: We first prove a general property of JP-ADG, which we will use to derive bounds on coloring quality, depth, and work. ρ ρ X X Lemma 6. JP, using a priority function ρ that defines a k- ∆i ≤ d · k ∈ O(d log n) (2) approximate degeneracy ordering, colors a graph G with at i=1 i=1 ρ most kd + 1 colors, for ε > 0. X log ∆i ∈ O(log d log n) (3) Proof. Since ρ defines a k-approximate degeneracy ordering, i=1 0 0 any v ∈ V has at most kd neighbors v with ρ(v ) ≥ ρ(v) Thus, for the expected length of a longest path in G: and thus at most kd predecessors in the DAG. Now, we can choose the smallest color available from {1, . . . , kd + 1} to  log d log2 n [|P|] = O d log n + (4) color v, when all of its predecessors have been colored. E log log n

Coloring Quality The coloring quality now follows from the properties of the priority function obtained with ADG. Our main result follows by combining our bounds on the longest path P in the DAG Gρ and a result by Hasenplaugh et Corollary 1. With priorities ρ = hρADG, ρRi, JP-ADG colors al. [31], which shows that JP has O(log n+log ∆·|P|) depth. a graph with at most 2(1 + ε)d + 1 colors, for ε > 0.

6 1 /* Input : G(V,E) (input graph). Theorem 1. JP-ADG colors a graph G with degeneracy d in 2 * Output : An array C , it assignsa color to each vertex.*/ 2 log d log2 n 3 expected depth O(log n + log ∆ · (d log n + log log n )) and 4 C = [0 0 ... 0] //Initialize an array of colors 5 //RunADG* to derivea 2(1 + ε/12)-approximate degeneracy ordering O(n + m) work in the CRCW setting. 6 //G ≡ [G(1) ... G(ρ)] contains ρ low-degree partitions 7 //We have G(i) = G[R(i)] where R(i) = {v ∈ V | ρ(v) = i} Proof. Since ρADG is a partial 2(1 + ε)-approximate de- 8 (ρ, G) = ADG *(G) 9 generacy ordering (Lemma 4) and since ADG performs at 10 //Initialize bitmaps Bv to track colors forbidden for each vertex 11 ∀v∈V Bv = [00...0] //Each bitmap is d2(1 + ε/12)(1 + µ)de + 1 bits most O(log n) iterations (Lemma 1), the depth follows from 12 SIM-COL(G(ρ), {Bv | v ∈ R(ρ)})//First, we color G(ρ) Lemma 7, Lemma 1 and past work [31], which shows that JP 13 14 for ` from ρ − 1 down to 1 do://For all low-degree partitions runs in O(log n + log ∆ · |P|) depth. As both JP and ADG 15 Q = R(ρ) ∪ · · · ∪ R(` + 1) //A union of already colored partitions 16 for all v ∈ R(`) do in parallel: perform O(n + m) work, so does JP-ADG. 17 for all u ∈ NQ(v) do in parallel:// For v's colored neighbors 18 Bv = Bv ∪ C[u] // Update colors forbidden for v 19 SIM-COL(G(`), {B | v ∈ R(`)})//Run Algorithm5 B. Graph Coloring by Silent Conflict Resolution (DEC-ADG) v Algorithm 4: DEC-ADG, the second proposed parallel coloring heuristic that Our second coloring algorithm takes a radical step to move provides a (2(1 + ε)d)-coloring. Note that we use factors ε/4 and ε/12 for away from the long line of heuristics based on JP. The key idea more straightforward proofs (this is possible as ε can be an arbitrary non- negative value). is to use ADG to decompose the input graph into low-degree partitions (thus “DEC-ADG”), shown in Algorithm 4. Here, ADG is again crucial to our bounds. Specifically, vertices with the same ADG rank form a partition that is “low-degree”: it (note that, in DEC-ADG, we use factors ε/4 and ε/12 instead has a bounded number of edges to any other such partitions of ε for more straightforward proofs; this is possible as ε can (by the definition of ADG). Each such partition is then colored be an arbitrary non-negative value). Now, when coloring a separately, with a simple randomized scheme in Algorithm 5. partition G(`), we know that SIM-COL, by its design, chooses This may generate coloring conflicts, i.e., neighboring vertices colors for v only in the range of {1...(1 + µ)deg`(v) + 1} with identical colors. Such conflicts are resolved “silently” by (Algorithm 5, Line 7; as we will show, using such a range repeating the coloring on conflicting vertices as many times as will enable the advantageous bounds for DEC-ADG). Thus, needed. As ADG bounds counts of edges between partitions, it it suffices to keep bitmaps of size d(1 + µ)kde + 1 for each also bounds counts of conflicts, improving depth and quality. vertex, where k = 2(1 + ε/12). We first detail Algorithm 4. A single low-degree partition In SIM-COL, we color a single low-degree parti- G(`) produced by the iteration ` of ADG is the induced tion G(`) = (V (`),E(`)). SIM-COL takes two arguments: (1) subgraph of G over the vertex set R removed in this iter- the partition to be colored (it can be an arbitrary graph G = ation (Line 13, Alg. 1). Formally, G(i) = G[R(i)] where (V,E) but for clarity we explicitly use G(`) = (V (`),E(`)) R(i) = {v ∈ V | ρ(v) = i} and ρ is the partial k-approximate that denotes a partition from a given iteration ` in DEC- degeneracy order produced by ADG (cf. § II-B). Thus, in ADG) and (2) bitmaps associated with vertices in a given DEC-ADG, we first run ADG to derive the ordering ρ and also partition R(i). By design, SIM-COL delivers a ((1 + µ)∆)- the number ρ of low-degree partitions (ρ ∈ O(log n)). Here, coloring; µ > 0 can be an arbitrary value. To be able to we use ADG*, a slightly modified ADG that also records – derive the final bounds for DEC-ADG we set µ = ε/4. U as an array G ≡ [G(1) ... G(ρ)] – each low-degree partition. are vertices still to be colored, initialized as U = V (`). In Then, we iterate over these partitions (starting from ρ) and each iteration, vertices in U are first colored randomly. Then, color each with SIM-COL (“SIMple coloring”, Alg. 5). We each vertex v compares its color C[v] to the colors of its active discuss SIM-COL in more detail later in this section, its (not yet colored) neighbors in NU and checks if C[v] is not semantics are that it colors a given arbitrary graph G (in already taken by other neighbors inside and outside of V (`) our context G is the `-th partition G(`)) using (1 + µ)∆ (by checking Bv), see Lines 9–13. The goal is to identify colors, where µ > 0 is an arbitrary value. To keep the whether at least one such neighbor has the same color as coloring consistent with respect to already colored partitions, v. For this, we use Reduce over NU (v) with the operator we maintain bitmaps Bv that indicate colors already taken by f defined as feq(u) = (C[v] == C[u]) (the “==” operator v’s neighbors in already colored partitions: If v cannot use works analogously to the same-name operator in C++) and a color c, the c-th bit in Bv is set to 1. These bitmaps are a simple lookup in Bv. If v and u have the same color, updated before each call of SIM-COL (Lines 16–18). feq(u) equals 1. Thus, if any of v’s neighbors in U have How large should Bv be to minimize storage overheads the same color as v, Reduce(NU (v), feq) > 0. This enables but also ensure that each vertex has enough colors to choose us to attempt to re-color v by setting C[v] = 0. If a vertex from? We observe that a single bitmap Bv should be able gets colored, we remove it from U (Line 19) and update the to contain at most as many colors as neighbors of v in a bitmaps of its neighbors (Line 18). We iterate until U is empty. partition currently being colored (G(`)), and in all partitions Depth, Work We now prove the of DEC- that have already been colored (G(`0), `0 > `). We denote this ADG. The key observation is that the probability that a neighbor count with deg`(v). Observe that any deg`(v) is at particular vertex becomes inactive (permanently colored) is most kd = d2(1 + ε)de, as partitions are created according to constant regardless of the coloring status of its neighbors. The a partial k-approximate degeneracy order where k = 2(1 + ε) key proof technique is to use Markov and Chernoff Bounds.

7 1 /* Input : G(V (`),E(`)) (input graph partition), Bv (a bitmap with 2 * colors forbidden for each v). Output : color C[v] > 0.*/ First, to proof the intuitive fact that Z can be used to 3 U = V (`) 4 while U 6= ∅ do: approximate the number of vertices removed in each round, 5 //Part 1: all vertices in U are colored randomly: 6 for all v ∈ U do in parallel: we use the technique of coupling (See § II-E) together with 7 choose C[v] u.a.r. from {1, ..., (1 + µ)deg`(v)} a handy equivalence between stochastic dominance and the 8 9 //Part 2: each vertex compares its color to its active neighbors. coupling of two random variables [76]. 10 //If the color is non-unique, the vertex must be re-colored. 11 for all v ∈ U do in parallel: P Lemma 8. The random variable X = Xv stochasti- 12 //feq (v, u) = (C[v] == C[u]) is the operator in Reduce. v∈U |U| 13 if Reduce(NU (v), feq ) > 0 || C[v] ∈ Bv : C[v] = 0 cally dominates Y = P Z. 14 i=1 15 //Part 3: Update Bv for all neighbors with fixed colors: 16 for all v ∈ U do in parallel: Proof. Without loss of generality let us assume that all vertices 17 for all u ∈ NU (v) do in parallel: 18 if C[u] > 0: Bv = Bv ∪ C[u] in U are numerated from 1 to |U|, such that we can write X = 19 U = U \{v ∈ U | C[v] > 0} //Update vertices still to be colored P|U| i=1 Xi. Further note, that we have shown in Claim 1 that the Algorithm 5: SIM-COL, our simple coloring routine used by DEC-ADG. It event Xv = 1 happens independently with a probability of at delivers a ((1 + µ)∆)-coloring, where µ > 0 is an arbitrary value. When least 1 − 1 and therefore P r[X = 1] ≥ P r[Z = 1] holds. using SIM-COL as a subroutine in DEC-ADG, we instantiate µ as µ = ε/4; 1+µ v we use this value in the listing above for concreteness. We want to construct a coupling (X,b Yb) with P r[Xb ≥ Yb] = 1 to show the stochastic dominance of X [76]. First let Ybi be equal to Yi for all i = 1..|U| and define Yb Before we proceed with the analysis, we provide some P|U| to be i=1 Ybi. Next define Xbi to be 0 whenever Ybi = 0. If definitions. For each round ` of SIM-COL (Algorithm 5), we Ybi is 1 we define the probability of Xbi to be P r[Xi = 1] define an indicator random variable Xv to refer to the event in conditioned under Xi ≥ 1. which a vertex v gets removed from U (i.e., becomes colored We can see that by design Xbi ≥ Ybi holds for all i = 1..|U| and thus inactive) in this specific round `. The vertex v is and therefore, with Theorem 4.23 of other work [76], conclude removed if and only if the color C[v], which is selected on that X dominates Y. Line 7, is not used by some neighbor of v (i.e., this color is not in B ) and no active neighbor chose C[v] in this round. v P The random variable Xv indicates the complement of event Lemma 9. A random variable X = v∈U Xv is stochasti- P|U| Xv (i.e., a vertex v is not removed from U in a given round). cally dominated by a random variable Y = i=1 Z. Next, let Z be a Bernoulli random variable with probability Proof. From Claim 1 we already know that ∀v ∈ U. P r[X = P r[Z = 1] ≡ p = 1 − 1 , and let Z be the complement of v 1+µ 1] ≥ P r[Z = 1]. Since X and Z are Bernoulli random Z. Finally, we use a concept of stochastic dominance: For two v variables, P r[Z = 1] ≥ P r[X = 1] holds for their random variables A and B that are defined over the same set v complements. Now with a similar argument as shown in of possible outcomes, we say that A stochastically dominates Lemma 8 we can again conclude the proof. B if and only if P r[A ≥ z] ≥ P r[B ≥ z], for all z. In the following, we show that the event of an arbitrary vertex v becoming deactivated (Xv = 1) is at least as Lemma 10. SIM-COL performs O(log n) iterations w.h.p. (for probable as Z = 1. This will enable us to use these mutually constant µ > 1). independent variables Z to analyze the time complexity of SIM-COL and DEC-ADG. Proof. The number of vertices that are not removed (i.e., not permanently colored) in one round can be expressed as X = P Claim 1. In every iteration, for every vertex v, the probability Xv (Xv = 1 means that v is not removed from U in a 1 v∈U that the vertex v becomes inactive is at least 1 − 1+µ . given iteration). Now, we apply the Markov inequality to Y : Proof. The probability that v becomes inactive in any iteration     |U| i µ|U| E Y 1+µ 1 (P r[Xv = 1]) is at least 1 − , where i is the P r Y ≥ ≤ = = . (1+µ)deg`(v) 1 + µ µ|U| µ|U| µ number of distinct colors in Bv and received from neighbors 1+µ 1+µ in this round. This is because, in each iteration, while v connects to vertices with a total of i distinct colors, the total As Y dominates X (Lemma 9), we have number of colors to be selected from is (1 + µ)deg (v). Now,  µ|U|   µ|U|  ` P r Y ≥ ≥ Pr X ≥ . as v can have at most deg`(v) colored neighbors, we get 1 + µ 1 + µ i deg`(v) 1 1− (1+µ)deg (v) ≥ 1− (1+µ)deg (v) = 1− (1+µ) , which shows ` ` h µ|U| i 1 that P r[Xv = 1] ≥ P r[Z = 1] holds for all active v. Consequently, Pr X ≥ 1+µ ≤ µ . Hence, Thus, in expectation, a constant fraction of the vertices be-  µ|U|  1 Pr X ≥ > 1 − . comes inactive in every iteration. Now, in the next step, we will 1 + µ µ apply Markov and Chernoff bounds to an appropriately chosen 1 binomial random variable, showing that the number of vertices Note that 0 < 1 − µ ≤ 1 for µ > 1. Thus, with probability 1 µ that are removed is concentrated around its expectation. Hence, at least 1 − µ , at least 1+µ |U| vertices are removed from U the algorithm terminates after O(log n) iterations. (i.e., permanently colored) in each round.

8 Theoretical Properties Practical Properties GC Algorithm Remarks Time (in PRAM) or Depth (in W–D), Work Model Quality F.? G.? R.? W.? S.? Q.? Performance Quality

Class 1: Parallel coloring algorithms not based on JP [26]. Many are not used in practice (except for work by Gebremedhin [37] and related ones [38], [40], [43], [45]); we include them for completeness of our analysis.

2  † † (MIS) Alon [77] E O(∆ log n) O m∆ log n CRCW ∆ + 1 ŏ -- ŏ Ŏ ŏ —— Depth depends on ∆, P = m∆ (MIS) Goldberg [78] O ∆ log4 n O (n + m) log4 n EREW ∆ + 1 -- ŏ ŏ Ŏ† ŏ —— †Depth depends on ∆, P = (m + n)/∆ †Graphs with ∆ ∈ O(1). P = n. (MIS) Goldberg [79], [80] O(log∗ n) O(n log∗ n) EREW ∆ + 1 - ŏ† --‡ - ŏ —— ‡ log∗ n grows very slowly. (MIS) Goldberg [80] O(∆ log ∆(∆ + log∗ n)) O((n + m)∆ log ∆(∆ + log∗ n)) EREW ∆ + 1 -- ŏ ŏ Ŏ† ŏ —— †Depth depends on ∆. P = n + m. Luby [81] O log3 n log log n O (n + m)(log3 n log log n) CREW ∆ + 1 -- ŏ ŏ - ŏ —— P = n + m † † (MIS) Luby [30] E O(∆ log n) O(m∆ log n) CRCW ∆ + 1 ŏ -- ŏ Ŏ ŏ   Depth depends on ∆, P = m   O ∆n O(∆n) --- ŏ ŏ ŏ   P ≤ √n Gebremedhin [37] E P CREW — Assuming 2 m . √n †  ∆P m   ∆P 2m  † Assuming P > . Work Gebremedhin [37] E O O CREW — --- Ŏ ŏ ŏ   2 m n n can be Ω(n + m) (for some P ). ITRB (Boman et al. [38]) O (∆ · I) % O (∆ · I · P ) % ä — -- ŏ ä† ä† ŏ   †No detailed bounds available ITR (C¸atalyurek¨ [40] and O (∆ · I) % O (∆ · I · P ) % ä — -- ŏ ä† ä† ŏ   †No detailed bounds available others [43], [45]) ITR-ASL (Patway et al. [32]) O(n · I) % O(n · I · P ) % ä — -- ŏ ä† ä† ŏ   †No detailed bounds available

Class 2: Coloring algorithms that are not parallel and based on the Greedy coloring scheme [25]. We include them as comparison baselines that deliver best-known coloring quality in practice.

Greedy-ID [1] O(n + m) O(n + m) Seq. ∆ + 1 -- ŏ ŏ ŏ ŏ   — Greedy-SD [27], [31] O(n + m) O(n + m) Seq. ∆ + 1 -- ŏ ŏ ŏ ŏ   —

Class 3: Parallel heuristics that constitute the largest line of work into parallel graph coloring and are fast in theory and practice. Most are based on JP [26].

JP-FF [25], [31] No general bounds; Ω(n) for some graphs O(n + m) W–D ∆ + 1 ŏ - ŏ - ŏ† ŏ   †No general bounds JP-LF [31] No general bounds; Ω∆2 for some graphs O(n + m) W–D ∆ + 1 ŏ --- ŏ† ŏ   †No general bounds JP-SL [31] No general bounds; Ω(n) for some graphs O(n + m) W–D d + 1 ŏ --- ŏ† -‡   †No general bounds. ‡Often, d  ∆.   log n ŏ ŏ† --- ŏ   † JP-R [26] E O log log n O(n + m) W–D ∆ + 1 Graphs with ∆ ∈ O(1)  n√ o √ log ∆ log n ŏ --- Ŏ† ŏ   † JP-R [31] E O log n + log ∆ · min m, ∆ + log log n O(n + m) W–D ∆ + 1 Depth depends on m or ∆   √ 2  √ log ∆ log n ŏ --- Ŏ† ŏ   † JP-LLF [31] E O log n + log ∆ · min {∆, m} + log log n O(n + m) W–D ∆ + 1 Depth depends on m or ∆   √ 2  √ log ∆ log n ŏ --- Ŏ† ŏ   † JP-SLL [31] E O log ∆ log n + log ∆ · min {∆, m} + log log n O(n + m) W–D ∆ + 1 Depth depends on m or ∆ JP-ASL [32], [82] O(n · I) % O(n · I · P ) % W–D ∆ + 1 ŏ -- ää ŏ   I is the number of iterations

  2  2 log d·log n ŏ ----† -†   † JP-ADG [This Paper] E O log n + log ∆ · d log n + log log n O(n + m) W–D 2(1 + ε)d + 1 Often, d  ∆ JP-ADG-M [This Paper]   2  O log2 n + log ∆ · d log n + log d·log n O(n + m) W–D 4d + 1 ŏ ----† -†   †Often, d  ∆ (a variant described in § V) E log log n O(n + m) ∗Using CREW gives O(nd + m) DEC-ADG [This Paper] O log d log2 n w.h.p. W–D (2 + ε)d Ŏ∗ ---† - -   E (w.h.p.) work. †Work efficient in expectation. DEC-ADG-M [This Paper] O(n + m) ∗Using CREW gives O(nd + m) O log d log2 n w.h.p. W–D (4 + ε)d Ŏ∗ ---† - -   E (a variant described in § V) (w.h.p.) work. †Work efficient in expectation. DEC-ADG-ITR [This Paper] O (I · d log n), [work bound is complex, details in text (§ IV-C)] W–D 2(1 + ε)d + 1 -- ŏ ŏ ŏ -   I is the number of iterations in [40]

TABLE III: Comparison of parallel graph coloring algorithms. “Greedy-X” is a the Greedy sequential coloring scheme [25], with ordering X. “JP-X” is a Jones and Plassmann scheme [26], with ordering X. “(MIS)” indicates that a given algorithm solves the Minimum Independent Set (MIS) problem, but it can be easily converted into a parallel GC algorithm. “EREW, CREW, and CRCW” are well-known variants of the PRAM model. “Performance” and “Quality” summarize the run-times and coloring qualities, based on the extensive existing evaluations [31], [33], [37] and our analysis (§ VI), focusing on modern real-world graphs as input sets (we exclude Class 1 as it is not relevant for practical purposes). “Seg.” is a simple sequential (RAM) model. “W–D” indicates work-depth. “F. (Free)?”: Is the heuristic provably free from using concurrent writes? “G. (General)?”: Does the algorithm work for general graphs? “R. (Randomized)?”: Is the algorithm randomized? “W. (Work-efficient)?”: Is the algorithm provably work-efficient (i.e., does it take O(n + m) work)? b “S. (Scalable)?”: Is the algorithm provably scalable (i.e., does it take O(a log n), where a  n and b ∈ N)? “Q. (Quality)?”: Does the algorithm provably ensure the coloring quality better than the trivial ∆ + 1 bound? “-”: full support, “Ŏ”: partial support, “ŏ”: no support, “ä”: unknown, “%”: bounds based on our best-effort derivation. “w.h.p.”: with high probability. I is the number of iterations. ‡ log∗ n grows very slowly. All symbols are explained in Table I. The proposed schemes are the only ones with provably good work and depth and quality.

For some constants c1, c2, let now c1 · log n + c2 be incident to at least one active vertex also decreases by a the number of iterations of SIM-COL performed. We want constant factor in each iteration with high probability. The to derive a number of iterations where we deactivated at work in an iteration is bounded by this number of edges and µ least 1+µ |U| vertices. This can be expressed as a random every iteration has depth O(∆) (in the CREW setting) and Pc1 log n+c2 O(log ∆) (in the CRCW setting). Hence, we conclude: variable Q = i=1 Qi, where Qi is the Bernoulli random variable that indicates if in iteration i sufficiently Lemma 11. SIM-COL takes O(∆ log n) depth (in the CREW many vertices were deactivated. This happens independently, setting) or O(log ∆ log n) depth (in the CRCW setting), and as shown above, with probability P r[Q = 1] ≥ µ−1 . Now i µ it has O(n + m) work w.h.p in the CREW setting. we apply a Chernoff bound to X and get: Proof. Since the reduction on Line 13 can be implemented in  2 2   µ − 1  −η c1(µ−1) −η c2(µ−1) 3µ 3µ P r Q < (1 + η) (c1 log n + c2) ≥ 1 − n · e  O(log (deg(v))) depth for each v ∈ U and since updating the µ O(∆) O(log ∆) (5) bitmaps in Part 3 takes (assuming CREW) and (assuming CRCW) we get, together with Lemma 10, an overall Thus, if we choose 0 < η < 1 and c1 appropriately, it depth of O(∆ log n) (assuming CREW) and O(log ∆ log n) follows that we perform O(log n) iterations w.h.p.. (assuming CRCW) w.h.p.. To bound the work of SIM-COL, we can observe that Second, to analyze the work performed by the algorithm similarly to the number of vertices, the number of edges we will first look at the work performed in one outer-loop

9 iteration. Let Ui be the set U in iteration i. Let ni = |Ui| together with Lemma 11, that SIM-COL (Algorithm 5), if be the number of active vertices and mi be the number of called on G(`), induces O(log d log n) depth w.h.p. (assuming active edges at the start of iteration i. Choosing colors for CRCW) and performs O(|R(`)| + |E[R(`)]|) work in expec- all vertices v ∈ Ui (Line 7) takes O(|Ui|) work in each tation (i.e., work proportional to the number of vertices and 2 iteration. The reduction for all v ∈ Ui (Line 13) takes edges in G(`)). Computing ADG takes O(log n) depth and O(P deg(v)) work. Updating the bitmaps (Line 18) also O(n+m) work. Updating bitmaps takes O(log d) depth (with v∈Ui takes O(P deg(v)) work. Removing vertices that have a simple Reduce), since each vertex in G(`) has at most kd v∈Ui P been colored in the current iteration from Ui can be done in neighbors in Q, and O( v∈R(`) deg(v)) work. O(|Ui|) work. Therefore the work for one iteration is bound Now, since DEC-ADG (Algorithm 4) performs O(log n) by O(P deg(v) + |U |) = O(m + n ). iterations, we get an overall depth of O(log d log2 n) w.h.p.. v∈Ui i i i To argue further we want to make a case distinction over More precisely, this can be seen from Lemma 10. Since we use the number of active vertices. a Chernoff bound to bound the number of iterations performed by SIM-COL, we can also guarantee that each of the O(log n) Case 1 Iterations where ni ∈ Ω(log n) but ni 6∈ Θ(log n): instances of SIM-COL will perform O(log n) iterations w.h.p.. We can model the number of vertices that get removed in each The work in one loop iteration can be bound by O(|R(`)|+ iteration as X = P X . This random variable, as seen |E[R(`)]| + P deg(v)) in expectation (as seen above). v∈Ui v v∈R(`) Pni from Lemma 8, stochastically dominates Y = i Z, i.e., Thus, since each vertex is in exactly one partition R(`) and for all valid x we have P r[X ≥ x] ≥ P r[Y ≥ x]. Note that each edge is in at most one (i.e., in at most one set E[R(`)] of µ E[Y ] = 1+µ ni and that because of our assumption we have edges that belong to a subgraph G(`)), we can conclude that ni > c1 log n + c2 for any constants c1 > 0 and c2. As Y is Algorithm 4 performs O(n + m) work in expectation. defined over independent random variables Z, we can apply Coloring Quality Finally, we prove the coloring quality. a Chernoff bound and, as X stochastically dominates Y : Claim 2. DEC-ADG produces a (2+ε)d coloring of the graph 2 − η [Y ] for 0 < ε ≤ 8. P r [Y ≤ (1 − η)E[Y ]] ≤ e 2 E    2 2  Proof. Since we use ADG to partition the graph into (ρ, G) µ − η µc1 − η µc2 P r X > (1 − η) n ≥ 1 − n 3(1+µ) · e 3(1+µ) (G ≡ [G(1) ... G(ρ)]) on Line 8, we know that ρ is a partial 1 + µ i 2(1 + ε/12)-approximate degeneracy ordering. Therefore, we This implies, for appropriately chosen constants 0 < η < 1 also know that each vertex v ∈ G(i) has at most 2(1+ε/12)d 0 0 and c1, that in each iteration, where we have active vertices neighbors in partitions G(i ) with i ≥ i. This implies, that if in order of log n, at least a constant fraction of them will we run SIM-COL on each partition G(i), we will color each be deactivated w.h.p.. Now, by using a union bound over all partition with at most (1 + ε/4)2(1 + ε/12)d colors. This is the iterations, which are in O(log n) w.h.p. (Lemma 10), we by the design of SIM-COL, which delivers a ((1 + µ)∆)- can also conclude that w.h.p. we will always deactivate a coloring for any graph G; in our case, we have G ≡ G(i), constant fraction of vertices. Further, we know that an edge ∆ = 2(1 + ε/12), and µ ≡ ε/4. Now, (1 + ε/4)2(1 + ε/12)d is deactivated with at least the same probability as one of is smaller or equal to (2 + ε)d for ε ≤ 8, as (1 + ε/4)2(1 +  2  its incident vertices. Hence, w.h.p., we will also deactivate a ε/12) = 2 + 16ε+ε ≤ (2 + ε), for ε ≤ 8. constant fraction of our active edges in each iteration. Thus, 24 the overall work performed is in O(n + m) w.h.p.. Now we have already seen that the runtime bounds of DEC- ADG hold for µ > 1 (Lemma 11). Since we defined µ to be

Case 2 Iterations where ni ∈ O(log n): As there are at ε/4 they hold for ε > 4. Thus, together with Claim 2 we can most O(log n) active vertices left, we can bound the work in see that the algorithm attains it’s runtime and color bounds one iteration by O(log n + log2 n). With Lemma 10, we get for 4 < ε ≤ 8. 3 overall work of O(log n) w.h.p. for these iterations. C. Enhancing Existing Coloring Algorithms

Therefore, we get O(n+m) work w.h.p. over all iterations. Finally, we illustrate that ADG does not only provide new provably efficient algorithms, but also can be used to enhance existing ones. For this, we seamlessly replace our default Now, we turn our attention back to DEC-ADG. As DEC- SIM-COL routine with a recent speculative coloring heuristic, ADG decomposes the edges of the input graph into O(log n) ITR, by C¸atalyurek¨ et al. [40]. The result, DEC-ADG-ITR, is disjoint subgraphs of maximum degree O(d), we have: similar to DEC-ADG, except that the used SIM-COL differs in Line 7 from the default Algorithm 5: colors are not picked Lemma 12. DEC-ADG takes O(log d log2 n) depth and randomly, but we choose the smallest color not in B . O(n + m) work w.h.p. in the CRCW setting. v Using ADG enables deriving similar bounds on coloring Proof. Since the graph partitions G(`) are induced by a partial quality (2(1+ε)d+1) as before. Since each vertex in partition kd-approximate degeneracy order, we know that each G(`) has G(`) can have at most kd colored neighbors when it gets a maximal degree of at most kd. Therefore, we can conclude colored, as shown before, we always choose the smallest color

10 √ form {1, ..., kd + 1}. However, deriving good bounds on work in depth: JP-SLL and JP-LLF depend linearly on m or ∆ and depth is hard because the lack of randomization (when while JP-ADG on d. In today’s graphs,√ d is usually much (i.e., picking colors) prevents us from using techniques such as orders of magnitude) smaller than m and ∆ [61]. To further Chernoff bounds. We were still able to provide new results. investigate this, we now provide a small lemma. Selecting colors can be done in O(∆) depth and O(∆·|U |) √ e i Lemma 13. For any d-degenerate graph, we have m ≥ d/2. work, where Ui is the set U in iteration i (of the while loop in SIM-COL) and ∆e = maxv∈V (`) deg`(v). For the modified Proof. First we can see, from the definition of degeneracy, SIM-COL, we get, since all other operations are equal, O(∆I) that G is d-degenerate (i.e. every induced subgraph has at I depth ( is the number of iterations of ITR); the work is least one vertex with degree ≤ d), but not (d − 1)-degenerate.    I Therefore, since G is not (d − 1)-degenerate, we will have at X X O  ∆e · |Ui| + deg(v) . least one subgraph G˜(V,˜ E˜) where each vertex has a degree i=1 v∈Ui larger than d − 1. This then also implies, that there are at least d vertices in G˜. Now we get for the edges of G˜ that Depth and work in DEC-ADG-ITER are, respectively d 2 |E˜| = 1 · P deg (v) ≥ 1 P d = d . Thus, G has at 2 v∈V˜ V˜ 2 i=1 √ 2 least d2/2 edges and therefore we get m ≥ d/2 O(Id log n) Lemma 13 and the fact that d ≤ ∆ illustrate that the and (as a simple sum over all iterations I) expected depth of JP-ADG is up to a logarithmic factor compa- rable to JP-R, JP-LLF, and JP-SLL. This further illustrates that  ρ  I    X X X X our bounds on depth in JP-ADG offer an interesting tradeoff O ∆ · |U | + deg(v) + deg(v) .    e i   compared to JP-LF and JP-LLF. `=1 i=1 v∈U v∈R(`) i We finally observe that the design of ADG combined with Note that these bounds are valid in the CREW setting. the parametrization using ε enables a tunable parallelism- quality tradeoff ε → 0 D. Using Concurrent Reads . When , coloring quality in JP-ADG approaches 2d + 1, only ≈2× more than JP-SL. On the other Similarly to past work [31], our algorithms rely on concur- hand, for ε → ∞, ρADG becomes irrelevant and the derived rent writes. However, a small modification to ADG gives vari- final ordering ρ = hρADG, ρX i converges to ρX . Now, X ants that only need concurrent reads (a weaker assumption that could be the random order R but also the low-depth LF and is more realistic in architectures that use caches). Specifically, LLF orders based on largest degrees. This enables JP-ADG to one can implement UPDATE from Algorithm 1 by iterating increase parallelism tunably, depending on user’s needs. over all v ∈ U in parallel, and for each v, appropriately We also compare JP-ADG to works based on speculation modifying its degree: D[v] = D[v] − Count(NU (v) ∩ R). and conflict resolution [38], [40], [43], [45], based on an This makes both ADG and DEC-ADG rely only on concurrent early scheme by Gebremedhin [37]. A direct comparison is reads but it adds a small factor of d to work (O(nd + m)). difficult because these schemes do no offer detailed theoretical E. Comparison to Other Coloring Algorithms investigations. Simple bounds on coloring quality, depth, and work are – respectively – ∆+1, O(∆I), and O(∆IP ), where We exhaustively compare JP-ADG and DEC-ADG to other I is #iterations and P is #processors. Here, we illustrate algorithms in Table III. We consider: non-JP parallel schemes that using ADG in combination with these works simplifies (Class 1), the best sequential greedy schemes (Class 2), and deriving better bounds for such algorithms, as seen by the parallel algorithms based on JP (Class 3). We consider depth example of DEC-ADG-ITR, see § IV-B. (time), work, used model, quality, generality, randomized design, work-efficiency, and scalability. We also use past V. OPTIMIZATIONS AND IMPLEMENTATION empirical analyses [31], [33], [37] and our results (§ VI) to We explored different design choices and optimizations for summarize run-times and coloring qualities of algorithms used more performance. Some relevant design choices were already in practice, focusing on modern real-world graphs as input. discussed in Section III; these were key choices for obtaining Details are in the caption of Table III. our bounds without additional asymptotic overheads. Here, As explained in Section I, only our algorithms work for the main driving question that we followed was how to arbitrary graphs, deliver strong bounds for depth and work maximize the practical performance of the proposed coloring and quality, and are often competitive in practice. Now, JP- algorithms? For brevity, we discuss optimizations by extending SL may deliver a little higher quality colorings, as it uses the Algorithm 1 and 3. We also conduct their theoretical analysis. exact degeneracy ordering (although without explicitly naming it) and its quality is (provably) d + 1. However, JP-SL comes A. Representation of U and R with much lower performance. On the other hand, most recent The first key optimization (ADG, Alg. 1) is to maintain JP-LLF and JP-SLL only provide the straightforward ∆ + 1 set U (vertices still to be assigned the rank ρ) and sets R(·) bound for coloring quality. These two are however inherently (vertices removed from U, i.e., R(i) contains vertices removed parallel as they depend linearly on log n, while JP-ADG in iteration i) together in the same contiguous array such that depends on log2 n. Yet, JP-ADG has a different advantage all elements in R(·) precede all elements in U. In iteration i,

11 this gives an array [R(1) ... R(i) index U], where index DEC-ADG (“DEC-ADG-M”) that use δm, and analyzed them points to the first element of U. Initially, in iteration 0, extensively; they enable speedups for some graphs. index is 0. Then, in iteration i + 1, one extracts R(i + 1) One advantage of ADG-M is that deriving median takes from U and substitutes [R(0) R(1) ... R(i) index U] with O(1) time in a sorted array. However, the whole U has to be [R(0) R(1) ... R(i) index R(i+1) U \R(i+1)]. Formally, we sorted in each pass. We incorporate linear-time integer sorting, keep an invariant that, if any vertex v has a degree ≤ (1 + ε)δb which was shown to be fast in the context of sorting vertex (D[v] ≤ (1 + ε)δb), then v ∈ R (i.e., v ∈ R(0) ∪ ... ∪ R(i + 1)). IDs [86] ADG-M only differs from Algorithm 1 in that, (1) Prior to removing R(i + 1) from U, we partition U into instead of δb, we select the median degree δm (of the vertices [R(i + 1) U \ R(i + 1)] beforehand, which can be done in in U), and (2) R, the set of vertices removed from U in a time O(|U|). We do this by iterating over U, comparing the given iteration, is now defined as the vertices v ∈ U, which degree of each vertex to (1+ε)δb, and placing this vertex either satisfy degU (v) ≤ δm. Additionally we limit R to half the in R(i + 1) or in U \ R(i + 1), depending on its degree. Then, size of U (+1 if |U| is odd). We refer to the priority function the actual removal of R(i + 1) from U takes O(1) time by produced by ADG-M as ρADG-M. simply moving the index pointer by |R(i+1)| positions “to the right”, giving – at the end of iteration i + 1 – a representation E. Push vs. Pull [R(0) R(1) ... R(i) R(i + 1) index U]. In JP-ADG, computing the updates to D can be imple- mented either in the push or the pull style (pushing updates B. Explicit Ordering in R(·) to a shared state or pulling updates to a private state) [36]. We advocate to sort each R(i) by the increasing count More precisely, one can either iterate through the vertices of neighbors within the vertex set U ∪ R(i) (i.e., for any in R(·) and decrement counters in D accordingly for each two vertices v, u ∈ R(i), we place v before u in the array neighbor in U. This is pushing as the accesses to D modify the representation of R(i) if and only if u has more neighbors shared state. On the other hand, in pulling, one iterates through in U ∪ R(i) than v). This gives an explicit ordering within the remaining vertices in U \ R(·) and counts the number of the set of vertices that are removed in the same iteration neighbors in R(·) for each vertex separately. This number is then subtracted from the relative counter in D. Here, there are of ADG. Further, this induces a total order on ρADG (i.e., random tie breaking becomes not necessary). Our evaluation no concurrent write accesses to D. We analyzed both variants indicates that such sorting often enhances the accuracy of and found that, while pushing needs atomics, pulling incurs the obtained approximate degeneracy ordering, which in turn more work. Both options ultimately give similar performance. consistently improves the ultimate coloring accuracy. Sorting F. Caching Sums of Degrees can be performed with linear time integer sort and the relative neighbor count of each vertex can be obtained using array D. In each iteration, the average degree of vertices in U is We also observe that, for some graphs, this additional vertex computed and stored in δb. Here, instead of computing δb in sorting improves the overall runtime. We additionally explored each iteration by explicitly summing respective degrees, one parallel integer sort schemes used to maintain the above- can maintain the sum of degrees in U, ΣU (in addition to δb), described representation of U ∪ R. We tried different algo- and update ΣU accordingly by subtracting the number of rithms (radix sort [83], counting sort [84], and quicksort [85]). edges in the cut (R(·),U \ R(·)) in each iteration. We omit this enhancement from the listings for clarity; our evaluation C. Combining JP and ADG indicates that it slightly improves the runtime (by up to ≈1%). We observe that Part 1 of JP-ADG (Lines 6–9, Algorithm 3), G. Infrastructure Details where one derives predecessors and successors in a given We integrated our algorithms with both the GAP bench- ordering to construct the DAG Gρ, can also be implemented mark suite [87] and with GBBS [61], a recent platform for as a part of UPDATE in ADG, in Algorithm 1. To maintain the testing parallel graph algorithms. We use OpenMP [88] for output semantics of ADG, we use an auxiliary priority function parallelization. We settle on GBBS. rank : V → R that simultaneously specifies the needed DAG structure. For each v ∈ V , rank[v] is defined as the number H. Detailed Algorithm Specifications of neighbors u ∈ N(v), for which ρADG(u) > ρADG(v) holds. We now provide detailed specifications of our algorithms Analogously, rank[v] resembles the number of neighbors of v with the optimizations described earlier in this section. We that were removed from U after the removal of v itself. This refer to them as ADG-O and ADG-M-O. The former is our optimization does not change the theoretical results. fundamental ADG algorithm (described in Section III) that uses the average degree to select vertices for removal in D. Degree Median Instead of Degree Average a given iteration. The latter is the ADG variant that uses We also use degree median instead of degree average in the median degree instead of the average degree (described ADG, to derive R: R = {u ∈ U | D[u] ≤ (1 + ε)δm}, in § V-D). The respective listing (of ADG-O) is in Algo- where δm is median of degrees of vertices in U. We developed rithm 6. We omit the listing of ADG-M-O because it is variants of ADG (“ADG-M”), JP-ADG (“JP-ADG-M”), and identical to that of ADG-O. The only difference is in the

12 way the PARTITION subroutine works. Specifically, it results 1) Analysis of ADG-M: We first investigate ADG-M. in U = [R δ u . . . u ] such that ∀u ∈ U : m |R|+2 |U| Lemma 14. For a constant ε > 0, ADG-M can be imple- D[u] ≤ D[δ ] ⇔ u ∈ R. δ is the median of the degrees m m mented such that it has O(log n) iterations, O(log2 n) depth, of vertices in U based on the increasing degree ordering. and O(n + m) work in the CRCW setting. Note that in the UPDATEandPRIORITIZE subroutine, when we refer to N(v), this technically is NU∪R(v). However, Proof. Since the algorithm removes at least half the vertices as enforcing the induced neighbor set entails performance of U in each iteration (±1 vertex) we can conclude, as for overheads, and as using N(v) is not incorrect, we use N(v). the default version of ADG, that ADG-M performs O(log n)

1 /* Input :A graph G(V,E). iterations in the worst case. As in each iteration all operations 2 2 * Output :A priority(ordering) function ρADG : V → R, can run in O(log n) depth, ADG-M’s depth is O(log n). 3 an auxiliary priority function rank : V → R.*/ 4 Let k be the number of performed iterations, let Ui be the 5 D = [ deg(v1) deg(v2) ... deg(vn)] //An array with vertex degrees 6 U = V //U is the induced subgraph used in each iteration set U in iteration i. To calculate work performed by ADG- 7 ` = 0 //A counter variable keeping track of removed vertices 8 for all v ∈ V do in parallel: M, we first consider the work in one iteration . Finding the 9 ρADG(v) = n //Initialize the priorities median takes O(|Ui|) work. Since all other operations are the 10 P  11 while U 6= ∅ do: same as for ADG, we get again O( deg(v) + |U |) v∈Ui i 12 δ = 1 P D[v] //Update the average degree of vertices in U b |U| v∈U work in one iteration. As we also remove a constant number of 13 14 //Partition U resulting in an array U = [R u|R|+1 . . . u|U|] vertices in each iteration from U, we conclude, as with ADG, 15 //s.t. ∀u ∈ U : D[u] ≤ (1 + ε)δb ⇔ u ∈ R; R isa subarray of U , 16 //it contains vertices assigned priority ina given iteration. that ADG-M performs O(n+m) work over all iterations. 17 //PARTITION runs in O(|U|) time and can be implemented 18 //as specified in § V-A. Lemma 15. ADG-M computes a partial 4-approximate de- 19 PARTITION(U , (1 + ε)δb) 20 generacy ordering of G. 21 //Sort R by increasing degrees based on D, see § V-A and § V-B. 22 //As D is updated every iteration(see below), v is before u 23 //in R if and only if u has more neighbors in U ∪ R than v. Proof. In each step ` of ADG-M, only less than then half 24 SORT(R, D) of the vertices in the induced subgraph G[U] can have a 25 26 //Explicitly impose an ordering for processing the vertices: degree that is strictly larger than 2δb`, where δb` is the average 27 for all ri ∈ [r0 r1 . . . r|R|−1] = R do in parallel: degree of vertices in this subgraph. Now, as we always remove 28 ρADG(ri) = ` + i 29 the vertices with the lowest degree and as we remove half 30 //Remove selected low-degree vertices(that are in R) 31 //by moving the pointer to U , denoted as &U , by |R| cells. of the vertices (±1) from U, all vertices that get removed 32 &U = &U + |R| 33 have a degree of at most 2δb`. From Lemma 3, we know that 34 //Update D to reflect removing R from U (definition below): 35 UPDATEandPRIORITIZE(D, U , R, rank) δb ≤ 2d. Thus, each vertex has a degree of at most 4d in the 36 subgraph G[U] (in the current step). Consequently, each vertex 37 //Update the counter by adding the count of removed vertices 38 ` = ` + |R| has at most 4d neighbors that are ranked equal or higher. 39 40 //Update D to reflect removing vertices in R froma set U : 41 UPDATEandPRIORITIZE(D, U , R, rank): 2) Analysis of JP-ADG-M: We now derive the bounds 42 for all v ∈ R do in parallel: 43 c = 0 //A counter to count neighbours with higher priority for JP-ADG-M. We proceed similarly as for JP-ADG. The 44 for w ∈ N(v) do: partial approximate degeneracy ordering delivered by ADG- 45 if ρADG(w) > ρADG(v): 46 //w comes after v in the explicit processing order M is referred to as ρADG-M. 47 DecrementAndF etch(D[w]) 48 c = c + 1 49 rank(v) = c Corollary 2. JP-ADG-M colors a graph with at most 4d + 1 Algorithm 6: ADG-O, a variant of ADG from Algorithm 1 with optimizations colors using the priority function ρ = hρADG-M, ρRi, described in § V-A, § V-B, and § V-C. Proof. Since hρADG-M, ρRi is a 4-approximate degeneracy or- dering, this result follows from Lemma 6 and Lemma 15. I. Theoretical Analysis We also provide theoretical analysis of the routine for Corollary 3. JP-ADG-M colors a graph G with degeneracy 2 log d log2 n deriving the approximate degeneracy ordering using median d in expected depth O(log n + log ∆ · (d log n + log log n )) degrees (ADG-M). Moreover, we analyze coloring algorithms and O(n + m) work in the CRCW setting. based on ADG-M (JP-ADG-M and DEC-ADG-M). Finally, Proof. Since ρADG-M is a partial 4-approximate degeneracy we also analyze the coloring algorithms enhanced with the ordering (Lemma 15) and since ADG-M performs at most optimizations described earlier in this section (ADG-O, ADG- O(log n) iterations (Lemma 14), the depth follows from M-O, JP-ADG-O, JP-ADG-M-O). The only change in bounds Lemma 7, Lemma 14 and past work [31], which shows that JP is in schemes based on median degrees, which increase the runs in O(log n+log ∆·|P|) depth. Since both JP and ADG-M ensured coloring counts by 2×. Importantly, all other bounds perform O(n + m) work, the same holds for JP-ADG-M. for coloring algorithms maintain their advantageous proper- ties described in Sections III–IV; 3) Analysis of DEC-ADG-M: The analysis of DEC-ADG- All proofs are very similar to the earlier proofs for the ADG, M is analogous to that of DEC-ADG, with the difference in JP-ADG, and DEC-ADG algorithms. Thus, we only provide that we use the ADG-M routine instead of ADG. The work sketches, underlying any differences from previous analyses. and depth bounds remain the same as in DEC-ADG (O(n+m)

13 w.h.p. and O(d log2 n)), and the coloring bound – similarly to GC Baseline Available Codes JP-ADG-M – increases from (2 + ε)d to (4 + ε)d. (MIS) Luby [30] ColPack 4) Impact of Optimizations: We also investigate the impact Gebremedhin [37] ColPack Gebremedhin [37] ColPack of optimizations described in § V-A, § V-B, and § V-C on the ITRB (Boman et al. [38] Zoltan bounds of the associated routines (ADG-O, ADG-M-O, JP- ITR (C¸atalyurek¨ et al. [40] ColPack ADG-O, JP-ADG-M-O). The key part is to illustrate that (1) and others [43], [45]) ITR-ASL (Patwary et al. [32]) ColPack moving “Part 1” of JP-ADG into ADG, and (2) the modified combined representation for R and U increase neither depth Greedy-ID [1] ColPack, GBBS Greedy-SD [27], [31] ColPack, GBBS or work. This can be shown straightforwardly. JP-FF [25], [31] ColPack, GBBS JP-LF [31] ColPack, GBBS, HP VI.EVALUATION JP-SL [31] ColPack, GBBS, HP JP-R [26] ColPack, GBBS, HP In evaluation, we found that empirical results follow theo- JP-R [31] ColPack, GBBS, HP retical predictions, already scrutinized in Table III and Sec- JP-LLF [31] GBBS, HP tion IV. Thus, for brevity, we now summarize the most impor- JP-SLL [31] GBBS, HP JP-ASL [32], [82] ColPack tant observations. A comprehensive comparison of run-times and coloring qualities of different algorithms is in Table III TABLE IV: Existing implementations of parallel graph coloring algo- (together with abbreviations of used comparison baselines). rithms, considered in this work. The existing repositories or frameworks are as follows: ColPack [82], [90], Zoltan [35], [91]–[94], original code by Hasenplaugh et al. (HP) [31], and GBBS with Ligra [61], [95], [96]. A. Methodology, Architectures, Parameters More details of the associated algorithms are in Table III. We first provide the details of the evaluation methodology, used architectures, and considered parameters, to facilitate The results are in Figure 1. Following past analyses [31], interpretability and reproducibility of experiments [89]. we consider separately two distinctive families of algorithms: Used Architectures In the first place, we use Einstein, those based on speculative coloring (SC), and the ones with an in-house Dell PowerEdge R910 server with an Intel Xeon the Jones and Plassman structure (color scheduling). These X7550 CPUs @ 2.00GHz with 18MB L3 cache, 1TiB RAM, two classes of algorithms – especially for larger datasets – and 32 cores per CPU (grouped in four sockets). We also are often complementary, i.e., whenever one class achieves conducted experiments on Ault (a CSCS server with Intel lower performance, the other thrives, and vice versa. This is Xeon Gold 6140 CPU @ 2.30GHz, 768 GiB RAM, 18 cores, especially visible for larger graphs, such as h-dsk, h-wdb, or and 24.75MB L3) and Fulen (a CSCS server with Intel s-gmc. The reason is that the structure of some graphs (e.g., Skylake @ 2GHz, 1.8 TiB RAM, 52 cores, and 16MB L3). with dense clusters) entails many coloring conflicts which may Methodology We provide absolute runtimes when report- need many re-coloring attempts, giving long tail run-times. ing speedups. In our measurements, we exclude the first mea- sured 1% of performance data as warmup. We derive enough B. Summary of Insights data to obtain the mean and 95% non-parametric confidence intervals. Data is summarized with arithmetic means. Our algorithms almost always offer superior coloring qual- Algorithms & Comparison Baselines We focus on mod- ity. Only JP-SL, JP-SLL (HP), and sometimes ITRB by Boman ern heuristics from Table III. For each scheme, we always pick et al. [38] (Zoltan) use comparably few colors, but they are at the most competitive implementation (i.e., fewest colors used least 1.5× and 2× slower, respectively. Simultaneously, run- and smallest performance overheads), selecting from existing times of our algorithms are comparable or marginally higher repositories, illustrated in Table IV (ColPack [82], [90], than the competition (in the class of algorithms with specula- Zoltan [35], [91]–[94], original code by Hasenplaugh et al. (HP) [31], GBBS with Ligra [61], [95], [96]), and our Friendships: Friendster (s-frs, 64M, 2.1B), Orkut (s-ork, 3.1M, 117M), LiveJournal (s-ljn, 5.3M, implementation. Detailed parametrizations are in the repro- 49M), Flickr (s-flc, 2.3M, 33M), Pokec (s-pok, 1.6M, 30M), Libimseti.cz (s-lib, 220k, 17M), Catster/Dogster (s-cds, 623k, 15M), Youtube (s-you, 3.2M, 9.3M), Flixster (s-flx, 2.5M, 7.9M), ducibility appendix. For Zoltan, we tested different variants of Hyperlink graphs: GSH domains (h-dgh, 988M, 33.8B), SK domains (h-dsk, 50M, 1.94B), ITRB and picked the best configuration, consistently with past IT domains (h-dit, 41M, 1.15B), Arabic domains (h-dar, 22M, 639M), Wikipedia/DBpedia (en) (h-wdb, 12M, 378M), Indochina domains (h-din, 7.4M, 194M), work [38] (100 steps, synchronous, “I” coloring order). For Wikipedia (en) (h-wen, 18M, 172M), Wikipedia (it) (h-wit, 1.8M, 91.5M), ColPack, we also picked the best variant of ITR-ASL and JP- Hudong (h-hud, 2.4M, 18.8M), Baidu (h-bai, 2.1M, 17.7M), DBpedia (h-dbp, 3.9M, 13.8M), ASL (“D1 OMP GMMP SL” and “D1 OMP MTJP SL”). Communication: Twitter follows (m-twt, 52.5M, 1.96B), Stack Overflow interactions (m-stk, 2.6M, 63.4M), Wikipedia talk (en) (m-wta, 2.39M, 5.M), For other JP baselines, we pick the best performing ones out Collaborations: Actor collaboration (l-act, 2.1M, 228M), DBLP co-authorship (l-dbl, 1.82M, of GBBS+Ligra and the original Hasenplaugh et al.’s code. 13.8M), Citation network (patents) (l-cit, 3.7M, 16.5M), Movie industry graph (l-acr, 500k, 1.5M) Datasets We use real-world graphs from SNAP [97], Various: UK domains time-aware graph (v-euk, 133M, 5.5B), Webbase crawl (v-wbb, 118M, 1.01B), Wikipedia evolution (de) (v-ewk, 2.1M, 43.2M), KONECT [98], DIMACS [99], and WebGraph datasets [100]; USA road network (v-usa, 23.9M, 58.3M), Internet topology (Skitter) (v-skt, 1.69M, 11M), see Table V for details. We analyze synthetic power-law graphs TABLE V: Considered real graphs from established datasets [97]–[100]. (generated with the Kronecker model [101]). This gives a large Graph are sorted by m in each category. For each graph, we show its “(symbol evaluation space; we only summarize selected findings. used, n, m)”.

14 Smaller graphs (used in SC: speculative JP: Jones and Larger graphs (used in SC: speculative coloring schemes online execution scheduling, ...) coloring schemes Plassman schemes offline data analytics, ...) JP: Jones and Plassman schemes

JP SC JP SC JP SC JP SC 1.5 6969 6161 4511 4511 4512 4512 4511 4511 5656 4511 4511 2 5353 200 4511 h − dsk 1.0 h − dsk h − bai 1.0 h − bai 4141 3535 3333 3434 3636 3333 3535 1 0.5 100 0.5 eoyfut nITR-ASL faults in Memory Memory faults in Zoltan Memory Memory faults in ColPack Memory 0 0.0 0 0.0 1.5 2.0 6 103103 1919 2020 8787 1818 1818 7777 h − wdb h − wdb 1.5 1.0 1616 161616161616161616161616 h − hud

h − hud 1.0 6969 4 5959 1.0 5050 4848 4949 5151 4848 5151 0.5 2 0.5 0.5

0.0 0.0 0 0.0 103103 1.5 2.0 9090 1.5 8181 861 859 858 858 859 859 859 858 858 858 7575 858 1.0 7474 72727272 m − wta m − wta h − wit 1.0 h − wit 1.5 6262 5959 5656 5555 1.0 1.0 0.5 0.5 0.5 0.5 0.0 0.0 0.0 0.0 1.5 296296 269269281281 6 257257 250250 1298 1298 1298 1298 1298 1298 1298 1298 1298 1298 1.0 4 1298

211211 s − flc 1.0 l − act l − act s − flc 191191 186186189189199199194194 4 0.5 2 0.5 2

0 0.0 0 0.0 1.5 163163 5858 0.75 133133136136 0.9 5656 5454 5656 5555 129129 124124 1.0 50505050 m − stk s − flx 1.0 m − stk s − flx 4040 4141 4343 3939 0.50 8787 88888888 9393 8787 0.6 8585 0.5 0.5 0.3 0.25

0.0 0.0 0.00 0.0 2.5 40 1.5 7272 2.0 6767 6666 6565 136136 137137 1.0 6464 30 127127 s − lib

s − frs 1.0 s − frs 1.5 s − lib 9494 95959595 102102 9595 9494 4040 3939 4040 4141 4040 4141 20 1.0 0.5

Time for different graphs [s] 0.5

Time for different graphs [s] 10 0.5 Memory faults in ITR-ASL Memory Memory faults in Zoltan Memory Memory faults in ColPack Memory 0.0 0.0 0 0.0 2.5 5151 1.5 Coloring quality [Color count relativeColoring to JP − R] 600 44444444 quality [Color count relativeColoring to JP − R] 4040 4040 265265 2.0 248248 241241

1.0 s − pok s − gmc s − gmc s − pok 3333 1.0 219219 212212 205205 3131 31313131 2929 3131 400 189189 191191 190190 1.5 173173 187187 1.0 0.5 200 0.5 0.5 0.0 0.0 0 0.0 10.0 1.5 2.5 7171 6767 6060 645 598 573 s − gmc2

2.0 573 566 564 5555 s − gmc2 552

5454 7.5 558 s − you

1.0 539 532 521

s − you 1.0 1.5 4040 4040 4343 3838 3737 3838 5.0 1.0 0.5 0.5 2.5 0.5 0.0 0.0 0.0 0.0 8 148 3 1.5 172172 128 128

115 140140 111 111 108 106 107 105 106 127127 127127 6 1.0 v − ewk v − ewk s − ork 2 1.0 114114 s − ork 9898 4 8888 8181 8888 84848484 0.5 1 0.5 2

0 0.0 0 0.0 1.5 1.5 101101 8484 8383 30 1507 1507 1507 1507 1512 1507 1507 1507 1507 1.0 7676 7676 7373 7373 1507

7272 v − skt 7171 v − wbb 6969 7070 v − wbb

v − skt 1.0 1.0 20 0.5 0.5 0.5 10

0.0 faults in Zoltan Memory 0.0 0 0.0

ITR ITR ITR ITR ITRB JP−R JP−R JP−LF JP−SL JP−FF JP−R ITRB JP−R JP−LF JP−LLF JP−SLL JP−LF JP−FF JP−LF JP−SL JP−FF JP−FF JP−SL ITR−ASL JP−ADG JP−LLF JP−LLF JP−SLL JP−LLF JP−SLL JP−ASL ITR−ASLJP−ADG ITR−ASL JP−ADG ITR−ASLJP−ADG [GBBS /Ligr a] [Zolt an] [GBBS /Ligr a] DEC−ADG−ITR [ColPac k] [ColPac k] [HP ] DEC−ADG−ITR [Zolt an] [ColPac k] DEC−ADG−ITR [HP ] DEC−ADG−ITR [HP ] [ColPac k] [GBBS /Ligr a] [ColPac k] [GBBS /Ligr a] reordering_time coloring_time reordering_time coloring_time Fig. 1: Run-times (1st and 3rd columns) and coloring quality (2nd and 4th columns). Two plots next to each other correspond to the same graph. Graphs are representative (other results follow similar patterns). Parametrization: 32 cores (all available), ε = 0.01, sorting: Radix sort, direction-optimization: push, JP-ADG variant based on average degrees δb. SL and SLL are excluded from run-times in the right column (for larger graphs) because they performed consistently worse than others. We exclude DEC-ADG for similar reasons and because it is of mostly theoretical interest; instead, we focus on DEC-ADG-ITR, which is based on core design ideas in DEC-ADG. Numbers in bars for color counts are numbers of used colors. “SC”: results for the class of algorithms based on speculative coloring (ITR, DEC-ADG-ITR). “JP”: results for the class of algorithms based on the Jones and Plassman approach (color scheduling, JP-*). A vertical line in each plot helps to separate these two classes of algorithms. DEC-ADG-ITR uses dynamic scheduling. JP-ADG uses linear time sorting of R. Any schemes that are always outperformed in a given respect (e.g., Zoltain in runtimes or ColPack in qualities) are excluded from the plots. 15 Weak scaling (Kronecker graphs) Strong scaling (h-bai graph) Strong scaling (s-pok graph) Most schemes Overall run-time Both JP-ADG and ITR-ADG Both JP-ADG and ITR-ADG 10.0 in this regime increases due to offer competitive or best scaling, offer competitive or best scaling, do not hit the memorybottleneck + the best coloring quality while 10.0 + the best coloring quality while memory wall 10 minimizing run-time overhead minimizing run-time overhead

● 1.0 ● ● ● 1.0 ●

● ● ● ● 1 ● ● ● ● ● ● ● ● ● 0.1 Time for different graphs [s] 0.1 1+1 2+2 4+4 8+8 16+16 32+32 1 2 4 8 16 32 1 2 4 8 16 32 #edges/vertex (implies the graph size) + #threads Number of threads Number of threads (JP) Jones- [C] JP−ASL [O]JP−ADG [O]:Our scheme. Speculative [C] ITR−ASL Good scaling of our baselines (indicated with arrows [G] [G] is due to using graph degeneracy d (or log d) instead Plassman JP−LF JP−R [C]: ColPack coloring (SP) ● [O] DEC−ADG−ITR of maximum degree Δ in the bounds, as opposed to algorithm [G]JP−LLF [G]JP−FF [G]: GBBS/Ligra algorithm class [C] ITR the competition. This is an advantage as d is usually class [H] much smaller than Δ. Details are in Sec. IV.E on page 7 JP−SL [H] JP−SLL [H]: HasenPl. et al [C]: ColPack [O]:Our scheme.

Fig. 2: Weak and strong scaling. Graphs are representative (other results follow similar patterns). Parametrization: ε = 0.01, sorting: Radix sort, direction-optimization: push, JP-ADG variant based on average degrees δb. DEC-ADG-ITR uses dynamic scheduling. JP-ADG uses linear time sorting of R. In weak scaling, we use n = 1M vertices. tive coloring) and within at most 1.3-1.4× of the competition D. Analysis of Coloring Quality (in the class of JP baselines). Thus, we offer the best coloring Coloring quality also follows the theoretical predictions: JP- quality at the smallest required run-time overhead. Finally, SL outperforms JP-SLL, JP-LF, and JP-LLF (by up to 15%), our routines are the only ones with theoretical guarantees on as it strictly follows the degeneracy order. Overall, all four work, depth, and quality. schemes (GBBS/Ligra, HP) are competitive. As expected, JP- C. Analysis of Run-Times with Full Parallelism FF and JP-R come with much worse coloring qualities because We analyze run-times using all the available cores. When- they do not focus on minimizing color counts. As observed ever applicable, we show fractions due to reordering (pre- before [31], ITR (ColPack) outperforms JP-FF and JP-R but processing, e.g., the “ADG” phase in JP-ADG) and the actual falls behind JP-LF, JP-LLF, JP-SLL, and JP-SL. JP-ASL and coloring (e.g., the “JP” phase in JP-ADG). JP-SL, JP-SLL ITR-ASL (ColPack) offer low (often the lowest) quality. ITRB (HP), and JP-ASL (ColPack) are the slowest as they offer least (Zoltan) sometimes approaches the quality of JP-SL, JP-SLL, parallelism. JP-LF, JP-LLF, and JP-R (GBBS/Ligra) are very DEC-ADG-ITR, and JP-ADG. fast, as their depth is in O(log n). We also analyze speculative The coloring quality of our schemes outperforms others coloring from ColPack and Zoltan; we summarize the most in almost all cases. Only JP-SL, JP-SLL (GBBS/Ligra, HP), competitive variants. ITR does not come with clear bounds on and sometimes ITRB (Zoltan) are competitive, but they are depth, but its simple and parallelizable structure makes it very always much slower. In some cases, JP-ADG (e.g., in s-ork) fast. ITRB schemes are >2× slower than other baselines and and DEC-ADG-ITR (e.g., in s-gmc) are better than JP-SL are thus excluded from run-time plots. We also consider an and JP-SLL (by 3-10%). Hence, while the strict degeneracy additional variant of ITR based on ASL [82], ITR-ASL. In order is in general beneficial when scheduling vertex coloring, several cases, it approaches the performance of ITR. it does not always give best qualities. JP-ADG consistently The coloring run-times of JP-ADG are comparable to JP- outperforms others, reducing used color counts by even up to LF, JP-LLF, and others. This is not surprising, as this phase 23% compared to JP-LLF (for m-wta). Finally, DEC-ADG- is dominated by the common JP skeleton (with some minor ITR always ensures much better quality than ITR, up to 40% differences from varying schedules of vertex coloring). How- (for s-lib). Both DEC-ADG-ITR and JP-ADG offer similarly ever, the reordering run-time in JP-ADG comes with certain high coloring qualities across all comparison targets. overheads because it depends on log2 n. This is expected, E. Analysis of Strong Scaling as JP-ADG – by its design – performs several sequential iterations, the count of which is determined by ε (i.e., how We also investigate strong scaling (i.e., run-times for the well the degeneracy order is approximated). Importantly, JP- increasing thread counts). Relative performance differences ADG is consistently faster (by more than 1.5×) than JP-SL between baselines do not change significantly, except for and JP-SLL that also focus on coloring quality. SLL that becomes more competitive when the thread count DEC-ADG-ITR – similarly to JP-ADG – entails ordering approaches 1, due to the tuned sequential implementation that overheads because it precomputes the ADG low-degree de- we used [31]. Representative results are in Figure 2; all composition. However, total run-times are only marginally other graphs result in analogous performance patterns. Most higher, and in several cases lower than those in ITR. This is variants from ColPack, Zoltan, GBBS/Ligra, and HP scale well because the low-degree decomposition that we employ, despite (we still exclude Zoltan due to high runtimes). Importantly, enforcing some sequential steps in preprocessing, reduces scaling of our baselines is also advantageous and comparable counts of coloring conflict, translating to performance gains. to others. This follows theoretical predictions, as the log2 n

16 factor in our depth bounds is alleviated by the presence of I. Performance Profiles the degeneracy d (or log d) instead of ∆, as opposed to the We also summarize the results from Figure 1 using per- competition; see § IV-E on page 7 for details. formance profiles [103], see Figure 5 for a representative profile for coloring quality. Intuitively, such a profile shows F. Analysis of Weak Scaling cumulative distributions for a selected performance metric Weak scaling is also shown in Figure 2. We use Kronecker (e.g., a color count). The summary in Figure 5 confirms the graphs [101] of the increasing sizes by varying the number of previous insights: DEC-ADG-ITR, JP-ADG, and JP-SL offer edges/vertex; this fixes the used graph model. JP-ADG scales the best colorings. comparably to other JP baselines; DEC-ADG-ITR scales com- parably or better than ITR or ITR-ASL. J. Additional Analyses of Design Choices We also analyze variants of JP-ADG and DEC-ADG-ITR, G. Analysis of Impact from ε considering sorting of set R, using static vs. dynamic sched- ules and other aspects from Section V (push vs. pull, median Representative results of the impact of ε are in Fig. 3. vs. average degree, different sorting algorithms). In Figure 1, As expected, larger ε offers more parallelism and thus lower we use JP-ADG with counting sorting of R and DEC-ADG- runtimes, but coloring qualities might decrease. Importantly, ITR with dynamic scheduling. All these design choices have the decrease is minor, and the qualities remain the highest or a certain (usually up to 10% of relative difference) impact on competitive across almost the whole spectrum of ε. performance, but (1) it strongly depends on the input graph, and (2) does not change fundamental performance patterns. h−bai v−usa h−bai v−usa 1.6 8 40 6 30 1.4 6 4 VII.RELATED WORK 20 ● ● 4 2 1.2 ● ● 10 One of the first methods for computing a (∆ + 1)-

Full runtime [s] Full runtime ● ● 2 quality Coloring 0 0 coloring in parallel for general graphs arose from the par- 0.01 0.10 1.00 0.01 0.10 1.00 0.01 0.10 1.00 0.01 0.10 1.00 Epsilon Epsilon Epsilon Epsilon allel polylogarithmic-depth (MIS) Algorithm: ● DEC−ADG−ITR−d JP−ADG−s algorithm by Karp and Wigderson [29], further improved to Fig. 3: Impact of ε on run-times and coloring quality. Parametrization: 32 cores, O(log n) depth by a simpler MIS algorithm in the influential sorting: Radix sort, direction-optimization: push, JP-ADG variant based on average paper by Luby [30] The key idea in this simple coloring degrees δb. DEC-ADG-ITR uses dynamic scheduling. strategy is to (1) find a MIS S, (2) apply steps of Greedy in parallel for all vertices in S (which is possible as, by definition, no two vertices in a MIS are adjacent), coloring all vertices H. Memory Pressure and Idle Cycles in S with a new color, (3) remove the colored vertices from G, We also investigate the pressure on the memory bus, see and (4) repeat the above steps until all vertices are colored. Figure 4. For this, we use PAPI [102] to gather data about Karp and Wigderson’s, and Luby’s algorithms started a idle CPU cycles and L3 cache misses. Low ratios of L3 misses large body of parallel graph coloring heuristics [1]–[3], or idle cycles indicate high locality and low pressure on the [9]–[11], [25]–[31], [33], [37], [77]–[81], [104]. We already memory bus. Overall, our routines have comparable or best exhaustively analyzed them in Section I and in Table III. ratios of active cycles and L3 hits. Almost all of them have theoretical guarantees based on the work-depth [105] or the PRAM model [106]. We build on and improve on these works in several dimensions, as explained in h−bai h−hud SC JP SC JP detail in Section I. 75% Many works exist in the theory of distributed graph coloring 50% based on models such as LOCAL or CONGESTED- [107]–[121]. These algorithms are highly theoretical 25% and do not come with any implementations. Moreover, they come with assumptions that are unrealistic in HPC settings, for 0% example distributed LOCAL and CONGEST algorithms do not Event ratio (the lower the better) ITR ITR initially know the interconnection graph, or the message size JP−R JP−R JP−FFJP−LF JP−SL JP−FFJP−LF JP−SL JP−LLF JP−LLF -ITR JP−ASL JP−SLL -ITR JP−ASL JP−SLL ITR−ASLJP−ADG ITR−ASLJP−ADG in LOCAL algorithms can be unbounded. Finally, they cannot DEC−ADG- DEC−ADG- Our routines (DEC-ADG-ITR & JP-ADG) are indicated with arrows . be directly compared to Work-Depth or PRAM algorithms. In their respective classes of algorithms, have competitive (low) counts Fraction of stalled cycles of idle CPU cycles and L3 misses, indicating low memory pressure Fraction of L3 misses Thus, they are of little relevance to our work. Fig. 4: Fractions of L3 misses (out of all L3 accesses) and idle (stalled) CPU cycles Many practical parallel and distributed approaches were (out of all CPU cycles) in each algorithm execution. Parametrization: graph h-hud, 32 cores, sorting: Radix sort, direction-optimization: push, JP-ADG uses average degrees δb. proposed. They often use different speculative schemes [34]– DEC-ADG-ITR uses dynamic scheduling. JP-ADG uses linear time sorting of R. [46], where vertices are colored speculatively and potential conflicts are resolved in a second pass. Some of these schemes

17 100 DEC−ADG [O] -ITR 90 JP−LF JP−SLL ITR−ASL JP−FF [H] JP−SL ITR JP−R 80 [O]JP−ADG JP−LLF [H] JP−SLL 70 JP−ADG [G] JP−LF 60 [G] JP−LLF 50 [Z] ITRB [C] ITR 40 [G] JP−R 30 % of instances [G] JP−FF "1.0" indicates the best 20 [C] ITR−ASL obtained graph coloring JP−ASL [C] JP−ASL 10

[Z]: Zoltan 0 1.0 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 35 [C]: ColPack [O]:Our scheme τ (color) [G]: GBBS/Ligra A given percentage for a given "τ" indicates how [H]: HasenPl. et al many data samples are within τ of the best coloring

Fig. 5: Color qualities from Figure 1 summarized with a performance profile. were implemented within frameworks or libraries [82], [122]– streaming setting exist [164], [165]. [124]. Another line of schemes incorporates GPUs and vec- Graph coloring has been targeted in several recent works torization [41], [44], [45], [125]–[129]. Other schemes use related to broad graph processing paradigms, abstrac- recoloring [130], [131] in which one improves an already tions, and frameworks [36], [166]–[173]. Several HPC existing coloring. Patidar and Chakrabarti use Hadoop to works [174]–[176] consider distributed graph coloring in the implement graph coloring [132]. Alabandi et al. [133] illustrate context of high-performance RDMA networks and RMA pro- how to increase parallelism of coloring heuristics. These works gramming [177]–[182]. Different coloring properties of graphs are orthogonal to this paper: they do not provide theoretical were also analyzed in the context of graph compression and analyses, but they usually offer numerous architectural and summarization [183], [184]. design optimizations that can be combined with our algorithms for further performance benefits. As we focused on theoret- ical guarantees and its impact on performance, and not on VIII.CONCLUSION architecture-related optimizations, we leave integrating these optimizations with our algorithms as future work. We develop graph coloring algorithms with strong theo- retical guarantees on all three key aspects of parallel graph There are works on coloring specific graph classes, such coloring: work, depth, and coloring quality. No other existing as planar graphs [134]–[137]. Some works impose additional algorithm provides such guarantees. restrictions, for example coloring balance, which limits differ- One algorithm, JP-ADG, is often superior in coloring qual- ences between numbers of vertices with different colors [138]– ity to all other baselines, even including the tuned SL and SLL [140]. Other lines of related work also exist, for example on algorithms specifically designed to reduce counts of used col- [141], dynamic or streaming coloring [142]– ors [31]. It also offers low run-times for different input graphs. [148], k-distance-coloring and other generalizations [149]– As we focus on algorithm design and analysis, one could [151], and sequential exact coloring [152]–[154]. There are combine JP-ADG with many orthogonal optimizations, for even works on solving graph coloring with evolutionary and example in the GPU landscape, to achieve more performance genetic algorithms [155]–[157] and with machine learning without sacrificing quality. Another algorithm, DEC-ADG, is methods [158]–[163]. All these works are unrelated as we of theoretical interest as it is the first routine – in a line of focus on unrestricted, parallel, and 1-distance vertex coloring works based on speculative coloring – with strong theoretical with provable guarantees on performance and quality, target- bounds. While being less advantageous in practice, we use its ing general, static, and simple graphs. underlying design to enhance a recent coloring heuristic [40] The general structure of our ADG algorithm, based on obtaining DEC-ADG-ITR, an algorithm with (1) strong quality iteratively removing vertices with degrees in certain ranges bounds and (2) competitive performance, for example up to defined by the approximation parameter ε, was also used to 40% fewer colors used then compared to the base design [40]. solve other problems, for example the (2 + ε)-approximate Our algorithms use a very simple (but rich in outcome) maximal densest subgraph algorithms by Dhulipala et al. [61]. idea of provably relaxing the strict vertex degeneracy order, Finding more applications of ADG is left for future work. to maximize parallelism when deriving this order. This idea, We note that, while our ADG scheme is the first paral- and our corresponding parallel ADG algorithm, are of separate lel algorithm for deriving approximate degeneracy ordering interest, and could enhance other algorithms that rely on vertex with a provable approximation factor, two algorithms in the ordering, for example in mining maximal cliques [49], [50].

18 We provide the most extensive theoretical study of parallel [19] M. Javedankherad, Z. Zeinalpour-Yazdi, and F. Ashtiani, “Content graph coloring algorithms. This analysis can be used by other placement in cache networks using graph coloring,” IEEE Systems Journal, 2020. researchers as help in identifying future work. [20] A. Dey and A. Pal, “Fuzzy graph coloring technique to classify the accidental zone of a traffic control,” Annals of Pure and Applied Acknowledgements: We thank Hussein Harake, Colin McMurtrie, Mark Klein, Mathematics, vol. 3, no. 2, pp. 169–178, 2013. Angelo Mangili, and the whole CSCS team granting access to the Ault and Daint [21] G. J. Chaitin, “Register allocation & spilling via graph coloring,” ACM machines, and for their excellent technical support. We thank Timo Schneider for his immense help with computing infrastructure at SPCL. Sigplan Notices, vol. 17, no. 6, pp. 98–101, 1982. [22] J. de Fine Licht et al., “Transformations of high-level synthesis codes for high-performance computing,” arXiv:1805.08288, 2018. REFERENCES [23] R. Lewis, A guide to graph colouring. Springer, 2015, vol. 7. [24] M. R. Garey, D. S. Johnson, and L. Stockmeyer, “Some simplified [1] T. F. Coleman and J. J. More,´ “Estimation of sparse jacobian matrices NP-complete problems,” in Proceedings of the sixth annual ACM and graph coloring problems,” SIAM Journal on Numerical Analysis, symposium on Theory of computing, ser. STOC’74. ACM, 1974, pp. vol. 20, no. 1, pp. 187–209, 1983. 47–63. [2] M. T. Jones and P. E. Plassmann, “Scalable iterative solution of sparse [25] D. J. A. Welsh and M. B. Powell, “An upper bound for the chromatic linear systems,” Parallel Computing, vol. 20, no. 5, pp. 753–773, 1994. number of a graph and its application to timetabling problems,” The [3] A. H. Gebremedhin, F. Manne, and A. Pothen, “What color is your Computer Journal, vol. 10, no. 1, pp. 85–86, 1967. jacobian? graph coloring for computing derivatives,” SIAM Review, [26] M. T. Jones and P. E. Plassmann, “A parallel graph coloring heuristic,” vol. 47, no. 4, pp. 629–705, 2005. SIAM Journal on Scientific Computing, vol. 14, no. 3, pp. 654–669, [4] M. Besta, R. Kanakagiri, H. Mustafa, M. Karasikov, G. Ratsch,¨ T. Hoe- 1993. fler, and E. Solomonik, “Communication-efficient jaccard similarity [27] D. Brelaz,´ “New methods to color the vertices of a graph,” Communi- for high-performance distributed genome comparisons,” IEEE IPDPS, cations of the ACM, vol. 22, no. 4, 1979. 2020. [28] D. W. Matula and L. L. Beck, “Smallest-last ordering and clustering [5] M. Besta, F. Marending, E. Solomonik, and T. Hoefler, “SlimSell: A and graph coloring algorithms,” Journal of the ACM, vol. 30, no. 3, Vectorized Graph Representation for Breadth-First Search,” in Proceed- pp. 417–427, 1983. ings of the 31st IEEE International Parallel & Distributed Processing [29] R. M. Karp and W. Avi, “A fast parallel algorithm for the maximal Symposium (IPDPS’17). IEEE, May 2017. independent set problem,” JACM, vol. 32, no. 4, pp. 762–773, 1985. [6] G. Kwasniewski, M. Kabic,´ M. Besta, J. VandeVondele, R. Solca,` and [30] M. Luby, “A simple parallel algorithm for the maximal independent set T. Hoefler, “Red-blue pebbling revisited: near optimal parallel matrix- problem,” SIAM journal on computing, vol. 15, no. 4, pp. 1036–1053, matrix multiplication,” in Proceedings of the International Conference 1986. for High Performance Computing, Networking, Storage and Analysis, [31] W. Hasenplaugh, T. Kaler, T. B. Schardl, and C. E. Leiserson, “Order- 2019, pp. 1–22. ing heuristics for parallel graph coloring,” in Proceedings of the 26th [7] E. Solomonik, M. Besta, F. Vella, and T. Hoefler, “Scaling betweenness ACM symposium on Parallelism in algorithms and architectures, ser. centrality using communication-efficient sparse matrix multiplication,” SPAA’14. ACM, 2014, pp. 166–177. in ACM/IEEE Supercomputing, 2017, p. 47. [32] M. M. A. Patwary, A. H. Gebremedhin, and A. Pothen, “New multi- [8] T. Kaler, W. Hasenplaugh, T. B. Schardl, and C. E. Leiserson, “Execut- threaded ordering and coloring algorithms for multicore architectures,” ing dynamic data-graph computations deterministically using chromatic in Euro-Par 2011 Parallel Processing, E. Jeannot, R. Namyst, and scheduling,” ACM Transactions on Parallel Computing (TOPC), vol. 3, J. Roman, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, no. 1, pp. 1–31, 2016. pp. 250–262. [9] E. M. Arkin and E. B. Silverberg, “Scheduling jobs with fixed start [33] J. Allwright, R. Bordawekar, D. Coddington, K. Dincer, and C. L. and end times,” Discrete Applied Mathematics, vol. 18, no. 1, pp. 1–8, Martin, “A comparison of parallel Graph coloring algorithms,” 1995. 1987. [34] U.¨ V. C¸atalyurek,¨ F. Dobrian, A. Gebremedhin, M. Halappanavar, and [10] D. Marx, “Graph colouring problems and their applications in schedul- A. Pothen, “Distributed-memory parallel algorithms for matching and ing,” Periodica Polytechnica Electrical Engineering (Archives), vol. 48, coloring,” in 2011 IEEE International Symposium on Parallel and no. 1-2, pp. 11–16, 2004. Distributed Processing Workshops and Phd Forum. IEEE, 2011, pp. [11] R. Ramaswami and K. Parhi, “Distributed scheduling of broadcasts in a 1971–1980. radio network,” in Proceedings of the Eighth Annual Joint Conference [35] D. Bozdag,˘ A. H. Gebremedhin, F. Manne, E. G. Boman, and U. V. of the IEEE Computer and Communications Societies, ser. IEEE Catalyurek, “A framework for scalable greedy coloring on distributed- INFOCOM ’89. IEEE, 1989. memory parallel computers,” Journal of Parallel and Distributed Com- [12] D. Ghrab, B. Derbel, I. Jemili, A. Dhraief, A. Belghith, and E.-G. Talbi, puting, vol. 68, no. 4, pp. 515–535, 2008. “Coloring based hierarchical routing approach,” 2013. [36] M. Besta, M. Podstawski, L. Groner, E. Solomonik, and T. Hoefler, [13] G. Li and R. Simha, “The partition coloring problem and its application “To push or to pull: On reducing communication and synchronization to wavelength routing and assignment,” in Proceedings of the First in graph computations,” in Proceedings of the 26th International Workshop on Optical Networks. Citeseer, 2000, p. 1. Symposium on High-Performance Parallel and Distributed Computing, [14] M. Besta, S. M. Hassan, S. Yalamanchili, R. Ausavarungnirun, 2017, pp. 93–104. O. Mutlu, and T. Hoefler, “Slim noc: A low-diameter on-chip network [37] A. H. Gebremedhin and F. Manne, “Scalable parallel graph coloring topology for high energy efficiency and scalability,” in ACM SIGPLAN algorithms,” Concurrency: Practice and Experience, vol. 12, no. 12, Notices, 2018. pp. 85–120, 2000. [15] M. Besta and T. Hoefler, “Slim Fly: A Cost Effective Low-Diameter [38] E. G. Boman, D. Bozdag,˘ U. Catalyurek, A. H. Gebremedhin, and Network Topology,” Nov. 2014, aCM/IEEE Supercomputing. F. Manne, “A scalable parallel graph coloring algorithm for distributed [16] S. Di Girolamo, K. Taranov, A. Kurth, M. Schaffner, T. Schnei- memory computers,” in European Conference on Parallel Processing. der, J. Beranek,´ M. Besta, L. Benini, D. Roweth, and T. Hoe- Springer, 2005, pp. 241–251. fler, “Network-accelerated non-contiguous memory transfers,” arXiv [39] A. H. Gebremedhin, I. G. Lassous, J. Gustedt, and J. A. Telle, “Graph preprint arXiv:1908.08590, 2019. coloring on a coarse grained multiprocessor,” in International Work- [17] M. Besta, M. Schneider, K. Cynk, M. Konieczny, E. Henriksson, shop on Graph-Theoretic Concepts in Computer Science. Springer, S. Di Girolamo, A. Singla, and T. Hoefler, “Fatpaths: Routing in super- 2000, pp. 184–195. computers and data centers when shortest paths fall short,” ACM/IEEE [40] U.¨ V. C¸atalyurek,¨ J. Feo, A. H. Gebremedhin, M. Halappanavar, and Supercomputing, 2020. A. Pothen, “Graph coloring algorithms for multi-core and massively [18] M. Besta, J. Domke, M. Schneider, M. Konieczny, S. Di Girolamo, multithreaded architectures,” Parallel Computing, vol. 38, no. 10-11, T. Schneider, A. Singla, and T. Hoefler, “High-performance routing pp. 576–594, 2012. with multipathing and path diversity in supercomputers and data [41] E. Saule and U.¨ V. C¸atalyurek,¨ “An early evaluation of the scalability centers,” arXiv preprint arXiv:2007.03776, 2020. of graph algorithms on the intel mic architecture,” in 2012 IEEE

19 26th International Parallel and Distributed Processing Symposium [65] A. Aggarwal, A. K. Chandra, and M. Snir, “On communication Workshops & PhD Forum. IEEE, 2012, pp. 1629–1639. latency in pram computations,” in Proceedings of the first annual ACM [42] A. E. Sariyuce,¨ E. Saule, and U.¨ V. C¸atalyurek,¨ “Scalable hybrid symposium on Parallel algorithms and architectures, 1989, pp. 11–21. implementation of graph coloring using mpi and openmp,” in 2012 [66] F. E. Fich, The complexity of computation on the parallel random IEEE 26th International Parallel and Distributed Processing Sympo- access machine. Department of Computer Science, University of sium Workshops & PhD Forum. IEEE, 2012, pp. 1744–1753. Toronto, 1993. [43] G. Rokos, G. Gorman, and P. H. Kelly, “A fast and scalable graph [67] R. P. Brent, “The parallel evaluation of general arithmetic expressions,” coloring algorithm for multi-core and many-core architectures,” in Journal of the ACM (JACM), vol. 21, no. 2, pp. 201–206, 1974. European Conference on Parallel Processing. Springer, 2015, pp. [68] R. E. Ladner and M. J. Fischer, “Parallel prefix computation,” Journal 414–425. of the ACM, vol. 27, no. 4, pp. 831–838, 1980. [44] A. V. P. Grosset, P. Zhu, S. Liu, S. Venkatasubramanian, and M. Hall, [69] M. Snir, Reduce and Scan. Boston, MA: Springer US, 2011, pp. 1728– “Evaluating graph coloring on gpus,” in Proceedings of the 16th ACM 1736. [Online]. Available: https://doi.org/10.1007/978-0-387-09766-4 symposium on Principles and practice of parallel programming, 2011, 120 pp. 297–298. [70] D. R. Karger and C. Stein, “A new approach to the minimum cut [45] M. Deveci, E. G. Boman, K. D. Devine, and S. Rajamanickam, problem,” J. ACM, vol. 43, no. 4, p. 601–640, 1996. “Parallel graph coloring for manycore architectures,” in 2016 IEEE [71] C. A. R. Hoare, “Quicksort,” The Computer Journal, vol. 5, no. 1, pp. International Parallel and Distributed Processing Symposium (IPDPS). 10–16, 1962. IEEE, 2016, pp. 892–901. [72] H. W. Lenstra and C. Pomerance, “A rigorous time bound for factoring [46] I. Finocchi, A. Panconesi, and R. Silvestri, “An experimental analysis of integers,” J. Amer. Math. Soc., vol. 5, pp. 483–516, 1962. simple, distributed vertex coloring algorithms,” Algorithmica, vol. 41, [73] D. R. Karger, P. N. Klein, and R. E. Tarjan, “A randomized linear-time no. 1, pp. 1–23, 2005. algorithm to find minimum spanning trees,” J. ACM, vol. 42, no. 2, p. [47] A.-L. Barabasi´ and R. Albert, “Emergence of scaling in random 321–328, 1995. networks,” Science, vol. 286, no. 5439, pp. 509–512, 1999. [Online]. [74] R. Solovay and V. Strassen, “A fast monte-carlo test for primality,” Available: https://science.sciencemag.org/content/286/5439/509 SIAM Journal on Computing, vol. 6, no. 1, pp. 84–85, 1977. [48] D. R. Lick and A. T. White, “k-degenerate graphs,” Canadian Journal [75] H. Gazit, “An optimal randomized parallel algorithm for finding con- of Mathematics, vol. 22, no. 5, p. 1082–1096, 1970. nected components in a graph,” SIAM Journal on Computing, vol. 20, [49] F. Cazals and C. Karande, “A note on the problem of reporting maximal no. 6, pp. 1046–1067, 1991. cliques,” Theoretical Computer Science, vol. 407, no. 1-3, pp. 564–568, [76] S. Roch, “Modern discrete probability: An essential toolkit,” University 2008. Lecture, 2015. [50] D. Eppstein, M. Loffler,¨ and D. Strash, “Listing all maximal cliques in [77] N. Alon, L. Babai, and I. Alon, “A fast and simple randomized sparse graphs in near-optimal time,” in Algorithms and Computation parallel algorithm for the maximal independent set problem,” Journal - 21st International Symposium, ISAAC 2010, Jeju Island, Korea, of Algorithms, vol. 7, no. 4, pp. 567–583, 1986. December 15-17, 2010, Proceedings, Part I, 2010, pp. 403–414. [78] M. Goldberg and S. Thomas, “A new parallel algorithm for the [Online]. Available: https://doi.org/10.1007/978-3-642-17517-6 36 maximal independent set problem ,” SIAM journal on coputing, vol. 18, [51] E. Tomita, A. Tanaka, and H. Takahashi, “The worst-case time no. 2, pp. 419–427, 1989. complexity for generating all maximal cliques and computational [79] A. V. Goldberg and S. A. Plotkin, “Parallel (∆ + 1)-coloring of experiments,” Theor. Comput. Sci., vol. 363, no. 1, pp. 28–42, 2006. constant-degree graphs,” Information Processing Letters, vol. 25, no. 4, [Online]. Available: https://doi.org/10.1016/j.tcs.2006.06.015 pp. 341–345, 1987. [52] M. Farach-Colton and M. Tsai, “Computing the degeneracy of [80] A. Goldberg, S. Plotkin, and G. Shannon, “Parallel symmetry-breaking large graphs,” in LATIN 2014: Theoretical Informatics - 11th in sparse graphs,” in Proceedings of the nineteenth annual ACM Latin American Symposium, Montevideo, Uruguay, March 31 - symposium on Theory of computing, ser. STOC ’87. ACM, 1987, April 4, 2014. Proceedings, 2014, pp. 250–260. [Online]. Available: pp. 315–324. https://doi.org/10.1007/978-3-642-54423-1 22 [81] M. Luby, “Removing randomness in parallel computation without a [53] M. Chrobak and D. Eppstein, “Planar orientations with low out-degree processor penalty,” Journal of Computer and System Sciences, vol. 74, and compaction of adjacency matrices,” Theoretical Computer Science, no. 2, pp. 250–286, 1993. vol. 86, no. 2, pp. 243–266, 1991. [82] A. H. Gebremedhin, D. Nguyen, M. M. A. Patwary, and A. Pothen, “Colpack: Software for graph coloring and related problems in sci- [54] P. Erdos˝ and A. Hajnal, “On chromatic number of graphs and set- entific computing,” ACM Transactions on Mathematical Software systems,” Acta Mathematica Hungarica, vol. 17, no. 1-2, pp. 61–99, (TOMS), vol. 40, no. 1, pp. 1–31, 2013. 1966. [83] P. M. McIlroy, K. Bostic, and M. D. McIlroy, “Engineering radix sort,” [55] L. M. Kirousis and D. M. Thilikos, “The linkage of a graph,” SIAM Computing systems, vol. 6, no. 1, pp. 5–27, 1993. Journal on Computing, vol. 25, no. 3, pp. 626–647, 1996. [84] H. H. Seward, “Information sorting in the application of electronic digi- [56] E. C. Freuder, “A sufficient condition for backtrack-free search,” tal computers to business operations,” Ph.D. dissertation, Massachusetts Journal of the ACM (JACM), vol. 29, no. 1, pp. 24–32, 1982. Institute of Technology. Department of Electrical Engineering, 1954. [57] G. D. Bader and C. W. Hogue, “An automated method for finding [85] C. A. Hoare, “Quicksort,” The Computer Journal, vol. 5, no. 1, pp. molecular complexes in large protein interaction networks,” BMC 10–16, 1962. bioinformatics, vol. 4, no. 1, p. 2, 2003. [86] J. Malicevic, B. Lepers, and W. Zwaenepoel, “Everything you al- [58] S. B. Seidman, “Network structure and minimum degree,” Social ways wanted to know about multicore graph processing but were networks, vol. 5, no. 3, pp. 269–287, 1983. afraid to ask,” in 2017 {USENIX} Annual Technical Conference [59] R. D. Blumofe and C. E. Leiserson, “Scheduling multithreaded compu- ({USENIX}{ATC} 17), 2017, pp. 631–643. tations by work stealing,” Journal of the ACM (JACM), vol. 46, no. 5, [87] S. Beamer, K. Asanovic,´ and D. Patterson, “The gap benchmark suite,” pp. 720–748, 1999. arXiv preprint arXiv:1508.03619, 2015. [60] ——, “Space-efficient scheduling of multithreaded computations,” [88] R. Chandra, L. Dagum, D. Kohr, R. Menon, D. Maydan, and J. Mc- SIAM Journal on Computing, vol. 27, no. 1, pp. 202–229, 1998. Donald, Parallel programming in OpenMP. Morgan kaufmann, 2001. [61] L. Dhulipala, G. E. Blelloch, and J. Shun, “Theoretically efficient par- [89] T. Hoefler and R. Belli, “Scientific benchmarking of parallel computing allel graph algorithms can be fast and scalable,” arXiv:1805.05208v4, systems: twelve ways to tell the masses when reporting performance 2018. results,” in Proceedings of the international conference for high perfor- [62] G. Bilardi and A. Pietracaprina, Models of Computation, Theoretical. mance computing, networking, storage and analysis, 2015, pp. 1–12. Boston, MA: Springer US, 2011, pp. 1150–1158. [90] A. H. Gebremedhin, D. Nguyen, M. Patwary, and A. Pothen, “Colpack: [63] G. E. Blelloch and B. M. Maggs, Parallel Algorithms, 2nd ed. Chap- Graph coloring software for derivative computation and beyond,” man & Hall/CRC, 2010, p. 25. Submitted to ACM TOMS, 2010. [64] J. F. JaJa, PRAM (Parallel Random Access Machines). Boston, MA: [91] E. G. Boman, U.¨ V. C¸atalyurek,¨ C. Chevalier, and K. D. Devine, “The Springer US, 2011, pp. 1608–1615. zoltan and isorropia parallel toolkits for combinatorial scientific com-

20 puting: Partitioning, ordering and coloring,” Scientific Programming, Symposium on Theory of Computing, ser. STOC 2018. ACM, 2018, vol. 20, no. 2, pp. 129–150, 2012. p. 445–456. [92] K. D. Devine, E. G. Boman, L. A. Riesen, U. V. Catalyurek, and [116] A. Panconesi and A. Srinivasan, “Improved distributed algorithms for C. Chevalier, “Getting started with zoltan: A short tutorial,” in Dagstuhl coloring and network decomposition problems,” in Proceedings of the Seminar Proceedings. Schloss Dagstuhl-Leibniz-Zentrum fur¨ Infor- Twenty-Fourth Annual ACM Symposium on Theory of Computing, ser. matik, 2009. STOC ’92. New York, NY, USA: ACMy, 1992, p. 581–592. [93] S. Rajamanickam and E. G. Boman, “Parallel partitioning with zoltan: [117] L. Barenboim, M. Elkin, and U. Goldenberg, “Locally-iterative dis- Is hypergraph partitioning worth it?” Graph Partitioning and Graph tributed (δ+ 1): -coloring below szegedy-vishwanathan barrier, and Clustering, vol. 588, pp. 37–52, 2012. applications to self-stabilization and to restricted-bandwidth models,” in [94] M. A. Heroux, R. A. Bartlett, V. E. Howle, R. J. Hoekstra, J. J. Hu, Proceedings of the 2018 ACM Symposium on Principles of Distributed T. G. Kolda, R. B. Lehoucq, K. R. Long, R. P. Pawlowski, E. T. Phipps Computing, ser. PODC ’18. New York, NY, USA: ACM, 2018, p. et al., “An overview of the trilinos project,” ACM Transactions on 437–446. Mathematical Software (TOMS), vol. 31, no. 3, pp. 397–423, 2005. [118] Y.-J. Chang, T. Kopelowitz, and S. Pettie, “An exponential separation [95] J. Shun and G. E. Blelloch, “Ligra: a lightweight graph processing between randomized and deterministic complexity in the local model,” framework for shared memory,” in ACM Sigplan Notices, vol. 48, no. 8. SIAM Journal on Computing, vol. 48, no. 1, pp. 122–143, 2019. ACM, 2013, pp. 135–146. [119] Ojvind¨ Johansson, “Simple distributed δ-coloring of graphs,” Informa- [96] L. Dhulipala, J. Shi, T. Tseng, G. E. Blelloch, and J. Shun, “The tion Processing Letters, vol. 70, no. 5, pp. 229 – 232, 1999. graph based benchmark suite (gbbs),” in Proceedings of the 3rd Joint [120] D. A. Grable and A. Panconesi, “Fast distributed algorithms for brooks– International Workshop on Graph Data Management Experiences & vizing colorings,” Journal of Algorithms, vol. 37, no. 1, pp. 85–120, Systems (GRADES) and Network Data Analytics (NDA), 2020, pp. 1– 2000. 8. [121] F. Kuhn and R. Wattenhofer, “On the complexity of distributed graph [97] J. Leskovec and A. Krevl, “SNAP Datasets: Stanford large network coloring,” in Proceedings of the twenty-fifth annual ACM symposium dataset collection,” http://snap.stanford.edu/data, Jun. 2014. on Principles of distributed computing, 2006, pp. 7–15. [98] J. Kunegis, “Konect: the koblenz network collection,” in Proc. of Intl. [122] D. Gregor and A. Lumsdaine, “The parallel bgl: A generic library for Conf. on World Wide Web (WWW). ACM, 2013, pp. 1343–1350. distributed graphcomputations,” 2005. [99] C. Demetrescu, A. V. Goldberg, and D. S. Johnson, The Shortest Path [123] D. Bozdag,˘ A. H. Gebremedhin, F. Manne, E. G. Boman, and U. V. Problem: Ninth DIMACS Implementation Challenge. American Math. Catalyurek, “A framework for scalable greedy coloring on distributed- Soc., 2009, vol. 74. memory parallel computers ,” Journal of Parallel and Distributed [100] P. Boldi and S. Vigna, “The webgraph framework i: compression Computing, vol. 68, no. 4, pp. 515–535, 2008. techniques,” in Proceedings of the 13th international conference on [124] M. Naumov, M. Arsaev, P. Castonguay, J. Cohen, J. Demouth, J. Eaton, World Wide Web. ACM, 2004, pp. 595–602. S. Layton, N. Markovskiy, I. Reguly, N. Sakharnykh, V. Sellappan, and [101] J. Leskovec, D. Chakrabarti, J. Kleinberg, C. Faloutsos, and Z. Ghahra- R. Strzodka, “Amgx: A library for gpu accelerated algebraic multigrid mani, “Kronecker graphs: An approach to modeling networks,” Journal and preconditioned iterative methods,” SIAM Journal on Scientific of Machine Learning Research, vol. 11, no. Feb, pp. 985–1042, 2010. Computing, vol. 37, no. 5, pp. 602–626, 2015. [102] P. J. Mucci, S. Browne, C. Deane, and G. Ho, “Papi: A portable [125] M. Naumov, L. Chien, P. Vandermersch, and U. Kapasi, “Cusparse interface to hardware performance counters,” in Proceedings of the library,” in GPU Technology Conference, 2010. department of defense HPCMP users group conference, vol. 710, 1999. [126] J. Cohen and P. Castonguay, “Efficient graph matching and coloring [103] E. D. Dolan and J. J. More,´ “Benchmarking optimization software with on the gpu,” in GPU Technology Conference, 2012, pp. 1–10. performance profiles,” Mathematical programming, vol. 91, no. 2, pp. [127] X. Chen, P. Li, J. Fang, T. Tang, Z. Wang, and C. Yang, “Efficient 201–213, 2002. and high-quality sparse graph coloring on gpus,” Concurrency and [104] D. W. Matula, G. Marble, and J. D. Isaacson, “Graph coloring algo- Computation: Practice and Experience, vol. 29, no. 10, p. e4064, 2017. rithms,” in Graph theory and computing. Elsevier, 1972, pp. 109–122. [128] S. Che, G. Rodgers, B. Beckmann, and S. Reinhardt, “Graph coloring [105] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction on the gpu and some techniques to improve load imbalance,” in 2015 to algorithms. MIT press, 2009. IEEE International Parallel and Distributed Processing Symposium [106] G. E. Blelloch, “Programming parallel algorithms,” Communications Workshop. IEEE, 2015, pp. 610–617. of the ACM, vol. 39, no. 3, pp. 85–97, 1996. [129] M. Naumov, P. Castonguay, and J. Cohen, “Parallel graph coloring [107] N. Linial, “Distributive graph algorithms global solutions from local with applications to the incomplete-lu factorization on the gpu,” Nvidia data,” in 28th Annual Symposium on Foundations of Computer Science White Paper, 2015. (sfcs 1987), 1987, pp. 331–335. [130] J. Culberson, “Iterated greedy graph coloring and the difficulty land- [108] ——, “Locality in distributed graph algorithms,” SIAM Journal on scape,” 1992. Computing, vol. 21, no. 1, pp. 193–201, 1992. [131] A. E. Sarlyuce,¨ E. Saule, and U. V. C¸atalyurek,¨ “Improving graph [109] L. Barenboim and M. Elkin, “Distributed graph coloring: Fundamentals coloring on distributed-memory parallel computers,” in 2011 18th and recent developments,” 2013. International Conference on High Performance Computing. IEEE, [110] L. Barenboim, M. Elkin, S. Pettie, and J. Schneider, “The locality of 2011. distributed symmetry breaking,” Journal of the ACM, vol. 63, no. 3, [132] N. M. Gandhi and R. Misra, “Performance comparison of parallel graph 2016. coloring algorithms on bsp model using hadoop,” in 2015 International [111] L. Barenboim and M. Elkin, “Deterministic distributed vertex coloring Conference on Computing, Networking and Communications (ICNC). in polylogarithmic time,” J. ACM, vol. 58, no. 5, 2011. IEEE, 2015, pp. 110–116. [112] J. Schneider and R. Wattenhofer, “A new technique for distributed [133] G. Alabandi, E. Powers, and M. Burtscher, “Increasing the parallelism symmetry breaking,” in Proceedings of the 29th ACM SIGACT- of graph coloring via shortcutting,” in Proceedings of the 25th ACM SIGOPS Symposium on Principles of Distributed Computing, ser. SIGPLAN Symposium on Principles and Practice of Parallel Program- PODC ’10. New York, NY, USA: ACM, 2010, p. 257–266. [Online]. ming, 2020, pp. 262–275. Available: https://doi.org/10.1145/1835698.1835760 [134] K. Diks, “A fast parallel algorithm for six-colouring of planar graphs,” [113] L. Barenboim and M. Elkin, “Distributed (δ + 1)-coloring in linear (in in International Symposium on Mathematical Foundations of Computer δ) time,” in Proceedings of the Forty-First Annual ACM Symposium on Science. Springer, 1986, pp. 273–282. Theory of Computing, ser. STOC ’09. New York, NY, USA: ACM, [135] D. W. Matula, Y. Shiloach, and R. E. Tarjan, “Two linear-time 2009, p. 111–120. algorithms for five-coloring a planar graph,” STANFORD UNIV CA [114] D. G. Harris, J. Schneider, and H.-H. Su, “Distributed (δ+1)-coloring DEPT OF COMPUTER SCIENCE, Tech. Rep., 1980. in sublogarithmic rounds,” in Proceedings of the Forty-Eighth Annual [136] J. F. Boyar and H. J. Karloff, “Coloring planar graphs in parallel,” ACM Symposium on Theory of Computing, ser. STOC ’16. New York, Journal of Algorithms, vol. 8, no. 4, pp. 470–479, 1987. NY, USA: ACM, 2016, p. 465–478. [137] T. Hagerup, M. Chrobak, and K. Diks, “Optimal parallel 5-colouring [115] Y.-J. Chang, W. Li, and S. Pettie, “An optimal distributed (δ+1)- of planar graphs,” SIAM Journal on Computing, vol. 18, no. 2, pp. coloring algorithm?” in Proceedings of the 50th Annual ACM SIGACT 288–300, 1989.

21 [138] M. T. Gjertsen Jr., Robert K. ane Jones and P. E. Plassmann, “Parallel [159] H. Lemos, M. Prates, P. Avelar, and L. Lamb, “Graph colouring meets heuristics for improved, balanced graph colorings,” Journal of Parallel deep learning: Effective graph neural network models for combinatorial and Distributed Computing, vol. 37, no. 2, pp. 171 – 186, 1996. problems,” arXiv preprint arXiv:1903.04598, 2019. [139] H. Lu, M. Halappanavar, D. Chavarr´ıa-Miranda, A. Gebremedhin, [160] Y. Zhou, J.-K. Hao, and B. Duval, “Reinforcement learning based local and A. Kalyanaraman, “Balanced coloring for parallel computing search for grouping problems: A case study on graph coloring,” Expert applications,” in 2015 IEEE International Parallel and Distributed Systems with Applications, vol. 64, pp. 412–422, 2016. Processing Symposium, 2015, pp. 7–16. [161] N. Musliu and M. Schwengerer, “Algorithm selection for the graph [140] A. H. Gebremedhin, F. Manne, and A. Pothen, “Parallel Distance-k coloring problem,” in International Conference on Learning and Intel- Coloring Algorithms for Numerical Optimization,” in Euro-Par 2002 ligent Optimization. Springer, 2013, pp. 389–403. Parallel Processing Proceedings. Springer, Berlin, Heidelberg, 2002, [162] T. Ben-Nun, M. Besta, S. Huber, A. N. Ziogas, D. Peter, and T. Hoefler, pp. 912–921. “A modular benchmarking infrastructure for high-performance and [141] I. Holyer, “The np-completeness of edge-coloring,” SIAM Journal on reproducible deep learning,” IEEE IPDPS, 2019. computing, vol. 10, no. 4, pp. 718–720, 1981. [163] J. Huang, M. Patwary, and G. Diamos, “Coloring big graphs with [142] S. Sallinen, K. Iwabuchi, S. Poudel, M. Gokhale, M. Ripeanu, and alphagozero,” arXiv preprint arXiv:1902.10162, 2019. R. Pearce, “Graph colouring as a challenge problem for dynamic graph [164] M. Farach-Colton and M.-T. Tsai, “Tight approximations of degeneracy processing on distributed systems,” in SC’16: Proceedings of the In- in large graphs,” in LATIN 2016: Theoretical Informatics. Springer, ternational Conference for High Performance Computing, Networking, 2016, pp. 429–440. Storage and Analysis. IEEE, 2016, pp. 347–358. [165] ——, “Computing the degeneracy of large graphs,” in Latin American [143] L. Yuan, L. Qin, X. Lin, L. Chang, and W. Zhang, “Effective and effi- Symposium on Theoretical Informatics. Springer, 2014, pp. 250–260. cient dynamic graph coloring,” Proceedings of the VLDB Endowment, [166] J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin, “Pow- vol. 11, no. 3, pp. 338–351, 2017. ergraph: Distributed graph-parallel computation on natural graphs,” in [144] J. Bossek, F. Neumann, P. Peng, and D. Sudholt, “Runtime analysis of Presented as part of the 10th {USENIX} Symposium on Operating randomized search heuristics for dynamic graph coloring,” in Proceed- Systems Design and Implementation ({OSDI} 12), 2012, pp. 17–30. ings of the Genetic and Evolutionary Computation Conference, 2019, [167] M. Besta and T. Hoefler, “Accelerating irregular computations with pp. 1443–1451. hardware transactional memory and active messages,” in ACM HPDC, [145] S. Bhattacharya, D. Chakrabarty, M. Henzinger, and D. Nanongkai, 2015. “Dynamic algorithms for graph coloring,” in Proceedings of the [168] M. Besta, D. Stanojevic, T. Zivic, J. Singh, M. Hoerold, and T. Hoefler, Twenty-Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. “Log (graph) a near-optimal high-performance graph representation,” SIAM, 2018, pp. 1–20. in ACM PACT, 2018, pp. 1–13. [146] S. Solomon and N. Wein, “Improved dynamic graph coloring,” arXiv [169] L. Gianinazzi, P. Kalvoda, A. De Palma, M. Besta, and T. Hoefler, preprint arXiv:1904.12427, 2019. “Communication-avoiding parallel minimum cuts and connected com- ponents,” in ACM SIGPLAN Notices, vol. 53, no. 1. ACM, 2018, pp. [147] S. K. Bera and P. Ghosh, “Coloring in graph streams,” arXiv preprint 219–232. arXiv:1807.07640, 2018. [170] M. Besta, M. Fischer, V. Kalavri, M. Kapralov, and T. Hoefler, “Practice [148] M. Besta, M. Fischer, T. Ben-Nun, J. De Fine Licht, and T. Hoefler, of streaming and dynamic graphs: Concepts, models, systems, and “Substream-centric maximum matchings on fpga,” in ACM/SIGDA parallelism,” arXiv preprint arXiv:1912.12740, 2019. FPGA, 2019, pp. 152–161. [171] M. Besta, E. Peter, R. Gerstenberger, M. Fischer, M. Podstawski, [149] H. Lu, M. Halappanavar, D. Chavarr´ıa-Miranda, A. H. Gebremedhin, C. Barthels, G. Alonso, and T. Hoefler, “Demystifying graph databases: A. Panyala, and A. Kalyanaraman, “Algorithms for balanced graph Analysis and taxonomy of data organization, system designs, and graph IEEE Transactions colorings with applications in parallel computing,” queries,” arXiv preprint arXiv:1910.09017, 2019. on Parallel and Distributed Systems, vol. 28, no. 5, pp. 1240–1256, [172] M. Besta, D. Stanojevic, J. D. F. Licht, T. Ben-Nun, and T. Hoefler, 2017. “Graph processing on fpgas: Taxonomy, survey, challenges,” arXiv [150] D. Bozdag,˘ U.¨ i. t. V. C¸atalyurek,¨ A. H. Gebremedhin, F. Manne, E. G. ¨ preprint arXiv:1903.06697, 2019. Boman, and F. Ozguner,¨ “Distributed-memory parallel algorithms for [173] M. Besta, M. Fischer, T. Ben-Nun, D. Stanojevic, J. D. F. Licht, and distance-2 coloring and related problems in derivative computation,” T. Hoefler, “Substream-centric maximum matchings on fpga,” ACM SIAM Journal on Scientific Computing, vol. 32, no. 4, pp. 2418–2446, Transactions on Reconfigurable Technology and Systems (TRETS), 2010. vol. 13, no. 2, pp. 1–33, 2020. [151] D. Bozdag,˘ U. Catalyurek, A. H. Gebremedhin, F. Manne, E. G. Bo- [174] L. Thebault, “Scalable and efficient algorithms for unstructured mesh ¨ man, and F. Ozguner,¨ “A parallel distance-2 graph coloring algorithm computations,” Ph.D. dissertation, 2016. for distributed memory computers,” in International Conference on [175] J. S. Firoz, M. Zalewski, A. Lumsdaine, and M. Barnas, “Runtime High Performance Computing and Communications. Springer, 2005, scheduling policies for distributed graph algorithms,” in 2018 IEEE pp. 796–806. International Parallel and Distributed Processing Symposium (IPDPS). [152] J. Lin, S. Cai, C. Luo, and K. Su, “A reduction based method for IEEE, 2018, pp. 640–649. coloring very large graphs.” in IJCAI, 2017, pp. 517–523. [176] D. Gregor and A. Lumsdaine, “The parallel bgl: A generic library [153] A. Verma, A. Buchanan, and S. Butenko, “Solving the maximum for distributed graph computations,” Parallel Object-Oriented Scientific clique and vertex coloring problems on very large sparse networks,” Computing (POOSC), vol. 2, pp. 1–18, 2005. INFORMS Journal on computing, vol. 27, no. 1, pp. 164–177, 2015. [177] M. Besta and T. Hoefler, “Active access: A mechanism for high- [154]E.H ebrard´ and G. Katsirelos, “A hybrid approach for exact coloring performance distributed data-centric computations,” in ACM ICS, 2015. of massive graphs,” in International Conference on Integration of Con- [178] ——, “Fault tolerance for remote memory access programming mod- straint Programming, Artificial Intelligence, and Operations Research. els,” in ACM HPDC, 2014, pp. 37–48. Springer, 2019, pp. 374–390. [179] R. Gerstenberger, M. Besta, and T. Hoefler, “Enabling Highly-scalable [155] R. Abbasian and M. Mouhoub, “An efficient hierarchical parallel Remote Memory Access Programming with MPI-3 One Sided,” in genetic algorithm for graph coloring problem,” in Proceedings of the ACM/IEEE Supercomputing, ser. SC ’13, 2013, pp. 53:1–53:12. 13th annual conference on Genetic and evolutionary computation, [180] ——, “Enabling highly scalable remote memory access programming 2011, pp. 521–528. with mpi-3 one sided,” Communications of the ACM, vol. 61, no. 10, [156] A. E. Eiben, J. K. Van Der Hauw, and J. I. van Hemert, “Graph coloring pp. 106–113, 2018. with adaptive evolutionary algorithms,” Journal of Heuristics, vol. 4, [181] H. Schweizer, M. Besta, and T. Hoefler, “Evaluating the cost of atomic no. 1, pp. 25–46, 1998. operations on modern architectures,” in IEEE PACT, 2015, pp. 445– [157] C. Fleurent and J. A. Ferland, “Genetic and hybrid algorithms for graph 456. coloring,” Annals of Operations Research, vol. 63, no. 3, pp. 437–461, [182] P. Schmid, M. Besta, and T. Hoefler, “High-performance distributed 1996. RMA locks,” in ACM HPDC, 2016, pp. 19–30. [158] Y. Zhou, B. Duval, and J.-K. Hao, “Improving probability learning [183] M. Besta, S. Weber, L. Gianinazzi, R. Gerstenberger, A. Ivanov, based local search for graph coloring,” Applied Soft Computing, vol. 65, Y. Oltchik, and T. Hoefler, “Slim graph: practical lossy graph com- pp. 542–553, 2018. pression for approximate graph processing, storage, and analytics,”

22 in Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. ACM, 2019, p. 35. [184] M. Besta and T. Hoefler, “Survey and taxonomy of lossless graph compression and space-efficient graph representations,” arXiv preprint arXiv:1806.01799, 2018.

23