arXiv:1808.06705v4 [cs.DS] 3 May 2021 eaeal ooecm h rbeswt ae rpgto o gra for propagation label with problems the overcome to able are We of de in edges ie ipe nietdgraph undirected simple, a Given Introduction 1 hretpt eghbtenvertices between length shortest-path opttoa models. computational h ubro rcsosadwtoticesn h decount. edge the ach increasing It without models. and computational processors many of to number adaptable the easily is cou that edge gation the keep sublin to a managed in complete carefully to be difficult must inherently edges exchang is removing propagation propagatio it and Label but label traversal, investigate pointer-chasing. graph on we method simple rely Instead this not is graph. only does Not and the 23]. to o 36, edges fraction 26, constant more [2, a probability select high with [19, to non-leaders is systems challenge shared-memory paralle the to a where in limited contraction” slowdown also of are source pointe implementations primary requires PRAM which a Thes was 42], jumping 3, (PRAM). Pointer 13, Machine latency. [22, Access trees Random compressing and Parallel merging a on work superlinear of components connected the ∗ G h ats,dtriitcprle ( parallel deterministic fastest, The eitoueasrrsnl ipefaeokfrdtriitc undir deterministic, for framework simple surprisingly a introduce We eerhDrcoae ainlScrt gny otMead Fort Agency, Security National Directorate, Research r attosof partitions are eaeal ooecm h rbeswt ae rpgto f propagation label with problems the overcome to able are We nte oua ehdi lae-otato”weetec the where “leader-contraction” is method popular Another oko aallRno cesMcie(RM.Teealgor These (PRAM). Machine Access Random Parallel a on work h ro fcnegnea noe problem. open an as convergence of proof the O ovrec napt rp.I a ojcue yLuadTar and Liu by conjectured was It graph. path a on convergence iie osae-eoysses ayo hs RMalgori PRAM these of Many ope systems. pointer-chasing shared-memory requires to which limited trees, compressing and n rsn e loihsi RM tem n aRdc.Give MapReduce. and Stream, PRAM, while in increasin direction without alternating new and in present processors edges of computational directed many number propagating to the adaptable of easily pendently is that propagation inherentl is pr it Label but traversal, pointer-chasing. graph investigat on simple we rely using Instead not component with does graph. non-leaders and original implement, of the to fraction in constant were than a edges to adjacent are that (log = Keywords: h ats eemnsi loihsfrcnetdcompon connected for algorithms deterministic fastest The eitoueasrrsnl ipefaeokfrdetermini for framework simple surprisingly a introduce We E | V 2 ftovrie r o once hnte r ndffrn comp different in are they then connected not are vertices two If . n | tp.Oreprmnso ag fdffiutgah losug also graphs difficult of range a on experiments Our steps. ) vertices, rp onciiyi o tp sn ae propagation label using steps log in connectivity Graph rp onciiy once opnns aallalgor parallel components, connected connectivity, graph m V = uhta vr aro etcsaecnetdb ah hc sa is which path, a by connected are vertices of pair every that such | E | G de,orapoc ae ()wr ahse,btw a only can we but step, each work O(m) takes approach our edges, in G O ( = (log C N v ,E V, and n loihsfrcnetdcmoet aelgrtmctm n pe and time logarithmic take components connected for algorithms ) tp sn ipe eemnsi ehd htaeaatbet ma to adaptable are that methods deterministic simple, using steps ) with ) u h diameter The . alBurkhardt Paul a ,2021 4, May n Abstract = ,M 05.Eal [email protected] Email: 20755. MD e, | V | efrigmnmmrdcino etxlbl.We labels. vertex on reduction minimum performing etcsand vertices ain htices eoyacs aec n are and latency access memory increase that rations icl ocmlt nasbiernme fsteps. of number sublinear a in complete to difficult y ae rpgto eas ti eemnsi,easy deterministic, is it because propagation label e ihpoaiiy u hscnrqieadn more adding require can this but probability, high alnei oslc osatfato fleaders of fraction constant a select to is hallenge pgto xhne ersnaielbl ihna within labels representative exchanges opagation nstk oaihi ieadpromsuperlinear perform and time logarithmic take ents oes tahee oaihi ovrec inde- convergence logarithmic achieves It models. D h decut eepo oe ehdof method novel a employ We count. edge the g rgahconnectivity. graph or eepo oe ehdo rpgtn directed propagating of method novel a employ We edr htaeajcn oacntn rcinof fraction constant a to adjacent are that leaders f srpeettv aeswti opnn using component a within labels representative es tc nietdgahcnetvt sn label using connectivity graph undirected stic, stemxmmdsac in distance maximum the is eas ti eemnsi,es oimplement, to easy deterministic, is it because n -hsn prtosta nraemmr access memory increase that operations r-chasing hsaeas eycmlctdt implement. to complicated very also are thms ∗ ipe nietdgraph undirected simple, a n tm ananasann oetb merging by forest spanning a maintain ithms a 21)t take to (2019) jan ee oaihi ovrec neednl of independently convergence logarithmic ieves hconnectivity. ph tadteeoetewr ieri ahstep. each in linear work the therefore and nt iiu pnigte loih 1] The [12]. tree spanning minimum l tm rm tem mapreduce stream, pram, ithm, loihsmiti pnigfrs by forest spanning a maintain algorithms e admzdbti a eur digmany adding require can it but randomized 4 ] nte oua ehdi “leader- is method popular Another 4]. 34, ce rp onciiyuiglblpropa- label using connectivity graph ected m etlgrtmccnegne eleave We convergence. logarithmic gest a ubro tp 3,3,4] Adding 41]. 39, [37, steps of number ear = | E nns h distance The onents. | de,tecnetdcomponents connected the edges, O (log n tp rpossibly or steps ) rv logarithmic prove G euneo adjacent of sequence G ( = ews ofind to wish We . ,E V, d ( with ) ,u v, sthe is ) rform ny 1 1 1

1 3 4 2 3 4 3 4 3 4 2 2 2

(1) (2) (3) (4)

Figure 1: A path graph converges in three steps. After each step the output graph becomes the input for the next step. Undirected edges denote a pair of counter-oriented edges.

edges in alternating direction while performing minimum reduction on vertex labels. We believe our solution to obtaining sublinear convergence and near optimal work for connected components is one of the simplest to date. Moreover, our experiments demonstrate fast convergence. In this paper we say that a (v,u) edge is directed from v to u and (·, ·) denotes an ordered pair that distinguishes (v,u) from (u, v). We call the counter-oriented (v,u), (u, v) edges the twins of a conjugate pair. An undirected {u, v} edge in G is then comprised of these twins. To propagate a label w from edge (v,u) we create just the (u, w) twin. In the next step we reverse the direction to return the opposite twin, (w,u), if w is the minimum label for u. We’ll call these two edge operations label propagation and symmetrization, respectively. Comitant with label propagation is a min update on u’s minimum label, which may or may not change. Contraction of the graph is due to the label proagation operation and symmetrization ensures that a vertex and its minimum label are able to exchange new minimum labels. Thus every vertex in each step propagates and retains its current minimum label. Let l(v) be the current minimum label for v. Then for an edge (v,u) we get (u,l(v)) or (u, v) in a single step, due to either label propagation or symmetrization, respectively. Each edge is replaced by a new edge and hence the method maintains a stable edge count. The essential operations are summarized as follows.

• For every (v,u) edge if u is not l(v) then min update and label propagation, else symmetrization, repeating until no label changes.

Figure 1 illustrates our method where the starting minimum label for each vertex is the lowest vertex ID among its neighbors and itself, and eventually the graph is transformed into a star whose root is the component label. This is a practical and very simple technique for large-scale streaming and parallel graph connectivity. We present new algorithms in PRAM, Stream, and MapReduce. Our approach takes O(m) work each step, but we can only prove logarithmic convergence on a path graph. Despite the simplicity of our algorithm, the proof of logarithmic convergence is elusive and poses a rather interesting challenge. We conjecture that our algorithm takes O(log n) steps to converge. Our algorithm behaves well and empirically takes O(log n) steps on a range of difficult graphs. In 2019 Liu and Tarjan conjectured that an earlier version of our algorithm takes O(log n) steps or possibly O(log2 n) steps [29]. We leave the proof of convergence as an open problem.

2 Our contribution

We introduce a simple, deterministic label propagation method for undirected graph connectivity. Our approach propagates directed edges in alternating direction to achieve fast convergence independently of the number of proces- sors while also maintaining O(m) work each step. We present new algorithms in PRAM, Stream, and MapReduce. We will silently use the standard notation for asymptotic bounds to provide a familiar basis for comparison, but the reader should keep in mind that our bounds that depend on the convergence are conjecture only. If our conjecture of O(log n) convergence is true, then our label propagation algorithm on a Concurrent Read Concurrent Write (CRCW) PRAM achieves O(log n) time and O(m log n) work with O(m) processors. On an Exclusive Read Exclusive Write (EREW) PRAM it takes O(log2 n) time and O(m log2 n) work. In contrast, the fastest deterministic CRCW graph connectivity algorithms take O(log n) time and O((m + n) · α(m,n)) work using O((m + n) · α(m,n)/ log(n)) processors [22, 13], where α(m,n) is the inverse Ackerman function. The best-known NC EREW algorithm [11] takes O(log n) time and O((m + n) log n) work with O(m + n) processors. Although our results are slower than those for the fastest deterministic CRCW and EREW algorithms, our method is much simpler and easier to implement. We also give an efficient Stream-Sort algorithm that takes O(log n) passes and O(log n) memory, and a MapReduce algorithm taking O(log n) rounds and O(m log n) communication overall. These would be the first deterministic O(log n)-step graph connectivity algorithms in Stream and MapReduce models. For the

2 Model Steps Cost Class Author CRCW O(log n) O((m + n) · α(m,n)) deterministic Cole, Vishkin [13] EREW O(log n) O((m + n) log n) deterministic Chong, Han, and Lam [11] Stream-Sort O(log n) O(log n) randomized Aggarwal et al. [1] MapReduce O(log2 n) O(m log2 n) deterministic Kiveras et al. [26] CRCW O(log n) O(m log n) deterministic This paper EREW O(log2 n) O(m log2 n) deterministic This paper Stream-Sort O(log n) O(log n) deterministic This paper MapReduce O(log n) O(m log n) deterministic This paper

Table 1: Comparison to state-of-the-art. Cost is work, memory, or communication for PRAM, Stream-Sort, and MapReduce, respectively. (The O(log n) convergence results for this paper are conjecture.)

purposes of this discussion we will assume O(log n) convergence holds for our algorithm. With that assumption, refer to Table 1 for a summary of these results and comparison to the current state-of-the-art. The computational models we explored are briefly described in Section 3. We refer the reader to [10, 33, 31, 28] for more complete descriptions. A survey of related work is given in Section 4. Then in Section 5 we introduce our principal algorithm which establishes the framework behind our technique. This leads to our PRAM results in Section 6. In Section 7 the framework is extended for Stream-Sort and MapReduce models, which introduces the subject of label duplication in Section 8 where we identify how pathological duplication of labels arises and how to address it. We give our Stream-Sort and MapReduce algorithms in Sections 9 and 10. Finally we briefly describe a parallel implementation of our principal algorithm in Section 11 followed by basic empirical results in Section 12.

3 Computational models

In a PRAM [18] each processor can access any global memory location in unit time. Processors can read from global memory, perform a computation, and write a result to global memory in a single clock cycle. All processors execute these instructions at the same time. A read or write to a memory location is restricted to one processor at a time in an Exclusive Read Exclusive Write (EREW) PRAM. Writes to a memory location are restricted to a one processor at a time in a Concurrent Read Exclusive Write (CREW) PRAM. A Concurrent Read Concurrent Write (CRCW) PRAM permits concurrent read and write to a memory location by any number of processors. But values concurrently written by multiple processors to the same memory location are handled by a resolution protocol. A Combining Write Resolution uses an associative operator to combine all values in a single instruction. A Combining CRCW employs this to store a reduction of the values, such as the minimum, in constant time [40]. The Stream model [32, 21] focuses on the trade-off between working memory space s and number of passes p over the input stream, allowing the computational time to be unbounded. In W-Stream [38] an algorithm can write to the stream for subsequent passes and in Stream-Sort [1] the input or intermediate output stream can also be sorted. In both W-Stream and Stream-Sort the output streams become the input stream in the next pass. In Stream-Sort the streaming and sorting passes alternate so a Stream-Sort algorithm reads an input stream, computing on the items in the stream, while writing to an intermediate output stream that gets reordered for free by a subsequent sorting pass. The streams are bounded by the starting problem size. An algorithm in Stream-Sort is efficient if it takes polylog n passes and memory. The MapReduce model [24, 20, 35] appeared some years after the programming paradigm was popularized by Google [14]. The model employs the functions map and reduce, executed in sequence. The input is a set of hkey,valuei pairs that are “mapped” by instances of the map function into a multiset of hkey,valuei pairs. The map output pairs are “reduced” and also output as a multiset of hkey,valuei pairs by instances of the reduce function where a single reduce instance gets all values associated with a key. A round of computation is a single sequence of map and reduce executions where there can be many instances of map and reduce functions. Each map or reduce function can complete in polynomial time for input n. Each map or reduce instance is limited to O(n1−ǫ) memory for a constant ǫ> 0, and an algorithm is allowed O(n2−2ǫ) total memory. The number of machines/processors is bounded to O(n1−ǫ), but each machine can run more than one instance of a map or reduce function.

3 4 Related work

The famous 1982 algorithm by Shiloach and Vishkin [42] takes O(log n) time using O(m + n) processors on a CRCW PRAM, performing O((m + n) log n) work overall. In 1991 Cole and Vishkin improved the result to O(log n) time and O((m + n) · α(m,n)) work using O((m + n) · α(m,n)/ log(n)) processors [13], but hides a large constant in the asymptotic bound. The constant was reduced by Iwama and Kambayashi in 1994 [22]. But these latter CRCW algorithms are very complicated and difficult to translate to other computational models because of the pointer operations. Our algorithm takes O(log n) time using O(m) processors, albeit on a more powerful CRCW. But it is more amenable to other models because it does not rely on pointer-jumping. The fastest deterministic EREW algorithm takes O(log n) time using O(m + n) processors and is due to the 2001 work by Chong, Han, and Lam [11]. This algorithm relies on carefully merging adjacency lists, and improves the earlier 1995 result by Chong and Lam, which took O(log n log log n) time. These EREW algorithm require parallel sorting and pointer jumping. Our EREW algorithm is slower, taking O(log2 n) time and O(m log2 n) work using O(m) processors, but doesn’t rely on pointer jumping or sorting and is far simpler to implement. The best known deterministic Stream algorithm for connected components is given by Demetrescu, Finocchi, and Ribichini [15], taking O((n log n)/s) passes and s working memory size in W-Stream. Their algorithm can only achieve O(log n) passes using s = O(n) memory. A randomized s-t-connectivity algorithm by Aggarwal et al. [1] takes O(log n) passes and memory in Stream-Sort. It can be modified to compute connected components with the same bounds, but requires sorting in three of four steps in each pass [33]. In contrast, our Stream-Sort connected components algorithm is deterministic and takes O(log n) passes and O(log n) memory. It requires only one sorting step per pass and is straightforward to implement. A randomized MapReduce algorithm by Rastogi et al. [36] was one of the first to show promise of fast convergence for connected components. But it was later shown in [2] that their Hash-to-Min algorithm [36] takes Ω(log n) rounds. It uses a single task to send an entire component to another, which for a giant component will effectively serialize the communication. In 2014 Kiveris et al. [26] introduce their Two-Phase algorithm, which takes O(log2 n) rounds and O(m log2 n) communication overall. Unlike the Hash-to-Min algorithm it avoids sending the giant component to a single reduce task. We introduce a new MapReduce algorithm that is comparable to Two-Phase while being deterministic and simple to implement. Like Two-Phase our algorithm does not load components into memory or send an entire component through a single communication channel. We go further in memory conservation by maintaining O(1)- space working memory. Our MapReduce algorithm completes in O(log n) rounds using O(m log n) communication, thereby improving the state-of-the-art by Ω(log n) factor in both convergence and communication. Although we do not study the MPC model [5, 6] in this paper, we want to highlight recent breakthrough work in this model. The MPC model is a generalization of MapReduce and other Bulk Synchronous Parallel (BSP) style models. The MPC model is more powerful than MapReduce; a MapReduce algorithm can be simulated in MPC with the same runtime. In 2018 Andoni et al. [2] gave a randomized O(log D log logm/n n) round algorithm in MPC for connected components. Their algorithm uses the leader-contraction method that works by selecting a small fraction of leader vertices while maintaining high probability that non-leader vertices are adjacent. To achieve this the authors add edges so the graph has uniformly large degree, but to avoid Ω(n3) communication cost they carefully manage how edges are added. This result was later improved in 2019 by Behnezad et al. [7], who gave a O(log D + log logm/n n)-round, randomized algorithm in MPC. More recently, Liu et al. [30] gave a randomized CRCW PRAM algorithm based the work of Andoni et al. [2] and Behnezad et al. [7], taking O(log D + log logm/n n) time using O(m) processors. These results use randomization and are not simple to implement, which conflicts with our motivation. The work most closely related to ours is that of Liu and Tarjan [29] who gave a family of label propagation algorithms taking between O(log n) and O(log2 n) steps. Similar to our approach they propagate labels by minimum reduction at each step, which creates a directed graph. But in their approach they maintain acyclicity and hence their algorithms produce a forest of trees each step. In contrast, our algorithm does not require maintaining trees as evident in Figure 1. They also perform a short-cutting operation in which the parent of every vertex is replaced with its grandparent, whereas we only exchange labels between edge endpoints. They had analyzed an earlier version of our algorithm and conjectured it takes polylog steps to converge. Our current algorithm in this paper [8] is even simpler than our previous version and those in [29]. Moreover, it has the added benefit that the work per step is easily shown to take no more space than the input graph.

4 Algorithm 1

Lk ⊲ arrays for l(v) at each step k Initialize L1 with all starting l(v) 1: for k =1, 2,... until labels converge do 2: set Lk+1 := Lk 3: for all (v,u) ∈ Ek do 4: if u 6= Lk[v] then 5: set Lk+1[u] := min(Lk+1[u],Lk[v]) 6: add (u,Lk[v]) to Ek+1 ⊲ Label Propagation 7: else 8: add (u, v) to Ek+1 ⊲ Symmetrization

5 Principal algorithm

We begin with our principal algorithm to establish the framework and core principles. The essential operations, as succinctly summarized in the introduction, are simply for every (v,u) edge if u is not l(v) then min update and label propagation, else symmetrization, repeating until no label changes. Here Algorithm 1 describes our method in full. We don’t specify any model now so we can focus on the basic procedures. It should be noted that this principal algorithm achieves linear work and fast convergence in both sequential and parallel settings. We will use Nk(v) to + denote the neighborhood of a vertex v at step k and Nk (v) = {v} ∪ Nk(v) as the closed neighborhood. Then let + l(v) = min(Nk (v)) be the current minimum label for v. We use this l(v) notation without a step subscript for simplicity. In our algorithm listings we use arrays Lk in its place, e.g. Lk[v] holds l(v) for v at step k. Only two such arrays are needed in each step. Before the algorithm starts L1 is initialized with the l(v). For all algorithms we use Ek to denote the edges that will be processed at step k, but Ek is a multiset because it may contain duplicates. We employ two edge operations, label propagation and symmetrization, defined as follows. Definition 1. Label propagation replaces a (v,u) edge with (u,l(v)) if u 6= l(v). Definition 2. Symmetrization replaces a (v,u) edge with (u, v) if u = l(v). Since G does not contain loop edges then these operations cannot create loops, otherwise it would contradict the minimum value. Thus u = l(v) implies v 6= l(v). Label propagation is primarily responsible for path contraction. Minimum label updates are concomitant with label propagation. Symmetrization keeps an edge between a vertex and its minimum label. Thus for an edge (v,u) in one step there will be either (u, v) or (u,l(v)) in the next step. In Algorithm 1 each edge is replaced by one new edge, either by label propagation or symmetrization, so the edge count does not increase. The algorithm is illustrated in Figure 1. Notice the (2, 1) edge is duplicated in the second step. Also observe that the (4, 2) edge in the second step produces (2, 1) for the third step because L2(4) = 1. We remark that Algorithm 1 can be terminated a number of ways without affecting the asymptotic bounds. For example, we can detect when labels no longer change by simply keeping a counter for the label propagation branch. At each step the counter is set to zero. If the minimum label l(v) for a vertex v is not v itself, the counter is incremented. Now observe that in the final star graph, only the root of the star can fall into the label propagation branch but since the root is the minimum label, the counter cannot be updated. This simple check and update takes O(1) time for each label propagation operation and is therefore free. We use this approach in our implementation of Algorithm 1 described in Section 11. In G an undirected edge {u, v} is comprised of counter-oriented twins (v,u), (u, v), thus there are 2m edges in total and G is symmetric. As stated in the introduction, we are careful to create just one twin. Thus at each step the graph may be directed. By propagating single directed edges and alternating the direction of the edge with a minimum label, we are able to limit the work in each step to O(m) edges while also maintaining overall connectivity. Say for a (v,u) edge that u is the minimum for itself and v. This (v,u) becomes (u, v) by symmetrization and then (v,u) again by label propagation, cycling until the algorithm ends or a new minimum is acquired. If at some later step either endpoint gets a new minimum then it can propagate it to the other, thereby replacing their edge which prohibits retaining an obsolete minimum label. In this case, u cannot be passed to v again because there will be no edge to v. Claim 1. If (v,u) exists at some step, then v,u are connected for all remaining steps in Algorithm 1. Proof. We will demonstrate that given (v,u) in a step then v,u remain connected in the next step because either (u, v) is created or either (u,l(v)), (l(v), v) or (u,l(v)), (v,l(v)) are created.

5 First we establish that (u, v) is created as follows. If v = l(v) then (u, v) is created by label propagation. If u = l(v) then (u, v) is created by symmetrization. Otherwise (u,l(v)) is created by label propagation. We now establish that if (u,l(v)) was created, then (l(v), v) or(v,l(v)) is created. If (v,l(v)) exists in the current step, then it is replaced with (l(v), v) by symmetrization. If (v,l(v)) does not exist in the current step then it must in the previous step, otherwise v cannot get the minimum label l(v). Again following symmetrization there is an (l(v), v) in this current step, which is then replaced with (v,l(v)) by label propagation. Thus if (u,l(v)) replaces (v,u) then either (l(v), v) or (v,l(v)) is created. Hence v and u remain connected after one step. By inductively applying this to every new edge, it follows that a connection between v and u is preserved in all subsequent steps. Lemma 1. Algorithm 1 finds the connected components of G. Proof. Claim 1 establishes that if (v,u) exists at some step, then v,u are connected for all remaining steps. It then follows that a component remains connected at each step. Now let v be the minimum label of a component. Any (v,u) edge becomes (u, v) by label propagation and therefore l(u) gets v. Any (u, v) edge becomes (v,u) by symmetrization, and again l(u) must be v. Therefore every vertex in a component gets the same representative label for that component. Lemma 2. Algorithm 1 creates 2m edges each step. Proof. Any new edge is a replacement of an existing (v,u) edge as follows. If u 6= l(v) then (v,u) is replaced with (u,l(v)) by label propagation. If u = l(v) then (v,u) is replaced with (u, v) by symmetrization. Since there are 2m edges in G then there are 2m edges in the first step and so on.

This last result has significant practical benefit because it ensures that the read/write costs remain linear with respect to the edges at each step, otherwise on very large graphs these costs can be the bottleneck with respect to runtime performance. We have shown that our algorithm is very simple and takes O(m) work each step, making it appealing for practical applications. This is demonstrated in Section 12 where our algorithm empirically converges in O(log n) steps on a variety of graphs. Despite the simplicity of our algorithm, the convergence in general is difficult to analyze. Hence we can only conjecture the following and leave the proof as a challenging open problem. Conjecture 1. Algorithm 1 converges in O(log n) steps. Instead, we can show the convergence on path graphs, which might provide insight to a more general proof. Observe that only minimum labels are propagated and a minimum for a vertex can only be replaced with a lesser label. At each step, minimum labels are exchanged between the endpoints of each (v,u) edge through label propagation, and symmetrization maintains the edge between a vertex and its minimum. Thus each length-two path is shortened by label propagation, and symmetrization ensures that a minimum label vertex and its subordinate can both get a new minimum from one or the other in a later step. An interesting case is label convergence on a sequentially labeled path. On such a path, label convergence follows a Fibonacci sequence. A na¨ıve label propagation algorithm will duplicate labels leading to a progressive increase in the number of edges. We give the following results for our algorithm. An interested reader can find the proofs in Appendix A. Lemma 3. Algorithm 1 on a sequentially labeled path propagates labels in Fibonacci sequence, specifically at each step k the label difference ∆k(v,lk(v)) = v − lk(v) follows ∆k(v,lk(v)) = Fk where Fk = Fk−1 + Fk−2. The label updates follow a Fibonacci sequence so the next statement holds.

Proposition 1. Algorithm 1 converges in logφ n = O(log n) steps on a sequentially labeled path, where φ ≈ 1.618 is the Golden Ratio value. It is intuitive that the convergence of Algorithm 1 on any path doesn’t take asymptotically more steps than on a sequentially labeled path. Claim 2. Algorithm 1 on any 4-path converges in at most three steps.

6 Proof. Observe that every vertex in a 4-path is within a distance of three from any other vertex in the path. Now recall that a vertex can only get a new minimum label by label propagation and label propagation contracts a length-two path. Let C(G) denote the minimum labeled vertex of the path and is therefore the component label. It isn’t difficult to see that if C(G) is not an endpoint of the 4-path then the algorithm converges to a star in two steps. This is because all vertices are a distance of two from C(G) and will be connected to it in the first step. Then it takes one more step to break the non-tree edge from the endpoint that originally was not connected to C(G). If C(G) is an endpoint of the 4-path then the other endpoint does not get an edge to C(G) in the first step because it is a distance of three from C(G). But this other endpoint will be connected to some other vertex that is connected to C(G) and therefore gets C(G) in the next step. Since all other vertices get C(G) in the first step, then the total number of steps for convergence is three. Thus the claim holds. Claim 3. Given two stars rooted by their respective minimum labels and then connected by the roots, it takes Algorithm 1 at most three steps to converge. Proof. Let L, R be the respective roots of the left and right stars, and each is the minimum label for its star. Without loss of generality, let L be less than R. Now suppose the stars are connected by either an (L, R) or (R,L) edge. If it is an (R,L) edge, then symmetrization creates (L, R) and label propagation passes L to each leaf node of R. But also symmetrization can create (R,u) edges from any (u, R) edges. Then a subsequent step replaces these (R,u) with (u,L) edges to complete the final rooted star. This takes two steps in total. If it is an (L, R) edge, then label propagation creates (R,L). It then follows the steps for the previous case and hence takes three steps in total. Thus the claim holds. Proposition 2. Algorithm 1 converges in O(log n) steps on a path graph. Proof. Observe that an 8-path is just two 4-paths connected end-to-end. Let L, R be the least smallest labels respectively for the two subpaths. One of these labels will be the component label. Suppose the component label is in the middle of the path, and therefore is the endpoint of one 4-path that is adjacent to the endpoint of the other 4-path. The algorithm will converge both 4-paths simultaneously in the same number of steps as a single independent 4-path. This is because minimum labels are stored externally in an array and it is the minimum label from this array that is propagated. Since both 4-paths have an endpoint whose minimum label is the component minimum, then both converge in the same number of steps. But we are interested in the worst-case. Suppose that L, R are at opposite ends of the path, one of which is the component label. The algorithm simultaneously converts each to a star rooted respectively at L and R, taking at most three steps by Claim 2. The graph remains connected in concordance with Claim 1, and because of label propagation there will be an edge connecting L and R. It follows from Claim 3 that it takes at most three more steps to get the final star. Likewise, doubling an 8-path yields a 16-path. Then this takes at most three more steps to converge than the 8-path. A path of n = 4 · 2k size can be generated by doubling k times. Each doubling takes at most three steps to merge connected stars according to Claim 3. Thus it takes at most 3k extra steps for all the intermediate merges for an n-path. It follows from n = 2k+2 that k ≤ log n. Then by Claim 2 it takes at most three steps to merge all concatenated 4-paths and 3log n steps to merge intermediate stars. This leads to a total of 3 + 3log n = O(log n) steps to converge. For the remainder of this paper we will assume Conjecture 1 is true.

6 PRAM algorithm

Our Algorithm 1 maps naturally to a PRAM. We will use the following semantics for our parallel algorithms. All statements are executed sequentially from top to bottom but all operations contained within a for all construct are performed concurrently. All other statements outside this construct are sequential. Recall that in a synchronous PRAM all processors perform instructions simultaneously and each instruction takes unit time. We use a Combining CRCW PRAM to ensure the correct minimum label is written in O(1) time [40]. Theorem 1. Algorithm 1 finds the connected components of G in O(log n) time and O(m log n) work using O(m) processors on a Combining CRCW PRAM.

7 Proof. For each edge it performs either label propagation or symmetrization by Definitions 1 and 2. Each operation reads an edge from memory and overwrites it with a new edge. On a CRCW this takes O(1) time for each edge. Computing the minimum on each edge also takes O(1) time using Combining Write Resolution. In each step there are O(m) edges according to Lemma 2, and it takes O(log n) steps to converge due to Conjecture 1. Therefore it takes O(log n) time and O(m log n) work given O(m) processors. An EREW algorithm follows directly from Algorithm 1 because it is well-known that a CRCW algorithm can be simulated in a EREW with logarithmic factor slowdown [25]. The only read/write conflict in Algorithm 1 is in the minimum label update. Here it does not require a minimum reduction in constant-time. Thus for a p-processor EREW, reading Lk[v] takes O(log p) time by broadcasting the value in binary tree order to each processor attempting to read it. It isn’t difficult to see that a minimum value can be found in O(log p) time using a binary tree layout to reduce comparisons by half each step 1. This immediately proves Theorem 2. Theorem 2. Algorithm 1 finds the connected components of G in O(log2 n) time and O(m log2 n) work using O(m) processors on an EREW PRAM.

7 Extending to other models

The Stream and MapReduce models restrict globally-shared state so the minimum label for each vertex must be carried with the graph at each step. Recall in Algorithm 1 there may not be an explicit (v,l(v)) edge in a step but all minimum labels are kept in the global Lk array. So given a (v,u) edge but no explicit (v,l(v)) edge, we can still apply label propagation and produce (u,l(v)). Otherwise it would create some (u, w) edge where w is the minimum of the current set of neighbors but may not be the true minimum for v. This would cause the algorithm to fail, which is easily demonstrated on the graph in Figure 1. Moreover, in MapReduce the map and reduce functions are sequential so a giant component that is processed by one task will serialize the entire algorithm. We address these limitations by slightly altering the label propagation and symmetrization operations. Definition 3. Label propagation replaces a (v,u) edge with (u,l(v)) if u, v 6= l(v). Definition 4. Symmetrization replaces a (v,u) edge with (u, v), (v,u) if u = l(v). Label propagation and symmetrization will now only proceed from vertices v 6= l(v) to mitigate sequential processing of a giant component, thus skipping over intermediate representative labels. As before, u = l(v) implies v 6= l(v), otherwise it would contradict the minimum function. Now symmetrization adds both edges, (l(v), v), (v,l(v)), so that v is always paired with its minimum label in the absence of random access to global memory. We also remark that symmetrization must be this way when ignoring v where v 6= l(v) because the vertices that would have created the edge (v,l(v)) are now ignored. These minor changes do not invalidate the correctness or convergence established by Algorithm 1 because the same edges are created but with some added duplicates. The primary difference is that both (v,l(v)), (l(v), v) edges are created in the same step rather than strided across two consecutive steps. But this incurs label duplication that can lead to a progressive increase in edges if left unchecked.

8 Label duplication

The new label propagation and symmetrization for Stream and MapReduce can lead to O(log n) factor inefficiency as a result of increased label duplication, especially on sequentially labeled path or tree graphs. This leads to the following crucial observation. Observation 1. Adding counter-oriented edges in the symmetrization step of Algorithm 1 will pair each v with every new l(v) it gets for three steps on a sequentially labeled path graph.

Proof. Recall from Lemma 3 that v − lk(v) = Fk where Fk = Fk−1 + Fk−2 is the Fibonacci recurrence. Since th symmetrization retains each l(v) for the next step then v gets its k label lk(v) = v − Fk for the next three steps because of the recurrence of Fk. Thus any new l(v) that v receives will return to v a total of three steps unless l(v) is the minimum label for the component of v.

1Given an array M having n values and p = ⌈n/2⌉ processors, then at each step M[i] ← min(M[2i − 1],M[2i]) for 1 ≤ i ≤ p, where p is halved after each step.

8 Algorithm 2

initialize E1 with sorted E and set lastv := ∞ 1: for k =1, 2,... until labels converge do 2: for (v,u) ∈ Ek do 3: if v 6= lastv then 4: set l(v) := min(v,u) and lastv := v 5: if u = l(v) then ′ 6: add (v,u),NEW  to Ek+1 and add (u, v),NEW  ⊲ Symmetrization 7: else if v 6= l(v) then 8: add (u,l(v)),NEW  ⊲ Label Propagation ′ 9: add (v,u),OLD to Ek+1 ′ 10: sort Ek+1 ⊲ NEW and OLD edges get sorted together for each (v,u) ′ ′ 11: if (v,u),NEW  ∈ Ek+1 but (v,u),OLD ∈/ Ek+1 then 12: add (v,u) to Ek+1

Once an l(v) is replaced with an updated minimum for v, it is no longer needed and only adds to the edge count. Symmetrization by Definition 4 retains (v,l(v)) so when v gets l(l(v)) from its current l(v), then in the next step l(l(v)) will propagate back to l(v). Since each vertex in a chain is a minimum label to vertices up the chain, then each vertex will in turn be back-propagated down the chain. This follows a Fibonacci sequence hence the duplication of labels grows rapidly. For example we can see from Lemma 3 that vertex 2 in a chain will be the minimum label for the 3, 4, 5, 7, 10, 15, 23,... vertices in sequence, and each of these vertices will return vertex 2 back to the neighbor from which it was received. Moreover, as seen in Observation 1 each new l(v) is retained by v for three steps. Relabeling the graph can avoid the pathological duplication but a robust algorithm is more desirable, especially in graphs that may contain a very long chain. Suppose now a (u,l(v)) edge is added to Ek+1 only if that edge is not currently in Ek. This is testing if lk(v) ∈/ Nk(u) then it can be added to Nk+1(u). Since Definitions 3,4 are applied in models with limited random access and global memory, we leverage sorting to identify the next minimum label for each vertex and also remove ′ labels that would otherwise fail this membership test. Let Ek+1 be the intermediate edges that are created during th ′ the k step and from which a subset are retained for the k + 1 step. Sorting edges in Ek and Ek+1 will identify those edges that are duplicated across both steps and therefore should be removed. But an edge that is duplicated in only ′ Ek+1 must be retained for proper label propagation. To avoid inadvertently removing such an edge, all duplicates ′ in the Ek+1 are first removed before merging and sorting with Ek edges. After removing duplicates the edges from Ek are also removed because these were only needed for the membership test. We apply this in our next algorithms.

9 Stream-Sort algorithm

Our Stream-Sort algorithm in Algorithm 2 extends Algorithm 1 as described in Section 7 and removes duplicates by sorting in the manner described at the end of Section 8. It requires two stages per iteration step. The first stage performs label propagation and symmetrization and also returns the input edges. The second stage eliminates duplicates. A one-stage algorithm [9] was described in 2016, which is simpler to implement, but does not address label duplication. Recall that Algorithm 1 does not return the current edges but creates new edges by symmetrization and label propagation. But in Stream-Sort we must return the current edges temporarily in the intermediate sorting stage in order to ignore duplicate edges. Hence we mark the edges to distinguish old from new. Here a NEW edge resulted from either symmetrization or label propagation, and an OLD edge is a current input edge. Label propagation and symmetrization follow Definitions 3 and 4. In the first stage Algorithm 2 reads sorted edges, hence l(v) = min(v,u) from the first edge of v. If u = l(v) for this first edge then symmetrization adds (u, v),NEW , (v,u),NEW  to an ′ ′ intermediate stream Ek+1, and for each remaining edge a (u,l(v)),NEW  is added to Ek+1 by label propagation, and (v,u),OLD is also added. Note that if u = l(v) in the first edge of v, all remaining u cannot be l(v) due to sorting and thus u, v 6= l(v) for the remaining edges of v so label propagation can proceed. The intermediate ′ stream Ek+1 is sorted by a sorting pass and input to the second stage. In the second stage the edges are sorted so all NEW and OLD versions for (v,u) are grouped together. Then any edge without an OLD member is added to a new output stream Ek+1, which will be the input stream in the next pass. Both intra- and inter-step duplicates have been removed. The algorithm repeats this procedure until no new minimum can be adopted.

9 Algorithm 3

1: procedure Reduce-1(key = v,values = Nk(v)) ⊲ sorted values + 2: set l(v) := min(Nk (v)) 3: if v 6= l(v) then 4: emit hv, (l(v),NEW )i and hl(v), (v,NEW )i ⊲ Symmetrization 5: for u ∈ values : u 6= l(v) do 6: emit hu, (l(v),NEW )i ⊲ Label Propagation 7: emit hv, (u,OLD)i 8: procedure Reduce-2(key = v,values = {(u,i): i ∈{OLD,NEW }}) ⊲ sorted values 9: if (v,u),NEW  ∈ values but (v,u),OLD ∈/ values then 10: emit hv,ui 11: for k =1, 2,... until labels converge do 12: MAP 7→ Identity 13: Reduce-1 14: MAP 7→ Identity 15: Reduce-2

Theorem 3. Algorithm 2 finds the connected components of G in O(log n) passes and O(log n) memory in Stream- Sort.

Proof. Algorithm 2 achieves this as follows. The input stream Ek is scanned in one pass and the intermediate output ′ stream Ek+1 is subsequently sorted in a single sorting pass. There are then a constant number of sorting passes each step, which are essentially free. Label propagation and symmetrization extends from those in Algorithm 1 but create more duplicate edges. These duplicates are removed in the sorting pass. Thus correctness follows from Lemma 1, there are O(m) edges each step due to Lemma 2, and it takes O(log n) steps to converge due to Conjecture 1. The input to each pass is O(m) thus satisfying the constraint of the Stream-Sort model. Sorting requires only O(log n) bits of memory to compare vertices and labels. Overall it takes O(log n) passes and O(log n) memory. If Conjecture 1 holds, this would be the first efficient, deterministic connected components algorithm in Stream- Sort.

10 MapReduce algorithm

Our MapReduce algorithm described in Algorithm 3 is similar to Algorithm 2, managing duplicates as described at the end of Section 8. It takes two rounds per iteration step, the first to perform label propagation and symmetrization and the second to remove duplicates. A single-round algorithm [9] with less efficient communication is also available to the interested reader. The values for each key are sorted hence intra-step duplicates are adjacent and easily removed, permitting the algorithm to maintain O(1) working memory. Since the values are sorted then l(v) is simply the lesser between the key and first value. We omit the specifics on sorting and getting l(v) for brevity. Label propagation and symmetrization follow Definitions 3 and 4. The first round returns label propagation or symmetrization edges as NEW, and current edges as OLD, again to assist in removing duplicates. If v 6= l(v) then there is a u = l(v), thus symmetrization emits hv, (l(v),NEW )i, hl(v), (v,NEW )i, and for each u 6= l(v)a hu, (l(v),NEW )i is added by label propagation, and hv, (u,OLD)i is also added. This skips any local minimum or the component minimum which could otherwise result in a large span of sequential processing. The second round accumulates these edges and every edge without an OLD member is returned with no markings. The two rounds are repeated until labels converge. Since duplicates are removed the total communication cost is O(m) per round. Theorem 4. Algorithm 3 finds the connected components of G in O(log n) rounds using O(m log n) communication overall in MapReduce. Proof. Label propagation and symmetrization extends from those in Algorithm 1 but create more duplicate edges. These duplicates are removed in each iteration step, thus correctness follows from Lemma 1. There are O(log n) steps due to Conjecture 1, and two rounds per step for a total of 2 log n = O(log n) rounds. The communication cost is proportional to the number of edges written after all rounds. Both inter- and intra-step duplicates are removed in each iteration step so the total number of edges after the second round of each step is O(m) following Lemma 2. Thus for O(log n) rounds the overall communication is O(m log n) as claimed.

10 This is comparable in runtime and communication to [26], but differs in that it is a deterministic MapReduce algorithm for connected components.

11 Implementation

We will give some basic empirical results for our principal algorithm described in Algorithm 1. First we’ll briefly describe our parallel implementation. We implemented the algorithm in C++ and , and for write conflicts we used atomic operations. There is a write conflict in updating the minimum label for each vertex. For example a vertex u can receive many l(v) labels from its v neighbors, but not necessarily at once. It is possible that l(u) is replaced by some l(v) but an l(v) that is a lesser label could be received later and therefore must overwrite it. We use the “compare and exchange” atomic operation to update the minimum label. But this atomic does not test relational conditions, instead an exchange is made if the values being compared differ. Thus to atomically update a minimum value, the compare and exchange result must be repeatedly tested. Once a succeeds with its exchange, it must test if its original data is still less than the updated value, accomplished by a simple loop construct. Since the criteria for updating is just a difference in value, then eventually the thread with the minimum value will succeed and all other threads will test out. The winning thread itself will also test out because its value will be equal to the updated value. Our implementation must detect when labels no longer change. We keep a counter for the label propagation branch to identify when label propagation no longer updates minimum labels. At each step the counter is set to zero. If the minimum label l(v) for a vertex v is not v itself, the counter is incremented. In the final star graph only the root of the star can fall into the label propagation branch but since the root is the minimum label, the counter cannot be updated. Specifically, in the loop over (v,u) edges, we carry out label propagation if u is not l(v), otherwise symmetrization is performed. We update the counter in the label propagation branch if l(v) 6= v is true. The algorithm halts if the counter value is zero at the end of a step. Effectively this updates the counter until each vertex is a leaf node of a star. This is because any v that is a star means that l(v) is equal to v, hence it must fall into the label propagation branch and then fail to update the counter. Finally, at each step k the threads must synchronize to add new edges to the work list. This work list represents Ek+1. The work per step is exactly 2m by Lemma 2. We set the data buffer for the work list to be double the size of the edge count, specifically 4m. Then half the buffer is used to read current work and the other half to write new work, where each half swaps its read/write role every other step, meaning over a periodicity of two (modulo 2). Each thread keeps a local buffer (an array) in which it writes new edges produced by label propagation and symmetrization. When a thread is finished processing its fraction of the step k work list, it then reserves space for adding its new edges to the k + 1 half of the work list. The thread uses an atomic “fetch and add” to reserve its block of space, since it has knowledge of how many edges it must write. Then the thread can concurrently write its data without conflict.

12 Experiments

We ran our parallel implementation of Algorithm 1 on a workstation with 28 Intel Xeon E5-2680 cores and 256 GB of RAM. Our experimental results are given in Table 2. The first column of Table 2 lists the graphs used in our experiments. All graphs were unweighted, without self-loops, and symmetrized with vertices labeled from 0 to n − 1, and are therefore simple, undirected graphs with a total of 2m edges. The first four graphs are large, sequentially labeled path graphs. Our naming convention for these uses a suffix that denotes the number of vertices by the base-2 exponent. Hence the graph named seqpath20 is a path of n =220 vertices labeled in order from 0..n − 1. It is interesting to note that the convergence on these graphs follows the 20 26 prediction asserted in Proposition 1. Using logφ n with φ =1.618, the predicted number of steps from n =2 ..2 is 29, 32, 35, 37. Our implementation takes one extra step to test for completion and so it closely matches the prediction. The next four graphs are road networks with relatively large diameters. The first three road networks are from the Stanford Network Analysis Project (SNAP) [27]. The fourth road network is the USA road network from the 9th DIMACS Challenge [16]. The next eight graphs are based on the example in Appendix I of Andoni et al. [2], which was devised to be difficult for fast convergence in graph connectivity. Each of these graphs is an r × c grid labeled sequentially in row- major order from 1..rc but with a bridge node, the zero label vertex, connecting to the first vertex of each row. Hence n = rc +1,m = (2r − 1)c and the diameter is D =2c. We use the naming convention of grid-0to[rows]by[columns]. The number of rows is fixed to r = 262144 for the first five of these graphs. For the last three such graphs the number

11 Table 2: Experimental results.

components n (vertices) m (edges) D (diameter) steps seconds seqpath20 1 1,048,576 1,048,575 1,048,575 31 0.23 seqpath22 1 4,194,304 4,194,303 4,194,303 34 1.0 seqpath24 1 16,777,216 16,777,215 16,777,215 37 4.1 seqpath26 1 67,101,864 67,108,863 67,108,863 40 8.5 roadNet-PA 206 1,088,092 1,541,898 787 17 0.17 roadNet-TX 424 1,379,917 1,921,660 1057 18 0.21 roadNet-CA 2638 1,965,206 2,766,607 854 17 0.29 road-USA 1 23,947,347 28,854,312 6809 22 2.30 grid-0to262144by16 1 4,194,305 8,388,592 32 18 0.61 grid-0to262144by32 1 8,388,609 16,777,184 64 28 1.7 grid-0to262144by64 1 16,777,217 33,554,368 128 28 3.3 grid-0to262144by128 1 33,554,433 67,108,736 256 28 7.7 grid-0to262144by256 1 67,108,865 134,217,472 512 28 13 grid-0to1048576by20 1 20,971,521 41,943,020 40 22 3 grid-0to4194304by22 1 92,274,689 184,549,354 44 24 13 grid-0to16777216by24 1 402,653,185 805,306,344 48 26 58 web-Stanford 365 281,903 1,992,636 674 16 0.13 web-BerkStan 676 685,230 6,649,470 514 16 0.25 com-Youtube 1 1,134,890 2,987,624 20 8 0.14 com-LiveJournal 1 3,997,962 34,681,189 17 8 0.82 com-Orkut 1 3,072,441 117,185,083 9 6 1.9 com-Friendster 1 65,608,366 1,806,067,360 32 9 50 of rows is r =2c and thus grows exponentially with respect to the diameter. Our algorithm converges linearly with respect to the diameter on these last three graphs. The remaining graphs in the table are from SNAP [27]. These graphs have small diameter, which is expected for real-world networks. Our algorithm converges in O(D) steps on these graphs. It is evident by the number of steps in the sixth column of Table 2 that our algorithm converges rapidly on these graphs, tending towards O(log n) convergence as we conjectured. Observe that it takes fewer steps to converge than the diameter in each of the graphs we tested, demonstrating a real practical benefit. Moreover, the convergence rate is independent of the number of processors. Our implementation can admit improvement when comparing the runtime to state-of-the-art implementations [17]. Due to the simplicity and fast convergence of our algorithm we believe the runtime performance can be signficantly improved. The fast convergence, simplicity, and extensibility to other computational paradigms makes our algorithm appealing in practice.

Acknowledgments

The author thanks David G. Harris and Christopher H. Long for their helpful comments. The author also thanks the private reviewers.

References

[1] G. Aggarwal, M. Datar, S. Rajagopalan, and M. Ruhl. On the streaming model augmented with a sorting primitive. In Proceedings of the IEEE Symposium on Foundations of Computer Science, FOCS’04, pages 540– 549, 2004. [2] A. Andoni, C. Stein, Z. Song, Z. Wang, and P. Zhong. Parallel graph connectivity in log diameter rounds. In Proceedings of the 59th Annual IEEE Symposium on Foundations of Computer Science, 2018. [3] B. Awerbuch and Y. Shiloach. New connectivity and MSF algorithms for shuffle-exchange network and pram. IEEE Transactions on Computers, C-36:1258–1263, 1987. [4] D. A. Bader and G. Cong. A fast, parallel spanning tree algorithm for symmetric multiprocessors. In Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS’04, 2004.

12 [5] P. Beame, P. Koutris, and D. Suciu. Communication steps for parallel query processing. In Proceedings of the 32nd ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, PODS’13, pages 1–12, 2013. [6] P. Beame, P. Koutris, and D. Suciu. Communication steps for parallel query processing. Journal of the ACM, 64(6):1–58, 2017. [7] S. Behnezhad, L. Dhulipala, H. Esfandiari, J. Lacki, and V. Mirrokni. Near-optimal graph connectivity. In 2019 IEEE 60th Annual Symposium on Foundations of Computer Science (FOCS), FOCS’19, pages 1615–1636, 2019. doi: 10.1109/FOCS.2019.00095. [8] P. Burkhardt. Improved connectivity algorithm. Unpublished, October 19, 2019. [9] P. Burkhardt. Easy graph algorithms under hard constraints. Technical report, U.S. National Security Agency, May 2016. Keynote at the 2016 Seventh Annual Graph Exploitation Symposium (GraphEx). [10] D. K. Campbell. A survey of models of parallel computation. Technical report, Unversity of York Department of Computer Science, March 1997. [11] K. W. Chong, Y. Han, and T. W. Lam. Concurrent threads and optimal parallel minimum spanning trees algorithm. Journal of the ACM, 48(2):297–323, 2001. [12] S. Chung and A. Condon. Parallel implementation of Boruvka’s algorithm. In Proceed- ings of International Conference on Parallel Processing, ICPP ’96, pages 302–308, 1996. [13] R. Cole and U. Vishkin. Approximate parallel scheduling. Part II: Applications to logarithmic-time optimal graph algorithms. Information and Computation, 92:1–47, 1991. [14] J. Dean and S. Ghemawat. MapReduce: simplified data processing on large clusters. In Proceedings of the 6th conference on symposium on operating systems design and implementation, OSDI ’04, pages 137–150, 2004. [15] C. Demetrescu, I. Finocchi, and A. Ribichini. Trading off space for passes in graph streaming problems. ACM Transactions on Algorithms, 6(1):6:1–6:17, 2009. doi: 10.1145/1644015.1644021. [16] C. Demetrescu, A. Goldberg, and D. Johnson. 9th dimacs implementation challenge: Shortest paths. http://www.diag.uniroma1.it//challenge9/, 2019. [17] L. Dhulipala, C. Hong, and J. Shun. ConnectIt: A framework for static and incremental parallel graph connec- tivity algorithms. Proceedings of the VLDB, 14(4):653–667, 2020. [18] S. Fortune and J. Wyllie. Parallelism in random access machines. In Proceedings of the ACM Symposium on Theory of Computing, STOC’78, pages 114–118, 1978. [19] S. Goddard, S. Kumar, and J. F. Prins. Connected components algorithms for mesh-connected parallel comput- ers. In Parallel Algorithms: 3rd DIMACS Implementation Challenge, 1995. [20] M. T. Goodrich, N. Sitchinava, and Q. Zhang. Sorting, searching, and simulation in the MapReduce framework. In Proceedings of International Symposium on Algorithms and Computation, ISAAC’11, pages 374–383, 2011. [21] M. R. Henzinger, P. Raghavan, and S. Rajagopalan. Computing on data streams. Technical Report 1998-011, DEC Systems Research Center, 1998. [22] K. Iwama and Y. Kambayashi. A simpler for graph connectivity. Journal of Algorithms, 16(2):190–217, 1994. [23] D. R. Karger, N. Nisan, and M. Parnas. Fast connected components algorithms for the erew pram. SIAM Journal of Computing, 28(3):1021–1034, 1999. [24] H. Karloff, S. Suri, and S. Vassilvitskii. A model of computation for MapReduce. In Proceedings of the 21st annual ACM-SIAM symposium on discrete algorithms, SODA ’10, pages 938–948, 2010. [25] R. M. Karp and V. Ramachandran. Parallel algorithms for shared-memory machines. In J. van Leeuwen, editor, Handbook of Theoretical Computer Science, volume A, pages 869–932. MIT Press, Cambridge, MA, USA, 1990.

13 [26] R. Kiveris, S. Lattanzi, V. Mirrokni, V. Rastogi, and S. Vassilviskii. Connected components in MapReduce and Beyond. In Proceedings of the ACM Symposium on , SoCC’14, pages 1–13, 2014. [27] J. Leskovec and A. Krevl. SNAP Datasets: Stanford large network dataset collection. http://snap.stanford.edu/data, June 2014. [28] R. Li, H. Hu, H. Li, Y. Wu, and J. Yang. MapReduce parallel programming model: A state-of-the-art survey. International Journal of Parallel Programming, 44(4):832–866, 2016. [29] S. C. Liu and R. E. Tarjan. Simple concurrent labeling algorithms for connected components. In Symposium on Simplicity of Algorithms, SOSA’19, 2019. [30] S. C. Liu, R. E. Tarjan, and P. Zhong. Connected components on a PRAM in log diameter time. In Symposium on Parallelism in Algorithms and Architectures, SPAA’20, 2020. [31] A. McGregor. Graph stream algorithms: a survey. ACM SIGMOD Record, 43(1):9–20, 2014. [32] J. I. Munro and M. S. Paterson. Selection and sorting with limited storage. Theoretical Computer Science, 12(3):315–323, 1980. [33] T. C. O’Connell. A survey of graph algorithms under extended streaming models of computation. In Fundamental Problems in Computing, pages 455–476. Springer, 2009. [34] M. M. A. Patwary, P. Refsnes, and F. Manne. Multi-core spanning forest algorithms using the disjoint-set data structure. In Proceedings of the International Parallel and Distributed Processing Symposium, IPDPS’12, pages 827–835, 2012. [35] A. Pietracaprina, G. Pucci, M. Riondato, F. Silvestri, and E. Upfal. Space-round tradeoffs for MapReduce computations. In Proceedings of the 2012 ACM international conference on supercomputing, SC ’12, pages 235–244, 2012. [36] V. Rastogi, A. Machanavjjhala, L. Chitnis, and A. D. Sarma. Finding connected components in map-reduce in logarithmic rounds. In Proceedings of the 2013 IEEE international conference on data engineering, ICDE’13, pages 50–61, 2013. [37] A. Rosenfeld and J. L. Pfaltz. Sequential operations in digital picture processing. Journal of the ACM, 13(4):471– 494, 1966. [38] M. Ruhl. Efficient Algorithms for New Computational Models. PhD thesis, Massachusetts Institute of Technology, Cambridge, MA, USA, September 2003. [39] H. Samet and M. Tamminen. Efficient component labeling of images of arbitrary dimension represented by linear bintrees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 10(4):579–586, 1988. [40] B. Schmidt, J. Gonzalez-Dominguez, C. Hundt, and M. Schlarb. Parallel Programming: Concepts and practice. Morgan Kaufmann, 1st edition, 2017. [41] L. G. Shapiro. Connected component labeling and adjacency graph construction. Machine Intelligence and Pattern Recognition, 19:1–30, 1996. [42] Y. Shiloach and U. Vishkin. An O(log n) parallel connectivity algorithm. Journal of Algorithms, 3:57–67, 1982.

A Sequentially labeled path

Lemma 3. Algorithm 1 on a sequentially labeled path propagates labels in Fibonacci sequence, specifically at each step k the label difference ∆k(v,lk(v)) = v − lk(v) follows ∆k(v,lk(v)) = Fk where Fk = Fk−1 + Fk−2.

Proof. We will prove ∆k(v,lk(v)) = Fk by induction but first we begin with some preliminaries. Let Fk =1, 1, 2, 3, 5, 8,... be the Fibonacci sequence over k =0, 1, 2, 3, 4,...,n and thus Fk = Fk−1 + Fk−2. Let step k = 0 be the initial state before the algorithm begins. Observe that a sequentially labeled path is a non-decreasing sequence, thus any subpath has the ordering l(u)

14 same as ∆1(v,l1(v), leading to ∆1(v,l1(v)) = ∆0(u,l0(u))+∆0(v,u). We can relabel ∆0(u,l0(u)) as ∆0(v,l0(v)) since the path is sequentially labeled. From this we make the assumption that ∆k(v,lk(v)) = ∆k−1(v,lk−1(v))+∆k−1(v,u). But we want label differences between a vertex and its minimum label so we argue that ∆k−1(v,u) is the same as ∆k−2(v,lk−2(v)) using the following justification. It suffices to show that ∆k(v,u) is related to an edge that is created by either label propagation or symmetrization. Since Fk is positive then so is ∆k(v,u), implying v>u. Recall that edges pointing from a lower label to higher label are due to symmetrization. Hence a (u, v) edge at step k is due to symmetrization from edge (v,u) at step k − 1, where u is lk−1(v) and the label difference is the same. Then ∆k(v,u) can be relabeled as ∆k−1(v,lk−1(v)). Our inductive assumption is now given by ∆k(v,lk(v)) = Fk = ∆k−1(v,lk−1(v)) + ∆k−2(v,lk−1(v)) and we will prove that it holds for all steps. In the base step, we use k = 1 and k = 2, so we must show ∆1(v,l1(v)) = F1 and ∆2(v,l2(v)) = F2. The first case follows trivially from the sequence of labels at k = 0, thus ∆1(v,l1(v))=1+1=2= F1 where we use ∆0(v,u) in place of ∆k−2(v,lk−2(v)). For the second case we have ∆2(v,l2(v)) = ∆1(v,l1(v)) + ∆0(v,l0(v)). It was established in the first case that ∆1(v,l1(v)) = 2 so by substitution we get ∆2(v,l2(v))=2+1=3= F2. In the inductive step, assume ∆k(v,lk(v)) = Fk is valid for all values from one to k, then we must show ∆k+1(v,lk+1(v)) = Fk+1 is also valid. This is demonstrated by,

∆k+1(v,lk+1(v)) = ∆k(v,lk(v)) + ∆k−1(v,lk−1(v))

= Fk + Fk−1

= Fk+1.

Proposition 1. Algorithm 1 converges in logφ n = O(log n) steps on a sequentially labeled path, where φ ≈ 1.618 is the Golden Ratio value.

th Proof. Let Fk = Fk−1 + Fk−2 be the k number in the Fibonacci sequence. It follows from Lemma 3 that the label updates for each vertex follows a Fibonacci sequence since expanding ∆k(v,lk(v)) = Fk leads to lk(v) = lk−1(v)+ lk−2(v) − v. Each vertex v gets a new minimum label lk(v) at step k from its previous minimum labels from steps k − 1 and k − 2 until it finally gets the component minimum label. Hence the kth-labeled vertex in the path will get the component label 1 in the same number of steps as it takes to get from 1 to Fk in the Fibonacci sequence. Then it takes logφ n = O(log n) steps for the last vertex to get label 1, where φ ≈ 1.618 is the well-known Golden Ratio value.

15