Parallel Finger Search Structures Seth Gilbert Wei Quan Lim National University of Singapore National University of Singapore

Keywords Parallel data structures, multithreading, dictionaries, comparison-based search, distribution-sensitive Abstract

In this paper 1 we present two versions of a parallel finger structure FS on p processors that supports searches, insertions and deletions, and has a finger at each end. This is to our knowledge the first implementation of a parallel search structure that is work-optimal with respect to the finger bound and yet has very good parallelism (within a factor of O(logp)2 of optimal). We utilize an extended implicit batching framework that transparently facilitates the use of FS by any parallel program P that is modelled by a dynamically generated DAG D where each node is either a unit-time instruction or a call to FS. The total work done by either version of FS is bounded by the finger bound FL (for some linearization L of D), i.e. each operation on an item with distance r from a finger takes O(logr+1) amortized work. Running P using the simpler version takes  T1+FL 2  O p +T∞ +d · (logp) +logn time on a greedy scheduler, where T1,T∞ are the size and span of D respectively, and n is the maximum number of items in FS, and d is the maximum number of calls to FS along any path in D. Using the faster  T1+FL 2  version, this is reduced to O p +T∞ +d ·(logp) +sL time, where sL is the weighted span of D where each call to FS is weighted by its cost according to FL. We also sketch how to extend FS to support a fixed number of movable fingers. The data structures in our paper fit into the dynamic multithreading paradigm, and their performance bounds are directly composable with other data structures given in the same paradigm. Also, the results can be translated to practical implementations using work-stealing schedulers. Acknowledgements We would like to express our gratitude to our families and friends for their wholehearted support, to the kind reviewers who provided helpful feedback, and to all others who have given us valuable comments and advice. This research was supported in part by Singapore MOE AcRF Tier 1 grant T1 251RES1719. 1 Introduction There has been much research on designing parallel programs and parallel data structures. The dynamic multithreading paradigm (see [14] chap. 27) is one common parallel programming model, in which algorithmic parallelism is expressed through parallel programming primitives such as fork/join (also spawn/sync), parallel loops and synchronized methods, but the program cannot stipulate any mapping from subcomputations to processors. This is the case with many parallel languages and libraries, such as Cilk dialects [20, 25], Intel TBB [34], Microsoft Task Parallel Library [37] and subsets of OpenMP [31]. Recently, Agrawal et al. [3] introduced the exciting modular design approach of implicit batching, in which the programmer arXiv:1908.02741v4 [cs.DS] 10 Oct 2019 writes a multithreaded parallel program that uses a black box data structure, treating calls to the data structure as basic operations, and also provides a data structure that supports batched operations. Given these, the runtime system automatically combines these two components together, buffering data structure operations generated by the program, and executing them in batches on the data structure. This idea was extended in [4] to data structures that do not process only one batch at a time (to improve parallelism). In this extended implicit batching framework, the runtime system not only holds the data structure operations in a parallel buffer, to form the next input batch, but also notifies the data structure on receiving the first operation in each batch. Independently, the data structure can at any point flush the parallel buffer to get the next batch. This framework nicely supports pipelined batched data structures, since the data structure can decide when it is ready to get the next input batch from the parallel buffer, which may be even before it has finished processing the previous batch. Furthermore, this framework makes it easy for us to build composable parallel algorithms and data structures with composable performance bounds. This is demonstrated by both the parallel working-set map in [4] and the parallel finger structure in this paper.

1 This is the full version of a paper published in the 33rd International Symposium on Distributed Computing (DISC 2019). It is posted here for your personal or classroom use. Not for redistribution. c 2019 Copyright is held by the owner/author(s).

1 Finger Structures The map (or dictionary) data structure, which supports inserts, deletes and searches/updates, collectively referred to as accesses, comes in many different kinds. A common implementation of a map is a balanced binary search tree such as an AVL tree or a red-black tree, which (in the comparison model) takes O(logn) worst-case cost per access for a tree with n items. There are also maps such as splay trees [36] that have amortized rather than worst-case performance bounds. A finger structure is a special kind of map that comes with a fixed finger at each end and a (fixed) number of movable fingers, each of which has a key (possibly −∞ or ∞ or between adjacent items in the map) that determines its position in the map, such that accessing items nearer the fingers is cheaper. For instance, the finger tree [27] was designed to have the finger property in the worst case; it takes O(logr +1) steps per operation with finger distance r (Definition 1), so its total cost satisfies the finger bound (Definition 2). Definition 1 (Finger Distance). Define the finger distance of accessing an item x on a finger structure M to be the number of items from x to the nearest finger in M (including x), and the finger distance of moving a finger to be the distance moved. Definition 2 (Finger Bound). Given any sequence L of N operations on a finger structure M, let FL denote the finger bound ÍN ( ) for L, defined by FL = i=1 logri +1 where ri is the finger distance of the i-th operation in L when L is performed on M. Main Results We present in this paper, to the best of our knowledge, the first parallel finger structure. In particular, we design two parallel maps that are work-optimal with respect to the Finger Bound FL (i.e. it takes O(FL) work) for some linearization L of the operations (that is consistent with the results), while having very good parallelism. (We assume that each key comparison takes O(1) steps.) These parallel finger structures can be used by any parallel program P, whose actual execution is captured by a program DAG D, where each node is an instruction that finishes in O(1) time or a call to the finger structure M, called an M-call, that blocks until the result is returned, and each edge represents a dependency due to the parallel programming primitives. The first design, called FS1, is a simpler data structure that processes operations one batch at a time. Theorem 3 (FS1 Performance). If P uses FS1 (as M), then its running time on p processes using any greedy scheduler (i.e. at each step, as many tasks are executed as are available, up to p) is  T +F  O 1 L +T +d · (logp)2 +logn p ∞ for some linearization L of M-calls in D, where T1 is the number of nodes in D, and T∞ is the number of nodes on the longest path in D, and d is the maximum number of M-calls on any path in D, and n is the maximum size of M. 2 Notice that if M is an ideal concurrent finger structure (i.e. one that takes O(FL) work), then running P using M on p processors T1+FL according to the linearization L takes Ω(Topt) worst-case time where Topt = p +T∞. Thus FS1 gives an essentially optimal 2  2  time bound except for the ‘span term’ d · (logp) +logn , which adds O (logp) +logn time per FS1-call along some path in D. The second design, called FS2, uses a complex internal pipeline to reduce the ‘span term’. Theorem 4 (FS2 Performance). If P uses FS2, then its running time on p processes using any greedy scheduler is  T +F  O 1 L +T +d ·(logp)2 +s p ∞ L for some linearization L of M-calls in D, where d is the maximum number of FS2-calls on any path in D, and sL is the weighted span of D where each FS2-call is weighted by its cost according to FL, except that each finger-move operation is weighted by logn. Specifically, each access FS2-call that is an access with finger distance r according to L is given the weight logr +1, and each FS2-call that is a finger-move is given the weight logn, and sL is the maximum weight of any path in D. Thus, ignoring 2 finger-move operations, FS2 gives an essentially optimal time bound up to an extra O (logp) time per FS2-call along some path in D. We shall first focus on basic finger structures with just one fixed finger at each end, since we can implement the general finger structure with f movable fingers by essentially concatenating (f +1) basic finger structures, as we shall explain later in Section 6. We will also discuss later in Section 7 how to adapt our results for work-stealing schedulers that can actually be provided by a real runtime system.

2 To cater to instructions that may not finish in O(1) time (e.g. due to memory contention), it suffices to define T1 and T∞ to be the (weighted) work and span (Definition 5) respectively of the program DAG where each M-call is assumed to take O(1) time.

2 Challenges & Key Ideas The sequential finger structure in [22] (essentially a B-tree with carefully staggered rebalancing) takes O(logr +1) worst-case time per access with finger distance r, but seems impossible to parallelize efficiently. It turns out that relaxing this bound to O(logr +1) amortized time admits a simple sequential finger structure FS0 (Section 3) that can be parallelized. In FS0, the items are stored in order in a list of segments S0[0],S0[1], ··· ,S0[l], S1[l], ··· ,S1[1],S1[0], where each segment Si[k] is k+1 a balanced binary search tree with size at most 3 · c(k) but at least c(k) unless k = l, where c(k) = 22 . This ensures that k Si[k] has height O 2 , and that the r least items are in the first logO(logr) segments and the r greatest items are in the last logO(logr) segments. Thus for each operation with finger distance r, it takes O(logr +1) time to search through the segments from both ends simultaneously to find the correct segment and perform the operation in it. After that, we rebalance the segments to preserve the size invariant, in such a way that each imbalanced segment Si[k] will have new size 2 · c(k). This double-exponential segment sizes and the reset-to-middle rebalancing is critical in ensuring that all the rebalancing takes O(1) amortized time per operation, even if each rebalancing cascade may take up to Θ(logn) time.

The challenge is to parallelize FS0 while preserving the total work. Naturally, we want to process operations in batches, and use a batch-parallel search structure in place of each binary search tree. This may seem superficially similar to the parallel working-set map in [4], but the techniques in the earlier paper cannot be applied in the same way, for three main reasons. Firstly, searches and deletions for items not in the map must still be cheap if they have small finger distance, so we have to eliminate these operation in a separate preliminary phase by an unsorted search of the smaller segments, before sorting and executing the other operations. Secondly, insertions and deletions must be cheap if they have small finger distance (e.g. deleting an item from the first segment must have O(1) cost), so we cannot enforce a tight segment size invariant, otherwise rebalancing would be too costly. This is unlike the parallel working-set map, where we not only have a budget of O(logn) for each insertion or deletion or failed search, but also must shift accessed items sufficiently near to the front to achieve the desired span bound. The rebalancing in the parallel finger structures in this paper is hence completely different from that in the parallel working-set map.

Thirdly, for the faster version FS2 where the larger segments are pipelined, in order to keep all segments sufficiently balanced, the pipelined segments must never be too underfull, so we must carefully restrict when a batch is allowed to be processed at a segment. Due to this, we cannot even guarantee that a batch of operations will proceed at a consistent pace through the pipeline, but we can use an accounting argument to bound the ‘excess delay’ by the number of FS2-calls divided by p. Other Related Work There are many approaches for designing efficient parallel data structures, so as to make maximal use of parallelism in a multi- system, whether with empirical or theoretical efficiency. For example, Ellen et al. [17] show how to design a non-blocking concurrent binary search tree, with later work analyzing the amortized complexity [16] and generalizing this technique [13]. Another notable concurrent search tree is the CBTree [2, 1], which is based on the splay tree. But despite experimental success, the theoretical access cost for these tree structures may increase with the number of concurrent operations due to contention near the root, and some of them do not even maintain balance (i.e., the height may get large). Another method is software combining [19, 23, 32], where each process inserts a request into a shared queue and at any time one process is sequentially executing the outstanding requests. This generalizes to parallel combining [6], where outstanding requests are executed in batches on a suitable batch-parallel data structure (similar to implicit batching). These methods were shown to yield empirically efficient concurrent implementations of various common abstract data structures including stacks, queues and priority queues. In the PRAM model, Paul et al. [33] devised a parallel 2-3 tree where p synchronous processors can perform a sorted batch of p operations on a parallel 2-3 tree of size n in O(logn+logp) time. Blelloch et al. [10] show how to increase parallelism of tree operations via pipelining. Other similar data structures include parallel treaps [11] and a variety of work-optimal parallel ordered sets [8] supporting unions and intersections with optimal work, but these do not have optimal span. As it turns out, we can in fact have parallel ordered sets with optimal work and span [5, 28]. Nevertheless, the programmer cannot use this kind of parallel data structure as a black box with atomic operations in a high-level parallel program, but must instead carefully coordinate access to it. This difficulty can be eliminated by designing a suitable batch-parallel data structure and using implicit batching [3] or extended implicit batching as presented in [4] and more fully in this paper. Batch-parallel implementations have been designed for various data structures including weight-balanced B-trees [18], priority queues [6], working-set maps [4] and euler-tour trees [38].

3 2 Parallel Computation Model In this section, we describe parallel programming primitives in our model, how a parallel program generates an execution DAG, and how we measure the cost of an execution DAG. 2.1 Parallel Primitives

The parallel finger structures FS1 and FS2 in this paper are described and explained as multithreaded data structures that can be used as composable building blocks in a larger parallel program. In this paper we shall focus on the abstract algorithms behind FS1 and FS2, relying merely on the following parallel programming primitives (rather than model-specific implementation details, but see Appendix Section A.6 for those): 1. Threads: A thread can at any point terminate itself (i.e. finish running). Or it can fork a new thread, obtaining a pointer to that thread, or join to another thread (i.e. wait until that thread terminates). Or it can suspend itself (i.e. temporarily stop running), after which a thread with a pointer to it can resume it (i.e. make it continue running from where it left off). Each of these takes O(1) time. 2. Non-blocking locks: Attempts to acquire a non-blocking lock are serialized but do not block. Acquiring the lock succeeds if the lock is not currently held but fails otherwise, and releasing always succeeds. If k threads concurrently access the lock, then each access finishes within O(k) time. 3. Dedicated lock: A dedicated lock is a blocking lock initialized with a constant number of keys, where concurrent threads must use different keys to acquire it, but releasing does not require a key. Each attempt to acquire the lock takes O(1) time, and the thread will acquire the lock after at most O(1) subsequent acquisitions of that lock. 4. Reactivation calls: A procedure P with no input/output can be encapsulated by a reactivation wrapper, in which it can be run only via reactivations. If there are always at most O(1) concurrent reactivations of P, then whenever a thread reactivates P, if P is not currently running then it will start running (in another thread forked in O(1) time), otherwise it will run within O(1) time after its current run finishes. We also make use of basic batch operations, namely filtering, sorted partitioning, joining and merging (see Appendix Section A.2), which have easy implementations using arrays in the binary forking model in [9]. So FS1 and FS2 (using a work-stealing scheduler) can be implemented in the Arbitrary CRCW PRAM model with fetch-and-add, achieving the claimed performance bounds. Actually, FS1 and FS2 were also designed to function correctly with the same performance bounds in a much stricter computation model called the QRMW parallel pointer machine model (see Appendix Section A.1 for details). 2.2 Execution DAG

The program DAG D captures the high-level execution of P, but the actual complete execution of P (including interaction between data structure calls) is captured by the execution DAG E (which may be schedule-dependent), in which each node is a basic instruction and the directed edges represent the computation dependencies (such as constrained by forking/joining of threads and acquiring/releasing of blocking locks). At any point during the execution of P, a node in the program/execution DAG is said to be ready if its parent nodes have been executed. At any point in the execution, an active thread is simply a ready node in E, while a terminated/suspended thread is an executed node in E that has no child nodes. The execution DAG E consists of program nodes (specifically P-nodes) and ds (data-structure) nodes, which are dynamically generated as follows. At the start E has a single program node, corresponding to the start of the program P. Each node could be a normal instruction (i.e. basic arithmetic/memory operation) or a parallel primitive (see Section 2.1). Each program node could also be a data structure call. When a (ready) node is executed, it may generate child nodes or terminate. A normal instruction generates one child node and no extra edges. A join generates a child node with an extra edge to it from the terminate node of the joined thread. A resume generates an extra child node (the resumed thread) with an edge to it from the suspend node of the originally suspended thread. Accesses to locks and reactivation calls would each expand to a subDAG comprised of normal instructions and possibly fork/suspend/resume. The program nodes correspond to nodes in the program DAG D, and except for data structure calls they generate only program nodes. A call to a data structure M is called an M-call. If M is an ordinary (non-batched) data structure, then an M-call generates an M-node (and every M-node is a ds node), which thereafter generates only M-nodes except for calls to other data structures (external to M) or returning the result of some operation (generating a program node with an edge to it from the original M-call).

4 However, if M is an (implicitly) batched data structure, then all M-calls are automatically passed to the parallel buffer for M (see Appendix Section A.4). So an M-call generates a buffer node corresponding to passing the call to the parallel buffer, as if the parallel buffer for M is itself another data structure and not part of M. Buffer nodes generate only buffer nodes until it notifies M of the buffered M-calls or passes the input batch to M, which generates an M-node. In short, M-nodes exclude all nodes generated as part of the buffer subcomputations (i.e. buffering the M-calls, and notifying M, and flushing the buffer). 2.3 Data Structure Costs We shall now define work and span of any (terminating) subcomputation of a multithreaded program, i.e. any subset of the nodes in its execution DAG. This allows us to capture the intrinsic costs incurred by a data structure, separate from the costs of a parallel program using it. Definition 5 (Subcomputation Work/Span/Cost). Take any execution of a parallel program P (on p processors), and take any subset C of nodes in its execution DAG E. The work taken by C is the total weight w of C where each node is weighted by the time taken to execute it. The span taken by C is the maximum weight s of nodes in C on any (directed) path in E. The cost of w C is p +s. Definition 6 (Data Structure Work/Span/Cost). Take any parallel program P using a data structure M. The work/span/cost of M (as used by P) is the work/span/cost of the M-nodes in the execution DAG for P. Note that the cost of the entire execution DAG is in fact an upper bound on the actual time taken to run it on a greedy scheduler, which on each step assigns as many unassigned ready nodes (i.e. nodes that have been generated but have not been assigned) as possible to available processors (i.e. processors that are not executing any nodes) to be executed. Moreover, the subcomputation cost is subadditive across subcomputations. Thus our results are composable with other algorithms and data structures in this model, since we actually show the following for some linearization L (where FL,d,n,sL are as defined in Section 1 Main Results, and N is the total number of calls to the parallel finger structure). Theorem 7 (FS Work/Span Bounds). N 2  " (Theorem 12 and Theorem 14) FS1 takes O(FL) work and O p +d · (logp) +logn span. N 2  " (Theorem 16 and Theorem 21) FS2 takes O(FL) work and O p +d ·(logp) +sL span. Note that the bounds for the work/span of FS1 and FS2 are independent of the scheduler. In addition, using any greedy  T1+FL  scheduler, the parallel buffer for either finger structure has cost O p +d ·logp (Appendix Theorem 24). Therefore our main results (Theorem 3 and Theorem 4) follow from these composable bounds (Theorem 7). In general, if a program uses a fixed number of implicitly batched data structures, then running it using a greedy scheduler takes ∗  T1+w ∗ ∗  ∗ ∗ O p +T∞ +s +d ·logp time, where w is the total work of all the data structures, and s is the total span of all the data structures, and d∗ is the maximum number of data structure calls on any path in the program DAG. 3 Amortized Sequential Finger Structure

In this section we explain a sequential finger structure FS0 with a fixed finger at each end, which (unlike finger structures based on balanced binary trees) is amenable to parallelization and pipelining due to its doubly-exponential segmented structure (which was partially inspired by Iacono’s working-set structure [24]).

Front < S0[0] < S0[1] < S0[2] < ··· < S0[l] ∧ Back > S1[0] > S1[1] > S1[2] > ··· > S1[l] 2k Figure 1: FS0 Outline; each box Si[k] represents a 2-3 tree of size Θ(2 ) for k < l

FS0 keeps the items in order in two halves, the front half stored in a chain of segments S0[0..l], and the back half stored in 2k+1 reverse order in a chain of segments S1[0..l]. Let c(k) = 2 for each k ∈ Z. Each segment Si[k] has a target size t(k) = 2·c(k), and a target capacity defined to be [t(k),t(k)] if k < l but [0,t(k)] if k = l. Each segment stores its items in order in a 2-3 tree. We say that a segment Si[k] is balanced iff its size is within c(k) of its target capacity, and overfull iff it has more than c(k) items above target capacity, and underfull iff it has more than c(k) items below target capacity. At any time we associate every item x to a unique segment that it fits in; x fits in S0[k] if k is the minimum such that x ≤ max(S0[k]), and that x fits in S1[k] if k is the minimum such that x ≥ min(S1[k]), and that x fits in S0[l] if max(S0[l]) < x < min(S1[l]). We shall maintain the invariant that every segment is balanced after each operation is finished.

5 For each operation on an item x, we find the segment Si[k] that x fits in, by checking the range of items in S0[a] and S1[a] for each a from 0 to l and stopping once k is found, and then perform the desired operation on the 2-3 tree in Si[k]. This takes k k O(k+log(t(k)+c(k))) ⊆ O 2 steps, and 2 = log2 c(k−1) ≤ log2 r +1 where r is the finger distance of the operation. After that, if Si[k] becomes imbalanced, we rebalance it by shifting (appropriate) items to or from Si[k +1] (after creating empty segment Si[k+1] if it does not exist) to make Si[k] have target size or as close as possible (via a suitable split then join of the 2-3 trees), and then Si[k+1] is removed if it is the last segment and is now empty. After the rebalancing, Si[k] will not only be balanced but also have size within its target capacity. But now Si[k+1] may become imbalanced, so the rebalancing may cascade. 0 0 Finally, if one chain Si[0..l ] is longer than the other chain Sj[0..l], it must be that l = l+1, so we rebalance the chains as 0 follows: If Sj[l] is below target size, shift items from Si[l ] to Sj[l] to fill it up to target size. If Sj[l] is (still) below target size, 0 remove the now empty Si[l ], otherwise add a new empty segment Sj[l+1]. Rebalancing may cascade throughout the whole chain and take Θ(logn) steps. But we shall show below that the rebalancing costs can be amortized away completely, and hence each operation with finger distance r takes O(logr +1) amortized steps, giving us the finger bound for FS0. We will later use the same technique in analyzing FS1 and FS2 as well.

Lemma 8 (FS0 Rebalancing Cost). All the rebalancing takes O(1) amortized steps per operation.

Proof. We shall maintain the invariant that each segment Si[k] with q items beyond (i.e. above or below) its target capacity has at least q · 2−k stored credits. Each operation is given 1 credit, and we use it to pay for any needed extra stored credits at the segment where we perform the operation. Whenever a segment Si[k] is rebalanced, it must have had q items beyond its target capacity for some q > c(k), and so had at least q · 2−k stored credits. Also, the rebalancing itself takes −k −(k+1) O(log(t(k)+q)+log(t(k +1)+c(k +1)+q)) ⊆ O(logq) ⊆ O q·2 steps, after which Si[k +1] needs at most q · 2 extra stored credits. Thus the stored credits at Si[k] can be used to pay for both the rebalancing and any extra stored credits needed by Si[k+1]. Whenever the chains are rebalanced, it can be paid for by the last segment rebalancing (which created or removed a segment), and no extra stored credits are needed. Therefore the total rebalancing cost amounts to O(1) per operation.  4 Simpler Parallel Finger Structure

We now present our simpler parallel finger structure FS1. The idea is to use the amortized sequential finger structure FS0 (Section 3) and execute operations in batches. We group each pair of segments S0[k] and S1[k] into one section S[k], and we say that an item x fits in the sections S[j..k] iff x fits in some segment in S[j..k]. The items in each segment are stored in a batch-parallel map (Appendix Section A.3), which supports: " Unsorted batch search: Search for an unsorted batch of b items within O(b·logn) work and O(logb·logn) span, tagging each search with the result, where n is the map size. " Sorted batch access: Perform an item-sorted batch of b operations on distinct items within O(b·logn) work and O(logb+logn) span, tagging each operation with the result, where n is the map size. " Split: Split a map of size n around a given pivot rank (into lower+upper parts) within O(logn) work/span. " Join: Join maps of total size n separated by a pivot (i.e. lower+upper parts) within O(logn) work/span. For each section S[k], we can perform a batch of b operations on it within O(b·logc(k)) work and O(logb+logc(k)) span if we have the batch sorted. Excluding sorting, the total work would satisfy the finger bound for the same reason as in FS0. However, we cannot afford to sort the input batch right at the start, because if the batch had b searches of distinct items all with finger distance O(1), then it would take Ω(b·logb) work and exceed our finger bound budget of O(b). We can solve this by splitting the sections into two slabs, where the first slab comprises the first loglog(2b) sections, and passing the batch through a preliminary phase in which we merely perform an unsorted search of the relevant items in the first slab, and eliminate operations on items that fit in the first slab but are neither found nor to be inserted. This preliminary phase takes O(logc(k)) work per operation and O(logb·logc(k)) span at each section S[k]. We then sort the uneliminated operations and execute them on the appropriate slab. For this, ordinary sorting still takes too much work as there can be many operations on the same item, but it turns out that the finger bound budget is enough to pay for entropy-sorting b  (Appendix Definition 31), which takes O log q +1 work for each item that occurs q times in the batch. Rebalancing the segments and chains is a little tricky, but if done correctly it takes O(1) amortized work per operation. Therefore we achieve work-optimality while being able to process each batch within O(logb)2 +logn span. The details are below.

6 4.1 Description of FS1

size-b input batch Sort Parallel buffer −−−−−−−−−−−−−−−→ S[0] → · · · → S[m−1] −−−−−−→ S[m] → · · · → S[l] where m = dloglog(2b)e

| {z } | {z } First slab Final slab

Figure 2: FS1 Outline; each batch is sorted only after being filtered through the smaller sections

FS1-calls are put into the parallel buffer (Section 2) for FS1. Whenever the previous batch is done, FS1 flushes the parallel buffer to obtain the next batch B. Let b be the size of B, and we can assume b > 1. Based on b, the sections in FS1 are conceptually divided into two slabs, the first slab comprising sections S[0..m−1] and the final slab comprising sections S[m..l], where m = dloglog(2b)e +1 (where log is the binary logarithm). The items in each segment are stored in a batch-parallel map (Appendix Section A.3).

FS1 processes the input batch B in four phases: 1. Preliminary phase: For each first slab section S[k] in order (i.e. k from 0 to m−1) do as follows: (a) Perform an unsorted search in each segment in S[k] for all the items relevant to the remaining batch B0 (of direct pointers into B), and tag the operations in the original batch B with the results. (b) Remove all operations on items that fit in S[k] from the remaining batch B0. (c) Skip the rest of the first slab if B0 becomes empty. 2. Separation phase: Partition B based on the tags into three parts and handle each part separately as follows: (a) Ineffectual operations (on items that fit in the first slab but are neither found nor to be inserted): Return the results. (b) Effectual operations (on items found in or to be inserted into the first slab): Entropy-sort (Appendix Definition 31) them in order of access type (search, update, insertion, deletion) with deletions last, followed by item, combining operations of the same access type on the same item into one group-operation that is treated as a single operation whose effect is the last operation in that group. Each group-operation is stored in a leaf-based binary tree with height O(logb) (but not necessarily balanced), and the combining is done during the entropy-sorting itself. (c) Residual operations (on items that do not fit in the first slab): Sort them while combining operations in the same manner as for effectual operations. 3 3. Execution phase: Execute the effectual operations as a batch on the first slab, and then execute the residual operations as a batch on the final slab, namely for each slab doing the following at each section S[k] in order (small to big): (a) Let G1..4 be the partition of the batch of operations into the 4 access types (deletions last), each Ga sorted by item. (b) For each segment Si[k] in S[k], and for each a from 1 to 4, cut out the operations that fit in Si[k] from Ga, and perform those operations (as a sorted batch) on Si[k], and then return their results. (c) Skip the rest of the slab if the batch becomes empty. 4. Rebalancing phase: Rebalance all the segments and chains by doing the following: (a) Segment rebalancing: For each chain Si, for each segment Si[k] in Si in order (small to big): i. If k > 0 and Si[k−1] is overfull, shift items from Si[k−1] to Si[k] to make Si[k−1] have target size. c(k) 0 ii. If k > 0 and Si[k −1] is underfull and Si[k] either has at least items or is the last segment in Si, let Si[k ] 0 2 be the first underfull segment in Si, and fill Si[k ..k −1] using Si[k] as follows: for each j from k −1 down to k0, shift items from S [j +1] to S [j] to make S [k0..j] have total size Íj t(a) or as close as possible, and then i i i a=k0 remove Si[j +1] if it is emptied. iii. If Si[k] is (still) overfull and is the last segment in Si, create a new (empty) segment Si[k+1]. iv. Skip the rest of the current slab if Si[k] is (now) balanced and the execution phase had skipped S[k]. (b) Chain rebalancing: After that, if one chain Si is longer than the other chain Sj, repeat the following until the chains are the same length: 0 0 i. Let the current chains be Si[0..k] and Sj[0..k ]. Create new (empty) segments Sj[k +1..k], and shift all items 0 from Si[k] to Sj[k], and then fill the underfull segments in Sj[k ..k−1] using Sj[k] (as in step 4aii). ii. If Sj[k] is (now) empty again, remove S[k].

3 This does not require entropy-sorting, but combining merge-sort essentially achieves the entropy bound anyway.

7 4.2 Analysis of FS1 First we establish that the rebalancing phase works, by proving the following two lemmas.

Lemma 9 (FS1 Segment Rebalancing Invariant). During the segment rebalancing (step 4a), just after the iteration for 0 0 0 segment Si[k], for any imbalanced segment Si[k ] in Si[0..k], either k = k or Si[k ..k] are all underfull.

Proof. The invariant clearly holds for Si[0]. Consider each iteration for segment Si[k] during the segment rebalancing where k > 0. If Si[k−1] was overfull, then by the invariant it was the only imbalanced segment in Si[0..k−1], and would be rebalanced [ − ] [ ] c(k) in step 4ai, preserving the invariant. If Si k 1 was underfull and Si k had at least 2 items or was the last segment in Si, then ( ) in step 4aii S [k0..k −1] would be filled using S [k], which had at least c k ≥ Ík−1 t(a) items unless it was the last segment i i 2 a=k0 0 in Si, and hence after that every segment in Si[k ..k−1] (that is not removed) would be balanced, preserving the invariant. If step 4ai and step 4aii do not apply, then Si[k−1] is balanced or Si[k] is underfull, so the invariant is preserved. Finally, if Si[k] is balanced at the end of that iteration, and had been skipped by the execution phase, then by the invariant all segments in Si[0..k] are balanced, and all segments skipped by the rebalancing phase are also balanced, so the invariant is preserved. 

Lemma 10 (FS1 Chain Rebalancing Iterations). The chain rebalancing (step 4b) takes at most two iterations, after which both chains S0 and S1 will have equal length and all their segments will be balanced. Proof. By Lemma 9, all segments in each chain will be balanced after the segment rebalancing (step 4a). After that, if one chain 0 Si[0..k] is longer than the other chain Sj[0..k ], the first chain rebalancing iteration transfers all items in Si[k] to the other chain (step 4bi), leaving Si[k] empty. If Sj[k] remains non-empty, then both chains have length k and we are done. Otherwise, S[k] would be removed, and then the second chain rebalancing iteration transfers all items in Si[k−1] to the other chain, which is at ( − ) ≥ Ík−2 ( ) [ 0 − ] least c k 1 a=0 t a items, so every segment in Sj k ..k 2 would be filled to target size, and hence both chains would have length (k−1). 

Next we bound the work done by FS1. Definition 11 (Inward Order). Take any sequence A of map operations and let I be the set of items accessed by operations in A. Define the inward distance of an operation in A on an item x to be min(size(I≤x),size(I≥x)). We say that A is in inward order iff its operations are in order of (non-strict) increasing inward distance. Naturally, we say that A is in outward order iff its reverse is in inward order.

Theorem 12 (FS1 Work). FS1 takes O(FL) work for some linearization L of FS1-calls in D. ∗ Proof. Let L be a linearization of FS1-calls in D such that:

" Operations on FS1 in earlier input batches are before those in later input batches. " The operations within each batch are ordered as follows: 1. Ineffectual operations are before effectual/residual operations. 2. Effectual/residual operations are in order of access type (deletions last). 3. Effectual insertions are in inward order, and effectual deletions are in outward order. 4. Operations in each group-operation are consecutive and in the same order as in that group. Let L0 be the same as L∗ except that in point 3 effectual deletions are ordered so that those on items in earlier sections are later (instead of outward order). Now consider each input batch B of b operations on FS1. In the preliminary and execution phases, each section S[a] takes O(2a) work per operation. Thus each operation in B with 0 [ ] Ík a k ⊆ ( ) finger distance r according to L on an item x that was found to fit in section S k takes O a=0 2 = O 2 O logr +1 ≥ Ík−1 ( ) ≥ 1 ( − ) [ ] work, because r a=0 c a +1 2 c k 1 if S k is in the first slab (since earlier effectual operations in B did not delete items [ − ] ≥ Ík−1 ( )− ≥ 1 ( − ) [ ] ≤ 1 ( − ) in S 0..k 1 ), and r a=0 c a b 2 c k 1 if S k is in the final slab (since b 2 c m 1 ). Therefore these phases take O(FL0) work in total. Let G be the effectual operations in B as a subsequence of L∗. Entropy-sorting G takes O(H +b) work (Appendix Theorem 32), where H is the entropy of G (i.e. H = Íb log b where q is the number of occurrences of the i-th operation in G). Partition i=1 qi i G into 3 parts: searches/updates G1 and insertions G2 and deletions G3. And let Hj be the entropy of Gj. Then H = Í3 Íb b Íb b ≤ j=1 Hj + i=1 log b where bi is the number of operations in the same part of G as the i-th operation in G, and i=1 log b   i   i b·log 1 Íb b = b·log3 by Jensen’s inequality. Thus entropy-sorting G takes O Í3 H +b work. Let C be the cost of b i=1 bi j=1 j j Gj according to FL∗ . Since each operation in Gj has inward distance (with respect to Gj) at most its finger distance according ∗ to L , we have Hj ∈ O(Cj) (Appendix Theorem 28), and hence entropy-sorting takes O(FL∗ ) work in total.

8 Sorting the residual operations in the batch B (that do not fit in the first slab) takes O(logb) ⊆ O(logr) work per operation with finger distance r according to L∗, since r ≥ c(m−1) ≥ 2b.

Therefore the separation phase takes O(FL∗ ) work in total. Finally, the rebalancing phase takes O(1) amortized work per operation, as we shall prove in the next lemma. Thus FS1 takes O(max(FL∗,FL0)) total work.  Lemma 13 (FS1 Rebalancing Work). The rebalancing phase of FS1 takes O(1) amortized work per operation. −k Proof. We shall maintain the credit invariant that each segment Si[k] with q items beyond its target capacity has at least q·2 stored credits. The execution phase clearly increases the total stored credits needed by at most 1 per operation, which we can pay for. We now show that the invariant can be preserved after the segment rebalancing and the chain rebalancing.

During the segment rebalancing (step 4a), each shift is performed between some neighbouring segments Si[k] and Si[k+1], 0 where Si[k] has t(k)+q items and Si[k +1] has t(k +1)+q items just before the shift, and |q| > c(k). The shift clearly takes O(log(t(k) + q) + log(t(k + 1) + q0)) work. If q0 < 2 · t(k + 1) then this is obviously just O(log t(k) + log |q|) work. But if q0 > Ík 2·t(k+1), then Si[k+1] will also be rebalanced in step 4ai of the next segment balancing iteration, since at most t(a) ≤ 0 a=0 t(k+1) items will be shifted from Si[k+1] to Si[k] in step 4aii, and hence Si[k+1] will still have at least q items. In that case, the second term O(log(t(k+1)+q0))) in the work bound for this shift can be bounded by the first term of the work bound for the 0 0 subsequent shift from Si[k+1] to Si[k+2], since log(t(k+1)+q ) ∈ O(logq ). Therefore in any case we can treat this shift as taking only O(logt(k)+log |q|) ⊆ O(log |q|) ⊆ O|q| ·2−k work. Now consider the two kinds of segment rebalancing:

" Overflow: step 4ai shifts items from overfull Si[k] to Si[k +1]. Suppose that Si[k] has t(k)+u items just before the shift. −(k+1) After the shift, Si[k] has target size and needs no stored credits, and Si[k +1] would need at most u·2 extra stored −k credits. Thus the u·2 credits stored at Si[k] can pay for both the shift and the needed extra stored credits. 0 " Fill: step 4aii fills some underfull segments Si[k ..k] using Si[k+1]. Suppose that Si[j] has t(j)−ui(j) items just before the 0 0 fill, for each j ∈ [k ..k]. After the fill, every segment in Si[k ..k] has size within target capacity and needs no stored credits,   and S [k+1] needs at most Ík u (j) ·2−(k+1) ≤ 1 Ík u (j)·2−j extra stored credits, which can be paid for by using i j=k0 i 2 j=k0 i 0 −j half the credits stored at each segment in Si[k ..k]. The other half of the ui(j)·2 credits stored at Si[j] suffices to pay for 0 the shift from Si[j +1] to Si[j], for each j ∈ [k ..k]. The chain rebalancing (step 4b) is performed only when segment rebalancing creates or removes a segment and makes one chain longer than the other. Consider the biggest segment Si[k] that was created or removed. If Si[k] was created, it must be k due to overflowing Si[k −1] to Si[k] in step 4ai, and hence the shift from Si[k −1] to Si[k] already took Θ 2 work. If Si[k] 0 was removed, it must be due to filling some segments Si[k ..k−1] using Si[k] in step 4aii, but Si[k−1] must have had at least c(k−1) items before the execution phase, and at least half of them were either deleted or shifted to Si[k−2], and hence either k k the deletions can pay Θ 2 credits, or the shift to Si[k−2] already took Θ 2 work. Therefore in any case we can afford to ignore up to Θ2k work done by chain rebalancing. Now observe that the chain rebalancing performs at most two transfers (step 4bi) of items from the last segment of the longer 0 chain Si[0..k] to the shorter chain Si0[0..k ], by the FS1 Chain Rebalancing Iterations(Lemma 10). Each transfer takes O(k) 0 work to create the new segments and O(1) work to shift Si[k] over to Si0[k], and then fills underfull segments in Si0[k ..k−1] k j using Si0[k]. The fill takes O 2 work for the shift from Si0[k] to Si0[k −1], and takes O 2 work for each shift from Si0[j] 0 j S 0[j − ] j ∈ [k k − ] S 0[j] Í t(a) ≤ t(j ) to i 1 for each +1.. 1 , since i has at most a=0 +1 items just before the shift. Therefore each transfer takes O2k work in total, and hence we can ignore all the work done by the chain rebalancing. 

And now we turn to bounding the span of FS1. N 2  Theorem 14 (FS1 Span). FS1 takes O p +d · (logp) +logn span, where N is the number of operations on FS1, and n is the maximum size of FS1, and d is the maximum number of FS1-calls on any path in the program DAG D. Proof. Let s(b) denote the maximum span of processing an input batch of size b (that has been flushed from the parallel buffer). Take any input batch B of size b. We shall bound the span taken by B in each phase. The preliminary phase takes Ologb·2k span in each first slab segment S[k], adding up to O(logb)2 span. The separation phase also takes O(logb)2 span, by PESort Costs(Theorem 32). The execution phase takes Ologb+2k span in each segment S[k], adding up to O(logb·loglogb+logn) span. Returning the results for each group-operation takes O(logb) span.

9 The rebalancing phase also takes Ologb+2k span for each segment S[k] processed in step 4a, because each shift between segments with total size q takes O(logq) span, and filling S [k0..k −1] using S [k] in step 4aii takes Ologb+Ík t(a) ⊆ i i a=k0   Ologb+2k span for the first shift from S [k] to S [k−1] and then O logÍj t(a) ⊆ O2j span for each subsequent shift i i a=k0 from Si[j +1] to Si[j]. Similarly, the chain rebalancing in step 4b takes O(logb+logn) span, because it performs at most two iterations by FS1 Chain Rebalancing Iterations(Lemma 10), each of which takes O(logb+logn) span to fill the underfull segments of the shorter chain using its last segment. 2  b 2  2 b  2 Therefore s(b) ∈ O (logb) +logn ⊆ O p +(logp) +logn , since (logb) ∈ O p if b ≥ p . Each batch B of size b waits in the buffer for the preceding batch of size b0 to be processed, taking O(s(b0)) span, and then B itself is processed, taking O(s(b)) span, taking O(s(b)+s(b0)) span in total. Since over all batches each of b,b0 will sum up to at most the total number N of FS1-calls, and there are at most d FS1-calls on any path in the program DAG D, the span of FS1 is N 2  O p +d · (logp) +logn .  5 Faster Parallel Finger Structure

Although FS1 has optimal work and a small span, it is possible to reduce the span even further, intuitively by pipelining the batches in some fashion so that an expensive access in a batch does not hold up the next batch.

As with FS1, we need to split the sections into two slabs, but this time we fix the first slab at m sections where m ∈ logΘ(logp) so that we can pipeline just the final slab. We need to allow big enough batches so that operations that are delayed because earlier batches are full can count their delay against the total work divided by p. But to keep the span of the sorting phase down to O(logp)2, we need to restrict the batch size. It turns out that restricting to batches of size at most p2 works. We cannot pipeline the first slab (particularly the rebalancing), but the preliminary phase and separation phase would only take O(logp)2 span. The execution phase and rebalancing phases are still carried out as before on the first slab, taking O(logp)2 span, but execution and rebalancing on the final slab are pipelined, by having each final slab section S[k] process the batch passed to it and rebalance the preceding segments S0[k−1] and S1[k−1] if necessary. To guarantee that this local rebalancing is possible, we do not allow S[k] to proceed if it is imbalanced or if there are more than c(k) pending operations in the buffer to S[k+1]. In such a situation, S[k] must stop and reactivate S[k+1], which would clear its buffer and rebalance S[k] before restarting S[k]. It may be that S[k+1] also cannot proceed for the same reason and is stopped in the same manner, and so S[k] may be delayed by such a stop for a long time. But by a suitable accounting argument we can bound the total delay due to all such stops by the total work divided by p. Similarly, we do not allow the first slab to run (on a new batch) if S[m−1] is imbalanced or there are more than c(m−1) pending operations in the buffer to S[m]. Finally, we use an odd-even locking scheme to ensure that the segments in the final slab do not interfere with each other yet can proceed at a consistent pace. The details are below.

5.1 Description of FS2

input batch size-p2 cut batch Sort FS2: Parallel buffer −−−−−−−−−→ Feed buffer −−−−−−−−−−−−−−→ First slab −−−−−−→ Final slab

First slab: → S[0] → S[1] → · · · → S[m−1] → where m = loglog5p2

Lock Lock Lock Lock 1%- 1 2%-2 1%-1 2%-2 Final slab: S[m−1] −−−−−−→ S[m] −−−−−−→ S[m+1] −−−−−−→ S[m+2] −−−−−−→ ··· S[l] Buffer Buffer Buffer Buffer

Figure 3: FS2 Sketch; the final slab is pipelined, facilitated by locks between adjacent sections

We shall now give the details (see Figure 3). We will need the bunch structure (Appendix Definition 23) for aggregating batches, which is an unsorted set supporting both addition of a batch of new elements within O(1) work/span and conversion to a batch within O(b) work and O(logb) span if it has size b.  2 FS2 has the same sections as in FS1, with the first slab comprising the first m = loglog 5p sections, and the final slab 2 comprising the other sections. FS2 uses a feed buffer, which is a queue of bunches of operations each of size exactly p except the last (which can be empty). Whenever FS2 is notified of input (by the parallel buffer), it reactivates the first slab.

10 Each section S[k] in the final slab has a buffer before it (for pending operations from S[k−1]), which for each access type uses an optimal batch-parallel map (Appendix Section A.3) to store bunches of group-operations of that type, where operations on the same item are in the same bunch. When a batch of group-operations on an item is inserted into the buffer, it is simply added to the correct bunch. Whenever we count operations in the buffer, we shall count them individually even if they are on the same item. The first slab and each final slab section also has a deferred flag, which indicates whether its run is deferred until the next section has run. Between every pair of consecutive sections starting from after S[m−1] is a neighbour-lock, which is a dedicated lock (see Section 2.1) with 1 key for each arrow to it in Figure 3. Whenever the first slab is reactivated, it runs as follows: 1. If the parallel buffer and feed buffer are both empty, terminate. 2. Acquire the neighbour-lock between S[m−1] and S[m]. (Skip steps 2 to 4 and steps 8 to 10 if S[m] does not exist.) 3. If S[m−1] has any imbalanced segment or S[m] has more than c(m−1) operations in its buffer, set the first slab’s deferred flag and release the neighbour-lock, and then reactivate S[m] and terminate. 4. Release the neighbour-lock. 5. Let q be the size of the last bunch F in the feed buffer. Flush the parallel buffer (if it is non-empty) and cut the input batch of size b into small batches of size p2 except possibly the first and last, where the first has size minb,p2 −q. Add that first small batch to F, and append the rest as bunches to the feed buffer. 6. Remove the first bunch from the feed buffer and convert it into a batch B, which we call a cut batch. 7. Process B using the same four phases as in FS1 (Section 4.1), but restricted to the first slab (i.e. execute only the effectual operations on the first slab, and do segment rebalancing only on the first slab, and do chain rebalancing only if S[m] had not existed before this processing). Furthermore, do not update S[m−1]’s segments’ sizes until after this processing (so that S[m] in step 4 will not find any of S[m−1]’s segments imbalanced until the first slab rebalancing phase has finished). 8. Acquire the neighbour-lock between S[m−1] and S[m]. 9. Insert the residual group-operations (on items that do not fit in the first slab) into the buffer of S[m], and then reactivate S[m]. 10. Release the neighbour-lock. 11. Reactivate itself. Whenever a final slab section S[k] is reactivated, it runs as follows: 1. Acquire the neighbour-locks (between S[k] and its neighbours) in the order given by the arrow number in Figure 3. 2. If S[k] has any imbalanced segment or S[k+1] (exists and) has more than c(k) operations in its buffer, set S[k]’s deferred flag and release the neighbour-locks, and then reactivate S[k+1] and terminate. 3. For each access type, flush and process the (sorted) batch G of bunches of group-operations of that type in its buffer as follows: (a) Convert each bunch in G to a batch of group-operations. (b) For each segment Si[k] in S[k], cut out the group-operations on items that fit in Si[k] from G, and perform them (as a sorted batch) on Si[k], and then fork to return the results of the operations (according to the order within each group-operation). (c) If G is non-empty (i.e. has leftover group-operations), insert G into the buffer of S[k+1] and then reactivate S[k+1]. 4. Rebalance locally as follows (essentially like in FS1): (a) For each segment Si[k] in S[k]: i. If Si[k−1] is overfull, shift items from Si[k−1] to Si[k] to make Si[k−1] have target size. ii. If Si[k −1] is underfull, shift items from Si[k] to Si[k −1] to make Si[k −1] have target size, and then remove Si[k] if it is emptied. iii. If Si[k] is (still) overfull and is the last segment in Si, create a new segment Si[k+1] and reactivate it. (b) If S[k] is (still) the last section, but chain Si is longer than chain Sj: i. Create a new segment Sj[k] and shift all items from Si[k] to Sj[k]. ii. If Sj[k−1] is (now) underfull, shift items from Sj[k] to Sj[k−1] to make Sj[k−1] have target size. iii. If Sj[k] is (now) empty again, remove S[k]. 5. If k = m, and the first slab is deferred, clear its deferred flag then reactivate it. 6. If k > m, and S[k−1] is deferred, clear its defered flag then reactivate it. 7. Release the neighbour-locks.

11 5.2 Analysis of FS2 For each computation, we shall define its delay to intuitively capture the minimum time it needs, including all potential waiting on locks. Each blocked acquire of a dedicated lock corresponds to an acquire-stall node α in the execution DAG whose child node ρ is created by the release just before the successful acquisition of the lock. Let ∆(α) be the ancestor nodes of ρ that have not yet executed at the point when α is executed. Then the delay of a computation Γ is recursively defined as the weighted span of Γ, where each acquire-stall node α in Γ is weighted by the delay of ∆(α) (to capture the total waiting at α), and every other node is weighted by its cost. 4 Whenever the first slab or a final slab section runs, we say that it defers if it terminates with its deferred flag set (i.e. at step 2), otherwise we say that it proceeds (i.e. to step 3) and eventually finishes (i.e. reaches the last step) with its deferred flag cleared. We now establish some invariants, which guarantee that FS2 is always sufficiently balanced. Lemma 15 (FS2 Balance Invariants). FS2 satisfies the following invariants: 1. When the first slab is not running, every segment in Si[0..m−2] is balanced and Si[m−1] has at most 2·t(m−1) items. 2. When a final slab section S[k] rebalances a segment in S[k−1] (in step 4a), it will make that segment have size t(k−1). 3. Just after the last section finishes without creating new sections, the segments in S[k] are balanced and both chains have the same length. 4. Each final slab section S[k] always has at most 2·c(k−1) operations in its buffer. 5. Each final slab segment Si[k] always has at most 2·t(k) items, and at least c(k−1) items unless S[k] is the last section. Proof. Invariant 1 holds as follows: The first slab proceeds only if S[m−1]’s segments are balanced, and from that point until after the rebalancing phase, its segments are modified only by itself (since S[m] will not modify S[m−1]), and thereafter all its sections except S[m−1] remain unmodified until it processes the next cut batch. Thus the same proof as for FS1 Segment Rebalancing Invariant(Lemma 9) shows that just before the segment rebalancing (step 4a) iteration for Si[m − 1], for any 2 imbalanced first slab segment Si[k], either k = m−2 or Si[k..m−2] are underfull. But note that the cut batch had at most p ≤ c(m−1) [ − ] c(m−1) 2 operations, and so after the execution phase, Si m 1 had at least 2 items unless it was the last segment in its chain. Thus Si[0..m−2] will be made balanced (by step 4ai or step 4aii in the iteration for Si[m−1], or by step 4b). Similarly, [ − ] ( − ) Ím−1 ( ) 2 ≤ · ( − ) Ím−2 ( ) ≤ c(m−1) S m 1 will have at most t m 1 + a=0 c a +p 2 t m 1 items in each segment, since a=0 c a 2 . Invariant 2 holds as follows. Each final slab section S[k] proceeds only if its segments each has at least c(k) items unless it is the last segment in its chain, and its buffer had at most 2·c(k−1) operations by Invariant 4. Since c(k)−2·c(k−1) ≥ t(k−1), rebalancing a segment in S[k−1] (step 4a) will make it have size t(k−1). Invariant 3 holds as follows. The last section S[l] proceeds only if its segments each has at most t(k)+c(k) items, and its buffer had at most 2·c(k−1) ≤ c(k) operations by Invariant 4. Thus if any of its segments Si[l] becomes overfull and it creates a new section S[l+1], it will subsequently be deferred until S[l+1] runs. And during that run of S[l+1], it will proceed and shift at most 2·c(k) items from Si[l] to Si[l+1], after which Si[l+1] will not be overfull, and so S[l+1] will not create another new section S[l+2]. Therefore we can assume that the chains’ lengths never differ by more than one segment, and so the chain rebalancing (step 4b) will make the chains the same length while ensuring the segments in S[k−1] and S[k] are balanced. Invariant 4 holds for S[m], because the first slab proceeds only if S[m]’s buffer has at most c(m − 1) operations, and only processes a cut batch of size at most p2, hence after that S[m]’s buffer will have at most p2 +c(m−1) < 2·c(m−1) operations. Invariant 4 holds for S[k] for each k > m, because S[k −1] proceeds only if S[k]’s buffer has at most c(k −1) operations, and only processes a buffered batch of size at most 2·c(k−2) by Invariant 4 for S[k−1], hence after that S[k]’s buffer will have at most c(k−1)+2·c(k−2) ≤ 2·c(k−1) operations.

Invariant 5 holds as follows. Each final slab segment Si[k] is modified only when either S[k] or S[k+1] runs, and the latter never makes Si[k] imbalanced. Consider each S[k] run. It proceeds only if Si[k] has at most t(k)+c(k) items and at least c(k) items unless S[k] is the last section, and its buffer had at most 2·c(k−1) operations by Invariant 4, and Si[k−1] had at most 2·t(k−1) items by Invariant 5 for S[k−1]. So at most 2·c(k−1) items were inserted into Si[k], and at most t(k−1) items were shifted from Si[k−1] to Si[k]. Also, at most 2·c(k−1) items were deleted from Si[k], and at most t(k−1) items were shifted from Si[k] to Si[k−1]. Thus after that run, Si[k] has at most t(k)+c(k)+4·c(k−1) ≤ 2·t(k) items and at least c(k)−2·c(k−1)−t(k−1) ≥ c(k−1) items unless S[k] was the last section, since c(k) = c(k−1)2 ≥ c(m−1)·c(k−1) ≥ 5·c(k−1). 

With these invariants, we are ready to bound the work done by FS2.

4 The delay of Γ depends on the actual execution, due to the definition of ∆(α) for each acquire-stall node α in Γ. But it captures the minimum time needed to run Γ in the following sense: For any computation Γ, on any step that executes all ready nodes in the remaining computation Γ0 (i.e. the unexecuted nodes in Γ), the delay of Γ0 is reduced. (So if a greedy scheduler is used, the number of steps in which some processor is idle is bounded by the delay.)

12 Theorem 16 (FS2 Work). FS2 takes O(FL) work for some linearization L of FS2-calls in D. ∗ Proof. We shall use a similar proof outline as for FS1 Work(Theorem 12). Let L be a linearization of FS2-calls in D such that:

" Operations on FS2 that finish during the first slab run or some final slab section run are ordered by when that run finished. " Operations on FS2 that finish during the same first slab run are ordered as follows: 1. Ineffectual operations are before effectual operations. 2. Effectual operations are in order of access type (deletions last). 3. Effectual insertions are in inward order, and effectual deletions are in outward order (Definition 11). " Operations on FS2 in each group-operation are in the same order as in that group. As before, let L0 be the same as L∗ except that in point 3 effectual deletions are ordered so that those on items in earlier sections are later (instead of outward order).

Consider each cut batch B of operations processed by the first slab. By FS2 Balance Invariants(Lemma 15), just before that processing, every segment in Si[0..m−2] is balanced, and Si[m−1] has at most 2·t(m−1) items. Thus in both the preliminary phase and the execution phase, each section S[k] takes O2k work per operation. And this amounts to O(logr +1) work per ( − − ) B r L0 S[k] r ≥ Ímin k 1,m 2 c(k) operation in with finger distance according to , because the operation reaches only if a=0 +1. As with FS1, the separation phase takes O(FL∗ ) work in total (see Theorem 12’s proof). Now consider each batch B of operations processed by a final slab section S[k]. By FS2 Balance Invariants(Lemma 15), B has at most 2·c(k−1) operations, and each segment in S[k] always has at most 4·c(k) items. So inserting the operations in B into the buffer took O2k work per operation. Converting each bunch in B to a group-operation takes O(1) work per operation. Cutting out and performing and returning the results of the group-operations that fit takes O2k work per group-operation. And the local rebalancing takes O2k work. Therefore each S[k] run that proceeds to process its buffered operations takes O2k work per operation. This again amounts to O(logr +1) work per operation X in B with finger distance r according to L∗ as follows: " If X finishes in S[m]: At that point the first slab has at least c(m−1)−p2 ≥ 4p2 items in each chain, because S[m−1] was balanced just before processing the last cut batch. Thus r ≥ 4p2 and hence X costs O(2m) ⊆ O(logr) work. " If X finishes in S[k] for some k > m: At that point S[k − 1] has at least c(k − 2) items in each segment by FS2 Balance Invariants(Lemma 15). Thus r ≥ c(k−2) and hence X costs O2k ⊆ O(logr) work. Finally, all the rebalancing takes O(1) amortized work per operation, which we shall leave to the next lemma. 

Lemma 17 (FS2 Rebalancing Work). All the rebalancing steps of FS2 take O(1) amortized work per operation. −k Proof. We shall maintain the credit invariant that each segment Si[k] with q items beyond its target capacity has at least q·2 stored credits. Also, each unfinished operation carries 1 credit with it. As with FS1 (see Lemma 13’s proof), the invariant can be preserved after rebalancing in the first slab. By the same reasoning, the invariant can also be preserved after segment rebalancing in the final slab (step 4a), because any shift between segments Si[k−1] and Si[k] where k ≥ m is performed only when Si[k −1] is imbalanced, and after that Si[k −1] has size t(k −1) by FS2 Balance Invariants(Lemma 15). Similarly, the invariant can be preserved after chain rebalancing in the final slab (step 4b), because it takes O2k work, which can be ignored since the last segment rebalancing shift already took O2k work. 

To tackle the span of FS2, we need some lemmas concerning the span of cutting the input batch and the delay in each slab.

Lemma 18 (FS2 Input Cutting Span). The first slab cuts an input batch of size b (i.e. cutting it into small batches and storing b  them in the feed buffer) within O p +logp span. Proof. Cutting the input batch into small batches takes O(logb) span. Adding them to the feed buffer takes O(1+ b ) span. This p2 b  b  2 amounts to O p +logp span because logb ∈ O p if b > p . 

13 k Lemma 19 (FS2 Final Slab Delay). Each section S[k] in the final slab runs within O 2 delay (whether it defers or finishes). Proof. Consider any final slab section S[k] that has acquired the second neighbour-lock. Checking whether it has an imbalanced segment and checking S[k+1]’s buffer size takes only O(1) delay. By FS2 Balance Invariants(Lemma 15), S[k] has at most 2·c(k−1) operations in its buffer, and S[k] always has at most 2·t(k) items in each segment, and S[k−1] has at most 2·t(k−1) items in each segment. Thus converting each bunch in the buffer takes O2k span, and performing the operations that fit in S[k] takes O2k span, and rebalancing the segments in S[k−1] takes O2k span. Now consider any final slab section S[k] that has acquired the first neighbour-lock. It waits O2k delay for the current holder (if any) of the second neighbour-lock to release it, and then itself takes O2k more delay to complete its run. Finally consider any final slab section S[k] that starts running. If k = m, it waits O2k delay for the first slab to release the shared neighbour-lock, since the first slab takes only O(2m) span on each access to S[m−1]. If k > m, it waits O2k delay for the current holder of the first neighbour-lock to release it, and then itself takes O2k more delay to complete its run. 

Lemma 20 (FS2 First Slab Delay). The first slab takes O(logp) delay for each acquiring of the neighbour-lock, and it processes each cut batch within O(logp)2 delay. m Proof. Each acquiring of the neighbour-lock takes O(2 ) = O(logp) delay by FS2 Final Slab Delay(Lemma 19). Checking whether S[m−1] has an imbalanced segment and checking S[m]’s buffer size takes only O(1) delay. Obtaining the cut batch (whose size is at most p2) from the first bunch from the feed buffer takes O(logp) delay. The four phases take O(logp)2 m delay in total, as in FS1 (see Theorem 14). Inserting the residual group-operations into the buffer of S[m] takes O(logp+2 ) = O(logp) delay, since S[m]’s buffer had at most 2·c(m−1) items by FS2 Balance Invariants(Lemma 15). 

With these lemmas, we can finally bound the span of FS2. N 2  Theorem 21 (FS2 Span). FS2 takes O p +d ·(logp) +sL span for some linearization L of D.(d is the maximum number of FS2-calls on any path in D, and sL is the weighted span of D with FS2-calls weighted according to FL.)

Proof. Take any path C through the program DAG D. Let L be the linearization in the proof of FS2 Work(Theorem 16). Consider any FS2-call X along C with finger distance r according to L. We shall trace the journey of X from the parallel buffer in an input batch to a cut batch and then through the slabs, and bound the delay taken by X relative to FS2, meaning that in the computation of the delay we only count FS2-nodes. Along the way, we shall partition that delay into the normal delay and the deferment delay, where the latter comprises all waiting at the first slab or a section that defers from the point it sets the deferred flag until it is reactivated and clears the deferred flag (and proceeds). Normal delay At the start, X waits in the parallel buffer for the first slab to finish running on the previous input batch of size b0, taking  b0 2 O p +(logp) delay by FS2 Input Cutting Span(Lemma 18) and FS2 First Slab Delay(Lemma 20). Next X waits for the first slab to process some i cut batches of size p2 in the feed buffer, each taking O(logp)2 ⊆ O(p) normal delay. Then X is b 2 flushed from the parallel buffer in some input batch of size b, which is cut within O p +(logp) normal delay, and next waits for another j cut batches of size p2 that come before X in the feed buffer, each taking O(p) normal delay. (Note that we are ignoring all waiting while the first slab is deferred.) k If X finishes in the final slab, it waits a further O 2 normal delay at each final slab section S[k] that it passes through by FS2 Final Slab Delay(Lemma 19). And when X finishes in a section S[k], at that point S[k−1] has at least c(k−2) items in each Ík a k segment by FS2 Balance Invariants(Lemma 15). Thus r ≥ c(k−2) and hence X takes O a=m 2 = O 2 ⊆ O(logr) normal delay in the final slab. Finally when X is returned, it is in some group-operation with g operations, so returning the results takes g  O(logg) ⊆ O p +logp span.  b0 b g 2  Therefore in total X takes O p + p + p +i·p+j ·p+(logp) +logr normal delay.

14 Deferment delay

To bound the deferment delay, we shall use a similar credit invariant as in FS2 Rebalancing Work(Lemma 17), but instead of paying for rebalancing work we shall use the credits to pay for p times the deferment delay. This would imply that the 1  deferment delay is at most O p per operation on FS2. The invariant is that for k ≥ m−1, each segment Si[k] with q items beyond its target capacity has at least q·2−k stored credits, and that each operation in S[k+1]’s buffer carries 2−k credits with it. Consider each deferment of a section S[k] for k ≥ m−1 (where deferment of the first slab is treated as deferment of S[m−1]). At that point either one of its segment is imbalanced or S[k+1]’s buffer has more than c(k) items, and S[k] reactivates S[k+1], which may either defer or proceed. In any case, from that point until S[k + 1] proceeds, S[k] will never proceed (even if reactivated), because its segments and S[k+1]’s buffer remain untouched. But once S[k+1] proceeds, it will empty its buffer and make S[k]’s segments balanced by FS2 Balance Invariants(Lemma 15), and then reactivate S[k] on finishing, so S[k] will proceed within O2k subsequent delay. Thus if X is waiting at S[k] due to consecutive sections S[k..j] being deferred, and S[j +1] proceeding, the deferment at S[k]   O Íj+1 a O j p· j ≤pc(m)· j ∈ Oc(j)· −j 2j ∈ Opc(j)  S[j] lasts a=k 2 = 2 delay (by Lemma 15 again), and 2 2 2 since 2 . If had an imbalanced segment, it would have at least c(j)·2−j stored credits, and we can use half of it to pay for any needed extra stored credits at S[j +1] due to the shift. If S[j +1]’s buffer had more than c(j) items, then they carry c(j)·2−j credits, and we can use half to pay for any needed extra stored credits at S[j +1] and for any credits carried by operations that go on to S[j +2]. In both cases, we can use the other half of those credits to pay for p times the deferment delay that X takes at S[k]. Total delay 0 2 2 There are at most d FS2-calls along C, and over all X, each of b,b ,g,i·p ,j·p above will sum up to at most the total number N of N  N 2  FS2-calls, and the total deferment delay of all FS2-calls along C is O . Therefore the span of FS2 is O +d ·(logp) +sL .  p p 6 General Parallel Finger Structures

To support an arbitrary but fixed number f of movable fingers (besides the fingers at the ends), while retaining both work- optimality with respect to the finger bound and good parallelism, we essentially use a basic parallel finger structure for each sector between adjacent fingers.

It is easier to do this with FS1, because we are processing the operations in batches. The finger-move operations are all done first in a finger phase before the rest of the batch, and of course we combine finger-move operations on the same finger. Consider any finger that is between two sectors R0 and R1. This finger is sandwiched between the nearest chain Si of R0 and the nearest chain S1−i of R1. To move this finger into chain Si of R0 past an item in segment Si[k], we move all the items I between the old and new finger position from R0 to R1, roughly as follows:

1. Cut out the items in I from sector R0’s segments Si[0..k] and join them (from small to big) into a single batch B. 2. Join the items in sector R1’s segments S1−i[0..k] (from small to big) and shift them into S1−i[k+1] (by a single join). 3. Use B to fill sector R1’s sections S1−i[0..k] to target size except perhaps S1−i[k]. 4. Rebalance R0 and R1 as in FS1’s rebalancing phase (Section 4.1). This essentially contributes O2k work and O(logn) span, because we can preserve the same credit invariant to bound the rebalancing work and span. It is similar but messier for moving a finger so far that it goes over the nearer chain of R0 and into its further chain. After that, we can simply partition the map operations around the fingers and perform each part on the correct sector in parallel. This partitioning takes O(b) work and O(logb) span for each batch of b operations (see Appendix Section A.2), and O(logb) ⊆ b  O p +logp , and each sector takes O(logn) span. Thus we will obtain the desired work/span bounds (Theorem 7). It is much harder for FS2, and considerably complicated, so we shall not attempt to explain it here.

15 7 Work-Stealing Schedulers

The bounds on the work and span of FS1 and FS2 in Section 4 and Section 5 hold regardless of the scheduler. The performance bounds for FS1 and FS2 in Section 1 require a greedy scheduler, in order to bound the parallel buffer cost. In practice, we do not have such schedulers. But we can design a suitable work-stealing scheduler in the QRMW pointer machine model that yields the desired time bounds (Theorem 3 and Theorem 4) on average, as we shall explain below. We make the modest assumption that each processor (in the QRMW pointer machine) can generate a uniformly random integer in [1..p] and convert it to a pointer given by a constant lookup-table within O(1) steps. For instance, this can be done if each processor has local RAM of size p (i.e. sole access to its own local memory with p cells and O(1) random access). The blocking work-stealing scheduler in [12] is for an atomic message passing model, in which multiple concurrent accesses to each deque are arbitrarily queued and serviced one at a time. This can be supported by guarding each deque with a CLH lock [29], and the analysis carries over. The non-blocking work-stealing scheduler in [7] assumes O(1) memory contention cost, which is contrary to the QRMW contention model. Nevertheless, the combinatorial techniques in that paper can be adapted to prove the desired performance bounds for our implementation (Definition 22). Definition 22 (Non-Blocking Work-Stealing Scheduler). The non-blocking work-stealing scheduler can be implemented in the QRMW pointer machine model as follows: " Each processor i ∈ [1..p] has: " A global deque Qi of DAG nodes, shared between owner and stealer using Dekker’s . " A global non-blocking lock Li (see Appendix Definition 36). " A local array Ri[1..p] where Ri[j] stores a pointer to Qj and a pointer to Lj. // Used implicitly wherever needed. " Each processor i does the following repeatedly: Access Qi as owner, removing the node v at the bottom if it is non-empty. If v exists (i.e. Qi was non-empty): Execute v. Access Qi as owner, inserting all the child nodes generated by v at the bottom. Otherwise: Create Int k uniformly randomly chosen from [1..p]. If TryLock(Lk): Access Qk as stealer, removing the node w at the top if it is non-empty. Unlock(Lk). If w exists (i.e. Qk was non-empty): Execute w. Access Qi as owner, inserting all the child nodes generated by w at the bottom. 8 Conclusions

This paper presents two parallel finger structures that are work-optimal with respect to the finger bound, and the faster version has a lower span by using careful pipelining. Pipelining techniques to reduce the span of data structure operations have been explored before [10, 4]. As indicated by our results, the extended implicit batching framework combines nicely with pipelining and is a promising approach in the design of parallel data structures. Nevertheless, despite the common framework, the parallel finger structures in this paper and the parallel working-set map in [4] rely on different ad-hoc techniques and analysis, and it raises the obvious interesting question of whether there is a way to obtain a batch-parallel splay tree in the same framework, that satisfies both the working-set property and the finger property.

16 Appendix Here we spell out the model details, building blocks and supporting theorems used in our paper. A.1 QRMW Pointer Machine Model QRMW stands for queued read-modify-write, as described in [15]. In this contention model, asynchronous processors perform memory accesses via read-modify-write (RMW) operations (including read, write, test-and-set, fetch-and-add, compare-and- swap), which are supported by almost all modern architectures. Also, to capture contention costs, multiple memory requests to the same memory cell are FIFO-queued and serviced one at a time, and the processor making each memory request is blocked until the request has been serviced. In the parallel pointer machine, each processor has a fixed number of local registers and memory accesses are done only via pointers, which can be locally stored or tested for equality (but no pointer arithmetic). The QRMW pointer machine model, introduced in [4], extends the parallel pointer machine model in [21] to RMW operations. In this model, each memory node has a fixed number of memory cells, and each memory cell can hold a single field, which is either an integer or a pointer. Each processor also has a fixed number of local registers, each of which can hold a single field. The basic operations that a processor can perform include arithmetic operations on integers in its registers, equality-test between pointers in its registers, creating a new memory node and obtaining a pointer to it, and RMW operations. An RMW operation can be performed on any memory cell via a pointer to the memory node that it belongs to. All operations except for RMW operations take one step each. RMW operations on each memory cell are FIFO-queued to be serviced, and the first RMW operation in the queue (if any) is serviced at each time step. The processor making each memory request is blocked until the request has been serviced. A.2 Parallel Batch Operations We rely on the following basic operations on batches: " Split a given batch of n items into left and right parts around a given position, within O(logn) work/span. " Partition a given batch of n items into lower and upper parts around a given pivot, within O(n) work and O(logn) span. " Partition a sorted batch of n items around a sorted batch of k pivots, within O(k ·logn) work and O(logn+logk) span. " Join a batch of batches with n total items, within O(n) work and O(logn) span. " Merge two sorted batches with n total items, optionally combining duplicates, within O(n) work and O(logn) span if the combining procedure takes O(1) work/span. These can be implemented in the QRMW pointer machine model [28] with each batch stored as a BBT (leaf-based height- balanced binary tree with an item at each leaf). They can also be implemented (more easily) in the binary forking model in [9] with each batch stored in an array. For instance, joining a batch of arrays can be done by using the standard prefix-sum technique to compute the total size of the first k arrays, and hence we can copy each array in parallel into the final output array, and merging two sorted arrays can be done by the algorithm given in [26] (section 2.4) and [35]. A related data structure that we also rely on is the bunch data structure, which is defined as follows. Definition 23 (Bunch Structure). A bunch is an unsorted set supporting addition of any batch of new elements within O(1) work/span and conversion to a batch within O(b) work and O(logb) span if it has size b. A bunch can be implemented using a complete binary tree with batches at the leaves, with a linked list threaded through each level to support adding a new batch as a leaf in O(1) work/span. To convert a bunch to a batch, we treat the bunch as a batch of batches and parallel join all the batches. A.3 Batch-Parallel Map In this paper we rely on a parallel map that supports the following operations: " Unsorted batch search: Search for an unsorted input batch of b items (not necessarily distinct), tagging each search item with the result, all within O(b·logn) work and O(logb·logn) span, where n is the map size. " Sorted batch access: Perform an item-sorted input batch of b operations on distinct items, tagging each operation with the result, all within O(b·logn) work and O(logb+logn) span, where n is the map size before the batch access. " Split: Split a map M of size k around a given pivot rank r into two maps M1,M2, where M1 contains the items with ranks at most r in M, and M2 contains the items with ranks more than r in M, within O(logk) work/span. " Join: Join maps M1,M2 of total size k where every item in M1 is less than every item in M2, within O(logk) work/span. This can be achieved in the QRMW pointer machine model [28], and also (more easily) in the binary forking model [9].

17 A.4 Parallel Buffer To facilitate extended implicit batching, we can use any parallel buffer implementation that takes O(p + b) work and O(logp+logb) span per batch of size b (on p processors), any operation that arrives is (regardless of the scheduler) within O(1) 1 span included in the batch that is being flushed or in the next batch, and there are always at most 2 p+q ready buffer nodes (active threads of the buffer) where q is the number of operations that are currently buffered or being flushed. This would entail the following parallel buffer overhead [4] (and we reproduce the proof here). Theorem 24 (Parallel Buffer Cost). Take any program P using an implicitly batched data structure M that is run using any  T1+w  greedy scheduler. Then the cost (Definition 6) of the parallel buffer for M is O p +d ·logp , where T1 is the work of all the P-nodes, and w is the work taken by M, and d is the maximum number of M-calls on any path in the program DAG D.

Proof. Let t1 and t∞ be the total work and span (Definition 6) respectively of the parallel buffer for M. Let N be the total 2 number of operations on M. Consider each batch B of b operations on M. Let tB be span taken by the buffer on B. If b ≤ p , 2 b  b  N  then tB ∈ O(logp). If b > p , then tB ∈ O(logb) ⊆ O p . Thus tB ∈ O p +logp and hence t∞ ∈ O p +d ·logp . Now consider the actual execution of the execution DAG E of the program P using M. At each time step, the buffer is processing at most two consecutive batches, so we shall analyze the buffer work done during the time interval for each pair of consecutive batches B and B0, where B has b operations and B0 has b0 operations. 0 ≥ 1 0 ( 0) If b+b 6 p, then the buffer work done on B and B is O b+b . 0 1 1 ( 0) 2 If b+b < 6 p, then there are at most 2 p+ b+b < 3 p ready buffer nodes in E, so at least one of the following holds at each time step in this interval: " 1 ( ) At least 6 p ready P-nodes in E are being executed. These steps take at most O T1 work over all intervals. " 1 ( ) At least 6 p ready M-nodes in E are being executed. These steps take at most O w work over all intervals. " At most p ready nodes in E are being executed. All ready buffer nodes in E are being executed (by greedy scheduling), so over all intervals there are O(t∞) such steps, taking O(p·t∞) work. t1 T1+w t1  T1+w  Therefore p ∈ O( p +t∞), and hence the buffer’s cost is p +t∞ ∈ O p +d ·logp since N ≤ T1.  The parallel buffer for each data structure M can be implemented using a static BBT (leaf-based balanced binary tree), with a sub-buffer at each leaf node, one for each processor, and a flag at each internal node. Each sub-buffer stores its operations as the leaves of a complete binary tree with a linked list threaded through each level. Whenever a thread τ makes a call to M, the processor running τ suspends it and inserts the call together with a callback (i.e. a structure with a pointer to τ and a field for the result) into the sub-buffer for that processor. Then the processor walks up the BBT from leaf to root, test-and-setting each flag along the way, terminating if it was already set. On reaching the root, the processor notifies M (by reactivating it), which can decide when to flush the buffer. M can also query whether the parallel buffer is non-empty, defined as whether the flag at the root is set. M can eventually return the result of the call via the callback (i.e. by updating the result field and then resuming τ ). Whenever the buffer is flushed (by M), all sub-buffers are swapped out by a parallel recursion on the BBT, replaced by new sub-buffers in a newly constructed static BBT. We then wait for all pending insertions into the old sub-buffers to be completed, before joining their contents into a single batch to be returned (to M). To do so, each processor i has a flag yi initialized to true, and a thread field ϕi initialized to null. Whenever it inserts an M-call X, it sets yi := false, then inserts X into the (current) sub-buffer, then resumes ϕi if TestAndSet(yi) = true. To wait for pending insertions into the old sub-buffer for processor i, we store a pointer to the current thread in ϕi and then suspend it if TestAndSet(yi) = false. Inserting into each sub-buffer can be done in O(1) time. Test-and-setting each flag in the BBT also takes O(1) time, because at most three processors ever access it. Each static BBT takes O(p) work and O(logp) span to initialize. Each data structure call takes O(p) work and O(logp) span for a processor to reach the root, because the flags ensure that only O(1) work is done per node in traversing the BBT. Joining the contents of the sub-buffers takes O(p+b) work and O(logp+logb) span if the resulting 1 joined batch is of size b. It is also easy to ensure that flushing uses at most 2 p+b threads where b is the size of the flushed batch. Thus this parallel buffer implementation has the desired properties that support extended implicit batching. It is worth noting that the parallel buffer can be implemented in the dynamic multithreading paradigm, like all other data structures and algorithms in this paper, but it requires the ability for a thread to have O(1)-time access to the sub-buffer for the processor running it, so that it can insert each data structure-call into the sub-buffer in O(1) work/span. This can be done if each processor has a local array of size p (i.e. it is accessible only by that processor but supports O(1) random access) for each implicitly batched data structure, and each thread can retrieve the id of the processor running it. But in the QRMW pointer machine model this is not necessary if the program uses a fixed set of implicitly batched data structures, since each processor can be initialized with a (constant) pointer to a structure that always points to the current sub-buffer for that processor.

18 A.5 Sorting Theorems The items in the search problem can come from any arbitrary set S that is linearly ordered by a given comparison function, and we shall assume that S has at least two items. As is standard, let Sn be the set of all length-n sequences from S. Search structures can often be adapted to implement sorting algorithms 5, in which case any lower bound on complexity of sorting typically implies a lower bound on the costs of the search structure. For the proofs of FS1 Work and FS2 Work we need a crucial lemma that the entropy bound is a lower bound for (comparison-based) sorting, as precisely stated below. n Íu Lemma 25 (Sorting Entropy Bound). For any sequence I in S with item frequencies q1..u (i.e. i=1 qi = n), any sorting algorithm requires Ω(H) comparisons on average over all (distinct) rearrangements of I, where H = Íu q ·log n  is the i=1 i qi entropy of I.[30] From this we immediately get a relation (Theorem 28) between the entropy bound and the maximum finger bound (i.e. the maximum finger bound over all permutations), because we can use a finger-tree to perform sorting. Definition 26 (Finger-Tree Sort). Let FSort be the sequential algorithm that sorts an input sequence I as follows: Create an empty finger-tree F (with one finger at each end) that stores linked lists of items. For each item x in I, if F already has a linked list of copies of x, then append x to that linked list, otherwise insert a linked list containing just x into F. At the end iterate through F to produce the desired sorted sequence. n Definition 27 (In-order Item Frequencies). A sequence I in S is said to have in-order item frequencies q1..u if the i-th smallest item in I occurs qi times in I. n Theorem 28 (Maximum Finger Bound). Take any sequence I in S with in-order item frequencies q1..u. Then the maximum finger bound for I, defined as MF = Íu q ·(logmin(i,u+1−i)+1), satisfies MF ∈ Ω(H) where H = Íu q ·log n . I i=1 i I i=1 i qi Proof. By the Sorting Entropy Bound(Lemma 25) let J be a rearrangement of I such that FSort(J) takes Ω(H) comparisons. Clearly FSort(J) also takes O(MFJ) = O(MFI) comparisons, and hence MFq ∈ Ω(H).  Finally we give a parallel sorting algorithm PESort that achieves the entropy bound for work but yet takes only O(logn)2 span on a list of n items, which we need in our parallel finger structure. For comparison, we also give the simpler parallel merge-sort PMSort. The input and output lists are each stored in a batch (leaf-based balanced binary tree), and these algorithms work in the QRMW pointer machine model. We shall use the following notation for every binary tree T: T.root is its root, and for each node v of T, v.left and v.right are its child nodes, and v.height is the height of the subtree at v, and v.size is the number of leaves of the subtree at v. Definition 29 (Parallel Merge-Sort). Let PMSort be the procedure that does the following on an input batch I of items: If I.size ≤ 1, return I. Otherwise, compute in parallel A = PMSort(I.left) and B = PMSort(I.right), and then parallel merge (Section A.2) A and B into an item-sorted batch C, and then return C. Theorem 30 (PMSort Costs). PMSort sorts every sequence I in Sn within O(n·logn) work and O(logn)2 span. Proof. The claim follows directly from the work/span bounds for parallel merging (Section A.2) and I.height ∈ O(logn).  Definition 31 (Parallel Entropy-Sort). Define a bundle of an item x to be a BT (binary tree) in which every leaf has a tagged copy of x. Let PESort be the parallel merge-sort variant that does the following on an input batch I of items: If I.size ≤ 1, return I. Otherwise, compute in parallel A = PESort(I.left) and B = PESort(I.right), and then parallel merge (Section A.2) A and B into an item-sorted batch C of bundles, combining bundles of the same item into one by simply making them the child subtrees of a new bundle, and then return C. Then PESort(I) returns an item-sorted batch of bundles, with one bundle (of all the tagged copies) for each distinct item in I, and clearly each bundle has height at most I.height. n Theorem 32 (PESort Costs). PESort sorts every sequence I in S with item frequencies q1..u within O(H + n) work and O(logn)2 span, where H = Íu q ·ln n . i=1 i qi Proof. Consider the merge-tree T, in which each node is the result of parallel merging its child nodes. Note that T.height = I.height ∈ O(logn), and that each item in I occurs in at most one bundle in each node of T. Clearly the work done is O(1) times the total length of all the parallel merged batches (Section A.2). Thus the work done can be divided per item; work done on ( ) · n  item x takes O 1 times the number of nodes of T that contain a bundle of x, and there are O k log k +k such nodes where k is the frequency of x in I, by Lemma 33 below. Therefore PESort(I) takes OÍu q ·log n +q  ⊆ O(H +n) work. The span i=1 i qi i bound on PESort(I) is immediate from the span bound on parallel merging (Section A.2). 

5 A sorting algorithm is a procedure that given any input sequence will output a sequence of pointers to the input items in sorted order.

19 Lemma 33 (BBT Subtree Size Bound). Given any BBT T with n leaves of which k are marked with k > 0, and with each · n  internal node marked iff it is on a path from the root to a marked leaf, the number of marked nodes of T is O k log k +k . Proof. We shall iteratively change the set of marked leaves of T, and accordingly update the internal nodes so that each of them is marked iff it is on a path from the root to a marked leaf. At each step, if there is a marked node u with a marked child v and an unmarked child w such that v has two marked children, then unmark the rightmost marked leaf x in the subtree at v and mark the deepest leaf y in the subtree at w. This will not decrease the number of marked nodes, because unmarking x results in unmarking at most v.right.height internal nodes, and marking y results in marking at least w.height internal nodes, and v.right.height ≤ w.height since T is a BBT. Note that each step decreases the sum of the lengths of all the paths from the root to the marked nodes with two marked children, so this iterative procedure terminates after finitely many steps. After that, for every node v with only one marked child, there is only one marked leaf in the subtree at v. Let A be the set of marked nodes with two marked children, and B be the set of marked nodes not in A but with a parent in A. Then there are exactly (k −1) nodes in A, and exactly k nodes in B, and the subtrees Í at nodes in B are disjoint, so v∈B v.size ≤ n. Since every marked node is either in A or on the downward path of marked Í Í nodes from some node in B, the number of marked nodes is at most (k −1)+ v∈B(v.height+1) ∈ O(k+ v∈B logv.size) ⊆ · n  O k+k log k by Jensen’s inequality.  Remark. See [28] (Subtree Size Bound) for a generalization of Lemma 33 with a different proof, but if we want a bound with explicit constants then the above proof yields a tighter bound for a BBT.

PESort is all we need for the parallel finger search structures FS1 and FS2, but we can in fact obtain a full parallel entropy-sorting algorithm, namely one that outputs a single item-sorted batch of all the (tagged copies of) items in the input sequence I from Sn and satisfies the entropy bound for work. Specifically, we can convert each bundle in PESort(I) to a batch (Definition 34), and then parallel join (Section A.2) all those batches to obtain the desired output. Definition 34 (Bundle Balancing). A bundle B of size b and height h is balanced as follows: Recursively construct a linked list through all the leaves of B, and mark the leaves of B with (1-based) rank of the form (i·h+1), and then extract those marked leaves as a batch P (by parallel filtering as described in [28]). Then at each leaf v in P, construct and store at v a batch of the items in B with ranks i·h+1 to (i+1)·h, obtained by traversing the linked list forward. Now P is essentially a batch of size-h batches (except perhaps the last smaller batch), which we then recursively join to obtain the batch of all items in G (alternatively, but less efficiently, simply parallel join P). Theorem 35 (Bundle Balancing Costs). Balancing a bundle B of size b and height h takes O(b) work and O(h) span. Proof. Note that B has less internal nodes than leaves, and so constructing the linked list takes O(b) work and O(h) span. Extracting the batch P of items of B with ranks at intervals of h takes O(b+P.size·h) = O(b) work and O(h) span. Constructing the batches of items in-between those in P takes O(b) work and O(P.height+h) ⊆ O(h) span, and recursively joining them takes O(1) work and span per node of P (except O(h) span for the first joining involving the last batch).  A.6 Locking Mechanisms

Here we give pseudo-code implementations of the various locking mechanisms used as primitives in this paper (Section 2.1), which have the claimed properties under the QRMW memory contention model. The non-blocking lock is trivially implemented using test-and-set as shown in TryLock/Unlock below. Definition 36 (Non-Blocking Lock). TryLock( Bool x ): Return ¬TestAndSet(x). Unlock( Bool x ): Set x := false. Next is the reactivation wrapper for a procedure P, which can be implemented using fetch-and-add and guarantees the following according to some linearization [28]: 1. Whenever P is reactivated, there will be a complete run of P that starts after that reactivation. 2. If P is run only via reactivations, then no runs of P overlap, and there are at most as many runs of P as reactivations of P. 3. If P is reactivated by only k threads at any time, then each reactivation call C finishes within O(k) span, and some run of P starts within O(k) span after the start of C or the end of the last run of P that overlaps C.

20 Definition 37 (Reactivation Wrapper). (P is the procedure to be guarded by the wrapper.) Private Procedure P. Private Int count := 0. Public Reactivate(): If FetchAndAdd(count,1) = 0: Fork the following: Do: Set count := 1. P(). While FetchAndAdd(count,−1) > 1. The dedicated lock with keys [1..k], where threads must use distinct keys to acquire it, can be implemented using fetch-and-add as shown below and guarantees the following according to some linearization [4]: 1. Mutual exclusion: Only one thread can hold the lock at any point in time; a thread becomes the lock holder when it successfully acquires the lock, and must release the lock before the next successful acquisition. 2. Fairness and bounded latency: When any thread attempts to acquire the dedicated lock, it will become a pending holder within O(k) span, and each pending holder will successfully acquire the lock after at most 1 subsequent successful acquisition per key (if every lock holder eventually releases the lock). And whenever the lock is released, if there is at least one pending holder then within O(k) span the lock would be successfully acquired again. Definition 38 (Dedicated Lock). (k is the number of keys.) Private Int count := 0. Private Int last := 0. Private Array q[1..k] initialized with null. Public Acquire( Int i ): If FetchAndAdd(count,1) = 0: Set last := i. Return. Otherwise: Write pointer to current thread into q[i]. Suspend current thread. Public Release(): If FetchAndAdd(count,−1) > 1: Create Int j := last. Create Pointer t := null. While t = null: Set j := j%k+1. If q[j] , null, then swap t,q[j]. Set last := j. Resume t. It is worth mentioning that we can easily replace the array q[1..k] in the above implementation by a cyclic linked list, and use the linked list nodes instead of integers as the keys.

21 References

[1] Yehuda Afek, Haim Kaplan, Boris Korenfeld, Adam Morrison, and Robert E Tarjan. The cb tree: a practical concurrent self-adjusting search tree. Distributed computing, 27(6):393–417, 2014. [2] Yehuda Afek, Haim Kaplan, Boris Korenfeld, Adam Morrison, and Robert Endre Tarjan. Cbtree: A practical concurrent self-adjusting search tree. In DISC, volume 7611 of Lecture Notes in Computer Science, pages 1–15. Springer, 2012. [3] Kunal Agrawal, Jeremy T Fineman, Kefu Lu, Brendan Sheridan, Jim Sukha, and Robert Utterback. Provably good scheduling for parallel programs that use data structures through implicit batching. In Proceedings of the 26th ACM symposium on Parallelism in algorithms and architectures, pages 84–95. ACM, 2014. [4] Kunal Agrawal, Seth Gilbert, and Wei Quan Lim. Parallel working-set search structures. In Proceedings of the 30th ACM symposium on Parallelism in algorithms and architectures, pages 321–332. ACM, 2018. [5] Yaroslav Akhremtsev and Peter Sanders. Fast parallel operations on search trees. In 2016 IEEE 23rd International Conference on High Performance Computing (HiPC), pages 291–300. IEEE, 2016. [6] Vitaly Aksenov, Petr Kuznetsov, and Anatoly Shalyto. Parallel Combining: Benefits of Explicit Synchronization. In Jiannong Cao, Faith Ellen, Luis Rodrigues, and Bernardo Ferreira, editors, 22nd International Conference on Principles of Distributed Systems (OPODIS 2018), volume 125 of Leibniz International Proceedings in Informatics (LIPIcs), pages 11:1–11:16, Dagstuhl, Germany, 2018. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. [7] Nimar S Arora, Robert D Blumofe, and C Greg Plaxton. Thread scheduling for multiprogrammed multiprocessors. Theory of computing systems, 34(2):115–144, 2001. [8] Guy E Blelloch, Daniel Ferizovic, and Yihan Sun. Just join for parallel ordered sets. In Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures, pages 253–264. ACM, 2016. [9] Guy E Blelloch, Jeremy T Fineman, Yan Gu, and Yihan Sun. Optimal parallel algorithms in the binary-forking model. arXiv preprint arXiv:1903.04650, 2019. [10] Guy E. Blelloch and Margaret Reid-Miller. Pipelining with futures. In Proceedings of the ninth annual ACM symposium on Parallel algorithms and architectures, SPAA ’97, pages 249–259, New York, NY, USA, 1997. ACM. [11] Guy E. Blelloch and Margaret Reid-Miller. Fast set operations using treaps. In Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures, pages 16–26, 1998. [12] Robert D Blumofe and Charles E Leiserson. Scheduling multithreaded computations by work stealing. Journal of the ACM (JACM), 46(5):720–748, 1999. [13] Trevor Brown, Faith Ellen, and Eric Ruppert. A general technique for non-blocking trees. In ACM SIGPLAN Notices, volume 49, pages 329–342. ACM, 2014. [14] Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein. Introduction to Algorithms. The MIT Press, third edition, 2009. [15] Cynthia Dwork, Maurice Herlihy, and Orli Waarts. Contention in shared memory algorithms. Journal of the ACM (JACM), 44(6):779–805, 1997. [16] Faith Ellen, Panagiota Fatourou, Joanna Helga, and Eric Ruppert. The amortized complexity of non-blocking binary search trees. In Proceedings of the 2014 ACM symposium on Principles of distributed computing, pages 332–340. ACM, 2014. [17] Faith Ellen, Panagiota Fatourou, Eric Ruppert, and Franck van Breugel. Non-blocking binary search trees. In Proceedings of the 29th ACM SIGACT-SIGOPS Symposium on Principles of Distributed Computing, PODC ’10, pages 131–140, New York, NY, USA, 2010. ACM. [18] Stephan Erb, Moritz Kobitzsch, and Peter Sanders. Parallel bi-objective shortest paths using weight-balanced b-trees with bulk updates. In International Symposium on Experimental Algorithms, pages 111–122. Springer, 2014. [19] Panagiota Fatourou and Nikolaos D. Kallimanis. Revisiting the combining synchronization technique. In PPoPP, pages 257–266, 2012. [20] Matteo Frigo, Charles E. Leiserson, and Keith H. Randall. The implementation of the Cilk-5 multithreaded language. In Proceedings of the ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 212–223, 1998. [21] Michael T Goodrich and S Rao Kosaraju. Sorting on a parallel pointer machine with applications to set expression evaluation. Journal of the ACM (JACM), 43(2):331–361, 1996. [22] Leo J Guibas, Edward M McCreight, Michael F Plass, and Janet R Roberts. A new representation for linear lists. In Proceedings of the ninth annual ACM symposium on Theory of computing, pages 49–60. ACM, 1977.

22 [23] Danny Hendler, Itai Incze, Nir Shavit, and Moran Tzafrir. Flat combining and the synchronization-parallelism tradeoff. In Proceedings of the ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), pages 355–364, 2010. [24] John Iacono. Alternatives to splay trees with O(log n) worst-case access times. In Proceedings of the twelfth annual ACM-SIAM symposium on Discrete algorithms, pages 516–522. Society for Industrial and Applied Mathematics, 2001. [25] Intel Corporation. Intel Cilk Plus Language Extension Specification, Version 1.1, 2013. Document 324396- 002US. Available from http://cilkplus.org/sites/default/files/open_specifications/Intel_Cilk_ plus_lang_spec_2.htm. [26] Joseph JáJá. An introduction to parallel algorithms, volume 17. Addison-Wesley Reading, 1992. [27] S Rao Kosaraju. Localized search in sorted lists. In Proceedings of the thirteenth annual ACM symposium on Theory of computing, pages 62–69. ACM, 1981. [28] Wei Quan Lim. Optimal multithreaded batch-parallel 2-3 trees. arXiv:1905.05254, 2019. [29] Peter Magnusson, Anders Landin, and Erik Hagersten. Queue locks on cache coherent multiprocessors. In Parallel Processing Symposium, 1994. Proceedings., Eighth International, pages 165–171. IEEE, 1994. [30] Ian Munro and Philip M Spira. Sorting and searching in multisets. SIAM journal on Computing, 5(1):1–8, 1976. [31] OpenMP Architecture Review Board. OpenMP application program interface, version 4.0. Available from http: //www.openmp.org/mp-documents/OpenMP4.0.0.pdf, July 2013. [32] Y. Oyama, K. Taura, and A. Yonezawa. Executing parallel programs with synchronization bottlenecks efficiently. In Proceedings of the International Workshop on Parallel and Distributed Computing for Symbolic and Irregular Applications (PDSIA), pages 182–204, 1999. [33] Wolfgang Paul, Uzi Vishkin, and Hubert Wagener. Parallel dictionaries on 2–3 trees. Automata, Languages and Programming, pages 597–609, 1983. [34] James Reinders. Intel Threading Building Blocks: Outfitting C++ for Multi-Core Processor Parallelism. O’Reilly, 2007. [35] Nodari Sitchinava. Ics 643: Advanced parallel algorithms lecture 10. http://www2.hawaii.edu/~nodari/teaching/ f16/notes/notes10.pdf, 2016. [36] Daniel Dominic Sleator and Robert Endre Tarjan. Self-adjusting binary search trees. Journal of the ACM (JACM), 32(3):652–686, 1985. [37] The Task Parallel Library. http://msdn.microsoft.com/en-us/magazine/cc163340.aspx, October 2007. [38] Thomas Tseng, Laxman Dhulipala, and Guy Blelloch. Batch-parallel euler tour trees. In 2019 Proceedings of the Twenty-First Workshop on Algorithm Engineering and Experiments (ALENEX), pages 92–106. SIAM, 2019.

23