Parallel Finger Search Structures Arxiv:1908.02741V4 [Cs.DS]

Parallel Finger Search Structures Seth Gilbert Wei Quan Lim National University of Singapore National University of Singapore Keywords Parallel data structures, multithreading, dictionaries, comparison-based search, distribution-sensitive algorithms Abstract In this paper 1 we present two versions of a parallel finger structure FS on p processors that supports searches, insertions and deletions, and has a finger at each end. This is to our knowledge the first implementation of a parallel search structure that is work-optimal with respect to the finger bound and yet has very good parallelism (within a factor of O¹logpº2 of optimal). We utilize an extended implicit batching framework that transparently facilitates the use of FS by any parallel program P that is modelled by a dynamically generated DAG D where each node is either a unit-time instruction or a call to FS. The total work done by either version of FS is bounded by the finger bound FL (for some linearization L of D), i.e. each operation on an item with distance r from a finger takes O¹logr+1º amortized work. Running P using the simpler version takes T1+FL 2 O p +T1 +d · ¹logpº +logn time on a greedy scheduler, where T1;T1 are the size and span of D respectively, and n is the maximum number of items in FS, and d is the maximum number of calls to FS along any path in D. Using the faster T1+FL 2 version, this is reduced to O p +T1 +d ·¹logpº +sL time, where sL is the weighted span of D where each call to FS is weighted by its cost according to FL. We also sketch how to extend FS to support a fixed number of movable fingers. The data structures in our paper fit into the dynamic multithreading paradigm, and their performance bounds are directly composable with other data structures given in the same paradigm. Also, the results can be translated to practical implementations using work-stealing schedulers. Acknowledgements We would like to express our gratitude to our families and friends for their wholehearted support, to the kind reviewers who provided helpful feedback, and to all others who have given us valuable comments and advice. This research was supported in part by Singapore MOE AcRF Tier 1 grant T1 251RES1719. 1 Introduction There has been much research on designing parallel programs and parallel data structures. The dynamic multithreading paradigm (see [14] chap. 27) is one common parallel programming model, in which algorithmic parallelism is expressed through parallel programming primitives such as fork/join (also spawn/sync), parallel loops and synchronized methods, but the program cannot stipulate any mapping from subcomputations to processors. This is the case with many parallel languages and libraries, such as Cilk dialects [20, 25], Intel TBB [34], Microsoft Task Parallel Library [37] and subsets of OpenMP [31]. Recently, Agrawal et al. [3] introduced the exciting modular design approach of implicit batching, in which the programmer arXiv:1908.02741v4 [cs.DS] 10 Oct 2019 writes a multithreaded parallel program that uses a black box data structure, treating calls to the data structure as basic operations, and also provides a data structure that supports batched operations. Given these, the runtime system automatically combines these two components together, buffering data structure operations generated by the program, and executing them in batches on the data structure. This idea was extended in [4] to data structures that do not process only one batch at a time (to improve parallelism). In this extended implicit batching framework, the runtime system not only holds the data structure operations in a parallel buffer, to form the next input batch, but also notifies the data structure on receiving the first operation in each batch. Independently, the data structure can at any point flush the parallel buffer to get the next batch. This framework nicely supports pipelined batched data structures, since the data structure can decide when it is ready to get the next input batch from the parallel buffer, which may be even before it has finished processing the previous batch. Furthermore, this framework makes it easy for us to build composable parallel algorithms and data structures with composable performance bounds. This is demonstrated by both the parallel working-set map in [4] and the parallel finger structure in this paper. 1 This is the full version of a paper published in the 33rd International Symposium on Distributed Computing (DISC 2019). It is posted here for your personal or classroom use. Not for redistribution. c 2019 Copyright is held by the owner/author(s). 1 Finger Structures The map (or dictionary) data structure, which supports inserts, deletes and searches/updates, collectively referred to as accesses, comes in many different kinds. A common implementation of a map is a balanced binary search tree such as an AVL tree or a red-black tree, which (in the comparison model) takes O¹lognº worst-case cost per access for a tree with n items. There are also maps such as splay trees [36] that have amortized rather than worst-case performance bounds. A finger structure is a special kind of map that comes with a fixed finger at each end and a (fixed) number of movable fingers, each of which has a key (possibly −∞ or 1 or between adjacent items in the map) that determines its position in the map, such that accessing items nearer the fingers is cheaper. For instance, the finger tree [27] was designed to have the finger property in the worst case; it takes O¹logr +1º steps per operation with finger distance r (Definition 1), so its total cost satisfies the finger bound (Definition 2). Definition 1 (Finger Distance). Define the finger distance of accessing an item x on a finger structure M to be the number of items from x to the nearest finger in M (including x), and the finger distance of moving a finger to be the distance moved. Definition 2 (Finger Bound). Given any sequence L of N operations on a finger structure M, let FL denote the finger bound ÍN ¹ º for L, defined by FL = i=1 logri +1 where ri is the finger distance of the i-th operation in L when L is performed on M. Main Results We present in this paper, to the best of our knowledge, the first parallel finger structure. In particular, we design two parallel maps that are work-optimal with respect to the Finger Bound FL (i.e. it takes O¹FLº work) for some linearization L of the operations (that is consistent with the results), while having very good parallelism. (We assume that each key comparison takes O¹1º steps.) These parallel finger structures can be used by any parallel program P, whose actual execution is captured by a program DAG D, where each node is an instruction that finishes in O¹1º time or a call to the finger structure M, called an M-call, that blocks until the result is returned, and each edge represents a dependency due to the parallel programming primitives. The first design, called FS1, is a simpler data structure that processes operations one batch at a time. Theorem 3 (FS1 Performance). If P uses FS1 (as M), then its running time on p processes using any greedy scheduler (i.e. at each step, as many tasks are executed as are available, up to p) is T +F O 1 L +T +d · ¹logpº2 +logn p 1 for some linearization L of M-calls in D, where T1 is the number of nodes in D, and T1 is the number of nodes on the longest path in D, and d is the maximum number of M-calls on any path in D, and n is the maximum size of M. 2 Notice that if M is an ideal concurrent finger structure (i.e. one that takes O¹FLº work), then running P using M on p processors T1+FL according to the linearization L takes Ω¹Toptº worst-case time where Topt = p +T1. Thus FS1 gives an essentially optimal 2 2 time bound except for the ‘span term’ d · ¹logpº +logn , which adds O ¹logpº +logn time per FS1-call along some path in D. The second design, called FS2, uses a complex internal pipeline to reduce the ‘span term’. Theorem 4 (FS2 Performance). If P uses FS2, then its running time on p processes using any greedy scheduler is T +F O 1 L +T +d ·¹logpº2 +s p 1 L for some linearization L of M-calls in D, where d is the maximum number of FS2-calls on any path in D, and sL is the weighted span of D where each FS2-call is weighted by its cost according to FL, except that each finger-move operation is weighted by logn. Specifically, each access FS2-call that is an access with finger distance r according to L is given the weight logr +1, and each FS2-call that is a finger-move is given the weight logn, and sL is the maximum weight of any path in D. Thus, ignoring 2 finger-move operations, FS2 gives an essentially optimal time bound up to an extra O ¹logpº time per FS2-call along some path in D. We shall first focus on basic finger structures with just one fixed finger at each end, since we can implement the general finger structure with f movable fingers by essentially concatenating ¹f +1º basic finger structures, as we shall explain later in Section 6. We will also discuss later in Section 7 how to adapt our results for work-stealing schedulers that can actually be provided by a real runtime system.

Load more