Communication-Efficient Parallel Sorting (Preliminary Version)

Communication-Efficient Parallel Sorting (Preliminary Version) MICHAEL T. GOODRICH* Johns Hopkins Univ. goodrich@cs. jhu. edu Abstract ing [2, 6, 12, 26, 30, 31, 34, 41, 40]. The real potential of parallel computation, therefore, will most likely We study the problem of sorting n numbers on a only be realized for coarse-to-medium-grain paral- p-processor bulk-synchronous parallel (BSP) com- lel systems, where the ratio of memory to proces- puter, which is a parallel multicomputer that al- sors is non-constant, for such systems allc,w an al- lows for general processor-to-processor communica- gorithm designer to balance communication latency tion rounds provided each processor sends and re- wit h internal comput at ion time. Indeed, this realiza- ceives at most h items in any round. We provide tion has given rise to several new computation mod- parallel sorting methods that use internal computa- els for parallel algorithm design, which all use what tion time that is O(*) and a number of commu- Valiant [40] calls “bulk synchronous” processing. In nication rounds that is 0( ~$$~1) ) for h = @(n/p). such a model an input of size n is distributed evenly The internal computation bound is optimal for any across a p-processor parallel computer. In a single comparison-based sorting algorithm. Moreover, the computation round (which Valiant calls a superstep) number of communication rounds is bounded by a each processor may send and receive h messages and constant for the (practical) situations when p < then perform an internal computation on its internal memory cells using the messages it has just re- nl–l/c for a constant c >— 1. In fact, we show that our bound on the number of communication rounds ceived. To avoid any conflicts that might be caused is asymptotically optimal for the full range of values by asynchronies in the network (whose topology is for p, for we show that just computing the “or” of left undefined) the messages sent out in a round tby n bits distributed evenly to the first O(n/h) of an some processor cannot depend upon any messages arbitrary number of processors in a BSP computer that processor receives in round t (but, clf course, requires fl(log n/ log(h + 1)) communication rounds. they may depend upon messages received in round t – 1). We refer to this model of computation as the Bulk-Synchronous Parallel (BSP) model. 1 Introduction 1.1 The BSP Model Most of the research on parallel algorithm design in the 1970’s and 1980’s was focused on fine-grain As with the PRAM family of computer modelsl, the massively-parallel models of computation (e. g., see BSP model is distinguished by the broadcast and [4, 7, 22, 24, 28, 37]), where the ratio of mem- combining abilities of the network connecting the ory to processors is fairly small (typically O(l)), various processors. In the weakest versicm, which and this focus was independent of whether the is the only version Valiant [40] considers, the net- model of computation was a parallel random-access work may not duplicate nor combine messages, but machine (PRAM) or a network model, such as a instead may only realize h-relations between the pro- mesh-of-processors. But, as more and more par- cessors. We call this the ERE W BSP model, noting allel computer systems are being built, researchers that it is essentially the same as a model Valiant are realizing that processor-to-processor communi- elsewhere [41] calls the XPRAM and one that Gib- cation is a prime bottleneck in parallel comput- bons [19] calls the EREW phase-PRAM. It is also the communication structure assumed by the LogP *This research supported by the NSF under Grants IRI-9116843 and CCR-9300079, and by ARO under grant model [12, 25], which is the same as the BSP model DAAH04-96-1-0013. except that the LogP model does not explicitly re- Permission to make digitel/bard copies of all or psrt of thk material for quire bulk-synchronous processing. personal or classroom use is granted without fee provided that the copies But it is also natural to allow for a slightly more are not made or distributed for pmtit or commercial advanhge, the copyright notice, the title of the publication and its date appear, and notice is powerful bulk-synchronous model, which we call the given that copyright is by permission of the ACM, Inc. To copy otherwise, 1 Indeed ~ pRAM with as many processors aOd memOry to republi~, to post on servers or to rediatribuw to lists, requires specitio permission andlor fee. cells is a BSP model with h = 1, as is a rnodute parallel corrL- STOC’96, Philadelphia PA, USA puter (MPC) [31], which is also known as is a distributed- @ 1996 ACM 0-89791-785-5/96/05. .$3.50 memory mczchtne (DMM) [23], for any memory size. 247 weak- CREW BSP model. In this model we assume allel sorting algorithms (e.g., see Akl [4], Bitton et processors are numbered 1, 2, . .. p, and that mes- al. [7], J6J& [22], Karp and Ramachandran [24], and sages can be duplicated by the network so long as Reif [37]). Nevertheless, it was not until 1983 that the destinations for any message are a contiguous it was shown, by Ajtai, Kom16s, and Szemer6di [3], set of processors {2, z + 1,. ... j }. This is essentially that n elements can be sorted in O(log n) time with the same as a model Dehne et al. [15, 16] refer to an O(n log n)-sized network (see also Paterson [35]). as the coarse-grain multi-computer. In designing an In 1985 Leighton [27] extended this result to show algorithm for this model one must take care to en- that one can produce an O (n)-node bounded-degree sure that, even with message duplication, the num- network capable of sorting in O (log n) steps, based ber of messages received by a processor in a single upon an algorithm he called “columnsort .“ In 1988 round is at most h. Nevertheless, as we demonstrate Cole [10] gave simple methods for optimal sorting in in this paper, this limited broadcast capability can the CREW and EREW PRAM models in O(log n) sometimes be employed to yield weak-CREW BSP time using O(n) processors, based upon an elegant algorithms that are conceptually simpler than their ‘(cascade mergesort” paradigm using arrays, and this EREW BSP counterparts. result was recently extended to the Parallel Pointer Finally, one can imagine more powerful instances Machine by Goodrich and Kosaraju [20]. Thus, one of the BSP model, such as a CREW BSP model, can sort optimally in these fine-grained models. which would allow for arbitrary broadcasts, or even These previous methods are not optimal, how- a CRC W BSP model, which would also allow for ever, when implemented in bulk-synchronous mod- messages to the same location to be combined (using els. Nevertheless, Leighton’s columnsort method [27] some arbitration rule). (See also [19, 32].) can be used to design a bulk-synchronous parallel The running time of a BSP algorithm is charac- sorting algorithm that uses a constant number of terized by two parameters: TI, the internal compu- communication rounds, provided P3 S n. Indeed, t ation time, and Tc, the number of communication there are a host of published algorithms for achieving rounds. A prime goal in designing a BSP algorithm such a result when the ratio of input size to number is to minimize both of these parameters. Alterna- of processors is as large as this. For example, a ran- tively, by introducing additional characterizing pa- domized strategy, called sample sort, achieves this rameters of the BSP model, we can combine TI and result with high probability [8, 17, 18, 21, 29, 38], TC into a single running time parameter, which we as do deterministic strategies based upon regular call the combined running time. Specifically, if we sampling [33, 39]. These methods based upon sam- let L denote the latency of the network—that is, pling do not seem to scale nicely for smaller n/p the worst-case time needed to send one processor- ratios, however. If columnsort is implemented in a to-processor message—and we let g denote the time recursive fashion, then it yields an EREW BSP al- “gap” between consecutive messages received by a gorithm that uses TC = O([log n/ log(n/p)]~) com- processor in a communication round, then we can munication rounds and internal computation time characterize the total running time of a BSP com- that is O(TC (n/p) log(n/p)), where 6 = 2/ (log 3 – putation as O(T1 + (L + g c). Incidentally, this is h)ll), which is approximately 3.419. Using an algo- also the running time of implementing a BSP com- rithm they call “cubesort,” Cypher and Sanz [13] putation in the analogous LogP model [12, 25]. show how to improve the Tc term in these bounds * ~–10g* (n/P)) [log ~/ log(n/P)]2), ad The goal of this paper is to further the study of to be 0((25)t10g bulk-synchronous parallel algorithms by addressing Plaxton [36] shows how cubesort can be modi- the fundamental problem of sorting n elements dis- fied to achieve Tc = O([log n/ log(n/p)]2). In- tributed evenly across a p-processor BSP computer. deed, Plaxton3 can modify the “sharesort” method of Cypher and Plaxton [14] to achieve TC = O((logn/ log(n/p)) log2(logn/ log(n/p))).

Communication-Efficient Parallel Sorting (Preliminary Version)

2.3 Quicksort

Advanced Topics in Sorting

Sorting Algorithm 1 Sorting Algorithm

Super Scalar Sample Sort

How to Sort out Your Life in O(N) Time

Advanced Topics in Sorting List RSS News Items in Reverse Chronological Order

Sorting Algorithm 1 Sorting Algorithm

Mergesort and Quicksort

Algorithms ROBERT SEDGEWICK | KEVIN WAYNE

Msc THESIS FPGA-Based High Throughput Merge Sorter

FPGA-Accelerated Samplesort for Large Data Sets

Implementations of Randomized Sorting on Large Parallel Machines