A Taxonomy of Parallel

DINA BITTON Department of Applied Mathematics, We~zmannInstitute, Rehovot, Israel

DAVID J. DeWITT Department, Unwerstty o[ Wisconsin, Madison, Wisconsin 53706

DAVID K. HSIAO AND JAISHANKAR MENON Computer and Information Science Department, The Ohio State University, Columbus, Ohio 43210

We propose a taxonomy of parallel sorting that encompasses a broad range of array- and file-sorting algorithms. We analyze how research on parallel sorting has evolved, from the earliest sorting networks to shared memory algorithms and VLSI sorters. In the context of sorting networks, we describe two fundamental parallel merging schemes: the odd-even and the bitonic merge. We discuss sorting algorithms that evolved from these merging schemes for parallel computers, whose processors communicate through interconnection networks such as the perfect shuffle, the mesh, and a number of other sparse networks. Following our discussion of network sorting algorithms, we describe how faster algorithms have been derived from parallel enumeration sorting schemes, where, with a shared memory model of parallel computation, keys are first ranked and then rearranged according to their rank. Parallel sorting algorithms are evaluated according to several criteria related to both the of an algorithm and its feasibility from the viewpoint of computer architecture. We show that, in addition to attractive communication schemes, network sorting algorithms have nonadaptive schedules that make them suitable for implementation. In particular, they are easily generalized to block sorting algorithms, which utilize limited parallelism to solve large sorting problems. We also address the problem of sorting large mass-storage files in parallel, using modified disk devices or intelligent bubble memory. We conclude by mentioning VLSI sorting as an active and promising direction for research on parallel sorting.

Categories and Subject Descriptors: B.3 [Hardware]: Memory Structures; B.4 [Hardware]: Input/Output and Data Communications; B.7.1 [Integrated Circuits]: Types and Design Styles; F.2.2 [Analysis of Algorithms and Problem Complexity]: Nonnumerical Algorithms and Problems General Terms: Algorithms Additional Key Words and Phrases: Block sorting, bubble memory, external sorting, hardware sorters, internal sorting, limited parallelism, merging, parallel sorting, sorting networks

D. Bitton's present address is Department of Computer Science, Cornell University, Ithaca, New York 14853. Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. © 1984 ACM 0360-0300/84/0900-0287 $00.75

Computing Surveys, Vol. 16, No. 3, September 1984 288 • D. Bitton, D. J. De Witt, D. K. Hsiao, and J. Menon CONTENTS efficient serial algorithms are known which can sort n values in at most O(n log n) comparisons, the theoretic lower bound for this problem [Knuth 1973]. In addition to INTRODUCTION their time complexity, various other prop- 1 PARALLELIZING SERIAL SORTING erties of these serial internal sorting algo- ALGORITHMS rithms have also been investigated. In 1 1 The Odd-Even Transposition Sort particular, sorting algorithms have been 1 2 A Parallel Tree-Sort Algorithm 2 NETWORK SORTING ALGORITHMS evaluated with respect to time-memory 2 1 Sorting Networks trade-offs (the amount of additional memory 2 2 Sorting on an SIMD Machine required to run the algorithm in addition 2.3 Summary to the memory storing the initial sequence), 3. SHARED MEMORY PARALLEL SORTING ALGORITHMS stability (the requirement that equal ele- 3.1 A Modified ments retain their original relative order), 3 2 Faster Parallel Merging Algorithms and sensitivity to the initial distribution of 3 3 Bucket Sorting the values (in particular, best case and 3 4 Sorting by Enumeration worst case complexity have been investi- 3 5 Summary 4 BLOCK SORTING ALGORITHMS gated). 41 Two-Way Merge-Spht In the last decade, parallel processing has 4 2 Bltonic Merge-Exchange added a new dimension to research on in- 5. EXTERNAL SORTING ALGORITHMS ternal sorting algorithms. Several models 5.1 Parallel Tape Sorting 5.2 Parallel Disk Sorting of parallel computation have been consid- 5.3 Analysis of the Parallel External ered, each with its own idea of "contiguous" memory locations and definition of the way 6 HARDWARE SORTERS multiple processors access memory. In or- 61 The Rebound Sorter der to state the problem of parallel sorting 6 2 The Up-Down Sorter 6 3 Sorting within Bubble Memory clearly, we must first define what is meant 6 4 Summary and Recent Results by a sorted sequence in a parallel processor. 7 CONCLUSIONS AND OPEN PROBLEMS When processors share a common memory, REFERENCES the idea of contiguous memory locations in a parallel processor is identical to that in a A v serial processor. Thus, as in the serial case, the time complexity of a sorting algorithm INTRODUCTION can be expressed in terms of number of comparisons (performed in parallel by all Sorting in computer terminology is defined or some of the processors) and internal as the process of rearranging a sequence of memory moves. On the other hand, when values in ascending or descending order. processors do not share memory and com- Computer programs such as compilers or municate along the lines of an interconnec- editors often'choose to sort tables and lists tion network, definition of the sorting prob- of symbols stored in memory in order to lem requires a convention to order the enhance the speed and simplicity of algo- processors and thus the union of their rithms used to access them (for the search local memory locations. When parallel pro- or insertion of additional elements, for in- cessors are used, the time complexity of a stance). Because of both their practical im- sorting algorithm is expressed in terms of portance and theoretical interest, algo- parallel comparisons and exchanges be- rithms for sorting values stored in random tween processors that are adjacent in the access memory (internal sorting) have been interconnecting network. the focus of extensive research on algo- Shared memory models of parallel com- rithms. First, serial sorting algorithms were putation have been instrumental in inves- investigated. Then, with the advent of par- tigating the intrinsic parallelism that exists allel processing, parallel sorting algorithms in the sorting problem. Whereas the first became a very active area of research. Many results on parallel sorting were related to

Computing Surveys, Voi. 16, No. 3, September 1984 A Taxonomy o/Parallel Sorting • 289 sorting networks [Batcher 1968], faster in research on new external sorting parallel sorting algorithms have been pro- schemes. The reasons for the relatively posed for theoretical models of parallel small amount of research on parallel exter- processors with shared memory [Hirsch- nal sorting [Bitton-Friedland 1982; Even berg 1978; Preparata 1978]. A chain of re- 1974] are most likely related to the neces- sults in shared memory computation has sity of adapting such schemes to mass- led to a number of parallel sorting schemes storage device characteristics. that exhibit a O(log n) time complexity. It may seem that advances in computer Typically, the parallel sorting problem is technology, such as the advent of intelli- expressed as that of sorting n numbers with gent or associative memories, could elimi- n or more processors, all sharing a large nate or reduce the use of sorting as a tool common memory, so that they may access for performing other operations. For ex- with various degrees of contention (e.g., ample, when sorting is used in order to parallel reads and parallel writes with ar- facilitate searching, one may advocate that bitration). Research on paralle ! sorting has associative memories will suppress the need been largely concerned with purely theoret- of sorting. However, associative stores re- ical issues, and it is only recently that fea- main too expensive for widespread use, es- sibility issues such as limited parallelism pecially when large volumes of data are or, in the context of very large scale inte- involved. In the case where sorting is re- gration (VLSI) sorting, trade-offs between quired for the sole purpose of ordering data, hardware complexity (expressed in terms the only way to reduce sorting time is to of chip area) and time complexity are being develop fast parallel sorting schemes, pos- addressed. sibly by integrating sorting capability into In addition to using sorting algorithms mass-storage memory [Chen et al. 1978; to rearrange numbers in memory, sorting Chung et al. 1980]. is often advocated in the context of infor- In this paper, we propose a taxonomy of mation processing. In this context, sorting parallel sorting that includes both internal is used to order a file of data records, stored and external parallel sorting algorithms. on a mass-storage device. The records are We analyze how research on parallel sort- ordered with respect to the value of a key, ing has evolved from the earliest sorting which might be a single field or the con- networks to shared memory model algo- catenation of several fields in the record. rithms and VLSI sorters. We attempt to Files are sorted either to deliver well-orga- classify a broad range of parallel sorting nized output to a user (e.g., a telephone algorithms according to various criteria, in- directory), or as an intermediate step in the cluding time efficiency and the architec- execution of a complex database operation tural requirements upon which they de- [Bitton and DeWitt 1983; Selinger et al. pend. The goal of this study is to provide a 1979]. Because of memory limitations file basic understanding and a unified view of sorting cannot be performed in memory the body of research on parallel sorting. It and external sorting algorithms must be would be beyond the scope of a single paper used. External sorting schemes are usually to survey the proposed models of compu- based on iterative merging [Knuth 1973, tation in detail or to analyze in depth the sec. 5.4]. Even when fast disk devices are complexity of the various algorithms sur- used as mass-storage devices, input/output veyed. We have kept to a minimum the accounts for most of the execution time in discussion on algorithm complexity, and external sorting. 1 we only describe the main upper-bound re- Despite the obvious need for fast sorting sults for the number of parallel comparison of large files, the availability of parallel steps required by the algorithms. Rather processing has not generated much interest than theoretical problems related to paral- lel sorting (which have been treated in 1 It is estimated that the OS/VS Sort/Merge program depth in a number of studies, e.g., Borodin consumes as much as 25 percent of all input/output and Hopcroft [ 1982], Shiloach and Vishkin time on IBM systems [Bryant 1980]. [1981], and Valiant [1975]), we emphasize

Computing Surveys, VoL 16, No. 3, September 1984 290 • D. Bitton, D. J. De Witt, D. K. Hsiao, and J. Menon problems related to the feasibility of par- comparison, step, processors conditionally allel sorting with present or near-term exchange data. The concurrency that can technology. be achieved in the exchange steps is limited The remainder of this paper is organized either by the interconnection scheme be- as follows. In Section 1, we show that cer- tween the processors (if one exists) or by tain fast serial sorting algorithms can be memory conflicts (if shared memory is used parallelized, but that this approach leads to for communication). simple and relatively slow parallel algo- With a parallel processor, the analog to rithms. Section 2 is devoted to network a comparison and move step in a unipro- sorting algorithms; in particular, we de- cessor memory is a parallel comparison- scribe in detail several sorting networks exchange of pairs of keys. Thus it is natural that perform Batcher's biton~c sort. In Sec- to measure the performance of parallel tion 3 we survey a chain of results that led sorting algorithms in terms of the number to the development of very fast sorting of comparison-exchanges that they require. algorithms: the shared memory model par- The speedup of a parallel sorting algorithm allel merging [Gavril 1975; Valiant 1975] then can be defined as the ratio between and the shared memory sorting algorithms the number of comparison-moves required [Hirschberg 1978; Preparata 1978]. In Sec- by an optimal serial sorting algorithm and tion 4 we address the issue of limited par- the number of comparison-exchanges re- allelism and, in this context, we define quired by the parallel algorithm. "block sorting" parallel algorithms, which Since a serial algorithm that sorts by sort M.p elements with p processors. We comparison requires at least O(n log n) then identify two methods for deriving a comparisons to sort n elements [Knuth block sorting algorithm. In Section 5, we 1973, p. 183], the optimal speedup would be address the problem of sorting a large file achieved when, by using n processors, n in parallel. We show that previous results elements are sorted in O(log n) parallel on parallel sorting are mostly applicable to comparisons. It does not, however, seem internal sorting schemes, where the array possible to achieve this bound by sim- to be sorted is entirely stored in memory, ply parallelizing one of the well-known and propose external parallel sorting as a O(n log n)-time serial sorting algorithms. new research direction. Section 6 contains These algorithms appear to have serial con- an overview of recently proposed designs straints that cannot be relaxed. Consider, for dedicated sorting devices. In Section 7 for example, a two-way [Knuth we summarize this survey and indicate pos- 1973, p. 160]. The algorithm consists of log sible directions for future research. n phases.During each phase, pairs of sorted sequences (produced in the previous phase) 1. PARALLELIZlNG SERIAL SORTING are merged into a longer sequence. During ALGORITHMS the first phases, a large number of proces- Parallel processing makes it possible to per- sors can be used to merge different pairs in form more than a single comparison during parallel. However, there is no obvious way each time unit. Some models of parallel to introduce a high degree of parallelism in computation (the sorting networks in par- later phases. In particular, the last phase ticular) assume that a key is compared to that consists of merging two sequences, only one other key during a time unit, and each of which contains n/2 elements, is a that parallelism is exploited to compare serial process that may require as many as different pairs of keys simultaneously. An- n - 1 comparisons. other possibility is to compare a key to On the other hand, parallelization of many other keys simultaneously. For ex- straight sorting methods requiring O(n 2) ample, in Muller and Preparata [1975], a comparisons seems easier. However, this key is compared to (n - 1) other keys in a approach can at best produce O(n)-time single time unit by using (n - 1) processors. parallel sorting algorithms when O(n) pro- Parallelism can also be exploited to move cessors are used, since by performing n many keys simultaneously. After a parallel comparisons instead of 1 in a single time

Computing Surveys, Vol. 16, No. 3, September 1984 A Taxonomy o/Parallel Sorting • 291 unit, the execution time can be reduced active during the odd time steps, and exe- from O(n 2) to O(n). An example of this cute the odd phases of the serial odd-even kind of parallelization is a well-known par- transposition sort in parallel. Similarly, let allel version of the common P2, P4 .... be active during the even time called the odd-even transposition sort {Sec- steps, and perform the even phases in par- tion 1.1). allel. Partial parallelization of a fast serial al- Note that a single comparison-exchange gorithm can also lead to a parallel algo- requires two transfers. For example, during rithm of order O(n). For example, the serial the first step, x2 is transferred to/)1 and tree can be modified so that compared to xl by P1. Then, if Xl > x2, xl all comparisons at the same level of the is transferred to P2; otherwise x2 is trans- tree are performed in parallel. The result is ferred back to/)2. Thus the parallel odd- a parallel , described in Section 1.2. even transposition algorithm sorts n num- This parallel algorithm is used in the da- bers with n processors in n comparisons tabase Tree Machine [Bentley and Kung and 2n transfers. 1979]. 1.2 A Parallel Tree-Sort Algorithm 1.1 The Odd-Even Transposition Sort In a serial tree selection sort, a binary tree The serial "bubble sort" proceeds by com- with (2n - 1) nodes is used paring and exchanging pairs of adjacent to sort n numbers. The tree has n leaves, items. In order to sort an array (Xl, x2,..., and initially one number is stored in each xn), (n - 1) comparison-exchanges (Xl, x2), leaf. Sorting is performed by selecting the (x2, x3), ..., (xn-1, x~) are performed. This minimum of the n numbers, then the min- results in placing the maximum at the right imum of the remaining (n - 1) numbers, end of the array. After this first step, x~ is etc. discarded, and the same "bubble" sequence The binary tree structure is used to find of comparison-exchanges is applied to the the minimum by iteratively comparing the shorter array (Xl, x2,..., x~_~). By iterating numbers in two sibling nodes, and moving (n - 1) times the entire sequence is sorted. the smaller number to the parent node (see The serial odd-even transposition sort Figure 1). By simultaneously performing all [Knuth 1973] is a variation of the basic the comparisons at the same level of the bubble sort, with a total of n phases, each binary tree, a parallel tree sort is obtained of which requires n/2 comparisons. Odd [Bentley and Kung 1979]. and even phases alternate. During an odd Consider a set of (2n - 1) processors phase, odd elements are compared with interconnected to form a binary tree with their right adjacent neighbor; thus the pairs one processor at each of n leaf nodes and (xl, x2), (xs, x4) .... are compared. During at each interior node of the tree. By starting an even phase, even elements are compared with one number at each leaf processor, the with their right adjacent neighbor; that is, minimum can be transferred to the root the pairs (x2, x3), (x4, xs) .... are compared. processor in log2(n) parallel comparison To completely sort the sequence, a total of and transfer steps. At each step, a parent n phases (alternately odd and even) is re- receives an element from each of its two quired [Knuth 1973, p. 65]. children, performs a comparison, retains This algorithm calls for a straightforward the smaller element, and returns the larger parallelization [Baudet and Stevenson one. After the minimum has reached the 1978]. Consider n linearly connected pro- root, it is written out. From then on, empty cessors and label them P1, P2 .... , P~. processors are instructed to accept data Assume that the links are bidirectional so from nonempty children and select the that P, can communicate with both P~_~ minimum if they receive two elements. At and P,+I. Also assume that initially x, re- every other step, the next elements in in- sides in P, for i = 1, 2 ..... n. To sort (x~, creasing order reaches the root. Thus sort- x2 ..... xn) in parallel, let P1, P3, P5 .... be ing is completed in time O(n).

ComputingSurveys, Vol. 16, No. 3, September 1984 292 D. Bitton, D. J. De Witt, D. K. Hsiao, and J. Menon

1 2 3 4 5 6 7 8 X 2 X 4 X 6 X x X X X 1 3 5 7 X X X X X X (a) (b)

X 2 X 4 X 6 X 8 X 2 X 4 X 6 X X 3 X 7 X 3 X 7 1 5 X 5 X 1 (e) (d)

Figure 1. Parallel tree selection sort. (a) Step 1. (b) Step 2. (c) Step 3. (d) Step 4.

Both the odd-even transposition sort Several other networks are of major in- and the parallel tree sort constitute two terest, particularly the cube [Pease 1977] simple parallel sorting algorithms, derived and the cube-connected cycles [Preparata by performing in parallel the sequence of and Vuillemin 1979], which are suitable for comparisons required at different stages of sorting as well as for a number of numerical the bubble sort and the tree selection sort. problems. It has been shown that sorting Both algorithms use O(n) processors to sort based on the bitonic merge can be imple- an arbitrary sequence of n elements in O(n) mented as a routing strategy on these net- comparisons. We shall show that parallel works. It is beyond the scope of this paper sorting algorithms developed by exploiting to investigate these networks in detail; we the intrinsic parallelism in sorting are shall concentrate on explaining the basic faster than those developed by parallelizing merge patterns that determine the routing serial sorting algorithms. strategies on all these networks and deriv- ing the O(log2 n) lower bound for sorting 2. NETWORK SORTING ALGORITHMS time on Batcher's networks. It is somehow surprising that initially the 2.1 Sorting Networks simple ,hardware problem of designing a multiple-input multiple-output switching Sorting networks originated as fast and network has been a prime motivation in economical switching networks. Since a developing parallel sorting algorithms. The sorting network with n input lines can or- earliest results in parallel sorting are found der any permutation of (1, 2 .... , n), it can in the literature on sorting networks be used as a multiple-input multiple-output [Batcher 1968; Van Voorhis 1971]. Since switching network [Batcher 1968]. Imple- then, a wide range of network topologies menting a serial sorting algorithm on a have been proposed, and their ability to network of comparators [Knuth 1973, p. support fast sorting algorithms has been 220] results in a serialization of the com- extensively investigated. In Section 2.1, we parators and consequently increases the describe in detail the odd-even and the network delay. To make a sorting network bitonic merging networks. In Section 2.2, fast, it is necessary to have number of com- we show that parallel sorting algorithms for parator modules perform comparisons in SIMD (single instruction multiple data parallel. Parallel sorting algorithms are stream) machines are derived from the bi- therefore necessary to design efficient sort- tonic network sort. In particular, we de- ing networks. scribe two bitonic sort algorithms for a One of the first results in parallel sorting mesh-connected processor [Nassimi and is due to Batcher [1968], who presented two Sahni 1979; Thompson and Kung 1977]. methods to sort n keys with O(n log2 n)

Computing Surveys, VoL 16, No. 3, September 1984 A Taxonomy of Parallel Sorting 293

°'O2 i n ' H • MAX (A,B) 03 element odd - merge 04 sorter Figure 2. A comparison-exchange module. comparators in time O(log2 n). As shown in Figure 2, a comparator is a module that receives two numbers on its two input lines A, B, and outputs the minimum on its output line L and the maximum on its output line H. A serial comparator receives element A and B with their most significant bit first, even - meroe and can be realized with a small number of sorter gates. Parallel comparators compare sev- eral bits in parallel at each step; they are faster but obviously more complex. Both of Batcher's algorithms, the "odd-even sort" and the "bitonic sort," are based on the principle of iterated merging. A specific iterative rule is applied to an initial se- Figure 3. The lterative rule for the odd-even merge. quence of 2 k numbers in order to create sorted runs of length 2, 4, 8 .... ,2 k during successive stages of the algorithm. modules), followed by 2 h-2 (2 by 2) merging 2 1 1 The Odd-Even Merge Rule networks, followed by 2 k-8 (4 by 4) merging networks, and so on. Since a 2 '+1 by 2 '+1 The iterative rule for the odd-even merge merging network requires one more step of is illustrated in Figure 3. Given two sorted comparison-exchange than a 2' by 2' merg- sequences (al, a2, ..., a,) and (bl, b2, ..., ing network, it follows that an input num- bn), two new sequences are created: The ber goes through at most 1 + 2 + 3 + • • • odd sequence consists of the odd-numbered + k = k(k + 1)/2 comparators. This means terms and the even sequence consists of the that 2 h numbers are sorted by performing even-numbered terms. The odd sequence k(k + 1)/2 parallel comparison-exchanges. (Cl, c2 .... ) is obtained by merging the odd However, the number of comparators re- terms (a,, a3 .... ) with the odd terms (b,, quired by this type of sorting network is b3 .... ). Similarly, the even sequence (dl, (k 2 - k + 4)2 h-2 - 1 [Batcher 1968]. d2 .... ) is obtained by merging (a2, a4 .... ) with (b2, b4 .... ). Finally, the sequence (cl, c2 .... ) and (dl, d2 .... ) are merged into 2.1.2 The Bitonic Merge Rule (e~, e2 .... , e2,) by performing the following For the bitonic sort, a different iterative comparison-exchanges: rule is used (Figure 4). A bitonic sequence el = el, is obtained by concatenating two mono- ee, = min(c,+l, d,), tonic sequences, one ascending and the e2,+l = max(c,+1, d,) for i = 1, 2, ..., other descending. A cyclic shift of this con- catenated sequence is also a bitonic se- e2n = dn. quence. The bitonic iterative rule is based The resulting sequence will be sorted (for a on the observation that a bitonic sequence proof, see Knuth [1973, pp. 224, 225]). To can be split into two bitonic sequences by sort 2 k numbers using the odd-even itera- performing a single step of comparison- tive merge rule requires 2 h-~ (1 by 1) merg- exchanges. Let (al, a2,..., a2~) be a bitonic ing networks {i.e., comparison-exchange sequence such that al <-- a2 - • • • -< an and

Computing Surveys.Vol. 16, No. 3, September 1984 294 • D. Bitton, D. J. De Witt, D. K. Hsiao, and J. Menon o! el (to produce the decreasing part of a bitonic e2 n w sequence). Figure 5 illustrates a bitonic sort element network for eight input lines. In general, bitonic the bitonic sort of 24 numbers requires sorter k(k + 1)/2 steps, with each step using 2 k-1 comparators. After the first was pre- en-I sented, it was shown that the same sorting en scheme could be realized with only n/2 comparators, interconnected as a perfect en÷l shuffle [Stone 1971]. Stone noticed that if fl en÷2 the input values were labeled with a binary element index, then the indices of every pair of keys bitonic entering a comparator at any step of the sorter bitonic sort would differ by a single bit in their binary representations. Stone also made the following observations: The net- o e2n-I work has log n stages. The ith stage consists j.I =[ e2n- of i steps and at step i, inputs that differ in their least significant bit are compared. Figure 4. The iterative rule for the bitonic merge. This regularity in the bitonic sorter sug- gests that a similar interconnection scheme could be used between the comparators of an+l >-- an+2 >-- "'" >-- a2n. 2 Then the se- any two adjacent columns of the network. quences Stone concluded that the perfect shuffle interconnection could be used throughout min(al, an+i), min(a2, an+2).... all the stages of the network. "Shuffling" and in input lines (in a manner similar to shuf- fling a deck of cards) is equivalent to shift- max(a1, an+l), max(a2, an+2).... ing their binary representation to the left. are both bitonic. Furthermore, the first se- Shuffling twice shifts the binary represen- quence contains the n lower elements of tation of each index twice. Thus the input the original sequence, whereas the second lines can be ordered before each step of contains the n higher ones. It follows that comparison-exchanges by shuffling them a bitonic sequence can be sorted by sepa- as many times as required by the bitonic rately sorting two bitonic sequences each sort algorithm. The network that realizes half as long as the original sequence. this idea is shown in Figure 6 for eight To sort 2 k numbers by using the bitonic input lines. In general, for n = 24 input iterative rule, we can iteratively sort and lines, this type of bitonic sorter requires a merge sequences into larger sequences until total of (n/2)(log n) 2 comparators, arranged a bitonic sequence of 24 is obtained. This in (log n) 2 ranks of (n/2) comparators each. bitonic sequence can be split into "lower" The network has log n stages, with each and "higher" bitonic subsequences. Note stage consisting of log n steps. At each step, that the iterative building procedure of a the output lines are shuffled before they bitonic sequence requires that some com- enter the next rank of comparators. The parators invert their output lines and out- comparators in the first (log n) - i steps of put a pair of numbers in decreasing order the ith stage do not exchange their inputs; their only use is to shuffle their input lines. As an alternative to a multistage net- 2 As pointed out by one of the referees, the restriction work, the bitonic sort can also be imple- to equal-length ascending and descending parts is not necessary. However, we have made this assumption mented as a recirculating network, which for the sake of clarity in explaining the bltonic merge requires a much smaller number of com- rule. parators. For example, an alternative bi-

Computing Surveys, Vol. 16, No. 3, September 1984 A Taxonomy o[ Parallel Sorting • 295

__ 0 0 0 0 0

FigureS. Batcher's bitonic sort for eight numbers. The boxescontaining mi- nus signs indicate comparators that in- vert their output lines.

.

stoge I stoqe 2 stooe 3 A x zo Zo z,\ -=Zl

Z2~ _=Zz • Z3 Figure 6. Stone's modified bitonic sort. The boxes con- taining minus signs indicate z,j _--Z4 comparators that revert their rZs output lines. z./ : Z6 Z7 :Z7

• ---.~.~i.(..y) ,,-.~mo,,(..y) y--~--~mox(x,y) y~min(x,y) tonic sorter can be built with a single rank Therefore, this algorithm significantly im- of comparators connected by a set of shift proves the previous known bound of O(n) registers and shuffle links, as shown in for the time required to sort n elements Figure 7. Since the ith stage of the bitonic with n processing elements. sort algorithm requires i comparison-ex- Siegel has shown that the bitonic sort changes, Batcher's sort requires can be also performed by other types of networks in time O(log2 n) [Siegel 1977]. 1+2+3+... +logn The cube and the plus-minus 2' networks = log n(log n + 1)/2 are among the networks that he considered. Essentially, the data exchanges required by parallel comparison-exchanges. Stone's bi- the bitonic sort scheme can be realized on tonic sorter, on the other hand, requires a these networks as well (in fact, the perfect total of (log n) 2 steps because additional shuffle may be seen as an emulator of the steps are needed for shuffling the input cube). Siegel proves that simulating the lines (without performing a comparison). shuffle on a large class of interconnection In both cases, the asymptotic complexity is networks takes O(log2 n) time, and thus O(log2 n) comparison-exchanges. This pro- sorting can also be performed within this vides a speedup of O(n/log n) over the time limit. Finally, we should also mention O(n log n) complexity of serial sorting. the versatile cube-connected cycle (CCC), a

Computing Surveys, Vol. 16, No. 3, September 1984 296 • D. Bitton, D. J. De Witt, D. K. Hsiao, and J. Menon

work sorting algorithm is determined when the sorting operation is initialized, a central controller can supervise the execution of the algorithm by broadcasting at each time step the appropriate compare-exchange in- struction to the processors.

2.2.1 Sorting on an Array Processor The sorting problem can be also defined as the problem of permuting numbers among the local memories associated with the processors of an SIMD machine. In partic- ular, by assuming that one number is stored in the local memory of each processor of a mesh-connected machine, sorting can be Storoqe Comparators seen as the process of permuting the num- Registers bers stored in neighboring processors until they conform to some ordering of the mesh. Figure 7. Stone's architecture for the bltonic sort. The processors of an n by n mesh-con- nected parallel processor can be indexed network that efficiently emulates the cube according to a specified rule, such as the and the shuffle, and yet requires only three row-major or column-major indexing, communication ports per processor [Pre- which are commonly accepted ways to order parata and Vuillemin 1979]. A bitonic or an array. Thompson and Kung [1977] an odd-even sort can also be performed on adapted the bitonic sorting scheme to a the CCC in time 0(log 2 n). mesh-connected processor, with three al- ternative indexing rules: the row-major rules, the snakelike row-major rules, and 2.2 Sorting on an SIMD Machine the shuffled row-major rules. These rules Sorting networks are characterized by their are shown in Figure 8. property of nonadaptivity. They perform By assuming that n 2 keys with arbitrary the same sequence of comparisons, regard- values are initially distributed so that ex- less of the results of intermediate compar- actly one number resides in each processor, isons. In other words, whenever two keys the sorting problem consists of moving the R, and Rj are compared, the subsequent ith smallest number to the processor in- comparison for Rj is the same in the case dexed by i, for i = 1 ..... n 2. As with sorting where R, < R~ as it would have been in the networks, parallelism is used to compare case where R~ < R,. As a result of the pairs of numbers simultaneously, and a nonadaptivity property, a network sorting number is compared to only one other num- algorithm is conveniently implemented on ber at any given unit of time. Concurrent an SIMD machine. An SIMD {single in- data movement is allowed but only in the struction stream, multiple data stream) same direction; that is, all processors can machine is a system consisting of a control simultaneously transfer the content of their unit and a set of processors with local mem- transfer register to their right, left, above, ory interconnected by an interconnection or below neighbor. This computation model network. The processors are highly syn- is SIMD since at each time unit a single chronized. The control unit broadcasts in- instruction (compare or move) can be structions that all active processors execute broadcast for concurrent execution by the simultaneously (a mask specifies a subset set of processors specified in the instruc- of processors that are idle during an in- tion. The complexity of a method that struction cycle). Since the sequence of com- solves the sorting problem for this model parisons and transfers required in a net- can be measured in terms of the number of

Computing Surveys, Vol 16, No 3, September 1984 A Taxonomy of Parallel Sorting • 297 1983]. On the basis of this merge pattern, a two-dimensional array can be sorted in row-major order, in time O(n), and with a smaller proportionality constant than the previous algorithms.

2.3 Summary (a) (b) In this section we have examined two well- known sorting networks, the odd-even and bitonic networks, and shown that the con- cept of a sorting network has been extended to various schemes of synchronous parallel sorting. Although some consideration was given to the hardware complexity, the com- plexity of sorting on these networks has mainly been characterized in terms of exe- (c) cution time and number of processing ele- ments utilized. Thus, our baseline for eval- Figure8. Array processor, indexing schemes. (a) uating the various sorting schemes em- Row-major indexing. (b) Snakelike row-major index- ployed by these networks was the number rag. (c) Shuffled row-major indexing. of comparison-exchanges required, and we did not systematically account for the de- comparison and unit-distance routing steps. gree of network interconnection as a com- For the rest of this section we refer to the plexity measure of the network sorting al- unit-distance routing step as a move. Any gorithms. It is beyond the scope of this algorithm that is able to perform a permu- study to provide a comprehensive analysis tation for this model will require at least of interconnection networks. Extensive lit- 4(n - 1) moves, since it may have to inter- erature exists on this topic, and we have change the elements from two opposite cor- listed some references for the interested ners of the array processor (this is true for reader [Feng 1981; Nassimi and Sahni any indexing scheme). In this sense a sort- 1982; Pease 1977; Preparata and Vuillemin ing algorithm that requires O(n) moves is 1979; Siegel 1977, 1979; Thompson 1980]. optimal. Until very recently, the best-known per- The odd-even and the bitonic network formance for sorting networks was an sorting algorithms were adapted to this par- O(log 2 n) sorting time with O(n log 2 n) allel computation model, leading to two comparators. We have shown how the bi- algorithms that perform the mesh sort in tonic network sort can be interpreted as a O(n) comparisons and moves [Thompson sorting algorithm that sorts n numbers in and Kung 1977]. The first algorithm uses time O(log 2 n) with n/2 processors. In Sec- an odd-even merge of two-dimensional ar- tion 3, we show that, in an attempt to rays and orders the keys with snakelike develop faster parallel sorting algorithm, a row-major indexing. The second uses a bi- more flexible parallel computation model tonic sort and orders the keys with shuffled than the network comparators--the shared row-major indexing. A third algorithm that memory model--has been successfully in- sorts in row-major order with similar per- vestigated. However, a recent theoretical formance was later obtained [Nassimi and result may renew the interest in r~etwork Sahni 1979]. This algorithm is also an ad- sorting algorithms [Ajtai et al. 1983], show- aptation of the bitonic sort in which the ing a network of O(n log n) comparators iterative rule is a merge of two-dimensional that can sort n numbers in O(log n) com- arrays. Finally, an improved version of the parisons. Unfortunately, unlike the odd- two-dimensional odd-even merge was re- even or the bitonic sort, this algorithm is cently proposed [Kumar and Hirschberg not suitable for implementation. It is based

Computing Surveys, Vol. 16, No. 3, September 1984 298 • D. Bitton, D. J. De Witt, D. K. Hsiao, and J. Menon on a complex graph construction that may odd-even and the bitonic merge described make the proportionality constant (in the in Section 2) and sorting algorithms that lower bound for the number of compari- combine enumeration with parallel merge sons) unacceptably high. procedures [Preparata 1978]. In addition to these enumeration sorting algorithms, we 3. SHARED MEMORY PARALLEL also describe a parallel bucket sorting al- SORTING ALGORITHMS gorithm [Hirschberg 1978]. After the time bound of O(log 2 n) was 3.1 A Modified Sorting Network achieved with the network sorting algo- rithms, researchers attempted to improve In an attempt to reduce the number of this to the theoretical lower bound of comparisons required for sorting by in- O(log n). In this section, we describe several creasing the degree of parallelism beyond parallel algorithms that sort n elements O(n), Muller and Preparata [1975] first with O(log n) comparisons. These algo- proposed a modified sorting network based rithms assume a shared memory model of on a different type of comparators (Figure parallel computation. 9). These comparators have two input lines Although the sorting network algorithms and one output line. The two numbers to are based on comparison-exchanges of compare are received on the A and B lines. pairs, shared memory algorithms generally The output bit x is 0 if A < B and 1 if A > use enumeration to compute the rank of B. To sort a sequence of n elements, each each element. Sorting is performed by element is simultaneously compared to all computing in parallel the rank of each ele- the others in one unit of time by using a ment, and routing the elements to the loca- total of n(n - 1) comparators. The output tion specified by their rank. Thus, in the bits from the comparators are then fed into network sorting algorithms, individual a parallel counter that computes in log n processors decide locally about their next steps the rank of an element by counting routing step by communicating with their the number of bits set to 1 as a result of nearest neighbors, whereas in the shared comparing this element with all the other ' memory algorithms, any processor may ac- (n - 1). Finally, a switching network, con- cess any location of the global memory at sisting of a binary tree with (log n) + 1 every step of the computation. As shown in levels of single-pole double-throw switches, Section 2, the network algorithms assume routes the element of rank i to the ith a sparse interconnection scheme and differ terminal of the tree. There is one such tree only by the network interconnection topol- for each element, and each tree uses ogy. The shared memory sorting algorithms (2n - 1) switches. Routing an element rely on parallel computation models that through this tree requires log n time units, differ in whether or not they allow read and which determines the algorithm's complex- write conflicts and how they resolve these ity. At the cost of additional hardware com- conflicts [Borodin and Hopcroft 1982]. plexity, this algorithm sorts n elements in Clearly, the shared memory models are O(log n) time with O(n 2) processing ele- more powerful. However, at present, they ments. Muller and Preparata's algorithm are largely of theoretical interest, whereas was the first to use an enumeration scheme the network models are suitable to imple- for parallel sorting. mention with current or near-term tech- The idea of sorting by enumeration was nology. exploited to develop other very fast parallel In the remainder of this section, we first sorting algorithms [Hirschberg 1978; Pre- describe a modified sorting network scheme parata 1978], which improve Muller and that sorts by enumeration, using O(n 2) Preparata's result by reducing the number processing elements [Muller and Preparata of processing elements. Even from a theo- 1975]. We then survey two parallel merge retical point of view, the requirement of n 2 algorithms that are faster than the non- processors for achieving a speed of adaptive network merge algorithms (the 0 (log n) is not satisfactory. A parallel sort-

Computing Surveys, Vol 16, No 3, September 1984 A Taxonomy of Parallel Sorting • 299 O! Cl# el2 Por011el 02 = ~ : Counter c o cI.n _ (t st.) E ' LmJ i I ¢ C[I o L -~. . Porollel o. _ E Cln (ithl o v o I Iddio Figure9. Muller and Prepara- an "1 Cnn :! ta's [1975] sorting network. IC •{m--•d,m-/ . 1° to Io/:., o I •

i%--.o I I 1--

ing algorithm could theoretically achieve 3.3 Bucket Sorting the same speed with only O(n) processors if it had a parallel speedup of order n. Hirschberg's [1978] "" algo- rithm sorts n numbers with n processors in time O(log n), provided that the numbers 3.2 Faster Parallel Merging Algorithms to be sorted are in the range {0, 1, ..., Optimal parallel sorting algorithms may m - 1}. A side effect of this algorithm is use fast merging procedures in addition to that duplicate numbers that may appear in enumeration. In a study of parallelism in the initial sequence are eliminated in the comparison problems, Valiant [1975] pre- sorting process. If memory conflicts could sents a recursive algorithm that merges two be ignored, there would be a straightfor- sorted sequences of n and m elements {n _< ward way to parallelize a bucket sort: It m) with mn processors in 2 log(log n) + would be sufficient to have m buckets and O(1) comparison steps {compared to log n to assign one number to each processor. for the bitonic merge). On the other hand, The processor that gets the ith number is Gavril [1975] proposes a fast merging al- labeled P,, and it is responsible for placing gorithm that merges two sorted sequences the value i in the appropriate bucket. For of length n and m with a smaller number example, if P3 had the number 5, it would of processors p _< n _< m. This algorithm is place the value 3 in bucket 5. The problem based on binary insertion and requires only with this simplistic solution is that a mem- 2 log(n + 1) + 4(n/p) comparisons when ory conflict may result when several pro- n=m. cessors simultaneously attempt to store dif- Both Valiant's and Gavril's merging al- ferent values of i in the same bucket. gorithms assume a shared memory model The memory contention problems can be of computation. All the processors utilized solved by substantially increasing the can access elements of the initial data si- memory requirements. Suppose that there multaneously, or intermediate computation is enough memory available for m arrays, results. each of size n. Each processor then can

Computing Surveys, Vol. 16, No 3, September 1984 300 • D. Bitton, D. J. De Witt, D. K. Hsiao, and J. Menon write in a bucket without any fear of mem- 3.5 Summary ory conflict. To complete the bucket sort, the m arrays must be merged. The proces- Despite the improvement achieved by elim- sors perform this merge operation by inating memory conflicts, the more recent searching, in a binary tree search method, shared memory algorithms are still far from for the marks of "buddy" active processors. being suitable for implementation. Any If P~ and P~ discover each other's marks model requiring at least as many processors and i < j, then P, continues and P~ deacti- as the number of keys to be sorted, all vates (hence dropping a duplicate value). sharing a very large common memory, is Hirschberg also generalizes this algo- not feasible with present or near-term tech- nology. These models also ignore signifi- rithm so that duplicate numbers remain in the , but this generalization cant computation overheads such as, for degrades the performance of the sorting instance, the time associated with the real- algorithm. The result is a method that sorts location of processors during various stages n numbers with nl~ 1/h processors in time of the sort algorithm (although a first at- 0 ( k log n) (where k' is an arbitrary integer). tempt at introducing this factor in a com- In addition to the lack of feasibility of putation model is made by Vishkin the shared memory model, another major [1981]). drawback of the parallel bucket sort is its However, the results achieved are of ma- 0 (mn) space requirement. Even when the jor theoretical importance, and the meth- range of values is not very large, it would ods used demonstrate the intrinsic parallel be desirable to reduce the space require- nature of certain sorting procedures. It may ment; in the case of a wide range of values also happen that future research will suc- (e.g., when the sort keys are arbitrary char- ceed in refining the shared memory model acter strings rather than integer numbers), for parallel computation and make it more the algorithm cannot be utilized. reasonable from a computer architecture point of view. An attempt to classify the various types of assumptions underlying 3.4 Sorting by Enumeration recent research o'n shared memory models Parallel enumeration sorting algorithms, of parallel computation is made by Borodin which do not restrict the range of the sort and Hopcroft [1982]. Of particular interest values and yet run in time O(log n), are is the class of algorithms that allow simul- described by Preparata [1978]. The keys taneous reads, but allow simultaneous are partitioned into subsets, and a partial writes only if all processors try to write the count is computed for each key in its re- same value [Shiloach and Vishkin 1981]. spective subset. Then, for each key the sum of these partial counts is computed in par- 4. BLOCK SORTING ALGORITHMS allel, giving the rank of that key in the sorted sequence. Preparata's first algorithm For all the parallel sorting algorithms de- uses Valiant's [1975] merging procedure, scribed in previous sections, the number of and sorts n numbers with n log n processors records or keys to be sorted is limited by in time O(log n). The second algorithm the number of processors available. Typi- uses Batcher's odd-even merge, and sorts cally, O(n) or more processors are utilized n numbers with n 1+1/k processors in time to sort n records. Thus these algorithms O(k log n). The performance of the latter implicitly assume that the number of pro- algorithm is similar to Hirschberg's {Sec- cessors is very large. tion 2.3), but has the advantage of being This type of assumption was initially jus- free of memory contention. Recall that tified when parallel sorting algorithms were Hirschberg's model required simultaneous developed for implementing fast switching fetches from the shared memory, whereas networks. In this context, there are two Preparata's method does not (since each reasons that justify the n (or n/2) proces- key participates in only one comparison at sors requirement to sort n numbers. First, any given unit of time). in a switching network, the processors are

Computing Surveys, Vol 16, No. 3, September 1984 A Taxonomy of Parallel Sorting • 301 simple hardware boxes that compare and (1) the block residing in each processor's exchange their two inputs. Second, since memory constitutes a sorted sequence the number of processors is proportional to S~ of length M, and the number of input lines to the network, (2) the concatenation of these local se- it can never be prohibitively high. quences, $1, $2 .... , Sp, constitutes a However, for a general-purpose sorting sorted sequence of length n. algorithm, it is desirable to set a limit on the number of processors available so that For example, for three processors, the dis- the number of records than can be sorted tribution of the sort keys before and after will not be bounded by the number of pro- sorting could be the following: cessors. Furthermore, it must be possible Before After to sort a large array with a relatively small Pl 2, 7, 3 1, 2, 3 number of processors. In general, research P2 4,9,1 4.5,6 on parallel algorithms (for sorting, search- Ps 6,5,8 7,8,9 ing, and various numerical problems) is Thus we now have a convention for order- based on the assumption of unlimited par- ing the total address space of a multipro- allelism. It is only recently that technology cessor, and we have defined parallel sorting constraints, on one hand, and a better un- of an array of size n, where n may be much derstanding of parallel algorithms, on the larger than p. other, are motivating the development of Algorithms to sort large arrays of files algorithms for computers with a relatively that are initially distributed across the small number of processors. An excellent processors' local memories can be con- illustration of this trend is a systematic structed as a sequence of block merge-split study of quotient networks by Fishburn and steps. During a merge-split step, a proces- Finkel [1982] for networks such as the per- sor merges two sorted blocks of equal length fect shuffle and the hypercube. Quotient {which were produced by a previous step), networks are architectures that exploit lim- and splits the resulting block into a ited parallelism in a very efficient way. The "higher" and a "lower" block, which are idea is that, given a network ofp processing sent to two destination processors {like the units, a problem of size n (for arbitrarily high and low outputs in a comparison- large n) can be solved by having each pro- exchange step). cessing unit emulate a network of size A block sorting algorithm is obtained by 0 (n/p) with the same topology. Together, replacing every comparison-exchange step the p processing units will emulate a net- (in a sorting algorithm that conists of com- work of size O (n). parison-exchange steps) by a merge-split In the area of parallel sorting, the prob- step. It is easy to verify that this procedure lem of limited parallelism has not been produces a sequence that is sorted accord- systematically addressed until recently. We ing to the above definition. propose some basic ideas for further re- There are two ways to perform a merge- search in this direction in the following split step. One is based on a two-way merge paragraphs. [Baudet and Stevenson 1978]; the other is When p processors are available and n based on a bitonic merge [Hsaio and Menon records are to be sorted, one possibility is 1980]. In Sections 4.1 and 4.2, we describe to distribute the n records among the p pro- both methods and illustrate them by inves- cessors so that a block ofM = Fn/pl records tigating the block sorting algorithms that is stored in each processor's local memory they generate on the basis of the odd-even (a few dummy records may have to be added transposition sort (Section 1.1) and the to constitute the last block). The processors bitonic sort (Section 2.1.2). An important are labeled P1, P2 ..... Pp, according to an property of the parallel block sorting algo- indexing rule that is usually dictated by the rithms generated by both methods is that, topology of the interconnecting network. like the network sorting algorithms, they Then, the processors cooperate to redistrib- can be executed in SIMD mode (see Section ute the records so that 2.2).

ComputingSurveys, Vol. 16, No. 3, September1984 302 • D. Bitton, D. J. DeWitt,L,16181,, D. K. Hsiao, and J. Menon ,1 161 1 Figure 10. Merge-split based on two-way merge. I 1 1 1,o 81,1,o1,,I

4.1 Two-Way Merge-Split back the higher M records. The algorithm can be executed synchronously by p pro- A two-way merge-split step is defined as a cessors, odd and even processors being al- two-way merge of two sorted blocks of size ternately idle. M, followed by splitting the result block of size 2M into two halves. Both operations 4.1.2 Block Bitonic Sort Based are executed within a processor's local on Two-Way Merge-Split memory. The contents of processor's mem- ory before and after a two-way merge-split By using Batcher's bitonic, p records can are shown in Figure 10. After two sorted be sorted with p/2 processors in log 2 p sequences of length M have been stored in shuffle steps and 1/2((log p) + 1)(log p) each processor's local memory, the proces- comparison-exchange steps. Suppose that sors execute in parallel a merge procedure each processor has a local memory large and fill up an output buffer 0 [1..2M] (thus enough to store 4M records. In this case, a a two-way merge-split step uses a local processor can perform a two-way merge memory of size at least 4M). After all pro- split on two blocks of size M. By replacing cessors have completed the parallel execu- each comparison-exchange step by a two- tion of the merge procedure, they split their way merge-split step, we obtain a block output buffer and send each half to a des- bitonic sort algorithm that can sort M.p tination processor. The destination pro- records with p/2 processors in log2p shuffle cessors' addresses are determined by the steps, and 1/2((log p) + 1)(log p) merge- comparison-exchange algorithm on which split steps. During a shuffle step, each pro- the block sorting algorithm is based. cessor sends to each of its neighbors a sorted sequence of length M. During a 4.1.1 Block Odd-Even Sort Based merge-split step, each processor performs on Two-Way Merge-Split a two-way merge of the two sequences of length M (which it has received during the Initially, each of the p processors' local previous shuffle step) and splits the result- memory contains a sequence of length M. ing sequence into two sequences of length The algorithm consists of a preprocessing M. The algorithm is illustrated in Figure step (Step 0), during which each processor 11, for two processors, where M = 2. independently sorts the sequence residing In the general case, the algorithm re- in its local memory, and p additional steps quires p/2 processors, where p is a power (Steps 1 to p), during which the processors of 2, each with a local memory of size cooperate to merge the p sequences gener- 4(M.p), to sort M.p records. ated by Step 0. During Step 0, the proces- sors perform a local sort by using any fast 4.1.3 Processor Synchronization serial sorting algorithm. For example, a local two-way merge or a quick sort can be When M is large, or when the individual used. Steps 1 to p are similar to Steps 1 to records are long, transferring blocks of p of the odd-even transposition sort (see M.p records between the processors intro- Section 1.1). During the odd (even) steps, duces time delays that are higher by several the odd- (even-) numbered processors re- orders of magnitude than the instruction ceive from their right neighbor a sorted rate of the individual processors. In addi- block, perform a two-way merge, and send tion, depending on the data distribution,

Computing Surveys, Vol. 16, No. 3, September 1984 A Taxonomy of Parallel Sorting 303

Step I Step 2 Step 3

Figure 11. Block-bitonic sort based on two-way merge. the number of comparisons required to merge two blocks of M records may vary. Thus, for the execution of block sorting algorithms based on two-way merge-split, a coarser granularity for processor synchro- nization might be more adequate than the SIMD mode, where processors are synchro- fi nized at the machine instruction level. A multiprocessor model for those algorithms Figure 12. Bitonic merge-exchange step. in which processors operate independently of each other, but can be synchronized by exchanging messages among themselves or with a controlling processor at intervals of However, as indicated in the previous sec- several thousand instructions, is more ap- tion, the two-way merge-split requires a propriate for these algorithms. At the ini- processor's local memory size to be at least tiation time of a block sorting algorithm, 4M. Another alternative is that Pj send one the controller assigns a number of proces- of its records at a time and wait for a return sors to its execution. Because other op- record from P, before sending the next erations may already be executing, the record. Suppose that M records (x~, x2,..., controller maintains a free list and assigns XM) are stored in increasing order in P~'s processors from this list. In addition to the memory and the M records (yl, Y2,..., YM) availability of processors, the size of the are stored in decreasing order in P/s mem- sorting problem is also considered by the ory. Let Pj send yl to P~. Pi then compares controller to determine optimal processor xl and yl, keeps the lower of the two, and allocation. sends back to Pj the higher record. This procedure is then repeated for (x2, y2), • • •, 4.2 Bitonic Merge-Exchange (XM, YM). This sequence of comparison- exchanges constitutes the "bitonic merge" Consider the situation in which two pro- cessors P~ and P~ each contain a sorted and results in having the highest M records block of length M, and we want to compare in Pj and the lowest M in P~ [Alekseyev 1969; Knuth 1973]. Thus the merge-split and exchange records between the proces- sors so that the lower M records reside in operation can now be completed by having P~ and the higher M in Pj. One way to P, and Pj each perform a local sort of their M records. Figure 12 illustrates the bitonic obtain this result is to execute the following three steps: merge-exchange operation for M = 5. It is important to notice that the data exchanges (1) P~ sends its block to P~. are synchronous (unlike in the two-way (2) P, performs a two-way merge-split. merge-split operation). Thus the block (3) P, sends the high half-block to Pj. sorting algorithms based on the bitonic

Computing Surveys, Vol. 16, No. 3, September 1984 304 • D. Bitton, D. J. De Witt, D. K. Hsiao, and J. Menon merge-exchange are more suitable for im- sion of the bitonic sort, which was described plementation on parallel computers that in Section 2.1.2. Consider a network of p require a high degree of synchronization identical processors, where p is a power of between their processors. 2, interconnected by two types of links (Fig- The bitonic merge-exchange also re- ure 14): quires substantially less buffer space than the two-way merge-split. Because the two- (1) two-way links, between pairs of adja- way merge-split merges two blocks of size cent processors: POP1, P2P3 .... ; M within a processor's local memory, it (2) one-way shuffle links, connecting each uses (4.M) space. The bitonic merge-ex- P, to its shuffle processor. change requires space for only M + 1 rec- If each processor has a local memory of size ords. Finally, the comparisons (of pairs of M + 1, then M.p records can be sorted by records) and the transfers are interleaved alternating local-sort, block-bitonic ex- in every bitonic merge-exchange step. The changes between neighbor processors and two-way merge-split requires that an entire shuffle procedures. During a shuffle proce- block of data be transferred to a processor's dure, each processor sends the records that memory before the merge operation is ini- were in its memory, in order, to the corre- tiated, whereas the bitonic merge-exchange sponding location of the shuffle processor's can overlap each record's transfer time with memory and receives the records that were processing time. in the memory of the reverse shuffle pro- However, a major disadvantage of the cessor. Figure 15 illustrates this algorithm bitonic merge-exchange is the necessity for for p = 4, where M = 5. performing a local sort of M records in each processor after the exchange step is com- 5. EXTERNAL PARALLEL SORTING pleted. To perform the local sort, a serial sorting algorithm that permutes the records In this section, we address the problem of in place {such as sort) should be used. sorting a large file in parallel. Serial file Otherwise the local sort might require more sorting algorithms are often referred to as memory than the exchange. Note that the "external sorting algorithms," as opposed sequences generated by the bitonic ex- to array sorting algorithms that are "inter- change are bitonic. Thus sorting these se- nal." For a conventional computer system quences requires at most (M/2) log M com- the need for an external sorting algorithm parisons and local moves. arises when the file to be sorted is too large to fit in main memory. 4.2.1 Block Odd-Even Sort Thus for a single processor the distinc- Based on Bitonic Merge-Exchange tion between internal sorting and external sorting methods is well known, and there As with the block odd-even merge based are accepted criteria for measuring their on two-way merge (Section 4.1.1), we start respective performances. However, the with M records in each processor's memory topic of external parallel sorting has not and perform an initial phase where each yet received adequate consideration. processor independently sorts the sequence In Section 4, we presented a number of in its memory. However, Steps 1 ... p are parallel algorithms that can sort an array different. During odd (even) steps, odd- that is initially distributed across the pro- (even-) numbered processors perform a bi- cessors' memories. The size of the array tonic merge-exchange with their right was limited only by the total memory of the neighbor. Figure 13 illustrates this algo- system (considered as the concatenation of rithm for p = 4, where M = 5. the processors' local memories). By analogy to the definition of serial internal sorting, 4.2.2 Block Bitonic Sort Based these algorithms may be called "parallel on Bitonic Merge-Exchange internal sorting algorithms." A fast and space-efficient block sorting al- A parallel sorting algorithm is defined as gorithm can be derived from Stone's ver- a parallel external sorting algorithm if it can

Computing Surveys, Vol. 16, No. 3, September 1984 A Taxonomy of Parallel Sorting • 305

PO' 3{31 2 2 1 20 recor~ i~ four processor memori¢~ after initial preproce=ing PI 0 I 12 4 $ 5 P2, , z;'{o { s 6 $ Po, . .z~ t g 8 3 2 i P3 , , 8 } 9 } 12 13 19 P! ! { 2 3 5 6 { ;P2 , • 19 12 8 5 0 compare and exchange (PO, P1) ~d (P2, P~) P3 2 4 5 9 13 ! { Po o 2t 2t 21 1 compare and exchange (PO, PI) and (P2, P3) P1 3 3 {,,4 } $ { `5 P~ . 8 ~I s{ 6{ $ PO zl ~zl313 P3 17 O I 12 { 13 { 19 ~PI . .,71 9Is s 6 iP2 2 { 4 5 $ 0 l localized sort :P3 ' ' 19 { 12 I 8 9 13 ! l 2.l localized sort PO. .o{ i 2l 2 P1 • $ 3 j 1 ' $ j{ 4 } 3 P2 ,, '51 6 8 8 9 PO 3 3 21 2 1 P3 9 12 13 17 19 , PI. 17 9 8 6 $ P2 0 2 4 $ 5 !P3 8 9 12 13 19

PO o 'I 2 21 2 P1 5 $ I 4 3 I 3 P2 5 61 8 819 PO 3 3 2t2 l P3 ~{12 13117 1~ P1 17 9 8 I 6 5 P2 ' ' 0 2 4 5 5 I { compare and exchange P1 and P2 P3 8 9 12 { 13 19 } { 0 1 2 t 2 2 compare and exchange PI and P2 ! ,5 5 4 I 3 3 ,5 6 s{s o PO 3 3 2 2 1 9 It2{ 13117 19 PI 0 2 4 $ $ { P2 17 9.. 8 6 $ localized sort P3 8 9 12 13 19

I PO localized sort o, '1 'I i 'PI 3 3 4 5 S P2 s 61 s{ s{ 9 PO , P3 9 ,2 { ,3 t ,71,9 P1 012 4 $ S P2 17{0{ 8)6 'i $ P3 8 } 9 { 12 { 13 I tO

Figure 13. Block odd-even sort (20records).

Computing Surveys,Voi. 16, No. 3, September1984 306 • D. Bitton, D. J. De Witt, D. K. Hsiao, and J. Menon

Secondory Memory (~) i-th Processor ) Controller ..... Control Line Two-woy Link One-woy Processor-to- Processor Link (a)

Figure 14. Processors' rater- connection for block-bitomc sort. (a) p = 4. (b) p = 16.

(b)

sort a collection of elements that is too has p identical processors and each pro- large to fit in the total memory available in cessor's memory is large enough to hold k the multiprocessor. This definition is gen- records, but the source file has more than eral enough to apply to both categories of p. k records. In both cases, the processor parallel architectures: the shared memory can access a mass-storage device on which multiprocessors and the loosely coupled the file resides. At termination of the al- multiprocessors (also called "multicompu- gorithm, the file must be written back to ters"). the mass-storage device in sorted order. 3 For shared memory multiprocessors, an An early result on tape parallel sorting external sorting algorithm is required when appeared in Even [1974]. Recently in Bit- the shared memory is not large enough to ton [1982], several parallel sorting algo- hold all the elements (and some work space rithms have been proposed for files residing to execute the sort program). On the other on a modified moving-head disk. ~ hand, for loosely coupled multiprocessors, 3 Physmal order on the mass storage devine must be the assumption is that the source records defined according to the physical characteristics of the cannot be distributed across the processors' storage device. For example, for a magnetic disk, a local memories. That is, the multicomputer track-numbering convention must be agreed upon.

Computing Surveys, Vol 16, No. 3, September 1984 A Taxonomy o[ Parallel Sorting • 307 Slap $ P0 PI 8 2{ l 3 $ , l: o119 I s P2 8 12 ! 0 19 $ 6 2 ~{ 3l s P3 s o{,3l 4{ { l A Perfect Shuffle A Perfe¢$ $hu~'le

PO PO P1 ~1 ~= o zot s P1 ,~ ml , l 3{ s P2 P2 s ~21 OllOl P3. , $ g 13 4 2 P3 {

PO 3{31 ~. PI slslo ~r P2 .,, i9 ~2 is J ~ ~ P3 21 4IS { sJ 0 (a)

S~e~ l PO 2{ 3 3 2 I PO i 21 213{ 3 PI 6 I 8 8 g 17 P1 19 13 I 12 { 9 I 8 P2 ig I 12 $ g 1.3 P2 s sl ~ { 2l o P3 21 4Is s 0 P3 s 6{ 810117 I A Pe~ee~ Shuffle A Perfect, Shuffle L PO ~{ ~{ 21313 P1 s st 4 z,I o P2 I~ 13l ~2lot s P3 ,~ ol s}olt7 l I r...xcH.~,,c~. Co. ,1 ! PO -- ~{2{ ~l 2l o PI ----- slsl ,I 3l 3 P2 -- slo I s I 61 s P3 • t l?{g, 12 i~1 ~o (b)

PO o z{ 2 ~ PZ 3 3 } 4 s $ P2 S t 8 8 8 g P3 .~"~ 9 t 12 13 17 lg (c) Figure 15. Block-bltonic sort. (a) Stage 1. (b) Stage 2. (c) Stage 3.

Computing Surveys, Vol. 16, No. 3, September 1984 308 • D. Bitton, D. J. De Witt, D. K. Hsiao, and J. Menon

5.1 Parallel Tape Sorting cessors and 4p magnetic tape drives) pro- vides very limited insight into practical as- The sorting problem addressed by Even pects of implementation. [1974] is to sort a file of n records with p processors (where n is much larger than p) 5.2 Parallel Disk Sorting and 4p magnetic tapes. The only internal memory requirement is that three records The notion of a sorted file stored on a could fit simultaneously in each processor's magnetic disk requires that physical order local memory. Under those assumptions, be defined since disks are not sequential Even proposes two methods for paralleliz- storage media. Within a disk track records ing the serial two-way external merge sort are stored sequentially, but then a conven- algorithm. In the first method, all the pro- tion is needed for numbering tracks. For cessors start together and work independ- example, adjacent tracks could be defined ently of each other on separate partitions as consecutive tracks on one disk surface. of the file. In the second, processors are This convention is adequate if a separate added one at a time to perform sorting in a processor is associated width each disk sur- pipelinelike algorithm. Both methods can face. Another way to model the mass-stor- be described briefly: age device is to consider a modified moving- head disk that provides for parallel read/ Method 1. Each processor is assigned write of tracks on the same cylinder (Figure nip records and four tapes and performs a 17). Disks that provide this capability have (serial) external merge sort on this subset. been proposed [Banerjee et al. 1978] and, After p sorted runs have been produced by in some cases, already built. The idea was this parallel phase, during a second phase pioneered by database machine designers, a single processor merge sorts these runs and prototypes have been built in the serially. framework of database research projects Method 2. The basic idea is that each (see, e.g., Leilich et al. [1978]). Commercial processor performs a different phase of the parallel readout disks have recently been serial merge procedure. The ith processor made available for high-performance com- merges pairs of runs of size 2 ~-1 into runs puters (e.g., a 600-Mbyte drive with a four- of size 2 ~. Ideally, n is a power of 2 and log track parallel readout capability and a data n processors are available. A high degree of transfer rate of 4.84 Mbytes/second is now parallelism is achieved by using the output available for the Cray-1 computer). Thus tapes of a processor as input tapes for the parallel readout disks appear to constitute next processor, so that, as soon as a pro- a viable compromise between the cost-ef- cessor has written two runs, these runs can fective, conventional moving-head disk and be read and merged by another processor. the obsolete fixed-head disk. In order to overlap the output time of a In order to minimize seek time, two disk processor with the input time of its succes- drives can be used concurrently. During sor, each processor writes alternately on execution of a single phase of a sorting four tapes (one output run on each tape). algorithm, one drive can be utilized for reading and the other for writing. These methods show that, from the al- In Bitton-Friedland [1982] a number of gorithmic point of view, it is possible to parallel external sorting algorithms and ar- introduce a high degree of parallelism in chitectures are examined and analyzed. the conventional two-way external merge- The mass-storage device is modeled as a sort. However, the assumptions about the parallel read/write disk. The algorithm mass storage device do not consider con- with the best performance is a parallel two- straints imposed by technology. Like the way external merge-sort, termed the par- shared memory model for array sorting, a allel binary . It is an im- parallel file sorting model that assumes a proved variation of Method 1 in Section shared mass storage device with unlimited 5.1, achieved by parallelizing the second I/O bandwidth (e.g., a model with p pro- phase of this method.

Computing Surveys, Vol 16, No 3, September 1984 A Taxonomy of Parallel Sorting • 309 [] [] [] [] SUBOPTIMAL STAGE 8

OPTIMAL STAGE

POSTOPTIMAI~ STAGE

IM

Figure 16. Parallel bmary merge sort.

When the number of output runs is 2 k, the processors operate in parallel, but on and k > 1, 2 h-~ processors can be used to separate data. Parallel I/O is made possible perform the next step of the merge sort by associating each processor with a surface concurrently. Thus execution of the paral- of the read disk and a surface of the write lel binary merge algorithm can be divided disk. into three stages, as shown in Figure 16. When the number of runs equals 2.p, The algorithm begins execution in a subop- each processor will merge exactly two runs timal stage {similar to Phase 1 in Method of length N/2p. We term this stage the 1), in which sorting is done by successively optimal stage. During the postoptimal merging pairs of longer runs until the num- stage, parallelism is employed in two ways. ber of runs is equal to twice the number of First, 2 k-1 processors are utilized to concur- processors. During the suboptimal stage, rently merge 2 k-1 pairs of runs {this occurs

ComputingSurveys, Vol. 16, No. 3, September 1984

4 , - ,= ~ ..... ~.,_. : .... ~ _~. 310 • D. Bitton, D. J. De Witt, D. K. Hsiao, and J. Menon

Figure 17. Architecture for the par- allel binary merge sort.

input drtve

outpu drive

after log(re~k) merge steps). Second, pipe- puters and real data in order to evaluate lining is used between merge steps. That is, the performance of external sorting algo- the ith merge step starts as soon as Step rithms. The results of these studies have (i - 1) has produced one unit of each of two complemented analytical results when output runs (where a unit can be a single modeling analytically the effect of access record or an entire disk page). time and the impact of data distribution The ideal architecture for the execution was too complex. In a parallel environment, of this algorithm is a binary tree of pro- the analytical performance evaluation of an cessors, as shown in Figure 17. The mass- external sorting scheme is made even more storage device consists of two drives, and difficult by the complexity of the input/ each leaf processor is associated with a output device. surface on both drives. In addition to the We can get some indication of the par- leaf processors, the disk is also accessed by allel speedup that can be achieved by per- the root processor to write the output file. forming an external sort in parallel by as- This organization permits leaf processors suming that the available input/output to do input/output in parallel, while reduc- bandwidth is limited only by the number of ing almost by half the number of processors processors. However, a satisfactory analy- that must actually do input/output. sis of parallel external sorting algorithms must also consider the constraints imposed 5.3 Analysis of the Parallel External by mass-storage technology. For example, Sorting Algorithm if the modified disk described in Section 5.2 is used for storage, the suboptimal stage For serial external sorting, numerous em- for the parallel binary merge algorithm can pirical studies have been done on real corn- either be highly parallel, or almost sequen-

Coml~utmg Surveys, Vol. 16, No, 3, September 1984 A Taxonomy of Parallel Sorting , 311 tial, depending on whether or not the pro- bounds obtained for chip area and time cessors request data from several tracks on complexity of VLSI sorters. These results the same cylinder. pertain to an area of research in complexity theory that is being defined, and they can- 6. HARDWARE SORTERS not be presented without properly defining VLSI circuit areas and introducing the the- The high cost and frequent need of sorting oretical area • time2 trade-off.4 are motivating the design of "sort engines," In the remainder of this section, we pres- which could eventually off-load the sorting ent an overview of designs that have been function from general-purpose central proposed for hardware sorters. We have processing units (CPUs). By implementing chosen to concentrate on sorters that were the sequence of comparison and move steps originally conceived for the magnetic bub- required by an efficient sorting algorithm ble technology because they illustrate well in hardware, one could realize a low-cost, how technology constraints define trade- fast, hardware device that would signifi- offs between sort;ng speed and parameters cantly lighten the burden on the CPU. Sev- related to input/output bandwidth, mem- eral alternative designs of hardware sorters ory, and number of control lines required recently have been proposed [Chen et al. by on-chip sorters. In particular, we shall 1978; Chung et al. 1980; Dohi et al. 1982; describe the rebound sorter and the up- Lee et al. 1981; Thompson 1983; Yasuura down sorter (which are clever pipeline ver- et al. 1982], and preliminary evaluations sions of the odd-even transposition sort seem to indicate that a VLSI implementa- (Section 1.1)) and a number of magnetic tion of sorting devices could soon become bubble sorters that integrate a sorting ca- feasible. The relatively simple logic re- pability in bubble memory. quired for sorting constitutes a strong ar- gument in favor of this approach. In addi- 6.1 The Rebound Sorter tion, the advent of new and inexpensive shift-register technologies, such as charge- The uniform ladder [Chen 1976] is an N- coupled devices and bubble memories, is loop shift-register structure capable of stor- stimulating new designs of hardware sort- ing N records, one record to a loop. Since ers based on these technologies [Chung et records stored in adjacent loops can be ex- al. 1980; Lee et al. 1981]. changed, this storage structure is very suit- A future outcome of improvement in able for a hardware implementation of the technology might be that bubble chips odd-even transposition sort (see Section could provide storage for large files with 1.1). If the time for a bit to circulate within on-chip sorting capabilities. In this case, a loop is called a period, then N records the sorting function could be provided by (which have been previously stored in the the mass-storage devices without requiring ladder) can be sorted in iN + 1)/2 periods the transfer of files to a dedicated sorting using iN- 1) comparators. machine or the main memory of a general- Further investigation of the ladder struc- purpose computer. However, it is prema- ture led to the design of a new sorting ture at this point to determine whether or scheme, where input/output of the records not advances in technology will be able to can be completely overlapped with the sort- provide for intelligent mass-storage devices ing time. This scheme is the rebound sort with sorting capabilities. Hardware sorters, in particular VLSI 4 In a recent work (pubhshed after this paper was written), Thompson has surveyed 13 different VLSI sorting circuits, are at present the focus of circuits that implement a range of sorting schemes: active research. Theoretical problems re- heap sort, pipelined merge-sort, bitonic sort, bubble lated to area-time complexity are also sort, and sort by enumeration. For each of these drawing considerable attention to VLSI schemes one or several circuit topologms (linear array, sorting [Leiserson 1981; Thompson 1980, mesh, binary tree, shuffle-exchange and cube-con- nected cycles, mesh of trees) are considered, and the 1983]. It is beyond the scope of this paper resulting sorter is evaluated with respect to its area • to present a proper survey of the theoretical time 2 complexity.

Computing Surveys, Vol. 16, No. 3, September 1984 312 * D. Bitton, D. J. DeWitt, D. K. Hsiao, and J. Menon Outpul

L

Figure 18. The rebound sorter. (a) The R steering unit. (b) Stacking steering units. [From Chen et al. 1978. © IEEE 1978.] ? (a)

(b)

[Chen et al. 1978]. The basic building block the continuing state, each steering unit of the rebound sorter is the steering unit continues to emit its contents in the direc- (Figure 18a), which has an upper-left cell tion determined in the previous decision L and a lower-right cell R. A sorter for N step. The continuing steps are required to records is assembled by stacking (N - 1) append the body of the records to their key. steering units, plus a bottom cell and a top It is readily seen that the first key emerges cell (Figure 18b). Associated with the two from the sorter after N steps (to + 8 in cells in a steering unit is a comparator K, Figure 9) and the complete sorted sequence which can compare the two values stored is produced in the next N steps. in the upper-left and lower-right cells. A record may be stored across adjacent cells 6.2 The Up-Down Sorter in two steering units, but the sortin~ key (.assumed to be at the head of the record) A significant improvement of the ladder must fit entirely in one cell, so that two sorter can be achieved by incorporating the keys can be compared in a single steering comparison function in the basic steering unit. Sorting is performed by pipelining unit and using an "up-down" sorting input records through the stack of steering scheme instead of the rebound sort. To sort units. Records enter the sorter through the 2N keys, the "compare~steer bubble sorter" upper-left cell of the top steering unit and [Lee et al. 1981] requires N compare/steer emerge in sorted order from the top cell (at units stacked on top of each other. It is the upper-right corner of the stack) after assumed that the entire record fits in a cell 2N steps. The sorting scheme is illustrated (thus two records are stored in a compare/ in Figure 19 for N- 4. The sorter alternates steer unit). The up-down sort is illustrated between a decision state and a continuing in Figure 20 for N = 3 (three compare/steer state. In the decision state, each steering units sorting six records). The up-down unit compares the keys stored in its upper- sorter operates in two phases. During the left and lower-right cells and emits the keys downward input phase, 2N keys are loaded either horizontally (upper-left key to the in 2N periods. During each period of the right, lower-right key to the left) or verti- input phase, a key enters the sorter and all cally (upper-left key to the upper unit, units push down the larger of their two lower-right key to the lower unit), depend- keys (to the unit beneath them). During ing on the outcome of the comparison. In the upward output phase, each unit pops

Computing Surveys, VoL 16, No. 3, September 1984 A Taxonomy of Parallel Sorting 313 A D A A B D 0 A A Cz B B~ D D2 A A 2 Cl C2 B~" B2 D D2 A I A2 ¢, • ~, ¢, c, c2 B,, R' At

~l C2 I~ " B2" I>B I :,c, c?- ;,c2 C2~ >C j 10 to+ I to+2 to+5 to+4 to+5 to+6 to+7 A A A B A 'A B B C A Bi B C C Ot I I I I A z- C>AI AA2 02 ! ! i ! i I Bt~.~,B 2 Bz.t>Bt BA2 C~l C&2 O&l 02 I I ce. ,c2 Cz.>Cl' .C' ~z D~I D2 V J ' ! I ! O~ .,,cz o~- ~>ol O I'~ ~,O~, OZ- I>Dt! O2 to+8 to+9 to+lO to+ll to÷12 to+IS to+14 to+15

Figure 19. The reboundsort for four records. [From Chen et al. 19'/8. © IEEE 19'/8.] up the smaller of its two keys (to the unit the design of the bubble chip. Thus this above it), and a key is output in every type of chip constitutes an intelligent mem- period. ory capable of performing the logical oper- The up-down sorter eliminates the large ations required by sorting. The sorters pro- number of control lines required by the posed by Chung et al. [1980] also attach rebound sorter. Whereas the rebound sorter comparators to the bubble memory, but in needs multiple control lines to individually addition they eliminate the input/output activate switches of the bubble ladder, the function that is an intrinsic part of the up-down sorter can be implemented with a previous sorting algorithms. The motiva- single control line for resetting all the com- tion for designing bubble elements that sort pare/steer units at the beginning of each in situ stems from the assumption that phase. Thus on-chip compare/steer units magnetic bubble memory may soon provide have a better chance to provide a chip cost-effective mass-storage systems. If ad- implementation of large files sorters. vances in technology make this assumption Both the rebound sorter and the up- realistic, then sorting a file will only require down sorter have the very desirable prop- rearranging records in mass storage accord- erty of completely overlapping input/out- ing to the result of comparisons performed put time with sorting time. Thus, assuming within the memory, without input/output that only serial input/output of data rec- operations or CPU intervention. ords is available, they provide an optimal Four models of intelligent bubbles are hardware implementation of file sorting. considered by Chung et al. [1980], and for each model an alternative sorting scheme 6.3 Sorting within Bubble Memory is proposed. The models differ in the size of the bubble loops and the number of The sorter described in the previous section switches required between the loops (to incorporates the comparison function in perform comparisons). The first two

Computing Surveys,Vol. 16, No. 3, September 1984 314 D. Bitton, D. J.~ De Witt, D. K. Hsiao, and J. Menon 5 2 5 6 2 I 6 5 3 I 2 5 4 3 6 2 5

(P ' co I j4 * ! I ,3" I | I , 5 : ' 00 * 4 3 * 3 2 O0"tI -~ ,,t- +- I ! ! * CO * 4 , 4i

| | | | ¢. t i to t l t2 t3 t4 t 5 t6oO

I I 2 I 2 3 I 2 3 4 I 2 3 4 5 2 3 4 5 6

* 6tg t -g col

d) cb cb t 7 t 8 t 9 tlO tll t12

Figure 20. Operatmn of the up-down sorter for six records (only the keys are shown). [From Lee et al. 1981 © IEEE 1981 ] models implement a bubble sort and an larger loop. At every step of the bitonic odd-even transposition sort, respectively, sort, larger loops are formed that contain whereas the other two implement a bitonic bitonic sequences. Because they implement sort. The first model has two loops, one of a faster algorithm, these sorters are faster size (n - 1) and the other of size 1, and a than the first sorter. However, the trade- single switch between them {Figure 21a) is off is a higher complexity of hardware used to perform the bubble sort. The second {more switches and more control states per model is a linear array of loops, all of size switch), which may be beyond the present 1, with a switch between every pair of ad- limits of chip density. For example, the jacent loops (Figure 21b). The (n - 1) bubble-sort sorter sorts in O(n 2) compari- switches perform comparisons in parallel, son steps, but it requires only one simple according to the odd-even transposition switch (with three control states). On the scheme (Section 1.1). For the other two other hand, a bitonic sorter sorts in time models (Figure 21c, d), the basic idea is to O(n log2 n), but requires log n complex have the option to open a switch between switches {each with 3 log n control states). adjacent loops {that are of the same size in Thus these detailed designs of bubble sort- Model 3, but of different sizes in Model 4) ers provide an excellent illustration of cost- so that the two loops are collapsed into a performance trade-offs in sorting.

Computing Surveys, Vol. 16, No 3, September 1984 A Taxonomy of Parallel Sorting * 315

°

(n'-I) I (a) I oF-lol I I I I I

(b) Figure21. Bubble loop structures for sorting. (a) Model 1. (b) Model 2. (c) Model 3. (d) Model 4. ]From Chung et al. 1980. © IEEE 1980.] I I I I I I

(c) I I I ' I 2 4 2k

(d)

6.4 Summary and Recent Results ity of fast high-capacity VLSI sorters is still an open question. However, this direc- Several designs of hardware sorters have tion of research on parallel sorting appears been proposed recently, and preliminary to be very promising. A well-defined VLSI evaluations of their feasibility are being complexity model that combines some performed. One of the most promising ap- measures of hardware complexity with time proaches appears to be the implementation efficiency provides a systematic approach of a simple pipelined sorting scheme on to the analysis of parallel sorting algo- bubble chips. rithms. For bubble memory devices, when However, a number of alternative designs assuming that records are read and written are being investigated. More recently, two serially, it has been shown that input/out- detailed layouts of VLSI sorters have been put time can be effectively overlapped with proposed. A high-capacity cellular array sorting time. Thus advances in technology that sorts by enumeration is investigated may soon make a well-designed, dedicated by Yasuura et al. [1982]. In Dohi et al. sorting device a cost-effective addition to [1982], cells that are able to sort-merge many computer systems. data in compressed form are connected in a binary tree topology to constitute a pow- 7. CONCLUSIONS AND OPEN PROBLEMS erful sorter. In both cases, the design of the basic cell is simple enough to allow very Over the last decade, parallel sorting has high density packaging with current or been the focus of active research.There are near-future technology. many parallel sorting algorithms currently The parallel sorting algorithms used by known, and new algorithms are being de- currently proposed hardware sorters are veloped-ranging from network sorting al- simple and slow compared to either the gorithms to algorithms for hypothetical sorting networks or the shared memory shared memory parallelcomputers or VLSI model algorithms. Although theoretical chips. Research on parallel sorting has of- complexity bounds are being investigated fered many challenges to both theoreticians for faster VLSI sorters and important re- and systems designers. From a theoretical sults have been achieved [Bilardi and Pre- point of view, the main research problem parata 1983; Thompson 1983], the feasibil- has been to design algorithms that, by sys-

Computing Surveys, Vol. 16, No. 3, September 1984 316 • D. Bitton, D. J. De Witt, D. K. Hsiao, and J. Menon

tematically exploiting the intrinsic paral- Table 1. Number of Processors and Execution Time lelism in sorting and merging, would reach Required by Parallel Sorting Algorithms the time theoretical lower bound, that is, Algorithm Processors Time algorithms that would sort n numbers in Odd-even transposition n O( n ) time O(log n) on a hypothetical O(n)-pro- Batcher's bitonic O(n log2 n) O(log2 n) cessor parallel machine. From a practical Stone's bitonic n/2 O(log2 n) point of view, systems designers have in- Mesh-bitonic n O(~j'n) vestigated feasibilit:~~ with current or near- Muller-Preparata n 2 O(log n) term technology, and integration of input/ Hirschberg (1) n O(log n) output time in the cost of parallel sorting. Hirschberg (2) n 1+1/k O(k log n) Preparata (1) n log n O(log n) Despite the apparent disparity among Preparata (2) n 1+Ilk O(k log n) the numerous parallel sorting algorithms Ajtai et al n log n O(log n) that have been proposed, we have shown how these algorithms may be broadly class- ified into three categories: network sorting In the third category of parallel sorting algorithms, shared memory sorting algo- algorithms, we include both internal and rithms, and parallel file sorting algorithms. external parallel sorting algorithms that The first category includes algorithms that utilize limited parallelism to solve a large- are based on nonadaptive, iterative merging sized problem. First, we dealt with block rules. Although first proposed in the con- sorting algorithms, which can sort a num- text of sorting networks, the two funda- ber of elements proportional to the number mental parallel merging algorithms (the of available processors (the proportionality odd-even merge and the bitonic merge de- constant being dependent on the size of the scribed in Section 2.1) were subsequently processors' memory). Then we introduced embedded in a more general model of par- parallel external sorting algorithms, which allel computation, where processors ex- address the problem of sorting in parallel change data synchronously along the lines an arbitrarily large mass-storage file. of a sparse interconnection network. In In addition to these three classes of par- particular, the bitonic sort has been allel sorting algorithms, we described (in adapted for mesh-connected processors Section 6) a number of hardware sorter {Section 2.2.1) and for a number of net- designs. Hardware sorters that have been works such as the shuffle, the cube, and the proposed assume a fixed, sparse intercon- cube-connected cycles. nection scheme between the processing ele- Algorithms in the second category re- ments. The parallel sorting algorithms uti- quire a more flexible pattern of memory lized by these sorters are highly synchro- accesses than the network sorting algo- nous and, for the most part, are derived rithms. They assume shared memory from algorithms that we have classified as models of computation, where processors network sorting algorithms (in particular, share read and write access to a very large the bitonic sort algorithm). Although the memory pool with various degrees of con- hardware sorters did not introduce inno- tention and different policies of conflict vative parallel sorting algorithms, new and resolution. For the most part, shared mem- important directions of research on parallel ory parallel sorting algorithms are faster sorting are being explored in their design than the network sorting algorithms, but process. One such direction is the explora- they are far less feasible from a hardware tion of algorithms that exploit the charac- point of view. In Table 1, we briefly sum- teristics of new storage technologies such marize the asymptotic bounds of the main as magnetic bubbles. Another is the inte- algorithms in both the network and the gration of VLSI hardware complexity in shared memory categories in terms of the cost models by which parallel sorting algo- number of processors utilized and execu- rithms are being evaluated. tion time (the latter being estimated as the One conclusion emerges clearly from this number of parallel comparison steps re- survey. Most research in the area of parallel quired by the algorithms). sorting has concentrated on finding new

Computing Surveys, Vol 16, No. 3, September 1984 A Taxonomy of Parallel Sorting • 317 ways to speed up the theoretical computa- ing this issue when considering a serial, tion time of sorting algorithms, whereas internal sorting algorithm, the situation is other aspects (such as technology con- quite different with parallel processing. On straints or data dependency) have received a single processor the source data are read less consideration. Typically, algorithms sequentially into memory. For a parallel have been developed for hypothetical com- processor there is the possibility that sev- puters that utilize unlimited parallelism eral processors can simultaneously read or and space to solve the sorting problem in write. On the ILLIAC-IV computer, for asymptotically minimal time. It seems that example, a fixed-head disk was used for today the complexity of sorting is fairly concurrent input/output by all 64 proces- understood, whether on networks or on sors. However, when a significantly larger shared memory parallel processors. Open number of processors is involved, only part questions that remain on the complexity of of them may be able to perform input/ parallel sorting are mostly related to newer output operations concurrently. Thus, for models of VLSI complexity that combine parallel internal sorting, the cost of reading chip area with time [Thompson 1983]. and writing the data should be incorporated It might be the case that after a decade when an algorithm is evaluated. In partic- of research devoted mainly to the theoret- ular, there may be no point in using a ical complexity of parallel sorting, aspects parallel sorting algorithm that requires related to the feasibility of parallel sorting only O(log n) time, if the start-up cost to in the context of current or near-term tech- get the data in memory were O(n). Model- nology will now be more systematically ing the cost of input/output is even more explored. To appreciate the practical im- crucial when the problem of sorting a large portance of parallel sorting, one should data file in parallel is addressed. The im- remember that the first parallel sorting portance of file sorting in database systems algorithms were intended to solve a hard- will undoubtedly motivate further research ware problem: building a switching network in this direction. that could provide all permutations of n input lines, with a delay shorter than the REFERENCES time required by serial sorting. It would be interesting, now that many fast parallel AJTAI, M, KOMLOS, J., AND SZEMER~DI, E 1983. An O(n log n) sorting network. In Pro- sorting algorithms are known, to investi- ceedings of the 15th Annual ACM Symposium on gate whether these algorithms can be Theory of Computing (Boston, Apr. 25-27). ACM, adapted to realistic models of parallel com- New York, pp. 1-9. putation. In particular, further research is ALEKSEYEV, V. E. 1969. On certain algorithms for needed to address issues related to limited sorting with minimal memory. Klbernettca 5, 5. parallelism (to remove the constraint relat- BANERJEE, J., BAUM, R. I., AND HSIAO, D. K. ing the number of processors to the problem 1978. Concepts and capabilities of a database computer. ACM Trans. Database Syst. 3, 4 (Dec.), size), partial broadcast (to replace simul- 347-384. taneous reads to the same memory loca- BATCHER, K. E. 1968. Sorting networks and their tion), and resolution of memory contention. apphcatlons. In Procee&ngs of the 1968 Spring Another important problem is related to Joint Computer Conference (Atlantic City, N.J., the validity of the performance criteria by Apr. 30-May 2), vol. 32. AFIPS Press, Reston, which parallel sorting algorithms have been Va., pp. 307-314. BAUDET, G., AND STEVENSON, D. 1978. Optimal previously evaluated. It is clear that com- sorting algorithms for parallel computers. IEEE munication, input/output costs, and hard- Trans. Comput. C-27, 1 (Jan.). ware complexity must be integrated in a BENTLEY, J. L., AND KUNG, H. T. 1979. A tree comprehensive cost model that is general machine for searching problems. In Proceedings enough to include a wide range of parallel of the 1979 International Conference on Parallel processor architectures. In particular, the Processing (Aug.). initial cost of reading the source data into BILARDI, G., AND PREPARATA, F. P. 1983. A mira- mum area VLSI architecture for O(log n) time the processors' memories has been largely sorting. TR-1006, Computer Science Depart- ignored by previous research on parallel ment, Univ. of Illinois at Urbana-Champaign sorting. Although one is justified in ignor- (Nov.).

Computing Surveys, Vol. 16, No. 3, September 1984 318 • D. Bitton, D. J. DeWitt, D. K. Hsiao, andJ. Menon

BITTON, D., AND DEWITT, D. J. 1983. Duplicate LEISERSON, C. E. 1981. Area-efficient VLSI com- record eliminationin large data files. ACM Trans. putation. Ph.D. dissertation. Tech. Rep. CMU- Database Syst. 8, 2 (June), 255-265. CS-82-108, Computer Science Dept.; Carnegie- B1TTON-FRIEDLAND, D. 1982. Design, analysis and Mellon Univ., Pittsburgh, Pa. (Oct.). implementation of parallel external sorting algo- MULLER, D. E., AND PREPARATA,F. P. 1975. Bounds rithms. Ph.D. dissertation, TR464, Computer to complexities of networks for sorting and for Science Department, Univ. of Wisconsin, Madi- switching. J ACM 22, 2 (Apr.), 195-201. son (Jan.). NASSIMI, D., AND SAHNI, S. 1979. Bitonic sort on a BORODIN, A., AND HOPCROFT, J. E. 1982. Routing, mesh connected parallel computer. IEEE Trans. merging and sorting on parallel models of com- Comput C-27, 1 (Jan.). putation. In Proceedings of the 14th Annual ACM Symposium on Theory of Computing (San Fran- NASSIMI, D., AND SAHNI, S. 1982. Parallel algo- cisco, Calif., May 5-7). ACM, New York, pp. 338- rithms to set up the Benes permutation network. 344. IEEE Trans Comput. C-31, 2 (Feb.). BRYANT, R. 1980. External sorting in a layered stor- PEASE, M. C. 1977. The indirect binary n-cube mi- age architecture. Lecture. IBM Research Center, croprocessor array. IEEE Trans Comput C-26, Yorktown Heights, N.Y. 5 (May). CHEN, T C., LUM, V. Y., AND TUNG, C. 1978. The PREPARATA,F. P. 1978. New parallel sorting schemes. rebound sorter: An efficient sort engine for large IEEE Trans. Comput. C-27, 7 (July). files. In Proceedings of the 4th International Con- PREPARATA, F. P., AND VUILLEMIN, J. 1979. The ference on Very Large Data Bases (West Berlin, cube-connected-cycles. In Proceedings of the 20th FRG, Sept. 13-15). IEEE, New York, 312-318. Symposium on Foundations of Computer Science. CHUNG, K., LuccIo, F., AND WONG, C. K. 1980. On SELINGER, P. G., ASTRAHAN, M. M, CHAMBERLIN, the complexity of sorting in magnetic bubble D D., LORIE, R. A., AND PRICE, T. G. 1979. memory systems. IEEE Trans. Comput. C-29 Access path selection in a relational database (July). system. In Proceedings of the ACM SIGMOD DOHI, Y., SUZUKI, A., AND MATSUI, N. 1982. International Conference on Management of Data Hardware sorter and its application to data base (Boston, Mass., May 30-June 1). ACM, New machine. In Proceedings of the 9th Conference on York, pp. 23-34. Computer Architecture (Austin, Tex., Apr. 26- SHILOACH, Y , AND VISHKIN, U. 1981. Finding the 29). IEEE, New York, pp. 218-225. maximum, merging and sorting in a parallel com- EVEN, S. 1974. Parallelism in tape-sorting. Commun. putation model, d Algorithms 2, 1 (Mar.). ACM 17, 4 (Apr.), 202-204. SIEGEL, H. J. 1977. The universalityof various types FENG, T.-Y. 1981. A survey of interconnection net- of SIMD machine interconnection networks. In works. Computer 14, 12 (Dec.). Proceedings of the 4th Annual Symposzum on FISHBURN, J. P., AND FINKEL, R. A. 1982. Quotient Computer Architecture (Silver Spring, Md., Mar. networks. IEEE Trans Comput C-31, 4 (Apr.). 23-25). ACM SIGARCH/IEEE-CS, New York. GAVRIL, F. 1975. Merging with parallel processors. SIEGEL, O. J. 1979. Interconnection networks for Commun ACM 18, 10 (Oct.), 588-591. SIMD machines. IEEE Comput 12, 6 (June). HIRSCHBERG, D. S. 1978. Fast parallel sorting algo- STONE, H. S. 1971 Parallel processing with the per- rithms. Commun. ACM 21, 8 (Aug.), 657-666. fect shuffle IEEE Trans Comput C-20, 2 (Feb.). HSlAO, D. C., AND MENON, M. J. 1980. Parallel THOMPSON, C. D. 1980. A complexity theory for record-sorting methods for hardware realization. VLSI. Ph.D. dissertation, Tech. Rep. CMU-CS- Tech. Rep. OSU-CISRC-TR-80-7, Computer and 80-140, Computer Science Dept, Carnegie-Mel- Science Information Dept., Ohio State Univ., Co- lon Univ, (Aug.). lumbus, Ohio (July). THOMPSON, C. D. 1983. The VLSI complexity of KNUTH, D. E. 1973. Sorting and searching. In The sorting. IEEE Trans Comput C-32, 12 (Dec.). Art of Computer Programming, vol. 3 Addison- THOMPSON, C. D., AND KUNG, H T 1977 Sorting Wesley, Reading, Mass on a mesh-connected parallel computer. Com- KUMAR, M., AND HIRSCHBERG,D. S. 1983. An effi- mun. ACM 20, 4 (Apr.), 263-271. cient implementation of Batcher's odd-even VALIANT, L G. 1975 Parallelism in comparison merge algorithm and its application in parallel problems. SIAM J Comput 3, 4 (Sept.). sorting schemes. IEEE Trans Comput C-32 VAN VOORHIS, D. C. 1971. On sorting networks. (Mar.). Ph.D. dissertation, Computer Science Dept., LEE, D. T., CHANG, H., AND WONG, K. 1981. An on- Stanford Umv, Stanford, Calif. chip compare/steer bubble sorter. IEEE Trans VISHKIN U. 1981 Synchronized parallel computa- Comput. C-30 (June). tion. Ph.D. dissertation, Computer Science Dept., LEILICH, H. O., STIEGE, G, AND ZEIDLER, H. C. Technion--Israel Institute of Technology, Haifa, 1978. A search processor for database manage- Israel. ment systems. In Proceedings of the 4th Confer- YASUURA, H., TAKAGI, N., AND YAJIMA, S ence on Very Large Databases (West Berlin, FRG, 1982. The parallel enumeration sorting scheme Sept. 13-15) IEEE, New York, pp. 280-287. for VLSI. IEEE Trans Comput C-31, 12 (Dec.). Received August 1982, final revision accepted August 1984 ComputingSurveys, Vol 16, No. 3, September 1984