useful. Samplesort was shown to be faster than Bitonic and the algorithm). In particular, in [RV83,87] two techniques Parallel for large values of N/P, although its were used that were not incorporated in the Samplesort of general utility was limited because of large memory [HC83], [BLM+91]: requirements and poor performance at lower values of N/P. (1) splitter-directed routing Samplesort as described in [HC83], [BLM+91] is a (2) recursive application of the sampling, partitioning randomized algorithm that works as follows. A sample of and routing step the complete set of input keys distributed uniformly across processors is extracted and sorted using some efficient We investigate both of these techniques and give a deterministic algorithm. From the sorted sample a subset quantitative analysis of the circumstances under which the consisting of PÐ1 values is extracted that partitions the first is advantageous. Our algorithm, which we call input into P bins that are similarly-sized with high Batched-routing , or B-Flashsort, can be likelihood. These PÐ1 splitters are then broadcast to all implemented on a wide variety of architectures, including processors, after which Binary Search can be used in parallel any mesh, CCC, or hypercube network. In this paper we at all processors to locate the destination processor for each concentrate on its implementation on a low-dimensional of their values. After sending all values to their toroidal mesh, which admits a simple and efficient approach destinations using general routing, a local sort of the values to splitter-directed routing. The algorithm takes advantage within each processor completes the algorithm. of the toroidal interconnection topology to reduce communication cost by a factor of 2 over the non-toroidal Although Samplesort is very fast for large N/P it does have mesh; this is useful in practice since the toroidal topology some important limitations. The requirement that the is actually implemented on many current and proposed complete set of splitters be available at every processor can machines. be problematic for parallel machines in which the number of processors is large but the amount of local memory is The batched routing technique works as follows. For each modest. This class includes the recently completed Mosaic dimension j of a d-dimensional mesh, values are moved machine as well as commercial machines like the MasPar along dimension j in successively smaller power-of-2 MP-1 and the DAP. A full-size MasPar MP-1, for strides. Each processor partitions its values into a set of example, has 16384 processors, but each processor has only values to keep and a set of values to send on, based on two 64KB of local storage. To Samplesort a data set consisting splitters it holds. All values to be sent from a processor of 64 bit values would require each processor to have 64P can be moved contention-free since all movement is the bits or 128KB available for the splitters alone. same distance in the same direction. Although we expect that the general routing used in Samplesort, when In this paper we propose an alternate randomized sorting implemented in hardware, should outperform batched- algorithm related to the algorithm described in [RV83,87] routing in B-Flashsort, the magnitude of this difference that has become known as Flashsort (see also determines the minimum size problem on which Ullman[U83], Appendix B for a simplified description of Samplesort can be competitive. For machines with large

7 6

5 4 Bitonic Sort 3 Samplesort 2 B-Flashsort Time (ms/(N/P)) 1 B-Flashsort with Recursive Subsampling 0 2 4 8 16 32 64 128 256 512 1024 Elements per processor (N/P)

Figure 1. Performance of implemented algorithms on 4096 processor MP-1

2 numbers of processors, limited per-processor memory, or processor 0 ≤ i < P starts with a list Li of N/P values to be software routing, B-Flashsort may always be faster. This sorted. We assume here the values are distinct; at the end of may be particularly true if batched routing is implemented this section we show that this requirement can be relaxed. at a very low-level. B-FLASHSORT-1D(L) The virtue of B-Flashsort is that it is simple, has per- SUBSAMPLE processor storage requirements that do not scale with P, and 1. foreach i ∈ 0 .. P-1 does not require general routing. It is completely 2. Gi := select k random elements from Li analyzable, up to tight bounds on the multiplicative 3. sort (G) using a deterministic P-processor sort constants. In comparison, in the original Flashsort in 4. foreach i ∈ 0 .. P-1 Ð [RV83,87] it was difficult to exactly determine analytical 5. if i = 0 then S := Ð∞ Ði constant factors due to the use of pipelining in the splitter- else S := G [k-1] i + iÐ1 directed routing (with a goal of asymptotic bounds, only 6. if i = (P-1) then S := +∞ + i upper bounds were determined in [RV83,87]). else Si := Gi[k-1] BATCH-SDR With the addition of a recursive sampling strategy we can 7. h := P extend the performance advantage of B-Flashsort over 8. while h > 1 do deterministic sorts to much smaller values of N/P. This 9. h := h /2 may improve the utility of B-Flashsort as a general purpose 10. foreach i ∈ 0 .. P-1 sorting routine since it need not suffer poor behavior at low 11. Li, Mi := [ v ∈ Li | τ(h,i,v)], N/P. [ v ∈ Li | ¬τ(h,i,v)] 12. Li := Li ++ MiÐh A great deal of previous work has been concerned with 13. end 1/2 sorting on the mesh with P processors in time O(P ). LOCAL-SORT The currently best bounds (smallest constant factor) known 14. foreach i ∈ 0 .. P-1 to us are given in [KKNT91]. Their algorithm is derived 15. sort (Li) from Flashsort (with pipelined splitter directed routing) and in addition uses many sophisticated and ingenious The algorithm proceeds in three phases, SUBSAMPLE, techniques, but their algorithm is far too complex to yield BATCH-SDR, and LOCAL-SORT. In the first phase, we an efficient practical implementation. In particular, the time choose k local samples without replacement at each bound has low order terms with large multiplicative processor into list G. Next a deterministic sort is used to constants which actually dominate the performance on sort all kP values across P processors. Each processor i existing large parallel machines. This indicates that there is defines its portion of the data set as being all values v such Ð + a large gap between a theoretical result on parallel sorting that S < v ≤ S . Since all values are distinct, every value i i + as compared to a parallel algorithm validated by an efficient belongs to exactly one partition. S is the largest value of i Ð implementation on an actual parallel machine. the sorted sample at processor i, and S is the largest value i Ð of the sorted sample in processor i-1. Note that S = Ð∞ + ∞ 0 We develop an analytical model for the running time of and SP-1 = + . B-Flashsort, following the approach taken in [BLM+91] and compare it with the model for Samplesort. We derive In the BATCH-SDR phase, values are moved toward their an expression in terms of machine-dependent parameters destination in log P steps, each using two splitters. Step characterizing the range of N/P for which B-Flashsort 11 splits the local list Li into a new list Li of values to outperforms Samplesort. To validate the analytic model, keep and a list Mi of values to move h steps. The predicate we implemented four sorting algorithms on a 4096 τ(h,i,v) is true exactly when v belongs on a processor less processor MasPar MP-1. Figure 1 summarizes the than h steps away from processor i. Because of the ring performance of the four sorting algorithms: (1) Bitonic sort, topology, the predicate has two cases: the fastest deterministic sort for small N/P on the MP-1 Ð + [Prin90], (2) Samplesort, (3) B-Flashsort and (4) τ ≤ ≤ (h,i,v) = [(i+h P) ^ (Si < v Si+h-1)] (2.1) B-Flashsort utilizing the recursive sampling strategy. ∨ Ð ∨ ( ≤ + [(i+h > P) ^ ((Si

3 so that synchronous nearest-neighbor communication can be employed. 3. Multidimensional Algorithm LEMMA 1. The predicate The multidimensional version of B-FLASHSORT-MD operating on P = pd processors arranged in a d-dimensional ∀( i, v: 0 ≤ i < P and v ∈ L : τ(h,i,v) ) (2.2) i torus can now be understood as follows. is an invariant of the iteration at line 8. The SUBSAMPLE step is identical to the 1-D version except PROOF. The assignment h:= P establishes (2.2) trivially. that the deterministic sort must operate on a d-dimensional In each iteration, as h is halved, Li is reduced on line 11 to mesh. This step creates a splitter SI at each processor just those values for which (2.2) holds. In line 14, (2.2) = +∞. indexed by a d-vector I, with S(pÐ1),...,(p-1) holds for all values of MiÐh. Thus (2.2) remains invariant. The 1D BATCH-SDR is applied in parallel to each column THEOREM 1. B-FLASHSORT-1D sorts its input values. (hyperplane) for each dimension 1 ≤ j ≤ d. The splitters used in the application to dimension j are as follows: PROOF. When the loop terminates after line 13, we have Ð ≤ + + (2.2) and h = 1. Taken together this yields (Si < v S i ) S = S for each value v in each processor i. Thus each processor I I[1],...,I[j],(p-1),...,(p-1) holds only values destined for it. Since no values are created or destroyed, every input value has been correctly Ð   S I [1],...,I[j-1],I[j]-1,p-1,...,p-1 if I[j]>0 SI = routed. Consequently L is partitioned appropriately and the  Ð∞ otherwise sort in step 15 completes the proof. The splitters for all of the steps can be created using dÐ1 The evaluation of τ(h,i,v) may be simplified by treating the spread operations from the highest indexed-elements in each boolean values as integers (true = 1, false = 0): dimension (in reverse order) starting with the original splitter SI. BATCH-SDR step j leaves the values totally Ð ≤ + ≤ ordered within dimension j, hence when all steps complete, (Si < v) + ( v Si+hÐ1) = 1 + ((i + h) P) the input data set is completely partitioned. The term on the right hand side is constant within each processor on each iteration, so that two comparisons and For example, on a 2-D mesh, all columns are routed using two arithmetic operations suffice. If the comparison is the splitters in the last column of processors. Then each carried out using unsigned comparisons, the predicate can be row is routed using the splitters within each processor, evaluated with a single comparison and a single arithmetic leaving the result partitioned in row-major order. operation. The LOCAL-SORT step is identical to the 1-D version and For simplicity, in the presentation of the algorithm we establishes the complete ordering. assumed the values being sorted were distinct. If values may occur several times in the splitter, we must modify predicate τ(h,i,v) to retain its interpretation in BATCH-SDR: 4. Analysis τ(h,i,v) = Define the skew W of the list L over all processors i to be ≤ Ð ≤ ≤ + [(i+h P) ^ (S v S )] (2.1) max |L | i i+h-1 i i maxi |Li| Ð + Ð + W = = ∨ [(i+h>P) ^ (S ≠S ) ^ ((S P) ^ (S =S ) ^ ((S ≤v) ∨ (v 0 and N/P > (α+1) ln P, the routing-induced skew WR satisfies

4 6 Min N/P for WR calc 5.5 avg W 100 trials 5 R

4.5 max WR 100 trials

R 4 WR calc, r = 0.01 3.5

3

Routing Skew W 2.5

2

1.5

1 2 4 8 16 32 64 128 256 512 1024 Elements per processor (N/P)

Figure 2. Predicted and observed routing-induced skew WR for P = 4096.

THEOREM 3. Let k be the number of samples chosen per processor. Then for any α > 0, the splitter-induced skew ≤ √α WR 1+ 3(P/N)( +1) ln P WS satisfies α with probability 1 Ð (1/P ). 1 2(α+1) WS + ≤ 2 + ¥ ln P WS k PROOF. (Appendix) α with probability 1Ð(1/P ). The routing-induced skew goes up slowly with increasing α P, but decreases much more rapidly with increasing N/P. PROOF. Solve (4.1) for WS with r = 1/P . Given P > 1 and arbitrary r we may choose α to obtain a Figure 4 illustrates the dependence of splitter skew on P and the oversampling ratio k. Since the SUBSAMPLE step bound on WR with probability 1Ðr provided N/P is sufficiently large. Figure 2 illustrates predicted and performs a sort of k elements per processor, we are measured routing skew over 100 trials on a 4096 processor interested in keeping k small. On the other hand, a large 2-D mesh. Figure 3 illustrates the dependence of routing WS that results from a small k causes slowdown in the skew on P and r. B ATCH-SDR and LOCAL-SORT stages. Space considerations also encourage us to bound WS < 2 with Bounds for the expected relation between the per-processor high probability at large N/P. sample size and the maximum size of the sorted lists at α completion are given in [BLM+91] and [DNS91]. Given A simple compromise is to let k =4( +1) ln P, with α P>1 and arbitrary r, the sample size k required to limit > 0, chosen to keep the probability of exceeding the ≤ splitter-directed skew to W with probability 1Ðr is given skew bounds acceptably low. For this choice, WS 2 for S ≥ by: all N and P. Since N/P k (else we degenerate to the deterministic sort), we are guaranteed that WR ≤ 1.85 with 2ln(P/r) the same probability. Routing-induced skew disappears in k = (4.1) 2 (1-1/WS) WS the last stages of SDR while routing-induced skew is introduced in these stages. With the choice of k above we This bound is conservative since it is independent of the can simplify the analysis by eliminating W and bounding number of elements per processor. R skew in all stages of SDR by WS.

5 If the skew bounds are exceeded the simplest recourse is to The second term is the cost for the dÐ1 spread operations randomize the input and rerun the algorithm. used to create the splitters for the SDR in each dimension. Randomization is required because resampling may not Of particular interest here is the TBATCHSDR portion of solve a routing congestion problem. this expression. On each step it performs work proportional to the longest list (WS¥N/P). By examination Performance Analysis of the B-FLASHSORT-1D algorithm we can express the time We analyze the algorithm in terms of the primitive per element as follows: operations listed below. Each operation is assumed to have TBATCHSDR(P) = a relatively constant per-element time. The time for an 1 4A log P + ¥ [X ¥ d ¥ (P1/d Ð 1)] arithmetic operation subsumes an indirect reference of its 2 source and destination, as well as the time to update a On each SDR step we must compare a value with the counter on each processor. bounding splitters and place it on one of two queues L or M (cost 3A). We must also move queue M which involves an opn description indirect access of the source and destination queue (cost 2A), A average time for an indirect memory-to- but we only charge 1/2 the per-element cost because on memory arithmetic operation average only half the elements are placed in M. The total X distance-one nearest-neighbor send time distance traveled by elements in queue M is (P1/d Ð 1) in Z time to spread values along a dimension each of d dimensions. The cost X per step of this distance R average random destination send time is scaled by 1/2 because, again, only half the elements move on a given step. The total running time of B-FLASHSORT-MD is given by the sum of the times of its three phases and is a function of N and P. 5. Recursive Subsampling

TB-FLASH-MD(N,P) = Since each dimension in SDR is treated in turn we can TSUBSAMPLE(N,P) consider an alternative approach to subsampling. At each + Z¥(dÐ1) SDR step we can sample a sufficient portion of the input to N + W ¥ ¥T (P) generate splitters for that step only. S P BATCHSDR + TLOCSORT(WS¥N/P) For example, in a 2-D mesh we need √ P splitters in the first step. Since we can sort using P processors, we can divide the oversampling k obtained from theorem 3 by √P

5.00

4.50 P = 4096, r = 10-3 4.00 P = 64, r = 10-3 R 3.50 P = 4096, r = 10-6 3.00 -6 P = 64, r = 10 2.50 Routing Skew W

2.00

1.50

1.00 4 16 64 256 1024 4096 16384 65536 Elements per processor

Figure 3. Dependence of WR on P.

6 to determine the oversampling ratio using P processors. Comparison of Algorithms This permits us to apply the algorithm to much smaller values of N/P without degenerating to the deterministic sort Here we compare the performance of Samplesort and time. Multidimensional non-recursive B-Flashsort. In the non- recursive case the first and last steps of Samplesort and B- In the second stage we need √P splitters in every row, or Flashsort, subsampling and local-sort respectively, are one splitter per processor. Note from Figure 4 that the identical. That is, required oversampling ratio to generate √P splitters for √ P TSAMPLESORT(N,P) = processors for a skew W with probability r is lower than S TSUBSAMPLE(N,P) + TSPLIT(N,P) that required to generate P splitters for P processors with + TLOCSORT(WS¥N/P) the same WS and r. Furthermore, the sorting of the samples taken in each row using √ P processors will be TB-FLASH-MD(N,P) = faster than a sort of a sample with the same size per TSUBSAMPLE(N,P) + Z¥(dÐ1) processor using P processors. N + W ¥ ¥T (P) S P BATCHSDR Thus recursive subsampling improves the performance of + TLOCSORT(WS¥N/P) B-Flashsort for small values of N/P relative to P. As a result the B-Flashsort algorithm using recursive For Samplesort we have subsampling in figure 1 gives better performance at smaller TSPLIT(N,P) = 2A ¥ P + 2A ¥ N/P log P + R ¥ N/P input sizes, but does not alter the performance for large N/P. because each of P splitters must be broadcast with cost approximately 2A, each of N/P elements must be found by The optimum choice for the number of recursive stages and binary search in the splitter table (2A per probe), and each the sampling size at each stage is highly dependent on the element must be sent to the destination queue. For dimensionality of the mesh and the performance of the B-Flashsort we have from section 4: deterministic at small N/P. The final 1 T (P) = 4A ¥ log P + ¥ [X ¥ d ¥ (P1/d Ð 1)] paper will give details of the analysis. BATCHSDR 2 To assess the predictive utility of the model (at least in one setting) we performed timings of the implementations of 6 . Analytical and Empirical SPLIT and BATCHSDR on a 4096 processor MasPar MP-1

300

250 -6 WS < 1.5, r = 10

200

-3 W < 1.5, r = 10 150 S

100 Oversampling ratio (k) -6 WS < 2, r = 10 -3 50 WS < 2, r = 10

0 4 16 64 256 1024 4096 16384 Number of Processors (P)

Figure 4. Dependence of oversampling on number of processors.

7 and compared the timings with those predicted by the model directed routing that achieves the SIMD-model when times are provided for the various primitive communication distance bound of d ¥(P1/d Ð 1) for sorting operations of the model. The timings we used for the on the d-dimensional torus. primitive operations of the model are shown below for 32 bit values on the MP-1. For the MP-1, d = 2 and the The algorithm presented scales well to large machines, maximum value of P is 16384. since its memory requirements are independent of the machine size. The analytic model developed suggests that MP-1 the algorithm will perform well on machines with high- opn (µS) description speed local connections on the mesh. Experimentally, A 20 average time for an indirect memory-to- B-Flashsort is competitive with Samplesort on the MP-1, a memory arithmetic operation machine with good hardware routing capabilities and fast X 3 distance-one nearest-neighbor send time broadcast (the characteristics that favor Samplesort). A full- Z 8 time to spread values along a dimension size MasPar MP-1 (16,384 processors) sorts approximately R 700 average random destination send time 4 million 32-bit integers per second using our implementation. The measured TBATCHSDR(P) on the MP-1 was found to 75 log P + 3 (R(P) -1) µS which agrees very well with the Appendix predicted time. The measured TSPLIT(N,P) was found to be 29P + 36 N/P log P + 740 N/P which agrees reasonably THEOREM 2. For any α > 0 and N/P > (α+1) ln P, the well with the predicted time, particularly for large N/P. routing-induced skew WR satisfies

To find the regime in which B-Flashsort gives superior WR ≤ 1+ √3(P/N)(α+1) ln P performance, we examined the inequality α with probability 1Ð(1/P ). Z¥(dÐ1) + WS ¥(N/P) ¥TBATCHSDR(P) < TSPLIT(N,P) PROOF. We use some basic probability theory in the proof Combining N/P terms this gives of theorem 2. Consider n independently distributed Poisson N/P ¥ [(WS ¥4A) - 2A)¥ log P + variables x1,...,xn, with Prob(xi = 1) = ρ and 1/d ρ WS ¥ X ¥ (d/2)¥ (P Ð 1) Ð R] < 2A ¥P Ð Z¥(dÐ1) Prob(xi =0)= 1Ð . The random variable which, taking the average maximum skew WS ≈ 1.5, and n solving for N/P in terms of P gives X = ∑ xi i =1 N 2A ¥ P Ð Z ¥(dÐ1) < has binomial distribution described by P 4A log P + 0.75X¥(P1/d Ð1)ÐR n When we examine this relation for P = 4096 and d = 2, we Prob(X = k) = ρk (1-ρ)n-k predict that for N/P ≤ 300 the B-Flashsort implementation ()k should be faster than Samplesort, which is indeed borne out The distribution has size n and mean µ = n¥ρ. We will be experimentally as is demonstrated in Figure 1. interested in the probability that X has a value larger than some given value k > µ: Using the constants for the CM-2 reported with the analytic model of [BLM+91] we obtain values of N/P in the range n n 320 to 560 as crossover points below which B-Flashsort ≥ ∑ ρk(1-ρ)n-k Prob(X k) = ()k would be competitive with Samplesort on a full-size CM-2 i = k (P = 2048, d = 11). The exact value depends on whether the formula for TSPLIT is the one used here or the one which satisfies the following inequality for any defined in [BLM+91]. However, we have not verified this 0 ≤ ε ≤ 1.8µ [KKNT91] : prediction experimentally. ε2 µ Prob(X ≥ µ+ε) ≤ eÐ /3 (A.1) 7. Conclusions We can now proceed with the theorem. Recall we are sorting N values using P processors. Initially each Randomization has been demonstrably useful both in processor holds N/P values. Consider the values held by an simplifying and in improving the efficiency of sorting arbitrary processor H at the completion of SDR stage algorithms on actual parallel machines. The B-Flashsort 1 ≤ i ≤ log P. There are a total of 2iN/P values that may algorithm developed in this paper combines some of these reach processor H, each with independent probability 2Ði. concepts with an efficient implementation of splitter-

8 We model the length of the queue at H at stage i as a [AV79] D. Angluin and L.G. Valiant, Fast binomially distributed variable X with size 2iN/P and mean Probabilistic Algorithms for Hamiltonian µ = N/P. Let q = Prob (X ≥ µ+ε) with 0 ≤ε≤1.8µ. Circuits and Matchings, Journal of Computer Then by (A.1) the probability that X < µ+ε at H is and System Sciences, 18(2):155-193, April 1979. ε2 µ 1Ðq ≥ 1ÐeÐ /3 (A.2) [AKl85] S.G. Akl;, Parallel Sorting Algorithms, Academic Press, Toronto, 1985. Since there are P processors, all of which must satisfy µ+ε [BAT68] K. Batcher; Sorting Networks and Their (A.2), the probability r that all queues are shorter than Applications, Proceedings of the AFIPS Spring is Joint Computing Conference, vol. 32, pp.307- 314, 1968. P ≤ Ðε2/3µ P r = (1-q) (1 Ð e ) (A.3) [BS78] G. Baudet and D. Stevenson, Optimal Sorting Since WR = (µ+ε)/µ, we have ε = µ(WRÐ1). If we rewrite Algorithms for Parallel Computers, IEEE (A.3) in terms of WR we get Transactions on Computers, C-27:84-87, 1978. [BLE90] G.E. Blelloch, Vector Models for Data-Parallel Ð3 1/P ≥ 2 Computing, The MIT Press, 1990. µ ln (1Ðr ) (WR Ð 1) (A.4) [BLM+91] G.E. Blelloch, C.E. Leiserson, B.M. Maggs, To bound the values of r for which (A.4) holds, note that C.G. Plaxton, S.J. Smith, and M. Zagha; A we must have 0 ≤ ε/µ ≤ 1.8 to apply (A.1). Since ε/µ = Comparison of Sorting Algorithms for the (WRÐ1) we must have Connection Machine CM-2; 3rd Annual ACM Symposium on Parallel Algorithms and Ð3 2 0 ≤ ln (1Ðr1/P) ≤ (W Ð 1)2 = (ε/µ) ≤ 3 <1.82 Architectures, July 21-24, 1991, Hilton Head, µ R SC, pp.3-16. µ which constrains r to satisfy 0 ≤ r ≤ (1 Ð eÐ )P. By [CHL89] B. Chlebus, Sorting Within Distance Bound on assumption µ ≥ (α+1) ln P, hence a Mesh-Connected Arrays, International Symposium on Optimal Algotihms, vol. 401 µ (1ÐeÐ )P of Lecture Notes in Computer Science, pp. α 232-238, Springer-Verlag, NY, 1989. ≥ (1ÐeÐ( +1) ln P)P α α [COLE88] R. Cole, Parallel , SIAM Journal  1  (P +1/P ) = 1Ð α on Computing, pp. 770-785, 1988.  P +1  [CV86] R. Cole and U. Vishkin, Deterministic Coin α ≈ eÐ1/P (since for x → ∞, (1Ð1/x)x → eÐ1) Tossing and Accelerating Cascades: Micro and Macro Techniques for Designing Parallel ≈ 1 → ∞ Ð1/x → 1 Ð α (since for x , e 1Ð(1/x)) Algorithms, Proceedings of the 18th Annual P ACM Symposium on Theory of Computing, pp. 206-219, 1986. Therefore the conditions obtain to apply the theorem for all [CLR90] T.H. Cormen, C.E. Leiserson, and R.L. ≤ ≤ α α 0 r 1 Ð (1/P ). In particular, with r = 1 Ð (1/P ), it Rivest, Introduction to Algorithms, The MIT may be easily verified using the identities above that Press and McGraw Hill, 1990. [CP90] R.E. Cypher and C.G. Plaxton, Deterministic WR ≤ 1+ √3(P/N)(α+1) ln P Sorting in Nearly Logarithmic Time on the (End of proof.) Hypercube and Related Computers, Proceedings of the 22nd Annual ACM Symposium on Theory of Computing, pp. 193-203, May Bibliography 1990. [DNS91] D.J. DeWitt, J.F. Naughton, D.F. Schneider, [AH88] A. Aggarwal and M.-D. A. Huang, Network Parallel Sorting on a Shared-Nothing Complexity of Sorting and Graph Problems Architecture using Probabilistic Splitting, and Simulating CRCW PRAMs by Computer Sciences TR#1043, University of Interconnection Networks; Lecture Notes in Wisconsin - Madison, 1991. Computer Science VLSI Algorithms and Architectures (AWOC 88) (ed. by John Reif), [FKO86] E. Felten, S. Karlin, and S. Otto, Sorting on a vol. 319, pp. 339-350, Springer-Verlag, 1988. Hypercube, Hm 244, Caltech/JPL, 1986. [AKS83] M.Ajtai, J. Komlos, and E. Szemeredil, [FM70] W.D. Frazer and A.C. McKellar, Samplesort: Sorting in c log n Parallel Steps, A Sampling Approach to Minimal Storage Combinatorica, 3:1-19, 1983. Tree Sorting, Journal of the ACM, 17(3):496- 507, 1970.

9 [HOE56] W. Hoeffding, On the Distribution of the [RR89] S. Rajasekaran and J.H. Reif, Optimal and Number of Successes in Independent Trials, Sublogarithmic Time Randomized Parallel Annals of Mathematical Statistics, 27:713-721, Sorting Algorithms, SIAM Journal on 1956. Computing, 18(3):594-607, June 1989. [HC83] J.S. Huang and Y.C. Chow, Parallel Sorting [RV83,87] J.H. Reif and L.G. Valiant, A Logarithmic and Data Partitioning by Sampling, Time Sort for Linear Size Networks, 15th Proceedings of the IEEE Computer Society’s Annual ACM Symposium on Theory of Seventh International Computer Software and Computing, Boston, MA, pp. 10-16, 1983; Applications Conference, pp. 627-631, also in Journal of the ACM, 34(1):60-76, November 1983. January 1987. [KKNT91] C. Kaklamanis, D. Krizanc, L. Narayanan, and [REI85] R. Reischuk, Probabilistic Parallel Algorithms T. Tsantilas, Randomized Sorting and Selection for Sorting and Selection, SIAM Journal of on Mesh-Connected Processor Arrays, 3rd Computing, 14(2):396-411, May 1985. Annual ACM Symposium on Parallel [SS86] C. Schnorr and A. Shamir, An Optimal Algorithms and Architectures, July 21-24, Sorting Algorithm for Mesh Connected 1991, Hilton Head, SC, pp. 17-28. Computers, Symposium on the Theory of [KUN88a] M. Kunde, Routing and Sorting on Mesh- Computation, pp. 255-263, 1986. Connected Arrays, Aegan Workshop on [SG88] S.R. Seidel and W.L. George, Binsorting on Computing: VLSI Algorithms and Hypercubes with d-port Communication, Architectures, vol. 319 of Lecture Notes in Proceedings of the Third Conference on Computer Science, pp. 423-433, Springer- Hypercube Concurrent Computers, pp. 1455- Verlag, NY, 1988. 1461, January 1988. [KUN88b] M. Kunde, l-selection and Related Problems on [TK77] C.D. Thompson and H.T. Kung, Sorting on a Grids of Processors, Aegan Workshop on Mesh-connected Parallel Computer, Computing: VLSI Algorithms and Communications of the ACM 20(4):263-271, Architectures, vol. 319 of Lecture Notes in 1977. Computer Science, pp. 423-433, Springer- [U83] J. Ullman, Computational Aspects of VLSI, Verlag, NY, 1988. Computer Science Press, 1983 [LEI865] F.T. Leighton, Tight Bounds on the [VD88] P. Varman and K. Doshi, Sorting with Linear Complexity of Parallel Sorting, IEEE Speedup on a Pipelined Hypercube, Technical Transactions on Computers, C-34(4):344-354, Report TR-9902, Rice University, Department April 1985. of Electrical and Computer Engineering, [LP90] T. Leighton and G. Plaxton, A (Fairly) Simple February 1988. Circuit That (Usually) Sorts, Proceedings of [WAG89] B.A. Wagar, Hyperquicksort: A Fast Sorting the 31st Annual Symposium on Foundations Algorithm for Hypercubes, Hypercube of Computer Science, pp. 264-274, October Multiprocessors 1987 (Proceedings of the 1990. Second Conference on Hypercube [NS82] D. Nassimi and S. Sahni, Parallel Permutation Multiprocessors) (ed. M.T. Heath), pp. 292- and Sorting Algorithms and a New Generalized 299, Philadelphia, PA, 1987. SIAM. Connection Network, Journal of the ACM, [WAG90] B.A. Wagar, Practical Sorting Algorithms for 29(3):642-667, July 1982. Hypercube Computers, Ph.D. Thesis, [PAT90] M.S. Paterson, Improved Sorting Networks Department of Electrical Engineering and with O(log n) Depth, Algorithmica, 5:75-92, Computer Science, University of Michigan, 1990. Ann Arbor, MI, July 1990. [PLAX89] C.G. Plaxton, Efficient Computation on [WS88] Y. Won and S. Sahni, A Balanced Bin Sort for Sparse Interconnection Networks, Technical Hypercube Multicomputers, Journal of Report STAN-CS-89-1283, Stanford Supercomputing, 2:435-448, 1988. University, Department of Computer Science, September 1989. [PRIN90] J.F. Prins, Efficient Bitonic Sorting of Large Arrays on the MasPar MP-1, 3rd Symposium on Frontiers of Massively Parallel Processing, 1990; expanded version Technical Report 91- 041, Univ. of North Carolina, 1991. [QUI89] M.J. Quinn, Analysis and Benchmarking of Two Parallel Sorting Algorithms: Hypersort and Quickmerge, BIT, 29(2):239-250, 1989.

10