
useful. Samplesort was shown to be faster than Bitonic and the algorithm). In particular, in [RV83,87] two techniques Parallel Radix sort for large values of N/P, although its were used that were not incorporated in the Samplesort of general utility was limited because of large memory [HC83], [BLM+91]: requirements and poor performance at lower values of N/P. (1) splitter-directed routing Samplesort as described in [HC83], [BLM+91] is a (2) recursive application of the sampling, partitioning randomized algorithm that works as follows. A sample of and routing step the complete set of input keys distributed uniformly across processors is extracted and sorted using some efficient We investigate both of these techniques and give a deterministic algorithm. From the sorted sample a subset quantitative analysis of the circumstances under which the consisting of P–1 values is extracted that partitions the first is advantageous. Our algorithm, which we call input into P bins that are similarly-sized with high Batched-routing Flashsort, or B-Flashsort, can be likelihood. These P–1 splitters are then broadcast to all implemented on a wide variety of architectures, including processors, after which Binary Search can be used in parallel any mesh, CCC, or hypercube network. In this paper we at all processors to locate the destination processor for each concentrate on its implementation on a low-dimensional of their values. After sending all values to their toroidal mesh, which admits a simple and efficient approach destinations using general routing, a local sort of the values to splitter-directed routing. The algorithm takes advantage within each processor completes the algorithm. of the toroidal interconnection topology to reduce communication cost by a factor of 2 over the non-toroidal Although Samplesort is very fast for large N/P it does have mesh; this is useful in practice since the toroidal topology some important limitations. The requirement that the is actually implemented on many current and proposed complete set of splitters be available at every processor can machines. be problematic for parallel machines in which the number of processors is large but the amount of local memory is The batched routing technique works as follows. For each modest. This class includes the recently completed Mosaic dimension j of a d-dimensional mesh, values are moved machine as well as commercial machines like the MasPar along dimension j in successively smaller power-of-2 MP-1 and the DAP. A full-size MasPar MP-1, for strides. Each processor partitions its values into a set of example, has 16384 processors, but each processor has only values to keep and a set of values to send on, based on two 64KB of local storage. To Samplesort a data set consisting splitters it holds. All values to be sent from a processor of 64 bit values would require each processor to have 64P can be moved contention-free since all movement is the bits or 128KB available for the splitters alone. same distance in the same direction. Although we expect that the general routing used in Samplesort, when In this paper we propose an alternate randomized sorting implemented in hardware, should outperform batched- algorithm related to the algorithm described in [RV83,87] routing in B-Flashsort, the magnitude of this difference that has become known as Flashsort (see also determines the minimum size problem on which Ullman[U83], Appendix B for a simplified description of Samplesort can be competitive. For machines with large 7 6 5 4 Bitonic Sort 3 Samplesort 2 B-Flashsort Time (ms/(N/P)) 1 B-Flashsort with Recursive Subsampling 0 2 4 8 16 32 64 128 256 512 1024 Elements per processor (N/P) Figure 1. Performance of implemented algorithms on 4096 processor MP-1 2 numbers of processors, limited per-processor memory, or processor 0 ≤ i < P starts with a list Li of N/P values to be software routing, B-Flashsort may always be faster. This sorted. We assume here the values are distinct; at the end of may be particularly true if batched routing is implemented this section we show that this requirement can be relaxed. at a very low-level. B-FLASHSORT-1D(L) The virtue of B-Flashsort is that it is simple, has per- SUBSAMPLE processor storage requirements that do not scale with P, and 1. foreach i ∈ 0 .. P-1 does not require general routing. It is completely 2. Gi := select k random elements from Li analyzable, up to tight bounds on the multiplicative 3. sort (G) using a deterministic P-processor sort constants. In comparison, in the original Flashsort in 4. foreach i ∈ 0 .. P-1 – [RV83,87] it was difficult to exactly determine analytical 5. if i = 0 then S := –∞ –i constant factors due to the use of pipelining in the splitter- else S := G [k-1] i + i–1 directed routing (with a goal of asymptotic bounds, only 6. if i = (P-1) then S := +∞ + i upper bounds were determined in [RV83,87]). else Si := Gi[k-1] BATCH-SDR With the addition of a recursive sampling strategy we can 7. h := P extend the performance advantage of B-Flashsort over 8. while h > 1 do deterministic sorts to much smaller values of N/P. This 9. h := h /2 may improve the utility of B-Flashsort as a general purpose 10. foreach i ∈ 0 .. P-1 sorting routine since it need not suffer poor behavior at low 11. Li, Mi := [ v ∈ Li | τ(h,i,v)], N/P. [ v ∈ Li | ¬τ(h,i,v)] 12. Li := Li ++ Mi–h A great deal of previous work has been concerned with 13. end 1/2 sorting on the mesh with P processors in time O(P ). LOCAL-SORT The currently best bounds (smallest constant factor) known 14. foreach i ∈ 0 .. P-1 to us are given in [KKNT91]. Their algorithm is derived 15. sort (Li) from Flashsort (with pipelined splitter directed routing) and in addition uses many sophisticated and ingenious The algorithm proceeds in three phases, SUBSAMPLE, techniques, but their algorithm is far too complex to yield BATCH-SDR, and LOCAL-SORT. In the first phase, we an efficient practical implementation. In particular, the time choose k local samples without replacement at each bound has low order terms with large multiplicative processor into list G. Next a deterministic sort is used to constants which actually dominate the performance on sort all kP values across P processors. Each processor i existing large parallel machines. This indicates that there is defines its portion of the data set as being all values v such – + a large gap between a theoretical result on parallel sorting that S < v ≤ S . Since all values are distinct, every value i i + as compared to a parallel algorithm validated by an efficient belongs to exactly one partition. S is the largest value of i – implementation on an actual parallel machine. the sorted sample at processor i, and S is the largest value i – of the sorted sample in processor i-1. Note that S = –∞ + ∞ 0 We develop an analytical model for the running time of and SP-1 = + . B-Flashsort, following the approach taken in [BLM+91] and compare it with the model for Samplesort. We derive In the BATCH-SDR phase, values are moved toward their an expression in terms of machine-dependent parameters destination in log P steps, each using two splitters. Step characterizing the range of N/P for which B-Flashsort 11 splits the local list Li into a new list Li of values to outperforms Samplesort. To validate the analytic model, keep and a list Mi of values to move h steps. The predicate we implemented four sorting algorithms on a 4096 τ(h,i,v) is true exactly when v belongs on a processor less processor MasPar MP-1. Figure 1 summarizes the than h steps away from processor i. Because of the ring performance of the four sorting algorithms: (1) Bitonic sort, topology, the predicate has two cases: the fastest deterministic sort for small N/P on the MP-1 – + [Prin90], (2) Samplesort, (3) B-Flashsort and (4) τ ≤ ≤ (h,i,v) = [(i+h P) ^ (Si < v Si+h-1)] (2.1) B-Flashsort utilizing the recursive sampling strategy. ∨ – ∨ ( ≤ + [(i+h > P) ^ ((Si <v) v Si+h–1))] 2. 1-D Algorithm + τ The value of Si+h–1 used in (h,i,v) needs to be obtained We will first explain the algorithm in the 1-D case and then once for each iteration of the while loop. Step 12 extend the description to multi-dimensional meshes. In the corresponds to a batch routing of the values to be moved. 1-D case, P processors are connected in a ring and each All values move the same distance h in the same direction, 3 so that synchronous nearest-neighbor communication can be employed. 3. Multidimensional Algorithm LEMMA 1. The predicate The multidimensional version of B-FLASHSORT-MD operating on P = pd processors arranged in a d-dimensional ∀( i, v: 0 ≤ i < P and v ∈ L : τ(h,i,v) ) (2.2) i torus can now be understood as follows. is an invariant of the iteration at line 8. The SUBSAMPLE step is identical to the 1-D version except PROOF. The assignment h:= P establishes (2.2) trivially. that the deterministic sort must operate on a d-dimensional In each iteration, as h is halved, Li is reduced on line 11 to mesh. This step creates a splitter SI at each processor just those values for which (2.2) holds. In line 14, (2.2) = +∞.
Details
-
File Typepdf
-
Upload Time-
-
Content LanguagesEnglish
-
Upload UserAnonymous/Not logged-in
-
File Pages10 Page
-
File Size-