Highly Scalable Parallel Sorting
Total Page:16
File Type:pdf, Size:1020Kb
so that they are in globally sorted order. A globally sorted order implies that every key on processor k is larger than every key on processor k − 1. Further, at Highly Scalable Parallel Sorting the end of execution, the number of keys stored on any processor should not be larger than some threshold n value p + tthresh (dictated by the memory availability Edgar Solomonik and Laxmikant V. Kale´ or desired final distribution). We will describe a new algorithm grounded on the Department of Computer Science basic principles of Histogram Sort [3] that can perform and scale well on modern supercomputer architectures. University of Illinois at Urbana-Champaign, Using our optimized implementation, we are able to Urbana, IL 61801, USA achieve an estimated 46.2% efficiency on 32,768 cores E-mail: [email protected], of BG/P (which corresponds to a speedup of roughly [email protected] 15,000). For a communication intensive algorithm such as sorting, this is a fairly impressive efficiency. Addi- tionally, the algorithm handles uniform as well as highly Abstract— Sorting is a commonly used process with non-uniform distributions of keys effectively. a wide breadth of applications in the high performance computing field. Early research in parallel processing Section 2 describes various parallel sorting algorithms has provided us with comprehensive analysis and theory as well as the original Histogram Sort. Section 3 docu- for parallel sorting algorithms. However, modern super- ments the optimizations we applied to the algorithm in computers have advanced rapidly in size and changed order to achieve better scaling. These include our probe significantly in architecture, forcing new adaptations to these algorithms. To fully utilize the potential of highly determination algorithm, all-to-all optimizations, and parallel machines, tens of thousands of processors are exploitations of overlap of communication with local used. Efficiently scaling parallel sorting on machines of sorting and merging. Section 4 introduces the specifics this magnitude is inhibited by the communication-intensive of our implementation, describes the machines we use problem of migrating large amounts of data between for testing, and analyzes the scaling and comparative processors. The challenge is to design a highly scalable sorting algorithm that uses minimal communication, max- speed of our parallel sorting algorithm. Section 5 will imizes overlap between computation and communication, suggest conclusions that can be drawn from our work and uses memory efficiently. This paper presents a scal- as well as potential areas for improvement. able extension of the Histogram Sorting method, making fundamental modifications to the original algorithm in II. PREVIOUS WORK order to minimize message contention and exploit overlap. We implement Histogram Sort, Sample Sort, and Radix Previous work has produced many alternative solu- Sort in CHARM++ and compare their performance. The choice of algorithm as well as the importance of the tions to address the problem of parallel sorting. The optimizations is validated by performance tests on two majority of parallel sorting algorithms can be classified predominant modern supercomputer architectures: XT4 as either merge-based or splitter-based. Merge-based at ORNL (Jaguar) and Blue Gene/P at ANL (Intrepid). parallel sorting algorithms rely on merging data be- tween pairs of processors. Such algorithms generally I. INTRODUCTION achieve a globally sorted order by constructing a sort- Algorithms for sorting are some of the most well- ing network between processors, which facilitates the studied and optimized algorithms in computer science. necessary sequence of merges. Splitter-based parallel Parallel sorting techniques, despite being well analyzed sorting algorithms aim to define a vector of splitters in a vast amount of literature, have not been sufficiently that subdivides the data into p approximately equal- tested or optimized on modern high performance super- sized sections. Each of the p − 1 splitters is simply a computers. Good scaling and performance of parallel key-value within the dataset. Thus for splitter si, where sorting is necessary for any parallel application that i 2 f0; 1; 2; :::; p − 2g, loc(si) 2 0; 1; 2; :::; n is the total depends on sorting. ChaNGa [1], an n-body Barnes-Hut number of keys smaller than si. Splitter si is valid if, n 1 n 1 tree-based gravity code [2], is an example of a highly (i + 1) p − 2 tthresh < loc(si) < (i + 1) p + 2 tthresh. parallel scientific code that uses sorting every iteration. Once these splitters are determined, each processor can The problem posed by parallel sorting is significantly send the appropriate portions of its data directly to the different and more challenging than the corresponding destination processors, which results in one round of sequential one. We begin with an unknown, unsorted dis- all-to-all communication. Notably, such splitter-based tribution of n keys over p processors. The algorithm has algorithms utilize minimal data movement, since the to sort and move the keys to the appropriate processor data associated with each key only moves once. Merge-based parallel sorting algorithms have been processor then produces p−1 splitters from the sp-sized extensively studied, though largely in the context of combined sample and broadcasts them to all other pro- sorting networks, which consider n=p ≈ 1. For n=p cessors. The splitters allow each processor to send each 1, these algorithms begin to suffer from heavy use of key to the correct final destination immediately. Some communication and difficulty of load balancing. There- implementations of Sample Sort also perform localized fore, splitter-based algorithms have been the primary load balancing between neighboring processors after the choice on most modern machines due to their minimal all-to-all. communication attribute and scalability. However, newer 1) Sorting by Regular Sampling: The Sorting by and much larger architectures have changed the prob- Regular Sampling technique is a reliable and practical lem statement further. Therefore, traditional approaches, variation of Sample Sort that uses a sample size of including splitter-based sorting algorithms, require re- s = p − 1 [9]. The algorithm is simple and executes evaluation and improvement. We will briefly detail some as follows. of these methods below. 1) Each processor sorts its local data. 2) Each processor selects a sample vector of size p−1 A. Bitonic Sort from its local data. The kth element of the sample n k+1 Bitonic Sort, a merge-based algorithm, was one of vector is element p × p of the local data. the earliest procedures for parallel sorting. It was intro- 3) The samples are sent to and merged on processor duced in 1968 by Batcher [4]. The basic idea behind 0, producing a combined sorted sample of size Bitonic Sort is to sort the data by merging bitonic p(p − 1). sequences.A bitonic sequence increases monotonically 4) Processor 0 defines and broadcasts a vector of p−1 1 then decreases monotonically. For n=p = 1, Θ(lg n) splitters with the kth splitter as element p(k + 2 ) merges are required, with each merge having a cost of of the combined sorted sample. Θ(lg n). The composed running time of Bitonic Sort for 5) Each processor sends its local data to the appro- n=p = 1 is Θ(lg2 n). Bitonic Sort can be generalized priate destination processors, as defined by the for n=p > 1, with a complexity of Θ(n lg2 n). Adaptive splitters, in one round of all-to-all communication. Bitonic Sorting, a modification of Bitonic Sort, avoids 6) Each processor merges the data chunks it receives. unnecessary comparisons, which results in an improved, It has been shown that if n is sufficiently large, no 2n optimal complexity of Θ(n lg n) [5]. processor will end up with more than p keys [9]. In Unfortunately, each step of Bitonic Sort requires practice, the algorithm often achieves almost perfect movement of data between pairs of processors. Like load balance. most merge-based algorithms, Bitonic Sort can perform Though attractive for its simplicity, this approach is very well when n=p (where n is the total number of problematic for scaling beyond a few thousand proces- keys and p is the number of processors) is small, since sors on modern machines. The algorithm requires that it operates in-place and effectively combines messages. a combined sample p(p − 1) keys be merged on one On the other hand, its performance quickly degrades as processor, which becomes an unachievable task since it n=p becomes large, which is the much more realistic demands Θ(p2) memory and work on a single proces- scenario for typical scientific applications. The major sor. For example, on 16,384 processors, the combined drawback of Bitonic Sort on modern architectures is sample of 64-bit keys would require 16 GB of memory. that it moves the data Θ(lg p) times, which turns into Despite this limitation, due to the popularity of Sorting costly bottleneck if we scale to higher problem sizes by Regular Sampling, we provide performance results or a larger number of processors. Since this algorithm of our implementation of the algorithm (see Figure 7). is old and very well-studied, we will not go into any For a more thorough description and theoretical analysis deep analysis or testing of it. Nevertheless, Bitonic of Sample Sort, refer to Li et al. [10]. Sort has laid groundwork for much of parallel sorting 2) Sorting by Random Sampling: Sorting by Ran- research and continues to influence modern algorithms. dom Sampling is an interesting alternative to Regular One good comparative study of this algorithm has been Sampling [8], [11], [6]. The main difference between documented by Blelloch et al. [6]. the two sampling techniques is that a random sample is flexible in size and collected randomly from each B.