Theoretically-Efficient and Practical Parallel In-Place Radix Sorting

Theoretically-Efficient and Practical Parallel In-Place Radix Sorting by Omar Obeya B.S., Massachusetts Institute of Technology (2018) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2019 ○c Massachusetts Institute of Technology 2019. All rights reserved. Author................................................................ Department of Electrical Engineering and Computer Science May 24, 2019 Certified by. Julian Shun Assistant Professor Thesis Supervisor Accepted by . Katrina LaCurts Chair, Master of Engineering Thesis Committee 2 Theoretically-Efficient and Practical Parallel In-Place Radix Sorting by Omar Obeya Submitted to the Department of Electrical Engineering and Computer Science on May 24, 2019, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science Abstract Radix sort stands out for having a better worse case theoretical bound than any comparison-based sort, for fixed length integers. Despite the fact that radix sort can be implemented either in-place or in parallel, there exists no parallel in-place implementation for radix sort that guarantees a sub-linear worst case span. The challenge arises due to read-write races when reading from and writing to the same array. In this thesis, I introduce Regions sort and use it to implement a parallel work-efficient in-place radix sorting algorithm. To sort n integers from a range r, and a parameter K, the algorithm requires only O(K log r log n) auxiliary memory, O(n log r) work and O((n=K + log K) log r) span. Regions sort uses a divide-and-conquer paradigm. It first divides the array into sub-arrays, each of which is sorted independently in parallel. This decreases the irregularity in the input and reorders it into a set of regions, each of which has similar properties. Next, it builds a data structure, called the regions graph to represent the needed data movements to completely sort the array. Finally, Regions sort iteratively plans swaps that satisfies the required data movements. The algorithm then hasto recurse on records with so-far equal keys to break ties. I compare two variants of Regions sort with the state-of-the-art sorting integer sort and comparison sort algorithms. Experiments show that the best variant of Regions sort usually outperforms other sorting algorithms. In addition, I perform different experiments showing that Regions sort maintains its superiority regardless input distribution and range. More importantly, the single-threaded implementation of Regions sort outperforms the highly optimized Ska Sort implementation for serial in-place radix sort. Thesis Supervisor: Julian Shun Title: Assistant Professor 3 4 Acknowledgments This thesis is based on paper titled: Theoretically-Efficient and Practical Parallel In- Place Radix Sorting with doi: https://doi.org/10.1145/3323165.3323198 [26]. This paper is to be published in ACM SPAA 2019, and is joint work with Endrias Kahssay, Edward Fan and Prof. Julian Shun without whom it would not have been possible to complete this research. This research was supported by DARPA SDH Award #HR0011-18-3-0007, Applications Driving Architectures (ADA) Research Center, a JUMP Center co-sponsored by SRC and DARPA, DOE Early Career Award #DE- SC0018947, and NSF CAREER Award #CCF-1845763. Finally, I would like to ex- press my sincere gratitude to my advisor Prof. Julian Shun for his support, patience, guidance and immense knowledge. I could not have imagined having a better advisor and mentor. 5 6 Contents 1 Introduction 15 2 Related Work 19 2.1 Preliminaries . 19 2.2 Previous Work . 21 2.2.1 Serial In-place Radix Sort . 21 2.2.2 Parallel Out-of-place Radix Sort . 21 2.2.3 PARADIS Algorithm . 22 2.2.4 IPS4o Algorithm . 24 3 Algorithm 25 3.1 Algorithm Overview . 25 3.2 Local Sorting . 26 3.3 Regions Graph Construction . 26 3.4 Global Sorting . 29 3.4.1 Cycle Finding . 30 3.4.2 2-path Finding . 33 3.5 Recursion . 38 4 Analysis 39 4.1 Local Sorting . 39 4.2 Graph Construction . 40 4.3 Global Sorting . 40 7 4.3.1 Cycle Finding . 40 4.3.2 2-path finding . 41 4.4 Overall algorithm . 43 5 Optimizations 45 5.1 Batching Cycles . 45 5.2 Prioritizing 2-cycles . 45 5.3 Early Recursion . 47 5.4 Country Sorting . 47 5.5 Coarsening . 47 6 Evaluation 49 6.1 Experiment Setup . 49 6.1.1 Experiment Environment . 49 6.1.2 Implementation Details . 49 6.1.3 Test Inputs . 50 6.2 Tested algorithms . 51 6.3 Results . 51 6.3.1 Results Overview . 51 6.3.2 Parallel Scalability . 54 6.3.3 Input Size Scalability . 54 6.3.4 Effect of Key Range . 55 6.3.5 Effect of Skewness in Zipfian Distributions . 56 6.3.6 Breakdown of Running Time . 57 6.3.7 Effect of Radix Size . 59 6.3.8 Effect of K ............................ 60 6.4 Conclusion . 61 8 List of Figures 1-1 Represents data movements required to sort records in one iteration of in-place radix sort, thus introducing potential read-write races. The colors and numbers in the records represent the bucket each record belongs to. The colored line under the array illustrates where records belonging to each bucket will end up after the array gets sorted. 16 3-1 (a) shows the value of the current radix of keys in the input for a 2-bit radix (B = 4), and lines below corresponds to the countries (blue is country 0, orange is country 1, green is country 2, and purple is country 3). (b) shows the corresponding regions graph. 28 3-2 (a) shows the value of the current radix of keys in the input for a 2-bit radix (B = 4). At each step, the corresponding swaps for each cycle is shown in the array above the regions graph. (b) shows the corresponding regions graph of (a), and the cycle to process is highlighted in red. (c) and (d) show the array and regions graph after processing the highlighted cycle in (b). In particular, the edges (0; 2; 1) and (1; 0; 1) are deleted, and (2; 1; 2) is modified to (2; 1; 1). The next cycle to process is highlighted in (d). (e) and (f) shows the input and regions graph after processing the highlighted cycle in (d). After processing the remaining cycle in (f), the array is completely sorted for the current recursion level, shown in (g), and the regions graph with no edges, shown in (h). 32 9 3-3 (a) and (b) show an example of processing a 2-path formed by the edges (1; 0; 1) and (0; 2; 1). Here vertex 1 is the source, vertex 0 is the broker, and vertex 2 is the sink. The result of processing this 2-path is shown in (c) and (d). The edges (1; 0; 1) and (0; 2; 1) are deleted, and a new edge (1; 2; 1) is formed (the dashed edge in (b)). 34 3-4 An illustration of how 2-paths incident on a broker are processed in parallel. The current broker being processed is circled in bold. At each step, the corresponding swap for each 2-path is shown in the array above the regions graph using the same color as the 2-path. In (b), The algorithm processes the single 2-path with vertex 0 as broker, which is (1; 0; 1); (0; 2; 1), and creates the dashed edge (1; 2; 1) resulting in (d). In (d), Regions sort processes all three 2-paths with vertex 1 as broker in parallel. The three 2-paths are (1; 2; 1); (2; 1; 2), which does not add any new edges; (3; 1; 1); (1; 2; 1), which creates the dashed edge (3; 2; 1); and (2; 1; 1); (1; 3; 1), which creates the dashed edge (2; 3; 1). This results in (f). Finally, in (f), The algorithm processes the single 2-path with vertex 2 as broker, which is (2; 3; 1); (3; 2; 1), and is left with an empty graph in (h), at which point the array is sorted. 36 5-1 An illustration of how 2-paths incident on a broker are processed in parallel. The current broker being processed is circled in bold. At each step, the corresponding swap for each 2-path is shown in the array above the regions graph using the same color as the 2-path. In (b), The algorithm processes the single 2-path with vertex 0 as broker, which is (1; 0; 1); (0; 2; 1), and creates the dashed edge (1; 2; 1) resulting in (d). In (d), Regions sort processes all three 2-paths with vertex 1 as broker in parallel. The three 2-paths are (1; 2; 1); (2; 1; 2), (3; 1; 1); (1; 3; 1); and (2; 1; 1); (1; 2; 1) each of which does not add any new edges. This results in (e) at which point the array is sorted and the algorithm is left with an empty graph in (f). 46 10 6-1 Speedup over single-thread performance of 2-path vs. thread count for unif(109; 109)................................ 55 6-2 Speedup over single-thread performance of 2-path vs. thread count for 9 9 zipf0:75(10 ; 10 )............................... 56 6-3 36-core running time vs. input size n for unif(108; n). 57 6-4 36-core running time vs.

Load more