<<

Theoretically-Efficient and Practical Parallel In-Place Radix Sorting by Omar Obeya B.S., Massachusetts Institute of Technology (2018) Submitted to the Department of Electrical Engineering and in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science at the MASSACHUSETTS INSTITUTE OF TECHNOLOGY June 2019 ○c Massachusetts Institute of Technology 2019. All rights reserved.

Author...... Department of Electrical Engineering and Computer Science May 24, 2019

Certified by...... Julian Shun Assistant Professor Thesis Supervisor

Accepted by ...... Katrina LaCurts Chair, Master of Engineering Thesis Committee 2 Theoretically-Efficient and Practical Parallel In-Place Radix Sorting by Omar Obeya

Submitted to the Department of Electrical Engineering and Computer Science on May 24, 2019, in partial fulfillment of the requirements for the degree of Master of Engineering in Electrical Engineering and Computer Science

Abstract Radix sort stands out for having a better worse case theoretical bound than any comparison-based sort, for fixed length integers. Despite the fact that radix sort can be implemented either in-place or in parallel, there exists no parallel in-place implementation for radix sort that guarantees a sub-linear worst case span. The challenge arises due to read-write races when reading from and writing to the same array. In this thesis, I introduce Regions sort and use it to implement a parallel work-efficient in-place radix . To sort 푛 integers from a range 푟, and a parameter 퐾, the algorithm requires only 푂(퐾 log 푟 log 푛) auxiliary memory, 푂(푛 log 푟) work and 푂((푛/퐾 + log 퐾) log 푟) span. Regions sort uses a divide-and-conquer paradigm. It first divides the array into sub-arrays, each of which is sorted independently in parallel. This decreases the irregularity in the input and reorders it into a set of regions, each of which has similar properties. Next, it builds a data structure, called the regions graph to represent the needed data movements to completely sort the array. Finally, Regions sort iteratively plans swaps that satisfies the required data movements. The algorithm then hasto recurse on records with so-far equal keys to break ties. I compare two variants of Regions sort with the state-of-the-art sorting integer sort and algorithms. Experiments show that the best variant of Regions sort usually outperforms other sorting algorithms. In addition, I perform different experiments showing that Regions sort maintains its superiority regardless input distribution and range. More importantly, the single-threaded implementation of Regions sort outperforms the highly optimized Ska Sort implementation for serial in-place radix sort.

Thesis Supervisor: Julian Shun Title: Assistant Professor

3 4 Acknowledgments

This thesis is based on paper titled: Theoretically-Efficient and Practical Parallel In- Place Radix Sorting with doi: https://doi.org/10.1145/3323165.3323198 [26]. This paper is to be published in ACM SPAA 2019, and is joint work with Endrias Kahssay, Edward Fan and Prof. Julian Shun without whom it would not have been possible to complete this research. This research was supported by DARPA SDH Award #HR0011-18-3-0007, Applications Driving Architectures (ADA) Research Center, a JUMP Center co-sponsored by SRC and DARPA, DOE Early Career Award #DE- SC0018947, and NSF CAREER Award #CCF-1845763. Finally, I would like to ex- press my sincere gratitude to my advisor Prof. Julian Shun for his support, patience, guidance and immense knowledge. I could not have imagined having a better advisor and mentor.

5 6 Contents

1 Introduction 15

2 Related Work 19 2.1 Preliminaries ...... 19 2.2 Previous Work ...... 21 2.2.1 Serial In-place Radix Sort ...... 21 2.2.2 Parallel Out-of-place Radix Sort ...... 21 2.2.3 PARADIS Algorithm ...... 22 2.2.4 IPS4o Algorithm ...... 24

3 Algorithm 25 3.1 Algorithm Overview ...... 25 3.2 Local Sorting ...... 26 3.3 Regions Graph Construction ...... 26 3.4 Global Sorting ...... 29 3.4.1 Cycle Finding ...... 30 3.4.2 2-path Finding ...... 33 3.5 Recursion ...... 38

4 Analysis 39 4.1 Local Sorting ...... 39 4.2 Graph Construction ...... 40 4.3 Global Sorting ...... 40

7 4.3.1 Cycle Finding ...... 40 4.3.2 2-path finding ...... 41 4.4 Overall algorithm ...... 43

5 Optimizations 45 5.1 Batching Cycles ...... 45 5.2 Prioritizing 2-cycles ...... 45 5.3 Early Recursion ...... 47 5.4 Country Sorting ...... 47 5.5 Coarsening ...... 47

6 Evaluation 49 6.1 Experiment Setup ...... 49 6.1.1 Experiment Environment ...... 49 6.1.2 Implementation Details ...... 49 6.1.3 Test Inputs ...... 50 6.2 Tested algorithms ...... 51 6.3 Results ...... 51 6.3.1 Results Overview ...... 51 6.3.2 Parallel Scalability ...... 54 6.3.3 Input Size Scalability ...... 54 6.3.4 Effect of Key Range ...... 55 6.3.5 Effect of Skewness in Zipfian Distributions ...... 56 6.3.6 Breakdown of Running Time ...... 57 6.3.7 Effect of Radix Size ...... 59 6.3.8 Effect of 퐾 ...... 60 6.4 Conclusion ...... 61

8 List of Figures

1-1 Represents data movements required to sort records in one iteration of in-place radix sort, thus introducing potential read-write races. The colors and in the records represent the bucket each record belongs to. The colored line under the array illustrates where records belonging to each bucket will end up after the array gets sorted. . . . 16

3-1 (a) shows the value of the current radix of keys in the input for a 2-bit radix (퐵 = 4), and lines below corresponds to the countries (blue is country 0, orange is country 1, green is country 2, and purple is country 3). (b) shows the corresponding regions graph...... 28

3-2 (a) shows the value of the current radix of keys in the input for a 2-bit radix (퐵 = 4). At each step, the corresponding swaps for each cycle is shown in the array above the regions graph. (b) shows the correspond- ing regions graph of (a), and the cycle to process is highlighted in red. (c) and (d) show the array and regions graph after processing the high- lighted cycle in (b). In particular, the edges (0, 2, 1) and (1, 0, 1) are deleted, and (2, 1, 2) is modified to (2, 1, 1). The next cycle to process is highlighted in (d). (e) and (f) shows the input and regions graph after processing the highlighted cycle in (d). After processing the remaining cycle in (f), the array is completely sorted for the current recursion level, shown in (g), and the regions graph with no edges, shown in (h). 32

9 3-3 (a) and (b) show an example of processing a 2-path formed by the edges (1, 0, 1) and (0, 2, 1). Here vertex 1 is the source, vertex 0 is the broker, and vertex 2 is the sink. The result of processing this 2-path is shown in (c) and (d). The edges (1, 0, 1) and (0, 2, 1) are deleted, and a new edge (1, 2, 1) is formed (the dashed edge in (b))...... 34

3-4 An illustration of how 2-paths incident on a broker are processed in parallel. The current broker being processed is circled in bold. At each step, the corresponding swap for each 2-path is shown in the array above the regions graph using the same color as the 2-path. In (b), The algorithm processes the single 2-path with vertex 0 as broker, which is (1, 0, 1), (0, 2, 1), and creates the dashed edge (1, 2, 1) resulting in (d). In (d), Regions sort processes all three 2-paths with vertex 1 as broker in parallel. The three 2-paths are (1, 2, 1), (2, 1, 2), which does not add any new edges; (3, 1, 1), (1, 2, 1), which creates the dashed edge (3, 2, 1); and (2, 1, 1), (1, 3, 1), which creates the dashed edge (2, 3, 1). This results in (f). Finally, in (f), The algorithm processes the single 2-path with vertex 2 as broker, which is (2, 3, 1), (3, 2, 1), and is left with an empty graph in (h), at which point the array is sorted. . . . 36

5-1 An illustration of how 2-paths incident on a broker are processed in parallel. The current broker being processed is circled in bold. At each step, the corresponding swap for each 2-path is shown in the array above the regions graph using the same color as the 2-path. In (b), The algorithm processes the single 2-path with vertex 0 as broker, which is (1, 0, 1), (0, 2, 1), and creates the dashed edge (1, 2, 1) resulting in (d). In (d), Regions sort processes all three 2-paths with vertex 1 as broker in parallel. The three 2-paths are (1, 2, 1), (2, 1, 2), (3, 1, 1), (1, 3, 1); and (2, 1, 1), (1, 2, 1) each of which does not add any new edges. This results in (e) at which point the array is sorted and the algorithm is left with an empty graph in (f)...... 46

10 6-1 Speedup over single-thread performance of 2-path vs. thread count for unif(109, 109)...... 55 6-2 Speedup over single-thread performance of 2-path vs. thread count for

9 9 zipf0.75(10 , 10 )...... 56 6-3 36-core running time vs. input size 푛 for unif(108, 푛)...... 57 6-4 36-core running time vs. range 푟 for unif(푟, 109)...... 58

9 6-5 36-core running time vs. range 푟 for zipf0.75(푟, 10 )...... 59 9 9 6-6 36-core running time vs. 휃 for zipf휃(10 , 10 ) ...... 60 6-7 Breakdown of 36-core running time for 2-path on unif(109, 109) and

9 9 zipf0.75(10 , 10 )...... 60 6-8 36-core running time vs. radix size for unif(109, 109)...... 61

9 9 6-9 36-core running time vs. radix size for zipf0.75(10 , 10 )...... 62 6-10 36-core running time vs. 퐾 for unif(109, 109)...... 63

9 9 6-11 36-core running time vs. 퐾 for zipf0.75(10 , 10 )...... 64

11 12 List of Tables

3.1 Terminology...... 26

6.1 Serial times (푇1) and parallel (푇36ℎ) times (seconds), as well as the par- allel speedup (SU) on a 36-core machine with two-way hyper-threading. The fastest parallel time for each input is bolded...... 52

6.2 Serial times (푇1) and parallel (푇36ℎ) times (seconds), as well as the par- allel speedup (SU) on a 36-core machine with two-way hyper-threading. This table is a continuation to table 6.1, and results in both tables are comparable...... 52

13 14 Chapter 1

Introduction

Radix sort stands out as a popular algorithm for integer sort which has better the- oretical guarantees than comparison based sorting. To sort 푛 integers of range [0, 1, . . . , 푟 − 1], radix sort requires 푂(푛 log 푟) work ( of operations) in the worst case. This can be significantly smaller than the Ω(푛 log 푛) work bound required by comparison based sorting, when the integer range is small. Most Significant Digit (MSD) radix sorting sorts records consisting of key-value pairs according to their keys. A radix is a digit in a certain integer representation. The algorithm starts with the most significant radix, sorts the records according to that radix, and then recurses on keys with equal radices to break ties using less significant radices. The algorithm proceeds till all ties are broken or all the bits of the key are exhausted. Thus, the main component of radix sort, is sorting a set of records based on a single digit that can have a small number of possible values, called buckets. Serial radix sort can be implemented with or without using linear auxiliary space. When using an auxiliary array, the algorithm reads one record at a time from the input array, and then writes the record back in the correct place in the auxiliary space. In contrast, the in-place algorithm reads record-by-record, and if the record is not in the correct place, it gets swapped with the record occupying its correct place. Parallel radix sorting can be implemented in (푛 log 푟) work and 푂(log 푟 log 푛) span [7, 38, 31], and there are a variety of practical parallel implementations that have been developed [7, 38, 36, 29, 25, 37, 31, 18, 21, 27, 30, 10, 19, 35, 34]. However,

15 Input: 0 0 2 0 3 2 1 1 1 3 1

Figure 1-1: Represents data movements required to sort records in one iteration of in- place radix sort, thus introducing potential read-write races. The colors and numbers in the records represent the bucket each record belongs to. The colored line under the array illustrates where records belonging to each bucket will end up after the array gets sorted. no existing algorithm is able to perform parallel radix sort with sub-linear auxiliary space and sub-linear span (longest sequential dependence). This is because doing in-place swaps to sort creates many read-write races [27]. Figure 1-1 illustrates where each record should be moved to in order for the array to be sorted according to one radix. If these records were moved concurrently according to the arrows shown, read- write races will arise. Nevertheless, using auxiliary memory leads to incurring more cache misses, due to a larger memory footprint. In this thesis, I illustrate that it is possible to achieve low space and high parallelism.

The only existing approach for a parallel in-place radix sort, is PARADIS. PAR-

16 ADIS is a speculative parallel MSD radix sort algorithm. To avoid races, PARADIS divides the array evenly among processors. Each processor is given a set of a sub- arrays, and is required to perform swaps in a manner that would place as many records in the correct place as possible. In the best case, this would lead to sorting the whole array in 푂(푛) work. However, this will not necessarily be the case, since one processor might need to move records to another sub-array owned by another processor. Con- sequently, PARADIS has to perform more such permutation iterations until all the records are sorted. Since each iterations sorts some records, the algorithm terminates; Consequently, the worst case span of PARADIS is Ω(푛 log 푟) which is no better than the serial MSD radix sort span in the worst case. Up to my knowledge, there exists no prior work that guarantees a sub-linear span for a parallel in-place radix sort. In this thesis, I introduce Regions Sort, and use it to implement a work-efficient parallel in-place radix sorting algorithm, that requires sub-linear span and has a competitive performance compared to other parallel sorting algorithms. In Chapter 2, I define the problem and my assumptions more formally. In addition, Iexplain and analyze the most relevant previous work using the same computational model assumed. Chapter 3 describes Regions sort and two versions of it: 2-path finding and cycle finding. It dives into major implementation details that can affectthe performance and asymptotic complexity of these versions. In Chapter 4, I analyze the algorithm by proving a number of lemmas about auxiliary space, work, and span of Regions sort. This analysis concludes that Regions Sort is a work-efficient algorithm with a sublinear depth that requires a sublinear auxiliary space. Finally, in Chapter 6, I present a comprehensive empirical study comparing the performance of the two versions of Regions sort with state-of-the-art parallel sorting algorithms on various inputs and distributions. The code for Regions Sort is publicly available at https: //github.com/omarobeya/parallel-inplace-radixsort.

17 18 Chapter 2

Related Work

2.1 Preliminaries

Notation. I refer to the number of distinct possible values a radix can take with 퐵, which is also number of buckets. I refer to the number of bits needed to represent the radix, also known as radix size, with 푑, such that 퐵 = 2푑. Unlike the rest of the paper, in figures’ captions, I refer to directed edges using source, destination, and edge weight. It is important to note that this description is not accurate enough, since multiple edges can co-exist with the same description. However, it preserves the simplicity in the graph, while staying clear enough for the small examples presented in figures. In-place Algorithms. An algorithm is considered in-place if the total memory it uses is 표(푛) or sub-linear in the size of the input. In this thesis, I assume each pointer to an array requires logarithmic space in the size of that array to store. Work-Span Model. In this thesis, I use the work-span model to analyze the com- plexity of algorithms.

1. Work: total number of instructions.

2. Span: length of the longest chain of dependency across instructions.

Under this model, the worst case complexity of an algorithm running on 푃 pro-

푊 표푟푘 cessors is upper bounded by 푃 + 푂(푆푝푎푛) using a work-stealing scheduler [8].

19 Primitives. In this thesis, I refer to three important primitives: , merge, and histogram building.

1. Prefix Sum: Takes an array of 푠 integers 퐴[0], 퐴[1], . . . 퐴[푠], a binary operator ⊕, and an identity 퐼, such that 퐼 ⊕ 푥 = 푥 for all 푥; and returns an array of 푠 + 1

푠 integers: 퐼, 퐼 ⊕ 퐴[0], 퐼 ⊕ 퐴[0] ⊕ 퐴[1], . . . , 퐼 ⊕푖=0 퐴[푖]. Parallel prefix sum could be implemented effectively in parallel achieving 푂(푛) work, 푂(log 푠) span, and requiring 푂(푠) auxiliary space [17].

2. Merge: Takes two sorted arrays of integers 퐴1 and 퐴2, and returns a sorted

array 퐴3 which has all the elements in 퐴1 and 퐴2. A parallel Merge needs

푂(|퐴1| + |퐴2|) work, 푂(log |퐴1| + log |퐴2|) span, and 푂(|퐴1| + |퐴2|) auxiliary space [17].

3. Histogram Building: Takes an array 퐴 of 푛 records, and a function 푓 that maps records to an integer from 0 to 퐵 − 1, and returns an array 퐶표푢푛푡푠, where 퐶표푢푛푡푠[푗] equal the number of times 푓(퐴[푖]) equals 푗 for all possible 푖. Histogram building can be built in parallel by dividing the array into 푆 equal sub-arrays, calculating the histogram of each sub-array independently in serial, and finally aggregating these counts using a parallel sum over each bucket independently. It requires 푂(푛 + 푆퐵) work, 푂(log 푆) span and 푂(푆퐵) auxiliary space. Parallel histogram building was used in previous parallel radix sort algorithms, like PARADIS [10]. Serial histogram building refers to the case when 푆 = 1.

Countries. When sorting integers according to a single radix that can take 퐵 dif- ferent values, each record can be classified into one of the 퐵 buckets. Records of the same bucket have to be placed contiguously. Most algorithms have to pre-calculate the sub-array that will store records belonging to each bucket at the end of sorting. In this thesis, I refer to such a sub-array as a country. More specifically, a country is a sub-array which when sorting finishes execution will hold all of, and only, the records belonging to a specific bucket. Hence, there is a one-to-one correspondence

20 between buckets and countries. Many algorithms set a pointer to the start of each country and update it as more elements are added to the country. I refer to such a pointer as a country pointer. To calculate the size of each country, histogram building is used. Calculating country pointers is done through performing a prefix sum on country sizes.

2.2 Previous Work

2.2.1 Serial In-place Radix Sort

Algorithm 1 shows the pseudo-code for serial in-place radix sort, also known as Amer- ican Flag Sort [24]. Lines 2 − 4 perform a serial histogram building and line 5 uses the histogram to calculate country pointers 퐻[푖]. Country pointers are updated to continuously reflect the first unsorted record in the corresponding country. Values 푇 [푖] are analogous to 퐻[푖] but they point to the first record of the next country. 퐻[푖] is updated dynamically to keep pointing at the first unsorted record in the country, while 푇 [푖] is static through the rest of the algorithm to bound 퐻[푖]. Lines 6 − 10 sorts all records by swapping each of them once to the correct country. In particular, the algorithm sorts one country 푖 at a time, starting from its country pointer 퐻[푖], representing the first record that does not belong to the country. The country repeat- edly swaps that record with another record from its correct country and increments the pointer of its correct country, since it contains one more record that belongs to it. The process is repeated till the record at the position 퐻[푖] actually belongs to country 푖, in which case 퐻[푖] is incremented.The algorithm takes linear work since each one swap in the for loop in lines 8 − 9 sorts at least one record. Then, it recurses on records with in the same country to break ties.

2.2.2 Parallel Out-of-place Radix Sort

Parallel out-of-Place MSD radix sort algorithm starts by performing a parallel his- togram to calculate country pointers. Then it divides the array into blocks of sub-

21 Algorithm 1 Pseudocode for Serial In-Place Radix Sort ◁ extract(푎, level) extracts the appropriate bits from element 푎 based on the radix size and current level.

1: procedure RadixSort(퐴, 푛, level) 2: 퐶 = {0,..., 0} ◁ length-퐵 array storing per-bucket counts 3: for 푎 in 퐴 do 4: 퐶[extract(푎, level)]++ 5: Prefix sum on 퐶 to get per-bucket start and end indices stored in 퐻 and 푇 , respectively 6: for 푖 = 0 to 퐵 − 1 do 7: while 퐻[푖] < 푇 [푖] do 8: while extract(퐴[퐻[푖]], level) ̸= 푖 do 9: swap(퐴[퐻[푖]], 퐴[퐻[extract(퐴[퐻[푖]], level)]++]) 10: 퐻[푖]++ 11: if level < maxLevel then 12: for 푖 = 0 to 퐵 − 1 do 13: RadixSort(퐴푖, 퐶[푖], level + 1) ◁ 퐴푖 is the sub-array containing elements in bucket 푖

arrays each of which is given to a processor. Each processor inserts its records in the correct country in the auxiliary space. Write races can be avoided by giving each processor an exclusive access to a fraction of each country 푖 equal to the number of records, read by that processor, and belong to country 푖 for all 0 ≤ 푖 < 퐵. This can be done by performing a prefix sum on the number of records that belong on bucket 푖 read by each processor. More space is also needed to keep track of country pointers at each level of recursion for each processor amounting to 푂(푃 퐵 log 푛 log 푟) space. The algorithm performs 푂(푛 log 푟) work 푂(log 푛 log 푟) span, and Ω(푁 + 푃 퐵 log 푟 log 푛) auxiliary space.

2.2.3 PARADIS Algorithm

Cho et al. [10] introduce PARADIS, the only existing algorithm for parallel in-place MSD radix sort. PARADIS sorts records into their respective countries, and then recurses independently in parallel on each country. In order to sort a set of records according to a certain radix, PARADIS first pre-calculates country sizes and loca-

22 tions in parallel, through parallel histogram building, once in the beginning of the algorithm. Next, the algorithm repeatedly performs two phases until the whole array is sorted: permutation phase and repair phase.

In the permutation phase, each country is divided evenly among processors. This prevents the read-write races problem mentioned before, since each record can be read and written by exactly one processor. Next, each processor permutes its records, such that it puts as many records as possible in the correct country. This part is very similar to the serial in-place sort, since each element is read only once and if it is not in the correct country it is swapped with a record in its country. The only difference is that a processor might read more records that belong to a specific country 푥 than it has room in country 푥, since a processor can read all the elements that belong to 푥, while

1 it has access to only 푃 fraction of unsorted records in each country 푥. This leads to some records not being put into their correct country after the permutation phase. However, the authors show that the algorithm always makes progress by putting at least a constant fraction of the records in their correct countries.

The repair phase follows. It is responsible for separating the sorted records from the non-sorted records, so that the algorithm can reattempt to permute them. For each country, records that do not belong to the country but could not be permuted in the previous phase are put at the end of the country. The new country pointers point to the first misplaced record in each original country, and country sizes are updated to reflect the number of misplaced records, effectively reducing the problem size. Repairing each country is independent and happens in parallel. Therefore, the span of the algorithm is constrained by the maximum fraction of misplaced elements shifted by one processor, which the authors call 푤.

The algorithm is in-place since it uses sub-linear space. Each of the 푃 pro- cessors need to keep 푂(퐵) country pointers at each recursion level, resulting in

1 푂(푃 퐵 log퐵 푟 log 푛) space . At the first iteration of permutation the total work is Θ(푛), but since the number of records to be sorted gets smaller by a factor of at least

1The bounds here are different from [10], as they assume that 푃 , 퐵 and 푟 are constant, and that pointers take 푂(1) space.

23 퐵−1 퐵 , total work done to sort records according to one radix is 푂(퐵푛), amounting to 푂(퐵푛 log 푟) work to sort the integers completely. Furthermore, the authors show an adversary input to their algorithm that leads to a Θ(푛) span per radix. Therefore, the span per radix is Ω(푛) amounting to Ω(푛 log 푟) in total. PARADIS is inefficient for both small and large values of 퐵. For small values of 퐵, the algorithm does not scale, since the repair phase assigns each country to be repaired by one processor. For larger values of 퐵, PARADIS work even blows up more since it takes 푂(퐵푛) work, which is a factor of 퐵 more than the serial in-place radix

퐵 sort which takes only Θ(푛). This makes complexity of PARADIS is Ω(푚푎푥( 푃 푛, 푛)).

2.2.4 IPS4o Algorithm

There has also been significant prior work on parallel algorithms for comparison sort- ing (e.g., [16, 9, 1, 28, 15, 23, 5, 12, 35, 11, 6, 4, 31, 13, 39]). In terms of implementa- tions, IPS4o is one of the most recent and has been shown to achieve state-of-the-art performance [2]. IPS4o [2] is an in-place comparison-based sorting algorithm that depends on sampling. IPS4o recursively divides the array into 푘 buckets by randomly picking 푘 − 1 pivots, then sorts records according to those pivots. Next, the array is divided among processors, each of them processor a sub-array serially. Each processor has 푘 buffers (each corresponding to a bucket), each of size 푏. When a processor reads a record it puts it in the corresponding buffer. When the buffer is full, the processor dumps all the records in its sub-array to replace 푏 already read records. Country bor- ders are calculated using parallel histogram building. Some buffers will be located in the correct country, others are not. IPS4o swaps incorrectly located buffers together to sort them. IPS4o has to perform corrections to account for the fact that not all countries are aligned with buffer sizes or have a country size divisible by 푏. Unlike Regions sort which uses atomics only implicitly to join threads, IPS4o uses atomics to implement locks on swapping buffers.

24 Chapter 3

Algorithm

3.1 Algorithm Overview

In this chapter, I present Regions sort, a novel parallel work-efficient in-place radix sorting algorithm. Algorithm 2 describes Regions sort at a high level. Table 3.1 defines the terminology used to describe the algorithm. At a high level, the algorithm is an MSD radix sort, therefore it classifies records based on a certain radix into 퐵 countries, and then recurses on each country independently, in parallel, to break ties using less significant radices. In order to pre-calculate how many levels of recursion are needed, the maximum key is computed once in the beginning of the algorithm. The maximum key could be computed using the prefix sum algorithm by setting the binary operator ⊕ to 푚푎푥(., .) and identity 퐼 to −∞. Once maximum computation is done, the algorithm proceeds by recursively sorting records according to one radix at a time. I first start by explaining how Regions sort sorts records according toa single radix in a single round of radix sort. In a single round, Regions sort operates in three phases: local sorting, constructing the regions graph, and global sorting. After sorting, if the algorithm is not already in the last recursion level, the algorithm recurses on each country concurrently.

25 Term Definition Country A contiguous sub-array of 퐴 that would contain all keys belonging to a certain bucket after it gets sorted by the current radix. Country 푖 (also referred to as 퐴푖) is the sub- array containing all keys belonging to bucket 푖 after sorting. Block A contiguous sub-array of 퐴. The 푘’th block is the sub-array 푘푛 (푘+1)푛 퐴[ 퐾 ,..., 퐾 − 1]. Homogeneous Sub-array A contiguous sequence of keys be- longing to the same bucket. Region A homogeneous sub-array of maxi- mum size within the same country. Table 3.1: Terminology.

3.2 Local Sorting

During the local sorting phase, (Lines 2–3), Regions sort partitions the array into 퐾 equal sub-arrays , I refer to each as a block. Each block is sorted in serial but independently from other blocks, allowing different blocks to be sorted concurrently. This phase is meant to decrease the irregularity in the input. After local sorting, the array has no more than 퐾퐵 homogeneous sub-arrays in a row, where a homogeneous sub-array is a contiguous group of records that belong to the same bucket. This serves two purposes. First, it makes it possible to represent the array using sub-linear auxiliary space. Second, it allows performing swaps between these sub- arrays in batches, since all records within a homogeneous array need to be moved to the same country, which leaves more room for parallelism and data locality.

3.3 Regions Graph Construction

In order to complete sorting the array, homogeneous sub-arrays need to be reordered such that each homogeneous array falls in the right country. Hence, Regions sort has to maintain a data structure that keeps track of all homogeneous sub-arrays belonging

26 Algorithm 2 Pseudocode for Regions Sort 1: 푛 = initial input size 2: 퐾 = initial number of blocks 3: procedure RadixSort(퐴, 푁, level) ′ 푁 4: 퐾 = ⌈퐾 · 푛 ⌉ ◁ Local Sorting Phase 5: parfor 푘 = 0 to 퐾′ − 1 do 푘푁 (푘+1)푁 6: Do serial in-place distribution on 퐴[ 퐾′ ,..., 퐾′ − 1] ◁ Graph Construction Phase 7: Compute global counts array 퐶 from local counts, and per-bucket start and end indices 8: Generate regions graph 퐺 ◁ Global Sorting Phase 9: while 퐺 is not empty do 10: Process a subset of non-conflicting edges in 퐺 in parallel 11: Remove processed edges and add new edges to 퐺 if needed ◁ Recursion 12: if level < maxLevel then 13: parfor 푖 = 0 to 퐵 − 1 do 14: if 퐶[푖] > 0 then 15: RadixSort(퐴푖, 퐶[푖], level + 1)

to a specific country 푖 and all homogeneous sub-arrays that need to be moved out of country 푖, for all 0 ≤ 푖 < 퐵. Furthermore, such a data structure has to overcome two challenges. First, it needs to be efficient for updates, since data movements should be reflected in the data structure. Second, it needs to use sub-linear memory, after any number of data movements. A Region is a homogeneous sub-array that resides in a single country and cannot be extended more to the left or to the right to take more records that are homogeneous and within the same country borders. By definition, all records within a homogeneous sub-array need to be moved to the same country. Thus, it is possible to represent all required data movements by listing all regions within the graph. A Regions Graph is a data structure which represents the required data move- ments to put records into their respective countries, while being flexible for dynamic updates in the global sorting phase. In a regions graph each country is represented by a node and each region is represented by a directed weighted edge that goes from

27 a b 1

1 0 1 1 0 0 2 0 1 2 3 1 1 1 3 1 2

3 1 2

Figure 3-1: (a) shows the value of the current radix of keys in the input for a 2-bit radix (퐵 = 4), and lines below corresponds to the countries (blue is country 0, orange is country 1, green is country 2, and purple is country 3). (b) shows the corresponding regions graph. the node corresponding to the country it resides in, into the node corresponding to the country it belongs to. The weight of each edge, represents the size of each region. In order to plan the swaps, edges also need to store the start of the corresponding region. A regions graph does not include zero-weighted edges or self-edges, since they do not correspond to any data movement. Consequently, a sorted array would correspond to a regions graph with no edges. Figure 3-1 shows an example array and the corresponding regions graph. After the local sorting phase, building the regions graph requires sub-linear work and space. Lemma 1 proves that number of edges in any regions graph after the local sorting phase is sub-linear in 푛.

Lemma 1. After the local sorting phase, total number of edges in any regions graph is 푂(퐾퐵).

Proof. The total number of homogeneous sub-arrays that could not be extended is 푂(퐾퐵), since each of the 퐾 blocks contains at most B such homogeneous sub-arrays. Each such homogeneous sub-array that does not cross a country boundary is a region.

28 Since the total number of countries is 퐵, there can exist no more than 퐵 − 1 country boundaries. Therefore total number of regions is 푂(퐾퐵). Number of edges is equal to the number of regions minus regions that do not need to be relocated.

Regions sort implementation uses an edge list, rather than an adjacency matrix. Therefore, the work and space needed to build the graph is linear in the size of edges and size of nodes which is 푂(퐵 + 퐾퐵). In contrast, an adjacency matrix requires (퐾퐵 + 퐵2) space and work to build.

3.4 Global Sorting

The aim of the global sorting phase is to use the regions graph in order to move each homogeneous sub-array correctly to its respective country while using sub-linear memory. Global sorting operates in steps, where in each step the algorithm picks one country to sort completely, after which this country will include all the records which belong to it. This also implies that it will not contain records that does not belong to it, as shown by the following lemma.

Lemma 2. For any vertex in a regions graph, the sum of weights of incoming edges equals the sum of weights of outgoing edges.

Proof. By definition of country size, it equals total number of records that belongto a certain country. Therefore, total number of records belonging to the country but are not inside, equals total number of records not belonging to the country but are inside it. By definition, the former is the sum of weights of incoming edges, whilethe latter is the sum of weights of outgoing edges.

Using lemma 2, it follows that once a country is sorted, the corresponding node will be isolated.

The global sorting phase iterates repeatedly through three important stages:

29 1. Planning: choosing sub-arrays to swap and storing these choices.

2. Executing: swapping sub-arrays in parallel.

3. Graph Update: updating the regions graph to reflect the changes in the array.

The algorithm iterates over these three stages until the regions graph has no edges, meaning that the array is completely sorted. In this work, I present two methods for global sorting: cycle finding and 2-path finding.

3.4.1 Cycle Finding

A cycle in a regions graph indicates a group of homogeneous sub-arrays that can be reordered such that each homogeneous sub-array ends up in its correct country. By construction of the regions graph, each edge in the graph points from the country it resides in, to the country it should move to. Thus, each sub-array corresponding to an edge in the cycle could be shifted to replace the sub-array corresponding to the following edge in the graph. In order for sub-arrays to switch places, they need to be the same size. Therefore, only the first 푤푚푖푛 records of each sub-array are moved, where 푤푚푖푛 is the minimum edge weight on the cycle. Note, that this means, that not all the records represented by the edges of the cycle will be sorted by the described shift. The cycle finding algorithm uses an edge list implementation to iteratively plans a cycle, execute it, and update the graph to reflect the change, till the graph has no edges and no cycles. It is possible to prove that a non-sorted array will always contain a cycle. Furthermore, due to the property in lemma 2, finding a cycle in a regions graph is easier than general graphs.

Lemma 3. A cycle can always be found in a non-empty regions graph in 푂(퐵) graph traversals.

Proof. Consider a non-empty graph. Start at a node with incident outgoing edges and repeatedly traverse the graph through outgoing edges. There are two possibilities. The first possibility is reaching a node with an ingoing edge but no outgoing edges.

30 However, this contradicts lemma 2. The second possibility is to keep traversing nodes infinitely. However, after visiting B + 1 nodes, one node must have been visited twice by pigeonhole principle. Thus, the graph contains a cycle which is a contradiction.

Cycle finding using a depth first search is efficient using a straightforward edgelist implementation. The regions graph contains 퐵 edge lists one for each node. Each list contains the set of edges going out of the corresponding node. Finding an arbitrary neighbor just takes 푂(1) work, by using the edge at the beginning of the edge list. Updating the graph also requires 푂(1) work to change the weight of an edge and 푂(1) work to delete the whole edge by advancing the pointer to the start of the edge list. The algorithm proceeds as follows: each time a node is chosen to be sorted, the algorithm iteratively finds a cycle connected to that node, executes it and updates the graph. This happens until that node has no more edges. Then, the algorithm picks another node as mentioned before.

Consider a cycle 푐 of length 푙 defined by the edges: (푣0, 푣1, 푤0, 푠0),..., (푣푙−1, 푣0, 푤푙−1, 푠푙−1)

and 푤푚푖푛 representing the minimum weight on the cycle.

1. Planning: finding a cycle using depth first search starting from the nodein interest.

2. Executing: moving 푠푖 + 푗 to position 푠(푖+1) 푚표푑 푙 + 푗 for all 0 ≤ 푖 < 푙 − 1 and

0 ≤ 푗 < 푤푚푖푛 in parallel.

3. Graph Update: updating each edge on the cycle by subtracting 푤푚푖푛 from

each edge’s weight and adding 푤푚푖푛 to its start. Edges whose weights drop to zero, should be deleted.

Figure 3-2 shows an example of the cycle finding algorithm. The algorithm eventually terminates since the total sum of weights in the graph

always decreases. In addition, every time 푤푚푖푛 is decremented from each edge on a cycle, one edge gets deleted. This is because by definition, at least one edge’s weight equals the minimum edge on the cycle.

31 a c 0 0 2 0 1 2 3 1 1 1 3 0 0 0 1 1 2 3 2 1 1 3

1 b d 1 1 0 1 1 0 1 1 1 2 1

3 1 2 3 1 2

e g 0 0 0 1 1 1 3 2 2 1 3 0 0 0 1 1 1 1 2 2 3 3

f h 0 1 1 0 1

3 1 2 3 2

Figure 3-2: (a) shows the value of the current radix of keys in the input for a 2-bit radix (퐵 = 4). At each step, the corresponding swaps for each cycle is shown in the array above the regions graph. (b) shows the corresponding regions graph of (a), and the cycle to process is highlighted in red. (c) and (d) show the array and regions graph after processing the highlighted cycle in (b). In particular, the edges (0, 2, 1) and (1, 0, 1) are deleted, and (2, 1, 2) is modified to (2, 1, 1). The next cycle to process is highlighted in (d). (e) and (f) shows the input and regions graph after processing the highlighted cycle in (d). After processing the remaining cycle in (f), the array is completely sorted for the current recursion level, shown in (g), and the regions graph with no edges, shown in (h).

Independent records can be swapped in parallel. Let us define a record with order

푗 with respect to a cycle 푐, as a record that has an index that satisfies 푠푎 + 푗 for some edge 푎 on cycle 푐 and 0 ≤ 푗 < 푤푚푖푛. Using the execution definition, it is easy to see that all records with different orders with respect to a cycle are independent and can be executed in parallel. In subsection 4.3.1, I prove the span bound of the cycle finding algorithm. In section 5.1, I discuss an optimization that, on onehand, increases parallelism by executing multiple independent cycles concurrently, and on the other hand, increases auxiliary space required by the algorithm.

32 3.4.2 2-path Finding

The second approach in its essence depends on swapping a pair of equal-sized homo- geneous sub-arrays, one of them belongs to country 푏, but reside in country 푎, and the other resides in a specific country 푏 but belongs to country 푐. Swapping the two homogeneous sub-arrays would put the latter in the correct country, and move the former into a new place. I refer to such two sub-arrays as a 2-path. In addition, I refer to country 푎 as the source, country 푏 as the broker and country 푐 as the target. The naming comes from the way a 2-path appears on a regions graph. One of the homogeneous sub-arrays must be subset of the region represented by the edge going from the source to the broker. The other homogeneous sub-array must be a subset of the region represented by the edge going from the broker to the target. A 2-path

푝 :(푠1, 푠2, 푤, 푎, 푏, 푐) can be defined using a pointer to the start of the first sub-array

푠1, a pointer to the start of the second sub-array 푠2, the length of both sub-arrays 푤, the source, the broker 푏, and the target 푐. However, in figures I refer to 2-paths using the edges that forms them. This does not completely define a 2-path; however, it is sufficient for the test case presented in the figures and is convenient for clarity. Figure 3-3 illustrates how a 2-path looks on a regions graph and how the array and the corresponding regions graph changes when that 2-path is executed.

Planning all 2-paths with the same broker 푏, can be done through ordering all incoming records with all outgoing records. By lemma 2, the number of incoming records equals the number of outgoing records. Let the the number of incoming records to node 푏 (equivalently sum of weights of incoming edges) be 퐻푏. Therefore, by indexing incoming records and outgoing records independently each from 0 to

퐻푏 − 1, each incoming record could be mapped to an outgoing record, the one that shares the same index. Indexing individual records is inefficient, it is possible to simulate the same operation by indexing edges using a prefix sum on the weights. In other words, the algorithm has to perform a two prefix sums one on the weights of incoming edges and another on the weights of outgoing edges, I refer to the two sets of prefix sums as incoming prefix sums and outgoing prefix sums, respectively. The

33 a c 0 0 2 0 1 2 3 1 1 1 3 0 0 0 2 1 2 3 1 1 1 3

1 b BROKER SOURCE d 1 1 0 1 1 0 1 1 1 2 1 1 2

1 1 3 2 3 2

SINK

Figure 3-3: (a) and (b) show an example of processing a 2-path formed by the edges (1, 0, 1) and (0, 2, 1). Here vertex 1 is the source, vertex 0 is the broker, and vertex 2 is the sink. The result of processing this 2-path is shown in (c) and (d). The edges (1, 0, 1) and (0, 2, 1) are deleted, and a new edge (1, 2, 1) is formed (the dashed edge in (b)).

index of each record equals the prefix sum of the edge corresponding to its region plus its index within the region itself. To save space and work, given a certain ordering of incoming edges and outgoing edges, the number of 2-paths need to be minimized by maximizing the weight of each 2-path. Therefore, each 2-path will have one of its sub-arrays bordering a region at the left (start), and one of its sub-arrays bordering a region at the right (end). Otherwise, both sub-arrays could be extended more while adding more homogeneous records. Therefore, each 2-path will start at one of the prefix sums and end at the following prefix sum. By merging both prefix sums, it is possible to calculate allthe 2-paths. To update the graph, the algorithm deletes all the edges connected to the broker, and adds new edges corresponding to homogeneous sub-arrays that moved but are still in the wrong country and yet to be sorted.

Lemma 4. Executing a single 2-path, leads to adding at most one edge.

Proof. By construction, a 2-path 푝 :(푠1, 푠2, 푤, 푎, 푏, 푐) is a match between an incoming homogeneous sub-array with an outgoing homogeneous sub-array of equal length.

34 Therefore, the former sub-array will be placed correctly in its country, thus no edge needs to be added to the graph to represent a correctly placed group of records. If 푎 = 푐, then the second sub-array was also placed in the correct country. If 푎 ̸= 푐, then a single edge from 푎 to 푐 needs to represent the records since they represent homogeneous sub-array residing in a single country, a region.

The 2-path finding algorithm can be defined using the same framework described earlier.

1. Planning: finding all 2-paths with a specific broker 푏.

2. Executing: swapping the two homogeneous sub-arrays of every 2-path in par- allel.

3. Graph Update:

(a) Deleting used edges: delete all edges connected to node 푏.

(b) Adding new edges: for each 2-path, add a new edge, if necessary, to rep- resent the homogeneous sub-array that was swapped but was not put into the correct country.

Figure 3-4 shows a demonstration of the 2-path finding array on an example array.

The data structure of the regions graph for 2-path is slightly different than the one for the cycle finding. The 2-path algorithm is always interested in all edges connected to a given node. Once the node is processed and the country corresponding to it gets sorted, all edges connected to it need to be destroyed. This also implies that for all the nodes processed next, no node will contain an edge connected to that node. Let us assume that 푟푎푛푘푖 is the order in which node 푖 will be processed. For each node 푖, the data structure maintains two vectors of edges, one for outgoing edges and one for ingoing edges. However, each node stores edges connected to nodes with a higher rank than itself only. Thus, calculating all edges connected to a given node is very

35 a c 0 0 2 0 1 2 3 1 1 1 3 0 0 0 2 1 2 3 1 1 1 3

1 b d

1 1 0 1 1 0 1 1 1 2 1 1 1 1 1 1 1 3 2 3 1 2

e g 0 0 0 1 1 1 1 2 3 2 3 0 0 0 1 1 1 1 2 2 3 3

f h 0 1 0 1

1

3 1 2 3 2

Figure 3-4: An illustration of how 2-paths incident on a broker are processed in parallel. The current broker being processed is circled in bold. At each step, the corresponding swap for each 2-path is shown in the array above the regions graph using the same color as the 2-path. In (b), The algorithm processes the single 2-path with vertex 0 as broker, which is (1, 0, 1), (0, 2, 1), and creates the dashed edge (1, 2, 1) resulting in (d). In (d), Regions sort processes all three 2-paths with vertex 1 as broker in parallel. The three 2-paths are (1, 2, 1), (2, 1, 2), which does not add any new edges; (3, 1, 1), (1, 2, 1), which creates the dashed edge (3, 2, 1); and (2, 1, 1), (1, 3, 1), which creates the dashed edge (2, 3, 1). This results in (f). Finally, in (f), The algorithm processes the single 2-path with vertex 2 as broker, which is (2, 3, 1), (3, 2, 1), and is left with an empty graph in (h), at which point the array is sorted.

efficient, since such edges are stored in two contiguously memory, which increase data locality as well.

Planning and graph updating can be implemented either in serial or in paral- lel. The parallel implementation gives better theoretical guarantees, while the serial implementation is more efficient in practice for smaller graphs. The serial imple- mentation of planning is efficient through a two-finger merge. Prefix sums couldbe calculated implicitly an on-line fashion. The algorithm starts with a counter 푐, repre-

36 senting the number of matched records set at 0, and two pointers one on an incoming edge and one on an outgoing edge. Iteratively, the prefix sum of the two edges are cal- culated. The maximum number of records that can be matched equals to the smaller of the weights of the edges 푤푚푖푛. Edges are updated to reflect that they lost 푤푚푖푛 records. This means at least one edge has weight of 0 and can be skipped. Therefore, this algorithm takes linear time in the number of edges connected to the node.

The parallel implementation of planning is work-efficient and gives a poly-logarithmic span in the number of edges connected to the node, which is straightforward through parallelizing each step independently. A parallel prefix sum, storing its result inan auxiliary array taking linear size in the number of edges. Then a parallel merge is sufficient to simulate two finger merge but in divide and conquer manner.

Parallelism is essential in the execution stage to guarantee a sub-linear span and it can happen in two different ways. The first way is within the swapping asingle pair of homogeneous sub-array. The second way is across different 2-paths, since all 2-paths are independent. This is because by construction, all records in the incoming edges are matched only once and all records in the outgoing edges are matched only once. In addition, the graph does not allow for self-edges, and a record could either belong to an incoming edge or an outgoing edge.

In the graph updating stage, edges corresponding to regions swapped are deleted and edges corresponding to regions created are added. According to the implementa- tion presented earlier, deleting old edges requires the destruction of the two edge lists corresponding to the processed node, thus taking constant time. The serial imple- mentation to add edges is efficient since appending a new edge to an edge list takes constant time. Care is needed in the parallel implementation to avoid write races to edge lists. Therefore, first all edges to be added are stored in an auxiliary array. Next, these edges are sorted according to the edge list it will be stored in, which is the lower rank end point. Since the total values of ranks go from 0 to 퐵 − 1 and the total number of edges is sub-linear, a parallel out-of-place radix sort is efficient. For each edge list a space is allocated at the end of the edge list equal to the number of edges to added to that edge list. Each new edge is given an index within the new edges

37 added. Therefore, each edge could be added to the correct edge list in the correct index in parallel without any races.

3.5 Recursion

After sorting records according to one radix, Regions sort recurses on each country independently. Regions sort has to decide the number of blocks being used when recursing on smaller sub-problems in the next recursion level. Intuitively, it is very beneficial to have blocks of equal sizes, thus having more load balance in parallelism. So Regions sort preserves approximate ratio between block number in a sub-array and number of records inside that sub-array. That is for a country of size 푁, Regions

′ ′ 퐾 sort assigns 퐾 blocks to it such that 퐾 = ⌈푁 * 푛 ⌉. In section 4.4, I prove that this assignment does not blow up the work or space of the algorithm.

38 Chapter 4

Analysis

In this chapter, I analyze the algorithm described in the previous section by proving theoretical bounds on work, span and space. I assume a work-stealing scheduler like Cilk work scheduler [8]. Furthermore, I assume 푛 = Ω(퐾퐵), otherwise, the input is padded to have size Ω(퐾퐵). In sections 4.1− 4.3, I start by analyzing these bounds

′ 퐾 for one subproblem with size 푁 using 퐾 = ⌈푁 * 푛 ⌉ blocks. In section 4.4, I proceed to analyze these bounds over all the layers.

4.1 Local Sorting

In the local sorting phase, the serial radix sort is applied on 퐾′ independent blocks,

′ ′ 퐾 each of size 푂(푁/퐾 ). Note that 퐾 is bigger than 푁 * 푛 , therefore, the size of each block is 푂(푛/퐾). Since, serial radix sort does linear work in the size of the input, total work done is 푂(푛). Since, all these blocks are independent, then the span of the local sorting phase is 푂(푛/퐾). Each processor needs to store at least one pointer to keep track of the start of its block. In order to build the regions graph, in the following phase, each block needs to keep track of the homogeneous sub-arrays inside it by storing a pointer to the start of the sub-array and storing its size. Thus, 푂(log 푛) space is needed to represent each homogeneous sub-array. By construction of the serial radix sort algorithm, each block contains has at most 퐵 homogeneous sub-arrays. Therefore, total number of sub-arrays is at most 퐾′퐵, and total space

39 needed in local sorting phase is 푂(퐾′퐵 log 푛).

4.2 Graph Construction

All regions graph implementations require linear work in the number of edges and nodes. There exists exactly 퐵 nodes and by lemma 1, there exists 푂(퐾′퐵) edges for a subproblem using 퐾′ blocks. Thus a regions graph requires 푂(퐾′퐵) space In order to calculate the total space needed for regions graphs across one recursion level, it is important to bound number of blocks per recursion level. Since each edge stores at least one pointer, total space needed for the graph construction phase is 푂(퐾′퐵 log 푛). It is possible to build the graph in parallel by adding edges concurrently, this would lead to a span of 푂(log(퐾′퐵)).

4.3 Global Sorting

In this section, I analyze the two global sorting approaches separately.

4.3.1 Cycle Finding

The cycle finding algorithm repeatedly finds a cycle swap it and executes. Thefol- lowing lemma proves that the total number of cycles is sub-linear.

Lemma 5. Number of cycles is 푂(퐾′퐵).

Proof. After detecting a cycle with weight 푤푚푖푛, 푤푚푖푛 weight has to be subtracted from each edge on the cycle. By definition, at least one edge in the cycle has weight equal to 푤푚푖푛. Thus, this edge’s weight will fall to zero and will be deleted from the graph. Therefore, number of cycles cannot exceed number of edges. By lemma 1, total number of cycles is 푂(퐾′퐵).

In both planning and graph update phases, the algorithm performs constant work per edge in cycle. Therefore, total work done in the two phases is proportional to the sum of all edges in all cycles. Lemmas 3 and 5 show that the total work done

40 in finding cycles is 푂(퐾′퐵2). However, it is important to note that this bound is not tight enough and can mislead into blowing up work. Indeed, a tighter bound is 푂(min(퐾′퐵2, 푁)), since each edge in each cycle planned must have a non-zero weight. Therefore, total number of edges cannot be bigger than the total number of misplaced records. Consequently, planning and graph update never blow up the work. However, the draw back of the cycle algorithm is that the span is the same as the work since cycle finding is done in serial. In the execution stage, each cycle needs 푂(퐵) span per cycle since each edge can be perfectly parallelized, therefore 푂(min(퐾′퐵2, 푁)) in total. Nevertheless, the execution phase needs 푂(푁) work in case linear number of records were not sorted. Since one cycle is processed at a time, only one cycle needs to be stored. To store a cycle 푂(log 푛) space needed per edge on the cycle. In sum, the cycle finding algorithm needs 푂(퐵 log 푛) space, 푂(min(퐾′퐵2, 푁)), span and 푂(푁) work.

4.3.2 2-path finding

The 2-path finding algorithm repeatedly picks a country to sort, finds all 2-paths needed to sort the country completely, updates the graph, and then moves to another node. I first analyze the complexity of sorting one country by extracting all 2-path passing from the corresponding node, and then I sum over all nodes. By definition of the 2-path finding algorithm, the number of edges connected to a single node changes,

since some edges are added and others are removed. In this analysis, I use 퐸푖 to refer to the number of edges connected to node 푖 when node 푖 was being processed by the

global sorting algorithm to be totally sorted and 퐻푖 to refer to the sum of those edges. The planning phase requires a prefix sum and a merge. Both the parallel and

serial implementations require 푂(퐸푖) work. The space required to store the 2-path is

푂(퐸푖 log 푛), since each path need to store at least one pointer to the array. Never- theless, the span of the parallel version is 푂(log 퐸푖).

The work of execution is 푂(퐸푖 + 퐻푖), where 퐻푖 is the total number of records that

belong to country 푖 but are misplaced. The span needed is 푂(1) since all 퐻푖 misplaced

records can be swapped independently, and all 퐸푖 2-path are executed independently.

41 The space needed is just one pointer per processor which is 푂(푃 log 푛).

The work done by updating the graph after processing node 푖 is 푂(퐸푖), since each 2-path can add at most one edge to the graph by lemma 4. Since the edge list corresponding to node 푖 is destructed after processing node 푖, no additional space is needed since the lemma also proves that number of edges deleted equals number of edges added to the graph. The parallel algorithm to add edges in parallel requires only a parallel out-of-place radix sort before adding the edges. Therefore, it needs

additional 푂(퐸푖) space.

Therefore, processing a single node requires 푂(퐸푖 + 퐻푖) work, 푂(log 퐸푖) span and

푂(퐸푖) additional space. Summing over all nodes gives the total work and span. For space however, taking the maximum is sufficient, since no memory is carried over from one global sorting iteration to another.

Lemma 6. After processing a node, number of edges in the graph does not increase.

Proof. By construction, a 2-path corresponds to a difference between two prefix sums in the array resulting from merging incoming prefix sums and outgoing prefix sums.

But the number of prefix sums cannot exceed 퐸푖 + 2. The number of distinct prefix

sums cannot be more than 퐸푖, since the values 0 and 퐻푖 are in boths incoming prefix sums and outgoing prefix sums. Therefore, the number of added 2-paths cannot ex-

ceed 퐸푖. On the other hand, after processing a node, all 퐸푖 edges originally connected to it are deleted from the graph. Therefore, number of edges cannot increase after processing a node.

It is important to note that lemma 6 implies the total number of edges in the graph

cannot increase. Therefore, 퐸푖 is 푂(퐾퐵). Note, that since the graph of dynamic 퐵−1 ′ ′ 퐵−1 ′ 2 Σ푖=0 퐸푖 is Ω(퐾 퐵) not 푂(퐾 퐵). A naive summation shows that Σ푖=0 퐸푖 is 푂(퐾 퐵 ). 퐵−1 Nevertheless, 퐸푖 ≤ 퐻푖 since each edge in 퐸푖 has non-zero weight. Therefore, Σ푖=0 퐸푖 is 푂(min(퐾′퐵2, 푁)). This result is important since it means that the work done in global sorting is never more than the work done in serial radix sort. In conclusion, the 2-path finding algorithm requires 푂(푁) work, 푂(log(퐾′퐵)) span and 푂(퐾′퐵) auxiliary space.

42 4.4 Overall algorithm

It is very important to note that the total number of blocks per recursion level may be more than 퐾, even though the algorithm operates on the 푛 records at each recursion level. This is a bi-product of the ceiling function used to calculate 퐾′. However, in this section I prove that that increase will not impact the space and work of the algorithm negatively.

Lemma 7. The auxiliary space usage required by local sorting, graph construction, and global sorting across all recursive calls is 푂((퐾 + 푃 )퐵 log 푛)

Proof. Using a work-stealing scheduler, such as Cilk’s [8], at most 푃 tasks can be executed in parallel using 푃 processors. Let processor 푖 be working on sub-

′ problem of size 푁푖 with 퐾푖 blocks. Therefore, the space needed by processor 푖 is ′ 퐾 푂(퐾푖퐵 log 푛) = 푂(⌈푁푖 * 푛 ⌉퐵 log 푛). Since recursing happens only after com- pleting the sort and freeing the auxiliary space needed, processors must be oper-

푃 ating on non-intersecting sub-arrays. Therefore Σ푖=0푁푖 = 푂(푛) and total space 푃 퐾 푃 퐾 needed by 푃 processors is 푂(Σ푖=1(⌈푁푖 * 푛 ⌉)퐵 log 푛) = 푂(퐵 log 푛 Σ푖=1(푁푖 * 푛 + 푃 ) = 푂((퐾 + 푃 )퐵 log 푛).

Lemma 8. Total work done by local sorting, region graph construction and global

sorting is 푂((푛 + 퐾퐵)(log퐵 푟 + log 퐵)).

′ 퐾 Proof. Each sub-problem has 퐾 = ⌈푁 * 푛 ⌉ for a problem with size 푁. There are two possibilities: 퐾′ = 1 and 퐾 ̸= 1. If 퐾′ = 1, then the total work done to sort the records is 푂(푁 + 퐵) for or 푂(푁 log 퐵) for comparison sort. It is beneficial to choose the former, if 푁 = Ω(퐵), and the later otherwise. After sorting the records, there is no need to build a regions graph or global sort since the

퐾 퐾 array has a single block. If 퐾 ̸= 1, then ⌈푁 * 푛 ⌉ = 푂(푁 * 푛 ). Summing over all subproblems, with 퐾′ > 1, amounts to 푂((퐾퐵 + 푁)(log 푟)). Summing both results

gives 푂((푛 + 퐾퐵)(log퐵 푟 + log 퐵)).

43 Theorem 9. Using the cycle finding algorithm for global sorting, the algorithm re-

2 quires 푂((푛+퐾퐵)(log퐵 푟+log 퐵)) work, 푂((푛/퐾+퐾퐵 ) log퐵 푟) span, and 푂((푃 퐵 log퐵 푟+ 퐾퐵) log 푛) auxiliary space.

Theorem 10. Using the 2-path finding algorithm for global sorting, with parallel

planning and graph updates, the algorithm requires 푂((푛+퐾퐵)(log퐵 푟+log 퐵)) work,

푂((푛/퐾 + 퐵(log(퐾퐵) + 퐵)) log퐵 푟) span, and 푂((푃 퐵 log퐵 푟 + 퐾퐵) log 푛) auxiliary space.

According to the value of 퐾 trades span and space. When setting 퐾 to Θ(푃 ), and 퐵 to be a constant, the 2-path algorithm requires 푂(푛 log 푟) work, 푂((푛/푃 + log 푃 ) log 푟) span, and 푂(푃 log 푟 log 푛) auxiliary space. The algorithm scales linearly with the number of processor up to 푃 = 푂(푛/ log 푛). However, it is necessary that 푃 = 표(푛/(log 푛 log 푟)) to have an in-place algorithm. The following theorem summa- rizes these bounds.

Theorem 11. Using the 2-path finding algorithm for global sorting with 퐾 = Θ(푃 ) and 퐵 = Θ(1), Regions sort takes 푂(푛 log 푟) work, 푂((푛/푃 + log 푃 ) log 푟) span, and 푂(푃 log 푟 log 푛) auxiliary space.

44 Chapter 5

Optimizations

The following optimizations are included in the implementations of the two variants of Regions sort. All of the listed optimizations improve the performance in practice. Nevertheless, some of them have a negative effect on the theoretical guarantees. I describe each of the optimizations and discuss its effect on performance and theoretical guarantees of work, span and span, if any.

5.1 Batching Cycles

Instead of planning, executing and updating one cycle at a time, multiple cycles can be batched together. Specifically, when sorting a node, all cycles connected toitare processed, executed and updated at a time. This increases parallelism in execution of cycles. On the other hand, it requires storing a pointer corresponding to each edge in each cycle. This totals to 푂(퐾퐵2 log 푛) additional space.

5.2 Prioritizing 2-cycles

A 2-cycle is a 2-path where the edges form a cycle of length 2. When executing a 2-path that is not a 2-cycle, only one sub-array will be sorted, and the other will not, therefore, an edge is added to represent the homogeneous sub-array that needs to be sorted still. When executing a 2-cycle, both sub-arrays involved in the swap

45 a c 0 0 2 0 1 2 3 1 1 1 3 0 0 0 2 1 2 3 1 1 1 3

1 b d

1 1 0 1 1 0 1 1 1 2 1 1 1 1 1 1 3 2 3 2

e 0 0 0 1 1 1 1 2 2 3 3

f 0 1

3 2

Figure 5-1: An illustration of how 2-paths incident on a broker are processed in parallel. The current broker being processed is circled in bold. At each step, the corresponding swap for each 2-path is shown in the array above the regions graph using the same color as the 2-path. In (b), The algorithm processes the single 2- path with vertex 0 as broker, which is (1, 0, 1), (0, 2, 1), and creates the dashed edge (1, 2, 1) resulting in (d). In (d), Regions sort processes all three 2-paths with vertex 1 as broker in parallel. The three 2-paths are (1, 2, 1), (2, 1, 2), (3, 1, 1), (1, 3, 1); and (2, 1, 1), (1, 2, 1) each of which does not add any new edges. This results in (e) at which point the array is sorted and the algorithm is left with an empty graph in (f).

will be put in the correct country. Therefore, the 2-path finding algorithm prioritizes 2-cycles. Finding 2-cycles can be done by sorting the incoming edges and outgoing edges arrays according to the node at the other end of the edges. Figure 5-1 describes how prioritizing 2-cycles works. Comparing figure 5-1 with figure 3-4 illustrates that prioritizing 2-cycles helps work and span in practice.

46 5.3 Early Recursion

By definition, once one country is sorted, it does not have any elements that doesnot belong to it. Thus, it is possible to recurse on it independently and before sorting the rest of the countries. This optimization increases parallelism in the algorithm especially when coupled with the country sorting optimization. This optimization will multiple the space bound proved in lemma 7 by 푂(log퐵 푟). This is because each of the 푃 active tasks can have up to log퐵 푟 ancestors. Therefore total auxiliary space usage after this optimization is 푂((푃 + 퐾)퐵 log퐵 푟 log 푛).

5.4 Country Sorting

The global sorting algorithm is correct regardless the order nodes are processed at. Processing nodes corresponding to countries with largest number of elements is ben- eficial given that the early recursion optimization is activated. Since some threads can recurse on those bigger countries, while other threads continue processing smaller countries. This optimization is particularly useful with biased distribution. Since the largest of the 퐵 countries is a big portion of the array. Sorting 퐵 countries requires Θ(퐵 log 퐵) work using comparison sort. To maintain the same theoretical guarantees, country sorting could be applied only when the problem size is Ω(퐵).

5.5 Coarsening

It is beneficial to switch to serial implementations when problem sizes are small,to avoid the overhead of parallelism, false sharing, and for better cache locality. When 퐾′ = 1 or 푁 = 20000, serial Ska Sort, a highly optimized in-place radix sorting is used [33]. Furthermore, when 푁 is less than 32, is used to avoid recursing on small subproblems and avoid doing work proportional to 퐵 when number of records is significantly less than 퐵. In addition, when a parallel for loop makes less than 4000 iterations, a serial for loop is used instead. Furthermore, the implementation contains a lot of for loops that operate on consecutive memory. One example is the

47 for loop that swaps two sub-arrays. These for loops are coarsened such that every few records are processed using the same thread. This can, of course, decrease parallelism; however, it proved to be advantageous due to increasing data locality and avoiding false sharing.

48 Chapter 6

Evaluation

6.1 Experiment Setup

6.1.1 Experiment Environment

All test cases run are run on an AWS EC2 c5.18.xlarge instance with two 18-core processors with hyper-threading (Intel Xeon Platinum 8124M at 3.00 GHz with a 24.75 MB L3 cache), for a total of 36 cores and 72 hyper-threads. The system has a total of 144 GB of RAM. The code is compiled using g++ version 7.3.1 (which has support for Cilk Plus [22]). Cilk Plus has a provably-efficient work-stealing sched- uler [8]. In addition, the -O3 flag was used.

6.1.2 Implementation Details

The code considers 푑 bits at each level of recursion, such that 퐵 = 2푑. 푑 and 퐵 are known at compile time, which helps a number of arrays to be allocated statically. 퐵 is set to 28 = 256 in all the tests, except in the test when 퐵 is the independent variable. Regions sort sets 퐾 = 5000 when using the 2-path finding algorithm and 퐾 = 푃 when using the cycle finding, except in the test when 퐾 is the independent variable. This distinction is made to optimize each algorithm. For the 2-path finding algorithm it increasing 퐾 improves locality since each sub-array can fit in cache. For the cycle finding algorithm 퐾 is set the minimum value that can allow for linear speed

49 up, this is because as 퐾 increases, the number of edges in the graph increase and the weight on each of them decrease. Since parallelism in the cycle finding algorithm is bottlenecked by the minimum edge weight on each cycle, which decreases parallelism. It was found that the serial implementation of planning and graph update for the 2-path finding algorithm is more efficient than the parallel implementation. This could be attributed to the fact that number of edges is small. In other words, the overhead of parallelism is more expensive than the efficient serial implementation. Therefore, the results in this section reflect only this implementation.

6.1.3 Test Inputs

The datasets used include inputs of only integer keys as well as inputs of integer key- value pairs. Test include both 4-byte and 8-byte keys/values (inputs with larger values can be sorted by storing a pointer to the value). The inputs also include different distributions. The test files contain inputs with a uniform distribution where the probability of all inputs is the same. To test for skewed inputs, the test files include inputs with a Zipfian distribution. The skew in the distribution is controlled bythe 휃 parameter [14]. The Zipfian distribution was used previously test the efficiency of parallel algorithms like PARADIS [10] and RADULS [19, 20]. The inputs encompass multiple values of 휃: 0.25, 0.5, 0.75 and 1.00. Test files includes some degenerate inputs as well. The types of inputs used in the data sets are as follows:

1. unif(푥, 푦): represents an input drawn uniformly from range 푥, with length 푦.

2. zipf휃(푥, 푦): represents an input drawn from a Zipfian distribution of range 푥 with parameter 휃.

3. allEqual(109): represents an input of length (109) with all equal records. √ 4. 푛Equal(109): represents an input with 109/2 copies of 109/2 distinct values placed equally apart.

5. sorted(109): represents an input of a sorted array from 1 to 109.

50 6. almostSorted(109): represents an almost sorted input with 109/2 random ele- ments out of place.

6.2 Tested algorithms

The only parallel in-place radix sorting algorithm is PARADIS [10]. However, there is no publicly-available code online, and it was not possible to obtain it from the authors. Therefore, I present a rough comparison by comparing the two algorithms performances while accounting for the difference in the testing environments. Inad- dition, I compare with RADULS [19, 20] a parallel out-of-place radix sort, which has also a rough comparisons with PARADIS [10] showing the former’s superior- ity. Therefore, performing better than RADULS indirectly implies superiority over PARADIS as well. Aside from RADULS [19, 20], the implementations to be tested include: the state- of-the-art serial in-place radix sorting algorithm Ska Sort [33], the recent parallel in-place sample sort IPS4o by Axtmann et al. [2], a parallel out-of-place radix sort from the Problem Based Benchmark Suite (PBBS) [31], a parallel out-of-place sample sort from PBBS [5], and a parallel out-of-place comparison sorting algorithm from MCSTL [32].

6.3 Results

6.3.1 Results Overview

Tables 6.1and 6.2 show the single-thread, 72-thread, and parallel speedup of all of the algorithms on a collection of inputs of length 109. By looking at the table, the 2-path finding algorithm variant always outperforms the cycle finding variant for all cases except for the allEqual(109) test input. However, this is misleading since for this input all records are equal and therefore all regions graph are empty and thus both implementations perform exactly the same things, except that the value of 퐾 is bigger for the 2-path finding algorithm. Therefore, the

51 Ska Sort 2-path Cycle IPS4o RADULS Radix Sort Radix Sort Input 푇1 푇1 푇36ℎ SU 푇1 푇36ℎ SU 푇1 푇36ℎ SU 푇1 푇36ℎ SU unif(109, 109) 23.02 14.19 0.49 28.96 19.48 0.72 27.06 33.52 0.93 36.04 27.79 1.79 15.53 unif(109, 109)-pairs 34.59 20.05 0.84 23.87 27.66 1.31 21.11 139.44 3.01 46.33 41.11 3.49 11.78 unif(263, 109) 38.12 32.27 1.00 32.27 92.12 1.42 64.87 35.99 1.15 31.30 15.15 0.78 19.42 unif(263, 109)-pairs 54.91 40.05 1.78 22.50 119.80 2.37 50.55 144.22 3.32 43.44 23.12 2.05 11.28 9 9 zipf0.75(10 , 10 ) 22.02 14.81 0.55 26.93 20.05 0.70 28.64 29.66 0.82 36.17 28.86 1.93 14.95 9 9 zipf0.75(10 , 10 )-pairs 33.04 20.61 0.98 21.03 29.32 1.32 22.21 139.30 3.03 45.97 43.81 3.64 12.04 9 √allEqual(10 ) 6.60 6.77 0.20 33.85 6.63 0.22 30.14 2.32 0.29 8.00 30.29 1.87 16.20 푛Equal(109) 15.07 11.43 0.43 26.58 12.56 0.66 19.03 13.71 0.42 32.64 28.47 2.28 12.49 sorted(109) 14.54 14.51 0.31 46.81 16.36 0.38 43.05 24.60 0.62 39.68 30.25 1.67 18.11 almostSorted(109) 18.47 16.41 0.36 45.58 17.30 0.39 44.36 25.16 0.62 40.58 28.55 2.27 12.58

Table 6.1: Serial times (푇1) and parallel (푇36ℎ) times (seconds), as well as the par- allel speedup (SU) on a 36-core machine with two-way hyper-threading. The fastest parallel time for each input is bolded.

PBBS PBBS MCSTL Radix Sort Sample Sort Sort Input 푇1 푇36ℎ SU 푇1 푇36ℎ SU 푇1 푇36ℎ SU unif(109, 109) 14.40 0.62 23.23 81.41 1.81 44.98 100.36 2.54 39.51 unif(109, 109)-pairs 28.00 1.23 22.76 92.16 2.34 39.38 116.94 4.27 27.39 unif(263, 109) 42.60 1.50 28.40 136.00 2.93 46.41 103.12 4.62 22.32 unif(263, 109)-pairs 59.60 2.96 20.14 148.00 3.31 44.71 119.50 7.71 15.50 9 9 zipf0.75(10 , 10 ) 14.70 0.68 21.62 77.57 1.74 44.58 94.52 2.54 37.21 9 9 zipf0.75(10 , 10 )-pairs 36.50 1.33 27.44 92.71 2.57 36.07 120.09 4.58 26.22 9 √allEqual(10 ) 17.70 0.88 20.11 22.84 0.77 29.66 13.65 0.84 16.25 푛Equal(109) 16.70 0.60 27.83 39.22 1.36 28.84 30.85 1.23 25.08 sorted(109) 14.54 0.60 24.50 42.20 1.25 33.76 19.65 0.94 20.90 almostSorted(109) 16.60 0.60 27.67 43.70 1.25 34.96 22.62 2.62 8.63

Table 6.2: Serial times (푇1) and parallel (푇36ℎ) times (seconds), as well as the parallel speedup (SU) on a 36-core machine with two-way hyper-threading. This table is a continuation to table 6.1, and results in both tables are comparable.

52 result in this test does not reflect the superiority of cycle finding as opposed to 2-path finding, but rather the convenience of having a small 퐾 when there is no work to be done in sorting. In such a case, a smaller 퐾 would lead to smaller work to calculate the country sizes.

The two variants of Regions Sort achieve a parallel speedup of between 19–65x on 36 cores with hyper-threading. The scalability bottleneck is data movement, which has less than perfect scalability when run in parallel due to memory system bottle- necks.

Ska Sort is used as a base case in for sub-arrays not big enough for parallelism but not small enough to be sorted through insertion sort. However, a single-threaded execution of the 2-path finding algorithm is usually faster than the highly-optimized serial Ska Sort. Although the algorithm incurs more work to build the regions graph, it outperforms the original Ska Sort due to better cache locality. Cache locality is well exploited in the local sorting phase since the array is divided into sub-arrays of smaller sizes that fit in cache. The 2-path finding algorithm has good data locality from swapping pairs of contiguousness sub-arrays at a time.

IPS4o is the fastest publicly-available parallel in-place sorting algorithm, and Re- gions sort achieves a 1.1–3.6x speedup over it in parallel across all inputs, except √ for 푛Equal(109) on which we are about 1.02x slower. IPS4o, described more in Chapter 2, is a comparison sorting algorithm, is more general than radix sort but it has different theoretical bounds because it incurs more work to decide which bucket each key belongs to. The two variants of Regions Sort as well as IPS4o use a negligi- ble amount of auxiliary memory compared to the original input size (less than 5%). PBBS sample sort and MCSTL sort are consistently slower than IPS4o and Regions sort, and require auxiliary space proportional to the input size. Again, PBBS sample sort and MCSTL sort are comparison sorting algorithms, which are more general. Compared to PBBS radix sort, which is a parallel out-of-place radix sorting algo- rithm, the 2-path variant of Regions sort is 1.22–4.3x faster in parallel due to having a lower memory footprint and less data movement. Compared to RADULS, another out-of-place radix sorting algorithm, the presented algorithm is between 1.3x slower

53 to 9.4x faster. 2-path finding is only slightly slower than RADULS on the inputs with 8-byte key ranges, which might be due to RADULS being optimized for very large key ranges. Since the PARADIS code is not publicly-available, I present a rough comparison based on the numbers reported in their paper [10]. On an input of 8-byte key- value pairs of size 109, PARADIS reports a running time of approximately 6 seconds using 16 cores of two Intel Xeon E7-8837 processors running at 2.67GHz. On 16 cores without hyper-threading, the 2-path variant of Regions sort takes 3 seconds on unif(263, 109). I do not know the range of the keys PARADIS used so the experiments used a very large range experiments. I believe that even accounting for differences in machine specifications, Regions sort would still be faster. In addition, Regions sort outperforms RADULS whose authors also claim to beat PARADIS according to similar calculations. Furthermore, Regions sort has much stronger guarantees on work and span than PARADIS.

6.3.2 Parallel Scalability

Figures 6-1 and 6-2 show the relative performance of the algorithms as a function of

9 9 9 9 thread count for unif(10 , 10 ) and zipf0.75(10 , 10 ), respectively. The performance baseline is the single threaded 2-path finding algorithm since it is the fastest integer sorting algorithm in the algorithms being tested. All algorithms achieve a reasonable speedup as thread count increases till 36 cores, except RADULS. RADULS perfor- mance increases till 16 cores after which its performance degrades. It is clear that the 2-path finding algorithm outperforms all of the other algorithms on all thread counts.

6.3.3 Input Size Scalability

Figure 6-3 shows the performance of all of the algorithms on random inputs of varying size, with a range of 108. As expected, the running time of all of the algorithms increase with the input size. The 2-path finding algorithm is always the fastest on all of the different input sizes.

54 40 2-path Cycle 35 IPS4o MCSTL RADULS 30 PBBS radix sort PBBS 25

20

15

10 Speedup over serial 2-path 5

0 0 10 20 30 40 50 60 70 Threads

Figure 6-1: Speedup over single-thread performance of 2-path vs. thread count for unif(109, 109).

6.3.4 Effect of Key Range

Figures 6-4 and 6-5, illustrates the running time of the algorithms as a function of the range of the input keys. As expected, the running time of each algorithm increases with key range. However, the increase for both variants of the presented algorithm as well as PBBS radix sort is sub-linear in log 푟, the worst-case number of levels of recursion, because they switch to comparison sort when the input size is small enough. The sample sorting algorithms (IPS4o and PBBS sample sort) also get slower with larger key ranges because the number of recursive calls increases due to a larger number of unique keys (buckets with equal keys do not need to recurse).

55 2-path 35 Cycle IPS4o MCSTL 30 RADULS PBBS radix sort 25 PBBS samplesort

20

15

10

Speedup over serial 2-path 5

0 0 10 20 30 40 50 60 70 Threads

Figure 6-2: Speedup over single-thread performance of 2-path vs. thread count for 9 9 zipf0.75(10 , 10 ).

6.3.5 Effect of Skewness in Zipfian Distributions

9 9 Figure 6-6 shows the effect of varying 휃 parameter of input zipf휃(10 , 10 ) on the performance. A larger 휃 parameter increases the skew of the distribution. The figure shows that the performance of both the 2-path finding and cycle finding algorithms is robust across different 휃 values as number of levels of recursion depends mostly on the range, which is fixed in this experiment. The comparison sorting algorithms perform better when the skew is higher because having fewer unique keys reduces the number of recursive calls needed. The 2-path variant of Regions sort still performs the best across different 휃 values.

56 2-path 2500 Cycle IPS4o MCSTL 2000 RADULS PBBS radix sort PBBS samplesort 1500

Time (ms) 1000

500

0 0.2 0.4 0.6 0.8 1.0 Length 1e9

Figure 6-3: 36-core running time vs. input size 푛 for unif(108, 푛).

6.3.6 Breakdown of Running Time

Figure 6-7 shows the breakdown of the running time of the 2-path finding algorithm

9 9 9 9 on the unif(10 , 10 ) and zipf0.75(10 , 10 ) inputs. The running time is broken down into various phases of the first level of recursion and group together the time forthe remaining levels. Since the code runs in parallel and the early recursion optimization is used, some threads could be in the first level, while others can be in deeper levels. To calculate time spent in the first level, the breakdown considers all the time when at least one thread was doing work in the first level of the recursion. In effect, the breakdown illustrates the span of the first level compared to the span of the whole algorithm.

For these two inputs, there are four levels of recursion since the radix size is 8 bits. Figure 6-7 shows the the first level of radix sort takes a significant fraction of

57 3500 2-path Cycle IPS4o 3000 MCSTL RADULS 2500 PBBS radix sort PBBS samplesort 2000

1500 Time (ms)

1000

500

0 102 103 104 105 106 107 108 109 Range

Figure 6-4: 36-core running time vs. range 푟 for unif(푟, 109).

9 9 9 9 the total time (over half for unif(10 , 10 ) and almost half for zipf0.75(10 , 10 )). The remaining levels of recursion are relatively faster since the problem size decreases and is more likely to fit in cache.

Within the first level, local sorting and performing the swaps associated with 2-paths dominate the running time. Finding the 2-paths (labeled as “First level planning”) and updating the graph (labeled as “First level update”) take very little time (under 10% of the first level time). This is why using the parallel versions of planning and graph updating does not improve the running time even in the best case, assuming linear speed up and same number of cache misses.

58 3500 2-path Cycle 3000 IPS4o MCSTL RADULS 2500 PBBS radix sort PBBS samplesort 2000

1500 Time (ms)

1000

500

0 102 103 104 105 106 107 108 109 Range

9 Figure 6-5: 36-core running time vs. range 푟 for zipf0.75(푟, 10 ).

6.3.7 Effect of Radix Size

Figures 6-8 and 6-9 shows the impact of varying the number of bits of the radix (radix size) on the performance of the two variants of Regions Sort on unif(109, 109)

9 9 and zipf0.75(10 , 10 ), respectively. Using 8-bit radices yields the best performance, but the performance on 6-bit and 7-bit radices is almost as good. For smaller radices, more passes are needed before the problem size fits into cache, which leads to worse performance. Increasing the number of bits beyond 8 makes it so that the buckets during local sorting are less likely to fit in L1 cache (32 KB on the testing machine), which causes performance degradation.

59 2-path Cycle 4000 IPS4o MCSTL RADULS PBBS radix sort 3000 PBBS samplesort

2000 Time (ms)

1000

0 0.25 0.50 0.75 1.00 Theta

9 9 Figure 6-6: 36-core running time vs. 휃 for zipf휃(10 , 10 ) 0.3 0.25 Remaining levels 0.2 First level local sort 0.15 First level planning Time (s) Time 0.1 First level swap 0.05 First level update 0 uniform zipf

Figure 6-7: Breakdown of 36-core running time for 2-path on unif(109, 109) and 9 9 zipf0.75(10 , 10 ).

6.3.8 Effect of 퐾

In Figures 6-10 and 6-11 show the running time for different values of 퐾 for the 2-path

9 9 9 9 finding algorithm on unif(10 , 10 ) and zipf0.75(10 , 10 ), respectively. The plot shows 60 2-path 3500 Cycle

3000

2500

2000

Time (ms) 1500

1000

500

0 1 2 3 4 5 6 7 8 9 10 11 12 Radix

Figure 6-8: 36-core running time vs. radix size for unif(109, 109). that the running time for large values of 퐾 is much lower than for small values of 퐾. For small values of 퐾, the local sorts in the first level are less likely to fit in cache, which increases the overall running time.

6.4 Conclusion

In this thesis, I have introduced Regions sort and used it to implement a novel par- allel in-place radix sorting algorithm that is work-efficient and has sub-linear auxil- iary space and span. Furthermore, the experiments show that Regions sort achieves good scalability and almost always outperforms existing parallel in-place sorting al- gorithms. Berney et al. present parallel in-place algorithms for constructing implicit search tree layouts from sorted sequences [3]. This research can be used to efficiently generate

61 6000 2-path Cycle

5000

4000

3000 Time (ms) 2000

1000

0 1 2 3 4 5 6 7 8 9 10 11 12 Radix

9 9 Figure 6-9: 36-core running time vs. radix size for zipf0.75(10 , 10 ). sorted sequences of integers in-place, which can be used as inputs to their algorithms. One of their approaches uses cycles to represent a series of swaps. The cycles in their approach are determined statically, while in our cycle processing algorithm, the cycles are determined dynamically. It is important to note that Regions sort is not limited to radix sort. Indeed, it can be generalized to any algorithm that iteratively maps records to buckets. It would be interesting to use Regions sort to implement a sample sort similar to IPS4o [2], thus giving a parallel in-place comparison-based sorting. Another interesting research direction is to try implementing an efficient stable radix sort using Regions sort for example using the 2-path finding algorithm. Using a stable sort base case, records within a region should be stable at graph construction time. It should be noted, that self-edges could correspond to data movements in this case and need to be represented in the graph. The challenges then are matching incoming edges with outgoing edges

62 1000 2-path

800

600

Time (ms) 400

200

0 0 2000 4000 6000 8000 10000 K

Figure 6-10: 36-core running time vs. 퐾 for unif(109, 109). in a manner that maintains stability and handling self-edges without introducing read-write races.

63 2-path 1000

800

600

Time (ms) 400

200

0 0 2000 4000 6000 8000 10000 K

9 9 Figure 6-11: 36-core running time vs. 퐾 for zipf0.75(10 , 10 ).

64 Bibliography

[1] Michael Axtmann, Timo Bingmann, Peter Sanders, and Christian Schulz. Prac- tical massively parallel sorting. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2015.

[2] Michael Axtmann, Sascha Witt, Daniel Ferizovic, and Peter Sanders. In-place parallel super scalar samplesort IPS4o. In European Symposium on Algorithms (ESA), 2017.

[3] K. Berney, H. Casanova, A. Higuchi, B. Karsin, and N. Sitchinava. Beyond binary search: Parallel in-place construction of implicit search tree layouts. In IEEE International Parallel and Distributed Processing Symposium (IPDPS), 2018.

[4] Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, and Julian Shun. Internally deterministic parallel algorithms can be fast. In ACM SIGPLAN Symposium on Proceedings of Principles and Practice of Parallel Programming (PPoPP), 2012.

[5] Guy E. Blelloch, Phillip B. Gibbons, and Harsha Vardhan Simhadri. Low depth cache-oblivious algorithms. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2010.

[6] Guy E. Blelloch, Yan Gu, Julian Shun, and Yihan Sun. Parallelism in randomized incremental algorithms. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2016.

[7] Guy E. Blelloch, Charles E. Leiserson, Bruce M. Maggs, C. Greg Plaxton, Stephen J. Smith, and Marco Zagha. A comparison of sorting algorithms for the Connection Machine CM-2. In ACM Symposium on Parallel Algorithms and Architectures (SPAA), 1991.

[8] Robert D. Blumofe and Charles E. Leiserson. Scheduling multithreaded compu- tations by work stealing. Journal of the ACM, 46(5), September 1999.

[9] Jatin Chhugani, Anthony D. Nguyen, Victor W. Lee, William Macy, Mostafa Hagog, Yen-Kuang Chen, Akram Baransi, Sanjeev Kumar, and Pradeep Dubey. Efficient implementation of sorting on multi-core SIMD CPU architecture. In International Conference on Very Large Data Bases (VLDB), 2008.

65 [10] Minsik Cho, Daniel Brand, Rajesh Bordawekar, Ulrich Finkler, Vincent Ku- landaisamy, and Ruchir Puri. PARADIS: An efficient parallel algorithm for in-place radix sort. Proc. VLDB Endow., 8(12), August 2015. [11] Richard Cole. Parallel . SIAM J. Comput., 17(4), August 1988. [12] Richard Cole and Vijaya Ramachandran. Resource oblivious sorting on multi- cores. ACM Trans. Parallel Comput., 3(4), March 2017. [13] Naga Govindaraju, Jim Gray, Ritesh Kumar, and Dinesh Manocha. GPUTera- Sort: High performance graphics co-processor sorting for large database man- agement. In ACM SIGMOD International Conference on Management of Data, 2006. [14] Jim Gray, Prakash Sundaresan, Susanne Englert, Ken Baclawski, and Peter J. Weinberger. Quickly generating billion-record synthetic databases. In ACM SIGMOD International Conference on Management of Data, 1994. [15] David R. Helman, David A. Bader, and Joseph JaJa. A randomized parallel sorting algorithm with an experimental study. Journal of Parallel and Distributed Computing, 52(1), 1998. [16] H. Inoue, T. Moriyama, H. Komatsu, and T. Nakatani. AA-Sort: A new parallel sorting algorithm for multi-core SIMD processors. In International Conference on Parallel Architecture and Compilation Techniques (PACT), 2007. [17] J. Jaja. Introduction to Parallel Algorithms. Addison-Wesley Professional, 1992. [18] Daniel Jiménez-González, Juan J. Navarro, and Josep-L. Larrba-Pey. Fast par- allel in-memory 64-bit sorting. In International Conference on Supercomputing (ICS), 2001. [19] Marek Kokot, Sebastian Deorowicz, and Agnieszka Debudaj-Grabysz. Sorting data on ultra-large scale with RADULS. In Beyond Databases, Architectures and Structures, 2017. [20] Marek Kokot, Sebastian Deorowicz, and Maciej Długosz. Even faster sorting of (not only) integers. In Man-Machine Interactions 5, 2018. [21] Shin-Jae Lee, Minsoo Jeon, Dongseung Kim, and Andrew Sohn. Partitioned parallel radix sort. J. Parallel Distrib. Comput., 62(4), April 2002. [22] Charles E. Leiserson. The Cilk++ concurrency platform. The Journal of Super- computing, 51(3), 2010. [23] Hui Li and Kenneth C. Sevcik. Parallel sorting by over partitioning. In ACM Symposium on Parallel Algorithms and Architectures (SPAA), 1994. [24] Peter M. McIlroy, Keith Bostic, and M. Douglas McIlroy. Engineering radix sort. Computing Systems, 6(1), 1993.

66 [25] Duane Merrill and Andrew Grimshaw. High performance and scalable radix sorting: A case study of implementing dynamic parallelism for GPU computing. Parallel Processing Letters, 21(02), 2011.

[26] Omar Obeya, Endrias Kahssay, Edward Fan, and Julian Shun. Theoretically- efficient and practical parallel in-place radix sorting. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2019.

[27] Orestis Polychroniou and Kenneth A. Ross. A comprehensive study of main- memory partitioning and its application to large-scale comparison- and radix- sort. In ACM SIGMOD International Conference on Management of Data, 2014.

[28] Alexander Rasmussen, George Porter, Michael Conley, Harsha V. Madhyastha, Radhika Niranjan Mysore, Alexander Pucher, and Amin Vahdat. Tritonsort: A balanced large-scale sorting system. In USENIX Conference on Networked Systems Design and Implementation (NSDI), 2011.

[29] N. Satish, M. Harris, and M. Garland. Designing efficient sorting algorithms for manycore GPUs. In IEEE International Parallel Distributed Processing Sympo- sium (IPDPS), 2009.

[30] Nadathur Satish, Changkyu Kim, Jatin Chhugani, Anthony D. Nguyen, Vic- tor W. Lee, Daehyun Kim, and Pradeep Dubey. Fast sort on CPUs and GPUs: A case for bandwidth oblivious SIMD sort. In ACM SIGMOD International Conference on Management of Data, 2010.

[31] Julian Shun, Guy E. Blelloch, Jeremy T. Fineman, Phillip B. Gibbons, Aapo Kyrola, Harsha Vardhan Simhadri, and Kanat Tangwongsan. Brief announce- ment: the Problem Based Benchmark Suite. In ACM Symposium on Parallelism in Algorithms and Architectures (SPAA), 2012.

[32] Johannes Singler, Peter Sanders, and Felix Putze. MCSTL: The multi-core stan- dard template library. In Euro-Par, 2007.

[33] Malte Skarupke. I wrote a faster sorting algorithm. https://probablydance. com/2016/12/27/i-wrote-a-faster-sorting-algorithm/, 2016. [34] Andrew Sohn and Yuetsu Kodama. Load balanced parallel radix sort. In Inter- national Conference on Supercomputing (ISC), 1998.

[35] E. Solomonik and L. V. KalÃľ. Highly scalable parallel sorting. In IEEE Inter- national Symposium on Parallel Distributed Processing (IPDPS), April 2010.

[36] Elias Stehle and Hans-Arno Jacobsen. A memory bandwidth-efficient hybrid radix sort on GPUs. In ACM International Conference on Management of Data, 2017.

[37] Jan Wassenberg and Peter Sanders. Engineering a multi-core radix sort. In Euro-Par, 2011.

67 [38] Marco Zagha and Guy E. Blelloch. Radix sort for vector multiprocessors. In ACM/IEEE Conference on Supercomputing (SC), 1991.

[39] K. Zhang and B. Wu. A novel parallel approach of radix sort with bucket par- tition preprocess. In IEEE International Conference on Embedded Software and Systems, 2012.

68