International Conference on Supercomputing (ICS), Chicago, IL, June, 2017
Fast Segmented Sort on GPUs
† † † Kaixi Hou , Weifeng Liu‡§, Hao Wang , Wu-chun Feng †Department of Computer Science, Virginia Tech, Blacksburg, VA, USA, kaixihou, hwang121, wfeng @vt.edu { } Niels Bohr Institute, University of Copenhagen, Copenhagen, Denmark, [email protected] ‡ Department of Computer Science, Norwegian University of Science and Technology, Trondheim, Norway. § ABSTRACT both for traditional HPC applications and for big data processing. Segmented sort, as a generalization of classical sort, orders a batch In these cases, a large amount of independent arrays oen need to of independent segments in a whole array. Along with the wider be sorted as a whole, either because of algorithm characteristics adoption of manycore processors for HPC and big data applications, (e.g., sux array construction in prex doubling algorithms from segmented sort plays an increasingly important role than sort. In bioinformatics [15, 44]), or dataset properties (e.g., sparse matrices this paper, we present an adaptive segmented sort mechanism on in linear algebra [4, 28–31, 42]), or real-time requests from web GPUs. Our mechanisms include two core techniques: (1) a dif- users (e.g., queries in data warehouse [45, 49, 51]). e second ferentiated method for dierent segment lengths to eliminate the trend is that with the rapidly increased computational power of irregularity caused by various workloads and thread divergence; new processors, sorting a single array at a time usually cannot fully and (2) a register-based sort method to support N -to-M data-thread utilize the devices, thus grouping multiple independent arrays and binding and in-register data communication. We also implement sorting them simultaneously are crucial for high utilization. a shared memory-based merge method to support non-uniform As a result, the segmented sort that involves sorting a batch of length chunk merge via multiple warps. Our segmented sort mecha- segments of non-uniform length concatenated in a single array nism shows great improvements over the methods from CUB, CUSP becomes an important computational kernel. Although directly and ModernGPU on NVIDIA K80-Kepler and TitanX-Pascal GPUs. sorting each segment in parallel could work well on multicore CPUs Furthermore, we apply our mechanism on two applications, i.e., with dynamic scheduling [39], applying similar methods such as sux array construction and sparse matrix-matrix multiplication, “dynamic parallelism” on manycore GPUs may cause degraded per- and obtain obvious gains over state-of-the-art implementations. formance due to high overhead for context switch [14, 43, 46, 48]. On the other hand, the distribution of segment lengths oen exhibits CCS CONCEPTS the skewed characteristics, where a dominant number of segments are relatively short but the rest of them can be much longer. In •eory of computation Parallel algorithms; •Computing ! this context, the existing approaches, such as the “one-size-ts-all” methodologies Massively parallel algorithms; ! philosophy [37] (i.e., treating dierent segments equally) and some KEYWORDS variants of global sort [3, 10] (i.e., traditional sort methods plus seg- ment boundary check at runtime), may not give best performance segmented sort, SIMD, vector, registers, shue, NVIDIA, GPU, due to load imbalance and low on-chip resource utilization. Pascal, skewed data We in this work propose a fast segmented sort mechanism on ACM Reference format: GPUs. To improve load balance and increase resource utilization, Kaixi Hou, Weifeng Liu, Hao Wang, Wu-chun Feng. 2017. Fast Segmented our method rst constructs basic work units composed of adaptively Sort on GPUs. In Proceedings of ICS ’17, Chicago, IL, USA, June 14-16, 2017, dened elements from multiple short segments of various sizes or 10 pages. part of long segments, and then uses appropriate parallel strategies DOI: hp://dx.doi.org/10.1145/3079079.3079105 for dierent work units. We further propose a register-based sort method to support N -to-M data-thread binding and in-register data 1 INTRODUCTION communication. We also design a shared memory-based merge Sort is one of the most fundamental operations in computer science. method to support variable-length chunks merge via multiple warps. A sorting algorithm orders entries of an array by their ranks. Even For the grouped short and medium work units, our mechanism though sorting algorithms have been extensively studied on various does the segmented sort in the registers and shared memory; and parallel platforms [26, 33, 38, 40], two recent trends necessitate for those long segments, our mechanism can also exploit on-chip revisiting them on throughput-oriented processors. erst trend memories as much as possible. is that manycore processors such as GPUs are more and more used Using segments of uniform and synthetic power-law length on NVIDIA K80-Kepler and TitanX-Pascal GPUs, our segmented Permission to make digital or hard copies of all or part of this work for personal or sort can exceed state-of-the-art methods in three vendor libraries classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation CUB [37], CUSP [10] and ModernGPU [3] by up to 86.1x, 16.5x, on the rst page. Copyrights for components of this work owned by others than ACM and 3.8x, respectively. Furthermore, we integrate our mechanism must be honored. Abstracting with credit is permied. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a with two real-world applications to conrm their eciency. For fee. Request permissions from [email protected]. the sux array construction (SAC) in bioinformatics, our mecha- ICS ’17, Chicago, IL, USA nism results in a factor of 2.3–2.6 speedup over the latest skew-SA © 2017 ACM. 978-1-4503-5020-4/17/06...$15.00 DOI: hp://dx.doi.org/10.1145/3079079.3079105 International Conference on Supercomputing (ICS), Chicago, IL, June, 2017
ICS ’17, June 14-16, 2017, Chicago, IL, USA Kaixi Hou, Weifeng Liu, Hao Wang, Wu-chun Feng method [44] with CUDPP [19]. For the sparse matrix-matrix multi- plication (SpGEMM) in linear algebra, our method delivers a factor of 1.4–86.5, 1.5–2.3, and 1.4–2.3 speedups over approaches from cuSPARSE [35], CUSP [10], and bhSPARSE [30], respectively. e contributions of this paper are listed as follows: We identify the importance of segmented sort on various appli- • cations by exploring segment length distribution in real-world (a) Segments from squaring three dierent matrices in SpGEMM datasets and uncovering performance issues of existing tools. We propose an adaptive segmented sort mechanism for GPUs, • whose key techniques contain: (1) a dierentiated method for dierent segment lengths to eliminate load imbalance, thread divergence, and irregular memory access; and (2) an algorithm that extends sorting networks to support N -to-M data-thread
binding and thread communication at GPU register level. (b) Segments from the rst three iterations in SAC We carry out a comprehensive evaluation on both kernel level • Figure 2: Histogram of segment length changes in SpGEMM and SAC and application level to demonstrate the ecacy and generality of our mechanism on two NVIDIA GPU platforms. the input data or the intermediate data generated by the algorithm 2 BACKGROUND AND MOTIVATION at runtime. As a result, the segments of various lengths require 2.1 Segmented Sort dierentiate processing methods for high eciency. Later on, we will present an adaptive mechanism that constructs basic work Segmented sort (SegSort) performs a segment-by-segment sort on units composed of multiple short segments of various sizes or part a given array composed of multiple segments. If there is only one of long segments, and processes them in a very ne grained way segment, the operation converts into the classical sort problem that for achieving load balance on manycore GPUs. gains much aention in the past decades. us sort can be seen as a special case of segmented sort. e complexity of segmented sort 2.3 Sorting Networks and eir Limitations can be p n logn 1, where p is the number of segments in the i=1 i i A sorting network consisting of a sequence of independent compar- problem and n is length of each segment. Fig. 1 shows an example i isons is usually used as a basic primitive to build the corresponding of segmentedP sort, where an array stores a list of keys (integer in sorting algorithm. Fig. 3a provides an example of bitonic sorting this case) plus an additional array seg ptr used for storing head network that accepts 8 random integers as input. Each vertical line pointers of each segment. in the sorting network represents a comparison of input elements. rough theses comparisons, the 8 integers are sorted. 0 3 5 7 seg_ptr Segmented sort Although a sorting network provides a view of how to compare 4 1 2 11 8 1 6 5 1 2 4 8 11 1 6 5 input data, it does not give a parallel solution of how the data are dis- input output tributed into multiple threads. A straightforward solution is directly Figure 1: An example of segmented sort. It sorts four integer segments of various lengths pointed by seg ptr. mapping one element of input data to one GPU thread. However, this method will waste the computational power of GPU, because every comparison represented by a vertical line is conducted by two 2.2 Skewed Segment Length Distribution threads. is method also leads to poor instruction-level parallelism (ILP), since the insucient operations per thread cannot fully take We use real-world datasets from two applications to analyze the advantage of instruction pipelining. erefore, it is important to characteristics of data distribution in segmented sort. erst ap- investigate how many elements in a sorting network processed plication is the sux array construction (SAC) from Bioinformatics, by a GPU thread can lead to best performance on GPU memory where the prex doubling algorithm [15, 44] is used. is algorithm hierarchy. In this paper, we call it the on GPUs. calls SegSort to sort each segment in one iteration step, and the data-thread binding duplicated elements will form another sets of segments for the next iteration. is procedure continues until no duplicated element 7 0 rg0 exists. e second one is the sparse matrix-matrix multiplication 3 1 1. All-to-one rg1 data-thread tid=0 1 2 rg2 (SpGEMM). In this algorithm, SegSort is used for reordering entries binding in each row by their column indices. 2 3 rg3 6 4 rg0 tid=0 sorted unsorted 0 5 2. One-to-one rg0 tid=1 As shown in Fig. 2, the statistics of segments derived from these data-thread 5 6 binding rg0 tid=2 two algorithms shares one feature that the small/medium segments 4 7 rg0 tid=3 dominate the distribution, where around 96% segments in SpGEMM (a) A 8-way bitonic sorting network (b) Data-thread bindings and 99% segments in SAC have less than 2000 elements, and the rest Figure 3: One sorting network and existing strategies using registers on GPUs can be much longer but contributes less to the number of entries in the whole problem. Such highly skewed data stems from either Fig. 3b presents two examples of using data-thread binding to 1For generality, we only focus on comparison-based sort in this work. realize a part of sorting network shown in Fig. 3a. Fig. 3b-1 shows International Conference on Supercomputing (ICS), Chicago, IL, June, 2017
Fast Segmented Sort on GPUs ICS ’17, June 14-16, 2017, Chicago, IL, USA the most straightforward method of simply conducting all the com- and forth between the shared memory and the global memory to putation within a single thread [3].is option can exhibit beer process these extremely long segments. In each round of calling ILP but at the expense of requesting too many register resources. smem-merge, the synchronization between multiple thread blocks On the contrary, Fig. 3b-2 shows the example to bind one element is necessary: aer the execution of reg-sort and smem-merge in to one thread [13]. Unfortunately, this method wastes computing each block, intermediate results need to be synchronized across resources, i.e., the rst two threads perform the comparison on all cooperative blocks via the global memory [47]. Aer that, the the same operands as the last two threads do. erefore, we will partially sorted data will be repartitioned and assigned to each investigate a more sophisticated solution that allows N -to-M data- block by using a similar partitioning method in smem-merge, and thread binding (N elements binding to M threads) and evaluate utilizing inter-block locality will in general further improve overall the performance aer applying this solution to dierent lengths of performance [27]. segments on dierent GPU architectures. Another related but distinct issue is that even if we know the best t = thread Segments (seg_ptr & input) w = warp N -to-M data-thread binding for a given segment on a GPU archi- b = block tecture, how to eciently exchange data within/between threads is still challenging, especially at GPU register level. Dierent with the communications via the global and shared memory of GPU that unit-bin warp-bin block-bin grid-bin have been studied extensively, data sharing through registers may … … … … … … … … … … … … require more research. w w w w w w w w w w t t t t w w b b b b 3 METHODOLOGY reg-sort 3.1 Adaptive GPU SegSort Mechanism striped-write smem-merge e key idea of our SegSort mechanism is to construct relatively Global memory: sorted segments (output) Figure 4: Overview of our GPU Segsort design balanced work units to be consumed by a large amount of warps (i.e., a group of 32 threads in NVIDIA CUDA semantics) running on GPUs. Such work units can be a combination of multiple segments As shown in Fig. 4, the binning is the rst step of our mecha- of small sizes, or part of a long segment split by a certain interval. nism and a specic requirement of segmented sort compared to To prepare the construction, we rst group dierent lengths of the standard sort. It is crucial to design an ecient binning ap- segments into dierent bins, then combine or split segments for proach on GPUs. Such an approach usually needs carefully designed making the balanced work units, nally apply dierentiated sort histogram and scan kernels, such as those proposed in previous approaches in appropriate memory levels to those units. research [25, 47]. We adopt a simple and ecient “histogram-scan- Specically, we categorize segments into four types of bins as bin” strategy generating the bins across GPU memory hierarchy. shown in Fig. 4: (1) A unit bin, which segments only contain 1 or 0 We launch a GPU kernel which number of threads is equal to the element. For these segments, we simply copy them into the output number of segments. Each thread will process one element in the in the global memory. (2) Several warp bins, which segments are segs ptr array. Histogram: each thread has a boolean array as the short enough. In some warp bins, a segment will be processed predicates and the array length is equal to the number of total by a single thread, while in others, a segment will be handled by bins. en, each thread calculates its segment length from segs ptr several threads, but a warp of threads at most. at way, we only and sets the corresponding predicate to true if the segment length use GPU registers to sort these segments. is register-based sort is in the current bin. Aer that, we use the warp vote function is called reg-sort (Sec. 3.2), which allows the N -to-M data-thread ballot() and the bit counting function popc() to accumulate binding and data communication between threads. Once these the predicates for all threads in a warp for the warp-level histogram. segments are sorted in registers, we write data from the registers to Finally, the rst thread of each warp atomically accumulates the bin the global memory, and bypass the shared memory. To achieve the sizes in the shared memory to produce the block-level histogram, coalesced memory access, the sorted results may be wrien to the and aer that the rst thread in each block does the same accumu- global memory in a striped manner aer an in-register transpose lation in the global memory to generate the nal histogram. Scan: stage (Sec. 3.4). (3) Several block bins, which consist of medium We use an exclusive scan on the histogram bin sizes to get starting size segments. In these bins, multiple warps in a thread block positions for each bin. Binning: all threads put their corresponding cooperate to sort a segment. Besides of using the reg-sort method segment IDs to positions atomically obtained from the scanned in GPU registers, a shared memory based merge method, called histogram. With such highly ecient designs, the overhead of smem-merge, is designed to merge multiple sorted chunks from grouping is limited and will be evaluated in Sec. 4.2. reg-sort in a segment (Sec. 3.3). Aer that, the merged results will be wrien into the output array from the shared memory to the 3.2 Reg-sort: Register-based Sort global memory. As shown in the gure, the number of warps Our reg-sort algorithm is designed to sort data in GPU registers in the thread block is congurable. (4) A grid bin, designed for for all bins. As a result, our method needs to support N -to-M the sucient long segments. For these segments, multiple blocks data-thread binding, meaning M threads cooperate on sorting N work together to sort and merge data. Dierent with the block elements. In order to leverage GPU registers to implement fast data bins, multiple rounds of smem-merge have to move data back exchange for sorting networks, M is set up to 32, which is the warp International Conference on Supercomputing (ICS), Chicago, IL, June, 2017
ICS ’17, June 14-16, 2017, Chicago, IL, USA Kaixi Hou, Weifeng Liu, Hao Wang, Wu-chun Feng size. For the medium and long segments where multiple warps are two paerns. In these gures, each thread handling k elements, involved, we still use reg-sort to sort each chunk of a segment and where k = N /M of the N -to-M data-thread binding. ese two use smem-merge to merge these sorted chunks. On the other hand, paerns can be easily constructed from the primitive exch primitive although our method theoretically supports any value of N , N is by simply swapping corresponding registers in each thread locally bound to the number of registers: if N is too large, the occupancy as shown in the right hand side of Fig. 6a and Fig. 6b. Dierent is degraded signicantly because too many registers are used. with these two paerns involving the inter-thread communication, a third paern in the N -to-M data-thread binding only has the _exch_primtive(rg0,rg1,tmask,swbit): _exch_primtive(rg0,rg1,0x1,0) intra-thread communication. As shown in Fig. 6d, we can directly rg0 (4) (1) (1) tid=0 rg0(4) (4) (1) (1) use the compare and swap operation for the implementation with- rg1 (3) (2) rg1(3) (1) (4) (4) (2) tid=1 rg0 (2) (3) rg0(2) (2) (2) (3) (3) out any shue-based inter-thread communication. We call this rg1 (1) (4) rg1(1) (3) (3) (2) (4) ① ② ③ ④ paern exch local.is paern can be an extreme case of N -to-M
① _shuf_xor(rg1, 0x1); // Shuffle data in rg1 binding, i.e., N -to-1, where the whole sorting network is processed ② cmp_swp(rg0, rg1); // Compare data of rg0 & rg1 locally by one thread. All of these three paerns are used in Alg. 1 to ③ if( bfe(tid,0) ) swp(rg0, rg1); // Swap data of rg0 & rg1 if 0 bit of tid is set ④ _shuf_xor(rg1, 0x1); // Shuffle data in rg1 implement a general N -to-M binding and the corresponding data Figure 5: Primitive communication pattern and its implementation communication for the bitonic sorting network.
Fig. 5 shows the data exchange primitive in the bitonic sorting Algorithm 1: Reg-sort: N -to-M data-thread binding and communication for bitonic sort network 3a with the details of implementation on GPU registers /* segment size N, thread number M, workloads per thread wpt = N/M, by using shue instructions. In this example, each thread holds re List is a group of wpt registers. */ 1 int p = (int) log N; two elements as the input: the thread 0 has 4 and 3 in its register 2 int pt = (int) log M; 3 for l p; l >= 1; l-- do rg0 and rg1; and the thread 1 has 2 and 1 accordingly. is situ- 4 int coop elem num = (int)pow(2, l 1); ation corresponds to a 4-to-2 data-thread binding. e primitive 5 int coop thrd num = (int)pow(2, min(pt, l 1)); is implemented in four steps. First, each thread needs to know 6 int coop elem size = (int)pow(2, p l + 1); 7 int coop thrd size = (int)pow(2, pt min(pt, l 1)); the communicating thread by the parameter tmask.e line 12 8 if coop thrd size == 1 then of Alg. 1 shows how to calculate its value, which is equal to the 9 int rmask = coop elem size 1; current cooperative thread group size minus 1. In this example, M 10 exch local(regList, rmask); 11 else is 2 because there are two threads in a cooperative group, and tmask 12 int tmask = coop thrd size 1; 13 int swbit = (int) log coop thrd size 1; is 1 (represented as 0x1 in the gure). is step uses the shue in- struction shfl xor(rg1, 0x1)2 to shue data in rg1: the thread 14 exch intxn(regList, tmask, swbit); 15 for k l + 1; k <=p; k ++do
0 gets data from the rg1 of thread 1, and similar to the thread 1. is 16 int coop elem num = (int)pow(2, k 1); 17 int coop thrd num = (int)pow(2, min(pt, k 1)); step makes the data changed from “4, 3, 2, 1” to “4, 1, 2, 3”. Second, 18 int coop elem size = (int)pow(2, p k + 1); each thread will compare and swap data in rg0 and rg1 locally to 19 int coop thrd size = (int)pow(2, pt min(pt, k 1)); change the data to “1, 4, 2, 3”. Aer the second step, the rg0 in 20 if coop thrd size == 1 then 21 int rmask = coop elem num 1; each thread has the smaller data and the rg1 has the larger one. 22 rmask = rmask (rmask >> 1); us, the third step is necessary to exchange the data in thread 1 to 23 exch local(regList, rmask); make the smaller data in its rg1 and the larger data in it rg0 for the 24 else 25 int tmask = coop thrd num 1; following shue, because the shue instruction can only exchange 26 tmask = tmask (tmask >> 1); 27 int swbit = (int)lo coop thrd size 1; data in registers having the same variable name, i.e., rg1. In a more general case where there are more threads are involved, we use the 28 exch paral(regList, tmask, swbit); parameter swbit to control which threads need to execute this local swap operation. e line 13 of Alg. 1 shows how to calculate the e pseudo-codes are shown in Alg. 1, which essentially com- value of swbit, which is equal to log coop thrd size 1. In this case, bine exch intxn and exch paral paerns (ln. 14 and ln. 28) in each the current cooperative thread group size is 2, the swbit is 0, and iteration, and use exch local when no inter-thread communication the thread 1 will do the swap when bfe(1, 0) returns 1. Aer the is needed. In each step where all comparisons can be performed in third step, we get “1, 4, 3, 2”. Fourth, we shue again on rg1 to parallel, we group data elements as coop elem num, representing move the larger data of thread 0 to thread 1 and the smaller data of the maximum number of groups that the comparisons can be con- thread 1 to thread 0, and then get the output “1, 2, 3, 4”. ducted without any interaction with elements in other group (ln. 4 Based on the primitive, we extend two paerns of the N -to-M and ln. 16); and the coop elem size represents how many elements binding to implement the bitonic sorting network, which are (1) in each group. For example, in the step 1 of Fig. 6d, the data is put exch intxn:e dierences of corresponding data indices for com- into 4 groups, each of which has 2 elements. us, coop elem num parisons decrease as the register indices increase in the rst thread. is 4 and coop elem size is 2. In the step 4 of this gure, there is (2) exch paral:e dierences of corresponding data indices for only 1 group with the size of 8 elements. us, coop elem num is 1 comparisons keep consistent as the register indices increase in the and coop elem sizeis 8. Similarly, the algorithm groups threads into rst thread. e lehand sides of Fig. 6a and Fig. 6b show these coop thrd num and coop thrd size for each step to represent the maximum number of cooperative thread groups and the number 2e shue operation shfl xor(v, mask) lets the calling thread x obtain register data v from thread x ˆ mask, and the operation shfl(v, y) lets the calling thread x of threads in each group. For example, the step 1 of Fig. 6d has fetch data v from thread y. 4 cooperative thread groups and each group has 1 thread. us, International Conference on Supercomputing (ICS), Chicago, IL, June, 2017
Fast Segmented Sort on GPUs ICS ’17, June 14-16, 2017, Chicago, IL, USA
_exch_intxn(rg0, rg1,…,rgk-1,tmask,swbit): _exch_paral(rg0, rg1,…,rgk-1,tmask,swbit): _exch_local(rg0, rg1,…,rgk-1,rmask): ① ② ③ ④ ⑤ ⑥ rg0 reg_sort(N=8,M=4) rg0 rg0 rg0 rg0 rg0 t0 rg1 rg1 rg1 rg1 rg1 rg1 ①_exch_local(rg0,rg1); t0 swap t0 swap t0 rg0 ②_exch_intxn(rg0,rg1,0x1,0); rgk-2 rgk-2 rgk-2 rgk-2 rgk-2 t1 rgk-1 rgk-1 rgk-1 rgk-1 rgk-1 rg1 ③_exch_local(rg0,rg1); rg0 if(bfe(tid,swbit)) rg0 rg0 if(bfe(tid,swbit)) rg0 rg0 rg0 ④_exch_intxn(rg0,rg1,0x3,1); swp(rg0,rgk-2) swp(rg0,rg1) t2 rg1 rg1 rg1 rg1 rg1 rg1 ⑤_exch_paral(rg0,rg1,0x1,0); t1 swp(rg1,rgk-1) t1 swp(rg2,rg3) t1 … … rg0 ⑥_exch_local(rg0, rg1); rgk-2 rgk-2 rgk-2 rgk-2 rgk-2 t3 rgk-1 rgk-1 rgk-1 rgk-1 rgk-1 rg1 (a) exch intxn pattern (b) exch paral pattern (c) exch local pattern (d) An example of bitonic sorting where N = 8 and M = 4 Figure 6: Generic exch intxn, exch paral and exch local patterns, shown in (a,b,c). e transformed patterns can be easily implemented by using primitives. A code example of Alg. 1 for is shown in (d)
Warp 0 Warp 1 Warp 2 Warp 3 smem_merge(5,spA,spB): coop thrd num is 4 and coop thrd size is 1. In contrast, the step 4, Stage0 spA spB 1st thread merges 2 data inp 2nd thread merges 3 data is 1 and is 4 that means the 4 threads (shared) 2 4 6 8 1 12 14 17 3 5 9 15 7 10 11 13 16 coop thrd num coop thrd size lenA lenB need to communicate with each other to get required data for the smem_merge(4,spA,spB) smem_merge(5,spA,spB) t0 = smem[spA]; out t1 = smem[spB]; comparisons in this step. (shared) 1 2 4 6 8 12 14 17 3 5 7 9 10 11 13 15 16 // Repeat … Stage1 spA spB p = (spB >= lenB) || If there is only one thread in a cooperative thread group (ln. 8 ((spA < lenA) && (t0 <= t1)); inp 1 2 4 6 8 12 14 17 3 5 7 9 10 11 13 15 16 (shared) rg0 = p ? t0 : t1; lenB and ln. 20), the algorithm will switch to the local mode exch local lenA if(p) t0 = smem[++spA]; smem_merge(4,spA,spB) smem_merge(5,spA,spB) because the thread already has all comparison operands. Once else t1 = smem[++spB]; out 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 // … for registers rg1, rg2 there are more than one thread in a cooperative thread group, the (global) Figure 7: An example of warp-based merge using shared memory algorithm uses exch intxn and exch paral paerns and calculates corresponding tmask and swbit to determine the thread for the communication pair and the thread that needs the local data re- gure, the spliing points spA and spB for the second thread of arrange aforementioned in the previous paragraph. In the step 1 warp 3 (in blue color) are computed by the MergePath algorithm. of this gure, where coop thrd size is 1 and exch local is executed, e right part of the gure shows the merge codes executed by the algorithm calculates rmask, which controls the local compar- this thread that process 3 elements. Our merge method rst loads ison on registers (ln. 9). In the step 4, where coop elem size is 8 data from spA and spB to two temporary registers t0 and t1. By and coop thrd size is 4, all 8 elements will be compared across all checking if spA and spB is out-of-bound and comparing t0 and t1, 4 threads. In this case, the tmask is 0x3 (ln. 12) and swbit is 1 the merge algorithm selects the smaller element to ll rst result (ln. 13). In the step 5 where the exch paral paern is used, the algo- register rg0.e algorithm continues loading the next element rithm calculates coop elem size is 4 and coop thrd size is 2. us, to ll t0 or t1 from corresponding chunks pointed by spA or spB, the tmask is 0x1 and swbit is 0. Note that although our design in until assigned number of elements is encountered. Aer that, the Alg. 1 is for bitonic sorter, our ideas are also applicable to other merged data in registers, e.g., rg0, rg1 for rst thread, and rg0, rg1, sorting networks by swapping and padding registers based on the rg2 for the second thread, will be stored back to shared memory for primitive paern. another iteration of merge. As shown in the gure, two iterations of smem-merge are used to merge four chunks belonging to one segment. 3.3 Smem-merge: Shared Memory-based Merge As shown in Fig. 4, for medium and large sized segments in the 3.4 Other Optimizations block bins and grid bin, multiple warps are launched to handle one Load data from global memory to registers: we decouple the segment and the sorted intermediate data from each warp need to data load from global memory to registers and the actual segmented merge. e smem-merge algorithm is designed to merge such data sort in registers. Our data load kernel uses successive threads to in the shared memory. load contiguous data for coalesced memory access. Although we Our smem-merge method enables multiple warps to merge chunks don’t keep the original global indices of input data in registers, that having dierent numbers of elements. We assign rst m warps doesn’t maer because the input data is unsorted and the global with x-warp size and the last m0 warps with -warp size to keep indices are not critical for the following sort routine. load balance between warps as possible as we can. Inside each warp, we also try to keep balance among cooperative threads in the Stride=1 Stride=2 merge. We design a searching algorithm based on the MergePath rg0 rg1 swap exchange swap swap exchange swap t0 1 2 1 2 1 3 1 3 1 3 1 5 1 5 algorithm [17] to search the spliing points to divide the target t1 3 4 4 3 4 2 2 4 2 4 2 6 2 6 segments into balanced partitions for each threads. t2 5 6 5 6 5 7 5 7 7 5 7 3 3 7 t3 7 8 8 7 8 6 6 8 8 6 8 4 4 8 Fig. 7 shows an example of smem-merge. In this case, there are 4 input shuf_xor(rg1,0x1) shuf_xor(rg1,0x2) output sorted chunks in the shared memory belonging to a segment. ey Figure 8: An example of in-register transpose have 4, 4, 4, and 5 elements, respectively. We assume each warp has 2 threads. As a result, the rst 3 warps will merge 4 elements, Store data from registers to global memory: when the seg- and each thread in these warps will merge two elements; while ments in the warp bins are sorted, we directly store them back to the last warp will work on 5 elements, and the rst thread will global memory without the merge stage. However, because the merge 2 elements and second thread will merge 3 elements. In the sorted data may distributed into dierent registers of threads in a International Conference on Supercomputing (ICS), Chicago, IL, June, 2017
ICS ’17, June 14-16, 2017, Chicago, IL, USA Kaixi Hou, Weifeng Liu, Hao Wang, Wu-chun Feng
blk_064_orig 9 2 threads 9 4 threads 8 8 threads 7 16 threads 6 32 threads ) ) ) ) ) s s s s s blk_128_orig / 8 / 8 / 7 / 6 / s s s s s 5 r r r r r i 7 i 7 i i i
blk_256_orig a a a 6 a a
p p p p 5 p 6 6 4 blk_512_orig 9 9 9 5 9 9 e e e e 4 e 0 5 0 5 0 0 0 1 1 1 1 1
blk_064_strd ( ( ( 4 ( ( 3
t 4 t 4 t t 3 t u u u u u
blk_128_strd p p p 3 p p
h 3 h 3 h h h 2
g g g 2 g 2 g blk_256_strd u 2 u 2 u u u o o o o o r r r r 1 r 1 blk_512_strd h 1 h 1 h 1 h h T T T T T 0 0 0 0 0 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 Pairs per thread Pairs per thread Pairs per thread Pairs per thread Pairs per thread (a) K80-Kepler blk_064_orig 25 2 threads 25 4 threads 25 8 threads 25 16 threads 25 32 threads ) ) ) ) ) s s s s s
blk_128_orig / / / / / s s s s s r r r r r
i 20 i 20 i 20 i 20 i 20
blk_256_orig a a a a a p p p p p
blk_512_orig 9 9 9 9 9 e 15 e 15 e 15 e 15 e 15 0 0 0 0 0 1 1 1 1 1
blk_064_strd ( ( ( ( (
t t t t t
u 10 u 10 u 10 u 10 u 10
blk_128_strd p p p p p h h h h h g g g g g
blk_256_strd u u u u u
o 5 o 5 o 5 o 5 o 5 r r r r r
blk_512_strd h h h h h T T T T T 0 0 0 0 0 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 1 2 4 8 16 Pairs per thread Pairs per thread Pairs per thread Pairs per thread Pairs per thread (b) TitanX-Pascal Figure 9: Performance of reg-sort routines with dierent combinations of data-thread binding policies, write methods, and block sizes warp, directly storing them back will lead to uncoalesced memory 4.1 Kernel Performance access. When the number of elements per thread grows, the situ- For the reg-sort kernels, we alternate the thread group size M in ation will become even worse. erefore, as well as keeping the 2, 4, 8, 16, 32 . At the same time, each thread varies its bound data { } direct store back method, which is labeled as orig in our evaluation, N /M in 1, 2, 4, 8, 16 (N /M is labeled as pairs per thread ppt in { } we design the striped write method, which is labeled as strd in gures). We use the maximum bound data of 16 because we observe the evaluation. We implement an in-register transpose method to the performance deteriorates obviously for higher numbers than scaer the sorted data in registers of threads. e transpose method 16. us, the target segment size N is a variable number ranging starts from shuing registers by the stride of 1, then doubles the from 2 (i.e., 2 threads each bind 1 pair) to 512 (i.e., 32 threads each stride to 2, and nishes the shues until log 32 = 5 iterations, bind 16 pairs). Since reg-sort might exhaust register resources, we where 32 is the warp size. Aer that, the successive threads can also vary the block sizes as 64, 128, 256, and 512 for maximizing write data to global memory in a cyclic manner for coalesced mem- occupancy. Fig. 9 shows the diversied performance numbers of ory access. Fig. 8 shows an example of using 4 threads to transpose reg-sort kernels, which actually demonstrates that choosing a single data in their two registers. is example shows that the number solution for all segments would lead to suboptimal performance of iterations for the in-register transpose depends on the number even among GPUs from the same vendor. 3 of threads but not on the number of elements each thread has . In Fig. 9, we notice that the impact of data-thread binding poli- As a result, aer log 4 = 2 iterations, the successive elements are cies varies widely depending on the GPU devices. For example, scaered to these threads. In the evaluation, we will investigate the when the segment size N equals to 16, the possible candidates are best scenarios for orig and strd, considering the orig method has 2(threads):8(ppt), 4:4, 8:2, and 16:1. On K80-Kepler GPU, the high- the uncoalesced memory access problem, while the strid method est performance is given by 8:2, achieving 30% speedups over the has the in-register transpose overhead. slowest policy of 2:8. In contrast, the TitanX-Pascal GPU shows very similar performance for these policies. is insensitivity to 4 PERFORMANCE RESULTS register resources exhibited on Pascal architecture can contribute We conduct the experiments on two generations of NVIDIA GPUs to its larger available register les per thread. Note, the policies of K80-Kepler and TitanX-Pascal. Tab. 1 lists the specications of each thread binding only 1 pair is equivalent to the method pro- the two platforms. e input dataset is two arrays holding key posed in [13].is method actually wastes computing resources, and value pairs separately. e total dataset size is xed at 228 because each comparison over two operands is conducted twice by and segment numbers are varied accordingly to the target segment two threads, resulting in suboptimal performance numbers. sizes. We report the throughput in the experiments equal to 228/t e striped write method (strd) in reg-sort kernels is particularly pairs/s, where t is the execution time. eective when each thread binds more than 4 pairs. is is because when the ppt is small (<4), consecutive threads can still access Table 1: Experiment Testbeds almost consecutive memory locations for coalesced memory trans- Tesla K80 (Kepler-GK210) TitanX (Pascal-GP102) action. However, larger ppt indicates the thread access locations are Cores 2496 @ 824 MHz 3584 @ 1531 MHz Register/L1/LDS per core 256/16/48 KB 256/16/48 KB highly scaered, causing ineciently memory transaction instead. Global memory 12 GB @ 240 GB/s 12 GB @ 480 GB/s In this case, the striped write method is of great necessity. On the Soware CUDA 7.5 CUDA 8.0 other hand, the block size also has the eect on the performance and the optimal one is usually achieved by using 128 or 256 threads. For the smem-merge kernels, Fig. 10 shows the performance 3e number of elements each thread has will determine how many shue instructions numbers of changing the number of cooperative warps and ppt. e are needed in each iteration. International Conference on Supercomputing (ICS), Chicago, IL, June, 2017
Fast Segmented Sort on GPUs ICS ’17, June 14-16, 2017, Chicago, IL, USA warp numbers are in 2, 4, 8, 16 , while the ppt varies among 2, 4, 8, global merge sort with runtime segment delimiter checking, thus { } { 16 . In this scenario, the target segment size N ranges from 128 (i.e., can ensure the sort only occurs within segment. } 2 warps each merge 32x2 pairs) to 4096 (i.e., 16 warps each merge Datasets of uniform distribution: We test a batch of uniform 32x8 pairs). Diered from reg-sort kernels, the best data-thread segments to evaluate the performance of our segsort kernels, which binding policies are more consistent for smem-merge on the two are ploed in Fig. 11. Since the cusp-segsort and mgpu-segsort are devices. For example, to merge 1024-length segment, both devices designed from the global sort, their performance is determined by prefer to use 8 warps with 4 ppt. Considering memory access, the the total input size. is “one-size-t-all” philosophy ignores the striped method provides similar performance with original method, characteristics of the segments and thus shows the plateau perfor- which directly exploits random access of shared memory in order mance. For the short segments (<256 on Kepler and <512 on Pas- to achieve coalesced transaction on global memory. Note, in our cal), our segsort can achieve an average of 13.4x and 3.1x speedups implementation, we carefully select shared memory dimensions over cusp-segsort and mgpu-segsort respectively on Kepler. ese and sizes to avoid band conicts. speedups rise to 15.5x and 3.2x on Pascal. For the other segments, Now, we can select best kernels for dierent segments with our segsort provides an average of 5.6x and 1.2x improvements on length of the power of two. Other segments can be handled by Kepler (10.0x and 2.1x on Pascal). e performance improvements directly padding them to the nearest upper power of two in reg-sort are mainly from our reg-sort and smem-merge, which minimize kernels. For smem-merge kernels, we use dierent ppt for dierent shared/global memory transactions. warps to minimize padding values. Since our method is based on the pre-dened sorting networks, the characteristics of input datasets will cause negligible to the selection of best kernels. Moreover, for each GPU, we only need to conduct the oine selection once. e results show that we only need 13 bins to handle all the segments. erst 8 and 9 bins use reg-sort to handle up to 256 pairs on Kepler GPU and 512 on Pascal respectively. en, other segments less than