International Conference on Supercomputing (ICS), Chicago, IL, June, 2017 Fast Segmented Sort on GPUs † † † Kaixi Hou , Weifeng Liu‡§, Hao Wang , Wu-chun Feng †Department of Computer Science, Virginia Tech, Blacksburg, VA, USA, kaixihou, hwang121, wfeng @vt.edu { } Niels Bohr Institute, University of Copenhagen, Copenhagen, Denmark,
[email protected] ‡ Department of Computer Science, Norwegian University of Science and Technology, Trondheim, Norway. § ABSTRACT both for traditional HPC applications and for big data processing. Segmented sort, as a generalization of classical sort, orders a batch In these cases, a large amount of independent arrays oen need to of independent segments in a whole array. Along with the wider be sorted as a whole, either because of algorithm characteristics adoption of manycore processors for HPC and big data applications, (e.g., sux array construction in prex doubling algorithms from segmented sort plays an increasingly important role than sort. In bioinformatics [15, 44]), or dataset properties (e.g., sparse matrices this paper, we present an adaptive segmented sort mechanism on in linear algebra [4, 28–31, 42]), or real-time requests from web GPUs. Our mechanisms include two core techniques: (1) a dif- users (e.g., queries in data warehouse [45, 49, 51]). e second ferentiated method for dierent segment lengths to eliminate the trend is that with the rapidly increased computational power of irregularity caused by various workloads and thread divergence; new processors, sorting a single array at a time usually cannot fully and (2) a register-based sort method to support N -to-M data-thread utilize the devices, thus grouping multiple independent arrays and binding and in-register data communication.