Approximate and Exact Selection on Gpus
Total Page:16
File Type:pdf, Size:1020Kb
1 Approximate and Exact Selection on GPUs Tobias Ribizel∗, Hartwig Anzt∗y ∗Steinbuch Centre for Computing, Karlsruhe Institute of Technology, Germany yInnovative Computing Lab (ICL), University of Tennessee, Knoxville, USA [email protected], [email protected] Abstract—We present a novel algorithm for parallel selection or synchronization. As streaming processors like GPUs are on GPUs. The algorithm requires no assumptions on the in- becoming increasingly popular, and are nowadays adopted by put data distribution, and has a much lower recursion depth a large fraction of the supercomputing facilities, there exists compared to many state-of-the-art algorithms. We implement the algorithm for different GPU generations, always using a heavy demand for selection algorithms that are designed the respectively-available low-level communication features, and to leverage the highly parallel execution model of GPUs assess the performance on server-line hardware. The computa- and avoid global synchronization and communication in favor tional complexity of our SampleSelect algorithm is comparable of localized communication. In response to this demand, to specialized algorithms designed for – and exploiting the we propose a new parallel selection algorithm for GPUs. characteristics of – “pleasant” data distributions. At the same time, as the SampleSelect does not work on the actual values Aiming at a sample selection algorithm featuring fine-grained but the ranks of the elements only, it is robust to the input parallelism, we follow a bottom-up approach by starting with data and can complete significantly faster for adversarial data the GPU hardware characteristics, and selecting algorithmic distributions. Additionally to the exact SampleSelect, we address building blocks that map well to the architecture-specific the use case of approximate selection by designing a variant that operating mode. Acknowledging CUDA’s asynchronous ex- radically reduces the computational cost while preserving high approximation accuracy. ecution model, and using low-level communication features inherently supported by hardware, the new selection algorithm Index Terms—parallel selection algorithm, GPU, kth order proofs to be competitive with other GPU-optimized selection statistics, approximate threshold selection algorithms that impose strong assumptions on the input data distribution, and superior to input-data independent state-of- I. INTRODUCTION the-art algorithms available in literature, open source software, Sequence selection is an ubiquitous challenge that appears or vendor libraries. in many problem settings, from quantile selection in order The rest of the paper is organized as follows. In Section II statistics over determining thresholds in approximative algo- we recall some basic concepts of selection algorithms and their rithms to top-k selection in information retrieval. One of parallelization potential. Section III list efforts that also aim the most popular solutions to this problem is the widely- at parallelizing selection algorithms for GPUs. In Section IV used Quickselect algorithm [1], a partial-sorting variant of we present the novel SAMPLESELECT selection algorithm. We Quicksort [2]. The close relationship between these two algo- also provide details about how the SAMPLESELECT algorithm rithms is not a singularity, but characteristic of the connection is realized in the CUDA programming model, and how the between selection and sorting algorithms. In fact, many im- low-level communication and synchronization features avail- provements of Quicksort and similar partitioning-based sorting able in the distinct GPU generations are incorporated. Sec- algorithms can be directly transferred to the corresponding tion V, presents a comprehensive analysis of the effectiveness, selection algorithms, e.g., the deterministic pivot choice im- efficiency, and performance of the novel SAMPLESELECT se- plemented using the Median of medians algorithm [3], multiple lection algorithm. We conclude in Section VI with a summary splitter elements in the Sample sort algorithm [4], and an of the findings, and an outlook on future research. implementation variant optimized for modern hardware archi- tectures called Super-scalar sample sort [5]. II. SELECTION With the rise of parallel architectures, the development of For an input sequence (x ; : : : ; x ), the selection problem effective selection and sorting algorithms is heavily guided 0 n−1 is given by finding the element at position k in the sorted by parallelization aspects. The traditional concepts employed sequence x ≤ · · · ≤ x , i.e., finding the kth-smallest for the parallelization are based on work decomposition, and i0 in−1 element x of the sequence. In this setting, we also say that have proven to be efficient for multi-core and multi-node ik xik has rank k. If the rank of an element is not unique, i.e., architectures embracing the multiple-instruction-multiple-data because the element occurs multiple times in the sequence, we (MIMD) programming paradigm. Unfortunately, the same assign it the smallest rank. strategies largely fail to work efficiently ons modern manycore architectures like GPUs. The primary reason is that these devices are designed to operate in streaming mode, and that A. General framework their performance heavily suffers from instruction-branching, The most popular algorithms for the selection problem are non-coalesced memory access, and global communication all based on partial sorting: If we choose b+1 so-called splitter 2 1 double select(data, rank) { total runtime of the resulting algorithm. In the general case, 2 if (size(data) <= base_case_size) { the optimal splitters separate the input elements in b buckets 3 sort(data); 4 return data[rank]; of equal size n=b. This results in an algorithm that needs at 5 } n most logb +1 recursive steps, where B is the base case size 6 // select splitters B 7 splitters = pick_splitters(data); (lowest recursion level), below which we sort the elements to 8 // compute bucket sizesn i return the kth-smallest element directly. Without considering 9 counts = count_buckets(data, splitters); 10 // compute bucket ranksr i the computational overhead of choosing the splitters, their offsets = prefix_sum(counts); 11 optimal values are the pi = i=b percentiles of the input dataset. 12 // determine bucket containing rank 13 bucket = lower_bound(offsets, rank); In practice, these can be approximated by the corresponding 14 // recursive subcall percentiles of a sufficiently large random sample. In terms of 15 data = extract_bucket(data, bucket); 16 rank -= offsets[bucket]; the relative element ranks, the average error introduced by 17 return select(data, rank); considering only of a small sample of size s of the complete 18 } dataset can be estimated as follows: The relative ranks of the sampled elements, i.e., the ranks normalized to [0; 1], are Fig. 1. High-level overview of a bucket-based selection algorithm approximately uniformly distributed: X1;:::;Xs ∼ U(0; 1) and, assuming sampling with replacement, independent. Thus the sample percentiles are asymptotically normal-distributed p Pick splitters with mean pi and standard deviation pi(1 − pi)=s [6]. We can thus use the sample size s to control the imbalance Sort splitters between different bucket sizes. Group by bucket C. Approximating the kth-smallest element Select bucket An important observation in the context of bucket-based se- lection algorithms is that after all elements have been grouped Pick splitters into their buckets, the ranks of the splitter elements are already available: Their ranks equal the aforementioned partial Sort splitters sums ri. If the application does not require our algorithm to compute the kth-smallest element exactly, but can work Group by bucket with an approximation like the k ± "th smallest element, the selection algorithm can be modified to terminate before the Fig. 2. Visualization of bucket-based partial sorting lowest recursion level is reached. In this case, we can compute the approximate kth order statistic as the splitter si whose rank ri is closest to k. In terms of the element ranks, the error is at worst half the maximum bucket size, and can thus elements si (−∞ = s0 ≤ · · · ≤ sb = 1), we can partition the be controlled by the number of buckets and sample size. If input dataset into b buckets containing the element intervals the distribution of the input data is smooth, the small error in [si; si+1). An important consequence of this partitioning is the element rank translates into a small error of the element that, aside from the element values, we also partition their value. However, this is not true for the general case, as for ranks in the sorted sequence: Let ni be the number of elements noisy input data, the induced error can grow arbitrary large. in the ith bucket, i.e., the number of elements from the input sequence contained in [si; si+1). Then these elements have III. RELATED WORK ranks in the interval [r ; r ), where r = Pi−1 n is the i i+1 i j=0 j In the past, different strategies aiming at efficient selection combined number of elements in all previous buckets. on GPUs were explored. The first implementation of a se- Based on this observation, we can formulate a general lection algorithm on GPUs was presented by Govindaraju et. framework for exact selection: After determining the element al. [7] for the problem of database operations. The proposed count for each bucket, it suffices to recursively proceed only algorithm recursively bisects the