Comparison Based Sorting for Systems with Multiple Gpus
Total Page:16
File Type:pdf, Size:1020Kb
Comparison Based Sorting for Systems with Multiple GPUs Ivan Tanasicx, Lluís Vilanovax, Marc Jordàx, Javier Cabezasx, Isaac Geladox Nacho Navarroxz, Wen-mei Hwu∗ xBarcelona Supercomputing Center zUniversitat Politecnica de Catalunya ∗University of Illinois {first.last}@bsc.es [email protected] [email protected] ABSTRACT 1. INTRODUCTION As a basic building block of many applications, sorting algo- Sorting is a key building block in many High Performance rithms that efficiently run on modern machines are key for Computing (HPC) applications. Examples of these are N- the performance of these applications. With the recent shift body simulations [1], some high performance sparse ma- to using GPUs for general purpose compuing, researches trix-vector multiplication implementations [2], graphics al- have proposed several sorting algorithms for single-GPU sys- gorithms like Bounding Volume Hierarchy (BVH) construc- tems. However, some workstations and HPC systems have tion [3], database operations [4], machine learning algorithms multiple GPUs, and applications running on them are de- [5] and MapReduce framework implementations [6, 7]. signed to use all available GPUs in the system. A common trend in computing today is the utilization In this paper we present a high performance multi-GPU of Graphical Processing Units (GPUs) that efficiently exe- merge sort algorithm that solves the problem of sorting data cute codes rich in data parallelism, to form high performance distributed across several GPUs. Our merge sort algorithm heterogeneous systems [8]. In these CPU/GPU systems, ap- first sorts the data on each GPU using an existing single- plication phases rich in data parallelism are executed in the GPU sorting algorithm. Then, a series of merge steps pro- GPU, while the remaining portions of the application are ex- duce a globally sorted array distributed across all the GPUs ecuted on the CPU. However, depending on the phase of the in the system. This merge phase is enabled by a novel pivot application, this pattern requires moving the data back and selection algorithm that ensures that merge steps always dis- forth between the CPU and the GPU, introducing overheads tribute data evenly among all GPUs. We also present the (in both execution time and power consumption) that can implementation of our sorting algorithm in CUDA, as well void the benefit of using GPUs. To avoid these overheads in as a novel inter-GPU communication technique that enables GPU accelerated applications that use sorting, several sort- this pivot selection algorithm. Experimental results show ing algorithms that execute in the GPU have already been that an efficient implementation of our algorithm achieves proposed, including [9, 10, 11, 12, 13]. Most of the existing a speed up of 1.9x when running on two GPUs and 3.3x GPU sorting algorithms only support single-GPU systems. when running on four GPUs, compared to sorting on a sin- This limitation might be unacceptable in applications that gle GPU. At the same time, it is able to sort two and four make use of several GPUs for a number of reasons. It would times more records, compared to sorting on one GPU. lead to underutilization of the available resources, increased overhead of moving the data to one GPU to be sorted and, Categories and Subject Descriptors since data structures might be distributed across the mem- ory of all GPUs in the node, their total size can exceed C.1.3 [Processor Architectures]: Other Architecture Styles the memory capacity of a single GPU. To the best of our knowledge, only the algorithms presented in [9, 14, 15] allow General Terms using more than one GPU, but these algorithms are either designed for external (out-of-core) sorting or have limita- Algorithms, Design, Performance tions. In this paper we present an efficient multi-GPU internal Keywords (in-core) merge based sorting algorithm that allows sorting Parallel, Sorting, GPU, CUDA data structures already distributed across the memories of several GPUs, including very large structures that could not fit in a single GPU and runs on any number of GPUs con- nected to the same compute node. We do not target cluster Permission to make digital or hard copies of all or part of this work for level sorting, although our sorting algorithm could be used personal or classroom use is granted without fee provided that copies are as a node level sort operation when building a cluster level not made or distributed for profit or commercial advantage and that copies sort. It consists of two phases: a first phase which performs bear this notice and the full citation on the first page. To copy otherwise, to a local sort in each GPU, resulting in \per GPU" sorted republish, to post on servers or to redistribute to lists, requires prior specific data, and a second phase which performs a series of merges permission and/or a fee. GPGPU-6, March 16 2013, Houston, TX, USA over ordered data from different GPUs. We designed the Copyright 2013 ACM ACM 978-1-4503-2017-7/13/03 ...$15.00. 1 merge step to be able to cope with the distributed nature Most parallel versions of comparison-based sorting algo- of a multi-GPU system, utilize the memory perfectly and rithms are based on two approaches. Merge-based approach keep the system in complete load balance. This is achieved splits data into chunks, sorts these chunks independently by splitting the arrays to be merged in two parts in such a (using some \small sort" algorithm) and then performs a way that exchanging exactly the same number of elements number of merge steps to obtain globally sorted data. The between them as two contiguous chunks of memory would other approach, distribution-based, sets up a number of buck- results in the data that can be merged by each GPU to ob- ets, for each element computes the destination bucket, scat- tain the globally sorted array. Here we introduce a novel ters all elements to the right buckets, sorts these buckets pivot selection algorithm that guarantees these properties. independently, and finally concatenates all buckets to get This paper also presents the CUDA implementation of the globally sorted data. Buckets are usually set up through our proposed multi-GPU merge sort algorithm. Our imple- sampling (deterministic or random, depending on the ap- mentation uses peer-to-peer (P2P) GPU access, introduced proach) in an effort to balance their occupancy. Picked up in CUDA 4.0, to enable efficient inter-GPU communication, samples represent the bucket range, while some form of his- which is key to accomplish a high performance merge step. togramming is usually performed to determine the size of However, P2P GPU access is only available for GPUs con- each bucket. nected to the same PCI express (PCIe) controller. To enable low-overhead inter-GPU communication across GPUs con- 2.2 Algorithmic Approach nected to different PCIe controllers, we introduce a novel In this paper we target a merge-based algorithm which inter-GPU communication technique based on the host ma- paired with our pivot selection algorithm allows us to build pped memory mechanism [16]. a sorting mechanism suitable for the distributed structure of The combination of the algorithm and the communication a multi-GPU system. This is achieved by moving the data mechanisms (P2P GPU accesses and our inter-GPU commu- in chunks to it's destination GPU in a series of merge steps nication technique) allows us to accomplish a scalable CUDA and finally performing the merge locally, in each GPU. implementation of multi-GPU sorting that scales up to 1.9x Compared to typical distribution-based sorting mecha- when sorting on two GPUs, and up to 3.3x when sorting on nisms, our approach to merge sort provides several benefits. four GPUs, compared to the single-GPU sorting. Further- It is able to efficiently utilize the available memory since more, implementing a fundamental algorithm such is sorting there are no data marshalling stages performed and all the for a multi-GPU system allows us to evaluate this emerging data exchanged by pairs of GPUs is of the same size, so it platform and the mechanisms it offers. can fit in the place of the data that it is being exchanged The paper is organized as follows: Section 2 presents the with. As a result of this, our algorithm ensures that each necessary background material on sorting algorithms and data partition fits in the GPU and is able to keep the system the base system we assume in this paper; Section 3 provides in complete load balance. Pivot selection is the only part of an overview of the algorithm, while Section 4 goes through the algorithm that could benefit from a shared memory ar- all those insights that actually provide an efficient imple- chitecture and, as we show later, modern GPUs are able to mentation; Section 5 discusses the experimental evaluation cope with this. of the proposed solution; Section 6 presents all the relevant Distribution-based sorting algorithms are, on the other work related to sorting on GPUs; finally, Section 7 draws hand, optimal regarding the communication (necessary num- the final conclusions of this work. ber of transfers of a single element until it reaches the desti- nation processor) and hence are the preferred approach for 2. BACKGROUND AND MOTIVATION sorting on big clusters and supercomputers. Because our This section presents the necessary background informa- targeted systems, described below, are small scale systems, tion that is required in order to understand the following communication overhead is less of a concern. This is also sections. We first describe the families of sorting algorithms shown through the experimental evaluation in Section 5. and then we describe our target system.