Transparent and Efficient Utilization of GPU Power

CrystalGPU: Transparent and Efficient Utilization of GPU Power Abdullah Gharaibeh, Samer Al-Kiswany, Matei Ripeanu Electrical and Computer Engineering Department The University of British Columbia {abdullah, samera, matei}@ece.ubc.ca Abstract models have the ability to overlap computation and the communication with the host; thus offering the General-purpose computing on graphics processing opportunity to hide communication overheads, a units (GPGPU) has recently gained considerable significant source of overheads for stream processing attention in various domains such as bioinformatics, applications. Additionally, the newly introduced databases and distributed computing. GPGPU is based dual-GPU devices (e.g., NVIDIA’s GeForce 9800 on using the GPU as a co-processor accelerator to GX2) offer an additional level of parallelism by placing offload computationally-intensive tasks from the CPU. two GPUs on the same graphics card, and thus offer an This study starts from the observation that a number of opportunity to harness additional computing capacity. GPU features (such as overlapping communication However, the efficient utilization of these and computation, short lived buffer reuse, and capabilities is a challenging task that requires careful harnessing multi-GPU systems) can be abstracted and management of GPU resources by the application reused across different GPGPU applications. developer that leads to extra development effort and This paper describes CrystalGPU, a modular additional application complexity [5]. This is mainly framework that transparently enables applications to due to the peculiar characteristics of current GPU exploit a number of GPU optimizations. Our architectures and development environments. For evaluation shows that CrystalGPU enables up to 16x example, to effectively overlap computation and speedup gains on synthetic benchmarks, while communication, the application needs to introducing negligible latency overhead. asynchronously launch a number of independent computation and data transfer tasks, which require the 1 Introduction application to maintain additional state information to keep track of these asynchronous operations. Also, to Today’s GPUs offer drastic reduction in employ dual-GPU devices, the application has to computational costs compared with existing traditional manually detect the available devices, and spread the CPUs. These savings encourage using GPUs as work among them. hardware accelerators to support computationally- This project proposes CrystalGPU, an execution intensive applications like, for example, scientific framework that aims to simplify the task of GPU applications [1] or distributed computing middleware application developers by providing a higher level of [2]. In particular, GPUs are well suited to support abstraction, while, concurrently, maximizing GPU stream processing applications, that is applications that utilization. In particular, the framework runs entirely on process a continuous stream of data units, where the the host machine and provides a layer between the processing of a single data unit may exhibit intrinsic application and the GPU, managing the execution of parallelism that can be exploited by offloading the GPU operations (e.g., transferring data from/to the computation to the GPU. Examples of such GPU and starting computations on the GPU). To this applications include video encoding/decoding [3] and end, the framework seeks to transparently enable three optimizations that can speedup GPU applications: deep packet inspection [4, 7]. Recent GPU models include a number of new First, reusing GPU memory buffers to amortize the capabilities that not only enable even lower cost of allocation and deallocation of short lived computational costs, but also make GPUs an interesting buffers. Al-Kiswany et al. [2] demonstrate that the computational platform for fine-granularity, percentage of total execution time spent on memory latency-sensitive applications. For example, new GPU buffer allocations can be as high as 80% of the execution time. Second, new GPU architectures and programming experimental results. Section 6 discusses related work. environments allow applications to overlap the We conclude in Section 7. communication and computation phases. Al- Kiswany et al. [2] show that the communication 2 Background overhead can be as high as 40% for data-intensive In this section, we present background related to applications. Other studies have also demonstrated GPU programming model (Section 2.1), and describe the significance of communication overhead when the type of applications that CrystalGPU targets using the GPU [6-8]. In these cases, overlapping (Section 2.2). computation and communication creates the opportunity to hide the communication cost, 2.1 GPU Programming maximizing the utilization of the GPU computational power as a result. Recently, GPUs underwent a radical evolution from Third, multi-GPU systems have become common: a fixed-function, special-purpose pipeline dedicated to Dual-GPU boards are commercially available and graphics applications to a fully programmable SIMD research groups started to prototype GPU clusters processor [9]. These changes transformed the GPUs that assemble even more GPUs on the same into powerful engines that support offloading workstation creating, as a result, an opportunity, to computations beyond graphics operations. aggregate additional low-cost computing capacity. In general, offloading computation to the GPU is a The contributions of this work are summarized as three-stage process. First, transferring input data to the follows: GPU’s internal memory . GPUs do not have direct First, we present a high-level abstraction that aims to access to the host’s main memory; rather they can only transparently maximize the utilization of a number access there onboard memory. Consequently, the of GPU features, while simplifying application application has to explicitly allocate buffers on the developers’ task. We contend that, in addition to the GPU local memory, and transfer the data to them processor/co-processor scenario we analyze in through an I/O interface, such as the host’s PCI- detail, the abstraction we propose can be easily Express bus. Further, data transfer operations are applied in other situations that involve massively performed using direct memory access (DMA) engine, multicore hardware deployments, for example which requires the application to have the data placed asymmetric multicores like IBM’s Cell Broadband in the host’s pinned (i.e., non-pageable) memory. Engine Architecture. Therefore, it is important for the application to use pinned buffers for the data from the beginning, Second, our prototype of this abstraction brings otherwise the device driver will perform an additional valuable performance gains in a realistic usage copy to an internal pinned buffer before transferring the scenario where the communication overhead is data to the GPU. significant. Our prototype demonstrates an up to 16x Second, processing. Once the data is transferred to performance improvements when integrated with the GPU’s internal memory, the application starts the StoreGPU: a GPU based library that accelerates a computation by invoking the corresponding ‘kernel’, a number of hashing based primitives commonly used function that when called is executed on the GPU. A in distributed storage systems. number of optimizations can be applied at the kernel Third, we make CrystalGPU available to the level (e.g., to enable efficient utilization of the GPU’s community 1. This framework can be used by a wide internal memory architecture, minimizing thread range of applications including stream-processing divergence, etc.), however, these application-level applications and embarrassingly-parallel data optimizations are beyond the scope of our work as we processing. focus on higher-level optimizations agnostic to the details of the kernel implementation. The rest of this paper is organized as follows. Third, transferring the results from the GPU to the Section 2 presents related background to GPU host’s main memory . After the kernel finishes programming and the type of applications we target. execution, the application explicitly transfers the results Section 3 discusses CrystalGPU’s API. Section 4 through the I/O channel back to the host’s main presents the framework’s design. Section 5 presents our memory. This three-stage process, collectively named herein a ‘job’, is repeated by a stream processing application 1 CrystalGPU is an open source project, the code can for each new data block to be processed. Note that due be found at http://netsyslab.ece.ubc.ca to limitations in current GPGPU programming Deep packet inspection (DPI) is yet another environments, a job can only be executed on a single example of streaming applications. A practical DPI GPU even in dual-GPU cards. Therefore, using multi- solution must support high-speed packet inspection and GPU architectures requires the application to explicitly impose low latency overheads. For example, the GPU divide the data into two data sets, and perform two can be used to accelerate a number of computationally transfers and kernel invocations, one for each GPU. expensive DPI operations such as header classification and signature matching algorithms [4, 7]. 2.2 CrystalGPU Applications In the above presented use cases, the GPU is used to sequentially carry computations on a stream of data In this section, we discuss the type of applications blocks (frames in the first, data

Load more