Versapipe: a Versatile Programming Framework for Pipelined Computing on GPU
Total Page:16
File Type:pdf, Size:1020Kb
VersaPipe: A Versatile Programming Framework for Pipelined Computing on GPU Zhen Zheng Chanyoung Oh Jidong Zhai Tsinghua University University of Seoul Tsinghua University [email protected] [email protected] [email protected] Xipeng Shen Youngmin Yi Wenguang Chen North Carolina State University University of Seoul Tsinghua University [email protected] [email protected] [email protected] ABSTRACT 1 INTRODUCTION Pipeline is an important programming pattern, while GPU, designed Pipeline parallel programming is commonly used to exert the full mostly for data-level parallel executions, lacks an ecient mecha- parallelism of some important applications, ranging from face detec- nism to support pipeline programming and executions. This paper tion to network packet processing, graph rendering, video encoding provides a systematic examination of various existing pipeline exe- and decoding [11, 31, 34, 42, 46]. A pipeline program consists of a cution models on GPU, and analyzes their strengths and weaknesses. chain of processing stages, which have a producer-consumer rela- To address their shortcomings, this paper then proposes three new tionship. The output of a stage is the input of the next stage. Figure 1 execution models equipped with much improved controllability, shows Reyes rendering pipeline [12], a popular image rendering al- including a hybrid model that is capable of getting the strengths gorithm in computer graphics. It uses one recursive stage and three of all. These insights ultimately lead to the development of a soft- other stages to render an image. In general, any programs that have ware programming framework named VersaPipe. With VersaPipe, multiple stages with the input data going through these stages in users only need to write the operations for each pipeline stage. succession could be written in a pipeline fashion. Ecient supports VersaPipe will then automatically assemble the stages into a hybrid of pipeline programming and execution are hence important for a execution model and congure it to achieve the best performance. broad range of workloads [5, 22, 26, 51]. Experiments on a set of pipeline benchmarks and a real-world face detection application show that VersaPipe produces up to 6.90 N ⇥ (2.88 on average) speedups over the original manual implementa- ⇥ Y tions. Primitives Bound Split Dice Shade Image diceable? CCS CONCEPTS Figure 1: Reyes rendering: an example pipeline workload. • General and reference → Performance; • Computing method- On the other hand, modern graphics processing unit (GPU) ologies → Parallel computing methodologies; • Computer shows a great power in general-purpose computing. Thousands of systems organization → Heterogeneous (hybrid) systems; cores make data-parallel programs gain a signicant performance boost with GPU. While GPU presents powerful performance po- KEYWORDS tential for a wide range of applications, there are some limitations GPU, Pipelined Computing for pipeline programs to make full use of GPU’s power. With its basic programming model, each major stage of a pipeline is put ACM Reference Format: into a separate GPU kernel, invoked one after one. It creates an Zhen Zheng, Chanyoung Oh, Jidong Zhai, Xipeng Shen, Youngmin Yi, ill-t design for the pipeline nature of the problem. In the Reyes and Wenguang Chen. 2017. VersaPipe: A Versatile Programming Framework rendering pipeline, for instance, the recursive process would need for Pipelined Computing on GPU. In Proceedings of MICRO-50, Cambridge, a large number of kernel calls, which could introduce substantial MA, USA, October 14–18, 2017, 13 pages. https://doi.org/10.1145/3123939.3123978 overhead. And as stages are separated by kernel boundaries, no par- allelism would be exploited across stages, which further hurts the performance. In the Reyes example, as the recursive depth varies for dierent data items in the pipeline, a few items that have deep Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed recursive depth will cause the whole pipeline to stall. for prot or commercial advantage and that copies bear this notice and the full citation How to make GPU best accelerate pipelined computations re- on the rst page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, mains an open question. Some work has devoted manual eorts to post on servers or to redistribute to lists, requires prior specic permission and/or a in developing some specic pipeline applications, such as graphic fee. Request permissions from [email protected]. processing [34, 35, 37, 45] and packet processing [46]. They provide MICRO-50, October 14–18, 2017, Cambridge, MA, USA some useful insights, but do not oer a programming support to © 2017 Association for Computing Machinery. ACM ISBN 978-1-4503-4952-9/17/10...$15.00 general pipeline problems. Some programming models have been https://doi.org/10.1145/3123939.3123978 proposed recently trying to simplify the development of ecient 587 MICRO-50, October 14–18, 2017, Cambridge, MA, USA Z. Zheng et al. pipeline applications on GPU. Most of them employ a Megakernel- It proposes VersaPipe, a versatile programming framework • based approach [3, 16], in which, the code for all stages of a pipeline that provides automatic support to create the combination is put into one large GPU kernel that branches to dierent sub- of execution models that best suit a given pipeline problem. functions based on which stage the current data item needs to go It evaluates VersaPipe on a set of pipeline workloads and • through. At launching time, the method tries to launch as many compares its performance with several alternative methods. blocks as can be concurrently scheduled on GPU and schedule the Results show that VersaPipe outperforms the state-of-art so- tasks through software managed task queues [10, 43, 44, 47]. lutions signicantly (by up to 1.66 ), accelerates the original ⇥ Although these eorts have made good contributions to advance manual implementations by 2.88 on average (up to 6.9 ). ⇥ ⇥ the state of the art, they have some serious limitations. Putting Our approach targets NVIDIA GPUs, but the general principles all stages into one kernel, Megakernel-based methods cause large behind the approach could be applied to any model based on an register and shared memory pressure on GPU. As a result, many equivalent of NVIDIA’s streaming multiprocessors, so long as the programs can have only a small number of threads running con- model provides a way to gain access to the id of the SM on which a currently on GPU, leaving the massive parallelism of GPU largely thread is executing. untapped (detailed in Section 8). An important observation made in this work is that, fundamen- tally, the prior eorts have been limited by the lack of software 2 BACKGROUND controllability of executions on GPU. GPU uses hardware sched- This section provides background on GPU that is relevant to this ulers to decide when and where (on which computing unit of a work. We use NVIDIA CUDA [30] terminology for the discussion. GPU) a GPU thread runs. All prior pipeline model designs have Graphic processing unit (GPU) is widely used in a variety of taken this limitation for granted, which has largely limited the general purpose computing domains now. It adopts a very wide Sin- design exibilities. gle Instruction Multiple Data (SIMD) architecture, where hundreds In this paper, by combining two recent software techniques (per- of units can process instructions simultaneously. Each unit in the sistent threads [3] and SM-centric method [50], explained later), SIMD of GPU is called Streaming Processor (SP). Each SP has its we show that the limitation can be largely circumvented with- own PC and registers, and can access and process a dierent instruc- out hardware extensions. The software controllability released by tion in the kernel code. A GPU device has multiple clusters of SPs, that method opens up a much greater exibility for the designs of called Streaming Multiprocessors (SMs). Each SM has indepen- pipeline execution models. dent hardware resources of shared memory, L1 caches, and register Leveraging the exibility, we propose three new execution mod- les, while all SMs share L2 cache. For more ecient processing and els with much improved congurability. They signicantly expand data mapping, threads on GPU are grouped as CTAs (Cooperative the selection space of pipeline models. By examining the various Thread Arrays, also called Thread Block). Threads in the same models, we reveal their respective strengths and weaknesses. We CTA can communicate with each other via global memory, shared then develop a software framework named VersaPipe to translate memory and barrier synchronization. The number of threads in a these insights and models into an enhanced pipeline programming CTA is set by programs, but also limited by the architecture. All system. VersaPipe consists of an API, a library that incorporates CTAs run independently and dierent CTAs can run concurrently. the various execution models and the scheduling mechanisms, and GPU is a complex architecture in which resources are dynami- an auto-tuner that automatically nds the best congurations of a cally partitioned depending on the usage. Since an SM has a limited pipeline. With VersaPipe, users only need to write the operations size of registers and shared memory, any of them can be a lim- for each pipeline stage. VersaPipe will then automatically assemble iting factor that could result in under-utilization of GPU. When the stages into a hybrid execution model and congure it to achieve hardware resource usage of a thread increases, the total number of the best performance. threads that can run concurrently on GPU decreases. Due to this For its much improved congurability and much enriched ex- nature of GPU architecture, capturing the workload concurrency ecution model space, VersaPipe signicantly advances the state into a kernel should be done carefully considering the resource of the art in supporting pipeline executions on GPU.