Effisha: a Software Framework for Enabling Efficient Preemptive

EffiSha: A Software Framework for Enabling Efficient Preemptive Scheduling of GPU Guoyang Chen, Yue Zhao, Xipeng Shen Huiyang Zhou Computer Science, North Carolina State Electrical and Computer Engineering, North University Carolina State University fgchen11, yzhao30, [email protected] [email protected] Abstract GPU, may get unfairly used: Consider two requests from Modern GPUs are broadly adopted in many multitasking en- applications A and B; even though A may have already used vironments, including data centers and smartphones. How- the GPU for many of its requests recently and B has only is- ever, the current support for the scheduling of multiple GPU sued its first request, the default GPU management—giving kernels (from different applications) is limited, forming a no consideration of the history usage of applications—may major barrier for GPU to meet many practical needs. This still assign the GPU to A and keep B waiting if A’s request work for the first time demonstrates that on existing GPUs, comes just slightly earlier than B’s request. Moreover, the efficient preemptive scheduling of GPU kernels is possi- default scheduling is oblivious to the priorities of kernels. ble even without special hardware support. Specifically, it Numerous studies [1–4] have shown that the problematic presents EffiSha, a pure software framework that enables way to manage GPU causes serious unfairness, response preemptive scheduling of GPU kernels with very low over- delays, and low GPU utilizations. head. The enabled preemptive scheduler offers flexible sup- Latest GPUs (e.g., Pascal from NVIDIA) are equipped port of kernels of different priorities, and demonstrates sig- with the capability to evict a GPU kernel at an arbitrary nificant potential for reducing the average turnaround time instruction. However, the kernel eviction incurs substantial and improving the system overall throughput of programs overhead in state saving and restoring. Some recent work [2, that time share a modern GPU. 4, 5] proposes some hardware extensions to help alleviate the issue. But they add hardware complexities. 1. Introduction Software solutions may benefit existing systems immediately. Prior efforts towards software solutions fall into two As a kind of massively parallel architectures, GPUs have at- classes. The first is for trackability. They propose some APIs tained broad adoptions in modern computing systems. Most and OS intercepting techniques [1, 3] to allow the OS or hy- of these systems are multitasking with more than one ap- pervisors to track the usage of GPU by each application. The plication running and requesting the usage of GPU simulta- improved trackability may help select GPU kernels to launch neously. In data centers, for instance, many customers may based on their past usage and priorities. The second class of concurrently submit their requests, multiple of which often work is about granularity. They use kernel slicing [6, 7] to need to be serviced simultaneously by a single node in the break one GPU kernel into many smaller ones. The reduced data center. How to manage GPU usage effectively in such granularity increases the flexibility in kernel scheduling, and environments is important for the responsiveness of the ap- may help shorten the time that a kernel has to wait before it plications, the utilization of the GPU, and the quality of ser- can get launched. vice of the computing system. Although these software solutions may enhance GPU The default management of GPU is through the undis- management, they are all subject to one important short- closed GPU drivers and follows a first-come-first-serve pol- coming: None of them allow the eviction of a running GPU icy. Under this policy, the system-level shared resource, kernel before its finish—that is, none of them allows preemptive GPU schedules. The request for GPU from an application, despite its priority, cannot get served before the finish of the GPU kernel that another application has already Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice launched. The length of the delay depends on the length of and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, to republish, to post on servers, or to redistribute to the running kernel. To reduce the delay, some prior propos- lists, requires prior specific permission and/or a fee. Request permissions from [email protected] or Publications Dept., ACM, Inc., fax +1 (212) 869-0481. als (e.g., kernel slicing [6, 7]) attempt to split a kernel into PPoPP ’17, February 04-08, 2017, Austin, TX, USA Copyright c 2017 ACM 978-1-4503-4493-7/17/02. $15.00 many smaller kernels such that the wait for a kernel to finish DOI: http://dx.doi.org/10.1145/3018743.3018748 gets shortened. They however face a dilemma: The resulting Kernel of matrix addition: increased number of kernel launches and the reduced par- %threadIdx.x: the index of a thread in its block; allelism in the smaller kernels often cause some substantial %blockIdx.x: the global index of the thread block; performance loss (as much as 58% shown in our experiments %blockDim.x: the number of threads per thread block in Section 8). idx = threadIdx.x + blockIdx.x * blockDim.x; In this work, we propose a simple yet effective way to C[idx] = A[idx] + B[idx]; solve the dilemma. The key is a novel software approach that, for the first time, enables efficient preemptions of ker- Figure 1. A kernel for matrix addition. nels on existing GPUs. Kernels need not be sliced anymore; they voluntarily suspend and exit when it is time to switch kernels on GPU. 1. kernel (traditional GPU) How to enable efficient kernel preemption is challenging 2. predefined # of block-tasks for the large overhead of context savings and restoration for (kernel slicing based work [5]) the massive concurrent GPU threads. Before this work, all 3. single block-task(EffiSha) previously proposed solutions have relied on special hardware extensions [2, 4]. 4. arbitrary segment of a block-task(impractical) Our solution is pure software-based, consisting of some Figure 2. Four levels of GPU scheduling granularity. novel program transformations and runtime machinery. We call our compiler-runtime synergistic framework EffiSha (for efficient sharing). As a pure software framework, it is A GPU kernel launch usually creates a large number immediately deployable in today’s systems. Instead of slic- of GPU threads. They all run the same kernel program, ing a kernel into many smaller kernels, EffiSha transforms which usually includes some references to thread IDs to the kernel into a form that are amenable for efficient vol- differentiate the behaviors of the threads and the data they untary preemptions. Little if any data need to be saved or work on. These threads are organized into groups called restored upon an eviction. Compared to prior kernel slicing- thread blocks. based solutions, EffiSha reduces the runtime eviction over- Note that in a typical GPU program1, no synchronizations head from 58% to 4% on average, and removes the need for across different thread blocks are supported. Different thread selecting an appropriate kernel size in kernel slicing. blocks can communicate by operating on the same data loca- EffiSha opens the opportunities for preemptive kernel tions on the global memory (e.g., reduction through atomic scheduling on GPU. We implement two priority-based pre- operations), but the communications must not cause depen- emptive schedulers upon EffiSha. They offer flexible support dence hazards (i.e., execution order constraints) among the for kernels of different priorities, and demonstrate significant thread blocks. Otherwise, the kernel could suffer deadlocks potential for reducing the average turnaround time (18-65% due to the hardware-based thread scheduling. on average) and improving the overall system throughput We call the set of work done by a thread block a block- of program executions (1.35X-1.8X on average) that time- task. For the aforementioned GPU property, the block-tasks share a GPU. of a kernel can run in an arbitrary order (even if they op- This work makes the following major contributions: erate on some common locations in the memory). Different 1) It presents EffiSha, a compiler-runtime framework block-tasks do not communicate through registers or shared that, for the first time, makes beneficial preemptive GPU memory. The execution of each block-task must first set up scheduling possible without hardware extensions. the states of their registers and shared memory. Typically, 2) It proposes the first software approach to enabling the ID of a thread block is taken as the ID of its block-task. efficient preemptions of GPU kernels. The approach consists Figure 1 shows the kernel for matrix addition. of multi-fold innovations, including the preemption enabling Existing GPUs do not allow the eviction of running com- program transformation, the creation of GPU proxies, and a puting kernels. A newly arrived request for GPU by a differ- three-way synergy among applications, a CPU daemon, and ent application must wait for the currently running kernel to GPU proxies. finish before it can use the GPU 2. 3) It demonstrates the potential of EffiSha-based sched- Scheduling Granularity. Scheduling granularity deter- ulers for supporting kernel priorities and improving kernel mines the time when a kernel switch can happen on GPU. responsiveness and system throughput. This section explains the choice we make. 2. Terminology and Granularity 1 Exceptions are GPU kernels written with persistent threads [8], which are discussed later in this paper.

Load more