The Case for GPGPU Spatial Multitasking

The Case for GPGPU Spatial Multitasking Jacob T. Adriaens, Katherine Compton, Nam Sung Kim Michael J. Schulte Department of Electrical and Computer Engineering AMD Research University of Wisconsin - Madison [email protected] [email protected], [email protected], [email protected] Abstract The iPhone 4, for example, contains a programmable GPU in addition to a general-purpose CPU and several The set-top and portable device market continues to Application-Specific Instruction Processors (ASIPs). Trans- grow, as does the demand for more performance under in- forming the GPU from a graphics and compute offload de- creasing cost, power, and thermal constraints. The integra- vice to a general-purpose data-parallel processor has the po- tion of Graphics Processing Units (GPUs) into these de- tential to enable entirely new classes of applications that vices and the emergence of general-purpose computations were previously unavailable on mobile devices due to per- on graphics hardware enable a new set of highly paral- formance and power constraints. lel applications. In this paper, we propose and make the GPGPU computations are motivated by GPUs’ tremen- case for a GPU multitasking technique called spatial mul- dous computational capabilities and high memory band- titasking. Traditional GPU multitasking techniques, such width for data-parallel workloads [22]. A range of appli- as cooperative and preemptive multitasking, partition GPU cations, from scientific computing to multimedia, are well- time among applications, while spatial multitasking allows suited to this form of parallelism and achieve large speedups GPU resources to be partitioned among multiple applica- on a GPU. For example, Yang et al. achieve up to a 38x tions simultaneously. We demonstrate the potential benefits speedup compared to a high-performance CPU when using of spatial multitasking with an analysis and characteriza- a GPU for real-time motion estimation [31]. tion of General-Purpose GPU (GPGPU) applications. We Unfortunately, GPUs have very primitive support for find that many GPGPU applications fail to utilize available multitasking, a key feature of modern computing systems. GPU resources fully, which suggests the potential for sig- Multitasking provides concurrent execution of multiple ap- nificant performance benefits using spatial multitasking in- plications on a single device. Advanced multitasking is stead of, or in combination with, preemptive or cooperative critical for preserving user responsiveness and satisfying multitasking. We then implement spatial multitasking and quality-of-service (QoS) requirements. NVIDIA’s Fermi compare it to cooperative multitasking using simulation. supports co-executing multiple tasks from the same appli- We evaluate several heuristics for partitioning GPU stream cation on a single GPU [19]. However, even Fermi does multiprocessors (SMs) among applications and find spatial not allow multiple different GPGPU applications to access multitasking shows an average speedup of up to 1.19 over GPU resources simultaneously. Other applications needing cooperative multitasking when two applications are sharing the GPU must wait until the application occupying the GPU the GPU. Speedups are even higher when more than two ap- voluntarily yields control. Having the application voluntar- plications are sharing the GPU. ily yield control of the GPU is a form of cooperative multitasking. In contrast, on the CPU, the operating system (OS) typically uses preemptive multitasking—suspending 1. Introduction and later resuming applications to time-share the CPU with- out the applications’ intervention or control. Both coop- Set-top and portable devices are becoming increasingly erative and preemptive multitasking are forms of temporal popular and powerful. Due to the cost, power, and thermal multitasking. Finally, multi-core CPUs support spatial mul- constraints placed on these devices, often they are designed titasking, which allows multiple applications to execute si- with a low-power general-purpose CPU and several hetero- multaneously on different cores. geneous processors, each specialized for a subset of the Until GPUs better support multitasking, they will con- device’s tasks. These heterogeneous systems increasingly tinue to remain second-class computational citizens. As include programmable Graphics Processing Units (GPUs). future technologies move the GPU onto the same chip as 978-1-4673-0826-7/12/$26.00 ©2011 IEEE the CPU [30], the importance of advancing the GPU from cious GPGPU applications, Windows Vista and Windows a graphics-only co-processor to a multitasking parallel ac- 7 impose time limits on GPU computations, after which the celerator will grow. This will require development of new OS requests that applications yield the GPU. If an applica- GPU multitasking techniques, both temporal and spatial. tion fails to yield, the GPU is reset, killing GPU computa- In this paper, we present a characterization of GPGPU tion [16]. Thus, GPGPU applications must be coded explic- applications for the portable and set-top markets. With this itly to yield the GPU during long computations so they will characterization, we observe GPGPU applications exhibit not be terminated. This means breaking up long GPGPU unbalanced GPU resource utilization. Using simulation, computations into a sequence of shorter computations. Fur- we then demonstrate significant performance improvements ther complicating the issue, different GPUs have varying when using spatial multitasking instead of cooperative mul- performance characteristics, so computations that complete titasking due to more efficient use of GPU resources. We in the allotted time on one GPU may not on another. Even also evaluate several heuristics for partitioning GPU SMs within the time limits, GPGPU calculations may be quite among applications sharing the GPU. The key contribu- long, sacrificing interactive response time. Preempting ap- tions of this work are: (1) Our proposal for GPGPU spa- plications from the GPU and/or allowing applications to run tial multitasking, which allows applications to execute si- simultaneously could help solve these issues. multaneously with GPU resources partitioned among them, Although preemption addresses some GPU multitasking rather than executing serially on all GPU resources. (2) issues, there is a large overhead associated with context A detailed characterization of GPGPU applications demon- switches: saving the current GPGPU state of one applica- strating many GPGPU workloads show unbalanced usage tion and restoring another’s. This state includes the register of GPU resources. (3) An evaluation of GPGPU spatial file and the GPU cores’ local memory data. For example, multitasking versus cooperative multitasking through cycle- in the NVIDIA GT200 architecture, each GPU core, or SM, accurate simulation. (4) A comparison of heuristics for par- has a 64KB register file, 8KB constant cache, and a 16KB titioning SMs among applications sharing a GPU via spatial shared memory. A kernel using all 30 SMs of this architec- multitasking. ture has a state size greater than 2.5MB [11]. This paper is organized as follows. Section 2 discusses In contrast, an AMD64 CPU core has 128 bytes of temporal multitasking, and the details of and motivation for general-purpose registers, 256 bytes of media registers, and spatial multitasking. Section 3 presents our GPGPU work- 80 bytes of floating-point registers [2]; this and other state load analysis, which focuses on the inefficient use of re- together represent approximately 0.5KB that must be saved sources by applications executing in isolation on the GPU, and restored for an AMD64 CPU context switch. The larger followed by the evaluation of spatial multitasking compared GPU kernel context size results in significantly more over- to cooperative multitasking and a comparison of several SM head for a GPU context switch than a CPU context switch. partitioning heuristics. Section 4 presents potential hard- To address the problems and challenges associated with ware and software challenges faced when implementing temporal multitasking on the GPU, we propose spatial mul- spatial multitasking. Section 5 discusses related work, and titasking—allowing multiple GPGPU kernels to execute si- Section 6 provides our conclusions and a discussion of our multaneously, each using a subset of the GPU resources. planned future work. Spatial multitasking differs from preemptive multitasking in that it divides GPU resources, rather than GPU time, among 2. GPU Multitasking competing applications. For example, instead of giving two applications 100% of the GPU resources 50% of the time, Initially, multiple graphics applications could only share spatial multitasking could grant each application 50% of the a GPU via cooperative multitasking, requiring applica- GPU resources 100% of the time. If one application com- tions executing on the GPU to yield GPU control volun- pletes, the other could then use 100% of the GPU resources. tarily. If a malicious or malfunctioning application never Figure 1 illustrates the differences among cooperative, pre- yielded, other applications were unable to use the GPU. emptive, and spatial multitasking. Windows Vista, together with DirectX 10, introduced GPU We have observed that many GPGPU workloads are preemptive multitasking for graphics applications, but not tuned for a particular GPU generation and subsequent, more for GPGPU applications [17,24]. GPGPU applications con- aggressive GPUs frequently show unbalanced resource

The Case for GPGPU Spatial Multitasking

What Is an Operating System III 2.1 Compnents II an Operating System

Mac OS X: an Introduction for Support Providers

CS 151: Introduction to Computers

Real-Time Operating Systems with Example PICOS18

CS 450: Operating Systems Michael Lee <[email protected]>

Scheduling Techniques for Reducing Processor Energy Use in Macos

Types of Operating System 3

I.T.S.O. Powerpc an Inside View

The Operating System the Operating System (OS) Is the He Low-Level Software Which Handles the Interface to Peripheral Hardware

Mac OS 8 Revealed

Hardware Multitasking Within a Softcore CPU

Fundamental 2