Igpu: Exception Support and Speculative Execution on Gpus

Appears in the 39th International Symposium on Computer Architecture, 2012 iGPU: Exception Support and Speculative Execution on GPUs Jaikrishnan Menon, Marc de Kruijf, Karthikeyan Sankaralingam Department of Computer Sciences University of Wisconsin-Madison {menon, dekruijf, karu}@cs.wisc.edu Abstract the evolution of traditional CPUs to illuminate why this pro- gression appears natural and imminent. Since the introduction of fully programmable vertex Just as CPU programmers were forced to explicitly man- shader hardware, GPU computing has made tremendous age CPU memories in the days before virtual memory, for advances. Exception support and speculative execution are almost a decade, GPU programmers directly and explicitly the next steps to expand the scope and improve the usabil- managed the GPU memory hierarchy. The recent release ity of GPUs. However, traditional mechanisms to support of NVIDIA’s Fermi architecture and AMD’s Fusion archi- exceptions and speculative execution are highly intrusive to tecture, however, has brought GPUs to an inflection point: GPU hardware design. This paper builds on two related in- both architectures implement a unified address space that sights to provide a unified lightweight mechanism for sup- eliminates the need for explicit memory movement to and porting exceptions and speculation on GPUs. from GPU memory structures. Yet, without demand pag- First, we observe that GPU programs can be broken into ing, something taken for granted in the CPU space, pro- code regions that contain little or no live register state at grammers must still explicitly reason about available mem- their entry point. We then also recognize that it is simple to ory. The drawbacks of exposing physical memory size to generate these regions in such a way that they are idempo- programmers are well known. Other issues like debugging tent, allowing their entry points to function as program re- and supporting arithmetic exceptions are likely to emerge as covery points and enabling support for exception handling, problems for future GPUs as well. Exception support is a fast context switches, and speculation, all with very low fundamental pillar of modern CPUs and is used to provide overhead. We call the architecture of GPUs executing these all of the above features. To make the leap to becoming idempotent regions the iGPU architecture. The hardware a truly general-purpose programming platform, we believe extensions required are minimal and the construction of future GPUs will require robust exception support to enable idempotent code regions is fully transparent under the typ- virtual memory, and will significantly benefit from this sup- ical dynamic compilation framework of GPUs. We demon- port in other areas as well. strate how iGPU exception support enables virtual memory Modern GPUs are also positioned to benefit from spec- paging with very low overhead (1% to 4%), and how speculation support in the near future. Shortly after the devel- ulation support enables circuit-speculation techniques that opment of exception support in CPUs, speculation was de- can provide over 25% reduction in energy. veloped as a mechanism to transparently handle “difficult” code, and a recent study claims that GPUs must similarly 1 Introduction begin incorporating techniques like speculation to expand the domains they can target [6]. Speculation support has Since the introduction of fully programmable vertex also increasingly been proposed for handling recovery from shader hardware [23], GPU computing has made enormous hardware reliability problems in CPUs [4, 17, 30]. Such strides. Modern GPUs incorporate sophisticated architec- problems, which include variability, noise, and excessive ture and microarchitecture techniques such as predication, guard-banding, are also emerging problems for GPUs [7]. caching, and prefetching, while abstracting the details away However, recent work on GPU solutions to overcome these from programmers through their software stack and dy- problems still has at least 40% overheads [35]. As with namic compilation approach. To improve the effectiveness CPUs, efficient speculation support in GPUs can serve as of GPUs as general-purpose computing devices, GPU pro- a fundamental primitive that enables support for a more di- gramming models and architectures continue to evolve, and verse range of application programs and handling of hard- we foresee exception support and speculative execution as ware reliability issues. the next key steps in their evolution. Below, we reflect on 1 1.1 Key Challenges respect to the program’s execution. On CPUs, this problem is handled simply by incorporating large hardware check- Exception support and speculative execution can expand pointing or buffering structures to manage speculative state. the scope and improve the usability of GPUs. However, The MIPS R10K for example, implements checkpointing implementing them efficiently on GPUs presents key chal- by maintaining four copies of the register rename table [41]. lenges. Below, we discuss exception and speculation sup- However, the amount of register state on GPUs is simply too port in CPUs and why CPU mechanisms are problematic to vast to consider this option. Hence, a third key challenge apply directly to GPUs. The three key challenges we iden- is supporting speculative writes: finding a way to manage tify are: consistent exception state, efficient context switch- large amounts of speculative program state. ing, and speculative writes. For CPUs, the problem of exception support was solved 1.2 Paper Overview at a relatively early stage [36, 38]. This support was a key enabler to their success, and instrumental in this success In this paper, we develop a low-overhead technique to was the definition of precise exception handling, where an support exceptions and speculative execution on GPUs. exception is handled precisely if, with respect to the except- Fundamentally, we observe that the three key challenges ing instruction, the exception is handled and the process re- of enabling precise exception and speculation recovery on sumed at a point consistent with the sequential architectural GPUs ultimately distill down to just two core problems: (i) model [36]. With support for precise exceptions, all types of minimizing the amount of program state that needs to be exceptions could be handled using a universal mechanism preserved and (ii) enabling restart from a consistent pro- such as the re-order buffer. However, precise exception sup- gram state. While previous work has explored optimiza- port has historically been difficult to implement for archi- tions to each of these pieces individually, the iGPU architectures that execute parallel SIMD or vector instructions, tecture developed in this work synergistically enables both. where precise state with respect to an individual instruction In terms of preserving minimal program state, we ob- is not natural to the hardware. High fan-out control sig- serve that preserving live state alone is sufficient. Others nals to maintain sequential ordering in a vector pipeline are have made this observation as well [28, 29, 34]. However, challenging to implement, and while buffering and register they have assumed either a checkpoint was available, or renaming approaches have been proposed [14, 36], they are restarting from the same program state as at the site of a costly in terms of power, area, and/or performance. Hence, mis-speculation or exception was necessary. The architec- a key challenge is supporting consistent exception state: tural state on CPUs is also typically small (tens of registers) exposing sequentially-ordered program state to an excep- and hence the optimization of furthermore minimizing this tion handler and also enabling program restart from a self- live state has historically been relatively insignificant. For consistent point in the program. GPUs, however, minimizing the amount of state that must A second reason for the widespread adoption of precise be managed to handle context switching is valuable. exception support in CPUs was that it enabled support for Second, in terms of restarting the program from a con- demand paging in virtual memory systems: to overlap pro- sistent state, we observe that it is not always necessary to cessor execution with the long latency of paging I/O, the restart the program from the site of an exception or mis- state of a faulting process could be cleanly saved away speculation, even without checkpoints, and that restarting and another process restored in its place. Simply borrow- from consistent live state, as opposed to architectural state, ing techniques from the CPU space to implement context is in most cases sufficient. Again, others have made this switching on GPUs, however, is difficult. In particular, sav- observation as well [9, 18, 22]. However, they largely ig- ing GPU state and then context switching to another process nore live-state minimization and/or do not provide general while a page fault is handled imposes a monumental under- exception and speculation support. Other shortcomings that taking: while on a conventional CPU core a context switch preclude their use for GPUs are discussed in Section 6. requires little more than saving and restoring a few tens of This paper builds upon previous work and delivers a sim- registers, for a GPU it can require saving and restoring hun- ple, elegant, and efficient solution to the problems of ex- dreds of thousands of registers. Thus, a second key chal- ception and speculation on GPUs. The iGPU architecture

Igpu: Exception Support and Speculative Execution on Gpus

Computer Science 246 Computer Architecture Spring 2010 Harvard University

OS and Compiler Considerations in the Design of the IA-64 Architecture

Selective Eager Execution on the Polypath Architecture

A Survey of Published Attacks on Intel

SUPPORT for SPECULATIVE EXECUTION in HIGH- PERFORMANCE PROCESSORS Michael David Smith Technical Report: CSL-TR-93456 November 1992

High Performance Architecture Using Speculative Threads and Dynamic Memory Management Hardware

Whitepaper Cache Speculation Side-Channels Author: Richard Grisenthwaite Date: January 2018 Version 1.1

Invisispec: Making Speculative Execution Invisible in the Cache Hierarchy

A State of the Art Investigation

Speculative Execution and Instruction-Level Parallelism

PA-RISC 8X00 Family of Microprocessors with Focus on PA-8700

Itanium Processor Microarchitecture