The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU

The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU Rajib Nath Dean Tullsen University of California, San Diego University of California, San Diego 9500 Gilman Drive, La Jolla, CA 9500 Gilman Drive, La Jolla, CA ABSTRACT frequency for energy and power savings, but can also This paper presents CRISP, the first runtime analytical address other problems such as temperature [11], reli- model of performance in the face of changing frequency ability [12], and variability [13]. However, in all cases in a GPGPU. It shows that prior models not targeted it trades off performance for other gains, so properly at a GPGPU fail to account for important characteris- setting DVFS to maximize particular goals can only be tics of GPGPU execution, including the high degree of done with the assistance of an accurate estimate of the overlap between memory access and computation and performance of the system at alternate frequency set- the frequency of store-related stalls. tings. CRISP provides significantly greater accuracy than While extensive research has been done on DVFS per- prior runtime performance models, being within 4% on formance modeling for CPU cores, there is little guid- average when scaling frequency by up to 7X. Using ance in the literature for GPU and GPGPU settings, de- CRISP to drive a runtime energy efficiency controller spite the fact that modern GPUs have extensive support yields a 10.7% improvement in energy-delay product, vs for DVFS [14, 15, 16, 17]. However, the potential for 6.2% attainable via the best prior performance model. DVFS on GPUs is high for at least two reasons. First, the maximum power consumption in a GPGPU is often higher than CPUs, e.g., 250 Watts for the NVIDIA Categories and Subject Descriptors GTX480. Second, while the range of DVFS settings in C.1.4 [Processor Architectures]: Parallel Architec- modern CPUs is shrinking, that is not yet the case for tures; C.4 [Performance of Systems]: Modeling tech- GPUs, which have ranges such as 0.5V-1.0V (NVIDIA niques Fermi [18]). This work presents a performance model that closely Keywords tracks GPGPU performance on a wide variety of workloads, and significantly outperforms existing models which Critical Path, GPGPU, DVFS are tuned to the CPU. We show that the GPU presents key differences that are not handled by those models. 1. INTRODUCTION Most DVFS performance models are empirical and This paper describes the CRISP (CRItical Stalled use statistical [19, 20, 21, 22, 23] or machine learning Path) performance predictor for GPGPUs under vary- methods [24, 25], or assume a simple linear relation- ing core frequency. Existing analytical models that ac- ship [26, 27, 28, 29, 30, 31, 32] between performance count for frequency change only target CPU perfor- and frequency. Abe et al., [14] target GPUs, but with a mance, and do not effectively capture the execution regression based offline statistical model that is neither characteristics of a GPU. targeted for, nor conducive to, runtime analysis. Dynamic voltage and frequency scaling (DVFS) [1] Recent research presents new performance counter ar- has shown the potential for significant power and energy chitectures [33, 34, 35] to model the impact of frequency savings in many system components including processor scaling on CPU workloads. These analytical models cores [2, 3], memory system [4, 5, 6], last level cache [7], (e.g., leading load [33] and critical path [34]) divide a and interconnect [8, 9, 10]. DVFS scales voltage and CPU program into two disjoint segments representing the non-pipelined (TMemory) and pipelined (TCompute) portion of the computation. They make two assumptions: (a) the memory portion of the computation never Permission to make digital or hard copies of all or part of this work for scales with frequency, and (b) cores never stall for store personal or classroom use is granted without fee provided that copies are operations. Though these assumptions are reasonably not made or distributed for profit or commercial advantage and that copies valid in CPUs, they start to fail as we move to massively bear this notice and the full citation on the first page. To copy otherwise, to parallel architecture like GPGPUs. No existing run- republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. time analytical models for GPGPUs handle the effect MICRO 2015 Waikiki, Hawaii USA of voltage frequency changes. Copyright 2015 ACM 978-1-4503-4034-2/15/12 ...$15.00. A program running on a GPGPU, typically referred nating sequence of compute and memory phases. In to as a kernel, will have a high degree of thread level par- a compute phase, the core executes useful instructions allelism. The single instruction multiple thread (SIMT) without generating any memory requests (e.g., instruc- parallelism in GPGPUs allows a significant portion of tion fetches, data loads) that reach main memory. A the computation to be overlapped with memory latency, memory request which misses the last level cache might which can make the memory portion of the computa- trigger a memory phase, but only when a resource is tion elastic to core clock frequency and thus breaks the exhausted and the pipeline stalls. Once the instruction first assumption of the prior models. Though mem- or data access returns, the core typically begins a new ory/computation overlap has been used for program op- compute phase. In all the current analytical models, the timization, modeling frequency scaling in the presence execution time (T) of a program at a voltage-frequency of abundant parallelism and high memory/computation setting with cycle time t is expressed as overlap is new and a complex problem. Moreover, due to the wide single instruction multiple data (SIMD) vec- T (t)= TCompute(t)+ TMemory (1) tor units in GPGPU streaming multiprocessors (SMs) They strictly assume that the pipeline portion of the ex- and homogeneity of computation in SIMT scheduling, ecution time (T ) scales with the frequency, while GPGPU SMs often stall due to stores. Compute the non-pipeline portion (TMemory) does not. The exe- The CRISP predictor accounts for these differences, cution time at an unexplored voltage-frequency setting plus others. It provides accuracy within 4% when scal- of v2,f2 can be estimated from the measurement in the ing from 700 MHz to 400 MHz, and 8% when scaling current settings v ,f by all the way to 100 MHz. It reduces the maximum error 1 1 by a factor of 3.6 compared to prior models. Addi- f1 T (v2,f2)= TCompute(v1,f1) × + TMemory (2) tionally, when used to direct GPU DVFS settings, it f2 enables nearly double the gains of prior models (10.7% EDP gain vs 5.7% with critical path and 6.2% with The accuracy of these models relies heavily on ac- leading load). We also show CRISP effectively used to curate classification of cycles into compute or memory reduce ED2P, or to reduce energy while maintaining a phase cycles, and this is the primary way in which these performance constraint. models differ. We describe them next. Stall Time: Stall time [35] estimates the non-pipeline phase (TMemory) as the number of cycles the core is un- 2. MOTIVATION AND RELATED WORK able to retire any instructions due to outstanding last Our model of GPGPU performance in the presence of level cache misses. It ignores computation performed DVFS builds on prior work on modeling CPU DVFS. during the miss before the pipeline stall. Since the scal- Since there are no analytical DVFS models for GPG- ing of this portion of computation can be hidden under PUs, we will describe some of the prior work in the CPU memory latency, stall time overpredicts execution time. domain and the shortcomings of those models when ap- Miss Model: The Miss model [35] includes ROB plied to GPUs. We also describe recent GPU-specific fill time inside the memory phase. It identifies all the performance models at the end of this section. stand alone misses, but only the first miss in a group of overlapped misses as a contributor to TMemory. In par- 2.1 CPU-based Performance Models for DVFS ticular, it ignores misses that occur within the memory Effective use of DVFS relies on some kind of predic- latency of an initial miss. The resulting miss count is tion of the performance and power of the system at then multiplied by a fixed memory latency to compute different voltage and frequency settings. An inaccurate TMemory. This approach loses accuracy due to its fixed model results in the selection of non-optimal settings latency assumption and also ignores stall cycles due to and lost performance and/or energy. instruction cache misses. Existing DVFS performance models are primarily aimed Leading Load: Leading load [33, 36] recognizes both at the CPU. They each exploit some kind of linear front-end stalls due to instruction cache misses and back- model, based on the fact that CPU computation scales end stalls due to load misses. Their model counts the with frequency speed while memory latency does not. cycles when an instruction cache miss occurs in the These models fall into four classes: (a) proportional, (b) last level cache. It adopts a similar approach to the sampling, (c) empirical, and (d) analytical. The propor- miss model to address memory level parallelism for data tional scaling model assumes a linear relation between loads. However, to account for variable memory latency, the performance and the core clock frequency. The sam- it counts the actual number of cycles a contributor load pling model better accounts for memory effects by iden- is outstanding.

Load more