The CRISP Performance Model for Dynamic Voltage and Frequency Scaling in a GPGPU

Rajib Nath Dean Tullsen University of California, San Diego University of California, San Diego 9500 Gilman Drive, La Jolla, CA 9500 Gilman Drive, La Jolla, CA

ABSTRACT frequency for energy and power savings, but can also This paper presents CRISP, the first runtime analytical address other problems such as temperature [11], reli- model of performance in the face of changing frequency ability [12], and variability [13]. However, in all cases in a GPGPU. It shows that prior models not targeted it trades off performance for other gains, so properly at a GPGPU fail to account for important characteris- setting DVFS to maximize particular goals can only be tics of GPGPU execution, including the high degree of done with the assistance of an accurate estimate of the overlap between memory access and computation and performance of the system at alternate frequency set- the frequency of store-related stalls. tings. CRISP provides significantly greater accuracy than While extensive research has been done on DVFS per- prior runtime performance models, being within 4% on formance modeling for CPU cores, there is little guid- average when scaling frequency by up to 7X. Using ance in the literature for GPU and GPGPU settings, de- CRISP to drive a runtime energy efficiency controller spite the fact that modern GPUs have extensive support yields a 10.7% improvement in energy-delay product, vs for DVFS [14, 15, 16, 17]. However, the potential for 6.2% attainable via the best prior performance model. DVFS on GPUs is high for at least two reasons. First, the maximum power consumption in a GPGPU is of- ten higher than CPUs, e.g., 250 Watts for the NVIDIA Categories and Subject Descriptors GTX480. Second, while the range of DVFS settings in C.1.4 [ Architectures]: Parallel Architec- modern CPUs is shrinking, that is not yet the case for tures; C.4 [Performance of Systems]: Modeling tech- GPUs, which have ranges such as 0.5V-1.0V (NVIDIA niques Fermi [18]). This work presents a performance model that closely Keywords tracks GPGPU performance on a wide variety of work- loads, and significantly outperforms existing models which Critical Path, GPGPU, DVFS are tuned to the CPU. We show that the GPU presents key differences that are not handled by those models. 1. INTRODUCTION Most DVFS performance models are empirical and This paper describes the CRISP (CRItical Stalled use statistical [19, 20, 21, 22, 23] or machine learning Path) performance predictor for GPGPUs under vary- methods [24, 25], or assume a simple linear relation- ing core frequency. Existing analytical models that ac- ship [26, 27, 28, 29, 30, 31, 32] between performance count for frequency change only target CPU perfor- and frequency. Abe et al., [14] target GPUs, but with a mance, and do not effectively capture the execution regression based offline statistical model that is neither characteristics of a GPU. targeted for, nor conducive to, runtime analysis. Dynamic voltage and frequency scaling (DVFS) [1] Recent research presents new performance ar- has shown the potential for significant power and energy chitectures [33, 34, 35] to model the impact of frequency savings in many system components including processor scaling on CPU workloads. These analytical models cores [2, 3], memory system [4, 5, 6], last level [7], (e.g., leading load [33] and critical path [34]) divide a and interconnect [8, 9, 10]. DVFS scales voltage and CPU program into two disjoint segments representing the non-pipelined (TMemory) and pipelined (TCompute) portion of the computation. They make two assump- tions: (a) the memory portion of the computation never Permission to make digital or hard copies of all or part of this work for scales with frequency, and (b) cores never stall for store personal or classroom use is granted without fee provided that copies are operations. Though these assumptions are reasonably not made or distributed for profit or commercial advantage and that copies valid in CPUs, they start to fail as we move to massively bear this notice and the full citation on the first page. To copy otherwise, to parallel architecture like GPGPUs. No existing run- republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. time analytical models for GPGPUs handle the effect MICRO 2015 Waikiki, Hawaii USA of voltage frequency changes. Copyright 2015 ACM 978-1-4503-4034-2/15/12 ...$15.00. A program running on a GPGPU, typically referred nating sequence of compute and memory phases. In to as a kernel, will have a high degree of level par- a compute phase, the core executes useful instructions allelism. The single instruction multiple thread (SIMT) without generating any memory requests (e.g., instruc- parallelism in GPGPUs allows a significant portion of tion fetches, data loads) that reach main memory. A the computation to be overlapped with memory latency, memory request which misses the last level cache might which can make the memory portion of the computa- trigger a memory phase, but only when a resource is tion elastic to core clock frequency and thus breaks the exhausted and the pipeline stalls. Once the instruction first assumption of the prior models. Though mem- or data access returns, the core typically begins a new ory/computation overlap has been used for program op- compute phase. In all the current analytical models, the timization, modeling frequency scaling in the presence execution time (T) of a program at a voltage-frequency of abundant parallelism and high memory/computation setting with cycle time t is expressed as overlap is new and a complex problem. Moreover, due to the wide single instruction multiple data (SIMD) vec- T (t)= TCompute(t)+ TMemory (1) tor units in GPGPU streaming multiprocessors (SMs) They strictly assume that the pipeline portion of the ex- and homogeneity of computation in SIMT scheduling, ecution time (T ) scales with the frequency, while GPGPU SMs often stall due to stores. Compute the non-pipeline portion (TMemory) does not. The exe- The CRISP predictor accounts for these differences, cution time at an unexplored voltage-frequency setting plus others. It provides accuracy within 4% when scal- of v2,f2 can be estimated from the measurement in the ing from 700 MHz to 400 MHz, and 8% when scaling current settings v ,f by all the way to 100 MHz. It reduces the maximum error 1 1 by a factor of 3.6 compared to prior models. Addi- f1 T (v2,f2)= TCompute(v1,f1) × + TMemory (2) tionally, when used to direct GPU DVFS settings, it f2 enables nearly double the gains of prior models (10.7% EDP gain vs 5.7% with critical path and 6.2% with The accuracy of these models relies heavily on ac- leading load). We also show CRISP effectively used to curate classification of cycles into compute or memory reduce ED2P, or to reduce energy while maintaining a phase cycles, and this is the primary way in which these performance constraint. models differ. We describe them next. Stall Time: Stall time [35] estimates the non-pipeline phase (TMemory) as the number of cycles the core is un- 2. MOTIVATION AND RELATED WORK able to retire any instructions due to outstanding last Our model of GPGPU performance in the presence of level cache misses. It ignores computation performed DVFS builds on prior work on modeling CPU DVFS. during the miss before the . Since the scal- Since there are no analytical DVFS models for GPG- ing of this portion of computation can be hidden under PUs, we will describe some of the prior work in the CPU memory latency, stall time overpredicts execution time. domain and the shortcomings of those models when ap- Miss Model: The Miss model [35] includes ROB plied to GPUs. We also describe recent GPU-specific fill time inside the memory phase. It identifies all the performance models at the end of this section. stand alone misses, but only the first miss in a group of overlapped misses as a contributor to TMemory. In par- 2.1 CPU-based Performance Models for DVFS ticular, it ignores misses that occur within the memory Effective use of DVFS relies on some kind of predic- latency of an initial miss. The resulting miss count is tion of the performance and power of the system at then multiplied by a fixed memory latency to compute different voltage and frequency settings. An inaccurate TMemory. This approach loses accuracy due to its fixed model results in the selection of non-optimal settings latency assumption and also ignores stall cycles due to and lost performance and/or energy. instruction cache misses. Existing DVFS performance models are primarily aimed Leading Load: Leading load [33, 36] recognizes both at the CPU. They each exploit some kind of linear front-end stalls due to instruction cache misses and back- model, based on the fact that CPU computation scales end stalls due to load misses. Their model counts the with frequency speed while memory latency does not. cycles when an instruction cache miss occurs in the These models fall into four classes: (a) proportional, (b) last level cache. It adopts a similar approach to the sampling, (c) empirical, and (d) analytical. The propor- miss model to address memory level parallelism for data tional scaling model assumes a linear relation between loads. However, to account for variable memory latency, the performance and the core clock frequency. The sam- it counts the actual number of cycles a contributor load pling model better accounts for memory effects by iden- is outstanding. tifying a scaling factor (i.e., slope) of each application Critical Path: Critical path [34] computes TMemory from two different voltage-frequency operating points. as the longest serial chain of memory requests that stalls The empirical model approximates that slope by count- the processor. The proposed counter architecture main- ing aggregate performance counter events, such as the tains a global critical path counter (CRIT) and a critical number of memory accesses. This model cannot dis- path time stamp register (τi) for each outstanding load tinguish between applications with low or high memory i. When a miss occurs in the last level cache (instruc- level parallelism (MLP) [33]. tion or data), the snapshot of the global critical path An analytical model views a program as an alter- counter is copied into the load’s time stamp register. 1 store Prediction (units f ) Model T f store Memory Time( 2 ) y At clock frequency f load C store STALL 18 48 MISS 16 50

Memor load A load B LEAD 15 51 CRIT 20 46 Compute 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 Time store store y At clock frequency f 2 load C store

Memor load A load B Compute 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 Time

Figure 1: TMemory computation for a CPU program running at frequency f with different linear performance models 500 Once the load i returns from memory, the global criti- bfs km (%) 400 cfd bp cal path counter is updated with the maximum of CRIT lcy hw time and τi + Li where Li is the memory latency for load i. 300 ge lm lud Example: Figure 1 provides an illustrating example 200 execution

of a CPU program that has one stand alone load miss in 100

(A at cycle 0) and two overlapped load misses (B at 0 cycle 12 & C at cycle 14). At clock frequency f, the Increase core stalls between cycles 1-7 and 15-25 for the loads to 1.43ns 2.50ns 3.33ns 5.00ns 10.00ns 700MHz 400MHz 300MHz 200MHz complete. The total execution time of the program at 100MHz clock frequency f is 33 units (18 units stalls + 15 units Clock cycle time (ns) f computation). As we scale down the frequency to 2 , Figure 2: Performance impact of core clock fre- the compute cycles (0, 8-14, 26-32) scale by a factor of quency scaling in a subset of GPGPU kernels 2. However, the expansion of compute cycles at 0 and from rodinia benchmark suite 14 are hidden by the load A and C respectively. Thus f from 700MHz (baseline) to 100MHz (minimum). The the execution time at frequency 2 becomes 46 units. For each performance model, we identify the cycles performance curve in Figure 2 shows the increase in classified as the non-pipelined (does not scale with the runtime as we scale down the core frequency i.e. in- frequency of the pipeline) portion of computation. Lead- crease clock cycle time (x-axis). We make two major observations from this graph. First, like CPUs, there ing load (LEAD) computes TMemory as the sum of A and B’s latencies (7+8=15). Miss model (MISS) identifies is significant opportunity for DVFS in GPUs – we see this because two of the kernels (cfd and km) are memory two contributor misses (A & B) and reports TMemory to be 16 (8 × 2). Among two dependent load paths (A+B bound and insensitive to frequency changes (they have a with length 8+7=15, A+C with length 8+12=20), crit- flat slope in this graph), while three others (ge, lm, and ical path (CRIT) selects A+C as the longest chain of hw) are only partially sensitive. For the memory bound dependent memory requests and computes the critical kernels, it is possible to reduce clock frequency and thus path as 20. Stall time (STALL) counts the 18 cycles of save energy without any performance overhead. load related stalls. Our second observation is that the linear assumption of all prior models does not strictly hold, as the partially We plug in the value of TMemory in equation 2 and f sensitive kernels (e.g., lm and hw) seem to follow two predict the execution time at clock frequency . The 2 different slopes, depending on the frequency range. At predicted execution time by the different models is shown high frequency, they follow a flat slope, as if they were in Figure 1. Note that the models also differ in their memory bound (e.g., between 700MHz and 300MHz in counting of T . Critical path (CRIT) outper- Compute lm), but at lower frequencies, they scale like compute forms the other models in this example. Although these bound applications (e.g., below 300MHz in lm). models show reasonable accuracy in CPU, optimiza- So while several of the principles that govern DVFS tion techniques like software and hardware prefetching in CPUs also apply to GPUs, we identify several dif- may introduce significant overlap between computation ferences which make the existing models inadequate and memory load latency, and thus invalidate the non- for GPUs. All of these differences stem from the high elasticity assumption of the memory component. In thread level parallelism in GPU cores, resulting from that case, the CRISP model should also provide sig- the SIMT parallelism enabled through warp scheduling. nificant improvement for CPU performance prediction. The first difference is the much higher incidence of over- lapped computation and memory access. The second 2.2 DVFS in GPGPUs is the homogeneity of computation in SIMT schedul- To explore the applicability of DVFS in GPGPUs, ing, resulting in the easy saturation of resources like we simulate an NVIDIA Fermi architecture using GPG- the store queue. The third is the difficulty of classify- PUSim [37] and run nine kernels from the Rodinia [38] ing stalls (e.g., an idle cycle may occur when 5 warps benchmark suite at different core clock frequencies – are stalled for memory and three are stalled for float- ing point RAW hazards). We will discuss each of these Since this is not a shortcoming in the models, just differences, as well as a fourth that is more of an archi- a change in the assumed architecture, we adapt these tectural detail and easily handled. models (as well as our own) to begin counting memory Memory/Computation Overlap: Existing mod- stalls as soon as they miss the L1 cache. This results els treat computation that happens in series with mem- in significantly improved accuracy for those models on ory access the same as computation that happens in GPUs. Thus, all results shown in this paper for leading parallel. When the latter is infrequent, those models load and critical path incorporate that improvement. are fairly accurate, but in a GPGPU with high thread level parallelism, it is common to find significant com- 2.3 Models for GPGPUs putation that overlaps with memory. A handful of power and energy models have recently This overlapped computation has more complex be- been proposed for GPGPUs [18]. However, very few havior than the non-overlapped. In the top timeline in of these models involve performance predictions. Dao Figure 1 the computation at cycle 0 is overlapped with et al. [39] proposes linear and machine learning based memory, so even though it is elongated it still does not models to estimate the runtime of kernels based on impact execution time – thus, that computation looks workload distribution. The integrated GPU power and more like memory than computation. However, if we performance (IPP) [40] model predicts the optimum extend the clock enough (i.e., more than a factor of 8 number of cores for a GPGPU kernel to reduce power. in this example), that computation completely covers However, they do not address the effect of clock fre- the memory latency and then begins to impact total quency on GPU performance. Moreover, they ignore execution time – now looking like computation again. cache misses [41], which is critical for a DVFS model. This phenomenon exactly explains the two-slope behav- Song et al. [42] contribute a similar model for a dis- ior exhibited by hw and lm in Figure 2. tributed system with GPGPU accelerators. Chen et Store Stalls: A core (CPU or GPU) would stall for al. [43] tracks NBTI-induced threshold voltage shift in a store only when the store queue fills, forcing it to stall GPGPU cores and determines the optimum number of waiting for one of the stores to complete. While this is cores by extending IPP. Equalizer [44] maintains perfor- a rare occurrence in CPUs, in a GPU with SIMT par- mance counters to classify kernels as compute-intensive, allelism, it is common for all of the threads to execute memory-intensive, or cache sensitive, using those classi- stores at once, easily flooding the store queue, resulting fications to change thread counts or P-states. However, in memory stalls that do not scale with frequency. because they have no actual model to predict perfor- Complex Stall Classification: Identifying the cause mance, it still amounts to a sampling-based search for of a stall cycle is difficult in the presence of high thread the right DVFS state. level parallelism. For example if several threads are None of these models address the effect of core clock stalled waiting for memory, but a few others are stalled frequency on kernel performance. CRISP is unique in waiting for floating point operations to complete, is this that it is the first runtime analytical model, either for a computation or memory phase? Prior works (which CPUs or GPGPUs, to model the non linear effect of start counting memory cycles only after the pipeline frequency on performance – this effect can be seen in stalls) would categorize this as computation because the either system, but is just more prevalent on GPGPUs. pipeline is not yet fully stalled for memory, and the on- The only GPU analytical model that accounts for fre- going floating point operations will scale with frequency. quency is an empirical model [45] that builds a unique Consider what happens when, as described earlier, non linear performance model for each workload – this the overlapped computation is scaled to the point where is intended to be used offline and is not practical for it completely covers the memory latency and now starts an online implementation. Abe, et al. [14, 15] also to interfere with other computation. The originally idle present an offline solution. They use a multiple linear cycles (e.g. between dependent floating point ops) will regression based approach with higher average predic- not contribute to extending execution time, but rather tion error (40%). Their maximum error is more than will now be interleaved with the other computation. 100%. Only the cycles where the pipeline was actually issu- ing instructions will impact execution time. Thus, un- 3. GPGPUDVFSPERFORMANCEMODEL like other models, we generally treat mixed-cause idle cycles (both memory and CPU RAW hazards) as mem- This section describes the critical stalled path (CRISP) ory cycles rather than computation cycles. performance predictor, a novel analytical model for pre- L1 Cache Miss: A substantial portion of memory dicting the performance impact of DVFS in modern latency in a typical GPGPU is due to the interconnect many-core architectures like GPGPUs. that bridges the streaming cores to L2 cache and GPU 3.1 Critical Stalled Path global memory. The leading load and critical path mod- els start counting cycles as the load request misses the Our model interprets computation in a GPGPU as last level cache, which works for CPUs where all on-chip a collection of three distinct phases: load outstanding, caches are clocked together; however, in our GPU the pure compute, and store stall (Figure 3). The core re- interconnect and last level cache are in an independent mains in a pure compute phase in the absence of an out- clock domain. standing load or any store related stall. Once a fetch or load miss occurs in one of the level 1 caches (con- Load Outstanding Comp T Stall LCP TLCP TCSP 2.5 2.0 time 1.5 ed Overlapped Load Pure Store Pure 1.0

Computation Stall Computation Stall Computation maliz 0.5

Nor 0.0 700 MHz 600 MHz 500 MHz 400 MHz 300 MHz 200 MHz 100 MHz Time Load Critical Path (LCP) Compute/Store Path (CSP) Figure 4: Runtime of the lm kernel at different Figure 3: Critical Stalled Path - Abstract View frequencies Comp Comp T Stall T Stall LCP TLCP CSP TCSP stant, data, instruction, texture), the core enters a load 1.2 outstanding phase and stays there as long as a miss is 1.0 time 0.8

outstanding. During this phase, the core may schedule ed 0.6 instructions or stall due to control, data, or structural 0.4 hazards. The store stall phase represents the idle cycles maliz 0.2

Nor 0.0 due to store operations, in the explicit case where the 700 MHz 600 MHz 500 MHz 400 MHz 300 MHz 200 MHz 100 MHz pipeline is stalled due to store queue saturation. We split the total computation time into two dis- Figure 5: Runtime of the cfd kernel at different joint segments: a load critical path (LCP) and a com- frequencies pute/store path (CSP). frequency. The expansion of overlapped computation T (t)= TLCP (t)+ TCSP (t) (3) stays hidden under TMemory as long as the frequency This division allows us to separately analyze (1) loads scaling factor is less than the ratio between TMemory and and the computation that competes with loads to be- the overlapped computation. As the frequency is scaled come performance-critical, versus (2) computation that further, the load critical path can be dominated by the never overlaps loads and is exposed regardless of the length of the scaled overlapped computation. Thus, we behavior of the loads. Store stalls fall into the latter define the LCP portion of the model as category because any cycle that stalls in the presence f1 × Comp TLCP (t)= maxTMemory, TLCP  (4) of both loads and stores is counted as a load stall. f2 The load critical path is the length of the longest se- quence of dependent memory loads possibly overlapped We describe details on parameterizing this equation with computations from parallel warps. Thus, it does from hardware events in section 3.5 and section 3.6. Fig- lm not necessarily include all loads, enabling it to account ure 4 shows the normalized execution time of the ker- for memory level parallelism by not counting loads that nel for 7 different frequencies. As expected, we see the occur in parallel – similar to [34, 33]. Meanwhile, we load stalls decreasing with lower frequency, but with- form the compute/store path as a summation of non- out increasing total execution time for small changes overlapped compute phases, store stall phases, and com- in frequency. As the frequency is decreased further, putation overlapped only by loads not deemed part of the overlapped computation increases, with much of it the critical path. Since the two segments in Equation 3 turning into non-overlapped computation (as some of scale differently with core clock frequency, we develop the original overlapped computation now extends past separate models for each. the duration of the load) resulting in an increase in total execution time. Prior models assume that overlapped 3.2 Load Critical Path Portion computation always remains overlapped, and miss this Streaming multiprocessors (SMs) have high thread effect. level parallelism. The maximum number of concur- 3.3 Compute/Store Path Portion rently running warps per SM varies from 24 to 64 de- The compute/store critical path includes computa- pending upon the compute capability of the device (e.g., Comp 48 in NVIDIA Fermi architecture). This enables the SM tion (TCSP ) that is not overlapped with memory oper- to hide significant amounts of memory latency, result- ations and simply scales with frequency, plus store stalls Stall ing in heavy overlap between load latency and useful (TCSP ). Streaming multiprocessors (SMs) have a wide computation. However, it often does not hide all of it, SIMD unit (e.g., 32 in Fermi). A SIMD store instruc- due to lack of parallelism (either caused by low inher- tion may result in 32 individual scalar memory store ent parallelism, or restricted parallelism due to resource operations if uncoalesced (e.g., A[tid * 64] = 0). As a constraints) or poor locality. result, the load-store queue (LSQ) may overflow very We represent the load critical path as a combination quickly during a store dominant phase in the running Stall kernel. Eventually, the level 1 cache runs out of miss of memory related overlapped stalls (TLCP) and over- Comp status holding register (MSHR) entries and the instruc- lapped computations ( TLCP ). Prior CPU-based mod- tion scheduler stalls due to the scarcity of free entries els’ assumption of TMemory’s inelasticity creates a larger in the LSQ or MSHR. We observe this phenomenon in inaccuracy in GPGPUs than it does for CPUs because several of our application kernels, as well as microbench- Comp of the magnitude of TLCP , which scales linearly with marks we used to fine-tune our models. Again, this is 1 Prediction (units f ) Model T f Memory Time( 2 )

y store At clock frequency f load C STALL 4 58 MISS 24 38

Memor store load A load B load D LEAD 18 44 CRIT 20 42 CRISP 20 54 Compute 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Time store y At clock frequency f 2 load C store

Memor load A load B load D Compute 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 Time

Figure 6: TMemory computation for a GPU program running at frequency f with different linear performance models a rare case for CPUs, but easily generated on a SIMD factor of two. Due to the scaling of overlapped com- core. Although the store stalls are memory delays, they putation under load A (cycles 0-6) at clock frequency do not remain static as frequency changes. As frequency f 2 , the compute cycle 8 cannot begin before time unit slows, the rate (as seen by the ) of 14. Similarly, the instruction at compute cycle 28 can- the store generation slows, making it easier for the LSQ not be issued until time unit 50 because the expansion and the memory hierarchy to keep up. A simple model of overlapped computation under load C (cycles 14-23) (meaning one that requires minimal hardware to track) pushes it further. Meanwhile, as the compute cycle 28 that works extremely well empirically assumes that the expands due to frequency scaling, they overlap with the frequency of stores decreases in response to the slow- subsequent store stalls. down of non-overlapped computation (recall, the over- For each of the prior performance models, we iden- lapped computation may also generate stores, but those tify the cycles classified as the non-pipelined portion of stores do not contribute to the stall being considered computation (TMemory). Leading load (LEAD) com- here). Thus we express the compute/store path by the putes TMemory as the sum of A, B, and D’s latencies following equation in the presence of DVFS. (8+5+5=18). Miss model (MISS) identifies three con- tributor misses (A, B, & D) and reports T to be f Memory Comp Stall 1 × Comp TCSP (t)= maxTCSP +TCSP , TCSP  (5) 24 (8 × 3). Among two dependent load paths (A+B+D f2 with length 8+5+5=18, A+C with length 8+12=20), We see this phenomenon, but only slightly, in Fig- CRISP computes the load critical path (LCP) as 20 (for ure 4, but much more clearly in Figure 5, which shows the longest path A+C). The stall cycles computed by one particular store-bound kernel in cfd. As frequency STALL is 4. We plug in the value of TMemory in equa- tion 2 and predict the execution time at clock frequency decreases, computation expands but does not actually f increase total execution time, it only decreases the con- 2 . As expected, the presence of overlapped computa- tribution of store stalls. tion under TMemory and store related stalls introduces We assume that the stretched overlapped computa- error in the prediction of the prior models. STALL tion and the pure computation will serialize at a lower model overpredicts while the rest of the models under- frequency. This is a simplification, and a portion of predict. CRISP’s prediction error stems from this assumption. Our model observes two load outstanding phases (cy- This was a conscious choice to trade off accuracy for cles 0-7 and 12-27), three pure compute phases (cycles 8- complexity, as tracking those dependencies between in- 11, 28, 30), and one store stall phase (cycle 29). CRISP structions is another level of complexity and tracking. computes the load critical path (LCP) as 20 (8+12 for In summary, then, CRISP divides all cycles into four load A-C). The rest of the cycles (11=31-20) are as- distinct categories: (1) pure computation which is com- signed to compute/store path (CSP). The overlapped pletely elastic with frequency, (2) load stall which as computation under LCP is 17 (cycles 0-6, 14-23). The part of the load critical path is inelastic with frequency, non overlapped computation in CSP is 10 (cycles 8-13, (3) overlapped computation which is elastic, but poten- 26-28,30). Note that cycle 29 is the only store stall cy- tially hidden by the load critical path, and (4) store cle. CRISP computes the scaled LCP as max (17 × stall, which tends to disappear as the pure computation 2, 20) = 34 and CSP as max (10 × 2, 11) = 20. Fi- stretches. nally, the predicted execution time by CRISP at clock f frequency 2 is 54 ( 34 + 20 ) which matches the actual 3.4 Example execution time (54). We illustrate the mechanism of our model in Figure 6 where loads and computations are from all the concur- 3.5 Parameterizing the Model rently running warps in a single SM. The total execution To utilize our model, the core needs to partition exe- time of the program at clock frequency f is 31 units (5 cution cycles along some fairly subtle distinctions, sep- units stall + 26 units computation). As we scale down arating two types of computation cycles and two types f the frequency to 2 , all the compute cycles scale by a of memory stalls, while accounting for the fact that in a heavily multithreaded core there are often a plethora (D) All other stalls must be solely due to arithmetic of competing causes for each idle cycle. dependencies, and are classified as computation. Despite these issues, we are able to measure all of Classifying computation: After identifying the to- these parameters with some simple rules and a very tal cycles of computation, we divide those into two groups small amount of hardware. First, we categorize each – the pure computation that is part of the compute/store cycle as either one of two types of stalls (load stalls path and the overlapped computation that is overlapped within the load critical path vs store stalls within the by the load-critical path. The former includes both compute/store path) or computation. Next we split computation that does not overlap loads and computa- computation into overlapped computation (within the tion that overlaps loads that are not part of the critical load critical path) or non-overlapped. path. The reason for dividing computation into these Identifying Stalls: Memory hierarchy stalls can sets is that the compute/store computation is always originate from instruction cache fetch misses, load misses, exposed (execution time is sensitive to their latency) or store misses. The first two we categorize as load stalls while the latter only become exposed when they exceed and the last only manifests in pipeline stalls when we the load stalls. We describe the mechanism to compute run out of LSQ entries or MSHRs. We initially separate the computation under the load critical path here, but these two types of stalls from computation cycles, then revisit the need to precisely track the critical path in break down the computation cycles in the next section. Section 3.4. Identifying the cause of stall cycles in a heavily mul- In order to track the computation that is overlapped tithreaded core is less than straightforward. We assume by the load-critical path, we need to track the criti- the following information is available in the pipeline or cal path using a mechanism similar to the critical path L1 cache interface – the existence of outstanding load model [34]. This is done using a counter to track the misses, the existence of outstanding store misses, the ex- global longest critical path (the current TMemory up to istence of a fetch stall in the front end pipeline, MSHR this point), as well as counters for each outstanding load or LSQ full condition, and the existence of RAW hazard that record the critical path length when they started, on a memory operation, a RAW hazard on an arith- so that we can decide when they finish if they extend metic operation causing an instruction to stall, or an the global longest critical path. However, we instead arithmetic structural hazard. Most of those are readily measure something we call the adjusted load critical available in any pipeline, but the last three might each path, which adds load stalls as 1-cycle additions to the require an extra bit and some minimal logic added to running global longest critical path. In this way, when the GPU scoreboard. the longest critical path is calculated, its length will in- Because our GPU can issue two instructions (two clude both the critical-path load latencies and the stalls arithmetic or one arithmetic and one load/store), char- that lie between the critical loads – the non-critical load acterizing cycles as idle or busy is already a bit fuzzy. stalls. That adjusted load critical path is the TLCP (t) Two-issue cycles are busy (computation), zero-issue cy- from Equation 3, which is also the “load critical path” cles are considered idle, while one-issue cycles will be from figure 3 and includes all load stalls (either as part counted as busy if there is an arithmetic RAW or struc- of the critical load latencies or the non-critical stalls) tural hazard, otherwise it will be characterized as idle as well as all overlapped computation. As a result, since there is a lack of sufficient arithmetic instructions we can solve for the overlapped computation by sim- to fully utilize the issue bandwidth. ply subtracting the total load stalls from the adjusted Idle cycles will be characterized as load stalls, store load critical path; this is possible because we also record stalls, or computation, according to the following al- the total load stalls in a counter. gorithm, the order of which determines the character- In the top timeline in Figure 6, the original critical ization of stalls in the presence of multiple potential path computation would count 8 (latency of A) + 12 causes: (latency of C) = 20 cycles as the load critical path, but (A) If there is a control hazard (branch mispredict), our adjusted load critical path will also pick up the stall cache/shared memory bank conflict, or no outstanding at cycle 27, so would calculate 21 as our adjusted load load or store misses or fetch stall, this cycle is counted critical path. At the same time, we would count 4 total as computation. Branch mispredicts are resolved at the load stall cycles. Subtracting 21 - 4 gives us 17 cycles of speed of the pipeline, as are L1 cache and shared mem- load-critical overlapped computation, which is exactly ory bank conflicts (L1 cache and shared memory run on the number of computation cycles under the two critical the same clock as the pipeline). path loads (A and C). (B) Else, if there is an outstanding load miss, then if there is a RAW hazard on the load/store unit, this cycle 3.6 Hardware Mechanism and Overhead is a load stall. Otherwise, if there is an arithmetic RAW We now describe the hardware mechanisms that en- or structural hazard, then this cycle is computation. If able CRISP. It maintains three counters: adjusted load neither is the case and the MSHR or LSQ are full, then Stall critical path (TMemory), load stall (TLCP ), and store this is a load stall. Finally, if there is a fetch stall in the Stall front end, then it is a load stall. stall (TCSP ). The rest of the terms in equation 3 and 4 (C) Else (no outstanding load miss), then if the MSHR can be computed from these three counters and total or LSQ are full, this indicates a store stall. time by simple subtraction. Like the original critical path (CRIT) model, it also maintains a critical path Overhead (Bytes) CRISP Component Count Per Element Total lation interval of GPUWattch, we collect the dynamic Adjusted LCP 1 4 4 power at the nominal voltage-frequency (700MHz, 1V) Time stamp registers 164 4 656 setting and scale it for the current voltage-frequency Overlapped stall 1 4 4 2 (PDynamic ∝ v f). We use GPUSimpow [49] to estimate Non overlapped stall 1 4 4 CRISP 668 static power of individual clock domains at nominal fre- CRIT 660 Total Storage Overhead(Bytes) quency, which is between 23%-63% for the Rodinia LEAD 18 benchmark suite. The static power in other P-States STALL 4 are estimated using voltage scaling (PStatic ∝ v) [50]. Table 1: Hardware storage overhead per SM of We use the Rodinia [38] benchmark suite in our ex- CRISP periments. To initially assess the accuracy of different models, we run the first important kernel of each Ro- dinia benchmark for 4.8 million instructions – we ex- timestamp register (tsi) for each outstanding miss (i) in the L1 caches (one per MSHR). Table 1 shows the total clude nw because the runtime of each kernel is smaller hardware overhead for these counters, 99% of which is than a single reasonable DVFS interval. The 4.8 mil- inherited from CRIT. We discuss the software over- lion instruction execution of each kernel at nominal fre- head in Section 4. quency translates into at least 10µs execution time (rang- At the beginning of each kernel execution, the coun- ing from 10µs to 600µs). We find that 10µs is the min- ters are set to zero. Once a load request i (instruction, imum DVFS interval for which the switching overhead ≤ data, constant, or texture) misses in the level 1 cache, (100ns) of the on chip regulator is negligible ( 1%). CRISP copies T into the pending load’s times- Meanwhile, to evaluate the energy savings, we run the Memory benchmarks for at least 1 billions instructions (except tamp register (tsi). When the load i’s data is serviced from the L2 cache or memory after L cycles, T for myc, which we ran long enough to get 4286 full ker- i Memory nel executions). For some benchmarks, we ran for up to is updated with the maximum of TMemory and tsi + Li. In the occurrence of a load stall, CRISP increments 9 billion instructions if that allowed us to run to comple- Stall tion. Eight of the benchmarks finish execution within TMemory and TLCP. For each non-overlapped (store) that allotment. For presentation purposes, we divide Stall stall, CRISP increments TCSP . the benchmarks into two groups. The benchmarks (bp, An unoptimized implementation of the logic that de- hw, hs, and patf) with at least 75% computation cy- tects load and store stall needs just 12 logic gates and cles are classified as compute bound. The rest of the is never on the critical path. Most of the signals used benchmarks are memory bound. in the cycle classification logic are already available in In Section 5.2, we demonstrate an application of our current pipelines. To detect different kinds of hazards, performance model in an online energy saving frame- we alter the GPU scoreboard, which requires 1 bit per work. In order to make decisions about DVFS settings warp in the GPU hardware scheduler and minimal logic. to optimize EDP or similar metrics, we need both a performance model and an energy model. The energy 4. METHODOLOGY model we use assumes that the phase of the next in- terval will be computationally similar to the current CRISP, like other existing online analytical models, interval [34], which is typically a safe assumption for requires specific changes in hardware. As a result, we GPUs. rely on simulation (specifically, the GPGPUSim [37] At each decision interval (10µs), the DVFS controller simulator) to evaluate the benefit of our novel perfor- uses the power model to find the static and dynamic mance model, as in prior work (CRIT [34], LEAD [33, power of the SMs at the current settings (vc,f c). It also 36]). The GPGPU configuration used in our experiment collects static and dynamic power consumed by the rest is very similar to the NVIDIA GTX 480 Fermi architec- of the components including memory, interconnect, and ture (in Section 5.4, we also evaluate a Tesla C 2050). level 2 cache. For each of the possible voltage-frequency To evaluate the energy savings realized by CRISP, we settings (v,f), it predicts (a) static power of the SMs update the original GPGPUSim with online DVFS ca- using voltage scaling, (b) dynamic power of the SMs pability. All the 15 SMs in our configured architecture using voltage and frequency scaling, (c) execution time share a single voltage domain with a nominal clock fre- T(f) using the underlying DVFS performance model quency of 700 MHz. We define six other performance (leading load, stall time, critical path, or critical stalled states (P-State) for the SMs – ranging from 100 MHz path). The static and dynamic power of the rest of the to 600 MHz with a step size of 100 MHz [18]. Similar system are assumed to remain unchanged since we do to GPUWattch [18], we select the voltage for each P- not apply DVFS to those domains. State from 0.55V to 1.0V. We rely on a fast-responding The controller computes the total static energy con- on-chip global DVFS regulator that can between sumption at each frequency f as a product of total static P-States in 100ns [46, 18, 47, 48]. We account for this power PStatic(v,f) and T(f). Meanwhile, the total dy- switching overhead for each P-State transition in our namic energy consumption at frequency f is computed energy savings result. as PDynamic(f) × T(fc). The static and dynamic en- We extend the GPUWattch [18] power simulator in ergy components are added together to estimate the GPGPUSim to estimate accurate power consumption predicted energy E(f) at P-State (v,f). Finally, the in the presence of online DVFS. For each power calcu- ae rniint h -tt ihtepredicted the with P-State the ED to minimum transition a makes fe ut ag,ad()treo h enl aeno- have kernels (T the stalls of is store three non-overlapped portion (3) ticeable overlapped and the large, the quite time, covers often often execution path of data critical load This majority are the benchmarks while error. the (2) (1) prediction understand- diverse, – of in facts interesting sources useful several this the shows is – of benchmark some 7 each the ing Figure on of in data breakdown CRISP raw the by various show computed the at We drive classifications data to models. counter MHz) (700 performance model frequency performance frequency. nominal the each the col- at save and information also P-states time We seven execution all the at lect benchmark [38] Rodinia Prediction Time Execution 5.1 we CRISP. Last, of CRISP. generality of the version examine of simpler performance computationally the explore a and also We (CRIT) compare (STALL). path we time poten- critical case stall (LEAD), each the load In leading measures efficiency. against predict energy and to in settings algorithm gains runtime tial DVFS a optimal in Second, predictor the art. the the uses of of fre- state it different prior accuracy the at the to runtimes compared evaluates kernel quencies, it predicting in First, model the ways. two in dictor RESULTS 5. effectiveness. in loss significant single no a sample with instead core could shown we that (not per- demonstrate our experiments here) drive our to However, domain power model. the formance in SMs all software across tics our of part is which except counters counters, overhead. the performance reading using hardware for not the do implement in we we since it P-state that measurement modeled noted for 1% overhead be our any the should have in to It included overhead. addition run is switching in also overhead could overhead, code This software performance actual same SM. the the an that running on Note by CPU. interval a on decision per cles with goal. fashion, the similar as EDP in minimum DVFS orchestrates controller otolrcmue ED computes controller Normalized time 0 0 0 0 0 1 esmlt h rtisac fakykre feach of kernel key a of instance first the simulate We pre- CRISP the of quality the examines section This statis- counter performance average we results, our In cy- 109 at overhead software online the measure We ae ntemaue utm n efrac counter performance and runtime measured the on Based ...... 0 2 4 6 8 0

iue7 radw feeuintime execution of Breakdown 7: Figure bp hw

hs T LCP Comp

2 patf .FrteEPoeaigstig the setting, operating EDP the For P. b+t bfs

cfd T LCP Stall 2

P( ge

,f v, km

sE( as ) lm T

lcy CSP Comp lud CSP Stall f

) mum × ). myc T T( nn CSP Stall f

) part 2

and sr sc uae u iegswdl fe hspit Beyond com- completely ac- as point. fairly kernel the this is treats after CRISP CRIT point, widely frequency MHz. this diverges 233 this but about above at curate, that covered completely see be We crit- load to the path expect ical we thus computations; overlapped EDadCI ue rmudrrdcin(5 or (65% underprediction Both from for suffer 288%. more inelastic. as CRIT high completely and as be LEAD as can STALL of stalls inaccuracy The all treating servatively 9. bench- Figure each in for reported frequencies are target the mark six of all The errors across 8. percentage models Figure absolute of tar- the average graph these highlight rightmost summarized the at also in We (MAPE) frequencies 100 error get MHz. percentage frequencies: 600 absolute target and error three mean MHz, for prediction 400 models the MHz, the shows of 8 each (100 Figure models of frequencies other performance MHz). six the at MHz–600 time frequency, execution MHz the predict 700 at data hl EDetmtsi s08.W esr h load the measure of We length 0.80. path as critical it estimates LEAD while on n eoemr estv ofeuny The frequency. to is sensitive exception more major become computation- and to inabil- transition bound kernels their the from when comes detect this to graph); ity the on of (those side right kernels the bound memory the underpredicting to the of best the models. with com- (109%) 3.6 existing error of maximum maximum factor the a the to by pared but reduced is varies, (30%) also error CRISP prediction of accuracy The e sw cl h rqec rm70Mzt 100 to T MHz measures 700 CRIT from frequency 10). this, the the (Figure illustrate scale for MHz To we time execution as predicted nel general. the in share show to we error, appear not superlinear do results This the CRISP is factor. the ker- scaling why it the compute-bound explains and the likely a covered to that more completely transitions is are the nel stalls error frequency, load superlinear the in this that change in acknowledges the factor the paper greater big from load A depart leading you this. The further frequency. the proportion- base inaccurate become techniques they more prior is, the (from ally that of factor – error superlinear the scaling MHz), becomes high 100 a to MHz with 700 performance the dict ex- over-predict heavily and stalls time. ecution related store the for Predicted execution time iue1:Eeuintm rdcinfor prediction time Execution 10: Figure TL lasoepeit h xcto ie con- time, execution the overpredicts always STALL eseta ohLA n RTaemr prone more are CRIT and LEAD both that see We eas oefo hs eut hta etyt pre- to try we as that results these from note also We (normalized) 1 1 1 1 1 2 2 2 2 ...... 0 2 4 6 8 0 2 4 6

1.43ns CRISP ORA CRIT LEAD ST 700MHz ALL CLE bfs

n vrrdcin(0%o oefor more or (109% overprediction and ) 2.50ns 400MHz

3.33ns 300MHz ge hr hybt alt account to fail both they where , lm Cloc k

ob .3o hc 4 are 34% which of 0.93 be to 5.00ns cyc 200MHz le time (ns) Memory s0.88 as lm lm ker- ge

). 10.00ns 100MHz 100 600 MHz 400 MHz 100 MHz 100 80 50 204% 117% 165% 288% 167% 105% 140% STALL

error(%) 60

0 102% 40

−50 MAPE(%) STALL 20

Prediction −100 0 bp hw hs patf bfs b+t cfd ge km lm lcy lud mum myc nn parf sr sc Comp Mem All 100 100 80 50 120% LEAD

error(%) 60 0 40

−50 MAPE(%) LEAD 20

Prediction −100 0 bp hw hs patf bfs b+t cfd ge km lm lcy lud mum myc nn parf sr sc Comp Mem All 100 100 80 50 109% CRIT

error(%) 60 0 40

−50 MAPE(%) CRIT 20

Prediction −100 0 bp hw hs patf bfs b+t cfd ge km lm lcy lud mum myc nn parf sr sc Comp Mem All 100 100 80 50 CRISP

error(%) 60 0 40

−50 MAPE(%) CRISP 20

Prediction −100 0 bp hw hs patf bfs b+t cfd ge km lm lcy lud mum myc nn parf sr sc Comp Mem All Compute bound benchmark Memory bound benchmark Summary Figure 8: Execution time prediction error for different target frequencies with baseline frequency of 700 MHz

100 STALL LEAD CRIT CRISP 100 80 80 60 60

40 MAPE(%) 40 MAPE(%)

20 vg. 20 A 0 0 bp hw hs patf bfs b+t cfd ge km lm lcy lud mum myc nn parf sr sc Comp Mem All Compute bound benchmark Memory bound benchmark Summary

Figure 9: Mean absolute percentage error averaged across all target frequencies with baseline fre- quency of 700 MHz pute bound, while at 100 MHz CRIT is still assuming CRISP in minimizing two widely used metrics: energy 52% of the execution time is spent waiting for memory. delay product (EDP) and energy delay squared product In contrast, CRISP closely follows the actual runtimes (ED2P). (ORACLE) both at memory-bound frequencies and at We compute the savings relative to the default mode compute-bound frequencies. which always runs at maximum SM clock frequency We observe similar phenomenon in the kernels hw, hs, (700 MHz, 1V). We run two similar experiments with and bfs. In each of these cases, CRISP successfully cap- our simple DVFS controller (Section 4) targeting (a) tures the piecewise linear behavior of performance un- minimum EDP, and (b) the minimum ED2P operating der frequency scaling. As summarized in Figure 9, the point. During both experiments, the controller uses the average prediction error of CRISP across all the bench- program behavior during the current interval to predict marks is 4% vs. 11% with LEAD, 11% with CRIT, and the next interval. We do the same for each of the other 35% with STALL. studied predictors – all use the same power predictor. Using CRISP for this purpose does raise one concern. 5.2 Energy Savings All of the models are asymmetric (i.e., will predict dif- ferently for lower frequency A->B than for higher fre- An online performance model, and the underlying ar- quency B->A) due to different numbers of stall cycles, chitecture to compute it, is a tool. An expected applica- different loads on the critical path, etc. However, the tion of such a tool would be to evaluate multiple DVFS asymmetry is perhaps more clear with CRISP. While settings to optimize for particular performance and en- predicting for lower frequencies we quite accurately pre- ergy goals. In this section, we will evaluate the utility of STALL LEAD CRIT CRISP 0 Minimum EDP point 40 15

(%) −5 30 10 20 5 −10 vings 10 0 sa −15 0 −5

EDP −20 −10 −10 bp hw hs patf bfs b+t cfd ge km lm lcy lud mum myc nn part sr sc Comp Mem All Compute bound benchmark Memory bound benchmark Average Figure 11: EDP reduction while optimized for EDP 2 STALL LEAD CRIT CRISP 0 Minimum ED P point 50 12 −2 40 10 −4 30 8 −6 20 6 vings(%) 4 −8 10 sa 2

P −10 0

2 0 −12 −10 −2 ED −14 −20 −4 bp hw hs patf bfs b+t cfd ge km lm lcy lud mum myc nn part sr sc Comp Mem All Compute bound benchmark Memory bound benchmark Average Figure 12: ED2P reduction while optimized for ED2P STALL LEAD CRIT CRISP-L CRISP dict when overlapped computation will be converted 40 Minimum EDP point 15 14 35 12 10 into pure computation, it is more difficult to predict 30

(%) 10

25 (%) which pure computation cycles will become overlapped 5 8 20 (%) computation at a higher frequency. We thus modify 6 MAPE 0 15 vings . Loss 4 Sa our predictor slightly when predicting for a higher fre- vg 10 −5 quency. We calculate the new execution time as T A 5 2 Memory 0 −10 0 Stall Comp fc Comp Mem All Comp Mem All Comp Mem All +TCSP +TCSP × f . This ignores the possible expan- Prediction Error EDP Savings Performance Loss sion of store stalls, and does not account for conversion of pure computation to overlapped. While this model Figure 13: Accuracy, benefit, and drawback of will be less accurate, in a running system with DVFS CRISP-L (as in the next section) if it mistakenly forces a change to a higher frequency, the next interval will correct it In summary, the average EDP savings for the Ro- with the accurate model. While this could cause ping- dinia benchmark suite with CRISP is 10.7% vs. 5.7% ponging which we could solve with some hysteresis, we with CRIT, 6.2% with LEAD, and 0% with STALL. found this unnecessary. The selected EDP operating points of CRISP trans- Optimizing for EDP: Figure 11 shows EDP savings late into 12.9% energy savings and 15.5% reduction in achieved by the evaluated performance models when the average power with an average performance loss of the controller is seeking the minimum EDP point for 3.4%. CRIT saves 10.6% energy and LEAD saves 8.1%. each kernel. We included all the runtime overhead The performance loss is 6.3% with CRIT and 2.9% with – software model computation and voltage switching LEAD. overhead – in this result. For the 14 memory bound Optimizing for ED2P: When attempting to op- benchmarks, CRISP reduces EDP by 13.9% compared timize for ED2P, CRISP again outperforms the other to 8.9% with CRIT, 8.7% with LEAD, and 0% with models by reducing ED2P 9.0% on average. The corre- STALL. CRISP achieves high savings in both lm (15.3%) sponding average savings realized by CRIT, LEAD, and and ge (21.8%). The other three models fail to realize STALL are only 3.5%, 4.9%, and 0%. Figure 12 shows any energy savings opportunity in lm and achieve very that the savings with CRISP is much higher (11.7%) for little savings (5% with CRIT and 1% with LEAD) in memory bound benchmarks compared to CRIT (5.5%), ge. As we saw previously in Figure 10, CRISP accu- LEAD (6.4%), and STALL (0%). rately models lm at all frequency points. Thus, it typ- CRISP achieves its highest ED2P savings of 42% for ically chooses to run this kernel at 300 MHz, resulting lcy, by running the primary kernel at 500 MHz most in 1.25% performance degradation but 16.4% reduced of the time. CRIT and LEAD underpredict the energy energy. CRIT and LEAD both over-estimate the per- saving opportunity and keep the frequency at 700 MHz formance cost at this frequency and instead choose to for 50% and 41% of the duration of this kernel, respec- run at 700 MHz. tively. As we analyze the execution trace, we find that For the compute bound benchmarks, CRISP marginally 67% of cycles inside the load critical path for that kernel reduces EDP by 0.5%, while LEAD and CRIT increase are computation, and are thus not handled accurately EDP by 2.7% and 5.7% respectively. Due to the un- by the other models. der prediction of execution time (e.g., patf), CRIT ag- gressively reduces the clock frequency and experiences 5.3 Simplified CRISP Model 13.3% performance loss on average (33% in patf) for the CRISP can be simplified to estimate the load critical compute bound benchmarks. path (LCP) length as the total number of load out- standing cycles, without regard for which stall cycles STALL LEAD CRIT CRISP CRIT CRISP Minimum EDP point Minimum Energy point 40 14 4.0 12 12 12 12 5 35 12 3.5 10 10 10 10 4 30 10 3.0 (%) 8 8 8 8 8 (%) 25 2.5 (%) (%) (%) (%) 3 (%) 6 (%) 20 2 0 6 6 6 6 4 . MAPE 2 vings 15 1 5 vings vings vings vings

. . Loss 2 4 4 4 4 Loss Sa Sa Sa Sa Sa

vg 10 1 0 0 . 1 A 2 2 2 2 5 −2 0.5 0 −4 0.0 0 0 0 0 0 Comp Mem All Comp Mem All Comp Mem All GTX Tesla GTX Tesla GTX Tesla GTX Tesla GTX Tesla Prediction Error EDP Savings Performance Loss 480 C2050 480 C2050 480 C2050 480 C2050 480 C2050 Energy EDP ED2P Average Power Performance

Figure 14: Effectiveness of CRISP on NVIDIA Figure 15: Effectiveness of CRISP while opti- Tesla C2050 mizing for energy and in particular which computation is under the crit- ical path. We simply treat all computation cycles that dles the impact of highly overlapped memory and com- are overlapped by a load as if they were on the critical putation in a system with high thread level parallelism, path. This approximated version of CRISP (CRISP-L) and also captures the effect of frequent memory store requires three counters, and thus removes 98% of the based stalls. storage overhead of CRISP (all the timestamp regis- ters). 7. ACKNOWLEDGEMENTS CRISP-L demonstrates competitive prediction accu- We thank the anonymous reviewers for their con- racy (4.66% on average in Figure 13) as CRISP (4.48% structive comments and suggestions that helped us to on average) does for the experiment in Section 5.1. How- improve the paper. This work was supported by NSF ever, when used in an EDP controller, CRISP-L suffers grants CCF-1219059 and CCF-1302682. relative to CRISP, but still realizes far better (8.38%) EDP savings than CRIT(5.65%) and LEAD (6.16%). The inclusion of more compute cycles inside LCP tends 8. REFERENCES to create underpredictions of execution time, result- [1] P. Macken, M. Degrauwe, M. V. Paemel, and H. Oguey, “A ing in more aggressive frequency scaling. CRISP-L ex- voltage reduction technique for digital systems,” in IEEE periences more (6.28%) performance loss than CRISP Int. Solid-State Circuits Conf., 1990. (3.44%). [2] D. Snowdon, S. Ruocco, and G. Heiser, “ and dynamic voltage scaling: Myths and 5.4 Generality of CRISP facts,” in Power Aware Real-time Computing, 2005. [3] S. Herbert and D. Marculescu, “Analysis of dynamic All of the results so far were for an NVIDIA GTX 480 voltage/frequency scaling in chip-multiprocessors,” in GPU. To demonstrate that the CRISP model is more ISLPED, 2007. general, we replicated the same experiments for a mod- [4] H. David, C. Fallin, E. Gorbatov, U. Hanebutte, and eled Tesla C 2050 GPU. The Tesla differs in available O. Mutlu, “Memory power management via dynamic voltage/frequency scaling,” in Proceedings of the 8th clock frequency settings, number of cores, DRAM queue International Conference on Autonomic Computing size, and in the topology of the interconnect. Results (ICAC), 2011. for predictor accuracy and effectiveness were consistent [5] Q. Deng, D. Meisner, A. Bhattacharjee, T. Wenisch, and with the prior results. In this section, we show in partic- R. Bianchini, “Multiscale: Memory system dvfs with ular the result for the EDP-optimized controller. Fig- multiple memory controllers,” in International Symposium on Low power electronics and design (ISLPED), 2012. ure 14 shows that the mean absolute percentage predic- [6] Q. Deng, D. Meisner, L. Ramos, T. Wenisch, and tion error of CRISP (6.2%) is much lower than LEAD R. Bianchini, “Memscale: active low-power modes for main (16.9%) and CRIT (15.4%). Meanwhile, CRISP realizes memory,” in ACM SIGPLAN, 2011. 10.3% EDP savings, which is 5% more than the best of [7] X. Chen, Z. Xu, H. Kim, P. Gratz, J. Hu, M. Kishinevskyy, the prior models. and U. Ograsy, “In-network monitoring and control policy We also experimented with another DVFS controller for dvfs of cmp networks-on-chip and last level caches,” in NOCS, 2012. goal – this one seeks to minimize energy while keeping the performance overhead under 1%. Figure 15 shows [8] X. Chen, Z. Xu, H. Kim, P. V. Gratz, J. Hu, 2 M. Kishinevsky, U. Ogras, and R. Ayoub, “Dynamic the energy, EDP, ED P, average power savings, and the voltage and frequency scaling for shared resources in performance loss for GTX480 and Tesla C2050. In Tesla multicore processor designs,” in DAC, 2013. C2050, CRISP achieves 9.3% energy savings, which is [9] G. Liang and A. Jantsch, “Adaptive power management for 4% more than CRIT. the on-chip communication network,” in Euromicro Conference on Digital System Design, 2006. [10] A. Mishra, R. Das, S. Eachempati, R. Iyer, 6. CONCLUSION N. Vijaykrishnan, and C. Das, “A case for dynamic We introduce the CRISP (CRItical Stalled Path) per- frequency tuning in on-chip networks,” in International Symposium on (MICRO), 2009. formance model to enable effective DVFS control in [11] D. Brooks and M. Martonosi, “Dynamic thermal GPGPUs. Existing DVFS models targeting CPUs do management for highperformance ,” in not capture key GPGPU execution characteristics, caus- High-Performance Architecture (HPCA), 2001. ing them to miss energy savings opportunity or cause [12] R. Teodorescu and J. Torrellas, “Variation-aware unintended performance loss. CRISP effectively han- application scheduling and power management for chip multiprocessors,” in Proc. 35th Ann. Int’l Symp. Computer “Feedback-based dynamic voltage and frequency scaling for Architecture (ISCA), 2008. memory-bound real-time applications,” in IEEE Real-Time Embedded Technology and Applications Symp. (RTAS), [13] D. Ernst, N. Kim, S. Das, S. Lee, D. Blaauw, T. Austin, 2005. T. Mudge, , and K. Flautner, “Razor: A low-power pipeline based on circuit-level timing speculation,” in International [33] S. Eyerman and L. Eeckhout, “A counter architecture for Symposium on Microarchitecture (MICRO), 2003. online dvfs profitability estimation,” in IEEE Trans. Comput., 2010. [14] Y. Abe, H. Sasaki, S. Kato, K. Inoue, M. Edahiro, and M. Peres, “Power and performance characterization and [34] R. Miftakhutdinov, E. Ebrahimi, and Y. N. Patt, modeling of gpu-accelerated systems,” in IEEE “Predicting performance impact of dvfs for realistic International Parallel and Distributed Processing memory systems,” in International Symposium on Symposium (IPDPS), 2014. Microarchitecture (Micro), 2012. [15] Y. Abe, H. Sasaki, S. Kato, K. Inoue, M. Edahiro, and [35] G. Keramidas, V. Spiliopoulos, and S. Kaxiras, M. Peres, “Power and performance analysis of “Interval-based models for run-time dvfs orchestration in gpu-accelerated systems,” in Proceedings of the USENIX superscalar processors,” in ACM Int. Conf. Computing conference on Power-Aware Computing and Systems Frontiers (CFˆaA˘Z10)´ , 2010. (HotPower), 2012. [36] B. Rountree, D. K. Lowenthal, M. Schulz, and B. R. [16] X. Mei, L. S. Yung, K. Zhao, and X. Chu, “A measurement de Supinski, “Practical performance prediction under study of gpu dvfs on energy conservation,” in Proceedings dynamic voltage frequency scaling,” in Int. Green of the Workshop on Power-Aware Computing and Systems Computing Conf. and Workshops, 2011. (HotPower), 2013. [37] A. Bakhoda, G. Yuan, W. Fung, H. Wong, and T. Aamodt, [17] “Nvidia gpu boost,” “Analyzing workloads using a detailed gpu simulator,” [18] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. in IEEE ISPASS, 2009. Kim, T. M. Aamodt, and V. J. Reddi, “Gpuwattch: [38] S. Che, J. Sheaffer, M. Boyer, L. Szafaryn, L. Wang, and Enabling energy optimizations in gpgpus,” in ISCA, 2013. K. Skadron, “A characterization of the rodinia benchmark [19] K. Choi, R. Soma, and M. Pedram, “Fine-grained dynamic suite with comparison to contemporary cmp workloads,” in voltage and frequency scaling for precise energy and Proceedings of the IEEE International Symposium performance trade-off based on the ratio of off-chip access onWorkload Characterization, 2010. to on-chip computation times,” in DATE, 2004. [39] T. T. Dao, J. Kim, S. Seo, B. Egger, and J. Lee, “A [20] G. Contreras and M. Martonosi, “Power prediction for intel performance model for gpus with caches,” in IEEE xscale processors using performance monitoring unit Transactions on Parallel and Distributed Systems, 2014. events,” in ISLPED, 2005. [40] S. Hong and H. Kim, “An integrated gpu power and [21] M. Curtis-Maury, J. Dzierwa, C. D. Antonopoulos, and performance model,” in Proc. 37th Ann. Int’l Symp. D. S. Nikolopoulos, “Online power-performance adaptation (ISCA), 2010. of multithreaded programs using hardware event-based [41] S. Hong and H. Kim, “An analytical model for a gpu prediction,” in Int. Conf. Supercomputing, 2006. architecture with memory-level and thread-level parallelism [22] M. Curtis-Maury, A. Shah, F. Blagojevic, D. S. awareness,” in ISCA, 2009. Nikolopoulos, B. R. de Supinski, and M. Schulz, “Prediction [42] S. Song, C. Su, B. Rountree, and K. Cameron, “A simplified models for multi-dimensional power-performance and accurate model of power-performance efficiency on optimization on many cores,” in PACT, 2008. emergent gpu architectures,” in IEEE 27th International [23] J. Carter, H. Hanson, K. Rajamani, F. Rawson, Symposium on Parallel and Distributed Processing, 2013. T. Rosedahl, and M. Ware, “Dynamic voltage and [43] X. Chen, Y. Wang, Y. Liang, Y. Xie, and H. Yang, frequency scaling (dvfs) control for simultaneous “Run-time technique for simultaneous aging and power multi-threading (smt) processors,” 2011. optimization in gpgpus,” in Proceedings of the 51st Annual [24] G. Dhiman and T. S. Rosing, “Dynamic voltage frequency Design Automation Conference, 2014. scaling for multi-tasking systems using online learning,” in [44] A. Sethia and S. Mahlke, “Equalizer: Dynamic tuning of ISLPED, 2007. gpu resources for efficient execution,” in MICRO, 2014. [25] M. Moeng and R. Melhem, “Applying statistical machine [45] J. Issa and S. Figueira, “A performance estimation model learning to multicore voltage and frequency scaling,” in Int. for gpu-based systems,” in 2nd International Conference on Conf. Computing Frontiers, 2010. Advances in Computational Tools for Engineering [26] C. Hughes, J. Srinivasan, and S. Adve, “Saving energy with Applications (ACTEA), 2012. architectural and frequency adaptations for multimedia [46] W. Kim, M. Gupta, G. Y. Wei, and D. Brooks, “System applications,” in MICRO, 2001. level analysis of fast, per-core dvfs using on-chip switching [27] C. Isci, A. Buyuktosunoglu, C. Y. Cher, P. Bose, and regulators,” in High-Performance Computer Architecture M. Martonosi, “An analysis of efficient multi-core global (HPCA), 2008. power management policies: Maximizing performance for a [47] L. Clark, E. Hoffman, J. Miller, M. Biyani, Y. Liao, given power budget,” in MICRO, 2006. S. Strazdus, M. Morrow, K. Velarde, and M. Yarch, “An [28] F. Xie, M. Martonosi, and S. Malik, “Efficient embedded 32-b core for low-power and behavior-driven runtime dynamic voltage scaling policies,” high-performance applications,” in IEEE Journal Of in CODES+ISSS, 2005. Solid-State Circuits, 2001. [29] D. Snowdon, S. Petters, and G. Heiser, “Accurate on-line [48] T. Fischer, J. Desai, B. Doyle, S. Naffziger, and B. Patella, prediction of processor and memory energy usage under “A 90-nm variable frequency clock system for a voltage scaling,” in 2007, EMSOFT. power-managed architecture processor,” in IEEE Journal Of Solid-State Circuits, 2006. [30] C. Isci and A. B. andM. Martonosi, “Long-term workload phases: Duration predictions and applications to dvfs,” in [49] J. Lucas, S. Lal, M. Andersch, M. A. Mesa, and IEEE Micro, 2005. B. Juurlink, “How a single chip causes massive power bills gpusimpow: A gpgpu power simulator,” in Proceedings of [31] A. Weissel and F. Bellosa, “ cruise control: the IEEE International Symposium on Performance Event-driven clock scaling for dynamic power Analysis of Systems and Software, 2013. ˘ ´ management,” in IntˆaAZl Conf. Compilers, Architecture [50] J. A. Butts and G. S. Sohi, “A static power model for and Synthesis for Embedded Systems (CASES), 2002. architects,” in International Symposium on [32] C. Poellabauer, L. Singleton, and K. Schwan, Microarchitecture (MICRO), 2000.