<<

GPGPU Power Modeling for Multi-Domain Voltage-Frequency Scaling

Joao˜ Guerreiro, Aleksandar Ilic, Nuno Roma, Pedro Tomas´ INESC-ID, Instituto Superior Tecnico,´ Universidade de Lisboa, Portugal {Joao.Guerreiro, Aleksandar.Ilic, Nuno.Roma, Pedro.Tomas}@inesc-id.pt

Abstract—Dynamic Voltage and Frequency Scaling (DVFS) However, to efficiently apply these power management on Graphics Processing Units (GPUs) components is one of techniques, an accurate model is required to predict how the most promising power management strategies, due to its the power consumption scales when different GPU fre- potential for significant power and energy savings. However, there is still a lack of simple and reliable models for the quency/voltage configurations are applied. Previous works estimation of the GPU power consumption under a set of have showed that applications that utilize the GPU resources different voltage and frequency levels. differently have their performance and power consumption Accordingly, a novel GPU power estimation model with both scale in distinct ways when DVFS is applied [7], [8], [9], core and memory frequency scaling is herein proposed. This [10]. Hence, to accurately characterize the GPU power model combines information from both the GPU architecture and the executing GPU application and also takes into account consumption when executing any given application, it is nec- the non-linear changes in the GPU voltage when the core and essary to analyze the usage pattern of the multiple GPU com- memory frequencies are scaled. The model parameters are ponents. Other research works have proposed GPU power estimated using a collection of 83 microbenchmarks carefully models [11], [12], [13], that predict the power consumption crafted to stress the main GPU components. Based on the only at a fixed GPU configuration, i.e. not predicting the hardware performance events gathered during the execution of GPU applications on a single frequency configuration, the consequent changes in power consumption caused by DVFS. proposed model allows to predict the power consumption of More recent works partially tackle this problem [14], [15], the application over a wide range of frequency configurations, by focusing on power prediction at different frequency as well as to decompose the contribution of different parts of configurations, although achieving non-negligible accuracy the GPU pipeline to the overall power consumption. errors (ranging from 10% up to 24%). Nevertheless, none Validated on 3 GPU devices from the most recent ot these approaches considers the non-linear scaling of the microarchitectures (Pascal, Maxwell and Kepler), by using a collection of 26 standard benchmarks, the proposed model GPU voltage with the operating frequency. is able to achieve accurate results (7%, 6% and 12% mean In accordance, this paper main contribution is a new absolute error) for the target GPUs (Titan Xp, GTX Titan X approach to estimate GPU power consumption across an and Tesla K40c). ample range of frequency and voltage configurations for the multiple GPU domains (core and memory). This is done I.INTRODUCTION by carefully crafting a set of 83 CUDA microbenchmarks, exercising the different components of real GPUs. The During the past decade, Graphics Processor Units (GPUs) average GPU power consumption during the execution of have suffered many evolutions, transitioning from real-time each microbenchmark is measured for all frequency/voltage graphics processors to high performance accelerators for levels, while a collection of performance events is measured general-purpose applications, with particular application in only at a reference configuration, allowing a clear under- Deep Learning [1]. Because GPU architectures are able to standing of how the power consumption changes with DVFS achieve both high arithmetic throughput and high memory and how each microbenchmark exploits the underlying GPU. bandwidth [2], they are ideal to accelerate data parallel With this information, a power model for the consid- applications. GPUs are nowadays a staple in many high- ered GPU device is estimated using an iterative heuristic performance computing (HPC) systems, confirmed by their algorithm that relies on statistical regression. Based on usage in 72 of the most recent TOP500 HPC systems list. the observed GPU components utilization rates, the model One common challenge of GPU-accelerated systems re- allows the prediction of the power consumption of each gards the power and energy constraints. Despite their poten- component, as well as estimating how the voltage scales with tial to high performance computing, GPU devices consume their operating frequency. Once the model is created, it is considerable amounts of power, even on underused compo- possible to characterize the power consumption of any GPU nents. This issue is often addressed by Dynamic Voltage and application for all frequency and voltage configurations, by Frequency Scaling (DVFS), which consists on scaling the measuring the performance events during its execution at a voltage and frequency of the GPU components according to single configuration. the requirements of the executing applications and leading Beyond DVFS prediction, the proposed model can also be to significant power and energy savings [3], [4], [5], [6]. used in other scenarios, such as in providing an estimate of the total and/or per-component power consumption for short- F lived kernels or even in devices without embedded power sensor; or provide insights on the most influential factors F F FF() F of GPU power consumption, useful during application opti- mization. F The proposed GPU power model was extensively vali- FF F dated with a collection of applications from standard bench- F () marks (Parboil [16], Rodinia [17], Polybench [18] and FFF CUDA SDK [19]), by using real GPU devices from the three F most recent NVIDIA microarchitectures (Pascal, Maxwell F and Kepler). The proposed model achieves accurate results, Figure 1: Block diagram of NVIDIA’s Titan Xp GPU. on a frequency range of up to 2× change in core frequency and 4× change in memory frequency, with average errors of about 7%, 6% and 12% for the Pascal, Maxwell and Kepler design and structure (giving rise to both mobile and high- devices, respectively. This is a significant improvement over performance devices). the accuracy offered by previous state-of-the-art models and Another common characteristic across most GPU gener- low-level simulators, with the benefit of also running faster ations (illustrated in Figure 1 for a Titan Xp GPU) is the than the latter. Accordingly, the most significant contribu- existence of independent frequency domains, such as the tions of this paper are the following: core (or graphics) domain, clocked at fcore and the memory domain, clocked at fmem and which affects only the device • a microbenchmark suite that stresses the GPU com- memory (DRAM) bandwidth. ponents that are the most relevant to the GPU power By applying DVFS to exploit the independent frequency consumption, as well as the full disclosure of the domains, it is possible to adapt the performance of the performance events that characterize the utilization of GPU components to the requirements of the application the GPU components; under execution and attain energy savings [3], [4], [5], • a novel DVFS-aware GPU power model, able to predict [6]. However, optimizing the GPU configuration (i.e. the the GPU power consumption (decoupling it at the level frequency and voltage levels of both core and memory of each GPU component) at different frequency and domains) is a non-trivial problem [20], [9], [10], as it voltage configurations by using performance events requires an accurate estimation of both the execution time gathered at a single configuration — to the best of our and average power consumption and how they change when knowledge, this is the first truly DVFS-aware power the GPU configuration is modified. model, by being able to estimate how GPU voltage Generally, the power consumption of a GPU device can scales with the operating frequency on modern GPU be decomposed in the sum of the power consumptions devices1; of the multiple architectural components [21], with the • validation of the proposed GPU power model with power of each component ( ) being associated with its standard benchmarks on commercially available GPU k peak power consumption and with how an application devices from multiple architectures, including the most stresses such component during its execution (Power(C ) ∝ recent Pascal microarchitecture. k Utilization(Ck)). The rest of this paper is organized as follows. Sec- tion II motivates the presented work. Section III details A. Power consumption and DVFS the proposed DVFS-aware power model and Section IV Gonzalez et al. [22] and Butts et al. [23] proposed the presents the proposed microbenchmark suite. Section V power models presented in Equations 1 and 2: presents the experimental results obtained to validate the 2 proposed model. Section VI overviews the related work and PowerDynamic = a · C · V · f, (1) Section VII concludes the manuscript. PowerStatic = V · N · Kdesign · ˆIleak, (2)

II.BACKGROUNDAND MOTIVATION where a denotes the average utilization ratio, C the total capacitance, V the supply voltage, f the operating frequency Since GPU devices started being used as massively par- and N the number of transistors in the chip design. Kdesign allel general-purpose accelerators, their microarchitecture is a constant factor associated with the technology char- observed several incremental changes. Nonetheless, common acteristics and Iˆleak is a normalized leakage current for a design principles are usually observed, such as their modular single transistor, which depends on the threshold voltage. These two power models can be used to describe how the 1The complete source code (microbenchmark suite and a tool to construct the DVFS-aware GPU power consumption model) is publicly available at: dynamic and static components scale with the frequency https://github.com/hpc-ulisboa/gpupowermodel. and voltage of their respective hardware elements. However, d)T BlackScholes-4-D dIT achieved and peak theoretical throughputs of the component. As it can be seen, the two applications present very different dTT 33 3I TI) utilization rates of the GPU components, which results in 3Id 3)T the distinct power consumption levels of 181W and 135W

TI TId) d- at the default frequency configuration of the GPU (fcore = 3TT TI3 - 975 f = 3505 --- - MHz and mem MHz). TIK TIK - Additionally, it can also be seen that the variation of )T S- )TT TT TT TT TT 3TTT 33TT 3dTT T -- the power consumption when the memory frequency is d)T CUTCP-4D decreased is much higher for the BlackScholes benchmark, dIT mainly because of its greater DRAM utilization: when the dTT 3I memory frequency decreases from 3505 MHz to 810 MHz, the power consumption decreases by 52% (from 181W to 3Id 3)T 3() 87W). On the other hand, the power consumption of the TI TI3) - CUTCP benchmark decreases only by 24% (from 135W 3TT TI33 -

--- - TIK to 102W). Regarding the core frequency scaling, it can TI)3 - )T be seen that the GPU power cannot be represented as a )TT TT TT TT TT 3TTT 33TT 3dTT T -- simple linear function of the core frequency, as suggested Figure 2: DVFS impact on the power consumption of two in the power models proposed in previous works [12], [14], applications on the GTX Titan X GPU. On the right side due to the implicit voltage scaling (see also Equations 1 is presented the utilization of the GPU components during and 2 and observe the non-linear behaviour of the power the application execution (GPU frequencies set to fcore = consumption in Figure 2). 975 MHz and fmem = 3505 MHz). From these observations, it is clear that an accurate DVFS-aware power model is needed, to characterize the although they give a valuable insight on the impact of relationship between the utilization of the GPU components, DVFS, it is usually impossible to accurately measure these their runtime power consumption and how they change when two components separately, let alone determine the model the frequency/voltage of the GPU domains are scaled. individual parameters. Consequently, other approaches to This is a gap that this research aims to close, by proposing model the GPU power consumption are required. an iterative heuristic algorithm based on statistical regression Additionally, while it is common for GPU manufacturers to model both the uncertainties of the GPU components to provide tools to dynamically scale the frequency of each and how their voltage scales with the frequency of each GPU domain, there is usually no way of knowing how the domain, creating an accurate power consumption model voltage is scaling. In fact, while in some previous NVIDIA of the GPU. Through extensive microbenchmarking of the generations (like the Fermi), the GPU voltage scaled linearly several GPU components, it is possible to estimate the with the frequency of the cores [4], in Maxwell GPUs it is parameters of the power consumption model. Once the possible to scale the frequency of the cores without changing model is constructed, one can predict the total and/or per- the voltage [10]. component power consumption of a new (unseen) applica- tion on any frequency/voltage configuration, by measuring B. DVFS impact on GPU power consumption its performance events at a single configuration. Each GPU application has its unique characteristics, such as the algorithm, the data types, the used operations, as III.DVFS-AWARE POWER MODEL well as the size of the input data, the dimensions of the grid of threads, etc. These characteristics determine how the A. Power consumption model different GPU components are used during the application The proposed DVFS-aware power model assumes the execution. Furthermore, depending on how the applications decomposition of the GPU power consumption across its exercise the GPU components, the effects of DVFS on the internal components. This is done by considering that the total GPU power consumption can vary between applications components may operate under different frequency and — the effects also depend on the architectural characteristics voltage domains, such that: of each utilized component (see Equations 1 and 2). N Figure 2 presents an example of such a scenario, where XV-F P = P(D ), (3) the BlackScholes and CUTCP benchmarks were executed on GPU k a NVIDIA GTX Titan X GPU across multiple frequency and k=1 voltage configurations. Figure 2 also presents the utilization where NV-F represents the number of independent volt- of the main GPU components, represented as the ratio of the age/frequency (V-F) domains and P(Dk) represents the power consumption of each domain (Dk). The power of each B. Hardware utilization metrics domain (D ) is defined as follows: k To accurately determine the utilization parameters (Ui) in

NC Equations 6 and 7, a set of metrics is defined, which consider 2 X P(Dk) = α0Vk + Vk fk(α1 + γi · Ui) (4) the GPU hardware components with the greatest contribu- i=1 tion to the power consumption variations, namely: integer (Int), single- and double-precision floating-point (SP/DP) where V and f represent the specific voltage and frequency k k and special-function (SF) units, shared memory, L2 of the D domain, N is the number of GPU components k C and DRAM. Although it would be potentially beneficial to operating under domain D and U ∈ [0, 1] is their respective k i consider more components of the GPU architecture (e.g. average utilization rate. The coefficients α , α , γ ,...,γ 0 1 1 NC L1 instruction and data caches, texture units, etc.), it is not represent a set of hardware-specific parameters, associated easy to assess their real-time utilization, since NVIDIA does to the characteristics of the underlying architecture, such as not disclose events or metrics to accurately describe their component total capacitance and leakage resistance. Hence, average utilization. Nonetheless, should such information be the proposed power model comprises 3 different terms: 1) disclosed, one may easily consider other hardware units, in α V , corresponding to the static power of the domain (see 0 k order to further improve the model accuracy. Moreover, as it Equation 2); 2) V 2f · α , corresponding to the power k k 1 will be described in Section III-C, even for the selected GPU consumption associated with that specific frequency and components it was deemed necessary to rely on undisclosed voltage level, independent of the component utilizations events, by performing extensive experimental testing to (e.g., idle power of that V-F level); and 3) V 2f · γ U , k k i i uncover their potential meaning. corresponding to the dynamic power of component i (see The utilization level of the considered GPU compute Equation 1). units, measured during the application execution, can be To reduce the number of unknown parameters, Equation 4 obtained by observing the number of executing warps and can be normalized to a reference voltage (VR): by comparing it to the number of warps that would execute

2 NC if the units were always filled: Vk Vk X P(Dk) = α0VR + fk(α1VR + γiVR · Ui) AWarps · WarpSize V V x R R i=1 Ux = , x ∈ {Int, SP, DP, SF}, ACycles · UnitsPerSMx NC (8) ¯ ¯ 2 X = β0Vk + Vkfk(β1 + ωi · Ui), (5) where AWarpsx is the number of warps executing on unit i=1 x during the application execution, ACycles is the number

Vk of cycles when there is at least one active warp on the SMs, where β0 = α0VR, β1 = α1VR, ωi = γiVR and V¯ k = . VR This formulation is particularly useful during the model UnitsPerSMx is the number of units of type x on each SM and WarpSize is the number of threads in a warp (a estimation, since the normalized V¯ k is 1 at the reference configuration, making it simpler for this particular setup characteristic of the GPU device). and providing the grounds for the initial estimation of the On the other hand, the utilization rate of the different parameters (see Section III-D). memory hierarchy levels can be computed by looking at the achieved bandwidth at each level (ABand) and by compar- Although one can generally consider multiple V-F do- ing it with the corresponding peak bandwidth (PeakBand), mains, in most modern GPU devices N = 2, correspond- V-F such that: ing to the core domain (Pcore), which includes the L2 cache, ABandy and the memory domain (Pmem), i.e. PGPU = Pcore+Pmem. Uy = , y ∈ {L2, Shared, DRAM} . (9) By replacing Equation 5 for these two domains and by PeakBandy denoting with Ncore the number of components from the C. Architecture-specific events core domain whose power consumption is considered in the Some of the metrics used to compute the utilization model, the following is obtained: rate in the proposed model can be directly gathered from

Ncore the publicly available device characteristics, such as the ¯ ¯ 2 X UnitsPerSM for the Int, SP, DP and SF units, and the Pcore = β0Vcore + Vcorefcore(β1 + ωiUi), (6) i=1 WarpSize. The DRAM and shared memory peak bandwidth P = β V¯ + V¯ 2 f (β + ω U ). (7) can be calculated using the known device characteristics mem 2 mem mem mem 3 mem mem (PeakBand = f · Cycle , where f is the operating frequency Equations 6 and 7 show the distinctive approach of of that memory level). The L2 cache peak bandwidth cannot the proposed model when compared with the state-of-the- be computed as trivially, as it was shown by numerous art [12], [14], since it considers multiple V-F domains and works [24], [25], [26]. Hence, it was experimentally deter- relies on a more accurate relationship between frequency mined with a set of specific L2 microbenchmarks (detailed scaling, voltage levels and power consumption. in Section IV), specifically developed for this purpose. Table I: Performance events required to compute the metrics of the proposed model. Moreover, since there is a relation used in the proposed power consumption model. between the unknowns V¯ k and βi/ωi in the proposed model Metric Titan Xp GTX Titan X Tesla K40c (Equations 6 and 7), a simple least squares regression cannot ACycles active cycles be used, as it leads to a non-full-rank optimization problem. ∗ ∗ l2 subp{0,1} total rd sq l2 subp{0,1,2,3} t rd sq ABandL2 ∗ ∗ Hence, an iterative optimization algorithm was devised to l2 subp{0,1} total wr sq l2 subp{0,1,2,3} t wr sq shared ld trans l1 sh ld trans estimate such parameters, which works as follows: ABandShared shared st trans l1 sh st trans 1) Determine the initial value of the unknowns X by fb subp{0,1} rd sectors ABandDRAM considering the reference frequency F1 = (fcore1, fb subp{0,1} wr sectors ¯ ¯ † W131, W134 fmem1), where Vcore1 = Vmem1 = 1. Consider also AWarpsSP/INT W580, W581 W361, W362 W136, W137 two additional configurations F2 = (fcore2, fmem1) † AWarpsDP W584 W364 W141 † and F3 = (fcore1, fmem2) and assume that their corre- AWarpsSF W560 W359 W133 † ¯ InstINT W831 W504 W205 sponding normalized voltage levels are also 1 (Vcore2 † ¯ InstSP W829 W502 W203 = Vmem2 = 1). By using the measurements obtained ∗ sq - sector queries at those three configurations, solve the following linear † The prefix W stands for: 352321 for Titan Xp, 335544 for GTX Titan X and 318767 for Tesla K40c. system using a least squares estimation: X  2 X = arg min Pmeas. − Pˆ X (11) However, other required metrics depend on the average Microbench.∈BA ¯ ¯ utilization of the GPU components, which also depend on s.t. Vcore = Vmem = 1, the application characteristics (and therefore, need to be where Pmeas. is the measured power consumption, measured during their execution). Table I presents the set of Pˆ = Pcore + Pmem, as given by Equations 6 and 7, performance events that were used to obtain the remaining and BA is the set of microbenchmarks executed at parameters shown in Equations 8 and 9, collected using the frequency configurations F1, F2 and F3. the NVIDIA CUPTI library. The events identified with 2) By using the previously determined vector of param- a label correspond to the events disclosed by NVIDIA. eters X — for each frequency configuration (fcore, The remaining events, identified with a numeric ID, were fmem) — use the measurements from the set of mi- selected through an extensive experimental testing in order crobenchmarks to estimate the values of V¯ core and to assess their meaning. Furthermore, since some of the V¯ mem, by solving the following problem: considered metrics (e.g. ABandDRAM) depend on the values For each F = (fcore, fmem) vector, solve : of multiple performance events (4 for this specific metric) 2 ¯ P  ˆ an aggregation step needs to be conducted. V = arg min Pmeas.−P (12) Moreover, since in the considered GPU devices the events V¯ Microbench.∈BB s.t. ∀ V¯ x1≥V¯ x2, x∈{core, mem} related with the SP and Int units are combined into the same fx1>fx2 set of events (making them indistinguishable), the utilization where V¯ is the voltage level associated with fre- of each of those components is determined by the ratio of xi quency f and BB is the set of microbenchmarks instructions executed for each instruction type: xi executed at frequency configuration (fcore, fmem). AWarpsInt/SPInstz 3) Considering the newly determined values for V¯ core AWarpsz = , z ∈ {Int, SP}. (10) ¯ InstInt + InstSP and Vmem, repeat step 1 to estimate the new values of X, but using the measurements from all frequency D. Model parameter estimation levels, i.e. by extending BA to include the set of The final step towards the definition of the proposed measurements taken for all microbenchmarks at all DVFS-aware power model corresponds to the determination frequency configurations. of the unknown parameters X = [ β0, β1, β2, β3, ωmem, ω1, 4) Iterate between steps 2 and 3 until convergence is ... , ωN ] and of the set of voltages V¯ = (V¯ core, V¯ mem) achieved, or the maximum number of iterations is associated with each frequency configuration (which are also reached. considered as unknowns because general GPU drivers do not One benefit of the proposed methodology over previous directly provide these values). Hence, a set of specifically studies is the ability to dynamically determine how the developed microbenchmarks (described in Section IV) is GPU voltage is scaling for each frequency configuration. used to stress the considered GPU components to better Considering that part of the power consumption scales with characterize their uncertainties. The set of measurements the square of the voltage, it is important to have an informed gathered during the execution of these microbenchmarks, i.e. knowledge of these values, in order to achieve an accurate the utilization rates and the power consumption at each V-F model of the architecture. Hence, despite being impossible to configuration, can then be used to estimate the parameters measure the real-time voltage of each GPU domain in many (a) Int, SP, DP Code: (b) SF Code: (c) Shared Memory Code: (e) DRAM Code: DATA_TYPE r0, r1, r2, r3; DATA_TYPE r0, r1, r2, r3; __shared__ DATA_TYPE shared[THREADS]; DATA_TYPE r0, r1; DATA_TYPE r0; r0=A[threadId]; r0=A[threadId]; for(int i=0;i

Figure 3: Example CUDA source code of some of the used microbenchmarks.

several components of the GPU, it is possible to isolate their ttt ttt power consumption, enabling an accurate prediction of their ttt contribution to the total GPU power consumption. ttt Figure 3 presents a subset of CUDA code examples from ttttttt the developed microbenchmarks. To stress the main arith- ttttttt metic units (Int, SP and DP), the microbenchmark presented ttttttt ttttttt in Figure 3a was developed, where the DATA TYPE can be tttt int float double tttttt switched between , and . Figure 4 presents ttttttt the PTX code corresponding to the SP variation of the tttt microbenchmark, where it can be seen that the FP operations ttt make use of architectural registers. When executing this microbenchmark, each starts by initializing the values Figure 4: PTX source code of the microbenchmark stressing of 4 registers with data from the global memory. Afterwards, the single-precision floating-point units. each thread executes a series of multiply and addition oper- ations (using the PTX fused multiply-add instruction), until a number of N iterations is reached (N=512 in the example computing systems, the proposed methodology still allows shown in Figure 4). The threads finalize by storing the com- achieving accurate power predictions, since no assumption puted value back to the global memory. By running the same is made on how the voltage scales with frequency. However, code with different values of N, it is possible to characterize if there is a previous information regarding the voltage the impact of different instruction mixes to the GPU power levels of each domain at any given frequency configuration, consumption, by assigning different amounts of arithmetic the proposed methodology can be simplified into a single operations per memory access (arithmetic intensity). As execution of step 3, by utilizing the real voltage values. the value of N increases, more arithmetic instructions are E. Power consumption prediction executed for each pair of load/store instructions, resulting in increasingly higher levels of utilization of the corresponding With the model parameters determined, it is possible to arithmetic units and lower utilization levels of the memory predict how the voltage scales with the frequency of each hierarchy (DRAM and L2 cache). GPU domain and to obtain the total power consumption of any executed application for the whole range of the device Figure 3b presents the developed microbenchmark to V-F configurations, by simply measuring its performance stress the special-function units. The code is very similar events on a single configuration. This allows a considerable to the previous arithmetic microbenchmarks, with the dif- decrease of the design search space, which is a highly valu- ference relying on using transcendental operations instead able advantage when applying DVFS in real-time. Finally, of a simple multiply and addition. the obtained power model also allows the decomposition of The microbenchmark presented in Figure 3c was devel- the power consumption into the partial consumptions of the oped to stress the memory subsystem, where each thread several GPU components. consecutively performs one load and one store to the shared memory. The load and store addresses are chosen in a way IV. MICROBENCHMARKINGTHE GPU that minimizes the shared-memory bank conflicts for both To model the unknown characteristics of the underlying loads and stores. architecture, the proposed modelling methodology relies on Due to the absence of publicly available information in-depth microbenchmarking of specific GPU components. regarding the operation and structure of the L2 cache on By creating a wide set of microbenchmarks, covering the NVIDIA GPUs, developing a microbenchmark to stress this opnn-ee.Fgr Bilsrtsteper-component the illustrates the at 5B microbenchmark pre- Figure each to of component-level. possible consumption is power it 7, the and dict 6 Equations in parameters known suite components. goal, microbenchmark considered design proposed the its the Overall, accomplishes code. that successfully source show their results in iterations the again loop obtained the utilizations varying of by memory degrees and corresponding varying with DP the element, SP, stress microbench- successfully the memory these for the marks, observed Regarding units. be Int microbenchmarks. the can SF of behaviour utilization the L2 same in and with The increase (DRAM the observed hierarchy by and memory be to cache), corresponding the (by can rate from utilization intensity elements This the both arithmetic 3a). of decrease the Figure gradual varying the in of N, effects possible increasing the is it 11 see collection, first to Integer the at the at (Maxwell) from Looking microbenchmarks GPU configuration. X frequency Titan default GTX the the developed the on executing microbenchmarks by obtained components, GPU ered no 83 with of awaken suite proposed is the microbenchmarks. in GPU resulting the (Idle), kernel where well executing as microbenchmark considered, also a was the as components used of several SMs). the utilization the of inside higher time less in spend smaller results threads the choosing which (since by DRAM N, and of for number loop lower values per a assigning instructions by arithmetic arithmetic obtained lower be the can to 3a), intensities similar (Figure rather is microbenchmarks structure arithmetic the While DRAM. GPU cache. the L2 the characterize different to mi- where explored [26], considered are on the patterns based access Hence, is task. 3d) trivial (Figure crobenchmark a not default is the at component X Titan GTX on suite microbenchmark the of breakdown power ( and rates frequency utilization Per-component 5: Figure pnmdlconstruction, model Upon consid- seven the of rate utilization the presents 5A Figure mix a to corresponding microbenchmarks of set a Finally, stress to used microbenchmarks the illustrates 3e Figure B. Per-Component A. Power [W] 150 200 100 Utilization 1.0 2.0 0.5 1.5 50 0 f core INT (x12) 975 = H and MHz SP (x11) i.e. f mem fe siaigteun- the estimating after 3505 = i.e. DP (x12) MHz). DRAM L2 Cache DP Unit INT Unit INT Measured Constant Shared Memory SF Unit SP Unit DP Unit L2 Cache DRAM INT Unit INT SP Unit SF Unit Shared Memory nstressing in SF (x8) es eoda h ats P ofiuain(ihs core (highest configuration GPU fastest at the of at time second execution 1 an the least reach executed always given repeatedly to were necessary, measurements kernels whenever the power which rates, misleading refresh times, observed execution in many for short Since result 100ms K40c. very Tesla may Xp, have the Titan for benchmarks 15ms the GPU and for X Titan period GTX refreshed 35ms the are estimated values whose an NVML, at measurements the using power (while obtained real also domains The are GPU set). automatically the is of voltage frequencies operating the and X the Titan recent, GTX the more Xp. on are Titan focus GPUs mostly Titan will results the presented since For and configurations. multiple reason, allow this GPUs frequency other memory two non-idle the single level, a has GPU While environment. K40c Tesla CentOS7 the used a were on platforms II) testing Table as (see microarchitectures NVIDIA recent setup Experimental A. microbenchmarks). (achieved Mix 49% the about of is one power maximum in dynamic the that the with and of contribute consumption, contribution power utilization, total the components to the 84W on depend power not the that, of observed portion constant be powerconsumption, the also configuration, the V-F can this it that for corresponding Additionally, using and their rate. follow results applications) utilization components (see the the of applications of of set consumption assessment of robust independent set a an this for on V-B Section consumption predicting in power accurate fre- particularly the observed is default be model can proposed the it the that results, at these consumption From with configuration. power quency together measured microbenchmark, total each the of breakdown power h VLlbaywsue o oioigadchanging and monitoring for used was library NVML The most the from GPUs three model, proposed the validate To L2 (x10) i.e. .E V. h em rmEutos6ad7ta do that 7 and 6 Equations from terms the (x7) MIX DRAM (x12) Shared (x10) XPERIMENTAL R ESULTS Table II: Summarized description of the used GPUs. 1.2 1.4 Measured Voltage Measured Voltage Titan GTX Tesla Predicted Voltage 1.1 Predicted Voltage 1.2 Ref Xp Titan X K40c Ref

V / 1.0 Base architecture Pascal Maxwell Kepler V / 1.0 Compute capability 6.1 5.2 3.5 0.9 0.8 500 600 800 9007001000 1100 1200 500 700 1100 13009001500 1700 1900 {4005, 3505, Core Frequency [MHz] Core Frequency [MHz] Memory frequencies (MHz) {5705, 4705}∗ 3004 3300, 810} (a) GTX Titan X. (b) Titan Xp. Core freq. range (MHz) [1911:582] [1164:595] [875:666] Number of core freq. levels 22 16 4 Figure 6: Measured vs. predicted core voltage. Default Mem. Frequency 5705 3505 3004 Default Core Frequency 1404 975 875 4 44 4 4 4 4 Threads per warp 32 32 32 SUSSS SUSS SUSS SUSS SUSS SUSS Number of SMs 30 24 15 Memory Bus Width 48B 48B 48B Shared mem. banks 32 32 32 SP/INT Units/SM 128 128 192 DP Units/SM 4 4 64 SS SF Units/SM 32 32 32 4444 4444 4444 TDP (W) 250 250 235 ∗ NVIDIA driver does not allow setting the memory frequency to lower levels. SS SS SS Table III: Standard benchmarks used to validate the proposed Figure 7: Power prediction for all V-F configurations, for power model. the validation set of standard benchmarks (not used in the Suite Application Name model construction). Streamcluster, Backprop, LUD, Gaussian, Rodinia [17] Hotspot, K-Means, ParticleFilter naive, ParticleFilter float, SRAD v1, SRAD v2 accuracy of the predictions is validated using the power con- Parboil [16] CUTCP, LBM 2MM, 3MM, FDTD-2D, SYRK, CORR, sumption measurements taken at all considered frequency Polybench [18] GEMM, GESUMMV, GRAMSCHM, levels (see Table II). SYRK DOUBLE, 3DCONV, COVAR CUDA SDK [19] Blackscholes, ConjugateGradientUM, matrixMulCUBLAS B. DVFS-aware power model validation Voltage levels prediction: Even though the devised model assumes that both Vcore and Vmem can scale with the and memory frequencies). The power consumption of each changes in frequency of the two GPU domains (see Sec- kernel was computed as the average of all gathered sam- tion II-B), from extensive experimental testing, no voltage ples. For benchmarks with multiple kernels the total power differences were observed across the different memory fre- consumption was obtained by weighting the consumption quency levels for the considered GPUs. of each kernel with its relative execution time. To guarantee On the other hand, the results denote clear differences the accuracy of the presented results, all benchmarks were on the core voltage levels for the GTX Titan X and Titan repeated 10 times, with the presented values corresponding Xp GPUs. Figure 6 presents the comparison between the to the median value. predicted core voltage (obtained during the construction The results will be analysed by considering the whole of the model), with the measured voltage for the Titan frequency range in Section V-B. To create the model, the Xp and GTX Titan X. The real measured voltages were microbenchmark suite described in Section IV was executed obtained using the NVIDIA Inspector and MSI Afterburner on different frequency levels, with the performance events (third party Windows tools). However it was not possible to shown in Table I being measured only at the reference fre- sweep through all core and memory frequency ranges, due quency levels. The algorithm proposed to estimate the power to limitations of these tools, nor was it possible to verify the model converged in less than 50 iterations, corresponding voltage levels on the Tesla K40c GPU. The results clearly to about 30 seconds on an Intel i7 4500U processor. To show that there are two distinct regions for the core voltage obtain a bias-free validation of the model, an independent when scaling the core frequency: i) a constant voltage region, collection of 26 applications from 4 benchmark suites was for lower frequencies; and ii) after a specific frequency, used (see Table III), with each application being executed the voltage starts increasing linearly with the frequency. By only at the reference frequency configuration to measure the comparing the predicted and measured values, it can be required hardware events. Since these applications were not observed that the devised model is accurate in predicting used to estimate the model parameters, they allow showing the core voltage, and in identifying the breaking point the model robustness for new (unseen) applications. The between the two distinct regions. The existence of these oe,t lo outassmn fteresults. the of assessment robust a allow to model, components GPU events the hardware three of the can utilization the of K40c the accuracy Tesla characterizing for reduced when the a 12.4% on by and error explained be higher 6.0% observed 6.9%, The devices. of X. Titan errors GTX the absolute on 248W to up large, is 40W values from power obtained going of range taken the account, were into settings frequency the multiple for Since model configurations. power proposed benchmarks the validation of accuracy be the not presents could values such because figure is the validated). it in (although shown frequencies memory not that different GTX across highlighted the X on be Titan predicted are should differences voltage different core it the significant Moreover, over affect application frequencies. significantly an core of may consumption behaviours power the scaling different two the for size matrices input the varying matrixMulCUBLAS of Effects 9: Figure the benchmarks. presents all figure of Each frequency set X. all validation for Titan error the GTX prediction with on obtained benchmarks error of Prediction 8: Figure ddddddp idddd FFFV FFFFV dd q]] q=] u]] u=] 2 =] ntevldto ecmrstemdlahee mean achieves model the benchmarks validation the On DVFS: with prediction consumption Power h aiainbnhak eentue ntecntuto fthe of construction the in used not were benchmarks validation The

FHFN lll lll FHFN lll lll

iiiiPi

]= ],](])]q] q]qu]] qq]] q]]] )]] (]] s]]=]] ,]]

dd

llll% llll%llll%llll%

qqsfddddgddddddd enl nteGXTtnX. Titan GTX the on kernel, 2 core

o utpecr n eoyV-F memory and core multiple for ,

qqusdlddddd[

rqece ihafixed a with frequencies

i i i FHFN lll lll FHFN lll lll wd q[u q[s ][f u[] ][( ] q f])sf])s =qux sfx ii ii i ][qu

llll% iii iiii d ud d d d d

d ][=( ][u( ][qH

iue7 Figure

memory ][q, ][=] ][)u ][us

e.g. r rqece fteGXTtnX hncvrn a covering When X. Titan 2 GTX of the range frequency of mem- frequencies different for ory benchmarks validation considered the for predictions (975MHz accurate 4.3 show of range X (3505MHz frequency Titan particular, a GTX In to the configurations. up con- of of power results range model wide the the the a predicting events for configuration, when performance sumption reference results the accurate the using achieves at by also only even is similar,measured It that model. very power noting are the Xp) of worth Titan accuracies and similar in X resulting two Titan the (GTX of mi- GPUs events same [14]. other performance existing the 23.5% and was on architecture error works The prediction previous the in where achieved croarchitecture, lower considerably one still the is than error prediction Nonethe- achieved I). the Table in less, presented events undisclosed the (using nteGXTtnXGU rmtepeetdrsls tcan it results, presented the configurations From V-F GPU. two X Titan for GTX benchmarks the on standard of set the main the represent bottlenecks. crucial components consumption with power which developers application about the for provides information power- interesting it This since particularly application. optimization, be any for can component breakdown consumption GPU power each the estimate of to possible is it determined, power error. with GPU average model the 6.8% proposed DRAM the a of by and rise predicted L2-cache is presented which data unit, consumption, the input SP in larger the resulting with of increase, seen, each utilization be of the can utilization sizes, it the As in component. and GPU consumption power GPU the the of model, matrices input the (square) in account consumptions. power into different distinct the taken in Naturally, resulting lead are accesses. will patterns this DRAM as the of utilization L2-cache data, increase than input the larger an utilization in much to resource a fits with different with it kernel have same kernel that GPU to such one different expected data the is example, input how For enough determine stressed. small will are data components input the of when error at accuracy seen is 4.9% were This the model increases. by the slightly build error the accuracy to the (where used measured), were configuration that reference events the performance from frequency operating away might the furthest at one power As the configurations. predicting is V-F when 6.0% expect, the of all error over prediction achieved, mean still a frequencies, memory the iue8dpcstepeito ro ftepwrmodel power the of error prediction the depicts 8 Figure iue1 rsnsteuiiainadpwrbekonof breakdown power and utilization the presents 10 Figure power: GPU the Decoupling the of size the varying of effects the presents 9 Figure size: data Input f mem 810 = → → 595MHz). 1Mz n 1.6 and 810MHz) H h cuayerricesst 8.7%. to increases error accuracy the MHz o ie enl h characteristics the kernel, given a For × o h oefeunisand frequencies core the for matrixMulCUBLAS × × o h eoyfrequency memory the for f netemdli fully is model the Once mem o h oefrequency core the for 3505 = H,while MHz, enlin kernel 4 × for VRG i GRVN ERN GRHV Vi GRH_ GR_I GREI i GRcH i GREE ERV i GREG GRHG GRcG GRVV i GRVH GREI GRNc GRcN GREh GRHV GRN_ GRc GREE GRE_ GRE_ GRVH GRE_ GRcA GREH GRIG GRHN GRNG GRIE GREh GREG GREE GRA GRcA GRNE GR_ GR_h GRAG GRIE GRNh GRNI GRHV GRIV GRIE GRVA GRhV GRAI GRHc GRAH GREh GRV_ GR_I GRHE GRVE Y GRAH GRVA GRAG GREI GREA GREV GREh GREH G GREA GREE GREE VGG EcG iiiiHRV) ENG E_G EVG EGG iLB cG NG _G VGG i Vi EcG iiiicRc) i i ENG i E_G i EVG EGG iLB cG NG _G Y A V YV E A V Figure 10: Power consumption breakdown on the real benchmarks, on the GTX Titan X GPU. be observed that the group of validation benchmarks is rather requiring exhaustive execution on all possible configurations representative, presenting large differences in the utilization as in [29]. 4) GPU hardware integration, by implement- levels of the different GPU components. Nonetheless, in ing the proposed model in hardware (similarly to Intel both V-F configurations, it can be seen that a non-negligible RAPL [30]), where it would be able to take into account fine- portion of the power consumption is accounted for in the grained V-F perturbations and potentially even non-SMU constant part (80W for the reference configuration and 50W (System Management Unit) V-F adjustments. for the low memory configuration), which aggregates the static power, the idle power of that frequency configuration, VI.RELATED WORK and the power consumptions of other non-modelled GPU Initial attempts to model the GPU power consump- components (due to the lack of informative counters). Natu- tion were focused on modelling the power at a fixed rally, between the two V-F configurations, the large variation frequency/voltage configuration, neglecting the effect of in DRAM operating frequency leads to the observed large DVFS [31], [32], [33], [34], [35], [36]. In particular, Na- variation in the DRAM power consumption, while the power gasaka et al. [37] proposed a power consumption model for of the remaining components stays almost constant. a Tesla GPU (GTX285) based on hardware performance Use cases: The proposed power consumption model can events and on a statistical approach to find the correla- be applied on the following scenarios: 1) GPUs without tion between the performance profiles and the GPU power sensor, by using a previously built model (e.g. using external consumption. They achieved 4.7% average prediction error, sensors) to provide an estimate of the total and/or per- although they also stated that the approach was ineffective on component GPU power consumption (similarly to [27] more recent GPUs, namely those from the Fermi generation. from Intel). 2) Application analysis, by using the per- Hong et al. also proposed a power model for a Tesla GPU component breakdown to assess the power bottlenecks of (GTX280) [11] based on an analysis of both the binary developing applications (alternative to the usual performance PTX and of the pipeline, at runtime. The offline PTX optimization); or even in a virtualization scenario (e.g. analysis allows this model to attain highly accurate GPU NVIDIA GRID system using Hyper-V execution [28]), power predictions, at the cost of being very GPU-specific. where the model — constructed in the Hypervisor — could Hence, such an approach lacks the ability to make accurate be provided to the guest VMs, allowing them to estimate predictions for different GPU architectures, or even for the their corresponding total and/or per-component power con- same GPU at different core and memory configurations. sumption (which they currently have no way of measuring). Song et al. used an artificial neural-network to train the 3) DVFS management, by facilitating the search for the GPU power consumption [13], achieving better prediction optimal frequency state, as it allows estimating the power accuracy than other traditional regression based models. consumption at different frequency configurations without However, neural network approaches usually create output models of high complexity, where it is often hard to extract of the several GPU components for any frequency/voltage its physical meaning. configuration, by using the performance events gathered at Leng et al. integrated Hong’s power model inside the a single configuration. The proposed approach makes use GPGPU-Sim [38] simulator to form GPUWattch [12]. Sup- of an especially devised iterative algorithm that relies on porting both NVIDIA’s Tesla and Fermi GPU architec- statistical regression and is able to model not only the tures, GPUWattch can estimate the cycle-level GPU power unknown characteristics of the underlying architecture, but consumption during application execution. The considered also to accurately predict how the GPU voltage scales with model assumes that the power consumption of a GPU the core/memory domain frequencies. domain always scales linearly with its frequency[12, eq.6]. When used with a set of three different GPUs, represent- However, as it was previously shown (see Figure 2) this is ing the three most recent NVIDIA microarchitectures, the not always the case, because of the non-linear behaviour proposed DVFS-aware power model accurately estimates the of the voltage (see also Figure 6). Nath et al. also used power consumption with an average error of 7%, 6% and GPGPU-Sim to create a performance model for DVFS, 12% on the Pascal, Maxwell and Kepler GPUs, respectively. which could potentially be expanded to include a power Particularly, in the Maxwell (or Pascal) GPU, the model model [39]. However, such approach requires adding logic provides accurate results across a frequency range of 1.6× to the GPU scoreboard, making it impossible to replicate (2.4×) for the core and 4.3× (1.2×) for the memory on real hardware. This type of approaches has been deemed frequencies. product-specific and difficult to apply on modern GPUs [14]. The work herein presented opens the possibility for many Abe et al. proposed DVFS-aware power regression models different future directions, one of which is the implemen- for GPUs from the NVIDIA’s Tesla, Fermi and Kepler tation of the proposed DVFS-aware power model in real- generations [14], which separate the GPU power consump- time. This can be done by taking advantage of the iterative tion in core and memory domains, each proportional to nature of many of the most common GPU applications, their corresponding frequency and associated performance by measuring the performance events during the first call events. The models are estimated with linear regression by to a GPU kernel and then using the power prediction to using measurements taken at 3 different core and 3 different determine the frequency/voltage configuration that best suits memory frequencies. The proposed models achieved average that kernel. Additionally, the proposed model can be used prediction errors of 15% for the Tesla GPU, 14% for for the development of novel energy-aware GPU simulators the Fermi GPU and 23.5% for the most recent Kepler and for the energy-optimization of GPU applications. GPU. However, the authors do not disclose which set of performance events are used in the model. Additionally, ACKNOWLEDGMENT despite performing the power consumption decomposition in the core and memory domains, similar to the one herein This work was partially supported by Fundac¸ao˜ proposed, the authors do not consider the non-linear scaling para a Cienciaˆ e a Tecnologia (FCT), under grants effects of the voltage. SFRH/BD/101457/2014 and UID/CEC/50021/2013. Wu et al. studied how the performance and power con- sumption of an AMD GPU scale with core and memory REFERENCES frequency variations, as well as with different number of [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, cores [15]. These authors group GPU applications into M. Devin, S. Ghemawat, G. Irving, M. Isard, and Others, distinct clusters based on their characteristics, each rep- “TensorFlow: a system for large-scale machine learning,” in resenting a different performance/power scaling. Neural- Proc. Conf. Oper. Syst. Des. Implement. (OSDI), 2016. network classifiers are used to characterize new applications, [2] NVIDIA, “NVIDIA, CUDA C Programming Guide v9.0. 2017.” 2017. [Online]. Available: http://docs.nvidia.com/ by predicting which scaling factor better represents an /cuda-c-programming-guide/index.html application. They achieve an average deviation of about 10% [3] S. Herbert and D. Marculescu, “Analysis of dynamic on the tested GPU device. However, the model accuracy voltage/frequency scaling in chip-multiprocessors,” in Proc. is highly dependent on a set of fine-tuned parameters, Int. Symp. Low Power Electron. Des. (ISLPED), 2007. such as the number of clusters, which makes it difficult [4] X. Mei, L. S. Yung, K. Zhao, and X. Chu, “A measurement study of GPU DVFS on energy conservation,” in Proc. to replicate on different architectures. More recently, in a Workshop Power-Aware Comput. Syst. (HotPower), 2013. follow-up [40], a technique was proposed to optimize the [5] K. Ma, X. Li, W. Chen, C. Zhang, and X. Wang, “GreenGPU: GPU energy efficiency by predicting the characteristics of A Holistic Approach to Energy Efficiency in GPU-CPU upcoming kernels, based on recent execution history. Heterogeneous Architectures,” in Proc. Int. Conf. Parallel Process. (ICPP), 2012. VII.CONCLUSIONSAND FUTURE WORK [6] R. Ge, R. Vogt, J. Majumder, A. Alam, M. Burtscher, and Z. Zong, “Effects of Dynamic Voltage and Frequency This manuscript presented a new DVFS-aware GPU Scaling on a K20 GPU,” in Proc. Int. Conf. Parallel Process. power model that is able to predict the power consumption (ICPP), 2013. [7] S. Zhuravlev, S. Blagodurov, and A. Fedorova, “Addressing microbenchmarking,” in Proc. Int. Symp. Perform. Anal. shared resource contention in multicore processors via Syst. Softw. (ISPASS), 2010. scheduling,” in Proc. Int. Conf. Archit. Support Program. [25] X. Mei and X. Chu, “Dissecting GPU Memory Hierarchy Lang. Oper. Syst. (ASPLOS), 2010. Through Microbenchmarking,” IEEE Trans. Parallel Distrib. [8] S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, Liang Syst. (TPDS), vol. 28, no. 1, pp. 72–86, jan 2017. Wang, and K. Skadron, “A characterization of the Rodinia [26] A. Lopes, F. Pratas, L. Sousa, and A. Ilic, “Exploring benchmark suite with comparison to contemporary CMP GPU performance, power and energy-efficiency bounds workloads,” in Proc. Int. Symp. Workload Characterization with Cache-aware Roofline Modeling,” in Proc. Int. Symp. (IISWC), 2010. Perform. Anal. Syst. Softw. (ISPASS), 2017. [9] J. Guerreiro, A. Ilic, N. Roma, and P. Tomas,´ “Performance [27] J. Haj-Yihia, A. Yasin, Y. B. Asher, and A. Mendelson, and Power-Aware Classification for Frequency Scaling of “Fine-Grain Power Breakdown of Modern Out-of-Order GPGPU Applications,” in Proc. Workshop Algorithms, Model. Cores and Its Implications on Skylake-Based Systems,” ACM Tools Parallel Comput. Heterog. Platforms (HeteroPar), 2016. Trans. Archit. Code Optim. (TACO), vol. 13, no. 4, pp. 1–25, [10] X. Mei, Q. Wang, and X. Chu, “A survey and measurement dec 2016. study of GPU DVFS on energy conservation,” Digital [28] NVIDIA and Microsoft, “Graphics Accelerated Pro- Communications and Networks, vol. 3, no. 2, pp. 89–100, ductivity For Every User, Any Application,” 2016. 2017. [Online]. Available: http://images.nvidia.com/content/grid/ [11] S. Hong and H. Kim, “An integrated GPU power and pdf/microsoft-server-solution.pdf performance model,” in Proc. Int. Symp. Comput. Archit. [29] J. Guerreiro, A. Ilic, N. Roma, and P. Tomas, “Multi-kernel (ISCA), 2010. Auto-Tuning on GPUs: Performance and Energy-Aware [12] J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Optimization,” in Proc. Euromicro Int. Conf. Parallel, Kim, T. M. Aamodt, V. J. Reddi, J. Leng, T. Hetherington, Distrib. Network-Based Process. (PDP), 2015. A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and [30] D. Hackenberg, R. Schone, T. Ilsche, D. Molka, J. Schuchart, V. J. Reddi, “GPUWattch: enabling energy optimizations in and R. Geyer, “An Energy Efficiency Feature Survey of GPGPUs,” in Proc. Int. Symp. Comput. Archit. (ISCA), 2013. the Intel Haswell Processor,” in Proc. Int. Parallel Distrib. [13] S. Song, C. Su, B. Rountree, and K. W. Cameron, Process. Symp. Workshop (IPDPSW), 2015. “A Simplified and Accurate Model of Power-Performance [31] X. Ma, M. Dong, L. Zhong, and Z. Deng, “Statistical Efficiency on Emergent GPU Architectures,” in Proc. Int. power consumption analysis and modeling for GPU-based Symp. Parallel Distrib. Process. (IPDPS), 2013. computing,” in Proc. Workshop Power-Aware Comput. Syst. [14] Y. Abe, H. Sasaki, S. Kato, K. Inoue, M. Edahiro, and (HotPower), 2009. M. Peres, “Power and performance characterization and [32] J. Chen, Bin Li, Ying Zhang, L. Peng, and J.-k. Peir, modeling of gpu-accelerated systems,” in Proc. Int. Parallel “Statistical GPU power analysis using tree-based methods,” Distrib. Process. Symp. (IPDPS), 2014. in Proc. Int. Green Comput. Conf. and Workshops (IGCC), [15] G. Wu, J. L. Greathouse, A. Lyashevsky, N. Jayasena, 2011. and D. Chiou, “GPGPU performance and power estimation [33] Y. Zhang, Y. Hu, B. Li, and L. Peng, “Performance and using machine learning,” in Proc. Int. Symp. High Perform. Power Analysis of ATI GPU: A Statistical Approach,” in Comput. Archit. (HPCA), 2015. Proc. Int. Conf. Networking, Archit. Storage (NAS), 2011. [16] J. A. Stratton, C. Rodrigues, I.-J. Sung, N. Obeid, L.-W. [34] S. Ghosh, S. Chandrasekaran, and B. Chapman, “Statistical Chang, N. Anssari, D. Geng, W.-M. Liu, and W. Hwu, modeling of power/energy of scientific kernels on a multi- “Parboil: A Revised Benchmark Suite for Scientific and GPU system,” in Proc. Int. Green Comput. Conf. and Commercial Throughput Computing,” Center for Reliable Workshops (IGCC), 2013. and High-Performance Computing, Tech. Rep., 2012. [35] J. Lim, N. B. Lakshminarayana, H. Kim, W. Song, [17] S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer, S. Yalamanchili, and W. Sung, “Power Modeling for GPU S.-H. Lee, and K. Skadron, “Rodinia: A benchmark suite Architectures Using McPAT,” ACM Trans. Des. Autom. for heterogeneous computing,” in Proc. Int. Symp. Workload Electron. Syst. (TODAES), vol. 19, no. 3, pp. 1–24, jun 2014. Characterization (IISWC), 2009. [36] V. Adhinarayanan, B. Subramaniam, and W.-C. Feng, [18] L.-N. Pouchet, “Polybench: The polyhedral benchmark suite,” “Online Power Estimation of Graphics Processing Units,” in URL http//www.cs.ucla.edu/˜pouchet/software/polybench/. Proc. Int. Symp. Clust. Cloud Grid Comput. (CCGrid), 2016. [19] NVIDIA, NVIDIA, GPU Computing SDK., 2017. [Online]. [37] H. Nagasaka, N. Maruyama, A. Nukada, T. Endo, and Available: https://developer.nvidia.com/cuda-code-samples S. Matsuoka, “Statistical power modeling of GPU kernels [20] R. A. Bridges, N. Imam, and T. M. Mintz, “Understanding using performance counters,” in Proc. Int. Green Comput. GPU Power: A Survey of Profiling, Modeling, and Simulation Conf. and Workshops (IGCC), 2010. Methods,” ACM Comput. Surv. (CSUR), 2016. [38] A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong, [21] C. Isci and M. Martonosi, “Runtime power monitoring in and T. M. Aamodt, “Analyzing CUDA workloads using a high-end processors: methodology and empirical data,” in detailed GPU simulator,” in Proc. Int. Symp. Perform. Anal. Proc. Int. Symp. Microarchitecture (MICRO), 2003. Syst. Softw. (ISPASS), 2009. [22] R. Gonzalez, B. Gordon, and M. Horowitz, “Supply and [39] R. Nath and D. Tullsen, “The CRISP performance model threshold voltage scaling for low power CMOS,” IEEE J. for dynamic voltage and frequency scaling in a GPGPU,” in Solid-State Circuits (SSC), vol. 32, no. 8, pp. 1210–1216, Proc. Int. Symp. Microarchitecture (MICRO), 2015. 1997. [40] A. Majumdar, L. Piga, I. Paul, J. L. Greathouse, W. Huang, [23] J. Butts and G. Sohi, “A static power model for architects,” and D. H. Albonesi, “Dynamic GPGPU Power Management in Proc. Int. Symp. Microarchitecture (MICRO), 2000. Using Adaptive Model Predictive Control,” in Proc. Int. [24] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and Symp. High Perform. Comput. Archit. (HPCA), 2017. A. Moshovos, “Demystifying GPU microarchitecture through