An Integrated GPU Power and Performance Model

Sunpyo Hong Hyesoon Kim Electrical and Computer Engineering School of Computer Science Georgia Institute of Technology Georgia Institute of Technology [email protected] [email protected]

ABSTRACT 1. INTRODUCTION GPU architectures are increasingly important in the multi-core era The increasing power of GPUs gives them a consid- due to their high number of parallel processors. Performance op- erably higher throughput than that of CPUs. As a result, many pro- timization for multi-core processors has been a challenge for pro- grammers try to use GPUs for more than just graphics applications. grammers. Furthermore, optimizing for power consumption is even However, optimizing GPU kernels to achieve high performance is more difficult. Unfortunately, as a result of the high number of pro- still a challenge. Furthermore, optimizing an application to achieve cessors, the power consumption of many-core processors such as a better power efficiency is even more difficult. GPUs has increased significantly. The number of cores inside a chip, especially in GPUs, is in- Hence, in this paper, we propose an integrated power and perfor- creasing dramatically. For example, NVIDIA’s GTX280 [2] has 30 mance (IPP) prediction model for a GPU architecture to predict the streaming multiprocessors (SMs) with 240 CUDA cores, and the optimal number of active processors for a given application. The next generation GPU will have 512 CUDA cores [3]. Even though basic intuition is that when an application reaches the peak mem- GPU applications are highly throughput-oriented, not all applica- ory bandwidth, using more cores does not result in performance tions require all available cores to achieve the best performance. improvement. In this study, we aim to answer the following important ques- We develop an empirical power model for the GPU. Unlike most tions: Do we need all cores to achieve the highest performance? previous models, which require measured execution times, hard- Can we save power and energy by using fewer cores? ware performance counters, or architectural simulations, IPP pre- Figure 1 shows performance, power consumption, and efficiency dicts execution times to calculate dynamic power events. We then (performance per watt) as we vary the number of active cores.1 The use the outcome of IPP to control the number of running cores. We power consumption increases as we increase the number of cores. also model the increases in power consumption that resulted from Depending on the circuit design (power gating, , etc.), the increases in temperature. the gradient of an increase in power consumption also varies. Fig- With the predicted optimal number of active cores, we show that ure 1(left) shows the performances of two different types of appli- we can save up to 22.09% of runtime GPU energy consumption and cations. In Type 1, the performance increases linearly, because ap- on average 10.99% of that for the five memory bandwidth-limited plications can utilize the computing powers in all the cores. How- benchmarks. ever, in Type 2, the performance is saturated after a certain number of cores due to bandwidth limitations [22, 23]. Once the number of Categories and Subject Descriptors memory requests from cores exceeds the peak memory bandwidth, increasing the number of active cores does not lead to a better per- C.1.4 [ Architectures]: Parallel Architectures formance. Figure 1(right) shows performance per watt. In this pa- ;C.4 [Performance of Systems]: Modeling techniques per, the number of cores that shows the highest performance per ; C.5.3 [Computer System Implementation]: Microcomputers watt is called the optimal number of cores. In Type 2, since the performance does not increase linearly, using General Terms all the cores consumes more energy than using the optimal number of cores. However, for application Type 1, utilizing all the cores Measurement, Performance would consume the least amount of energy because of a reduction in execution time. The optimal number of cores for Type 1 is the Keywords maximum number of available cores but that of Type 2 is less than Analytical model, CUDA, GPU architecture, Performance, Power the maximum value. Hence, if we can predict the optimal number estimation, Energy of cores at static time, either the compiler (or the programmer) can configure the number of threads/blocks2 to utilize fewer cores, or hardware or a dynamic manager can use fewer cores. To achieve this goal, we propose an integrated power and per- Permission to make digital or hard copies of all or part of this work for formance prediction system, which we call IPP. Figure 2 shows personal or classroom use is granted without fee provided that copies are an overview of IPP. It takes a GPU kernel as an input and predicts not made or distributed for profit or commercial advantage and that copies both power consumption and performance together, whereas previ- bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 1Active cores mean the cores that are executing a program. ISCA'10, June 19–23, 2010, Saint-Malo, France. 2 Copyright 2010 ACM 978-1-4503-0053-7/10/06 ...$10.00. The term, block, is defined in the CUDA programming model. nta tue ucmso iigmodel. timing a of outcomes co uses performance it hardware instead or simulations timing req not chitectural does power. IPP or models, power time previous execution unlike importantly, only predict models analytical ous active performan of right: # power, watt) middle: per vs. performance, efficiency and (left: power, cores Performance, 1: Figure vrg 58%o h oa P oe o h v bandwidth- five the for power GPU save total benchmarks. can limited the IPP the of gating, 25.85% mechanism power average gating with power that a shows employ evaluation amount Our that the GPUs for estimate savings also av- energy memor We five on the and for benchmarks. 22.09% consumption to core bandwidth-limited energy up fewer runtime save using of can by 10.99% we that erage prediction, show IPP results the The on savi based system. energy GPU demonstrate real and a system in IPP performa the highest evaluate the We in watt. results that cores of number optimal n oe oe.Te esrdadmdldteItlPentiu the modeled and processor. measured 4 They model. power a ing an Using Model Power a Building 2.1 temperatu operating determ and mainly layout is chip technology, power circuit Static by events. runtime by determined n ttcpwr ssoni qain(1). Equation in shown as power, static and POWER ON BACKGROUND 2. P enlPerformancePrediction GPU Kernel nsmay u okmkstefloigcontributions: following the makes work our summary, In the predicts IPP outcomes, performance and power the Using siadMrooi[2 rpsda miia ehdt buil to method empirical an proposed [12] Martonosi and Isci i so transistors, in overhead switching the is power Dynamic oe osmto a edvddit w at:dnmcpow dynamic parts: two into divided be can consumption Power Performance .W eeo neprclrniepwrpeito oe fo model prediction power runtime empirical an develop We 2. .W rps ht otebs forkolde stefis an first the is knowledge, our of best the to what, propose We 1. .W ucsflydmntaeeeg aig nara syst real a in savings energy demonstrate successfully We number optimal 4. the predicts that system IPP the propose We 3. promnewt)o PP plctoso P archi- GPU a tecture. on efficiency applications and GPGPU power of (performance/watt) performance, predict to model alytical fatv oe osv energy. save to cores active of tem- in runtime increases in the increases perature. from the resulted model that also consumption we power addition, In GPU. a yatvtn ee oe ae nteotoeo IPP. of outcome the on based cores fewer activating by # ofactivecores miia Method Empirical ower P iue2 vriwo h P System IPP the of Overview 2: Figure Power/Temperature Pred. Type 1 Type 2 = Dynamic

Power(W) # ofactivecores _ power Performance/watt prediction Optimal thread/blockconfiguration + Static _

power Performance / Watt # ofactivecores Manager H/W DynamicPower Programmer Compiler re. c per nce iear- uire unters; Type 1 More ined Type 2 ngs is t em (1) on ce of d- m y er s - r . ntmeaue h hehl voltage, threshold The temperature. on Runtime The qain()[2 nScin2 The 2. Section in [12] (2) Equation . vrl Model Overall 3.1 MODELS TEMPERATURE AND POWER 3. [21]. model linear a as ra simplified temperature be operating can power normal leakage a the increa in quadratically However, power leakage temperature. the (4), Equation in tor eprtr.Since temperature. h oe osmto hnaGUi nbtn plcto is application no but on is GPU gating. a clock running. when employ consumption not power do the GPUs evaluated the ossso h depwrpu h yai oe o ahhar each for power dynamic the the where plus component, power ware idle the of consists urdt xct rgaso P.I stesmo runtime- of sum the is It GPU. a SMs( all on from powers programs execute to quired oe oe o nacietr-ee study. architecture-level an for model power models. power static describe tem briefly and we consumption effects, power ture static understand To [4]. creased Power Static 2.2 maxim the is one where time, of unit archi per value. an accessed often is how ob unit indicate tural are They rates counters. Access performance from components. benchmar architectural training fewer several stress running by determined pirically em r ersial eemnd o example, For determined. heuristically are terms otaeto.I hi model, their In tool. software a ein and design, eed ntmeaue where temperature, on depends mode power static this in voltages improved operating [24] and effects al. temperature consider et Zhang Later, age. I s as expressed (4). be Equation can in current leakage The voltage. supply and ooycharacteristics. nology igetasso htdpnson depends that transistor single ˆ v leak t ower P P oe osmto ( consumption power GPU qain()sostebscpwrmdldsusdi 1] I [12]. in discussed model power basic the shows (2) Equation V ut n oi[]peetdtefloigsmlfidleakag simplified following the presented [6] Sohi in- and Butts is consumption power static scaled, is technology the As × stetemlvlaeta srpeetdby represented is that voltage thermal the is cc oGtdlcPower NonGatedClockP aPower MaxP = stespl voltage, supply the is µ _ 0 = Runtime power · X i C K =0 n OX design ( AccessRate ( P and C · static _ i W L + ) v power t 2 RP de ower IdleP · sacntn atrta ersnstetech- the represents that factor constant a is stedmnn eprtr-eedn fac- temperature-dependent dominant the is oGtdlcPower NonGatedClockP e I ˆ b = leak _ ( V SMs steadtoa oe osmto re- consumption power additional the is V dd ( emi o sdi hsmdl because model, this in used not is term cc C aPower MaxP − N sanraie ekg urn o a for current leakage normalized a is i PU GP · ) V n DRmmr ( memory GDDR and ) N dd K × stenme ftassosi the in transistors of number the is I ˆ em,a hw nEuto (5). Equation in shown as terms, leak V design 0 · ArchitecturalScaling th ) K _ · power design PU GP hc stetrsodvolt- threshold the is which , v safnto ftemperature of function a is t 2 · and V sn ogrcntn and constant longer no is (1 th _ smdldsmlrto similar modeled is ) · power − sas ucinof function a also is , I ˆ ArchitecturalScaling leak ( T/q kT e C − i aPower MaxP v V )+ )) t dd emcnit of consists term n tdepends it and , ) de ower IdleP RP de ower IdleP HotLeakage · e ( C _ −| Memory i e with ses V ) sthat ks th tained sem- is n hown pera- · |− nge, v tec- to l um t (2) (4) (3) V d- is e off t , ). per instruction depends on the instruction type and the number of Table 1: List of instructions that access each architectural unit operands, but we found that the power consumption difference due PTX Instruction Architectural Unit Variable Name to the number of register operands is negligible. add_int sub_int addc_int subc_int Int. arithmetic unit RP_Int Access Rate: As Equation 2 shows, dynamic power consump- sad_int div_int rem_int abs_int tion is dependent on access rate of each hardware component. Isci mul_int mad_int mul24_int mad24_int min_int neg_int and Martonosi used a combination of hardware performance coun- add_fp sub_fp mul_fp fma_fp Floating point unit RP_Fp ters to measure access rates [12]. Since GPUs do not have any neg_fp min_fp lg2_fp ex2_fp , we can estimate hardware access rates based mad_fp div_fp abs_fp on the dynamic number of instructions and execution times without sin_fp cos_fp rcp_fp sqrt_fp SFU RP_Sfu rsqrt_fp hardware performance counters. xor cnot shl shr mov cvt ALU RP_Alu Equation (9) shows how to calculate the runtime power for each set setp selp slct and or component (RPcomp) such as RP _Reg. RPcomp is the multiplica- st_global ld.global Global memory RP_GlobalMem tion of AccessRate and MaxPower . MaxPower is st_local ld.local Local memory RP_LocalMem comp comp comp tex Texture RP_Texture_Cache described in Table 2 and will be discussed in Section 3.4. Note that ld_const Constant cache RP_Const_Cache RP _Const_SM is not dependent on AccessRatecomp. ld_shared st_shared Shared memory RP_Shared Equation (10) shows how to calculate the access rate for each setp selp slct and or xor shr mov Register file RP_Reg component, AccessRate . The dynamic number of instructions cvt st_global ld_global ld_const comp add mad24 sad div rem abs neg per component (DAC_per_thcomp) is the sum of instructions that shl min sin cos rcp sqrt rsqrt set access an architectural component. W arps_per_SM indicates how mul24 sub addc subc mul mad cnot many warps3 are executed in one SM. We divide execution cycles ld_shared st_local ld_local tex All instructions FDS (Fetch/Dec/Sch) RP_FDS by four because one instruction is fetched, scheduled, and executed every four cycles. This normalization also makes the maximum value of the AccessRatecomp term be one.

RPcomp = MaxP owercomp × AccessRatecomp (9)

DAC_per_thcomp × W arps_per_SM GP U_power = Runtime_power + IdlePower (5) AccessRatecomp = (10) Exec_cycles/4 n n Runtime_power = RP _Componenti (6) DAC_per_thcomp = Number_Inst_per_warpsi(comp) (11) Xi=0 Xi=0 = RP _SMs + RP _Memory #T hreads_per_block #Blocks W arps_per_SM = × (12) „ #T hreads_per_warp #Active_SMs « 3.2 Modeling Power Consumption from Streaming Multiprocessors 3.3 Modeling Memory Power In order to model the runtime power of SMs, we decompose the The evaluated GPU system has five different memory spaces: SM into several physical components, as shown in Equation (7) global, shared, local, texture, and constant. The shared memory and Table 1. The texture and constant caches are included in the space uses a software managed cache that is inside an SM. The texture and constant memories are located in the GDDR memory, SM_Component term, because they are shared between multiple SMs in the evaluated GPU system. One texture cache is shared by but they mainly use caches inside an SM. The global memory and the local memory are sharing the same physical GDDR memory, three SMs, and each SM has its own constant cache. RP _Const_SM is a constant runtime power component for each active SM. It mod- hence RP _Memory considers both. Shared, constant, and texture els power consumption from several units, including I-cache, and memory spaces are modeled separately as SM components. the frame buffer, which always consume relatively constant amount n of power when a core is active. RP _Memory = Memory_componenti (13) Xi=0 n = RP _GlobalMem + RP _LocalMem SM_Componenti = RP _Int + RP _F p + RP _Sfu (7) Xi=0 + RP _Alu + RP _T exture_Cache + RP _Const_Cache 3.4 Power Model Parameters + RP _Shared + RP _Reg + RP _FDS + RP _Const_SM To obtain the power model parameters, we design a set of syn- thetic microbenchmarks that stress different architectural compo- n nents in the GPU. Each microbenchmark has a loop that repeats a × RP _SMs = Num_SMs SM_Componenti (8) certain set of instructions. For example, the microbenchmark that Xi=0 stresses FP units contains a high ratio of FP instructions. The optimum set of MaxP owercomp values in Equation (9) that Num_SMs: Total number of SMs in a GPU minimize the error between the measured power and the outcome of the equation is searched. However, to avoid searching through a Table 1 summarizes the modeled architectural components used large space of values, the initial seed value for each architecture unit by each instruction type and the corresponding variable names in is estimated based on the relative physical die sizes of the unit [12]. Equation (7). All instructions access the FDS unit (Fetch/Dec/Sch). Table 2 shows the parameters used for MaxP ower . Eight For the register unit, we assume that all instructions accessing the comp register file have the same number of register operands per instruc- 3Warp is a group of threads that are fetched/executed together in- tion to simplify the model. The exact number of register accesses side the GPU architecture. power components require a special piecewise linear approach [12]: trend although it is not directly dependent on the number of active an initial increase from idle to relatively low access rate causes a SMs. Finally, runtime power can be modeled by taking the number large granularity of increase in power consumption while a fur- of active SMs as shown in Equation (16) ther increase causes a smaller increase. Spec.Linear column in- dicates whether the AccessRatecomp term in Equation (9) needs to RP _SMs = Max_SM × log10(α × Active_SMs + β) (14) be replaced with the special piecewise linear access rate based on n × the following simple conversion; 0.1365 ∗ ln(AccessRatecomp) + Max_SM = (Num_SMs SM_Componenti) (15) Xi=0 1.001375. The parameters in this conversion are empirically deter- − mined to have a piecewise linear function. α = (10 β)/Num_SMs,β = 1.1 Runtime_power = (Max_SM + RP _Memory) (16) Table 2: Empirical power parameters × log10(α × Active_SMs + β) Units MaxPower OnChip Spec.Linear FP 0.2 Yes Yes Active_SMs: Number of active SMs in the GPU REG 0.3 Yes Yes ALU 0.2 Yes No 3.6 Temperature Model SFU 0.5 Yes No CPU Temperature models are typically represented by an RC INT 0.25 Yes Yes FDS (Fetch/Dec/Sch) 0.5 Yes Yes model [20]. We determine the model parameters empirically by Shared memory 1 Yes No using a step function experiment. Equation (17) models the rising Texture cache 0.9 Yes Yes temperature, and Equation (18) models the decaying temperature. Constant cache 0.4 Yes Yes

Const_SM 0.813 Yes No −t/RC_Rise T emperaturerise(t) = Idle_temp + δ 1 − e (17) Global memory 52 No Yes “ ” Local memory 52 No Yes −t/RC_Decay T emperaturedecay (t) = Idle_temp + γ e (18) “ ” δ = Max_temp − Idle_temp, γ = Decay_temp − Idle_temp Figure 3 shows how the overall power is distributed among the individual architectural components for all the evaluated bench- Idle_temp: Idle operating chip temperature marks (Section 5 presents the detailed descriptions and Max_temp: Maximum temperature, which depends on runtime power the evaluation methodology). On average, the memory, idle power, Decay_temp: Chip temperature right before decay and RP_Const_SM consume more than 60% of the total GPU power.

REG and FDS also consume relatively higher power than other 85 components because almost all instructions access these units. 77 69 3.5 Active SMs vs. Power Consumption 61

To measure the power consumption of each SM, we design an- Temperature 53 45 other set of microbenchmarks to control the number of active SMs. 0 50 100 150 200 250 300 350 400 450 500 These microbenchmarks are designed such that only one block can Time (Seconds) Chip Temperature be executed in each SM, thus as we vary the number of blocks, Estimated Temperature the number of active SMs varies as well. Even though the eval- Board Temperature uated GPU does not employ power gating, idle SMs do not con- 280 sume as much power as active SMs do because of low-activity fac- 224 tors [18] (i.e., idle SMs do not change values in circuits as often 168 as active SMs). Hence, there are still significant differences in the 112 56 total power consumption depending on the number of active SMs GPU Power (W) 0 in a GPU. 0 50 100 150 200 250 300 350 400 450 500 Time (Seconds) 164 Power

150

136 Figure 5: Temperature effects from power, (top): Measured and estimated temperature, (bottom): Measured power 122

108

GPU Power (W) Figure 5 shows estimated and measured temperature variations. 94 Both the chip temperature and the board temperature are measured 80 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 with the built-in sensors in the GPU. Max_temp is a function of Number of SMs runtime power, which depends on application characteristics. We Measured Estimated discovered that the chip temperature is strongly affected by the rate of GDDR memory accesses, not only runtime power consumption. Figure 4: Power consumption vs. active SMs Hence, the maximum temperature is modeled with a combination of them as shown in Equation (19). The model parameters are de- Figure 4 shows an increase in power consumption as we increase scribed in Table 3. Note that Memory_Insts includes global and the number of active SMs. The maximum power delta between local memory instructions. using only one SM versus all SMs is 37W. Since there is no power gating, the power consumption does not increase linearly as we Max_temp(Runtime_P ower)=(µ × Runtime_Power)+ λ (19) increase the number of SMs. We use a log-based model instead + ρ × MemAccess_intensity of a linear curve, as shown in Equation (14). We also model the Memory_Insts MemAccess_intensity = (20) memory power consumption following the exact same log-based NonMemory_Insts

2 1

0

S h a r e d

M e m o r y

1 9

0

C o n s t _ S M

)

1

7 0

W

C o n s t C a c h e

(

r

1 5

0

e

T e x t u r e C a c h e

w

1 3

0

F D S

P o

U

A L U

P

1 1

0

G

IN T

9

0

S F U

7 0

R E G

F P

I d l e p o w e r

Figure 3: Power breakdown graph for all the evaluated benchmarks

increase that resulted from the increases in temperature. Equa- Table 3: Parameters for GTX280 Parameter Value tion (21) shows the comprehensive power equation over time that µ 0.120 includes the increased static power consumption, which depends λ 5.5 on σ (the ratio of power delta over temperature delta (σ = 10 / ρ 21.505 22)). Note that Runtime_power is an initial power consumption RC_Rise 35 0 RC_Decay 60 obtained from (16), and the model assumes a cold start (i.e., the sys- tem is in the idle state). T emperature(t) in (23) is obtained from (17) or (18).

GP U_power(t) = Runtime_power(t)+ IdlePower (21)

Runtime_power(t) = Runtime_power0 + σ × Delta_temp(t) (22) 3.7 Modeling Increases in Static Power Delta_temp(t) = T emperature(t) − Idle_temp (23) Consumption Section 2.2 discussed the impact of temperature on static power consumption. Because of the high number of processors in the GPU chip, we observe an increase in runtime power consumption 4. IPP: INTEGRATED POWER AND as the chip temperature increases, as shown in Figure 6. To consider PERFORMANCE MODEL increases in static power consumption, we include the temperature In this section, we describe the integrated power and perfor- model (Equations (17) and (18)) into the runtime power consump- mance model to predict performance per watt and the optimal num- tion model. We use a linear model to represent increases in static ber of active cores. The integrated power and performance model power as discussed in Section 2.2. Since we cannot control the op- (IPP) uses predicted execution times to predict power consumption erating voltage of the evaluated GPUs at runtime, we only consider instead of measured execution times.

operating temperature effects.

2 5 0 6 0

1 4.1 Execution Time and Access Rate

4 0

1 Prediction

2 0 0

1 4

1 2 0

)

P o w e r d e l t a : w a t t s )

C In Section 3, we developed the power model that computes ac-

(

(W

e

1 5 0

r

r

1 0 0

u

e w

t cess rates by using measured execution time information. Predict-

a

o

r

e

8 0

p

1 0 0

P U

m ing power at static time requires access rates in advance. In other

e

GP

T 0

6 words, we also need to predict the execution time of an application

5 0

4 0

T 2 2

e m p e r a t u rr e e d e l t a : d e g r e e s

e to predict power. We used a recently developed GPU analytical tim-

0 0

2 ing model [9] to predict the execution time. The model is briefly

0 2 0 0 4 0 0 6 0 0

P r o g r a m

T i m e (S e c o n d s )

S r t s t a explained in this section. Please refer to the analytical timing model paper [9] for the detailed descriptions. Figure 6: Static power effects In the timing model, the total execution time of a GPGPU ap- plication is calculated with one of Equations (24), (25), and (26) Figure 6 shows that power consumption increases gradually over based on the number of running threads, MWP, and CWP in the time after an application starts,4 and the delta is 14 watts. This application. MWP represents the number of memory requests that delta could be caused by increases in static power consumption or can be serviced concurrently and CWP represents the number of additional fan power. By manually controlling the fan speed from warps that can finish one computational period during one memory lowest to highest, we measure that the additional fan power con- access period. N is the number of running warps. Mem_L is the sumption is only 4W. Hence, the remaining 10 watts of the power average memory latency (430 cycles for the evaluated GPU archi- consumption increase is modeled as the additional static power tecture). Mem_cycles is the processor waiting cycles for memory operations. Comp_cycles is the execution time of all instructions. 4The initial jump of power consumption exists when an application Repw is the number of times that each SM needs to repeat the same starts. set of computation. d(perf. per watt(# of active cores) We could calculate = 0 to Case1: If (MWP is N warps per SM) and (CWP is N warps per SM) d(# of active cores) Comp_cycles find the optimal number of cores. However, we observed that once (Mem_cycles + Comp_cycles + × (MW P − 1))(#Repw) #Mem_insts MW P _peak_BW reaches N, the application usually reaches the (24) peak bandwidth. Hence, based on Equation (30), we conclude that Case2: If (CWP >= MWP) or (Comp_cycles > Mem_cycles) the optimal number of cores can be calculated using the following N Comp_cycles equations to simplify the calculation. (Mem_cycles × + × (MW P − 1))(#Repw) MW P #Mem_insts (25) if (1) (MWP == N) or (CWP == N) or (34) Case3: If (MWP > CWP) (2) MWP > CWP or (Mem_L + Comp_cycles × N)(#Repw) (26) (3) MWP < MWP_peak_BW Optimal # of cores = Maximum available # of cores IPP calculates the AccessRate using Equation (27), where the else Mem_Bandwidth predicted execution cycles (P redicted_Exec_Cycles) are calculated Optimal # of cores = with one of the Equations (24),(25), and (26). (BW _per_warp) × N

DAC_per_thcomp × Warps_per_SM AccessRatecomp = (27) 4.3 Limitations of IPP P redicted_Exec_Cycles/4 IPP requires both power and timing models thereby inheriting the limitations from them. Some examples of limitations include the following: control flow intensive applications, asymmetric ap- 4.2 Optimal Number of Cores for Highest plications, and texture cache intensive applications. Performance/Watt IPP also requires an instruction information. However, IPP does IPP predicts the optimal number of SMs that would achieve the not require an actual number of total instructions. It calculates only highest performance/watt. As we showed in Figure 1, the perfor- access rates that can be easily normalized with an input data size. mance of an application either increases linearly (in this case, the Nonetheless, if an application shows a significantly different be- optimal number of SMs is always the maximum number of cores) havior depending on input sizes, IPP needs to consider the input or non-linearly (the optimal number of SMs is less than the maxi- size effects, which will be addressed in our future work. mum number of cores). Performance per watt can be calculated by using Equation (28). 4.4 Using Results of IPP In this paper, we constrain the number of active cores based on an (work/execution time(# of cores) output of IPP by only limiting the number of blocks inside an appli- perf. per watt(cores)= (power(# of cores)) cation, since we cannot change the hardware or the thread sched- (28) uler. If the number of active cores can be directly controlled by hardware or by a runtime thread scheduler, compilers or program- Equations (24),(25), and (26) calculate execution times. Among mers do not have to change their applications to utilize fewer cores. the three cases, only Case 2 has a memory bandwidth limited case. Instead, IPP only passes the information of the number of optimal Case 1 is used when there are not enough number of running threads cores to the runtime system, and either the hardware or runtime in the system, and Case 3 models when an application is compu- thread scheduler enables only the required number of cores to save tationally intensive. So both Cases 1 and 3 would never reach the energy. peak memory bandwidth. To understand the memory bandwidth limited case, let’s look at MWP more carefully. The following 5. METHODOLOGY equations show the steps in calculating MWP. MW P is the number of memory requests that can be serviced concurrently. As shown 5.1 Power and Temperature Measurement in Equation(29), MWP is the minimum of MW P _W ithout_BW , The NVIDIA GTX280 GPU, which has 30 SMs and uses a 65nm MW P _peak_BW , and N. N is the number of running warps. If technology, is used in this work. We use the Extech 380801 AC/DC there are not enough warps, MWP is limited by the number of run- Power Analyzer [1] to measure the overall system power consump- ning warps. If an application is limited by memory bandwidth, tion. The raw power data is sent to a data-log machine every 0.5 MWP is determined by MW P _peak_BW , which is a function of second. Each microbenchmark executes for an average of 10 sec- a memory bandwidth and the number of active SMs. Note that onds. Departure_delay represents the pipeline delay between two con- Since we measure the input power to the entire system, we have secutive memory accesses, and it is dependent on both the memory to subtract Idlepower_System (159W) from the total system input system and the memory access types (coalesced or uncoalesced) in power to obtain GP U_P ower.5 The Idle_Power value for the eval- applications. uated GPU is 83W. The GPU temperature is measured with the nvclock utility [16]. The command "nvclock -i" outputs board and MW P = MIN(MW P _W ithout_BW, MW P _peak_BW,N) (29) chip temperatures. Temperature is measured every second. Mem_Bandwidth MW P _peak_BW = (30) BW _per_warp × #ActiveSM F req × Load_bytes_per_warp BW _per_warp = (31) Mem_L MW P _W ithout_BW _full = Mem_L/Departure_delay (32) 5IdlePower_System is obtained by measuring system power with MW P _W ithout_BW = MIN(MW P _W ithout_BW _full, N) (33) another GPU card whose idle power is known 225 Measured 5.2 Benchmarks 200 Predicted To test the accuracy of our IPP system, we use the Merge bench- 175 marks [15, 9], five additional memory bandwidth-limited bench- 150 marks (Nmat, Dotp, Madd, Dmadd, and Mmul), and one computa- 125 tional intensive (i.e., non-memory bandwidth limited) benchmark 100 75

(Cmem). Table 4 describes each benchmark and summarizes the GPU Power (W) 50 characteristics of them. 25 To calculate the number of dynamic instructions, we use a GPU 0 PTX emulator, Ocelot [13]. It also classifies instruction types. SVM Bino Sepia Conv Bs Nmat Dotp Madd Dmadd Mmul Cmem

6. RESULTS Figure 9: Comparison of measured and predicted GPU power consumption for the GPGPU kernels

6.1 Evaluation of Runtime Power Model 1.0 INT Figure 7 compares the predicted power consumption with the 0.9 FP 0.8 REG SFU measured power value for the microbenchmarks. According to Fig- 0.7 ALU ure 3, the global memory consumes the most amount of power. 0.6 FDS 0.5 Global_memory MB4, MB8, and MEM benchmarks consume much greater power Texture_cache 0.4 Shared_memory than the FP benchmark, which consists of mainly floating point in- 0.3 Local_memory structions. Surprisingly, the benchmarks that use texture cache or 0.2 Constant_cache constant cache also consume high power. This is because both the 0.1 0.0 texture cache and the constant cache have higher MaxP ower than SVM Bino Sepia Conv Bs Nmat Dotp Madd Dmadd Mmul Cmem that of the FP unit. The geometric mean of the error in the power prediction for microbenchmark is 2.5%. Figure 8 shows the access Figure 10: Dynamic access rate of the GPGPU kernels rates for each microbenchmark. When an application does not have many memory operations such as the FP benchmark, dynamic ac- cess rates for FP or REG can be very close to one. FDS is one when 6.2 Temperature Model an application reaches the peak performance of the machine. Figure 11 displays the predicted chip temperature over time for all the evaluated benchmarks. The initial temperature is 57◦C, the 216 Measured 192 Predicted typical GPU cold state temperature in our evaluated system. The 168 temperature is saturated after around 600 secs. The peak tempera- 144 ture depends on the peak run-time power consumption, and it varies ◦ ◦ 120 from 68 C (the INT benchmark) to 78 C (SVM). Based on Equa- 96 tion (22), we can predict that the runtime power of SVM would 72 increase by 10W after 600 seconds. However, for the INT bench- GPU Power (W) 48 mark, it would increase by only 5W after 600 seconds.

24 5

0 8 )

FP MB4 MB8 MEM INT CONST TEX SHARED C

8 0

(

e

7 5

r

t u

a

7 0

Figure 7: Comparison of measured and predicted GPU power er

p

m

6 5

1 0 S e c s e

consumption for the microbenchmarks T

6 0

0 S e c s 1.0 6

INT i p

h

C

5 5

0 0 S e c s

0.9 FP 6 d

REG e

5 0 t

0 0 0 S e c s

0.8 6 i c

SFU d

e 5

0.7 ALU 4

P r 0 0.6 FDS 4 0.5 Global_memory Texture_cache 0.4 Shared_memory 0.3 Local_memory 0.2 Constant_cache 0.1 Figure 11: Peak temperature prediction for the benchmarks. 0.0 ◦ FP MB4 MB8 MEM INT CONST TEX SHARED Initial temperature: 57 C

Figure 8: Dynamic access rate of the microbenchmarks 6.3 Power Prediction Using IPP Figure 9 compares the predicted power and the measured power Figure 12 shows the power prediction of IPP for both the mi- consumptions for the evaluated GPGPU kernels. The geometric crobenchmarks and the GPGPU kernels. These are equivalent to mean of the power prediction error is 9.18% for the GPGPU ker- the experiments in Section 6.1. The main difference is that Sec- nels. Figure 10 shows the dynamic access rates. The complete tion 6.1 requires measured execution times while IPP uses pre- breakdown of the GPU power consumption is shown in Figure 3. dicted times using the equations in Section 4. Using predicted times Bino and Conv have lower global memory access rates than others, could have increased the error in prediction of power values, but which results in less power consumption than others. Sepia and Bs since the error of timing model is not high, the overall error of the are high performance applications. This explains why they have IPP system is not significantly increased. The geometric mean of high REG and FDS values. All the memory bandwidth-limited the power prediction of IPP is 8.94% for the GPGPU kernels and benchmarks have higher power consumption even though they have 2.7% for the microbenchmarks, which are similar to using real ex- relatively lower FP/REG/FDS access rates. ecution time measurements. Table 4: Characteristics of the Evaluated Benchmarks (AI means arithmetic intensity.) Benchmark Description Peak Bandwidth (GB/s) MWP CWP AI SVM [15] Kernel from a SVM-based algorithm 54.679 (non-bandwidth limited) 5.875 11.226 11.489 Binomial(Bino) [15] American option pricing 3.689 (non-bandwidth limited) 14.737 1.345 314.306 Sepia [15] Filter for artificially aging images 12.012 (non-bandwidth limited) 12 12 8.334 Convolve(Conv) [15] 2D Separable image convolution 16.208 (non-bandwidth limited) 10.982 3.511 43.923 Blackscholes(Bs) [17] European option pricing 51.033 (non-bandwidth limited) 3 5.472 24.258 Matrixmul(Nmat) Naive version of matrix multiplication 123.33 (bandwidth limited) 10.764 32 3.011 Dotp Matrix dotproduct 111.313 (bandwidth limited) 10.802 16 0.574 Madd Matrix multiply-add 115.058 (bandwidth limited) 10.802 16 1.049 Dmadd Matrix double memory multiply add 109.996 (bandwidth limited) 10.802 16 1.0718 Mmul Matrix single multiply 114.997 (bandwidth limited) 10.802 16 1.060 Cmem Matrix add FP operations 64.617 (non-bandwidth limited) 10.802 9.356 12.983

216 Measured 225 Measured 192 IPP 200 IPP 168 175 144 150 120 125 96 100 72 75 GPU Power (W) GPU Power (W) 48 50 24 25 0 0 FP MB4 MB8 MEM INT CONST TEX SHARED SVM Bino Sepia Conv Bs Nmat Dotp Madd Dmadd Mmul Cmem

Figure 12: Comparison of measured and IPP predicted GPU power comparison (Left:Microbenchmarks, Right:GPGPU kernels)

12 Dotp 6.4 Performance and Power Efficiency Madd 10 Dmadd Prediction Using IPP Mmul Nmat Based on the conditions in Equation (34), we identify the bench- 8 Cmem Dotp (IPP) marks that reach the peak memory bandwidth. The five merge Madd (IPP) 6 Dmadd (IPP) benchmarks do not reach the peak memory bandwidth as shown in GIPS Mmul (IPP) 4 Nmat (IPP) Table 4. CWP values in Bino, Sepia and Conv are equal to or less Cmem (IPP) than the MWP values of them, so these benchmarks cannot reach 2 the peak memory bandwidth. Both SVM’s MWP (5.878) and Bs’s 0 MWP (3) are less than MWP_peak_BW (10.8). Thus they cannot 5 10 15 20 25 30 reach the peak memory bandwidth also. Number of Active Cores To further evaluate our IPP system, we use the benchmarks that reach the peak memory bandwidth (the 3rd column in Table 4 shows Figure 13: GIPS vs. Active Cores 160 the average memory bandwidth of each application). We also in- Dotp clude one non-bandwidth limited benchmark (Cmem) for a com- 140 Madd Dmadd 120 parison. In this experiment, we vary the number of active cores Mmul 100 Nmat by varying the number of blocks in the CUDA applications. We Cmem design the applications such that one SM executes only one block. 80 Note that, all different configurations (in this section) of one appli- 60

cation have the exact same amount work. So, as we use fewer cores Bandwidth (GB/s) 40 (i.e., fewer blocks), each core (or block) executes more number of 20 6 instructions. We use Giga Instructions Per Sec (GIPS) instead of 0 Gflops/s for a metric. 5 10 15 20 25 30 Number of Active Cores Figure 13 shows how GIPS varies with the number of active cores for both the actual measured data and the predictions of IPP. Figure 14: Average measured bandwidth consumption vs. # of Only Cmem has a linear performance improvement in both the active cores measured data and the predicted values. The rest of the benchmarks show a nearly saturated performance as we increase the number of active cores. IPP still predicts GIPS values accurately except for its bandwidth consumption and the number of active cores, but it Cmem. Although the predicted performance of Cmem does not still cannot reach the peak memory bandwidth. The memory band- exactly match the actual performance, IPP still correctly predicts widths of the remaining benchmarks are saturated when the number the trend. Nmat shows higher performance than other bandwidth of active cores is around 19. This explains why the performance of limited benchmarks, because it has a higher arithmetic intensity. these benchmarks is not improved significantly after approximately Figure 14 shows the actual bandwidth consumption of the ex- 19 active cores. periment in Figure 13. Cmem shows a linear correlation between Figure 15 shows GIPS/W for the same experiment. The results 6We decide to use GIPS instead of Gflop/s because the performance show both the actual GIPS/W and the predicted GIPS/W using IPP. efficiency should include non-floating point instructions. Nmat shows a salient peak point, but for the rest of benchmarks, 0.06 Dotp (IPP) ergating is the predicted energy savings if power gating is applied. Madd (IPP) 0.05 Nmat (IPP) The average energy savings for Runtime cases is 10.99%. Dotp

0.04 Madd

0 %

Nmat 3

r e

0.03 s

e

o

r

v

2 0 %

o

s

R u n t i m e + I d l e

g c

GIPS / W 0.02

n

i

R u n t i m e

a x

m

1 0 %

S a v

P o w e r g a t i n g

0.01 g

g y

n

i

r

s

e

u

0 %

0.00 E n

5 10 15 20 25 30

D o t M a d d D m a d M m u l N m a t Number of Active Cores 0.12 Cmem (IPP) Dmadd (IPP) 0.10 Mmul (IPP) Figure 17: Energy savings using the optimal number of cores Cmem based on the IPP system (NVIDIA GTX 280 and power gating 0.08 Dmadd Mmul GPUs) 0.06

GIPS / W 0.04 6.4.2 Energy Savings in Power Gating GPUs 0.02 The current NVIDIA GPUs do not employ any per-core power 0.00 5 10 15 20 25 30 gating mechanism. However, future GPU architectures could em- Number of Active Cores ploy power gating mechanisms as a result of the growth in the num- ber of cores. As a concrete example, CPUs have already made use Figure 15: Performance per watt variation vs. # of active cores of per-core power gating [11]. for measured and the predicted values To evaluate the energy savings in power gating processors, we predict the GPU power consumption as a linear function of the number of active cores. For example, if 30 SMs consume total the efficiency (GIPS/W) has a very smooth curve. As we have ex- 120W for an application, we assume that each core consumes 4W pected, only GIPS/W of Cmem increases linearly in both the mea- when per-core power gating is used. There is no reason to differ- sured data and the predicted data. entiate between Runtime+Idle and Runtime power since the power 0.20 gating mechanism eliminates idle power consumption from in-active GIPS/W_Measured GIPS/W_IPP cores. Figure 17 shows the predicted amount of energy savings for 0.16 the GPU cores that employ power gating. Since power consump- tion of each individual core is much smaller in a power-gating sys- 0.12 tem, the amount of energy savings is much higher than in the cur-

0.08 rent NVIDIA GTX280 processors. When power gating is applied, GIPS / W the average energy savings is 25.85%. Hence, utilizing only fewer 0.04 cores based on the outcomes of IPP will be more beneficial in fu- ture per-core power-gating processors. 0.00 SVM Bino Sepia Conv Bs Nmat Dotp Madd Dmadd Mmul Cmem

Figure 16: GIPS/W for the GPGPU kernels 7. RELATED WORK

Figure 16 shows GIPS/W for all the GPGPU kernels running on 7.1 Power Modeling 30 active cores. The GIPS/W values of the non-bandwidth limited Isci and Martonosi proposed power modeling using empirical benchmarks are much higher than those of the bandwidth limited data [12]. There have been follow-up studies that use similar tech- benchmarks. GIPS/W values can vary significantly from applica- niques for other architectures [7]. Wattch [5] has been widely tion to application depending on their performance. The results used to model dynamic power consumption using event counters also include the predicted GIPS/W using IPP. Except for Bino and from architectural simulations. HotLeakage models leakage cur- Bs, IPP predicts GIPS/W values fairly accurately. The errors in the rent and power based on circuit modeling and dynamic events [24]. predicted GIPS/W values of Bino and Bs are attributed to the differ- Skadron et al. proposed temperature aware mod- ences between their predicted and measured runtime performance. eling [20] and also released a software, HotSpot. Both HotLeak- age and HotSpot require architectural simulators to model dynamic 6.4.1 Energy Savings by Using the Optimal Number power consumption. All these studies were done only for CPUs. of Cores Based on IPP Sheaffer et al. studied a thermal management for GPUs [19]. Based on Equation (34), IPP calculates the optimal number of In their work, the GPU was a fixed graphics hardware. Fu et al. cores for a given application. This is a simple way of choosing presented experimental data of a GPU system and evaluated the the highest GIPS/W point among different number of cores. IPP efficiency of energy and power [8]. returns 20 for all the evaluated memory bandwidth limited bench- Our work is also based on empirical CPU power modeling. The marks and 30 for Cmem. biggest contribution of our GPU model over the previous CPU Figure 17 shows the difference in energy savings between the use models is that we propose a GPU power model that does not re- of the optimal number of cores and the maximum number (30) of quire performance measurements. By integrating an analytical tim- cores. Runtime+Idle shows the energy savings when the total GPU ing model and an empirical power model, we are able to predict the power is used in the calculation. Runtime shows the energy savings power consumption of GPGPU workloads with only the instruc- when only the runtime power from the equation (5) is used. Pow- tion mixture information. We also extend the GPU power model to model increases in the leakage power consumption over time, Georgia Tech Innovation Grant, Intel Corporation, Microsoft Re- which is becoming a critical component in many-core processors. search, and the equipment donations from NVIDIA.

7.2 Using Fewer Number of Cores 9. REFERENCES Huang et al. evaluated the energy efficiency of GPUs for sci- [1] Extech 380801. http://www.extech.com/instrument/ products/310_399/380801.html entific computing [10]. Their work demonstrated the efficiency for . [2] NVIDIA GeForce series GTX280, 8800GTX, 8800GT. only one benchmark and concluded that using all the cores provides http://www.nvidia.com/geforce. the best efficiency. They did not consider any bandwidth limitation [3] Nvidia’s next generation compute architecture. effects. http://www.nvidia.com/fermi. Li and Martinez studied power and performance considerations [4] S. Borkar. Design challenges of technology scaling. IEEE Micro, for CMPs [14]. They also analytically evaluated the optimal num- 19(4):23–29, 1999. ber of processors for best power/energy/EDP. However, their work [5] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for was focused on CMP and presented heuristics to reduce design architectural-level power analysis and optimizations. In ISCA-27, space search using power and performance models. 2000. Suleman et al. proposed a feedback driven threading mecha- [6] J. A. Butts and G. S. Sohi. A static power model for architects. Microarchitecture, 0:191–201, 2000. nism [22]. By monitoring the bandwidth consumption using a hard- [7] G. Contreras and M. Martonosi. Power prediction for intel xscale ware , their feedback system decides how many threads processors using performance monitoring unit events. In ISLPED, (cores) can be run without degrading performance. Unlike our 2005. work, it requires runtime profiling to know the minimum number [8] R. Fu, A. Zhai, P.-C. Yew, W.-C. Hsu, and J. Lu. Reducing queuing of threads to reach the peak bandwidth. Furthermore, they demon- stalls caused by data prefetching. In INTERACT-11, 2007. strate power savings through simulation without a detailed power [9] S. Hong and H. Kim. An analytical model for a gpu architecture with model. The IPP system predicts the number of cores that reaches memory-level and thread-level parallelism awareness. In ISCA, 2009. the peak bandwidth at static time, thereby allowing the compiler or [10] S. Huang, S. Xiao, and W. Feng. On the energy efficiency of graphics thread scheduler to use that information without any runtime profil- processing units for scientific computing. In IPDPS, 2009. [11] Intel. Intel R Nehalem Microarchitecture. ing. Furthermore, we demonstrate the power savings by using both http://www.intel.com/technology/architecture-silicon/next-gen/. the detailed power model and the real system. [12] C. Isci and M. Martonosi. Runtime power monitoring in high-end processors: Methodology and empirical data. In MICRO, 2003. [13] A. Kerr, G. Diamos, and S. Yalamanchili. A characterization and 8. CONCLUSIONS analysis of ptx kernels. In IISWC, 2009. In this paper, we proposed an integrated power and performance [14] J. Li and J. F. Martínez. Power-performance considerations of modeling system (IPP) for the GPU architecture and the GPGPU on chip multiprocessors. ACM Trans. Archit. kernels. IPP extends the empirical CPU modeling mechanism to Code Optim., 2(4):397–422, 2005. model the GPU power and also considers the increases in leakage [15] M. D. Linderman, J. D. Collins, H. Wang, and T. H. Meng. Merge: a programming model for heterogeneous multi-core systems. In power consumption that resulted from the increases in tempera- ASPLOS XIII, 2008. ture. Using the proposed power model and the newly-developed [16] NVClock. Nvidia overclocking on Linux. timing model, IPP predicts performance per watt and also the opti- http://www.linuxhardware.org/nvclock/. mal number of cores to achieve energy savings. [17] NVIDIA Corporation. CUDA Programming Guide, V3.0. The power model using IPP predicts the power consumption and [18] K. Roy, S. Mukhopadhyay, and H. Mahmoodi-Meimand. Leakage the execution time with an average of 8.94% error for the evalu- current mechanisms and leakage reduction techniques in ated GPGPU kernels. IPP predicts the performance per watt and deep-submicrometer circuits. Proceedings of the IEEE, the optimal number of cores for the five evaluated bandwidth lim- 91(2):305–327, Feb 2003. [19] J. W. Sheaffer, D. Luebke, and K. Skadron. A flexible simulation ited GPGPU kernels. Based on IPP, the system can save on average framework for graphics architectures. In HWWS, 2004. 10.99% of runtime energy consumption for the bandwidth limited [20] K. Skadron, M. R. Stan, K. Sankaranarayanan, W. Huang, applications by using fewer cores. We demonstrated the power sav- S. Velusamy, and D. Tarjan. Temperature-aware microarchitecture: ings in the real machine. We also calculated the power savings Modeling and implementation. ACM Trans. Archit. Code Optim., if a per-core power gating mechanism is employed, and the result 1(1):94–125, 2004. shows an average of 25.85% in energy reduction. [21] H. Su, F. Liu, A. Devgan, E. Acar, and S. Nassif. Full chip leakage The proposed IPP system can be used by a thread scheduler estimation considering power supply and temperature variations. In ( system) as we have discussed in the paper. It ISLPED, 2003. [22] M. A. Suleman, M. K. Qureshi, and Y. N. Patt. Feedback driven can be also used by compilers or programmers to optimize program threading: Power-efficient and high-performance execution of configurations as we have demonstrated in the paper. In our future multithreaded workloads on cmps. In ASPLOS-XIII, 2008. work, we will incorporate dynamic voltage and frequency control [23] S. Williams, A. Waterman, and D. Patterson. Roofline: an insightful systems in the power and performance model. visual performance model for multicore architectures. Commun. ACM, 52(4):65–76, 2009. [24] Y. Zhang, D. Parikh, K. Sankaranarayanan, K. Skadron, and M. Stan. Acknowledgments Hotleakage: A temperature-aware model of subthreshold and gate leakage for architects. Technical report, University of Virginia, 2003. Special thanks to Gregory Diamos, Andrew Kerr, and Sudhakar Yalamanchili for Ocelot support. We thank the anonymous review- ers for their comments. We also thank Mike O’Connor, Alex Mer- ritt, Tom Dewey, David Tarjan, Dilan Manatunga, Nagesh Laksh- minarayana, Richard Vuduc, Chi-keung Luk, and HParch members for their feedback on improving the paper. We gratefully acknowl- edge the support of NSF CCF0903447, NSF/SRC task 1981, 2009