An Integrated GPU Power and Performance Model

An Integrated GPU Power and Performance Model Sunpyo Hong Hyesoon Kim Electrical and Computer Engineering School of Computer Science Georgia Institute of Technology Georgia Institute of Technology [email protected] [email protected] ABSTRACT 1. INTRODUCTION GPU architectures are increasingly important in the multi-core era The increasing computing power of GPUs gives them a consid- due to their high number of parallel processors. Performance op- erably higher throughput than that of CPUs. As a result, many pro- timization for multi-core processors has been a challenge for pro- grammers try to use GPUs for more than just graphics applications. grammers. Furthermore, optimizing for power consumption is even However, optimizing GPU kernels to achieve high performance is more difficult. Unfortunately, as a result of the high number of pro- still a challenge. Furthermore, optimizing an application to achieve cessors, the power consumption of many-core processors such as a better power efficiency is even more difficult. GPUs has increased significantly. The number of cores inside a chip, especially in GPUs, is in- Hence, in this paper, we propose an integrated power and perfor- creasing dramatically. For example, NVIDIA’s GTX280 [2] has 30 mance (IPP) prediction model for a GPU architecture to predict the streaming multiprocessors (SMs) with 240 CUDA cores, and the optimal number of active processors for a given application. The next generation GPU will have 512 CUDA cores [3]. Even though basic intuition is that when an application reaches the peak mem- GPU applications are highly throughput-oriented, not all applica- ory bandwidth, using more cores does not result in performance tions require all available cores to achieve the best performance. improvement. In this study, we aim to answer the following important ques- We develop an empirical power model for the GPU. Unlike most tions: Do we need all cores to achieve the highest performance? previous models, which require measured execution times, hard- Can we save power and energy by using fewer cores? ware performance counters, or architectural simulations, IPP pre- Figure 1 shows performance, power consumption, and efficiency dicts execution times to calculate dynamic power events. We then (performance per watt) as we vary the number of active cores.1 The use the outcome of IPP to control the number of running cores. We power consumption increases as we increase the number of cores. also model the increases in power consumption that resulted from Depending on the circuit design (power gating, clock gating, etc.), the increases in temperature. the gradient of an increase in power consumption also varies. Fig- With the predicted optimal number of active cores, we show that ure 1(left) shows the performances of two different types of appli- we can save up to 22.09% of runtime GPU energy consumption and cations. In Type 1, the performance increases linearly, because ap- on average 10.99% of that for the five memory bandwidth-limited plications can utilize the computing powers in all the cores. How- benchmarks. ever, in Type 2, the performance is saturated after a certain number of cores due to bandwidth limitations [22, 23]. Once the number of Categories and Subject Descriptors memory requests from cores exceeds the peak memory bandwidth, increasing the number of active cores does not lead to a better per- C.1.4 [Processor Architectures]: Parallel Architectures formance. Figure 1(right) shows performance per watt. In this pa- ;C.4 [Performance of Systems]: Modeling techniques per, the number of cores that shows the highest performance per ; C.5.3 [Computer System Implementation]: Microcomputers watt is called the optimal number of cores. In Type 2, since the performance does not increase linearly, using General Terms all the cores consumes more energy than using the optimal number of cores. However, for application Type 1, utilizing all the cores Measurement, Performance would consume the least amount of energy because of a reduction in execution time. The optimal number of cores for Type 1 is the Keywords maximum number of available cores but that of Type 2 is less than Analytical model, CUDA, GPU architecture, Performance, Power the maximum value. Hence, if we can predict the optimal number estimation, Energy of cores at static time, either the compiler (or the programmer) can configure the number of threads/blocks2 to utilize fewer cores, or hardware or a dynamic thread manager can use fewer cores. To achieve this goal, we propose an integrated power and per- Permission to make digital or hard copies of all or part of this work for formance prediction system, which we call IPP. Figure 2 shows personal or classroom use is granted without fee provided that copies are an overview of IPP. It takes a GPU kernel as an input and predicts not made or distributed for profit or commercial advantage and that copies both power consumption and performance together, whereas previ- bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 1Active cores mean the cores that are executing a program. ISCA'10, June 19–23, 2010, Saint-Malo, France. 2 Copyright 2010 ACM 978-1-4503-0053-7/10/06 ...$10.00. The term, block, is defined in the CUDA programming model. Type 1 Type 1 n Type 2 Power = (AccessRate(Ci) × ArchitecturalScaling(Ci) (2) Type 2 Power(W) Xi=0 Performance Performance / Watt × MaxPower(Ci)+ NonGatedClockP ower(Ci)) + IdleP ower # of active cores # of active cores # of active cores Figure 1: Performance, power, and efficiency vs. # of active Equation (2) shows the basic power model discussed in [12]. It cores (left: performance, middle: power, right: performance consists of the idle power plus the dynamic power for each hard- per watt) ware component, where the MaxPower and ArchitecturalScaling terms are heuristically determined. For example, MaxPower is em- ous analytical models predict only execution time or power. More pirically determined by running several training benchmarks that importantly, unlike previous power models, IPP does not require ar- stress fewer architectural components. Access rates are obtained chitectural timing simulations or hardware performance counters; from performance counters. They indicate how often an architec- instead it uses outcomes of a timing model. tural unit is accessed per unit of time, where one is the maximum value. Compiler GPU Kernel Performance Prediction Performance/watt prediction 2.2 Static Power Programmer Optimal thread/block configuration As the technology is scaled, static power consumption is in- Power/Temperature Pred. H/W Dynamic Power Manager creased [4]. To understand static power consumption and temperature effects, we briefly describe static power models. Figure 2: Overview of the IPP System Butts and Sohi [6] presented the following simplified leakage power model for an architecture-level study. Using the power and performance outcomes, IPP predicts the optimal number of cores that results in the highest performance per P = V · N · K · Iˆ (3) watt. We evaluate the IPP system and demonstrate energy savings static cc design leak in a real GPU system. The results show that by using fewer cores based on the IPP prediction, we can save up to 22.09% and on av- Vcc is the supply voltage, N is the number of transistors in the erage 10.99% of runtime energy consumption for the five memory design, and K is a constant factor that represents the tech- bandwidth-limited benchmarks. We also estimate the amount of design nology characteristics. Iˆ is a normalized leakage current for a energy savings for GPUs that employ a power gating mechanism. leak single transistor that depends on V , which is the threshold volt- Our evaluation shows that with power gating, the IPP can save on th age. Later, Zhang et al. [24] improved this static power model to average 25.85% of the total GPU power for the five bandwidth- consider temperature effects and operating voltages in HotLeakage, limited benchmarks. a software tool. In their model, K is no longer constant and In summary, our work makes the following contributions: design ˆ depends on temperature, where Ileak is a function of temperature 1. We propose what, to the best of our knowledge, is the first an- and supply voltage. The leakage current can be expressed as shown alytical model to predict performance, power and efficiency in Equation (4). (performance/watt) of GPGPU applications on a GPU architecture. −|V |−V W −Vdd th off b(Vdd−Vdd0) 2 v n·v 2. We develop an empirical runtime power prediction model for Iˆleak = µ0 · COX · · e · v · (1 − e t ) · e t L t a GPU. In addition, we also model the increases in runtime (4) power consumption that resulted from the increases in temperature. vt is the thermal voltage that is represented by kT /q, and it depends 3. We propose the IPP system that predicts the optimal number on temperature. The threshold voltage, Vth, is also a function of of active cores to save energy. 2 temperature. Since vt is the dominant temperature-dependent fac- 4. We successfully demonstrate energy savings in a real system tor in Equation (4), the leakage power quadratically increases with by activating fewer cores based on the outcome of IPP. temperature. However, in a normal operating temperature range, the leakage power can be simplified as a linear model [21]. 2. BACKGROUND ON POWER Power consumption can be divided into two parts: dynamic power and static power, as shown in Equation (1). 3. POWER ANDTEMPERATURE MODELS Power = Dynamic_power + Static_power (1) 3.1 Overall Model Dynamic power is the switching overhead in transistors, so it is GPU power consumption (GP U_power) is modeled similar to determined by runtime events.

An Integrated GPU Power and Performance Model

Accelerating HPL Using the Intel Xeon Phi 7120P Coprocessors

Intel Cirrascale and Petrobras Case Study

Power Measurement Tutorial for the Green500 List

Towards Better Performance Per Watt in Virtual Environments on Asymmetric Single-ISA Multi-Core Systems

Comparing the Power and Performance of Intel's SCC to State

NVIDIA Ampere GA102 GPU Architecture Whitepaper

Using Intel Processors for DSP Applications

Quantitative Analysis Modern Processor Design: Fundamentals of Superscalar Processors

High Performance Embedded Computing in Space

Low-Power High Performance Computing

PAKCK: Performance and Power Analysis of Key Computational Kernels on Cpus and Gpus

EPYC: Designed for Effective Performance