Power and Performance Analysis of GPU-Accelerated Systems

Power and Performance Analysis of GPU-Accelerated Systems Yuki Abe Hiroshi Sasaki Martin Peres Kyushu University Kyushu University Laboratoire Bordelais de Recherche en Informatique Koji Inoue Kazuaki Murakami Shinpei Kato Kyushu University Kyushu University Nagoya University Abstract 18.00 16.00 GTX 680 14.00 Graphics processing units (GPUs) provide signifi- 12.00 NVIDIA GPU 10.00 Intel CPU cant improvements in performance and performance-per- 8.00 6.00 GTX 285 GTX 580 watt as compared to traditional multicore CPUs. This GFLOPS / Wa+ 9800 GT GTX 480 4.00 GTX 280 8800 GTX E7-8870 energy-efficiency of GPUs has facilitated the use of 2.00 7900 GTX Q9650 980XE 0.00 E4300 E6850 X7460 GPUs in many application domains. Albeit energy effi- 2006/3/4 2007/9/15 2009/3/28 2010/10/9 2012/4/21 cient, GPUs consume non-trivial power independently of CPUs. Therefore, we need to analyze the power and performance characteristic of GPUs and their causal relation Figure 1: Performance-per-watt trends on representative with CPUs in order to reduce the total energy consump- NVIDIA GPUs [11] and Intel CPUs [6]. tion of the system while sustaining high performance. In this paper, we provide a power and performance analysis of GPU-accelerated systems for better understanding for GPUs, while most computational pieces of GPU- ings of these implications. Our analysis on a real system accelerated systems are offloaded on to GPUs. In order discloses that system energy can be reduced by 28% re- to develop energy-efficient GPU-accelerated systems, it taining a decrease in performance within 1% by control- is essential to identify the trade-off in power and perfor- ling the voltage and frequency levels of GPUs. We show mance of GPUs and its causal relation with CPUs. that energy savings can be achieved when GPU core and Despite rapid growth of GPU technology, there has memory clock frequencies are appropriately scaled con- not been much understanding of power and performance sidering the workload characteristics. Another interest- implications of GPU-accelerated systems. According to ing finding is that voltage and frequency scaling of CPUs vendor’s specifications, thermal design power (TDP) of is trivial for total system energy reduction, and even state-of-the-art GPUs is around 200W, while that of to- should not be applied in state-of-the-art GPU-accelerated day’s multicore CPUs is typically below 100W. Because systems. We believe that these findings are useful to de- TDP of GPUs is comparable to or even higher than that velop dynamic voltage and frequency scaling (DVFS) al- of CPUs, it is hard for system designers to optimize their gorithms for GPU-accelerated systems. energy savings by predicting the power and performance of GPU-accelerated systems without understandings of GPU power-performance characteristics. However, pre- 1 Introduction vious work [4, 3, 7, 9, 10] on the power and performance analysis of GPU-accelerated systems are based on ei- Graphics processing units (GPUs) have been increas- ther simulation studies or limited hardware functionality. ingly deployed in general-purpose application domains None of previous work has ever disclosed a fundamental due to their significant improvements in performance approach to the power and and performance-per-watt. As depicted in Figure 1, The contribution of this paper is to provide a power the performance-per-watt of GPUs highly outperforms and performance analysis of GPU-accelerated systems that of traditional multicore CPUs. Albeit energy ef- using NVIDIA’s Fermi architecture (GeForce GTX 480). ficient, GPUs consume non-trivial power during opera- Specifically, we identify when to scale the frequency and tion. However, commodity system software for GPUs voltage of GPUs and CPUs in order to minimize overall is not well designed to control their power consump- system energy. Our analysis opens up important prob- tion while primarily tailored to accelerate computations. lems of dynamic voltage and frequency scaling (DVFS) To the best of our knowledge, commodity system soft- algorithms for growing GPU-accelerated systems. We ware does not employ any voltage and frequency scal- also provide an open method and tool to scale voltage 1 300 E0 = 1.7 [kJ] Table 1: Performance levels of GTX 480 (GPU) 250 E1 = 1.4 [kJ] 200 Clock Domains Min [MHz] Low [MHz] High [MHz] E2 = 1.8 [kJ] 150 Core 50 405 700 100 Powre [W] Memory 135 324 1848 50 c-700, m-1848 c-700, m-135 c-405, m-135 0 0 5 10 15 20 25 30 35 Table 2: Performance levels of Core i5 2400 (CPU) Time [s] Platforms Min [MHz] Low [MHz] High [MHz] Figure 2: Power consumption and execution time of the Core i5-2400 1600 2700 3300.1 512 × 512 matrix addition program. the same program binary. and frequency of GPUs. The black box feature of cur- The power and energy consumption of the system rent GPU drivers and runtimes prevents researchers from is measured by the Yokogawa Electric Corporation’s tackling correlative power and performance optimiza- WT1600 digital power meter [1]. This instrument ob- tion problems. We believe that sharing such a common tains the voltage and electric current every 50ms from the method and tool with researchers would further facilitate power plug of the machine. The power consumption is the use of GPUs. calculated by multiplying the voltage and current, while The rest of this paper is organized as follows. Sec- the energy consumption is derived by accumulation of tion 2 presents our system platform and Section 3 pro- power consumption. vides evaluation results and analyses. Section 4 discusses related work, and the paper is concluded in Section 5. 3 Evaluation and Analysis 2 System Platform We first investigate the impact of GPU core and memory clocks on GPU-intensive workload executing 20,000 We use an NVIDIA GeForce GTX 480 graphics card and loops of 512 × 512 matrix addition. The voltage and Intel Core i5 2400 processor with the Linux kernel 3.3.0. frequency of the GPU is changed three times during the Tables 1 and 2 present their available performance lev- operation, while that of the CPU is fixed at the mini- els, respectively. To perform the experiment, we use mum level to focus on the behavior of the GPU. Fig- NVIDIA’s proprietary software [13] and Gdev [8] on a ure 2 shows the power consumption of the system in case by case basis. NVIDIA’s proprietary software does this setup, where “c-*” and “m-*” stand for the GPU not provide a system interface to scale the performance core and memory clocks, respectively, while “E*” rep- level of the GPU. We hence provide the modified BIOS resents the cumulative energy consumption of the corre- files for the GPU that force the binary driver to config- sponding duration. What is learned from this experiment ure the GPU with the specified performance level when is that it is important to cooperatively scale the GPU loaded. This method enables us to choose any set of the core and memory clocks to effectively reduce energy GPU core and memory clocks, but requires the driver to consumption. Lowering the memory clock to 135MHz reload, and the configuration is static while the driver is successfully reduces energy consumption, but the down- running. On the other hand, Gdev allows the system to scaling of the core clock to 405MHz counter-increases change the performance level of the GPU dynamically energy consumption. This indeed implies DVFS algo- at runtime through the Linux “/proc” file system inter- rithms dominate the power and performance of GPU- face. However, it is available only for the GPU core clock accelerated systems. at the moment, and the GPU memory clock is fixed at We next coordinate the GPU and the CPU using more 135MHz. This is limited due to an open-source imple- realistic workload from the Rodinia benchmark suite. To mentation of Linux, but is supposed to be removed in the simplify the setup, we consider only high and low core future release. clocks, meaning that we evaluate four configurations of The experimental workload executes the Rodinia (GPU-L, CPU-L), (GPU-H, CPU-L), (GPU-L, CPU-H) benchmark suite 2.0.1 [2] and our original microbench- and (GPU-H, CPU-H), where “*-L” and “*-H” repre- mark programs of matrix computation. All input data sent the low and high core clocks. The GPU memory for the Rodinia programs use the maximum feasible clock is fixed at 135MHz. In an idle state, however, the size, while the microbenchmark programs vary data size clocks are always down-scaled to the minimum level. We to conduct fine-grained measurements, all of which are also add another configuration (all-H) that keeps at the written in CUDA. We use the NVIDIA CUDA Compiler maximum clocks even though the GPU is idle, in order (NVCC) 4.0 [12] to compile the programs. Note that to see the impact of elementary coordinated DVFS on both NVIDIA’s proprietary software and Gdev receive GPU-accelerated systems. Figures 3 and 4 respectively 2 1.6 GPU CPU the GPU to increase the duration of an idle state. Hence, 1.4 1.2 energy is always wasted when the GPU is idle. We 1 demonstrate how energy consumption is wasted in an 0.8 0.6 idle state, when (i) the GPU is not present and (ii) is 0.4 0.2 present with three levels of a set of voltage and fre- 0 quency. Figure 5 shows the average power consumption all -H all -H of those four cases obtained by running the system for all -H all -H GPU-L GPU-L GPU-L GPU-L GPU-L GPU-L GPU-L GPU-L GPU-H GPU-H GPU-H GPU-H GPU-H GPU-H GPU-H GPU-H Normalized execuDon Dme 60 seconds without performing any computations (idle CPU-L CPU-H CPU-L CPU-H CPU-L CPU-H CPU-L CPU-H backprop heartwall hotspot srad state).

Power and Performance Analysis of GPU-Accelerated Systems

Accelerating HPL Using the Intel Xeon Phi 7120P Coprocessors

Intel Cirrascale and Petrobras Case Study

Power Measurement Tutorial for the Green500 List

Towards Better Performance Per Watt in Virtual Environments on Asymmetric Single-ISA Multi-Core Systems

Comparing the Power and Performance of Intel's SCC to State

NVIDIA Ampere GA102 GPU Architecture Whitepaper

Using Intel Processors for DSP Applications

Quantitative Analysis Modern Processor Design: Fundamentals of Superscalar Processors

High Performance Embedded Computing in Space

Low-Power High Performance Computing

PAKCK: Performance and Power Analysis of Key Computational Kernels on Cpus and Gpus

EPYC: Designed for Effective Performance