Run-Time Power Modelling in Embedded Gpus with Dynamic Voltage and Frequency Scaling

Run-Time Power Modelling in Embedded GPUs with Dynamic Voltage and Frequency Scaling Jose Nunez-Yanez, Kris Nikov, Kerstin Eder, Mohammad Hosseinabady {j.l.nunez-yanez,kris.nikov,kerstin.eder}@bristol.ac.uk,[email protected] Department of Electrical and Electronic Engineering, Bristol ABSTRACT popular as general-purpose accelerators in power constrained ap- This paper investigates the application of a robust CPU-based power plications such as unmanned aerial vehicles (UAV), self-driving cars modelling methodology that performs an automatic search of ex- or robotics. planatory events derived from performance counters to embedded The power profiles of these GPUs are in order of Watts compared GPUs. A 64-bit Tegra TX1 SoC is configured with DVFS enabled with the hundreds of Watts needed in their desktop counterparts and multiple CUDA benchmarks are used to train and test models that invariably use the PCIe bus to communicate with the host CPU optimized for each frequency and voltage point. These optimized and memory. Power optimization in these embedded devices is of models are then compared with a simpler unified model that uses a critical importance since in these applications, the power sources single set of model coefficients for all frequency and voltage points tend to be batteries and the systems must operate untethered for as of interest. To obtain this unified model, a number of experiments long as possible. In this paper we investigate the application of the are conducted to extract information on idle, clock and static power power modelling framework created in [6] for heterogeneous em- to derive power usage from a single reference equation. The re- bedded CPUs to embedded GPUs capable of general-purpose com- sults show that the unified model offers competitive accuracy with puting thanks to their support for languages such as as CUDA and an average 5% error with four explanatory variables on the test OpenCL. Our framework, called ROSE (RObust Statistical search of data set and it is capable to correctly predict the impact of voltage, explanatory activity Events), can be used to automatically collect frequency and temperature on power consumption. This model activity and power data and then perform a complete search for could be used to replace direct power measurements when these optimal events across a large range of frequency and voltage pairs are not available due to hardware limitations or worst-case analysis as defined in the device DVFS (Dynamic Voltage and Frequency in emulation platforms. Scaling) tables. ROSE multiple linear regression optimization uses OLS (Ordinary Least Squares) which is well understood for the case CCS CONCEPTS of both desktop and integrated CPUs but it has been less studied in GPUs that are characterized by a proprietary black box archi- • Hardware → Power estimation and optimization. tecture with restricted access to internal microarchitecture details. KEYWORDS Taking these points into account, the novelty of this work can be summarized as follows: keywords: Heterogeneous architecture, GPU power modelling, DVFS, multiple linear regression, Embedded GPU 1 We perform power modelling on an embedded GPU device ACM Reference Format: with integrated power measurement and voltage/frequency Jose Nunez-Yanez, Kris Nikov, Kerstin Eder, Mohammad Hosseinabady. 2020. scaling compared with previous work largely focused on Run-Time Power Modelling in Embedded GPUs with Dynamic Voltage and desktop GPUs. Frequency Scaling. In 11th Workshop on Parallel Programming and Run- 2 We limit the number of explanatory variables used in the Time Management Techniques for Many-core Architectures / 9th Workshop on model to enable the run-time collection of this information Design Tools and Architectures for Multicore Embedded Computing Platforms using a limited number of hardware registers. (PARMA-DITAM’20), January 21, 2020, Bologna, Italy. ACM, New York, NY, 3 We propose a novel unified model that includes temperature, arXiv:2006.12176v1 [cs.OH] 19 Jun 2020 USA, 6 pages. https://doi.org/10.1145/3381427.3381429 frequency and voltage as global states and compare it against models with coefficients optimized for single temperatures 1 INTRODUCTION and frequency/voltage pairs. Embedded GPUs (Graphical Processing Units) which are physically present in the same chip as the central processing unit (CPU) are This paper is organized as follows. Section 2 introduces related work in the area of power modelling with GPUs. Section 3 presents Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed the methodology based on our previous work in this area, the for profit or commercial advantage and that copies bear this notice and the full citation set of CUDA benchmarks for model training/verification and the on the first page. Copyrights for components of this work owned by others than the techniques used to obtain run-time measures of power and event author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission information. Section 4 develops models based on this methodology and/or a fee. Request permissions from [email protected]. with coefficients optimized for individual voltage and frequency PARMA-DITAM’20, January 21, 2020, Bologna, Italy pairs. Section 5 proposes an unified model so that the power ofa © 2020 Copyright held by the owner/author(s). Publication rights licensed to ACM. ACM ISBN 978-1-4503-7545-0/20/01...$15.00 single per-frequency model can be scaled to an extended range of https://doi.org/10.1145/3381427.3381429 voltage and frequency points. Section 6 investigates temperature PARMA-DITAM’20, January 21, 2020, Bologna, Italy Jose Nunez-Yanez, Kris Nikov, Kerstin Eder, Mohammad Hosseinabady effects on power consumption and model accuracy. Finally, section Training and testing benchmarks are independent and have been 7 concludes the paper. obtained from the Rodinia and CUDA SDK benchmark sets. Table 1: Deployed benchmarks 2 BACKGROUND CUDA Rodinia Train Set As previously indicated, there is a significant amount of work on stream_cluster srad_v1 srad_v1 power modelling of CPU cores and CPU-based systems and the particle_filter srad_v2 srad_v2 interested reader is referred to [7] for a review of results and tech- mmumergpu pathfinder pathfinder niques. In the field of general-purpose GPU-based computing, the leukocyte myocite myocite amount of power modelling research is more limited. The authors lavaMD kmeans kmeans of [4] investigate how performance counters can be used to model backprop bfs bfs power on a desktop NVIDIA GPUs connected to a host computer b+tree cfd cfd via a PCIe interface. The PCIe interface is instrumented with cur- heartwall hotspot3d hotspot3d rent clamp sensors and the host computer samples these sensors hotspot hybridsort hybridsort while collecting performance counter information. The authors CUDA SDK Test Set identify a total of 13 CUDA performance counters but since only binomialOptions Montecarlo four hardware registers are available, multiple runs are needed to blackscholes particles access all of them. The authors also identify that certain kernels that SobolQRNG Radixsort perform texture reads, such as Rodinia Leukocyte, show significant Transpose FDTD3d power error up to 50% due to lack of relevant counter information. Texture3D nbody The impact of DVFS on power modelling is not considered. Also targeting desktop GPUs, the authors of [8] introduce a support vec- tor regression model instead of the least squares linear regression We have modified the collection and processing stages to ac- more commonly used. A total of five variables are used, such as count for the differences in counter availability, power and current vertex shader busy and texture busy to build the model. Instead of sensors and DVFS implementations. In the CPU-based power model predicting the power of full kernels as done in [4], they predict done in [6] the DVFS table contains fixed pairs of voltage and fre- the power of the different execution phases as similarly done in quency and the preferred way to build the power model is to use a our work. The authors show a slight advantage of SVR in accuracy per-frequency model in which a distinct set of coefficient values are although some phases of execution of the GPU power cannot be calculated for each pair. To use this approach directly on the TX1 modelled correctly. The performance and power characterization is problematic due to temperature dependencies and the practical done in [1] considers different desktop NVIDIA GPU families (i.e. difficulties of adjusting the temperature of the device to each possi- Tesla, Fermi and Kepler). The external power measurements apply ble value during a data collection run that typically executes over to the entire system which includes the GPU and CPU and not the several days. To manage this complexity in this work we distinguish individual components. The proposed power model uses perfor- between local events such as the number of instructions executed mance counters and linear regression and introduces a frequency or the number of memory accesses that will affect power in cer- scaling parameter in the power equation to account for the different tain regions of the device, and global states such as the operating performance levels possible in the GPU. It does not consider the frequency, voltage and temperature that will affect power globally. operating voltage and, with multiple voltage levels possible for a This approach enables us to propose an unified model with a single single frequency, this could explain the errors in the prediction set of coefficients that could be used for multiple combinations accuracy which are measured at around 20 to 30%. The work of [3] of voltage, frequency and temperature. The development of this also focuses on desktop GPUs with a review that shows that the unified model and its comparison with the per-frequency models number of explanatory variables used varies between 8 and 23.

Load more