Dynamic Voltage and Frequency Scaling for 3D Graphics Applications on the State-Of-The-Art Mobile GPUs

A Dissertation Presented by

Navid Farazmand

to

The Department of Electrical and Computer Engineering

in partial fulfillment of the requirements for the degree of

Doctor of Philosophy

in

Computer Engineering

Northeastern University Boston, Massachusetts

February 2018 To my beautiful wife, Sanam, and our little angel, Eliana Nour...

ii Contents

List of Figures v

List of Tables vii

List of Acronyms ix

Acknowledgments xii

Abstract of the Dissertation xiii

1 Introduction 1 1.1 Low-power design techniques ...... 3 1.1.1 Dynamic voltage and frequency scaling ...... 6 1.2 Contributions of this thesis ...... 12 1.3 Organization of this thesis ...... 13

2 Background and Related Work 15 2.1 Power consumption in CMOS circuits ...... 15 2.1.1 Dynamic power ...... 16 2.1.2 Static power ...... 19 2.2 GPU programming ...... 20 2.2.1 3D graphics programming model ...... 23 2.3 Dynamic voltage and frequency scaling ...... 26 2.3.1 Desktop CPU ...... 27 2.3.2 Desktop GPUs ...... 29 2.3.3 Mobile SoC ...... 30

3 Fine-Grained DVFS Analysis Framework 35 3.1 Performance measurements ...... 36 3.1.1 Pipeline idle insertion ...... 37 3.2 Power measurement setup ...... 39 3.3 Performance-power data alignment ...... 40 3.3.1 Idle energy calculation ...... 42 3.4 Putting it all together ...... 43

iii 3.4.1 API log capture and playback ...... 44

4 Workload Analysis and Design Space Exploration 47 4.1 Graphics workload performance and energy consumption analysis ...... 48 4.2 Design space exploration ...... 66 4.2.1 The ideal DVFS ...... 67 4.2.2 The effect of HW and SW configuration parameters on the DVFS . . . . . 69 4.3 Proposed OPP selection algorithms ...... 80 4.4 Conclusions ...... 85

5 QoS-Aware DVFS 89 5.1 Algorithm core logic ...... 89 5.2 Proactively avoiding deadline misses ...... 92 5.3 Single step, multilevel frequency change ...... 93 5.4 QoS awareness ...... 94 5.5 Evaluation ...... 99 5.5.1 Implementation ...... 99 5.5.2 Applications ...... 100 5.5.3 Results ...... 100 5.6 Summary ...... 105

6 Energy-Aware DVFS 106 6.1 Performance model ...... 107 6.2 Energy model ...... 109 6.3 Experimental setup and evaluation results ...... 112 6.3.1 Correlation analysis ...... 114 6.3.2 Models ...... 119 6.3.3 Evaluation results ...... 122

7 Conclusion and future work 134 7.1 Conclusion ...... 134 7.2 Future work ...... 136

Bibliography 137

iv List of Figures

2.1 Energy consumption in CMOS circuits during 0 → 1 and 1 → 0 transition on the output node [1] ...... 16 2.2 a) GPU architecture, b) CPU-GPU communication diagram, and c) the command- buffer (ring-buffer) ...... 21 2.3 OpenGL ES 2.0 graphics programming pipeline [2]...... 24 2.4 eglSwapBuffers is used to identify frame boundaries: (A) the same pace between rendering and display (B) rendering faster than display (C) rendering slower than display: a jank (frame miss) occurs...... 25

3.1 Instrumenting command stream by inserting performance counter profiling instruc- tions into the command stream...... 36 3.2 The steps involved in the UMD to instrument command stream for performance profiling...... 38 3.3 Power measurement lab setup ...... 39 3.4 Assertion of the GPIO pin, acquired with other voltage rails ...... 41 3.5 Idle power calculation based on the notion of slack time...... 42 3.6 Total frame time (a) and graphics rail energy consumption (b) measured iteratively for all DDR frequencies ...... 44

4.1 Frame-to-frame workload variation...... 50 4.2 Relationship between OPP voltage and energy consumption ...... 52 4.3 Total energy consumption for graphics and DDR power rails for EgyptHD, across DDR frequencies ...... 55 4.4 Total energy consumption for graphics and DDR power rails for EgyptHD across GPU frequencies ...... 56 4.5 Energy consumption at the battery rail across GPU frequencies ...... 58 4.6 Aggregated energy consumption across all frames for each OPP ...... 59 4.7 Normalized change in the energy consumption with the (GPU) and/or Double Data Rate memory (DDR) frequency transitions ...... 63 4.8 Performance-frequency relationship ...... 65 4.9 Selecting optimal OPP for frame 200, EgyptHD in ideal DVFS algorithm . . . . . 69 4.10 ideal DVFS output: OPP transitions and performance/energy. profile ...... 70

v 4.11 Ideal Dynamic Voltage and Frequency Scaling (DVFS): performance/energy profile across all configurations...... 73 4.12 Summary of the effect of hardware and software configuration parameters on the energy/performance...... 74 4.13 Optimizing for minimum energy vs. minimum performance ...... 74 4.14 Design space exploration: OPP configurations and OPP residency ...... 78 4.15 Effect of OPP configurations on the performance and energy consumption. . . . . 79

5.1 (a) Utilization-based algorithm: utilization calculation @DVFS execution time (b) QoS-aware algorithm: separation between statistics tracking (frame rendering time) and DVFS execution...... 90 5.2 Multicontext multi-FPS rendering ...... 95 5.3 (a) Normalized and (b) percent of frame deadline violations ...... 101 5.4 Normalized (a) energy and (b) EV 2P ...... 103

6.1 Components of (a) existing utilization-based DVFS algorithms, and (b) the proposed DVFS solution ...... 107 6.2 (a, b) Relationship between software pipeline stage operations and performance/en- ergy consumption. (c) Consolidated hardware pipeline of a mobile GPU with unified centered around major processing blocks performing computation and data transfer...... 108 6.3 Correlation coefficients (a) and scatter plots between energy (b, c) and frame draw time (d), and workload event performance counters. Left column shows data only from one representative OPP (GPU Freq. = 401.8 MHz, DDR Freq. = 1296 MHz) 115 6.4 Correlation coefficients (a) and scatter plots between energy (b, c), and frame draw time (d, e), and GPU/DDR voltage and frequency. Left column shows data from only a single randomly selected frame of a randomly selected test...... 117 6.5 Performance model deadline-guarantee prediction accuracy ...... 123 6.6 Energy model-based DVFS accuracy...... 125 6.7 Scatter plot for all combinations of the Perf. models, threshold values, and energy models: (a) Energy-Draw Time, (b) Energy-Deadline Guarantee ...... 126 6.8 DVFS frequency prediction accuracy: Percentage of frames for which (a) GPU frequency, (b) DDR frequency, (c) Both GPU & DDR frequencies are correctly predicted...... 129 6.9 Mean Absolute Percentage Error (MAPE) for AP SoppSpr and EnSoppSpr energy models, and DTSopp performance model (mode,outcome variable) across OPPs (GPUFreqMHz DDRFreqMHz categories on the x-axis)...... 130 6.10 10-fold cross-validation scatter plot; energy models: (a) GFXDDRSepPerOPP, (b) AvgPowerSepPerOPP. In both charts, the performance model is MeetsLogitPerOPP. A data point from the training/test data split (Figure 6.7) for the same model configu- ration, for each test (and the average across tests), is highlighted with the solid fill color...... 132

vi List of Tables

1.1 Main reasons for low-power design across different consumer electronic device categories ...... 2 1.2 Representative examples of various low-power design techniques ...... 4

3.1 Command Stream instrumentation. Inserting profiling instructions at the eof (End of Frame) and sof(Start of Frame)...... 38

4.1 Summary of the hardware and software configuration parameters affecting the quality of a DVFS algorithm...... 71 4.2 An example of OPP energy consumption leading to suboptimal decisions . . . . . 82

5.1 Applications used for evaluation ...... 99

6.1 Design space of our energy models...... 110 6.2 Applications used for evaluation...... 112 6.3 List of the performance counters used to profile workload parameters...... 113 6.4 Predictors and outcome variable for the energy and performance models we have explored. Some models use transformed predictors and/or outcome variable that require preprocessing for both offline training and runtime prediction...... 121 6.5 Breakdown of the top configurations in Figure 6.7 ...... 126 6.6 Goodness of fit parameters for our energy and performance models...... 133

vii Listings

3.1 OpenGL ES API instrumentation for API log capture...... 45 4.1 OPP selection algorithm, criteria: Energy efficiency (simple) ...... 80 4.2 OPP selection algorithm, criteria: Energy efficiency ...... 82 4.3 OPP selection algorithm, criteria: number of OPPs (simple) ...... 83 4.4 OPP selection algorithm, criteria: number of OPPs ...... 83 4.5 OPP selection algorithm, criteria: full GPU frequencies coverage ...... 84 5.1 QoS-aware DVFS: core algorithm ...... 91 5.2 Rendering time tracking ...... 92 5.3 Pro-active detection of rendering time increase ...... 92 5.4 Multilevel frequency change ...... 93

viii List of Acronyms

AHB Advanced High-performance Bus. System bus definition within the AMBA 2.0 specification. Defines a high-performance bus including pipelined access, bursts, split, and retry operations. DVFS Dynamic Voltage and Frequency Scaling. Group of algorithms that change voltage and frequency at runtime according to workload to make a trade-off between performance and energy consumption DDR Double Data Rate memory. DDR refers to a specific type of dynamic memory technology, which is synonymous with the dynamic memory used as the global (volatile) memory in computing elements. DOU Days Of Use. For a mobile device, the length of the time from a full battery (re)charge to when the battery dies and requires another recharge. It is often used by different vendors with a predefined set of dashboard tests they use to define their target energy consumption requirements. These requirements are used to drive the design, measure the success of the design, and compare competing designs and platforms. CPU . GPU Graphics Processing Unit. GPGPU General-Purpose computing on GPU. There are a number of general-purpose (nongraphics) applications that can benefit significantly from the parallelism offered by graphics processors in terms of performance. GPGPU refers to the concept of running general-purpose (nongraphics) applications on GPUs. OPP Operating Performance Point. A unique permutation of all the frequency domains controlled by DCVS. In this thesis, it refers to the (GPU,DDR) frequency pair. IC Integrated Circuit. PMIC Power Management Interface Controller. The component within of a digital electronic chipset that controls the power distribution and protects the circuit from permanent damage, or data corruption due to voltage droop. It is also in charge of optimizing power distribution to increase battery efficiency. IoE Internet of Everything. IoT Internet of Things. SoC . Refers to the integration of multiple different electronic elements including processors and memories on a single chip in a single fabrication process API Application Programming Interface. A set of well-defined functions that specify the interface of a library with other software components. External components and interact with and use the library by calling its API functions. API only specifies the format of the functions and their behavior as seen by outside entity. It does not expose internal implementation of those behaviors.

ix UMD User Mode Driver. A device driver that runs in user-mode in the operating system and acts as the interface between an application and a device driver. IO Input/Output. ADB Android Debug Bridge. Set of command line tools available for the Android operating system, which are used for debugging android operating system and applications. ADB is also used for automation tasks such as starting and stopping applications. USB Universal Serial Bus. One of the most common data communication protocols capable of fast data rate transfers. It is capable of acting as the power source for the connected devices. USB also supports hot plug capabilities (there is no need to reboot the system when a device is connected/disconnected). GPIO General-Purpose Input/Output. Input/Output pins in a chipset that are not special purpose and are exposed to software to be used programmatically to interact with outside world (i.e., other chipsets on the board or any other device physically connected to the IO pin). sysfs Linux Kernel Virtual File System. It is used to expose various configuration information about Linux kernel and kernel drivers to the user-space in a virtual file system environment. sysfs nodes in the Android operating system can be accessible through ADB shell. fps Frames Per Second. APK Android Package file format. UI User Interface. ALU Arithmetic Logic Unit. ALUs Arithmetic Logic Units. EFU Elementary Function Unit. EFU is a special function unit implementing some complex, non- frequently used functions efficiently in the hardware. Given the complexity and low utilization of EFUs, there is one EFU per multiple ALUs in a GPU compute unit to reduce hardware footprint small. SIMD Single Instruction Multiple Data. texel Texture Element. texels Texture Elements. TP Texture Processor. SP Processor. MSAA Multiple Sample Anti Aliasing. bpp Bit Per . ppc Pixel Per Clock. MLR Multiple Linear Regression. A statistical analysis technique used to create predictive equations for one outcome variable based on multiple predictors. Both outcome variable and predictors are continuous. VR Virtual Reality. AR Augmented Reality. CMOS Complementary Metal-Oxide Semiconductor. A design logic based on MOSFET transistors (NMOS and PMOS) prevalently used for digital VLSI circuits. VLSI Very Large Scale Integrated circuit. SoI Silicon on Insulator. A variation of CMOS manufacturing process that uses a silicon-insulator- silicon layer for the substrate (bulk) of the transistor rather than a single silicon layer. AI Artificial Intelligence.

x OS Operating System. DPM Dynamic Power Management. A category of power management techniques that work by dynamically adjusting the configuration of the system under control. RTL Register Transfer Level. One of the levels of hierarchy in the design and development of electronic circuit hardware. MCD Multiple Clock Domain. DAG Directed Acyclic Graph. EDP Energy-Delay Product. EDP Energy-Delay-Product ED2P Energy-DelaySquare-Product EVP Energy-DeadlineViolation-Product EV 2P Energy-DeadlineViolationSquare-Product IPC Instructions Per Cycle. The number of instructions a CPU completes in one clock cycle. IPC is widely used as a measure of performance. DPC Decoded instructions Per Cycle. The number of instructions a CPU decodes in one clock cycle. LOD Level Of Details. A measure of the detail used for rendering the edges of the objects in 3D graphics rendering. Same term can also apply to a specific method of texturing. FLOPS Floating Point Operations. A metric commonly used to quantify the peak performance of a processor with support for floating point operations. CU Compute Unit. A single processing core of a many-core GPU. PE Processing Element. An execution unit within a SIMD Compute Unit. DSP Digital Signal Processor. IP Intellectual Property. FPS Frames Per Second. CP Command Processor. The front-end command processor of a GPU that is used to program GPU state registers and fetch data and vertex/fragment shader code from memory. VS Vertex Shader. A program that runs on every vertex of a 3D graphics workload performing various computations on the vertex attributes. FS Fragment Shader. A program that runs on every (visible) fragment (pixel) of a 3D graphics workload calculating the final color of the fragment. OpenCL Open Computing Language. An open standard programming API for heterogeneous computing. FPGA Field Programmable Gate Array. A reconfigurable electronic circuit that can implement different hardware functionalities by being re-programmed. QoS Quality of Service. EGL Embedded Graphics Layer. An API interface between Khronos rendering APIs such as OpenGL ES and the native platform windowing system. VSYNC Vertical SYNC event. VSYNC event refers to the event at which the display goes from the very last display line to the very first to start the next refresh cycle. The time between two VSYNC events is the time that is available to render a new frame. EOF End Of Frame KMD Kernel Mode Driver

xi Acknowledgments

Getting a Ph.D. degree requires dedication and sacrifice. It turns out, I underestimated the turns and bumps that one may need to go through in this process, especially when immigration is thrown into the mix. The good news is, I was spot on in believing in myself having grit and what it takes; in believing in the love and support of the most important people in my life; in believing that the universe rewards hard work, honesty, and my unfeigned respect for others; and last but not least, in relying on the many great people that made America the land of opportunity that it has been for many immigrants and I am hopeful that it will not change. My Ph.D. advisor, Prof. David Kaeli, made a true example of such honorable, kind-hearted people that strives to positively impact the lives of many students each year. He treats his students humanly and with respect rather than as subordinates, and brings the best out of them. I owe much of my achievements in the last few years, this thesis as well as my professional life to Prof. Kaeli. More importantly, I learned from Prof. Kaeli what I think every advisor should teach their students first: how to be a good person, respect and help others, cooperate and collaborate, and still be competitive and excel. Thank you! I am and will always be indebted and thankful to my hero in this journey, my beautiful wife, Sanam for her support, presence, and love. She has been with me at every turn and has stood besides me getting through every obstacle. We came out of this journey strong and proud together. And we created our little happy family with Eliana, the cutest baby girl we could have imagined. We are blessed to have her in our lives and I am excited that she made it on time to witness our efforts come to fruition in this thesis. I love you! I thank my parents for supporting my goals and ambitions with everything they had, and more. I would not have been able to get where I am without their unrelenting love and support. They made it possible for me to plan for the long-term and have made my little family’s happiness and success their goal while patiently counting the days we have been apart over the last several years. I thank my mother-in-law and father-in-law for their unconditional love and support; for selflessly helping us during this journey, and specifically over the last several months as I was making my final efforts to finish this thesis while we were expecting the new member of our family and I was keeping up with my full-time job. They had my back when I might have dropped the ball. I thank Chris Kolb, my manager at , for helping me get through the logistics of NDA and approval processes, and anything else relevant to Qualcomm for the completion of this thesis.

xii Abstract of the Dissertation

Dynamic Voltage and Frequency Scaling for 3D Graphics Applications on the State-Of-The-Art Mobile GPUs

by Navid Farazmand Doctor of Philosophy in Computer Engineering Northeastern University, February 2018 Dr. Kaeli, Adviser

Today, there are more mobile computing devices running on batteries than computers connected to a power source, and this trend is bound to continue. Power consumption has always been one of the design considerations in digital electronic devices. However, low-power design in battery powered mobile devices has become a first-order design objective due to their unique characteristics of mobile applications. Dynamic Voltage and Frequency Scaling (DVFS) is a runtime power management tech- nique commonly utilized to reduce power consumption, while achieving a predetermined level of performance. There have been numerous DVFS studies for general-purpose computing on CPUs and desktop Graphics Processing Units (GPU) over the past two decades. Nevertheless, DVFS for graphics workloads running on mobile GPUs has gained more attention in recent years and there remains a need for more research in this area. Mobile graphics workloads are an integrated part of the user experience on smartphones. Typical workloads range from User Interfaces (UIs), to mapping applications to browsers and mobile games. Thus, optimizing the energy consumption of mobile graphics workloads is essential for satisfactory operation of these devices. Early research into the potential benefits of DVFS for mobile graphics applications utilized emulated GPUs for evaluation. Follow up research focused on porting existing DVFS algorithms to mobile GPUs or enabling workload prediction for DVFS by taking advantage of the asynchronous nature of CPU and GPU operation. The goal of this thesis is to propose novel DVFS solutions for state-of-the-art mobile GPUs that improve energy consumption without negatively impacting performance guarantees. To achieve this goal, in this thesis we first introduce a fine-grained workload performance and energy consump- tion analysis framework for real-world use cases. Using this framework, we present a thorough

xiii analysis of the dynamic behavior of real-world applications to evaluate common assumptions that constitute the foundation of many DVFS solutions. We show that history-based prediction works well for graphics workloads and costly workload prediction techniques are not justified. We will also show that the complex architecture of mobile System On Chips (SoC) impacts the benefits of any solution. Leveraging our framework we also perform design space exploration on mobile graphics use cases. We explore the impact of various configuration parameters on the effectiveness of DVFS solutions and identify level switching latency and safety margins to have the most impact on energy consumption. We also show that incorporating energy awareness into the DVFS algorithm can further improve energy consumption. We propose two DVFS solutions which address limitations we identified during our workload analysis and design space exploration. First, we propose Quality of Service (QoS)-aware DVFS. By improving the level switching latency in response to workload changes, this solution allows safety margins to be reduced. These two improvements can provide better energy saving without sacrificing frame deadline guarantees. QoS-aware DVFS achieves frame deadline guarantees within 2% of running at the highest frequency (624 MHz), achieving a 17% and 27% improvement in frame deadline guarantees and Energy-DeadlineViolationSquare-Product (EV 2P ), respectively. This evaluation is performed at a 410 MHz operating frequency. QoS-aware DVFS consistently outperforms Qualcomm’s msm--tz DVFS solution. Next we propose energy-aware DVFS. Energy-aware DVFS uses detailed performance and energy consumption models to address the limitations of the QoS-aware DVFS, mainly the lack of explicit support for DDR frequency scaling. Using statistical models, energy-aware DVFS identifies an optimized Operating Performance Point (OPP) for a given workload. The optimized OPP is one that meets the Frames Per Second (FPS) requirement of the application with the minimum energy consumption. Energy-aware DVFS achieves energy consumption within 3% of the ideal profile for our DVFS policy. It only increases frame deadline violations compared to the ideal profile by less than 0.2%.

xiv Chapter 1

Introduction

Today, there are more mobile computing devices running on batteries than computers connected to a power source and this trend is poised to continue. The mobile phone industry is considered to be the largest ever consumer electronic market, with more than 1.2 billion smartphones in service and well over five billion mobile phones. These numbers far exceed the global install base of personal computers. In addition, the next wave of double digit growth in the mobile connected devices has already started in the Internet of Things (IoT) (a.k.a. Internet of Everything (IoE)) industry. IoT is forecast to outnumber the mobile phone industry, with multiples of billions of devices within the next five-ten years. Lowering power consumption has always been one of the design considerations in digital electronic devices. However, each class of device has unique processing characteristics and power requirements. For each class of device, we need to consider using different design parameters to arrive at the most efficient power/performance design point. Table 1.1 shows the potential of using power optimization for different classes of consumer electronic devices. With the advancement of technology and the growing demand for ubiquitous computing, the computing community has moved from being dominated by desktop personal computers to laptops (mobile personal computers), mobile smartphones, and the latest consumer electronics category, IoT. The effect of such transition has been to increase importance of power optimization as a first order design metric [3]. As shown in Table 1.1, Die (or junction) Temperature 1 is the limiting factor across computing classes that necessitates power management. Nevertheless, this requirement has unique implications on the low-power design in each category. High die temperature can damage the IC.

1 The temperature measured at the transistors of an Integrated Circuit (IC)

1 CHAPTER 1. INTRODUCTION

Table 1.1: Main reasons for low-power design across different consumer electronic device categories Mobile Desktop/Server Laptop IoT/IoE (Smart)phone Die Temperature Main Reason Main Reason Main Reason Main Reason Limitations Cost of Energy Main Reason Consumption Cost, Difficulty, or Main Reason X X X lack of Active Cooling Battery Life N/A Main Reason Main Reason Main Reason Skin Temperature X Main Reason Limitations Reliability X Main Reason

Desktop and server computers have effectively employed a combination of active and passive cooling to keep the temperature in a safe range [4]. In the absence of effective power optimization techniques, the cost of electricity, as well the cost of cooling, can present further obstacles [5, 6]. Since laptop computers are battery powered, the most important motivation behind low- power design in this class of system is battery life. As we reduce the form factor for laptops, passive cooling becomes less effectives, so active cooling is still employed. Mechanical moving parts involved in an active cooling system running on the same battery generally requires more power as compared to electronic parts. This further limits the active cooling effectiveness, reinforcing the need for even more aggressive power-saving techniques. Given their smaller form factor, the conditions become more complicated for smaller, hand-held, smartphones and IoT devices. These devices are more cost sensitive and have very strict form factor requirements. Heat transfer and passive cooling are Mudge2001ch less effective in these form factors. Area and cost of Power Management Interface Controller (PMIC) becomes a limiting factor. Consequently, the maximum power consumption becomes an issue, add to the challenges of managing total energy (affecting battery life) and average power consumption (affecting temperature). Breaching the maximum power consumption limit could result in run-time glitches or even permanent damage to the PMIC, affecting reliability of the device (as specified in Table 1.1). Finally, since heat is dissipated through the mobile device body which is in touch with user’s skin, the skin temperature

2 CHAPTER 1. INTRODUCTION limit (which is much lower than the die temperature limit of 40-45 degrees centigrade [7]) poses new design challenges [8]. While all the issues laid out above pose additional challenges for embedded (mobile) devices, active cooling is almost impossible. These devices run on much smaller batteries as compared to laptops, yet they are supposed to run on a single charge for days or weeks (as opposed to hours in laptops). Most troubling of all is the orders of magnitude increase in power density with shrinking feature sizes as technology nodes advance [9]. Technology scaling has been able to increase performance and reduce power consumption, but power density has not been addressed. The issue of exceedingly high power densities applies to all device categories. At some point, the ever increasing power density resulted in a total paradigm shift in high-performance computing design. Power (density) has become a performance limiting factor; higher frequencies are achievable, but infeasible in practice because of high power densities [10]. Increasing clock frequency as a means to boost performance has become less popular. Instead, parallel processing and multicore/manycore architectures have now become the path to better performance. But still energy efficiency 2 is higher in parallel architectures [10, 1]. Performance improvements can be achieved through parallel processing, while also considering power consumption. Yet, in cutting edge deep-submicron technologies utilized today, parallel architectures face a phenomenon called dark silicon [11]. Due to extremely high power densities, only a portion of an IC can be turned on and operational at any moment. Dark silicon refers to the portion that must be kept unpowered to prevent damage to the IC. It is projected that at 8 nm only 50% of the silicon die can be active [11]. All of these issues are a testament to the importance of power management and low-power design in today’s digital electronic devices as a first order design parameter.

1.1 Low-power design techniques

In the past, a large number of power management techniques have been proposed in the literature. These techniques differ in the following aspects: a) the level of the design hierarchy at which power is addressed (e.g., manufacturing techniques at the lowest level and system level (i.e., multiple components in a system) techniques at the highest level of the design hierarchy) b) the development phase at which they are incorporated into the system (design, fabrication, or run time), c) the power equation parameter(s) used to optimize power/performance, and d) the power

2 The amount of energy required to perform a specific task [1]

3 CHAPTER 1. INTRODUCTION

Table 1.2: Representative examples of various low-power design techniques These techniques span across different design hierarchies and development phases, and they optimize different components of the power consumption. Terminology: DYNA: Dynamic, STAT: Static - power; ARCH: Architecture, ALGO: Algorithm, SYS: System, CIRC: Circuit, DEV: Device - level; RUN: run-time, DESN: Design time, FAB: Fabrication process; LEAK: Leakage power. Power Hierarchy Development Power Eq. Technique Component Level Phase Parameter ARCH, DYNA RUN DVFS [12, 13, 14, 15, 16] VDD, f ALGO, SYS DESN time, clock gating DYNA CIRC, SYS α RUN [17, 18, 19, 20, 21] power efficient cache / STAT, ARCH DESN scratchpad design α DYNA [22, 23, 24, 25, 26] STAT, CIRC DESN/FAB 3DIC [27, 28, 29] Ceff DYNA

DYNA CIRC DESN transistor sizing [30, 31] Ceff near-threshold voltage DYNA CIRC RUN VDD, f operation [3, 32]

Ceff , Nclk, DYNA ARCH DESN Specialized Accelerator [33] f multicore / many-core DYNA ARCH DESN Ceff , f [11, 34, 35, 36, 37, 38]

heterogeneous components Ceff , Nclk, DYNA ARCH DESN [11, 34, 35, 36, 37, 38] f DYNA, CIRC DESN/FAB Multi-VDD [39, 40, 41, 42] VDD, LEAK STAT

Multi-Vth STAT CIRC DESN/FAB LEAK [40, 31, 43, 44, 45] STAT CIRC DESN body biasing [46, 47, 48, 49] LEAK STAT DEV/FAB DESN/FAB SoI [1, 50, 51] LEAK STAT ARCH, SYS RUN power gating [52, 53] LEAK

4 CHAPTER 1. INTRODUCTION consumption component that is being controlled. Table 1.2 presents a list of representative power management techniques that span across these categories. Power consumption on a CPU or GPU has two components: 1) the dynamic power: the power consumed when the circuit is operational (if the circuit is idle, dynamic power is not consumed), 2) the static power: consumed even when the circuit is idle and not performing any operation.

Power equation parameters include the operating voltage (VDD), the operating frequency

(f), the activity factor (α), the load capacitance (CL), and leakage. In current technology, a Very Large Scale Integrated (VLSI) circuit has billions of transistors (and hence billions of nodes connecting those transistors). Dynamic power is consumed when the load capacitances in the transistor output nodes are charged/discharged. CL is the total capacitance of the transistor output nodes in the circuit. The activity factor, α, is the ratio of the total load capacitance that is toggled (hence consuming energy) within a single clock cycle. A more detailed discussion of Complementary Metal-Oxide Semiconductor (CMOS) power consumption is presented in Chapter 2, Section 2.1. The choice of which level of the design hierarchy (refer to Gajski-Kuhn’s Y-chart [54]) to deploy a particular power management technique impacts how much power savings are achievable. If we implement management at higher levels of the design hierarchy (e.g., at a whole system or architecture level), we will have more general knowledge of the whole system state. Consequently, power management techniques implemented at higher levels provide greater opportunities for pow- er/energy savings [1]. Finally, low-power design techniques can be categorized as techniques that are incorporated during early design stages, during the manufacturing process, or applied at runtime. Among the low-power design techniques targeting dynamic power consumption are clock gating [17, 18, 19, 20, 21], near-threshold voltage operation [3, 32], transistor sizing [30, 31], and 3D-IC [27, 28, 29]. Some of these techniques can be applied at runtime, while others are design time techniques. Power and energy-efficient cache and scratchpad design [22, 23, 24, 25, 26], specialized accelerators [33], multicore (many-core), and heterogeneous computing frameworks [11, 34, 35, 36, 37, 38] are among the architectural design techniques for reducing dynamic power consumption. Lastly, DVFS [12, 13, 14, 15, 16] is an algorithm/system level technique that aims to reduce dynamic power consumption at runtime by changing voltage and/or operating frequency.

Multi-VDD [39, 40, 41, 42], Multi-Vth [40, 31, 43, 44, 45] circuits, and body biasing [46, 47, 48, 49] are among circuit level techniques for leakage power reduction while Silicon on Insulator (SoI)[1, 50, 51] and power gating [52, 53] are among device/fabrication level and system/algorithm level techniques that reduce static power consumption at design time and runtime,

5 CHAPTER 1. INTRODUCTION respectively. Note that many times we need to combine the techniques discussed above in order to achieve the best power budget. While some of these techniques are orthogonal and complement each other, some of them affect each other. For example, the main goal of DVFS is to reduce dynamic power consumption. However, when DVFS and power gating are used together in a chipset, they are interdependent and influence each other. DVFS might try to run at the lowest possible frequency and leave little idle time for power gating to be effective. Alternatively, DVFS can set the performance slightly higher and consume more dynamic power in order to increase power savings by enabling more effective power gating. In such scenarios, when searching for the optimal operating point, the interdependency between the two power management techniques should be taken into account. The focus of this thesis is on dynamic power management of mobile GPUs through DVFS. Mobile GPUs are primarily used for 3D graphics acceleration by the operating system for user interface (UI) elements, and by application that include 3D games, the browser, and others. Today, mobile GPUs provide a variety of different Application Programming Interface (API)s to support a range of workloads. Examples of these workloads include 2D graphics and general-purpose compute workloads. Mobile GPUs also support a number of emerging applications for mobile including Artificial Intelligence (AI) and machine learning, that have gained popularity recently.

1.1.1 Dynamic voltage and frequency scaling

DVFS is one of the Dynamic Power Management (DPM) techniques that is very effective at managing the trade-off between power/energy consumption and performance. It dynamically changes voltage and operating frequency based on workload changes. Energy consumption has a quadratic relationship with voltage; reducing voltage reduces energy consumption. Signal propagation delay in the circuit is inversely proportional to operating voltage. Thus, voltage reduction must be accompanied by the reduction of operating frequency. Lower frequency also translates to an additional reduction in dynamic power consumption (dynamic energy for a specific operation, however, depends on the number of clock cycles and does not change with frequency). A DVFS solution can be broken down into two components: 1) observer 2) controller [18]. The role of the observer is to monitor the dynamic behavior of the workload and/or system. Based on the observed behavior, the controller adjusts the frequencies to reach a more optimized configuration. Various DVFS algorithms use different observer algorithms and observation metrics. The idea of voltage and frequency scaling was first studied for high-performance desk-

6 CHAPTER 1. INTRODUCTION top/server Central Processing Unit (CPU)s. Govil et al. [55] and Pering et al. [56] were among the first wave of researchers to explore CPU DVFS. They used simulation frameworks to study the effectiveness of DVFS before hardware support for DVFS existed. In their study, they covered a number of DVFS policies. All the policies studied fall into the history-based (also referred to as interval-based) DVFS category. In interval-based algorithms, the predictor monitors system statistics within (often fixed) intervals. The observer maintains a history of the statistics. The controller, predicts the upcoming workload based on evaluating the past behavior. Algorithms studied in these two papers used CPU utilization 3 as the workload metric to track. PAST, FLAT, LONG SHORT, and AGED AVERAGES are examples of the policies that have been proposed in the past [55]. They predict the utilization for the next interval to be the same as last interval, a fixed value, weighted average using short (3 intervals) and long (9 intervals) history, or an age-based average (lower weight is assigned to older intervals) respectively. The controller in all of these algorithms selects the frequency that can finish the predicted work assuming a linear relationship between frequency and utilization. Additional contributions by Pering et al. [56] include: a) comparison with an optimal algorithm based on trace-based post processing, and b) study of real-time applications such as MPEG and audio decoding. They used Energy-Delay Product (EDP) 4 to quantify the efficiency of regular workloads and Energy Clipped Delay Product for workloads with realtime requirements. The latter incorporates the fact that running faster than required to meet deadlines does not provide any benefit. Some of the conclusions from these research studies include: a) CPU DVFS is an effective DPM technique; b) domain-specific applications such as MPEG/audio decoding need more tailored DVFS algorithms; c) while PAST was the least efficient policy, simpler history-based predictors such as PEAK are more efficient than complex history-based predictors; and d) prediction improvements can come from finer grained, workload-specific observers. In order to better organize DVFS techniques and be able to compare them to the above tech- niques we can categorize them based on the following criteria (influencing the observer component, controller component, or both):

• Algorithm input; Examples include:

– Run-time performance: e.g. utilization, uop/byte 5, or Instructions Per Cycle (IPC).

3 Utilization is defined as the ratio of the busy cycles over the total cycles. 4 EDP is a metric used to compare effectiveness of different low-power design techniques. Neither energy nor delay (performance) alone is sufficient for this purpose. EDP combines the two into a single metric to make sure a balance between energy saving and performance impact is maintained. In some cases, performance is even more important, so ED2P is used instead. 5 Micro-operations per byte of memory transfer

7 CHAPTER 1. INTRODUCTION

– Run-time energy consumption

• Algorithm policy; one of the following:

– Run at lowest frequency, keep performance (IPC) impact within a limit or utilization within a range. – Minimize (variations of) EDP – Minimize energy for a given required performance – Maximize performance for a given power budget (power capping) – Maximize performance for a given thermal limit (thermal capping)

• Type of performance and/or power input:

– Measured – Predicted

Algorithms studied in Pering et al. [56] used measured utilization as the algorithm input. Accordingly, the algorithm tried to keep utilization within a range by increasing/decreasing frequency if outside the range. To minimize frequency while maintaining desired performance, only measured utilization is not enough. Prior knowledge or prediction of the workload is required as reported by previous studies [55, 57, 58]. Prior workload knowledge, however, is not practical in many general-purpose use cases. Isci et al. [59] proposed an accurate predictor based on phase prediction, applying branch prediction techniques to accurately detect program phases. To satisfy performance requirements, they used a static performance prediction model. Phases were categorized based on their compute-to-memory data transfer ratio (uop/bytes). Each uop/bytes range was then mapped to a specific CPU frequency, along with a matching core CPU to memory frequency ratio. Uop/bytes ranges were selected empirically to ensure energy was saved, without impacting performance in memory-bound phases. Adding run-time power measurement to the algorithm inputs, DVFS solutions for power capping become possible [60, 61]. Further improvement to DVFS can be achieved by adding performance and/or energy prediction models to the system. The goal of the performance prediction is to predict power/per- formance at frequencies other than current frequency at runtime. These models are generated by offline profiling and analysis. It is important that these run-time models are relatively low overhead. Performance counter based models are of special interest due to their low overhead and reasonable accuracy. Rajamani et al. proposed performance counter based power and performance models [62]. Models are created using linear regression. Power is modeled based on Decoded instructions Per

8 CHAPTER 1. INTRODUCTION

Cycle (DPC). Performance is modeled based on L2 cache latency and IPC. They proposed Perfor- manceMaximizer and PowerSaver policies. The former maximizes performance while maintaining power within a limit. The latter minimizes energy while maintaining a minimum performance. Using the same power model, in addition to a new temperature-power model, the authors propose a thermal capping DVFS [63]. Spiliopoulos et al. propose similar models based on active and stall cycles. In order to predict performance and energy consumption at other frequencies, they scale stall cycles with frequency (since memory frequency is fixed, increasing CPU frequency increases stall cycles linearly). The policies they propose minimize EDP, ED2P , and EDP given a maximum allowable performance impact. Other notable run-time power models designed for other purposes, and also useful for DVFS, are proposed by Goel and Huang [64, 65]. Desktop GPUs have been in use for 3D gaming applications for a long time. Within the last decade, high-performance general-purpose and scientific applications for desktop GPUs (General-Purpose computing on GPU (GPGPU)) have also become popular. DVFS algorithms for desktop GPGPU have been proposed in the literature [66, 67, 68, 69]. Nugteren et al. propose a DVFS algorithm to reduce energy consumption without affecting performance [66]. To achieve this goal, the algorithm identifies the bottleneck between GPU core and memory, and reduces the operating frequency of the non-bottleneck module. This way performance remains intact while energy consumption is reduced. Paul et al. proposed a similar, but improved, solution [68]. Besides compute-to-memory intensity factor similar to [66], they used 7 other performance counters to characterize workload phases. Using regression analysis, they derived a sensitivity model for core frequency and the number of cores, as well as memory bandwidth. These models are used at runtime to determine how to change frequency to reduce energy consumption, but with minimal performance impact. Che and Skadron proposed a prediction framework for GPGPU performance [69]. A set of GPGPU benchmarks are executed offline. Selected workload features are extracted and paired with the measured performance. In order to predict performance for a new workload, values of the selected features are mapped to the closest profiled benchmark based on the workload feature values. GPGPU applications do not utilize parts of the GPU pipeline that are uniquely designed for 3D graphics. Predicting performance (and power consumption) for 3D graphics workloads is more complicated compared to GPGPU and general-purpose CPU workloads. The GPU’s deep pipeline consists of many number of sub-blocks (fixed-function or programmable). Workloads vary considerably and can cause each of these sub-blocks to become the bottleneck at different times.

9 CHAPTER 1. INTRODUCTION

Performance modeling for 3D graphics workloads on desktop GPUs is addressed in [70]. The authors assert that a GPU running a graphics workload does not have a single bottleneck. To identify rendering time, they combine scene complexity feature extraction (e.g., the number of polygons, number of vertices, number of ) with run-time performance sampling of primitives. They assume no serialization between these stages by assuming the runtime is the maximum of the processing time estimated for different stages. For power consumption, Ma and Zhong proposed a power consumption prediction model for GPUs for both GPGPU and graphics workloads. They compare two statistical models based on a Simple Linear Regression (SLR) and a Support Vector Regression (SVR). They used SVR instead of SLR given the unique characteristics of their work. They showed that for a set of validation data, their SVR model is more accurate than SLR for both graphics and GPGPU workloads. Lastly, Sheaffer et al. proposed a detailed trace-driven power estimation framework as part of a 3D graphics simulator [71]. It is difficult to use a simulation model for run-time power estimation for DVFS. They used the model to study the effectiveness of a DVFS solution for a GPU with two clock domains for vertex and shader processing subsystems. Their solution relies on their observations about bottleneck shifting between these two subsystems at different times (same behavior observed by Wimmer et al. [70]). Their algorithm is a history-based DVFS that changes frequency of either of these clock domains based on the occupancy of the buffer between the two subsystems. The idea is to reduce clock frequency of the non-bottlenecked subsystem. Mobile devices using mobile System on a Chip (SoC)s are battery powered and lack active cooling. Minimizing energy consumption is one of the top design priorities alongside performance. As a result, DPM, and more specifically, DVFS, has been an active area of research in mobile SoCs. Back in 2000, Grunwald et al. used the Itsy Pocket Computer, tailor-made for research purposes [72]. Their goal was to study the effectiveness of the DVFS algorithms previously proposed in [55, 56] for CPUs, on a pocket computer. Raghunathan et al. later studied DVFS for run-time energy-fidelity trade-offs for wireless mobile embedded CPU applications [73]. Additional study on accurate power modeling on mobile CPUs was considered by Contreras et al. and Holmbacka et al. [74, 75]. Skin temperature-aware CPU DVFS was proposed by Egilmez et al.[8]. Linux On-Demand governor [12] was used as the default DVFS algorithm in this work. In the case of a thermal emergency, temperature-specific DVFS overrides frequency setting to reduce temperature. Contrary to desktop GPUs, GPGPU applications on mobile devices are not as common yet. There are specific use-cases for mobile GPGPU such as camera image processing. Kim and Kim recognized the unique challenges facing GPGPU applications on mobile devices [15].

10 CHAPTER 1. INTRODUCTION

Commonly, GPGPU and graphics workloads share the same GPU. The authors argue that DVFS algorithms are optimized for graphics workloads. Modifying the algorithms to improve GPGPU the performance/energy consumption balance can reduce energy efficiency of the graphics workload. Authors propose a weighted threshold mechanism to combine interval-based GPGPU and graphics DVFS. Each type of workload has its own optimized thresholds. An aggregated threshold is calculated based on the relative intensity of workloads of each type. A weighted threshold, variable interval, multi-frequency level change, and an API to set desired frequency constitute the proposed DVFS improvement for mobile GPGPU in this work. Mobile 3D graphics power management has been the target of research within the past decade. Early research into power and performance analysis of 3D games started on mobile CPUs before mobile GPUs were widely available [76, 77]. Mochocki et al. used frame signatures to improve upon history-based algorithms [78]. Improvements achieved by only using history as prediction of the future if frame signatures matched; otherwise, the frequency is set to maximum. Motivated by their findings [76], Gu and Chakraborty focused on improving run-time prediction for 3D games using control-theory approaches [79], detailed frame-structure analysis [80], and a hybrid approach combining the two [14]. A predictive power gating technique for mobile GPU shaders has been proposed by Wang et al. [81]. It uses simple history-based prediction to predict FPS within the next cycle, and shutdown shader processors if the predicted FPS was above expected the rate. A linear relationship between number of shaders and FPS was assumed. More recently researchers leveraged simulated mobile GPUs [82] and real mobile SoC hardware [83, 84, 16]. DVFS algorithm proposed by Silpa et al. [82] provided the following advantages over previous work: a) it improves run-time prediction by incorporating pixel and texture workload into the prediction, b) rather than predicting the exact runtime, they use the prediction to proactively detect workload changes, c) by taking advantage of one of the unique characteristics of the mobile GPUs, tiled-rendering, they introduce the opportunity to correct mis-predictions to reduce performance impact. Park et al. published a detailed performance and energy analysis research using the QUALCOMM Snapdragon S4 SoC (equipped with Adreno A320 GPU)[83]. Given the thoroughness of this study on a very recent mobile SoC makes this work one of the most relevant and useful to our own. Finally, Pathania et al. conducted research on the advantages of an integrated CPU-GPU in terms of DVFS [84, 16]. They showed that DVFS improvements using their techniques compared well to the Linux On-Demand governor [12]. Their comparison, however, compares two very different DVFS solutions, the baseline being performance-oriented On-Demand governor. Therefore, it is not evident that the improvements are actually due to the integration between CPU

11 CHAPTER 1. INTRODUCTION and GPU DVFS. In fact, some of the reasons such an integration might not be necessary or beneficial are enumerated in Section 2.3.

1.2 Contributions of this thesis

The major goal of this thesis is to propose novel DVFS solutions that satisfy the strict energy-saving requirements of mobile GPUs, while guaranteeing the soft-realtime requirements of these workloads. In doing so, we have to identify the shortcomings of existing work and try to fill the void. Our target framework is a state-of-the-art Snapdragon mobile SoC platform. The primary objective of our proposed DVFS solutions is to guarantee individual-frame deadlines (<17 ms based on a 60 FPS display). Reducing energy consumption should not come at the cost of frame deadline violations. Additionally, our DVFS solutions should perform well across different use-cases and not be optimized for a select group of applications. Any solution is required to operate at frame-level latencies. Based on these requirements and our review of the existing work on mobile GPUs, we have identified a number of open problems we plan to address in this thesis. Specifically, we focus on the following limitations of existing solutions: (1) DDR frequency scaling is virtually non-existent. Existing DVFS solutions assume the DDR frequency is fixed, either implicitly or explicitly. There are two previous studies that consider DDR frequency scaling for GPGPU applications [66, 68]. Still, their solutions do not fully consider the interrelationship between the GPU and DDR frequencies for maximum efficiency. The importance of DDR frequency scaling has been touched upon in recent work on mobile SoCs [84, 83] without any detailed analysis or proposed solutions. We demonstrate in Chapter 4 that this factor alone has significant impact on workload performance and energy, motivating our work in this area. (2) Considerably high latency in reacting to workload changes: A low-latency frequency adjustment is critical for mobile graphics workloads with soft-realtime requirements. Based on our latency profiling results, the iPhone 6+ has a 100ms latency for one frequency level change, and takes about 350ms to go from the lowest to highest frequency when the workload changes. Similarly, the same transitions take 120ms and 600ms for the Galaxy S5, respectively. Such high latencies (in the order of tens of frames) result in frame deadline violations at a level that is not acceptable for graphics workloads, especially for Virtual Reality (VR) and Augmented Reality (AR) use cases. (3) Existing mobile GPU DVFS solutions lack the required runtime visibility into the workload and QoS requirements of individual applications concurrently

12 CHAPTER 1. INTRODUCTION running on the GPU. Consequently, they cannot make decisions tailored to individual applications’ QoS requirements. We can summarize the contributions of this thesis for addressing the limitations we have identified as follows:

1. We propose two innovative mobile GPU DVFS solutions for graphics workloads with the primary goal of minimizing frame deadline violations, while improving energy efficiency.

• QoS-aware DVFS:A DVFS governor algorithm with low-latency response to workload changes. This algorithm is suitable as a drop-in replacement for the existing GPU DVFS algorithms such as the On-Demand, the Simple, or the msm-adreno-tz governors. This solution enables new opportunities for improved QoS-aware DVFS solutions by incorporating context-awareness into the DVFS algorithm. • Energy-aware DVFS: an integrated GPU-DDR DVFS solution that improves upon QoS- aware DVFS. By enabling integrated GPU-DDR DVFS, this algorithm leverages the full GPU-DDR frequency space to arrive at a more energy-efficient selection compared to existing solutions that ignore DDR frequency scaling. Energy-aware DVFS achieves this goal through fine-grained statistical performance and energy model analysis.

2. We propose heuristic algorithms for OPP-space configuration selection to address the chal- lenges imposed by the large configuration space of a combined GPU-DDR frequency space for those DVFS solutions that employ incremental frequency changes. 3. We develop an analysis framework that enables fine-grained performance and energy consump- tion analysis of mobile graphics workloads. We use this framework to perform a thorough workload energy and performance analysis, as well as design space exploration; based on this analysis we identify the areas with the most potential for positive impact to focus on in our proposed solutions.

1.3 Organization of this thesis

The rest of this thesis is organized as follows: Chapter 2 provides relevant background, including power consumption components in modern CMOS technology, an overview of the GPU programming model, and a review of the DVFS techniques proposed in the literature.

13 CHAPTER 1. INTRODUCTION

In Chapter 3 we present our fine-grained DVFS analysis framework that is used in this thesis. The framework enables fine-grained (frame-level) performance and energy consumption analysis for mobile 3D graphics workloads. It also enables our DVFS algorithm design space exploration, and opportunity analysis. We start Chapter 4 by conducting a performance and energy consumption analysis on 3D graphics workloads. In this process, we evaluate some of the widely relied upon assumptions in existing DVFS techniques and gain more insight into the dynamic behavior of 3D graphics workloads relevant to DVFS. In Section 4.2 we perform DVFS design space exploration that can be used to quantify the impact of different configuration parameters on the effectiveness of the algorithms. We identify top contributing factors in the quality of the DVFS algorithm in this section. Later in this chapter in Section 4.3, we propose heuristic algorithms to select an optimized OPP configuration for a DVFS algorithm that would satisfy specific criteria We present our first of two DVFS solutions, QoS-aware DVFS, in Chapter 5. By making DVFS QoS-aware, and through accurate tracking of the frame draw time at runtime, QoS-aware DVFS makes significant improvements in frame deadline guarantee over existing Governor algorithms. In this chapter, we present the evaluation results of the implementation of QoS-aware on a Snapdragon SoC and compare this solution against Qualcomm’s msm-adreno-tz DVFS. Chapter 6 includes the details of our proposed energy-aware DVFS based on statistical performance and energy models. We describe, in details, the process for selecting the runtime workload event performance counters for our models, as well as the process that we used to design accurate performance and energy models for our DVFS algorithm. We present a detailed analysis of the effectiveness of energy-aware DVFS in this chapter. Finally, Chapter 7 presents our concluding remarks and future work.

14 Chapter 2

Background and Related Work

Next, we present background information and prior work on DVFS. We start with an overview of the power consumption mode for CMOS VLSI circuits. We then follow with a review of GPU architecture and GPU programming models. We will related previous work on workload characterization, performance and power/energy consumption analysis, performance/energy models, and existing DVFS techniques for CPUs, GPUs, and mobile SoCs.

2.1 Power consumption in CMOS circuits

Among the different types of transistors and logic families, CMOS technology is predom- inantly used for VLSI circuit fabrication. CMOS circuits utilize PMOS and NMOS transistors in two, complementary, networks [85]. The pull-up network is composed of PMOS transistors and is turned on when the output is a 1. A path from VDD to the output is created through the pull-up network. To set the output to 0, the pull-down network is turned on. Figure 2.1.a shows the high-level organization of CMOS circuits. Energy in an electronic circuit is consumed when electric current flows through resistors. Energy is dissipated in the form of heat generation. Pull-up and pull-down transistor networks in a CMOS circuit can be modeled by equivalent resistors and capacitors. In CMOS circuits, most of the energy is consumed when there is a path between the voltage source (VDD in Figure 2.1.a) and ground. CMOS circuits are designed to disconnect the path between the voltage source and ground when no operation is being performed. Paths the voltage source to ground are connected (through

15 CHAPTER 2. BACKGROUND AND RELATED WORK pull-up/pull-down networks) when the circuit becomes active, performs a computation, and the voltage on different nodes changes as a result. Ideally, the idle energy consumption of a CMOS circuit should be zero, because there is no path between the voltage source and the ground. In reality, however, there is very low amount of current flow between voltage sources and ground even when the circuit is idle and the pull-up/pull- down networks are turned off. Energy consumption in CMOS circuits thus is composed of two components:

Etotal = Estatic + Edynamic

Static energy (Estatic) is the energy consumed whether the circuit is operational or idle.

Dynamic energy (Edynamic) is the energy consumed only during circuit operation. These two components are described in more detail in the following sections.

VDD VDD

Pull-Up Network (PMOS) 0 1 Inp_1

Charge path t CL i Charge u c r i c - Out t r o Discharge Path h Discharge S CL Inp_n Pull-Down 1 0 Network CL (NMOS)

(a) (b)

Figure 2.1: Energy consumption in CMOS circuits during 0 → 1 and 1 → 0 transition on the output node [1]

2.1.1 Dynamic power

An NMOS/PMOS transistor has multiple capacitance components. These components can be modeled as a lump-sum capacitance CL at the output node of a CMOS circuit. CL, among other

16 CHAPTER 2. BACKGROUND AND RELATED WORK parameters, affects the performance of a CMOS circuit1, as well as the energy required for 0 → 1 or 1 → 0 transition of the output node.

Whenever a 0 → 1 transition occurs on the output node, CL is charged through the pull-up network. When the load capacitance is charged, some energy is saved in the capacitance. The amount of energy saved in CL, as well as the energy drawn from the voltage source VDD, are described using the following equations [85]:

1 2 ECL = E0→1 = /2 · CL · VDD (2.1) 2 EVDD = CL · VDD (2.2)

The amount of energy drawn from the voltage source is twice the energy saved in the capacitance CL. The extra energy is dissipated in the effective resistor of the pull-up network in the CMOS circuit (Figure 2.1.b). During a 1 → 0 transition, the pull-up network is turned off (there is no path between the

VDD and the CL). The pull-down network is turned on during this transition and the energy saved in the CL is dissipated through the effective resistance of the pull-down network (Figure 2.1. In this case, there is no energy drawn from the voltage source, VDD. Therefore, the total energy consumed 2 in a 0 → 1 transition followed by a 1 → 0 transition is CLVDD. The CMOS circuit shown in Figure 2.1.a is used to implement logic terms used in a logical binary equation. There are limits on the number of transistors in either a pull-up or pull-down network to ensure reliable and fast operation. These smaller blocks are combined to create larger circuits. Considering a CMOS circuit composed of a large number of nodes, Equation 2.3 is used to quantify the energy consumed for a specific logical operation in such a circuit.

1 2 2 Eop = /2 · CL · VDD · αtrans = CL · VDD · α0→1 (2.3)

CL in Equation 2.3 is the accumulated sum of the output capacitances of all the nodes in the circuit. αtrans is called the activity factor. Not all the nodes in the circuit change state when inputs are applied. αtrans is the ratio of the nodes that make a transition. α0→1 is the ratio of the nodes in the circuit that have a 0 → 1 transition. Given that energy is drawn from VDD when there is a 0 → 1 transition, the final equation can be derived from Equation 2.2. This is the most commonly used form of this equation.

1 i.e., the amount time it takes for the output node to transition from 0 to 1 or vice versa.

17 CHAPTER 2. BACKGROUND AND RELATED WORK

CMOS circuits can be categorized as synchronous and asynchronous, synchronous logic being the most widely used among the two. In synchronous circuits, all the operations are syn- chronized with a dedicated clock signal. No transition occurs in the circuit nodes unless there is a transition on the clock signal. The energy consumption for a given task that takes Nclk clock cycles to finish is specified by Equation 2.4 as follows:

2 Etask = CL · VDD · α · Nclk (2.4) 2 Etask = CL · VDD · α · f · t (2.5)

Here, the activity factor α is the average activity factor of the circuit across Nclk number 2 of clock cycles. Nclk can be expanded to f · t, where f is the frequency of the clock signal and t is the time it takes for the task to finish (2.5). In addition to the energy consumption of a digital IC, power consumption 3 is also an important factor. Power consumption, P , is the energy consumed per unit time. It is calculated by dividing the energy consumption by the total time. Equation 2.6 can be used to calculate power consumption of a synchronous CMOS circuit.

2 Ptask = CL · VDD · α · f (2.6)

The energy and power consumption equations presented above can be used to calculate dynamic components of the energy/power consumption. Note, however, that there is another component of the dynamic energy/power consumption called short circuit power. Similar to the dynamic power related to the charge and discharge of the load capacitance, short-circuit power is only consumed if there is a transition on the circuit nodes. Figure 2.1.a shows the short-circuit power path in a CMOS circuit. Assume that inputs to the circuit change and cause a transition on the output node from zero to one (one to zero). Before the input changes, the pull-down network is on. After the input changes, the pull-up network should be on and the pull-down network should be off. This transition does not happen instantly. There is a short time period that both the pull-up and the pull-down networks are turned on. During this short period, there is a path from voltage source to ground and energy is consumed. The short-circuit component of the dynamic power can be optimized at the transistor level. Today’s CMOS transistors

2 The number of 0 → 1 transitions of the clock signal in a unit of time 3 The amount of energy consumed in a unit of time.

18 CHAPTER 2. BACKGROUND AND RELATED WORK have negligible short-circuit power. Therefore, power optimization techniques generally focus on the dynamic power consumption due to the load capacitance charge and discharge. The following equations, derived from Equations 2.4-2.6, are usually referenced in the literature on low-power design techniques. They focus on relationships rather than exact equations, and may ignore fixed or less important parameters in specific use cases (e.g., Equation 2.9 that drops

Nclk from the energy equation). These equations are intended to highlight the variable parameters that affect energy/power consumption. Various low-power design techniques try to minimize one or more of these parameters at design time, through the manufacturing process, or using run-time power management techniques. CL and α are sometimes combined into one parameter, called the effective capacitance (Ceff ).

Ceff ∼ α · CL (2.7) 2 E ∼ Ceff · V · Nclk (2.8) 2 E ∼ Ceff · V (2.9) 2 P ∼ Ceff · V · f (2.10)

Most power and thermal management techniques target α, CL, VDD, and/or f to reduce power/energy consumption. Nclk is assumed to be dependent on workload. Thus, it is considered a fixed parameter in a number of low-power design techniques that optimize the hardware power configuration. Consequently, in those scenarios, Equation 2.9 is used to describe the parameters affecting energy consumption.

2.1.2 Static power

Static power is dissipated through transistors in the pull-up and pull-down network of a CMOS circuit, even when the inputs do not change and the circuit is idle. The main components of the static power include [86]:

• Subthreshold leakage • Gate-oxide tunneling current • Reverse-biased junction leakage current

Subthreshold leakage is the dominant source of static power consumption. A majority of power optimization techniques that target static power focus on reducing subthreshold leakage. An

19 CHAPTER 2. BACKGROUND AND RELATED WORK

NMOS transistor is supposed to be turned off when VGS < Vth (VGS is the voltage difference between the gate and the source of the transistor; Vth is the threshold voltage). However, there is a small amount of current flow even when VGS < Vth. The amount of subthreshold leakage current increases as Vth decreases. Thus leakage current increases as the technology node shrinks because node shrinkage is usually accompanied by a reduction in Vth). Leakage current also has an exponential relationship with temperature. When the leakage current is too high, there is the possibility for a phenomenon called thermal runaway. Higher power consumption increases temperature, which in turn, increases leakage current (and consequently power consumption). A positive feedback loop is formed resulting in a continuous increase in power and temperature, resulting in physical damage [87]. Lastly, leakage current has a cubic relationship with voltage. Detailed explanation of these components and their equations is beyond the scope of this thesis and not necessary for our discussion. Note that static power consumption has historically been negligible compared to dynamic power. While its contribution to overall power usage has been increasing with shrinking technology sizes, innovative fabrication techniques have still managed to keep static power consumption relatively low (For example 14 nm technology has lower leakage current than 20 nm). Consequently, a majority of the prior work in this area has been focused on dynamic power consumption. The following equations summarize the important factors impacting static power consumption:

Pstatic ∼ leakage ∼ Vth (2.11)

Pstatic ∼ T emp (2.12) 3 Pstatic ∼ VDD (2.13)

2.2 GPU programming

We provide an overview of the GPU programming model, as well as the GPU pipeline from the perspective of 3D graphics programming API in this section. GPUs are highly devices. They are generally comprised of many, uniform, computing elements working in parallel, with control logic that distributes parallel work between those computing elements. GPUs were initially designed to process 3D graphics workloads, supporting numerous rendering mechanisms, configurations, and data formats. As a result, GPUs require a rather complex and lengthy initialization process. Given this degree of overhead, GPUs

20 CHAPTER 2. BACKGROUND AND RELATED WORK

CPU MEM GPU Frontend - Command Submit commands Processor (CP) Fetch&Execute Submit commands commands Submit commands y

Backend s u B Wait for U

results P Write back results G Vertex Main Processing Read results y

Memory s u B Wait for Fragment U results P Processing G Write back results Texture Read results Unit Submit commands

(a) (b)

CB3 CB2 CBN-1 Write Pointer Read Pointer CBN CB1

Host GPU

Empty

(c)

Figure 2.2: a) GPU architecture, b) CPU-GPU communication diagram, and c) the command-buffer (ring-buffer)

21 CHAPTER 2. BACKGROUND AND RELATED WORK can be less efficient than CPUs when processing small workloads, or workloads with low latency requirements. To streamline GPU programming and make it possible for more programmers leverage these devices, high-level programming APIs have been developed. Hardware vendors implement device drivers that support these programming APIs, abstracting away the complexity of the initial- ization process, and supporting other common tasks with hardware-dependent implementations. In this context, a GPU works as a co-processor for a CPU. Figure 2.2.a shows a high-level view of the architecture of a GPU. What should be noted from this figure is the GPU pipeline, which has separate front-end and back-end programmable components. Correspondingly, a GPU has two different instruction sets, one for the front-end Command Processor (CP) and the other for the programmable stages, including the vertex/fragment processor. The CP (front-end) is programmed by the GPU device driver. It supports instructions that enable data transfers between CPU and GPU, as well as programming GPU configuration registers. Developers write programs for the vertex/fragment processor using supported shader language of the programming API (e.g., GLSL language for the OpenGL API [88]). Figure 2.2.b shows the interaction between a CPU and a GPU. Main memory is used as the main communication interface between the CPU and the GPU. The CPU writes commands for the GPU into specific memory location. Concurrently, the GPU fetches those commands from main memory and executes them. As demonstrated in this figure, the CPU does not have to wait for the GPU to finish processing a set of commands. Optionally, however, the CPU can wait until a specific command batch previously written into memory is finished executing on the GPU. Such synchronization mechanism is required when the results of the previous commands are to be used in producing new work for GPU. In general, a well written application allows CPU and GPU processing be done in parallel. Triple buffering, a feature introduced in Android 4.1, Jelly Bean, relies on the asynchronous nature of CPU-GPU work to allow applications to remain two frames ahead in order to guarantee a smooth graphical experience4. Figure 2.2.c shows exactly how the CPU and GPU communicate through memory in more detail. The idea is that a fix-sized memory called the command buffer is allocated to hold commands added by CPU. Since the command buffer is a circular buffer, it is commonly referred to as ring-buffer. When the CPU has work for the GPU, it adds commands to the end of ring-buffer and updates the write-pointer. Write-pointer always points to the next empty buffer index. Then the GPU reads commands starting at the read-pointer. The read pointer always points to the oldest command

4 Note that some specific use cases might avoid triple buffering. For example, it is a common practice in VR applications to use single buffering to reduce motion-to-photon latency [89]

22 CHAPTER 2. BACKGROUND AND RELATED WORK in the command buffer. Following this protocol, commands in a given ring-buffer are consumed by the GPU in a first-come-first-serve fashion. Clearly the CPU can add commands to the ring-buffer while the GPU is consuming previously written commands (thus resulting in asynchronous operation of the CPU and the GPU). Note that if the GPU becomes a bottleneck and the buffer becomes full, the CPU has to wait until the GPU executes commands to free up space in the buffer. Conversely, if the CPU is the bottleneck, the GPU will process commands until the ring-buffer is empty. At that point, the GPU blocks until there are more commands to process.

2.2.1 3D graphics programming model

There are a number of popular 3D graphics programming APIs available to programmers. Examples include DirectX, OpenGL, Mantle, Vulkan, and Metal [90]. Some of these APIs, such as DirectX from Microsoft and Metal from Apple, are available on a specific platform. On the other hand, OpenGL is an open standard API driven by the Khronos Group. OpenGL ES is a variation of OpenGL tailored for mobile GPU targets. OpenGL ES is supported on many different platforms from different vendors. Currently the latest version of this API is OpenGL ES 3.2. The first revision of this API, OpenGL ES 1.x, supported a fixed function GPU; i.e., different stages of the graphics pipeline performed predefined graphics operations. On the front end, the CP was programmable in this version (refer to Section 2.2). OpenGL ES 2.0, the most widely used version of the OpenGL ES API, added support for programmable shading. Figure 2.3 shows the programmable pipeline of OpenGL ES 2.0. As can be seen in this figure, there are two programmable stages, the vertex and fragment shaders, within the pipeline distinguished with a lighter (yellow) color. Other blocks are fixed function, and perform predefined processing on primitives, vertices, or fragments. Vertex and fragment shader code is written in the OpenGL Shading Language (GLSL) [88]. The graphics pipeline shown in Figure 2.3 highlights various computation and transfor- mation steps required for 3D graphics, from a software standpoint. Hardware vendors that support graphics APIs design their GPU hardware pipeline to be very similar to this software pipeline. But, there can be differences between the two. Notably, in embedded mobile GPUs, the vertex and fragment shader stages are mapped to a single unified shader [91, 92, 81]. The motivation for using a unified shader is to make the shader stage (and the GPU, as a result) more area and energy-efficient. By properly scheduling vertex and fragment shading work on the unified shader, we can increase utilization; better utilization translates to higher performance per unit area and . Another unique characteristic of mobile GPUs is binned rendering (also referred to as tiled

23 CHAPTER 2. BACKGROUND AND RELATED WORK

Triangles/Lines/Points

Primitive Vertex Primitive Processing Vertices Shader Assembly Rasterizer API Vertex Buffer Fragment Objects Shader

Color Depth Buffer Dither Frame Stencil Blend Buffer

Figure 2.3: OpenGL ES 2.0 graphics programming pipeline [2]. rendering [82]). Mobile GPUs employ relatively small fast on-chip memory to meet the strict area requirements. Restricting memory access to the on-chip memory improves performance and energy efficiency, versus performing main memory accesses. The frame-buffer area is divided into smaller rectangles (tile/bin). The optimal size and number of tiles is identified so that all tile data (such as color, depth, and stencil, among others) fits into on-chip graphics memory. A 3D graphics workload flows through the following pipeline stages in a tile-based GPU: (1) Tiling (2) Unresolve (3) Rendering (4) Resolve. The tiling process is performed once. In this stage, individual tiles, and the subset of geometries that are visible in each tile, are identified. These stages (2)-(4) are repeated for each and every tile. These stages use the information from the tiling phase to process only the geometries that fall within a tile. If the contents of a previous rendering step are used as the basis for the next rendering step, the content of the next tile is copied from external memory to the on-chip memory during unresolve, stage (2), before the rendering of the tile starts. After the rendering of the tile in stage (3), the frame buffer contents for the tile are copied from on-chip memory to the frame buffer in external memory during the resolve operation in stage (4). As shown in Figure 2.3, the rendering phase involves a number of different processing steps, most of which perform some operations on vertices and fragments. The bulk of the computation, however, is spent in the programmable vertex and fragment shading stages (as well as the texture processing operations, which are part of fragment shading). There are three vertices for each triangle (adjacent triangles can share vertices, reducing the total number of unique vertices to less than three times the number of triangles). The number of fragments per triangle, however, depends on the size of the triangle and the frame-buffer resolution.

24 CHAPTER 2. BACKGROUND AND RELATED WORK

Generally, the total number of fragments is considerably larger than the number of vertices, except for some special use cases. Consequently, a significant portion of the processing time and energy consumption is spent in the fragment shading phase. The vertex shading process involves Arithmetic Logic Unit (ALU) operations to transform vertex attributes such as location, color, and light, and memory operations to fetch vertex attributes and save transformed attributes [93]. Triangles assembled from vertices are then rasterized into individual fragments (i.e., pixels), and pass through the fragment shading stage to form the final pixel attributes. ALU operations performed during fragment shading take as input texture data and other fragment attributes to form the final color and light intensity of the fragment. Texture Processor (TP) hardware fetches multiple Texture Elements (texels) from memory, and combines them into one data point for the color blending of an individual fragment using texture bilinear operations. The last concept we want ro review here, which is relevant to our proposed DVFS solution in Chapter 5, is the eglSwapBuffers API. When all the rendering commands for a frame are submitted by an OpenGL application, the end of the frame is signaled by calling the eglSwapBuffers API. The eglSwapBuffers communicates the next frame buffer to be displayed to the display subsystem. In return, the application receives a new frame buffer to render to. eglSwapBuffers is an asynchronous API. It returns an empty frame buffer right away, if one is available, without waiting for the rendering if the previous frame to complete. The display can consume the frame buffer it receives with the eglSwapBuffers call, only after all the rendering commands that target that buffer finish executing, and the frame is ready for display. Memory synchronization primitives are used to coordinate the completion of the frame, signaling when the display can consume the frame buffer. Using triple buffering (Section 2.2), after enqueuing three frames, the eglSwapBuffers call blocks until one frame is transferred to the display and a free frame buffer becomes available. Frame 1 (A) Freq1 Frame 2 Frame 1 Frame 2 Frame 3 (B) Freq2 > Freq1 (C) Freq3 < Freq1 Frame 1 VSYNC VSYNC (EOF) markers VSYNC Figure 2.4: eglSwapBuffers is used to identify frame boundaries: (A) the same pace between rendering and display (B) rendering faster than display (C) rendering slower than display: a jank (frame miss) occurs.

Figure 2.4 illustrates different scenarios for frame rendering, highlighting the End Of Frames (EOF), where a frame is ready for display. The start of each display refresh cycle, the VSYNC, is also identified. The VSYNC is considered the frame rendering deadline. Figure 2.4.b.

25 CHAPTER 2. BACKGROUND AND RELATED WORK and 2.4.c show the cases where frame rendering is faster and slower than the frame display rate, respectively. If the frame rendering finishes after the VSYNC (Figure 2.4.c), a frame deadline violation occurs. Figure 2.4.a shows when a balance is achieved between frame production by the application and consumption by the display.

2.3 Dynamic voltage and frequency scaling

In this section, we review some of the previous work on DVFS. DVFS techniques belong to a category of power management techniques called DPM. DPM techniques monitor the system behavior at runtime and adjust the system configuration to reduce power/energy consumption [18]. DPM techniques can reduce dynamic energy consumption, static energy consumption or both. Clock gating techniques [20, 21] are among DPM techniques that optimize dynamic energy consumption. Power gating techniques (a.k.a. power collapse) [92, 53, 52] are among DPM techniques that reduce both dynamic and energy consumption. Both of these techniques aim to reduce energy consumption of the idle hardware blocks. These two techniques are interrelated with DVFS since DVFS decisions essentially impact the active vs. idle time period (higher frequency results in increased idle time)[81]. Therefore, the overall energy efficiency of a specific system configurations, depends on the combined impact of these power management techniques. Our focus in this thesis is on DVFS solutions. Runtime operation of a DVFS technique can be broken down into two components: 1) the Observer 2) the Controller. The Observer monitors the system and tries to predict the future behavior of the system. The Controller identifies and applies an optimized system configuration based on the predicted behavior. It is notable that in general, predicting future workload is not practical except for some special use-cases in which the workload is known prior to the execution. To address this limitation, DVFS techniques utilize history-based workload prediction. By relying on temporal locality, they predict the workload to be similar to, or a function of, the past workload. History-based DVFS techniques track workload behavior in specific (usually fixed) time intervals. These profiled statistics are then used as a proxy for the upcoming time interval. Depending on the granularity and accuracy of the prediction, the controller can adjust the configuration either by: a) incrementally changing voltage/frequency or b) identifying the proper configuration that satisfies performance requirements while reducing energy. The voltage/frequency is set in one step in this case as opposed to multiple incremental changes. Majority of the DVFS techniques we review in this chapter utilize history-based prediction. It is worth noting that in the context of this thesis the following terms refer to the same concept and are interchangeable: a) voltage/frequency level,

26 CHAPTER 2. BACKGROUND AND RELATED WORK b) frequency level, c) (DVFS) performance level, and d) OPP. Specifically, when we talk about frequency change, we are referring to both voltage and frequency changes.

2.3.1 Desktop CPU

Studies published by Govil et al. [55] and Pering et al. [56] are among some of the earliest research studies into the advantage of DVFS for power reduction in high-performance desktop CPUs. Their study was conducted using trace-driven simulation before hardware support for DVFS existed. The workload prediction phase of the DVFS techniques explored by Govil et al. [55] relied on CPU utilization to characterize the workload. Execution traces were divided into fixed-length intervals. The utilization of the next interval was calculated as a function of the past intervals. PAST prediction policy simply assumed that next interval has the same utilization as the preceding interval; LONG-SHORT and AGED-AVERAGE used weighted average functions of the past intervals for prediction; and CYCLE, and PATTERN matched the utilization patterns over multiple intervals with past patterns to predict the next interval’s utilization. Govil et al. [55] paired one of the following controller algorithms with these predictors:

1. Incremental: this controller incrementally changes the voltage/frequency until it reaches the desired voltage/frequency level. The desired voltage/frequency level must maintain the utilization within a low and a high threshold. 2. Direct jump to target: this controller selects the desired target voltage/frequency level in one step. The desired voltage/frequency level must be sufficient to complete the predicted workload for the interval.

The first controller, incremental, is the most practical solution because of its simplicity. However, for the second controller predicting workload of the next interval at current voltage/fre- quency level is not sufficient. It requires the workload prediction of the next interval for all the available voltage/frequency levels to select the proper frequency level. Govil et al. showed in this research that the choice of the DVFS evaluation interval length impacts the energy-delay trade- off. Shorter intervals track the utilization better and provide better performance but lower energy efficiency. They also concluded that simpler predictors provided better results. Pering et al. [56] expanded upon the study conducted by Govil et al. They used a detailed ARM8 processor simulator with detailed performance models for all the blocks including communi- cation channels, display, and user input. Their framework included energy models for the core, cache,

27 CHAPTER 2. BACKGROUND AND RELATED WORK external bus, and main memory. Pering et al. studied a more diverse set of applications including MPEG decoder and realtime audio. They introduced Clipped-Delay metric for studying the energy- performance tradeoff of realtime applications. Clipped-Delay metric takes into account that realtime applications require a certain performance level and not more. Finally, Pering et al. compared the efficiency of the history-based algorithms described by Govil et al. [55] to an optimal solution. They showed there is significant room for improvement upon those history-based techniques. The authors showed that the effectiveness of the algorithms varied across sampling intervals, across different applications. They concluded that adaptive strategies to select the best parameters for each process are required. Multiple Clock Domain (MCD)s Semeraro et al. [94] proposed a microprocessor design with support for MCD. They demonstrated a 20% improvement in EDP5 with multiple clock domains, as compared to a less than 3% improvement for a single clock domain. Isci et al. [59] proposed a workload phase monitoring and prediction framework. They used software-based phase prediction tables similar to branch history tables. Predicted workload phases are mapped to one of the six categories from the most CPU bound to the most memory bound. Each category is associated with one of the six available CPU frequencies. The idea is that memory bound phases can have lower CPU frequency without affecting performance considerably. Other DVFS research studies have focused on more adaptive solutions utilizing perfor- mance and/or energy models [62, 64, 13, 58]. Rajamani et al. developed performance counter based power and performance models for a Pentium M processor using linear regression [62]. Measured IPC is scaled with frequency to predict performance at a different frequency. Measured DPC is fed to individual models for each frequency level in order to predict power consumption. Spiliopoulous et al. [13] propose a single model for predicting power and performance at multiple frequencies. For performance, they identify the total number of memory stall cycles at the current DVFS setting. They argue that number of active cycles is fixed regardless of the frequency. By scaling only the stall cycles with frequency, performance at other frequency levels can be predicted. Similarly, they develop a statistical model to predict the effective capacitance (Equation 2.7) based on IPC. By multiplying the estimated effective capacitance by V 2F , power at a target DVFS setting can be predicted. Zhang and Hoffmann proposed a hybrid hardware/software capping technique for Xeon processors to improve upon DVFS based power capping [61]. They combine a low latency DVFS solution with

5 EDP is a metric to quantify the quality of a power management technique. Lower EDP values are desirable. A reduction in energy consumption in DVFS is generally accompanied by an increase in the performance. The EDP only improves if the energy reduction outweighs the increased execution time. When performance has higher importance, ED2P can be used as the metric.

28 CHAPTER 2. BACKGROUND AND RELATED WORK software-based power capping techniques to achieve better power savings. Geol et al. proposed a piecewise linear, lightweight, performance counter-based power model for scheduling and resource management in multicore CPUs [64]. Similarly, Huang et al. proposed a fine-grained power model for IBM Power CPUs for data center billing based on power consumption [65]. Gu and Qu [57] propose a model for temperature-leakage interdependency for realtime applications. They introduce a temperature-aware DVFS solution based on this model to improve the energy efficiency. Finally, Hanson et al. propose a DVFS-based thermal management technique [63]. Simi- larly, Koren and Krishna [58] propose a DVFS solution for realtime applications aimed at increasing the processor lifespan, reducing the average temperature.

2.3.2 Desktop GPUs

The addition of GPUs to large data centers has recently become a trend, as high-performance, scientific GPGPU applications have become more widespread. Much of the work on DVFS for desktop class GPUs have focused on DVFS for GPGPU applications. Nugteren et al. proposed a GPGPU DVFS solution with two frequency domains, i) GPU and ii) main memory [66]. They identify if a is compute or memory-bound using a simple two-part piecewise linear model based on ops/byte. The ops/byte ratio at which the workload goes from memory bound to compute bound is identified for each DVFS level by offline profiling of micro-benchmarks. The clock frequency of the non-bottleneck domain is reduced at runtime to save energy with minimal performance impact. In a similar study, Paul et al. utilize linear regression models based on seven different performance counters to identify a sensitivity factor (a number in the range [0-100]) for the impact of GPU and memory frequency changes on the performance [68]. Sensitivity factors are mapped to DVFS levels at design time. Their coarse-grained tuning applies the predetermined DVFS level based on sensitivity factors at runtime. A fine-grained tuning algorithm then adjusts the control knobs (GPU/memory frequency and number of Compute Unit (CU)s) in small steps. If the performance impact is significant, the fine-tuning algorithm restores the original frequency. Jiao et al. [67] use offline training to develop a neural network that identifies an optimized DVFS level based on the workload characteristics. Their solution requires the kernel execution time as an input. Therefore, each kernel has to be executed once before the DVFS scheme can optimize the consecutive runs (assuming kernel execution time does not change between runs). The authors do not discuss the runtime overhead of their solutions. Che and Skadron proposed a unique methodology to predict the performance of GPGPU

29 CHAPTER 2. BACKGROUND AND RELATED WORK applications [69]. Using offline profiling of benchmarks from the Rodinia suite, they identify and map workload features to performance ranges. At runtime, they match workload features with the data points to perform offline profiling and then predict the performance. A nearest neighbor algorithm is used in cases where an exact match is not observed. Predicting the processing time of 3D graphics workloads is more complicated. GPUs have a large number of pipeline stages working in parallel, and processing asynchronously(refer to Figure 2.2 and Figure 2.3). In addition to main memory, there are many buffers, caches, and local memory elements between pipeline stages. At any point of time any of these pipeline stages can become a performance bottleneck. Performance modeling is not as simple as characterizing a workload’s phases into memory bound or compute bound. Wimmer and Wonka highlighted these challenges and proposed heuristics for rendering time estimation of 3D graphics workloads [70]. Their heuristics utilize runtime partial performance sampling and extrapolative prediction for performance modeling. To improve the prediction accuracy and alleviate the frame-deadline misses due to misprediction, they propose two hardware extensions: 1) an accurate timestamp sampling mechanism on the GPU, and 2) a conditional branch mechanism on the GPU. A conditional branch is inserted at designated locations at compile time. This is used to jump to a different execution path with lower Level Of Details (LOD), if the deadline is likely to be missed. Sheaffer et al. explore a DVFS solution to demonstrate potential use-cases for their proposed GPU simulation framework [71]. Their DVFS solution controls separate clock domains for the vertex and fragment shader stages of a GPU pipeline. Either the vertex or fragment shader processing stage is identified as the bottleneck by monitoring the occupancy of the buffer between the two stages. The clock frequency of the non-bottlenecked block is reduced to save energy, without affecting performance. This technique is not applicable to the modern mobile GPUs with unified shaders [91, 92, 81].

2.3.3 Mobile SoC

During the last decade, mobile SoC popularity and usage has increased multiple orders of magnitude. Almost all of the use-cases for mobile SoCs involve battery powered operations. At the same time, continuously shrinking form factors impose stricter power consumption and thermal limits on these devices. Despite these limitations, consumers expect longer battery life and sustained performance6. Consequently, evaluating the effectiveness of the existing DVFS techniques, and designing new ones specifically tailored for mobile SoCs, have been an active area of research.

6 Sustained performance is the performance level the device can operate at without being throttled due to thermal limit violations.

30 CHAPTER 2. BACKGROUND AND RELATED WORK

CPUs and GPUs in mobile SoCs have distinct architectural differences with their desktop counterparts. For example, main memory is a shared resource in mobile SoCs, not only between the CPU and the GPU, but also all other computing elements (e.g., the Digital Signal Processor (DSP) and camera processor). Sharing main memory increases the level of interdependency between different Intellectual Property (IP) blocks within the SoC. Another example is the big-LITTLE multi-core CPU architecture, which was proposed to address strict power and thermal constraints of mobile SoCs. Specific examples of GPU architectural features unique to mobile SoCs include unified shaders [81] and tiled-rendering [82] (refer to Section 2.2.1). There have been many studies that characterize and model performance and energy con- sumption for mobile use-cases. The use cases in these research studies cover a relatively broad range of applications; from general-purpose benchmarks included in SPEC2000, Java, and Mibench benchmarks [74], to image, audio, and video processing [72, 73, 8, 75], to chess and a talking editor [72]. 2D and 3D graphics processing has been one of the main use-cases in portable devices. Even before mobile SoCs included dedicated GPUs, graphical user interfaces and mobile games used the CPU for their rendering tasks. There are articles exploring DVFS solutions for 3D graphics (primarily gaming) on mobile CPUs [76, 77, 79, 80, 14]. Gu et al. [76] published one of the first studies on the effectiveness of DVFS for mobile 3D games. The authors demonstrated that 3D gaming workloads exhibit sufficient workload variability to benefit from DVFS. They also showed that, unlike video decoding, it is difficult to categorize 3D graphics frames for workload prediction. Such an approach has been used to predict video decoding workload for the next frame, based on the frame type, to achieve improved DVFS. Their conclusion was that new studies are required to perform more in-depth analysis of 3D graphics workloads. Subsequently, Gu et al. attempted to improve upon history-based predictors. They proposed a control theory-based [79], a frame-structure based [80], and a hybrid predictive DVFS scheme for 3D interactive games [14]. Their control theory-based solution incorporates a performance feedback loop, based on control theory principles, to improve prediction accuracy. The frame structure-based solution employs detailed frame structure analysis to predict processing time of a 3D graphics frame. The hybrid solution combines these two approaches to achieve even more accurate predictions. Some of the limitations of these solutions are as follows: 1) the overhead of the frame structure-based prediction was reported to be 1.7 million CPU cycles, on average (they did not report maximum overhead). Such high runtime overhead (i.e., high execution latencies) can be prohibitive for mobile DVFS. 2) they only used one 3D graphics game for their analysis, which is insufficient in terms of

31 CHAPTER 2. BACKGROUND AND RELATED WORK coming to any generalizable conclusion. 3) the percentage of the frames missing the deadline in their hybrid technique was as high as 10% or more, a significant frame drop rate. Mochocki et al. made a similar attempt at improving workload prediction over history- based techniques [78]. They proposed a signature-based prediction mechanism. This prediction technique maps workload signatures to rendering times. By matching the signatures of future frames with the signature history, the performance of future workload can be predicted more accurately as compared to using simple history-based prediction. The advantages of this technique over frame- structure based prediction is lower overhead since the features selected for the signatures are simple graphics API level information. All of these four studies have used a CPU for the execution of their 3D graphics workloads. Considerable advancement in the programmable mobile GPU pipeline, and evolution in the characteristics of the 3D workloads, has occurred since these studies were published. The architecture of a modern mobile GPU provides a high degree of parallelism, especially when compared to current CPU architectures. It remains to be seen if these results apply to the modern mobile GPUs. Silpa et al. set out to further improve upon history-based and signature-based tech- niques [82]. They defined a new metric, frame/tile rank, for characterizing the workload of a frame. There are a number of factors contributing to their improved solution, including: a) the rank parameter incorporates prediction for pixel processing, texture processing, and rasterization phases, as opposed to frame signatures that only include geometry processing information [78]. In modern 3D graphics workloads, geometries with a similar number of primitives and areas can exhibit vastly different processing times. b) This work specifically targets and takes advantage of tile-based GPUs. By enabling DVFS at a tile granularity, they achieve improved prediction accuracy. They also have the added ability to take corrective actions in the case of a misprediction, before a frame deadline is missed. c) Rank is used to identify potential changes in the processing time as compared to the previous frame, rather than predicting the actual processing time. Note that one additional advantage of this research is that it is conducted on a cycle-accurate GPU simulation framework, Attila [95], rather than a mobile CPU. While GPGPU applications have not been widely deployed on mobile platforms, there are specific use-cases for mobile GPGPU, most notably camera image processing. One of the unique aspects of mobile, as compared to desktop GPGPU, is that graphics and GPGPU applications are executed on the same GPU in a time-sharing fashion. Kim and Kim addressed the implications of concurrent execution of graphics and compute workloads on optimized DVFS configurations [15]. They recognize that mobile GPU DVFS techniques are optimized for graphics workloads. The default

32 CHAPTER 2. BACKGROUND AND RELATED WORK settings are not well-suited for GPGPU applications. At the same time, changing DVFS parameters to accommodate both types of workloads can impact the power efficiency of graphics workloads. They propose an interval-based DVFS solution for concurrent GPGPU and graphics applications. In this solution, GPGPU and graphics use-cases maintain their own optimized thresholds. These thresholds are combined at runtime using a weighted threshold mechanism, producing a unified threshold. They also propose three additional DVFS enhancements for GPGPU use cases: 1) new APIs to request a specific frequency, 2) an adaptive interval-based algorithm that changes the DVFS interval at runtime based on the workload, and 3) a multilevel frequency adjustment. The authors highlight the benefits of performing multilevel frequency changes as compared to incremental frequency level changes, which are commonly found in DVFS techniques [12]. Researchers have recently considered the effectiveness of CPU-GPU integrated DVFS for graphics workloads [83, 84, 16]. Park et al. performed detailed workload characterization for DVFS potential exploration in [83]. The authors used synthetic benchmarks to control the workload for various stages of the pipeline, as well as the CPU workload. Their overall workload intensity was controlled by the resolution. The advantage of this work over previous studies is that their analysis is conducted using a mobile SoC platform released in recent years. One of the main conclusions of their work was that either the CPU or the GPU can become a bottleneck at different application phases. Therefore, an integrated CPU-GPU DVFS, that lowers the frequency of the non-bottlenecked component, might be more effective. Pathania et al. explored an integrated DVFS solution in [84, 16]. In [84], Pathania et al. made their first attempt at proposing an integrated CPU-GPU DVFS solution for graphics workloads. The result, however, is a solution that still makes DVFS level change decisions for the CPU and the GPU, independently. There are some flaws in this work, reflected in their experimental results, which reduce its effectiveness in practical use-cases. In a follow-up study, the authors acknowledge some of these shortcomings [16]. They propose a new integrated DVFS solution to overcome those limitations. Their new DVFS solution utilizes performance and power models for more accurate decision making. For their power model, they rely on naive assumption that power consumption is linearly correlated with the utilization. Frames with the same utilization can contain vastly different workloads, producing quite different power consumption profiles. The key missing part of their study is that their evaluation does not specifically demonstrate the benefits of GPU-CPU DVFS integration. They do demonstrate their solution can achieve 20% better energy efficiency as compared to an On-Demand governor7[12].

7 Note that On-Demand governor, which was initially designed for interactive use-cases such as User Interface (UI) in

33 CHAPTER 2. BACKGROUND AND RELATED WORK

Currently, we believe that there is little advantage gained by considering an integrated solution for the following reasons: (a) Generally, applications overlap their CPU work with GPU rendering, and try to leverage triple buffering to stay a few frames ahead (refer to Section 2.2). Therefore, the chances of the CPU becoming a bottleneck are low. (b) The CPU is concurrently used by many different processes. It is not practical to adjust the CPU frequency based on the graphics workload without imposing any impact on other applications. (c) In a typical 3D graphics use-case, the associated CPU computing is less intense than the GPU workload, so it is less likely to become the bottleneck. Even in the experiments designed to show the benefits of integrated CPU-GPU DVFS in [84], the authors reported low FPS sensitivity to CPU frequency. More recently, You and Chung [96], and Peters et al. [97] have focused on improving the user experience and DVFS efficiency by taking into account unique characteristics of graphics workloads; notably, they incorporate the notion of FPS into GPU and CPU DVFS, respectively. Yet, their solutions still do not address some of the important shortcomings of the existing DVFS solutions: high latency, and lack of context and application QoS-awareness. Both studies tried to enforce a configurable FPS rate lower than that of the display, but without regard to the application-specific FPS requirements.

CPUs, favors performance over energy efficiency

34 Chapter 3

Fine-Grained DVFS Analysis Framework

The DVFS analysis framework introduced in this chapter will be utilized in the following chapters. We will use this framework to conduct workload performance and energy consumption analysis of 3D graphics workloads and validate our proposed DVFS solutions. In creating the framework, we considered the following objectives:

• The capability to quantify the outcome of a DVFS algorithm with respect to overall perfor- mance and power/energy consumption. • The capability to compare different algorithms in a consistent setup. One of the important factors we considered in designing our analysis framework was run-to-run consistency to enable meaningful algorithm comparisons. • The capability to perform design space explorations by changing the parameters of the existing algorithms, as well as by devising and prototyping new algorithms. • And lastly, the ability to perform opportunity analysis, i.e., having a measure of the goodness of algorithm, and identifying the potential room for improvement with respect to either performance and/or energy consumption.

The next section provides the details of how our proposed analysis framework captures and combines workload performance and energy consumption characteristics to satisfy the aforemen- tioned objectives.

35 CHAPTER 3. FINE-GRAINED DVFS ANALYSIS FRAMEWORK

3.1 Performance measurements

We have developed a framework that reads performance counters at desired stages of the workload execution to profile performance metrics. Consider an OpenGL ES [98] graphics workload which is composed of many frames. We would like to be able to track workload characteristics, including rendering time, at the granularity of a single frame. Figure 3.1, shows the ring buffer of a GPU with command streams for multiple processes specified with different colors. As mentioned before, command streams are handed down to the User Mode Driver (UMD) by the applications. There is an EGL API [99], eglSwapBuffers, which is called by an application every time the instructions required to draw a frame finish and the instructions for the next frame start. This API call is eventually used to move the content of the rendered frame buffer to the display buffer. As shown in Figure 3.1, UMD can insert profiling instructions into the command stream for the processes that request performance profiling. These instructions are inserted at frame boundaries, i.e., whenever an eglSwapBuffers is observed in the command stream. Our framework is capable of reading any performance counter available to the software for performance profiling. Additionally, it provides finer-grained (versus at the level of an entire frame) profiling capabilities; although, frame-level profiling is sufficient for our DVFS analysis purposes. At a very minimum, we use one of the always-on counters in the GPU to extract total draw time per frame, a performance metric that should prove to be useful for simulating and comparing various algorithms, as well as for analyzing workload behavior.

Counter Sampling Instructions Ring Buffer

Command Batches Empty from different Contexts

Figure 3.1: Instrumenting command stream by inserting performance counter profiling instructions into the command stream.

36 CHAPTER 3. FINE-GRAINED DVFS ANALYSIS FRAMEWORK

3.1.1 Pipeline idle insertion

Unlike CPUs that are optimized for low latency, i.e., to execute a relatively small number of instructions as fast as possible, GPUs are optimized for throughput. GPUs are highly parallel processing elements suitable for relatively large workloads with high levels of data parallelism. One of the implications of such an architecture is a very deep pipeline. It can take a relatively long time from when the instructions are fetched from memory and enter the pipeline, until their results are computed and retired. As a result, profiling the timing of individual frames is not as simple as inserting instructions into the command stream that read counter values. At the beginning of a frame, sampling counters right away would be accurate enough most of the time. However, when the profiling instruction at the end of the frame is executed, the processing of the commands for that frame, which are in the pipeline, might still be in progress. In order to accurately measure the rendering time for a frame, we have utilized pipeline idling instructions. Such instructions, when encountered at the front end of the pipeline, stall the processing of incoming instructions until the pipeline executes all in-flight instructions from the previous frame and becomes idle. Table 3.1 provides pseudo-code of a command stream before and after instrumentation for performance profiling. Note that, in a single context execution model, only one wait for idle is needed before reading the counter at the end of the frame (eof cntr read) (instructions in blue). However, when multiple contexts are allowed to execute on the GPU, it is possible that the start of the frame in a context, with performance profiling enabled, follows another context, with profiling disabled. To ensure profiling is done accurately, an additional wait for idle is inserted before reading the counter at the beginning of the frame (sof cntr read, instructions in red). The flowchart in Figure 3.2 shows the required steps in the UMD that are taken in order to instrument command streams for performance profiling. As is shown in the flowchart there are two cases in which the profiling data is dumped into a file on the file system. First, when the UMD identifies if the additional instrumentation requires buffer allocation beyond the fixed amount of buffer available to UMD to hold the profile. In this case, UMD dumps some of the data in the buffer into the designated file for each process and frees up memory space in the buffer. More importantly, when each process closes its graphics rendering context by calling the eglDestroyContext, all data in the buffer dedicated to that process is dumped into the corresponding file. The profiling data for each application is downloaded from the mobile device running the Android operating system at the end of each. The data is saved on the host PC to be parsed later. This task is carried out through Android Debug Bridge (ADB) over a Universal Serial Bus (USB)

37 CHAPTER 3. FINE-GRAINED DVFS ANALYSIS FRAMEWORK

Table 3.1: Command Stream instrumentation. Inserting profiling instructions at the eof (End of Frame) and sof(Start of Frame). before instrumentation after instrumentation instr 1 instr 1 instr 2 instr 2 wait for idle eof cntr read eglSwapBuffers eglSwapBuffers wait for idle sof cntr read instr 3 instr 3 instr 4 instr 4 wait for idle eof cntr read eglSwapBuffers eglSwapBuffers wait for idle sof cntr read instr 5 instr 5 connection.

Instrument Command Stream Start Start

Identify cmd batch mem_size = Profiling boundaries to insert num_samples * Enabled? sampling inst.s inst_size No Yes Add inst.s to dump data & Buffer Full? Yes Add command Instrument update empty Stream to the Ring Command buffer ptr Buffer Stream No

Add sampling inst.s to End the command stream. End Update empty buffer ptr

Figure 3.2: The steps involved in the UMD to instrument command stream for performance profiling.

38 CHAPTER 3. FINE-GRAINED DVFS ANALYSIS FRAMEWORK

3.2 Power measurement setup

Figure 3.3: Power measurement lab setup An extension board is used to expose various power rails and IO blocks of the GPU. Flat wires are used to connect the board to a Kratos Power Measurement equipment.

In Figure 3.3 we can see our power measurement setup. The setup includes an extension board which exposes the pins on the SoC, as well as the power distribution rails. This board allows us to easily and reliably connect different Input/Output (IO) pins and power distribution rails to the power measurement equipment. The power measurement equipment provides a graphical interface that we use for manual data acquisition. Additionally, the acquisition software has a TCP/IP interface that can be used to automate data acquisition. We have used the TCP/IP interface to automate our data acquisition. The acquired power data is saved on the host computer, along with the performance data discussed in the previous section, to be parsed in a later step. As is shown in Figure 3.3, a temperature control unit is used to ensure repeatable experi- ments. Without the temperature control unit, when we run long automated measurements, the SoC, including the GPU, can heat up. Higher temperatures result in higher leakage power consumption. Measuring power with the same configuration, for the same application at two different temperatures can result in different data. Our goal is to perform a number of iterations for each test and configura- tion, calculating the average across iterations in order to average out variabilities that are not part of

39 CHAPTER 3. FINE-GRAINED DVFS ANALYSIS FRAMEWORK our analysis. Another potential source of variability during measurement is the active USB connection. As mentioned in the Section 3.1, some of the automation tasks in the framework rely on the ADB toolchain over a USB connection. Having a USB connection active during power measurements could affect the measurements at the battery level. It could also interfere with low-power state operation of the CPU, as the CPU may not be able to go to sleep or could jump to higher-power active states more frequently due to the USB interrupt processing. The power measurement software API supports enabling/disabling the USB. We utilize this feature to disable the USB right before data acquisition starts to prevent such interference. The USB is activated again when the data acquisition finishes and before the log files are downloaded from the device.

3.3 Performance-power data alignment

The power acquisition system we have described so far enables us to measure performance at a frame granularity. However, power measurement is done for the duration of the workload execution without any finer-grained insight into the frames or draw calls. From the performance profiling data, we have the sampled values of the GPU performance counters used for performance profiling at frame boundaries (draw call boundaries, etc.), i.e., the start and end of the frame. The power measurements also have time-stamps associated with the samples, although, in a different time-domain. If we could align a sampled value of the GPU performance counter with the power data time-stamp, the frame boundary time-stamps from the performance measurements could be used to identify frame-boundaries for the power data. Then, we can extract per frame (draw call, etc.) power and energy consumption. We use one of the General-Purpose Input/Output (GPIO) pins on the mobile SoC to align the performance and power data. The GPIO pin is physically connected to a dedicated channel of the power measurement equipment. During the application execution we toggle the GPIO pin and sample the GPU performance counter used for the performance profiling at the same time; we call this a “syncpoint” hereafter. Toggling GPIO and sampling the counter are obviously executed back to back on the CPU as serial operations. We implemented the code that toggles the GPIO and samples the performance counters in the Android kernel to make sure they are sampled virtually at the same time. This code is exposed to our automated framework through a Linux Kernel Virtual File System (sysfs) node. The voltage of the GPIO channel is acquired, along with the rest of the power

40 CHAPTER 3. FINE-GRAINED DVFS ANALYSIS FRAMEWORK

Figure 3.4: Assertion of the GPIO pin, acquired with other voltage rails White and green lines are battery and graphics rail voltages respectively. The red line is the GPIO pin voltage which has been asserted, and then de-asserted 57 seconds later. rails. Figure 3.4 shows a screen shot of the acquired voltage on three different channels, including battery (white), graphics (green), and the GPIO pin (red). As we can see in this figure, the GPIO pin is asserted (set to logical one/high voltage value), and de-asserted. During the data parsing process, the frame-boundary time stamps from the performance profiling are converted to the corresponding time-stamps in the acquired power data time domain. The required calculation for the start time stamp for a frame is as follows:

CounterV alue T imeStamp = sof (3.1) sof F requency CounterV alue T imeStamp = sp (3.2) sp F requency

T imeDiffframe = T imeStampsof − T imeStampsp (3.3)

P owerDomainT imeStampsof = P owerDomainT imeStampgpio + T imeDiffframe (3.4)

This calculation is repeated for the start and end of every frame (draw call, etc.). Having frame boundaries in the power time domain, average power and total energy consumption for each frame can be calculated. Note that we only need one syncpoint to make this calculation work.

41 CHAPTER 3. FINE-GRAINED DVFS ANALYSIS FRAMEWORK

GPU Busy Slack Time GPU Busy

sof eof sof eof No slack time for this frame

16.7 ms (for 60 fps)

VSYNC

Figure 3.5: Idle power calculation based on the notion of slack time. The second frame does not have any slack time with respect to the vsync. Therefore, there will be no idle power/energy associated with that frame.

3.3.1 Idle energy calculation

So far, we showed how we calculate the frame energy and average power consumption from the start of a frame to the end of a frame. In a mobile SoC, there is also power consumption during GPU idle times between two consecutive frames. The GPU consumes static energy, as well as some dynamic energy, due to the active clock tree and background computations. The amount of energy consumption during the idle period depends on multiple factors. Idle energy consumption can be reduced if GPU sub-blocks, or the whole GPU can be shut down. Idle energy consumption also depends on the voltage, frequency, and total idle time. Figure 3.5 demonstrates an example of two frames with different frame rendering times. Here, we have assumed that the display refresh rate is 60 Frames Per Second (fps), which is commonly the case in today’s mobile devices (therefore the available time to draw each frame is 16.7 ms). The first frame finishes before the screen refresh rate deadline (vsync). However, the second frame takes longer than the deadline. For the second frame in the Figure 3.5, the total idle energy would be zero because, for that frame, there is no slack time. All of the energy consumption would be the active energy consumption from the start to the end of the frame. The energy consumption for the first frame, on the other hand, depends on the length of the slack time and the average power consumption during the GPU idle time. In order to be able to incorporate the impact of the idle energy consumption, we calculate the average power consumption from the end of one frame to the start of the next frame. Note that, in our profiling phase, we disable the low-power modes that enable more aggressive power collapse of the GPU and its sub-blocks. During our DVFS algorithm evaluation and design space

42 CHAPTER 3. FINE-GRAINED DVFS ANALYSIS FRAMEWORK exploration, we calculate the actual idle energy consumption by multiplying the average power consumption by the total slack time. We also adjust the energy consumption based on the desired hardware configuration (for example, if 70% of the GPU can be shut down during GPU idle time, we add 30% of the estimated idle energy to the total energy consumption). This way, we can use the same profiling data to capture the impact of different idle power-saving scenarios, as well as different FPS requirements (a smaller FPS leaves more idle time after frame when rendering is done).

3.4 Putting it all together

So far we have discussed how power and performance profiling for a single iteration of an application for one DVFS OPP is done. The last thing about the framework that we need to describe is what the data required for DVFS analysis in this thesis looks like and how this data is gathered. As stated before, one of the objectives of this framework is to allow us to quantitatively compare different DVFS algorithms. Various DVFS algorithms essentially end up making different decisions as to when to change the DVFS level (OPP) and which level to change it to. The conse- quence of these different decisions is that the throughout of an application with X number of frames, will have some of the frames rendered at different DVFS levels. In order to be able to quantitatively compare these different algorithms, we need to know the performance (frame time) and energy consumption of each frame at different OPPs. Figure 3.6.a and 3.6.b demonstrate how we do this. In these two figures, multiple lines with different colors belong to different OPPs. The X-axis is the frame number and Y-axes are the total frame time and total energy consumption respectively. Performance and power measurement experiments are performed in multiple iterations, to cover all of the OPPs. The charts in these two figures only show the data across DDR frequencies for a single GPU frequency, but this is done across GPU frequencies as well. In order to ensure more accurate results, data for a single operating point is captured multiple times (e.g., 10 times) and the average value for each frame is calculated that results in the data in the Figure 3.6.a and 3.6.b. This is required since, in a typical mobile SoC, there are various shared resources that can cause variations in the results. We try to reduce system level effects by explicitly by running only GPU-specific content on the device, but there can still can be operations required by the Android operating system that can interfere. Running multiple times and using averages help alleviate this issue.

43 CHAPTER 3. FINE-GRAINED DVFS ANALYSIS FRAMEWORK

GPUFreq: 190 MHz 70 200 MHz 300 MHz 60 461 MHz 547 MHz 691 MHz 50 778 MHz 1037 MHz 40 1296 MHz 1555 MHz

Frame Time (ms) 30

20

10 500 1000 1500 Frame Index

(a)

GPUFreq: 190 MHz 0.020 200 MHz 0.018 300 MHz 461 MHz 0.016 547 MHz 691 MHz 0.014 778 MHz 1037 MHz 0.012 1296 MHz 1555 MHz 0.010 GFX Energy (J)

0.008

0.006

0.004 500 1000 1500 Frame Index

(b)

Figure 3.6: Total frame time (a) and graphics rail energy consumption (b) measured iteratively for all DDR frequencies Data for only one GPU frequency is shown here but iterative measurement is performed across GPU frequencies.

3.4.1 API log capture and playback

In order for our profiling data to be accurate across iterations of the same OPP, as well as different OPP configurations, we need to ensure that we can consistently render the exact same frames in each and every run of the application. One of the shortcomings of the previous work on DVFS for graphics workloads is the inability to run the exact same workload multiple times. If you run a 3D graphics application, the number of frames and their respective draw

44 CHAPTER 3. FINE-GRAINED DVFS ANALYSIS FRAMEWORK

commands could change from run to run. The variation in the frame workloads can be due to different user inputs for interactive applications such as games or UI. Another source of variation for 3D games is the time-dependent physics engine that identifies the location and direction of the objects in the scene. In a mobile SoC, system level effects due to shared resources cause variations in the completion time of the frames. Consequently, the content (and number) of the rendered frames, impacted by the physics engine’s computations, varies between experiments. In our framework we have implemented an OpenGL API capture and playback methodol- ogy to run an application consistently. To achieve the required consistency in our experiments, we run each application once and capture all of the OpenGL ES API calls it makes. The input data for each API call is also captured and saved into a binary file along with the API calls. Thereafter, we always use the API playback tool we have developed to run the application consistently as many times as needed. The OpenGL ES UMD driver that implements the functionality of the standard OpenGL ES APIs requires some modifications. The functions that implement the APIs in the UMD are changed. A wrapper code is inserted around the functionality implementation of each API. If API capture is enabled, the wrapper code saves the information about the API call and its input parameter(s) in the API capture file in binary format. The wrapper functions can be enabled/disabled at compile time using pre-processor directives or by using function tables in order to prevent performance overhead when the API logging is not required. The file is properly closed whenever a process destroys its context, dies, is killed, or otherwise ends normally. The code snippet in 3.1 presents a simplified example of how API capture is implemented for the glLinkProgram API.

1 GL APICALL void GL APIENTRY glLinkProgram (GLuint program)

2 {

3 #ifdefAPI CAPTURE

4 i f (api c a p t u r e e n a b l e d ) {

5 a p i capture .add(’glLinkProgram’, program);

6 }

7 #endif

8 // glLinkProgram functionality implementation goes here

9 }

Listing 3.1: OpenGL ES API instrumentation for API log capture.

In order to run the captured API logs, we have developed an Android application in Android Package (APK) file format that loads the binary data, creates the required OpenGL/egl

45 CHAPTER 3. FINE-GRAINED DVFS ANALYSIS FRAMEWORK context(s) and submits the API calls to the ring buffer. The application is capable of identifying frame boundaries based on the eglSwapBuffers API calls, similar to our performance profiling tool. It enables operations such as single frame playback (waiting for user input before going to the next frame), frame range playback, and frame repetition.

46 Chapter 4

Workload Analysis and Design Space Exploration

In this chapter, we present our results on workload performance and energy consumption analysis for DVFS. We carry out our analysis in this chapter on a state-of-the-art mobile SoC from QUALCOMM Technologies Incorporated, the Snapdragon 810 [100]. The Snapdragon 810 is manufactured in 20 nm fabrication technology. The following is a summary of the workload performance and energy consumption assumptions we explore in details in the next section:

1. Effectiveness of history-based workload prediction for DVFS algorithms: The quality of a history-based DVFS solution greatly depends on the frame-to-frame workload variations. We study the workload variation of 3D graphics applications to figure out if history-based prediction can still be effectively utilized for mobile GPU DVFS. 2. The quadratic relationship between operating voltage and dynamic energy consumption: According to Equation 2.8, dynamic energy consumption for a given task is proportional to the voltage squared. Power consumption is additionally proportional to the operating frequency (Equation 2.10). Some DVFS techniques estimate the energy consumption change when going from one DVFS OPP to another DVFS OPP based on this equation ([13]). 3. Unidirectional relationship between voltage/frequency and dynamic energy consumption: Based on this assumption, DVFS algorithms always select the lowest voltage/frequency that meets the deadlines as the most energy-efficient option. 4. Relative impact of the GPU vs. DDR voltage/frequency on the performance and energy consumption: We investigate if the performance impact of the GPU frequency is similar to

47 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

or higher/lower than the DDR frequency. We would additionally want to know if the energy consumption impact of the GPU frequency level for the same performance improvement is comparable to or higher/lower than the DDR frequency. This characterization can be useful for a DVFS algorithm to identify the most energy-efficient option to satisfy a change in the performance requirement. 5. Linear relationship between the performance and the GPU or DDR operating frequency: Here we evaluate whether performance has a linear relationship with the operating frequency.

The first item above helps us verify that history-based predictors satisfy our needs. Items 2-4 help us evaluate the need for, and potential usefulness of, an energy model for mobile GPU DVFS. We will additionally quantify the potential impact of energy-awareness in DVFS later in the design exploration section. The last item helps us understand the role of a performance model in our solutions. We review these items in more details in the next section.

4.1 Graphics workload performance and energy consumption analy- sis

Effectiveness of history (interval)-based workload prediction for DVFS (Page 47, Item 1) As stated in Section 2.3, history-based prediction of the workload of the next application phase (in terms of utilization, IPC, etc.) has been the stepping stone of many DVFS solutions for general-purpose workloads [55]. The effectiveness of history-based prediction depends on the temporal locality of the workloads. A number of research articles have looked into workload prediction for 3D graphics applications using detailed API level information [78, 82]. However, more research is required to provide convincing evidence of the feasibility and superiority of API level workload prediction due to the shortcomings of these studies. Here, we are going to analyze frame-to-frame (f2f) performance variability of modern 3D graphics workloads. Our goal is to assess the effectiveness of history-based prediction for state-of-the-art workloads. History-based predictors are much simpler and impose less CPU overhead and are preferred as long as they offer sufficient accuracy. In general-purpose workloads, if the time period is selected properly, history-based predic- tion can be effective given a high degree of spatial and temporal locality. For graphics workloads, history does usually repeat. Notably, objects moving in a scene on the screen should not move too

48 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION fast; otherwise the human eye will not be able to recognize the scene changes. Therefore, consecutive frames tend to have similar workload and the changes are generally gradual except for the occasional cases that the scene completely changes. Figure 4.1 demonstrates frame-to-frame rendering time variation for four applications. In this section, we frequently use the term workload change/variation to refer to the change or variation in the frame-to-frame rendering time. As can be seen in the Figure 4.1.a, three of the use cases are dominated by only gradual changes, with a low percentage of the frames showing rapid changes in the workload. More specifically, the 3D game applications exhibit a high degree of workload stability and gradual changes (we have made similar observations on other gaming applications not presented here). Note, however, the UI use case contains considerably more abrupt workload changes. The UI workload demonstrated in this figure is composed of a live wallpaper combined with multiple transitions between widget pages and App Drawer pages. To better understand how DVFS can leverage workload stability, consider workload varia- tions for EgyptHD, shown in Figure 4.1.a. Virtually all of the frame-to-frame (f2f) workload changes fall below 10%. In theory, a history-based DVFS algorithm can guarantee all deadlines are met by maintaining the GPU utilization below 90%. In general, for a workload with x% of f2f variations below 10%, such an algorithm ensures chances of a jank (a deadline miss) are less than x% (Android triple buffering can further reduce the chances of janks). This would clearly work well for the other two gaming benchmarks, TempleRun2 and Hover, with most changes below 10% threshold. Such high levels of workload uniformity have been reported in early work on mobile GPU DVFS as well [76, 16]. As such, in recent years, DVFS solutions leveraged history-based (utilization) predictors for GPU DVFS [8, 16, 97, 96]. To gain additional insight into workload variations, we can use Figure 4.1.b. The x-axis in this figure is the frame index. This figure shows the absolute workload (frame time) for the Hover game (left y-axis) and f2f workload variation in percentage (right y-axis). Despite the dynamic behavior of the workload, and significant workload changes at the large scale, the majority of the f2f variations are small (less than 10%). Note that the magnitudes of the variations in frame time are dependent on the GPU and DDR frequencies, to some degree, as demonstrated in Figure 4.1.c. For example, if the DDR frequency is the bottleneck at OPP1, but not at OPP2, an increase in the memory to computation ratio from one frame to the next can have more impact on the rendering time running at OPP1 as compared to running at OPP2. On the x-axis, the OPPs are ordered first, based on the GPU, and second, based on DDR frequency. Therefore, local trends are related to the DDR frequency

49 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

104 Frame time change histogram - Hover (%) 104 Frame time change histogram - TempleRun2 (%) abs_frame_time_change abs_frame_time_change 3308 2697

103 103

104 102 89 102 40

1 10 10 1 11 10 7 10

3 3 2 2 2 2

100 100

# changes within bin (log scale) 0 10 20 30 40 50 60 70 # changes within bin (log scale) 5 0 5 10 15 20 25 30 Change bins (%) Change bins (%)

104 Frame time change histogram - EgyptHD (%) 104 Frame time change histogram - ComplexUI (%) abs_frame_time_change abs_frame_time_change

1809 1480 103 103

122 102 77 102

25 27 20 22 19 19 22 15 16 12 12 13 101 101 6 4 4 4 3 3 3 3 2 2 2

100 100

# changes within bin (log scale) 5 0 5 10 15 # changes within bin (log scale) 0 50 100 150 200 Change bins (%) Change bins (%)

(a)

Frame time change across farmes - Hover (%) 10 50 Total_frame_time_ms 9 abs_frame_time_change (right) 40

8 30 7 20 6

Frame time (ms) 10 5

4 0 500 1000 1500 2000 2500 Absolute frame time change (%) Frame index

(b)

Number of frame-to-frame changes above threshold across OPPs (Hover) 700 160 threshold:10 (right) 140 600 threshold:2 threshold:5 (right) 120 500 100 400 80 300 60

200 40

# frame-to-frame changes 100 20

190_200 305_300 390_461 450_547 510_691 600_778 OPP (_)

(c)

Figure 4.1: Frame-to-frame workload variation. (a) Histogram of the workload changes (%) for four different applications. (b) Total frame time and workload variation. (c) - to-frame workload changes above threshold across different OPPs for the Hover application. 50 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION changes. Each line in this chart shows the number of workload changes above the threshold, for three different threshold values: 2%, 5%, and 10%. The number of variations above threshold are significantly higher for the 2% threshold. There is a repetitive pattern for the change in the workload variations with DDR frequencies at all GPU frequencies. The differences between the workload variation profile at different OPPs are not significant, especially for a 10% threshold. For the charts in Figure 4.1.a and 4.1.b we decided to use the data from the OPP with the highest GPU and DDR frequencies. At the highest OPP, we can ensure neither GPU nor DDR become bottleneck. So, the data is more representative of the actual workload variations. Based on our observations so far, we conclude that mobile 3D games demonstrate a level of workload stability that makes history-based DVFS algorithms an effective solution. While, our UI workload shows more variation is inherent in the type of user interactivity involved in UI use cases, we noted that with triple buffering, the chances of missing frame deadlines due to misprediction decrease. Quadratic relationship between operating voltage and dynamic energy consumption. (Page 47, Item 2) Recall from the Equation 2.8 that energy consumption for a given computation is propor- 2 tional to the load capacitance (CL), activity factor (α), voltage-squared (V ), and the number of processing clock cycles (Nclk). For the same hardware and same workload, all of these factors are expected to be constant. The only parameter that changes between two OPPs is the voltage. There- fore, energy consumption for the workload is expected to have a quadratic relationship with voltage. DVFS algorithms balance energy consumption vs. performance, prioritizing one or the other based on the use case. Some DVFS algorithms minimize energy consumption while maintaining minimum performance, while others minimize an integrated cost function such as the Energy-Delay-Product (EDP ). In the absence of an energy model, DVFS algorithms approximate the difference of the energy consumption between two OPPs with the change in the voltage squared between these levels. Our goal is to identify if this rule is valid for state-of-the-art mobile GPU graphics workloads or if a detailed energy estimation model is required for more accurate and fine-grained DVFS. Figure 4.2 demonstrates the relationship between energy/power consumption and voltage squared for the EgyptHD benchmark. In Figure 4.2.a, the relationship between the GPU energy consumption and GPU voltage is shown. There is one scatter plot per DDR frequency. The x-axis is the normalized voltage squared for different GPU frequency levels. Each data point is the normalized energy consumption for a specific frame at the OPP corresponding to the GPU and DDR frequency level. The data points for all frames at a fixed OPP virtually create a vertical line in these charts.

51 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

(a) EGP U ∼ VGP U

(b) PGP U ∼ VGP U ∗ FGP U

(c) EGP U+DDR+idle ∼ VGP U

Figure 4.2: Relationship between OPP voltage and energy consumption 52 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

Both energy and voltage are normalized to the lowest GPU frequency level in each chart. Thus, the chart in each subplot can be used to evaluate the correlation between the energy consumption and voltage squared across GPU frequencies for the same DDR frequency. At a quick glance, these charts might seem relatively linear (if for the same frame relation- ship between the energy consumption and voltage is quadratic, energy-voltage squared trend should be linear). But, the following two observations are made:

• In subplots with lower DDR frequencies, the length of the vertical lines increases with the normalized voltage. That is, the range between the minimum and maximum normalized energy within the OPP is larger for higher GPU voltage/frequency levels. Considering the scale of the y-axis, the difference between the ranges across GPU frequencies is significant. While the midpoints of the vertical lines in these subplots show some linearity, significant nonlinearities exist at the high end of the normalized energy. This behavior is likely because at lower DDR frequencies, DDR becomes the bottleneck. Therefore, the increase in the active-wait cycles at higher GPU frequencies becomes more dependent on the memory behavior and instruction mix of the workload. The result is a variation in the normalized energy consumption across different frames. Workloads that are less memory bound exhibit lower wait cycle increases as we increase GPU frequencies. • While the data points at the higher DDR frequencies are more concentrated (i.e., exhibit less variation across frames), the non-linearity across different GPU voltages increases considerably.

The profiling data does not support the quadratic relationship between energy and voltage. However, if we look at Figure 4.2.b, a more linear relationship between the power consumption and voltage squared, multiplied by the frequency, is observed (remember, power consumption is also proportional to frequency, in addition to voltage). We can gather from the difference between these two charts that the number of active cycles consuming energy changes with GPU frequency changes, for a fixed DDR. The assumption that processing cycles for a given task is fixed, regardless of the frequency, is valid for a single frequency domain. While Figure 4.2.a and 4.2.b provided insight into the GPU energy-voltage (power-voltage- frequency) relationship, Figure 4.2.c represents the total energy consumption. Total energy includes GPU and DDR energy consumption for the active and idle period. The nonlinearity of the total energy with GPU voltage is even more pronounced in this chart. In addition to the non-linear impact of the GPU DVFS level change on the busy idle cycles (depending on how memory bound the workload of each frame is) this non-linearity can also be a result of the non-linear impact of the idle

53 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION energy. Higher GPU frequencies increases the idle time. And the energy contribution of the idle period depends on the leakage characteristics of the device and the effectiveness of the power gating of GPU (sub-blocks). Note that Hover and TempleRun2 demonstrate similar nonlinear behavior as well. Unidirectional relationship between voltage/frequency and dynamic energy consump- tion. (Page 47, Item 3) We find that when OPP changes, the change in the energy consumption is not proportional to the voltage squared. Another widespread assumption utilized in DVFS algorithms is that higher voltage/frequency levels always have higher energy consumption. This assumption is implicit in the fact that DVFS algorithms (in the absence of an energy model) select the lowest voltage/frequency level that meets performance requirements. Figure 4.3.a shows the total graphics (GFX) and DDR rail energy consumption per frame for the EgyptHD benchmark for the 390 MHz GPU frequency. The x-axis in this chart is the frame index. Each line in this chart corresponds to one DDR frequency. Figure 4.3.b focuses on a subset of the frames for better clarity. Total energy consumption in this figure includes the idle energy consumption, with the assumption of no power gating during idle times (for more information on how the idle energy is calculated as well as how power gating is factored in the calculations, refer to the Section 3.3.1). Contrary to the assumption, we can see that for a majority of the frames the lowest energy consumption is not achieved by any of the three lowest DDR frequencies. Clearly the most energy-efficient frequency ranges are the mid-range frequencies (547 MHz-778 MHz), which are the two lowest DDR frequencies. More surprisingly, the two lowest DDR frequencies are the least energy-efficient frequencies in this figure. The GPU frequency 300 MHz is not the only GPU frequency that exhibits better power efficiency for higher DDR frequencies. Such behavior can be seen for a significant number of frames at most GPU frequencies. But, it is notable that the relative energy efficiency of various DDR frequencies is not the same at all GPU frequencies. In fact, which DDR frequency is the most energy-efficient for a given frame depends on the frame workload mix and the GPU frequency. Design time decisions about the optimal configuration are difficult to make in the wake of these observations. Fine-grained energy models that take into account all of the key factors are required. While these charts demonstrate the detailed frame by frame information for representative GPU and DDR frequencies, Figure 4.6 represents the data for all OPPs in a more compact form. We will use Figure 4.6 for additional insight into the energy consumption profile across OPPs.

54 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

GPUFreq: 390 MHz 0.040 200 MHz 300 MHz 461 MHz 0.035 547 MHz 691 MHz 778 MHz 0.030 1037 MHz 1296 MHz 1555 MHz 0.025

0.020

0.015 GFX+DDR (+idle) Energy (J)

0.010 500 1000 1500 Frame Index (a)

GPUFreq: 390 MHz 0.025 200 MHz 0.024 300 MHz 461 MHz 547 MHz 0.023 691 MHz 778 MHz 0.022 1037 MHz 1296 MHz 0.021 1555 MHz

0.020

0.019

0.018

GFX+DDR (+idle) Energy (J) 0.017

0.016 1200 1300 1400 1500 1600 1700 Frame Index (b)

Figure 4.3: Total energy consumption for graphics and DDR power rails for EgyptHD, across DDR frequencies GPU frequency is 390 MHz. Each line corresponds to one DDR frequency: a) the whole execution, and b) zoomed area starting with frame 1150i.

55 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

DDRFreq: 300 MHz 0.050 190 MHz 0.045 305 MHz 0.040 390 MHz 0.035 450 MHz 510 MHz 0.030 600 MHz 0.025 0.020 0.015 0.010 0.005

GFX+DDR (+idle) Energy (J) 500 1000 1500 Frame Index (a) DDRFreq: 1555 MHz 0.040 190 MHz 0.035 305 MHz 390 MHz 0.030 450 MHz 510 MHz 0.025 600 MHz

0.020

0.015

0.010

GFX+DDR (+idle) Energy (J) 500 1000 1500 Frame Index (b) DDRFreq: 1555 MHz 0.030 190 MHz 0.028 305 MHz 390 MHz 0.026 450 MHz 510 MHz 0.024 600 MHz 0.022 0.020 0.018 0.016

GFX+DDR (+idle) Energy (J) 800 1000 1200 1400 Frame Index (c)

Figure 4.4: Total energy consumption for graphics and DDR power rails for EgyptHD across GPU frequencies DDR Frequency: a) 300 MHz, b) 1555 MHz, and c) the same as (b), zoomed into an area of interest on the x-axis.

56 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

It is equally important to perform the same analysis on the effect of the GPU frequency on energy consumption. Figure 4.4.a and 4.4.b show combined energy consumption of the GPU and the DDR rails for the EgyptHD benchmark for two different DDR frequencies, 300 MHz, and 1555 MHz. Figure 4.4.c shows an area of interest for the 1555 MHz DDR frequency. The data in the Figure 4.4.a completely matches the expectation of lower energy con- sumption at lower GPU voltage/frequency levels. However, for the 1555 MHz DDR frequency, the behavior changes drastically. The energy consumption for the 190 MHz GPU frequency goes up. This is less energy-efficient than the 305 MHz and 390 MHz, and on par with or worse than 450 MHz GPU frequency for a significant number of frames. Higher DDR frequencies consume more energy. When the GPU frequency is too low, the total frame rendering time increases. DDR energy consump- tion, in this case, can significantly increase and result in higher total energy consumption at lower GPU frequencies. We have performed our analysis on the GPU and DDR energy consumption so far. But eventually, battery level energy consumption more directly impacts how long a mobile device can function on a single charge. Battery level energy consumption is impacted by other hardware blocks within the SoC such as the CPU. In our experiments, we have fixed the number of CPU cores to 4 and the CPU frequencies to 1.4 MHz to be able to obtain consistent profiling results. When reviewing battery level energy consumption trends, we should be mindful that the performance and energy consumption profile of the fixed CPU core and frequency setting could be different from the expected behavior in a real world configuration. We have taken some measurements to mitigate other system level impacts on the battery measurements. In our experiments, only the graphics application being profiled has been running. No other workload has been using the other IP blocks on the SoC, except for operating system’s background tasks that are supposed to be lightweight. Similar to the combined GPU and DDR energy, in Figure 4.5 we observe workload phases with more energy-efficient execution at higher voltage/frequency levels for the battery energy as well. In fact these trends are more elevated and have a higher magnitude. The two lowest GPU frequencies become the least energy-efficient for a majority of the frames. Lower frequencies increase the workload runtime, decreasing the chance of the SoC going into low-power states. As a result, battery energy consumption increases. Figure 4.6 demonstrates the aggregated energy consumption across all frames for each OPP. All the values are normalized to the bottom-right corner, the OPP with the lowest frequency. The data in Figure 4.3-4.5 present a subset of the OPPs for our analysis. Figure 4.6 complements the previous charts by presenting the data for all OPPs. On the other hand, this figure only shows general

57 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

DDRFreq: 300 MHz 0.120 190 MHz 0.115 305 MHz 390 MHz 0.110 450 MHz 510 MHz 0.105 600 MHz 0.100 0.095 0.090 VBAT Energy (J) 0.085 0.080 0.075 800 1000 1200 1400 1600 Frame Index

(a)

DDRFreq: 1555 MHz 0.13 190 MHz 305 MHz 0.12 390 MHz 450 MHz 0.11 510 MHz 600 MHz

0.10

0.09

VBAT Energy (J) 0.08

0.07

0.06 800 1000 1200 1400 1600 1800 Frame Index

(b)

Figure 4.5: Energy consumption at the battery rail across GPU frequencies DDR Frequency: a) 300 MHz, and b) 1555 MHz.

58 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

Normalized Graphics Energy - EgyptHD (%) Normalized DDR Energy - EgyptHD (%) 147.3 149.1 150.6 157.2 159.9 168.2 173.6 205.8 252.5 240 72.9 74.4 76.2 61.8 63.0 65.8 68.5 83.6 91.2

600 600 120 129.1 130.1 131.3 135.5 137.8 143.1 147.6 171.5 207.4 77.1 77.6 79.6 63.7 64.6 67.3 69.8 84.5 92.0 510 200 510 105 111.5 112.1 112.8 115.8 117.2 121.1 124.2 142.4 169.7 81.1 81.2 82.4 65.7 66.6 68.5 70.8 85.3 92.5 450 450

101.7 102.0 102.2 104.4 105.3 108.5 110.8 125.4 148.1 160 87.2 86.6 87.1 67.8 68.7 70.4 72.6 86.2 93.1 90 390 390

91.9 91.9 91.8 92.6 93.3 94.9 96.5 106.2 122.7 99.3 97.6 96.9 73.7 74.2 74.8 76.4 88.9 94.8

305 120 305 75 85.5 85.3 84.9 84.8 84.7 85.3 85.7 90.6 100.0 128.8 126.4 124.9 92.1 91.0 89.2 89.3 97.9 100.0 GPU Frequency (MHz) GPU Frequency (MHz) 190 190 1555 1296 1037 778 691 547 461 300 200 1555 1296 1037 778 691 547 461 300 200 DDR Frequency (MHz) DDR Frequency (MHz) (a) (b)

Normalized Graphics+DDR Energy - EgyptHD (%) Normalized Battery Energy - EgyptHD (%) 180 105 114.9 116.6 118.2 115.7 117.7 123.6 127.8 152.6 182.3 39.0 40.5 40.7 45.1 47.8 54.2 58.0 78.6 107.0 600 600

106.5 107.2 108.8 104.3 105.9 110.1 113.7 133.6 157.2 160 42.1 42.9 42.7 45.9 48.2 53.3 57.2 75.8 103.3 90 510 510

98.3 98.6 99.6 94.0 95.2 98.2 101.0 117.5 136.1 140 44.9 45.0 44.5 46.6 48.3 52.7 55.8 73.4 99.3 450 450 75 95.4 95.3 95.6 88.5 89.4 91.9 94.2 108.3 124.2 49.6 49.5 47.8 48.6 50.0 53.6 56.4 72.2 98.0

390 120 390 60 95.1 94.4 94.0 84.3 85.0 86.2 87.8 98.7 110.5 59.1 58.1 55.5 54.2 54.9 57.0 59.0 73.2 96.9 305 100 305 104.3 103.2 102.3 87.9 87.5 87.0 87.2 93.8 100.0 81.9 80.1 75.9 71.6 71.3 71.5 71.1 80.1 100.0 45 GPU Frequency (MHz) GPU Frequency (MHz) 190 190 1555 1296 1037 778 691 547 461 300 200 1555 1296 1037 778 691 547 461 300 200 DDR Frequency (MHz) DDR Frequency (MHz) (c) (d)

Normalized Battery Energy (+idle energy) - EgyptHD (%) Normalized Busy Idle - EgyptHD (%) 69.1 68.3 65.7 63.3 63.6 64.7 65.5 79.9 107.0 67.0 70.0 73.3 84.7 89.5 103.0 113.5 172.7 265.4 600 100 600 240 66.6 65.6 63.0 60.6 60.9 61.8 63.4 76.9 103.3 64.1 66.2 68.6 77.2 80.9 91.8 100.6 150.1 228.7 510 90 510 200 64.5 63.4 60.9 58.3 58.5 59.7 60.9 74.3 99.3 62.6 64.0 65.9 72.7 75.9 84.9 92.2 135.0 203.9 450 450 80 160 63.6 62.6 59.6 57.4 57.8 59.1 60.5 72.9 98.0 61.6 62.5 63.7 68.7 71.1 78.3 84.4 120.1 179.3 390 390 120 65.6 64.4 61.2 58.7 59.0 60.1 61.4 73.5 96.9 70 60.6 61.0 61.4 64.1 65.6 70.2 74.3 99.9 144.9 305 305

82.2 80.4 76.2 71.9 71.6 71.7 71.1 80.1 100.0 59.8 60.0 60.0 60.7 60.9 62.2 63.5 75.3 100.0 80 GPU Frequency (MHz) GPU Frequency (MHz)

190 60 190 1555 1296 1037 778 691 547 461 300 200 1555 1296 1037 778 691 547 461 300 200 DDR Frequency (MHz) DDR Frequency (MHz) (e) (f)

Normalized Battery Energy (+idle energy) - TempleRun2 (%) Normalized Battery Energy (+idle energy) - Hover (%) 835.0 798.8 522.0 355.3 353.5 347.5 343.9 339.0 328.6 649.2 586.1 499.8 524.8 390.5 348.6 342.2 341.3 338.9 600

600 750 600

415.9 396.1 353.8 312.3 310.0 303.8 298.3 286.3 257.0 399.0 384.3 350.4 317.0 313.9 305.8 297.5 286.7 279.1 500 510 600 510 378.4 361.9 316.7 275.4 272.7 264.0 258.2 241.3 212.4 359.8 345.6 311.2 277.6 274.4 263.3 254.1 240.2 229.5

450 450 400 450 353.8 336.5 290.7 249.7 246.8 238.4 230.8 210.3 182.9 524.8 503.1 315.1 244.0 241.0 229.0 221.3 202.9 193.8

390 390 300 300 510.0 478.0 352.1 223.4 219.8 207.1 199.7 166.6 152.7 300.3 283.2 245.1 204.9 201.4 188.7 177.4 159.3 156.0 305 305 200 478.5 442.4 296.9 131.2 144.3 132.7 84.8 114.9 100.0 150 260.6 237.4 187.8 141.3 136.5 123.1 113.4 110.9 100.0 GPU Frequency (MHz) GPU Frequency (MHz) 190 190 100 1555 1296 1037 778 691 547 461 300 200 1555 1296 1037 778 691 547 461 300 200 DDR Frequency (MHz) DDR Frequency (MHz) (g) (h)

Figure 4.6: Aggregated energy consumption across all frames for each OPP (a-e, g, and h) Show energy consumption while (f) shows GPU busy-idle for EgyptHD. All values are normalized to the OPP with lowest GPU and DDR frequency (bottom right).

59 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION trends for the complete execution of the application. It can potentially mask local trends through various phases of the execution, as demonstrated in Figure 4.3-4.5. In Figure 4.6.a and 4.6.b we can see the effect of the voltage/frequency on the energy consumption of the GPU and DDR, respectively. As expected, the total GPU-only energy decreases with decreased GPU frequency for the same DDR frequency (the top to bottom entry in each column). However, the energy consumption increases by decreasing the DDR frequency (moving left to right in each row). As we noted before, when DDR becomes the bottleneck at lower DDR frequencies, the busy-idle1 cycles of the GPU increases. As a result, the total energy consumption goes up. In order to confirm that busy-idle cycles are responsible for this trend, we inspect Figure 4.6.f. The busy-idle in this figure is calculated as the difference between the total processing cycles and the number of cycles that the ALU has been active. Without any exception, the busy-idle cycles decrease for higher DDR frequencies at the same GPU frequency. Also, for the same DDR frequency, higher GPU frequencies result in a higher number of busy-idle cycles. Figure 4.6.b shows that DDR energy efficiency exhibits strong nonlinear behavior. We can see that the DDR energy decreases by increasing the GPU frequency, as expected (similar to GPU energy with increasing DDR frequency). However, the DDR energy consumption does not always increase with increasing DDR frequency. The mid-range DDR frequencies are more energy-efficient than the low or high end of the DDR frequency range. Note that the operating voltage for all DDR frequencies are almost the same. This is a common practice in the state-of-the-art mobile SoCs. Therefore, voltage does not contribute to the change in energy consumption. The nonlinear behavior is a result of the complex DDR memory subsystem and optimizations that might only apply to certain levels. The PMIC design also contributes to this nonlinear characteristic. Figure 4.6.c, 4.6.d, and 4.6.e show the combined GPU and DDR energy, battery energy excluding idle energy, and battery, including idle energy, respectively. In Figure 4.6.c and 4.6.e we can see the combined effect of the variation in busy idle and nonlinear DDR energy, as described above. According to these two tables, the two lowest DDR frequencies are among the least efficient levels, while the mid-range frequencies are among the most energy-efficient levels. Additionally, note how the lowest frequencies, i.e., the bottom row(s) and rightmost column(s), are among the least energy-efficient, especially in terms of battery energy. This trend is most notable in Figure 4.6.d. The underlying reason for low efficiency at lowest DDR frequencies is the increase in busy-idle cycles. On the other hand, lower GPU frequencies specifically demonstrate lower efficiency in terms of

1 The number of cycles GPU is waiting for data from DDR, when DDR is the bottleneck.

60 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION battery energy. The increase in total processing time, and therefore, total active cycles at the SoC level for lower GPU frequencies, can help explain this trend. Incorporating the effect of the idle energy in Figure 4.6.d results in Figure 4.6.e. Idle energy is higher when processing time goes down when running at higher frequencies. Therefore, the difference between the energy consumption of the highest and lowest GPU frequencies become smaller when idle energy is included. Finally, Figure 4.6.g and 4.6.h show battery energy, including idle energy for TempleRun2 and Hover, respectively. These two workloads require less processing performance compared to EgyptHD. As a result, they conform closer to the established assumptions and are more energy- efficient at lower levels. They do not exhibit the type of non-linearity observed in EgyptHD. For these two benchmarks, the DDR is not a bottleneck. Therefore, busy-idle cycles do not vary significantly across DDR frequencies, thus lower levels are more energy-efficient. That said, TempleRun2 is still the most energy-efficient at the lowest GPU and third lowest DDR frequency, instead of the lowest DDR frequency. Even though all frames meet the deadline at the lowest DDR frequency, it is best to run it at the third lowest DDR frequency to minimize the busy-idle cycles, and consequently, lower energy consumption. The relative impact of the GPU DVFS level on the performance and energy consump- tion compared to the impact of the DDR DVFS level. (Page 47, Item 4) So far, we are working at a frame granularity. Thus, we could only analyze the trends of a change in the energy consumption with GPU or DDR frequency when the other frequency was fixed. Here we want to explore some scenarios in which both of these frequencies change, and potentially in opposite directions. We are going to investigate whether either GPU or DDR DVFS level adjustments have more impact on energy consumption. Knowing that one of the two domains has more impact on the energy consumption, we would be able to choose between two OPPs that both meet the deadline, one with higher GPU but lowerDDR frequency relative to the other OPP. Figure 4.7.a and 4.7.b show the transition from the OPP0 (< 510MHz, 778MHz >) to two other OPPs: OPP1 (< 450MHz, 778MHz >) and OPP2 < 510MHz, 691MHz >. The ratio between the GPU frequency of the OPP1 and the OPP0 is almost the same as the ratio between the DDR frequency of OPP2 and OPP0. According to the Figure 4.7.a and 4.7.b, reduction in the combined energy consumption of the graphics and DDR power rails is higher when we reduce the GPU frequency (OPP1) compared to reducing the DDR frequency (OPP2). Similar behavior with a slightly lower magnitude is observed in terms of the battery level energy consumption. This is not surprising given that the voltage of the DDR rail is almost the same across DDR frequencies.

61 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

If we can assume that the GPU frequency has more impact on the total energy, a DVFS algorithm could leverage this knowledge to preserve energy by reducing the GPU frequency while increasing DDR frequency to prevent deadline misses. Figure 4.7.c and 4.7.d demonstrate three such transitions for two benchmarks, EgyptHD, and TempleRun2, respectively. The energy is normalized and color coded. For each transition, the energy is normalized to the starting OPP. Thus the red horizontal line with the Y-axis value of one corresponds to the energy for the starting OPP. We can see the GPU frequency has higher impact on the energy consumption for some of the OPP transitions (e.g., the top OPP transition in the Figure 4.7.c and 4.7.d) but not all of them (e.g., the bottom two OPP transitions in the Figure 4.7.c and 4.7.d). Additionally, this rule does not consistently apply across frames and changes throughout the workload phases. Clearly, we cannot generalize the assumption that GPU frequency impacts energy consumption more than DDR frequency in a way that could be incorporated into a DVFS algorithm. We can summarize our conclusions from our energy consumption analysis as follows:

1. In modern mobile GPUs running 3D graphics applications, there are many workload phases, and running at higher DVFS levels can be more energy-efficient. 2. There is considerable variation in the relative energy efficiency of DVFS levels within the workload. Even within the same application, the most efficient DVFS operating level can change across the application phases. 3. The impact of DDR frequency scaling on the energy profile, especially in the context of combined GPU-DDR DVFS, is significant. The established assumptions about the relative energy efficiency of DVFS levels cannot be easily extended to characterize the energy profile when the two frequency change (in the opposite direction). 4. Given the above observations, maximizing energy savings, which is a top priority in mobile SoCs, requires a fine-grained energy model. Such models need to take into account the hardware energy consumption characteristics when running at different DVFS levels for different types of workloads at frame or sub-frame granularities. Generalized, design-time, decisions about the energy-efficient configuration are not sufficient anymore.

Linear relationship between performance and frequency (Page 48, Item 5) There are different scenarios when a DVFS algorithm needs to estimate the level of performance change when changing frequency: a) When there is going to be slack time since the workload is decreasing. The algorithm needs to estimate if any of the lower DVFS levels still meet the frame deadline before reducing frequency. b) When the workload is increasing. The DVFS

62 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

Normalized Energy (EgyptHD) Normalized Energy (EgyptHD) 1.10 1.15 <510,778> <450,778> ⇒ 1.10 1.05 <510,778> <510,691> ⇒ 1.05 1.00 1.00 0.95 0.95 0.90 0.90 Battery Energy <510,778> <450,778> GFX+DDR Energy 0.85 0.85 ⇒ <510,778> <510,691> ⇒ 0.80 0.80 0 500 1000 1500 0 500 1000 1500 Frame Index Frame Index (a) (b)

Normalized Energy (EgyptHD) Normalized Energy (TempleRun2) 1.20 1.4 <450,691> <450,691> 1.15 1.3 <390,778> <390,778> 1.10 1.2 1.05 1.1 1.00 1.0 0.95 0.9 0.90 Battery Energy Battery Energy 0.85 0.8 0.80 0.7 500 1000 1500 500 1000 1500 2000 2500 3000 Frame Index Frame Index Normalized Energy (EgyptHD) Normalized Energy (TempleRun2) 1.10 1.5 <510,778> 1.4 <510,778> <450,1037> <450,1037> 1.05 1.3 1.2 1.1 1.00 1.0 0.9 Battery Energy 0.95 Battery Energy 0.8 0.7 500 1000 1500 500 1000 1500 2000 2500 3000 Frame Index Frame Index Normalized Energy (EgyptHD) Normalized Energy (TempleRun2) 1.06 1.2

1.04 1.0 1.02 0.8 1.00 0.6 0.98 0.4 0.96

Battery Energy <600,1037> Battery Energy <600,1037> 0.94 0.2 <510,1296> <510,1296> 0.92 0.0 500 1000 1500 500 1000 1500 2000 2500 3000 Frame Index Frame Index

(c) (d)

Figure 4.7: Normalized change in the energy consumption with the GPU and/or DDR frequency transitions a) combined graphics and DDR energy consumption change when transitioning from OPP0-< 510MHz, 778MHz > to OPP1-< 450MHz, 778MHz > (red line) or to OPP2-< 510MHz, 691MHz > (blue line); b) battery level energy consumption change with the same transition as in (a); c) Transitions involving changes to both the GPU and the DDR frequencies in the opposite directions for EgyptHD; and d) the same as (c) but for the TempleRun2 game

application.63 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION algorithm needs to be able to estimate how much frequency increase is required to meet an increased workload requirement. A common, simple, assumption is that performance increases at the same rate as the frequency. That is:

P erf F req NewLevel = NewLevel P erfP revLevel F reqP revLevel This assumption follows the assumption that the number of clock cycles for a given task is fixed. Therefore, the processing time for the task decreases at the same rate that frequency increases. We already demonstrated that increasing core frequency without changing DDR frequency can increase the busy-idle cycles, and consequently, the total processing cycles (Figure 4.6.f). Yet, existing DVFS algorithms have largely ignored DDR frequency scaling by implicitly or explicitly assuming DDR is fixed at the highest frequency [84, 83, 101]. Besides, GPUs have a deep and complex pipeline. Different pipeline stages can become the bottleneck at different times. The effect of the GPU frequency on those bottlenecks and the overall performance is not necessarily linear. The scatter plot in the Figure 4.8.a and 4.8.b show the rate of performance vs. frequency change for DDR and GPU frequencies, respectively. The values on the y-axis are normalized frequencies (9 steps in the inclusive range [1, 7.78] for DDR and 6 steps in the inclusive range [1,3.15] for GPU respectively). The x-axis is normalized performance. In Figure 4.8.a (4.8.b) The y-axis is normalized DDR (GPU) frequency. Performance at each DDR (GPU) frequency in Figure 4.8.a (4.8.b is normalized to the lowest DDR (GPU) frequency. Data points at different GPU (DDR) frequencies in Figure 4.8.a (4.8.b) are displayed in different colors. Each data point in these figures corresponds to a single frame at a unique OPP. The scatter plots help us see the big picture. But they hide key information about the trends across frames. Figure 4.8.c-4.8.f demonstrate the normalized change in the performance for different GPU/DDR frequencies for the EgyptHD benchmark across frames. In Figure 4.8.c and 4.8.e, each line corresponds to a GPU frequency. DDR frequency in each of these two figures is fixed at 1037 MHz and 461 MHz respectively. Similarly, Figure 4.8.d and 4.8.f represent two different (high and low) GPU frequencies, with each line in these figures corresponding to a DDR frequency. From these charts we can make the following observations (we will use the following notations for brevity ⇒ RP : the rate of the performance change; RF : the rate of the frequency

RP change; RR: the ratio of the performance over frequency change rate, i.e., . RR is the rate of RF performance change per unit of frequency change.):

1. We observe some linearity between RP and RF DDR in Figure 4.8.a for the higher GPU

64 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

(a) (b)

DDRFreq: 1037MHz GPUFreq: 600MHz 2.8 4.5 ) ) P 2.6 P 4.0 R R ( (

e 2.4 e

c c 3.5 n n

a 2.2 a

m m 3.0 r r

o 2.0 o f f

r r 2.5 e e

p 1.8 p

d d 2.0 e 1.6 e z z RF_DDR: 1.00 RF_DDR: 3.89 i i l l

a a 1.5 RF_DDR: 1.50 RF_DDR: 5.18 1.4

m RF_GPU: 1.00 RF_GPU: 2.37 m RF_DDR: 2.31 RF_DDR: 6.48 r r

o 1.2 RF_GPU: 1.61 RF_GPU: 2.68 o 1.0 RF_DDR: 2.73 RF_DDR: 7.78 N RF_GPU: 2.05 RF_GPU: 3.16 N RF_DDR: 3.46 1.0 0.5 0 500 1000 1500 0 500 1000 1500 Frame index Frame index

(c) (d)

DDRFreq: 461MHz GPUFreq: 190MHz 2.0

) ) 1.8 P P R R ( ( 1.8

e e 1.6 c c n n a a 1.4 m m

r 1.6 r o o f f

r r 1.2 e e p p 1.4 d d

e e 1.0

z z RF_DDR: 1.00 RF_DDR: 3.89 i i l l

a a RF_DDR: 1.50 RF_DDR: 5.18 0.8 m 1.2 RF_GPU: 1.00 RF_GPU: 2.37 m RF_DDR: 2.31 RF_DDR: 6.48 r r

o RF_GPU: 1.61 RF_GPU: 2.68 o RF_DDR: 2.73 RF_DDR: 7.78

N RF_GPU: 2.05 RF_GPU: 3.16 N 0.6 RF_DDR: 3.46 1.0 0 500 1000 1500 0 500 1000 1500 Frame index Frame index

(e) (f)

Figure 4.8: Performance-frequency relationship performance change due to change in: a) DDR frequency b) GPU frequency. Relatively consistent change in performance across frames with change in: c) GPU Frequency, and d) DDR frequency. Highly variable change in performance due to change in: e) GPU frequency f) DDR frequency.

Notations used in these charts ⇒ RF : Rate of the frequency change (frequency normalized to the

reference level); RP : Rate of the performance change (performance normalized to the reference level). 65 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

frequencies. For the lower GPU frequencies, the nonlinearity in RR DDR increases (because GPU is more likely to become the bottleneck for some frames and DDR frequency increase

becomes less effective). For 190 MHz GPU frequency RR DDR goes as low as 0.2 (the red data points in Figure 4.8.a corresponding to this GPU frequency virtually form a vertical line). This observation is also clearly visible in Figure 4.8.d and 4.8.f. At the maximum DDR

frequency, for the 190 MHz GPU frequency, RP ' 1.8 while for the 600 MHz GPU frequency

RP goes as high as ∼ 4.5.

2. For higher DDR frequencies (at a fixed GPU frequency) RR DDR is lower. That is, perfor- mance gain per unit frequency increase diminishes at higher DDR frequencies.

3. We observe significant variation in RR DDR across frames in Figure 4.8.f and Figure 4.8.d. That is, the rate of performance change with DDR frequency varies significantly for different frame workloads. More compute bound frames are going to see less impact from an increase in

DDR frequency. The magnitude of the variations in the RR DDR is higher for as GPU frequency increases, as the memory bound frames experience a larger performance improvement. 4. We observe similar behavior for the effect of the GPU frequency on performance. We see some

linearity between the RP and the RF GP U . RR GP U is higher on average for higher DDR

frequencies (when DDR is not the bottleneck). There is one difference between RR GP U and

RR DDR. RR GP U varies less across the frames at higher DDR frequency, 1037 MHz. When the DDR frequency is not the bottleneck, most frames see a similar benefit from increased GPU

frequency. This is in contrast with a higher variation in RR DDR at higher GPU frequency.

Two important conclusions from these observations are: 1) performance gain per unit increase in frequency varies significantly with the workload. It also varies for different GPU and DDR frequency levels. 2) In most cases, performance does not change at the same rate as the frequency changes. The performance gain per unit of frequency increase could be as low as 0.2. A fine-grained performance model that takes into account all these factors is needed. This is especially true for graphics workload, given the soft-realtime deadlines that have to be met. Wrong decisions due to inaccurate assumptions come at the cost of frame deadline misses.

4.2 Design space exploration

In the previous section, we conducted workload performance and energy consumption analysis on mobile 3D applications. The common theme among our observations was that there

66 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION are a significant number of workload phases and DVFS level configurations for which some of the widely accepted assumptions utilized in DVFS techniques are rendered ineffective and unreliable. The complexity of state-of-the-art mobile SoCs and a workload’s impact on the performance and energy profile confirm that fine-grained performance and energy consumption models are required to find the more efficient configurations. In this section, we will try to quantify the impact of some of the hardware and software configuration parameters on the energy efficiency and performance of a DVFS solution. We will achieve this goal through design space exploration of DVFS configuration parameters using our proposed framework. We will quantify the potential improvements that can be gained by incorporating energy-awareness into the DVFS algorithm. We will also show that safety margins and level switching latency are among the parameters with the most influence on DVFS effectiveness for mobile graphics workload. We will also explore the impact of the OPP configuration on DVFS and propose heuristics to select an optimized OPP configuration for a given DVFS algorithm.

4.2.1 The ideal DVFS

From a high-level perspective, the goal of all DVFS solutions is to reduce power and/or energy consumption. Reducing power consumption does not necessarily always result in lower energy consumption. It is sometimes desirable to reduce peak power consumption due to thermal limitations and costs, even if the total energy consumption for the given task increases. Furthermore, DVFS is balancing a trade-off between performance and power/energy consumption. Hence, there can be a variety of DVFS objectives (e.g., targeting maximum performance for a given power budget, versus achieving the maximum reduction in power/energy), without impacting performance beyond a certain threshold. The soft-realtime requirements of mobile GPU workloads translate into unique challenges and opportunities. The challenges include that, in most cases, frame deadline violations significantly impact the user experience; specifically in VR and AR use cases there is virtually zero tolerance for deadline violations. Frame deadline guarantees are a unique challenge, even for some of the DVFS solutions proposed in recent years that specifically target graphics workloads [96, 84]. While these solutions consider the required FPS of the applications for their frequency selection, their policy is to maintain the average FPS within a certain range; as a result, they cannot guarantee meeting individual frame deadlines. The opportunity resides in the fact that there is a well-defined target performance for each frame. It allows for fine tuning the algorithm by eliminating guesswork and

67 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION vagueness of the goal. Our policy for the DVFS solutions we explore in this thesis is to guarantee all frame deadlines, when feasible. We cannot avoid deadline violations if even the highest frequency does not meet the frame deadline. But if there is a frequency that meets the frame deadline, our policy is to ensure that we do not select a frequency that does not guarantee the deadline. Reducing energy consumption is secondary to guaranteeing the QoS of the graphics applications (i.e., guaranteeing the frame deadline). Our DVFS policy regarding power/energy consumption is to minimize the energy consumption to increase battery life. If running at higher voltage/frequency levels results in lower energy consumption, even if the average or peak power consumption is higher, we select the higher voltage/frequency level. With the offline post processing of the performance and energy data harvested in our proposed framework, we are able to build oracle performance and energy models. With a given frame from one of our applications, and a given GPU and DDR frequency as input, the oracle models return the exact draw time and energy consumption. Using these oracle models, we can create the ideal solution that implements our DVFS policy, described above. Listing 1 shows the ideal DVFS implementation of our desired DVFS policy for a single frame. There are two main functions, MeetsDeadline(...) and GetEnergy(...), in this algorithm. MeetsDeadline(...) identifies if an OPP meets the frame deadline using our oracle performance model. GetEnergy(...) predicts the energy consumption of the OPP using our oracle energy model. Basically, this algorithm has two main steps: Step one, identifies all the OPPs that meet the deadline for the given frame (viable OPPs). An example for this step is shown in Figure 4.9.a. All the OPPs with the GPU frequency above 190 MHz and the DDR frequency above 300 MHz meet the frame deadline. Step two, selects the OPP with the minimum energy consumption from the set of viable OPPs (Figure 4.9.b). The above two steps are repeated for every frame of an application in order to come up with the performance and energy profile of the whole application, computed at an individual-frame granularity. We use the power and energy profile of this ideal solution to perform our design space exploration in the rest of this chapter. We will also utilize this ideal DVFS to evaluate the effectiveness of our proposed solution in Chapter 6. Figure 4.10 demonstrates the OPP transitions and performance/energy profile for EgyptHD, Hover, and TempleRun2 applications under the ideal DVFS. We can see that the benchmarks represent a range of use-cases: high frequency OPP changes for Hover, moderate number of OPP changes for EgyptHD, and very few OPP changes for TempleRun2 (GPU frequency settles at the lowest level;

68 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

Listing 1 The ideal DVFS Input: F rameNum 7: Energies ⇐ [] Input: OracleP erfModel 8: for opp in V iableOpps do Input: OracleEnergyModel 9: Energy ⇐ GetEnergy(F rameNum, opp)

1: V iableOpps ⇐ []

2: for opp in AllOpps do 10: Energies.add(Energy) 3: if MeetsDeadline(F rameNum, opp) 11: end for then 12: OppIndex ⇐ LowestEnergy(Energies)

4: V iableOpps.add(opp)

5: end if 13: SelectedOpp ⇐ V iableOpps[OppIndex] 6: end for Output: SelectedOpp

DDR frequency exhibits minimal variation for majority of the frames). For EgyptHD and Hover, the level change frequency shows some correlation with the variations in the frame time. While TempleRun2 has more frame time variations as compared to EgyptHD, its OPP change profile is fairly stable because it is a lightweight workload that meets the deadlines at the lowest frequencies for most of the frames.

(a) (b)

Figure 4.9: Selecting optimal OPP for frame 200, EgyptHD in ideal DVFS algorithm a) The total frame time (ms) per OPP. Candidate OPPs, i.e., those that meet the 60fps (16.7 ms per frame) deadline are highlighted with green color and the rest of the OPPs highlighted in red. b) the OPP with minimum energy consumption among the candidate OPPs is selected as the optimal OPP.

4.2.2 The effect of HW and SW configuration parameters on the DVFS

Now that we have implemented the ideal DVFS algorithm, we start incorporating practical hardware and software configuration parameters into the ideal DVFS (these configuration parameters

69 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

Figure 4.10: ideal DVFS output: OPP transitions and performance/energy. profile can be thought of as limitations. They force the ideal algorithm to behave more like a regular governor algorithm). By doing so, we can quantify the negative impact these parameters have on a DVFS algorithm. Note that some of these configuration parameters, such as the DVFS level switching latency, still allow the algorithm to use perfect knowledge of the upcoming frame workload that is not available in a practical DVFS solution. Therefore, the calculated performance and power quantities for different configurations will be a lower bound on the negative impact of those parameters on the DVFS algorithm efficiency. The actual impact for a governor algorithm, without knowledge of the future workload, can potentially be larger than these estimated values.

70 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

Table 4.1 lists the configuration parameters we model here, with a short description and sweeping values for each parameter. These parameters are categorized into hardware and algorithm parameters. Hardware parameters are reflective of hardware limitations DVFS algorithms must work with. Algorithm parameters are some of the common limitations of the existing DVFS algorithms. By analyzing the effect of both categories on the quality of the ideal algorithm we can gain insight into which parameters should be prioritized in designing future DVFS friendly hardware or new DVFS algorithms. These parameters are described below. Table 4.1: Summary of the hardware and software configuration parameters affecting the quality of a DVFS algorithm. config. param description values HARDWARE PARAMETERS switching latency Average latency for switching DVFS levels 0, 1, 5 (ms)

Minimum number of frames after a level ld min stable change before a change to lower frequencies 0, 5, 25 (frames) is allowed The same as ld min stable but in the opposite lu min stable 0 (frames) direction ALGORITHM PARAMETERS The size of the frame history window used to frame history window 0, 1, 5 (frames) estimate workload of the next frame

The maximum number of DVFS levels max switching stride allowed between the originating and 1, 3, unbounded destination levels Optimize for minimum energy or minimum optimization criteria energy, frequency frequency safety margin Safety margin deducted from the deadline 0, 10, 15 (%)

• Switching latency: the average time it takes for the hardware to change voltage/frequency levels. Faster level switching capabilities enable finer grained and more effective DVFS. • Level Down (Level Up) Min Stable: In some circuits, if we change frequencies too frequently,

71 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

it can cause temporary or permanent hardware failure. A value of zero for these two parameters allows level change every frame. A positive value defines the number of frames after a frequency change that no change is allowed. We might define two different values for increasing vs. decreasing frequency if meeting deadlines has higher importance. • Frame History Window: In order to analyze the effectiveness of history-based prediction, we extend the ideal algorithm to use a history of the past frames’ time and energy consumption to estimate the upcoming frame. This parameter defines the history window’s size in frames). The algorithm still has the exact time and energy across all OPPs for decision making. Practical implementations only have access to actual measurements of the current OPP. • Max switching stride: The maximum number of frequency levels the algorithm can change in one step. This limitation can be a hardware or an algorithm limitation. Again hardware PMIC might limit the level change stride to prevent failure. While, some of the existing algorithms jump to a predefined high frequency when the workload comes in and go to lower levels gradually, those algorithms that balance performance and energy (especially for 3D graphics which the performance goal is to just meet the deadline) only one switch step at a time. They gradually change levels up and down one step at a time. • Optimization Criteria: The two options are frequency and energy. This parameter describes how the second step in the ideal algorithm (Section 4.2.1) selects the more energy-efficient OPP. If the optimization criteria is frequency, the OPP with the lower voltage/frequency is considered the more energy-efficient OPP (this model has been utilized in existing DVFS solutions, as well as part of our proposed solution in Chapter 5). Between two OPPs that one has lower GPU frequency and the other has lower DDR frequency, the one with the lower GPU frequency is selected. This analysis allows us to quantify the usefulness of an accurate energy model for DVFS. • Safety margin: due to the potential for frequent misprediction, history-based algorithms maintain a minimum idle time in each period. The goal is to filter out a majority of small workload changes without missing the deadline. With increasing imperfection of the algorithm or importance of meeting deadlines, the margin needs to be increased. Higher margins sacrifice energy consumption for better performance.

We have applied the above hardware and software configuration parameters to the ideal algorithm and quantified the results. For each test, there are 378 different configurations. Figure 4.11.a and 4.12 present the simulation results across these configurations. In Figure 4.11.a, the x-axis

72 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

EgyptHD EgyptHD 41 16 -0.36 0.10 0.05 0.06 0.41 0.85 -0.89 energy energy 0.8 40 avg_frame_time_ms (right) 0.00 -0.05 -0.04 -0.40 -0.23 0.47 15 frameMiss 39 0.00 0.00 0.22 -0.00 -0.10 0.4 histWindow 38 14 0.00 0.00 0.00 -0.05 ldMinStbl 37 0.0 13 0.00 0.00 -0.04 switchStrd 36

energy (J) 0.00 -0.42

time (msec) 0.4 35 12 safetyMargin -0.83 34 switchLatency 0.8 11 33 avgFrmTime

32 10 energy 0 50 100 150 200 250 300 350 frameMiss ldMinStblswitchStrd histWindow safetyMargin avgFrmTime configuration switchLatency

Hover Hover 32 16 0.00 0.08 0.14 0.04 0.33 0.67 -0.43 energy energy 0.8 15 avg_frame_time_ms (right) 0.02 -0.10 0.02 -0.38 -0.13 0.47 31 frameMiss 14 0.00 0.00 0.22 -0.00 -0.09 0.4 histWindow 13 0.00 0.00 0.00 -0.14 30 ldMinStbl 0.0 12 0.00 0.00 0.03 switchStrd

energy (J) 29 0.00 -0.36

11 time (msec) 0.4 safetyMargin -0.77 10 switchLatency 28 0.8 9 avgFrmTime

27 8 energy 0 50 100 150 200 250 300 350 frameMiss ldMinStblswitchStrd histWindow safetyMargin avgFrmTime configuration switchLatency

TempleRun2 TempleRun2 19.5 0.12 -0.02 0.02 -0.00 -0.07 -0.19 0.99 19.0 energy energy 0.8 18.5 -0.08 -0.07 0.00 -0.16 0.20 0.13 18.0 frameMiss 17.5 0.00 0.00 0.22 -0.00 0.00 0.4

energy (J) histWindow 17.0 0.00 0.00 0.00 -0.05 16.5 ldMinStbl 0.0 12 0.00 0.00 0.05 11 switchStrd 10 0.00 -0.07 0.4 9 safetyMargin avg_frame_time_ms 8 -0.18 switchLatency 7 0.8 time (msec) 6 avgFrmTime 5 0 50 100 150 200 250 300 350 energy configuration frameMisshistWindowldMinStblswitchStrd safetyMarginswitchLatencyavgFrmTime

(a) (b)

Figure 4.11: Ideal DVFS: performance/energy profile across all configurations. a) denotes the 378 distinct configurations and the y-axes are the energy consumption (left axis) and average frame time (right axis) for the corresponding configuration. The configurations on the x-axis are created by running multiple next-loop iterations over values for each parameter. The innermost configuration parameter loop accounts for the local changes. The outermost parameter loop accounts for the long range phases in the chart. The outermost parameter in Figure 4.11 is the DVFS level switching latency. We have explored three different values for this parameter, 0, 1, and 5 ms. Accordingly, we can see three different phases in both energy and average frame time in this figure (the phases are: 0-125, 126-251, and 252-379). Some of the observations we can make from these charts are as follows:

73 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

Diff b/w worst config and ideal (gfx+ddr_rail) Worst case number of janks EgyptHD EgyptHD 175.0 60 Hover Hover TempleRun2 52.71 150 TempleRun2 50 47.13

40 35.39 100

30 # janks

Diff vs. ideal (%) 57.0 20 16.29 50 11.71 10 8.83 25.0

0 0 energy avg frame time missed vsync

(a) (b)

Figure 4.12: Summary of the effect of hardware and software configuration parameters on the energy/performance. a)

Impact of optimization criteria (frequency vs. energy) 10 0 0 1

∗ 8 ) 1 − y g

r 6 e n e m i t p

o 4 E / q e r f

m 2 i t p o E ( 0 =

) EgyptHD % (

2 Hover f f i

d TempleRun2 4 0 50 100 150 DVFS Configuration

Figure 4.13: Optimizing for minimum energy vs. minimum performance

• We see a considerable difference between the least energy-efficient and most energy-efficient configurations. The same observation exists between the configurations with the lowest and highest average frame time; i.e., common hardware and software limitations impact the efficiency of the DVFS algorithm, significantly. Figure 4.12 summarizes the results across these configurations for the three applications we have evaluated. Figure 4.12.a demonstrates the

74 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

energy and average frame time difference between the least energy-efficient configuration and the original (unrestricted) ideal algorithm configuration. Figure 4.12.a also shows the difference between the configuration with the highest average frame time (lower performance) and the unrestricted ideal algorithm. In soft-realtime 3D graphics applications meeting deadlines is the more meaningful metric for performance. Figure 4.12.b shows the number of missed frame deadlines for the worst configurations. The number of missed frame deadlines for the unrestricted ideal algorithm is zero. We can see that the worst configurations increase the number of frame deadline misses significantly. • Not surprisingly, the energy consumption and average frame time trend in opposite directions. The more energy-efficient configurations increase the average frame time. An improved performance profile comes at the cost of higher energy consumption. • The impact of these various configurations on the DVFS effectiveness varies considerably across benchmarks. For example, TempleRun2’s energy consumption is less sensitive to configuration changes. This is because TempleRun2 is a lightweight benchmark running, most of the time, at the lowest frequencies with considerable slack time. This lightweight, but consistent, workload is less dependent on the DVFS response latency. TempleRun2 also is interesting in that its energy and average frame time trends are aligned.

We can leverage our design space exploration results to identify the parameters that have the most impact on energy and/or performance. Figure 4.11.b presents the correlation coefficients between the energy/performance (avg frame time) and the configuration parameters. Using the first row and last column in each table, we can conclude that the switching latency and safety margin are the two parameters that have the highest impact on DVFS efficiency. One of the more interesting observations from these results is that making decisions for the next frame using history of past frames has limited impact on the results. We can use the workload variation results in Figure 4.1 to explain this observation. The three benchmarks we studied here have very consistent workloads. A majority of the frame-to-frame variations in these applications produce less than a 10% change in workload. Expanding the preliminary results to a more diverse set of applications is part of the future work in this thesis. We can expect to see greater impact from the frame history window size on the effectiveness of the DVFS for use-cases such as UI, which exhibits a significantly higher rate of workload variation (Figure 4.1.a). The correlation coefficients presented in Figure 4.11.b do not demonstrate the impact of optimizing for the minimum energy vs. the minimum frequency. Figure 4.13 shows a difference in

75 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION the energy consumption (plotted as a percentage) when optimizing for the lowest energy versus the lowest frequency. The x-axis is the DVFS configuration. Since we have separated the results based on the optimization criteria with two possible values, the number of configurations is 190, half the total configurations. A negative value in this chart means that optimizing for the minimum energy resulted in higher energy consumption. For the unconstrained ideal algorithm, we will not have negative values. But if we use history to make a decision for the upcoming frame, optimizing for the lowest frequency might coincidentally result in lower energy consumption due to the prediction inaccuracy. We can see that optimizing for energy consumption can result in up to a 10% lower energy consumption. This impact is particular evident in TempleRun2. As mentioned earlier, TempleRun2 can run even at the lowest available frequency. However, this application becomes memory bound at the lowest DDR frequency and the average frame time increases significantly. Consequently, total energy consumption (not average power consumption) goes up. Therefore, an energy aware DVFS algorithm recognizing this behavior can save considerable amounts of energy by picking the higher DDR frequency. OPP configuration: The last configuration parameter we explore in this section is the OPP configuration. A larger number of DVFS levels, theoretically, increase the opportunity for more fine-grained adjustments and more energy-efficient operation. Therefore, the number of GPU and DDR voltage/frequency levels in mobile SoCs has been increasing. Our target platform, the Qualcomm Snapdragon 810, provides six GPU and nine DDR levels, for a total of 54 possible combinations. The next generation chipset from Qualcomm, the Snapdragon 820, has even more frequencies. This has become a new challenge for many of the existing DVFS algorithms (including the Linux kernel governor algorithms). These algorithms change DVFS levels gradually when the workload changes, often one step at a time (the algorithms that favor performance might jump multiple levels up to a predefined level in one step when the workload increases. However, they elect to decrease frequency gradually when the workload goes down). Increasing the number of GPU and DDR frequency combinations may cause significant latency since it takes more steps to adjust to the right frequency. Additionally, these algorithms organize the DVFS levels in an ordered list, with increasing frequency (as a proxy for performance). For two OPPs, one with higher GPU frequency and the other one with higher DDR frequency, organizing the levels as an ordered list based on frequency becomes another issue. One of the ways we can address the above concerns is select an OPP configuration that includes a subset of the OPP space. We will explore the effect of seven different OPP configurations in this section. Figure 4.14.a demonstrates three of these configurations, diagonal, column, and

76 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION semi-column. Diagonal configuration is identified by green background color while column and semi-column are identified by red and blue line colors respectively. The idea behind the diagonal OPP is to use a balanced configuration in which neither GPU nor DDR is the bottleneck. Therefore, higher GPU frequencies are paired with higher DDR frequen- cies. Column and semi-column configurations aim to cover a broader range of DDR frequencies. The rationale here is, in order to achieve high efficiency for broad range of applications, we might need to consider lower DDR frequencies paired with higher GPU frequencies for compute intensive workloads. The opposite is true for memory intensive workloads. The latter two configurations require a DVFS algorithm that does not relies on an ordered (based on performance) set of OPPs. We call the other four configurations gpu perf (first row in Figure 4.14.a), gpu energy (last row), ddr perf (left-most column), and DDR energy (right-most column). These configurations emulate Linux DVFS Performance and Energy governors. There are only two governors in the Linux kernel with these names, since they use a DDR frequency that is fixed. The performance governor is optimized for performance and always runs at the highest frequency. The energy governor is optimized for energy saving and always runs at the lowest frequency. Here, we create four governors based on these two, two governors with the lowest and highest GPU frequency, and two governors with the lowest and highest DDR frequencies. Figure 4.14.b shows the results for the unconstrained ideal algorithm. The number in each cell is the number of frames that the ideal DVFS selected the corresponding OPP as the most energy- efficient level that meets the deadline. We normalize the total energy and the average frame time of each OPP configuration to the results in this unconstrained ideal configuration, in order to obtain the results presented in Figure 4.15. Each chart in this figure demonstrates energy or performance of one of the benchmarks, relative to the baseline (unconstrained ideal). Each chart contains results for two different algorithm configurations. The ideal bars correspond to ideal algorithm with no limitation other than the OPP configuration. nonIdeal applies the highest value of every constraint in Table 4.1, in addition to the OPP configuration. We use this to measure dependence f the OPP configuration on the underlying algorithm. According to the results, diagonal OPP has the most balanced performance and energy profile, very close to the baseline. The results for the gpu energy and ddr -energy configurations are counterintuitive. These two configurations reduce performance as expected. But they also increase the energy consumption, especially ddr energy, rather than decreasing it. This is because, in these particular configurations, the frequency is set to minimum and becomes the bottleneck, increasing average frame time significantly. The result is lower energy efficiency. This is another

77 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

(a) OPP configuration map

EgyptHD Hover TempleRun2 1500 10 1

600 600 750 600 600 1200 77 1 510 510 600 510 450 122 33 159 900 450 450 450 450

145 23 623 18 300 9

390 390 390 600 300 GPU Freq (MHz) GPU Freq (MHz) GPU Freq (MHz) 717 22 804 857 211 20 1538 856 13

305 150 305 305 150 300 16 54 380 479 23 11 7 1024 2 190 190 190 0 0 0 1555 1296 1037 778 691 547 461 300 200 1555 1296 1037 778 691 547 461 300 200 1555 1296 1037 778 691 547 461 300 200 DDR Freq (MHz) DDR Freq (MHz) DDR Freq (MHz)

(b) OPP residency (ideal DVFS) for EgyptHD, Hover, and TempleRun2

Figure 4.14: Design space exploration: OPP configurations and OPP residency clear example that the addition of the DDR frequency to the DVFS changes some of the assumptions that current state-of-the-art DVFS algorithms use. ddr perf and gpu perf configurations generate the expected results. They increase the performance at the cost of increasing energy consumption. The performance increase in these configurations is higher than the energy increase for the ideal configuration, resulting in a better energy-delay product. For an nonIdeal configuration, which is a proxy of a practical algorithm, in TempleRun2 and Hover, the performance increase is less than the increase in energy consumption. Finally, column and semi-column configurations generate very similar results. They are both better than the diagonal configuration for EgyptHD, but worse for the other two applications. Given that these two configurations provide a wider range of DDR frequencies for every GPU frequency, they might have an advantage over the diagonal in some real use cases. We can expect for workloads that exhibit phases with high degrees of compute or memory intensity, that these two configurations outperform the diagonal. Our planned work that will be carrie out as part of this thesis will shed more light on the factors that impact this condition. One last important observation is, that for some configurations, there are a number of cases for the ideal algorithm that are considerably different from the nonIdeal algorithm. The differences

78 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

energy (EgyptHD) 100 99.43 nonIdeal 80 83.51 ideal 60

40 25.6 20 13.47 15.69 5.73 5.05 1.37 1.73 -0.0 1.19 -0.0 0.03 1.16 0

20 Diff with all_opp configuration (%)

column diagonal ddr_perf gpu_perf ddr_energy gpu_energy semi_column OPP configuration

avg_frame_time_ms (EgyptHD) 250 214.55 nonIdeal 200 ideal 150 152.45 123.95 100 80.0 50 3.29 -1.16 -0.02 0.52 -0.02 -6.31 0.1 0.12 0 -20.59 -33.38 50 Diff with all_opp configuration (%)

energy (TempleRun2) 40 37.38 37.07 35 nonIdeal 30 30.04 ideal 25 20 15 11.24 11.24 11.9 10.45 10 9.32 5.84 5 0.11 0.18 0.47 0.59 0.47 0 Diff with all_opp configuration (%)

avg_frame_time_ms (TempleRun2) 80 76.61 60 60.99 58.38 nonIdeal ideal 40 33.79 20 1.36 5.59 0 -5.11 -5.11 -3.51 -13.56 -13.54 -19.71 20 40 -46.2 -48.76 60 Diff with all_opp configuration (%)

energy (Hover) 70 60 61.95 nonIdeal 50 ideal 40.99 40 37.79 30 30.23 20 11.46 10.4 10 4.49 4.49 6.59 4.13 0.18 0.37 0.37 0 -0.8 10 Diff with all_opp configuration (%)

avg_frame_time_ms (Hover) 100 98.89 80 nonIdeal 69.61 60 ideal 35.72 40 29.17 20 0.67 0.06 0 -1.06 -4.93 -1.06 -0.79 -4.92 -15.85 20 -29.02 40 -46.98 60 Diff with all_opp configuration (%)

Figure 4.15: Effect of OPP configurations on the performance and energy consumption.

79 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

observed for these two configurations suggests that the impact of the OPP configuration can very well depend on the DVFS algorithm. Our conclusion is that the OPP configurations will not be able to provide the best power- performance as the hardware characteristics, different DVFS algorithms, and design objectives (e.g., designing for better performance vs. a balanced performance-energy profile) vary. It is important to be able to select an optimized OPP configuration suitable for a given DVFS algorithm and hardware characteristics. In the next section, we propose heuristic algorithms that can be used to select an optimized OPP configuration for a given DVFS algorithm.

4.3 Proposed OPP selection algorithms

Criteria: energy efficiency The goal of this heuristic algorithm is to find the minimum number of OPPs, that if used, result in energy consumption within error% of that of the ideal (all OPP) configuration. Listing 4.1 shows the pseudocode for a simple algorithm that satisfies this goal. The OPPs are first sorted highest to lowest. The OPPs are added to the configuration, one at a time, until the resulting configuration meets the criteria. The main parameter we can use for sorting the OPPs is total energy consumption for all the frames at each OPP. The other criteria shown in the sort opp function is the frame coverage; the number of frames for which an OPP is the optimal choice. The potential advantage of considering frame coverage over the total energy is to prioritize coverage for a variety of frame workloads over the individual frames with large contribution to the total energy consumption. When we use total energy consumption as the sorting criteria, the demote low coverage opps function can achieve the same goal by de-prioritizing the OPPs with frame coverage lower than a certain threshold.

1 d e f get o p p c o n f i g c r i t e r i a efficiency(all o p p s , a l l frames , error , criteria):

2 f r a m e coverage = get o p p f r a m e coverage(all f r a m e s , a l l o p p s )

3 s o r t e d o p p s = s o r t o p p s ( a l l f r a m e s , a l l opps , criteria)

4 # Remove the OPPs that are optimal for very few frames. This is just

5 # for corner cases. Most likely, those that are optimal for low number

6 # of frames have high total energy consumption and will not be selected.

7 i f criteria !=’frame c o v e r a g e’:

8 s o r t e d opps = demote l o w c o v e r a g e opps(sorted o p p s )

9 o p p c o n f i g = [ ]

10 f o ri in range(len(sorted o p p s ) ) :

11 opp = s o r t e d o p p s [ i ]

80 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

12 o p p config .append( opp )

13 i f acceptable config(error , opp c o n f i g ) :

14 b r e ak

15 r e t u r n opp c o n f i g

16 d e f get o p p f r a m e coverage(frames , opps):

17 # For each OPPi, return the number of frames for which OPPi

18 # is the most energy efficient OPP.

19 d e f demote l o w c o v e r a g e opps(opps):

20 # For OPPi with very low number of frames for which

21 # OPPi is the optimal OPP, move OPPi to the end of

22 # the sorted list”opps”

23 d e f acceptable config(error , opp c o n f i g ) :

24 d c v s result = dcvs(opp c o n f i g )

25 i d e a l results = dcvs(all o p p s )

26 i f dcvs r e s u l t [’total e n e r g y’] <= i d e a l r e s u l t s [’total e n e r g y’] ∗ (1+ e r r o r ) :

27 r e t u r n True

28 e l s e:

29 r e t u r n False

30 d e f sort opps(frames, opps, criteria):

31 i f criteria ==’total e n e r g y’:

32 # Sort OPPs based on the aggregated total energy consumption

33 # of frames for each OPP

34 e l i f criteria ==’frame c o v e r a g e’:

35 # Sort OPPs based on the number of frames for which each

36 # OPP is the optimal OPP

Listing 4.1: OPP selection algorithm, criteria: Energy efficiency (simple)

The algorithm presented in listing 4.1 has two limitations:

1. Selecting OPPs based on total energy consumption for all frames can result in some suboptimal choices. This issue is demonstrated by a simple example in Table 4.2. In this example, the OPP configuration selection algorithm 4.1 selects OPP3 and OPP1 given their lower total energy consumption. Assuming an ideal DVFS algorithm, when these frames execute, the algorithm selects OPP1 for frames 1-4, and OPP3 for frames 5-6. However in the ideal configuration (all of the OPPs enabled), OPP2 would have provided lower energy consumption for frames 5-6. Interestingly, in this specific example, sorting based on frame coverage would end up with the optimal results (but that is not necessarily the case in general).

81 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

frame1 frame2 frame3 frame4 frame5 frame6 sum(all frames) OPP1 10 8 2 15 11 3 49 OPP2 15 10 8 16 5 2 56 OPP3 11 9 3 15.5 6 2.5 47

Table 4.2: An example of OPP energy consumption leading to suboptimal decisions

2. The OPP selection algorithm 4.1 does not take the deadlines into account. During the OPP sort and selection process, it is better if the algorithm gives higher weight to whether or not deadlines are met at a specific OPP.

Listing 4.2 shows the modified algorithm that addresses the two problems above. Only the modified functions are shown in this listing. To address the first problem above, every time an OPPj is added to the selected configuration, the list of frames is updated by removing the frames for which OPPj is the optimal choice. OPPs are then ranked again based on their total energy or frame coverage (for the frames not yet covered by the selected OPPs). This way, we minimize the frame coverage overlap between the selected OPPs, and at each point, select the best OPP for the uncovered frames (local optimum). The function that identifies the frame coverage for OPPs is also updated, adding only the frames to the OPPs that result in meeting the deadline.

1 d e f get o p p c o n f i g c r i t e r i a efficiency(all o p p s , a l l frames , error , criteria):

2 f r a m e coverage = get o p p f r a m e c o v e r a g e m e e t i n g deadline(all f r a m e s , a l l o p p s )

3 o p p c o n f i g = [ ]

4 opps = a l l o p p s

5 frames = a l l f r a m e s

6 n u m iterations = len(sorted o p p s )

7 f o ri in range(num iterations):

8 s o r t e d o p p s = s o r t opps(frames, opps, criteria)

9 i f criteria ==’total e n e r g y’:

10 s o r t e d opps = demote l o w c o v e r a g e opps(sorted o p p s , f r a m e c o v e r a g e )

11 opp = s o r t e d o p p s [ 0 ]

12 o p p config .append( opp )

13 i f acceptable config(error , opp c o n f i g ) :

14 b r e ak

15 opps = sorted o p p s [ 1 : ]

16 frames = frames − f r a m e coverage[opp]

17 r e t u r n opp c o n f i g

18 d e f get o p p f r a m e c o v e r a g e m e e t i n g deadline():

82 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

19 # For each OPPi, return the number of frames for which OPPi

20 # is the most energy efficient OPP and frame deadline is met.

Listing 4.2: OPP selection algorithm, criteria: Energy efficiency

Criteria: number of OPPs While the objective of the DVFS algorithm is to minimize energy consumption, it might be desirable to keep the number of OPPs under a limit. One of the scenarios is that the number of OPPs should be limited, using (history-based) algorithms that change frequency gradually one step at a time to reach the desired level. If the number of OPPs are large, the latency to settle at the desired level increases. This algorithm aims to select the most energy-efficient OPP configuration composed of exactly N OPPs. Listing 4.3 demonstrates a simple algorithm for finding an optimized OPP configuration of length N (only functions that have changed or new functions are shown. Other functions, such as sort opps that are the same as the previous section, are not repeated here). The advantage of this algorithm is its simplicity. However, it has the same problems that algorithm 4.1 had. When we favor optimization level over simplicity, algorithm 4.4 can be used.

1 d e f get o p p c o n f i g c r i t e r i a n u m o p p s ( a l l o p p s , a l l f r a m e s , N) :

2 s o r t e d o p p s = s o r t o p p s ( a l l f r a m e s , a l l opps , criteria)

3 r e t u r n sorted o p p s [ : N]

Listing 4.3: OPP selection algorithm, criteria: number of OPPs (simple)

1 d e f get o p p c o n f i g c r i t e r i a n u m o p p s ( a l l o p p s , a l l frames , N, criteria):

2 f r a m e coverage = get o p p f r a m e c o v e r a g e m e e t i n g deadline(all f r a m e s , a l l o p p s )

3 o p p c o n f i g = [ ]

4 opps = a l l o p p s

5 frames = a l l f r a m e s

6 n u m iterations = len(sorted o p p s )

7 f o ri in range(num iterations):

8 s o r t e d o p p s = s o r t opps(frames, opps, criteria)

9 i f criteria ==’total e n e r g y’:

10 s o r t e d opps = demote l o w c o v e r a g e opps(sorted o p p s , f r a m e c o v e r a g e )

11 opp = s o r t e d o p p s [ 0 ]

12 o p p config .append( opp )

13 i f acceptable config(N, opp c o n f i g ) :

14 b r e ak

15 opps = sorted o p p s [ 1 : ]

83 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

16 frames = frames − f r a m e coverage[opp]

17 r e t u r n opp c o n f i g

18 d e f acceptable config(N, opp c o n f i g ) :

19 r e t u r n len(opp c o n f i g == N)

Listing 4.4: OPP selection algorithm, criteria: number of OPPs

The proposed algorithms in this subsection achieve the goal of managing algorithm latency by keeping the number of OPPs within a limit. But these algorithms provide additional benefits compared to the algorithms that optimize for a set energy efficiency. These algorithms are simpler. Moreover, the function that decides if the algorithm goal is met does not need to run the DVFS algorithm on each iteration. As a result, these algorithm(s) are faster and independent of the actual DVFS algorithm. Criteria: number of DDR frequencies Sometimes it may be desirable to make sure there is at least one OPP for each GPU frequency. As our preliminary results in this chapter have shown, the difference between performance and the energy profile of the associated GPU frequencies is more significant as compared to adjacent DDR frequencies. Also, the number of GPU frequencies is smaller than the number of DDR frequencies. If one or more of these GPU frequencies are eliminated in the previous algorithms based on the profiling data, the future workloads could be impacted negatively in terms of either performance or energy consumption, especially due to the large performance/energy gap between the selected GPU frequencies. This algorithm is a variation of the algorithm presented in Listing 4.4. It selects a fixed number of DDR frequencies per GPU frequency.

1 # SelectN(N < num DDR freqs) OPPs for every GPU frequency

2 d e f select o p p a l l gpufreqs(all o p p s , a l l frames , N, frame c r i t e r i a , s o r t c r i t e r i a ) :

3 o p p c o n f i g = list()

4 g p u f r e q s = g e t g p u f r e q l i s t ( )

5 d d r f r e q s = g e t d d r f r e q l i s t ( )

6 f o r gpu f r e q in gpu f r e q s :

7 opps = permutations(list(gpu f r e q ) , d d r f r e q s )

8 i f frame c r i t e r i a ==’all f r a m e s’:

9 frames = a l l f r a m e s

10 e l i f frame c r i t e r i a ==’frame c o v e r a g e’:

11 frames = g e t f r a m e coverage(all f r a m e s , a l l o p p s , opps )

12 r e t u r n get o p p c o n f i g c r i t e r i a n u m opps(opps, frames, N, sort c r i t e r i a )

13

84 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

14 d e f get f r a m e coverage(all f r a m e s , a l l opps , opps):

15 # Return the list of frames for which one of the

16 # items in”opps” is the optimal OPP. If the resulting

17 # list is too small, return full opp list

18 f r a m e coverage = get o p p f r a m e c o v e r a g e m e e t i n g deadline(all f r a m e s , a l l o p p s )

19 frames = list()

20 f o r opp in opps:

21 frames.append(frame coverage[opp])

22 i f len(frames) / len(all f r a m e s ) < 0 . 1 0 :

23 r e t u r n all f r a m e s

24 e l s e

25 r e t u r n frames

Listing 4.5: OPP selection algorithm, criteria: full GPU frequencies coverage

4.4 Conclusions

In this chapter, we performed extensive performance and energy consumption analysis for mobile graphics workloads and identified some of the areas that existing work can benefit from improvements. One of the main conclusions derived from our design space exploration (Section 4.2) was that switching latency and safety margins are two of the factors with the most impact on the efficiency of a mobile GPU DVFS solution. Based on our latency profiling results, the iPhone 6+ has a 100ms latency for one frequency level change, and takes about 350ms to go from the lowest to highest frequency when the workload changes. Similarly, the same transitions take 120ms and 600ms for the Galaxy S5, respectively. Given the criticality of meeting the individual frame deadlines, it is due to such high latencies that DVFS algorithms need to add high safety margins at the cost of additional energy consumption. In Chapter 5 and Chapter 6, we present two novel DVFS solutions that address these two problems. Through fine-grained runtime profiling of the frame workloads and fast decision making, these two proposed solutions demonstrate excellent individual frame deadline guarantee, within a few percentages of the ideal profile, a significant improvement over existing solutions. Both of our proposed solutions utilize history-based prediction of the workload of the upcoming frame. The focus of these solutions is on improving the accuracy and speed of the frequency selection. Our analysis in Section 4.1 demonstrated that history-based predictors are

85 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION effective for mobile GPU 3D graphics workloads. The advantage of history-based predictors is their simplicity and low runtime overhead. Quality of Service-aware DVFS For the same reasons mentioned above, a set of history- based DVFS algorithms, called Governor algorithms, are implemented and readily available in the Linux kernel. Linux Governors run at regular intervals. They predict the utilization of the next interval based on the history of the past interval(s) [97]. The algorithm’s objective is to keep utilization of each interval within certain thresholds, changing the frequency when the utilization is outside the desired range. Commonly, frequency changes are applied one level at a time, except for specific events that can trigger large frequency changes [96]. Proprietary mobile GPU DVFS algorithms from SoC vendors, such as Samsung and Qualcomm, have similarly employed incremental frequency changes. As a result, frame misses were common in early versions of the Android Operating System (OS) before the introduction of triple buffering in Android 4.1. Still, some of the new use cases such as VR and AR, use single buffering for lower latency [89]. These use cases are highly sensitive to frame deadline misses since they can cause dizziness. QoS-aware DVFS, presented in Chapter 5, eliminates the main shortcomings of the existing utilization-based DVFS algorithms, summarized as follows: (1) High latency in responding to workload changes: Due to sampling inaccuracies, these solutions use the average utilization across multiple samples; when the utilization changes, it could take a few samples before the average value reflects the new utilization. Even when the average utilization reflects the current utilization, the incremental frequency change means it takes multiple executions of the algorithm before it reaches the desired frequency.2 (2) Lack of context and QoS awareness: existing governors use a one size fits all approach. They cannot optimize according to individual application QoS requirements. It is notable that recently You and Chung [96], and Peters et al. [97] attempted to incorpo- rate the notion of FPS into GPU and CPU DVFS, respectively. Yet, their solutions still suffer from the aforementioned issue. Both studies tried to enforce an FPS rate lower than that of the display, but without regard to the application-specific FPS requirements. Energy-aware DVFS QoS-aware DVFS achieves considerable improvements over the existing solutions, especially towards our main goal of eliminating frame deadline violations. Yet, it does not address some of the important limitations of the existing work we identified in this chapter. Notably: (a) It does not address the issue of DDR frequency scaling; it does not utilize the full two-dimensional GPU-DDR frequency space. (b) Relies on a na¨ıve performance model assuming

2This problem applies to DVFS solutions that use history-based prediction of other parameters such as IPC, compute- to-memory ratio, etc., and perform incremental frequency changes.

86 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION that performance has a simple linear relationship with frequency. (c) It relies on a na¨ıve energy model assuming that lowest voltage/frequency level always has the lowest energy consumption. The first item is the most important factor. Many of our observations throughout this chapter, including Item (b) and Item (c) are directly related to the impact of the DDR frequency scaling in our environment. A majority of the DVFS solutions (CPU and GPU), assume the DDR frequency is fixed, either implicitly or explicitly. The commonly used approach is to either fix the DDR frequency at the highest level, or to map unique pairs of CPU/GPU and DDR frequencies, selected statically (i.e, at design time) to a DVFS level (this is the approach QoS-aware DVFS utilizes). To the best of our knowledge, there are only two previous studies that considered DDR frequency scaling for GPGPU applications [66, 68]. Their solutions do not fully consider the interrelationship between the GPU and DDR frequencies for maximum efficiency; also their work is not directly applicable to 3D graphics workloads. The importance of DDR frequency scaling has been touched upon in recent work on mobile SoCs [84, 83], without any detailed analysis or a proposed solution. Our analysis in this chapter showed that DDR frequency scaling has a significant impact on workload performance and energy consumption. We showed that some assumptions commonly used in previous DVFS work, including Item (b) and Item (c), become void in this context. Furthermore, in Section 4.2 we quantified the impact of OPP configuration, and energy awareness on DVFS performance and power profile. The performance/energy profile of a a DVFS solution that does not support the full GPU-DDR OPP space, can be suboptimal, especially for corner cases that are significantly compute or memory bound. And we showed that energy-awareness alone can contribute up to 10% energy saving. Our energy-aware DVFS solution, presented in Chapter 6, addresses these problems by incorporating statistical performance and energy models into the DVFS algorithm. These statistical models allow us to find the set of OPPs that meet the frame deadlines, from all the permutations of GPU and DDR frequencies. Subsequently, the energy model is used to identify the OPP that consumes the least amount of energy within the set that meets the frame deadline. At runtime, we are able to find an optimized OPP that meets both performance and energy consumption requirements, based on the actual workload of the frame. The combination of the performance and energy models allows the algorithm to correctly identify those workload phases that the intuitive assumptions do not apply. For example, it can identify when running at higher GPU or DDR frequency is more energy-efficient. It is notable that variations of performance and/or power models have been used in CPU

87 CHAPTER 4. WORKLOAD ANALYSIS AND DESIGN SPACE EXPLORATION

DVFS techniques before [62, 64, 13]. Performance [66, 68] and power [67] models have also been developed for GPGPU power management Recently, Pathania et al. described statistical models for mobile GPU gaming workloads [16]. These existing solutions do not sufficiently satisfy one or more of the following requirements: (1) Low-latency, fine-grained, individual frame-level models for both performance and energy prediction that are required for graphics rendering workloads with soft-realtime deadlines. (2) The issue of DDR frequency scaling, or the challenges of a two-dimensional GPU-DDR frequency space. (Nugteren et al. [66] and Paul et al. [68] include some partial support for the GPU-DDR frequency space in their solutoin.) (3) Critically accurate prediction of the processing time (and energy consumption) of 3D graphics workloads. Predicting the processing time of 3D graphics workloads is more complicated [70]. The existing solutions for general-purpose CPU and GPGPU applications are not typically relevant for graphics workloads. For more details on the existing solutions cited here, and their limitations, refer to Section 2.3.

88 Chapter 5

QoS-Aware DVFS

As mentioned in Section 4.4, our QoS-aware DVFS approach falls under the category of DVFS solutions that utilize history-based prediction for the workload of the next frame. Our proposed solution aims to be interchangeable with the utilization-based solutions, specifically Linux Governor algorithms. The main objective of QoS-aware DVFS is to improve responsiveness, and provide frame deadline guarantees at an individual frame granularity. Our proposed solution eliminates a majority of frame deadline violations that are common in existing governor algorithms. It aims to achieve such QoS guarantees while maintaining an aggressive low-energy profile. Our QoS-aware DVFS solution has core logic and additional components, which together, address the shortcomings of the existing utilization-based DVFS solutions, summarized in Section 4.4. We start by describing the core algorithm logic.

5.1 Algorithm core logic

QoS-aware DVFS differs from utilization-based solutions in terms of both workload variation tracking and prediction, and the decision-making phases. Workload variation tracking phase: Instead of sampling GPU utilization periodically as a measure of the workload variation across all applications concurrently running on the GPU (Figure 5.1.a), QoS-aware DVFS accurately tracks frame rendering times separately for each (active) Embedded Graphics Layer (EGL) context (Figure 5.1.a). Furthermore, QoS-aware DVFS leverages the eglSwapBuffers API calls (Section 2.2.1) for fine-grained, frame-granularity, tracking of individ- ual frames within each EGL context. Figure 5.1.b highlights the separation between the frequency selection and data profiling components of QoS-aware DVFS.

89 CHAPTER 5. QOS-AWARE DVFS

DVFS F1 Busy Times DVFS Ctxt1 Ctxt2 Ctxt3 F1 F2 Utilization = 95% F1

(a) Decision DVFS DVFS Making DVFS Interval

g e n i Ctxt1: F1 F2 6ms Aggregate= k m c a Ctxt2: F1 F2 3ms r a 14ms r F Ctxt3: F1 F2 5ms T

(b) Figure 5.1: (a) Utilization-based algorithm: utilization calculation @DVFS execution time (b) QoS-aware algorithm: separation between statistics tracking (frame rendering time) and DVFS execution.

Our new workload performance tracking mechanism has the following advantages: (a) There is no sampling of the performance counters, reducing inaccuracies in the profiled inputs to the algo- rithm. We do not need many samples to accurately detect a change in the workload, as is needed with utilization-based sampling. Fewer samples translates to lower latency, which in turn, can potentially improve both frame deadline guarantees and energy consumption. (b) Accurate frame rendering time tracking, as opposed to using a proxy such as utilization, to detect workload changes, alleviates the need for safety margins. Reduced safety margins can increase energy savings, with little or no negative impact on frame deadline guarantees. (c) Workload performance tracking is disentangled from the periodic decision-making component. The decision-making task can run at short time intervals to increase algorithm responsiveness to workload changes without having a negative impact on the consistency and quality of the profiled frame rendering time. In DVFS solutions based on periodic utilization sampling, shorter sampling intervals result in fluctuating utilization, even for relatively consistent workloads. (d) Our workload tracking provides fine-grained visibility into individual EGL contexts, which enables the selection of an optimized configuration that is tailored for an individual context’s QoS requirements. Optimization criteria: The optimization criteria in QoS-aware DVFS is to make sure the combined frame rendering time across active EGL contexts is shorter than the VSYNC-period (the start of a screen refresh cycle is often identified by a Vertical SYNC event (VSYNC). We refer to the time between two VSYNC events, i.e., the frame deadline, as the VSYNC-period from now on). Rendering frames within a VSYNC-period is a key performance requirement for 3D graphics

90 CHAPTER 5. QOS-AWARE DVFS workloads. Rendering frames faster does not provide any advantage. Comparing this approach to that of utilization-based techniques, utilization can be a loosely correlated, and inaccurate, proxy for the workload variations. The notion of the required performance is undefined in this context. QoS-aware DVFS incorporates knowledge of the required QoS (i.e., FPS = 1/VSYNC-period), as well as the frame rendering time per active EGL context, into the algorithm. Code 5.1 shows the Qos-aware DVFS core algorithm. During the first step (lines 4-8), frame rendering times across all active contexts are aggregated. Next, the amount of work (i.e., the aggregated rendering time) is compared against the available deadline. If the amount of work is insufficient to render the frame, the frequency is increased (line 9). If the amount of work is considerably shorter than the deadline, the frequency is decreased (line 11). The configuration parameter down-threshold is used as a hysteresis function to avoid constant transitions between two frequencies. We can, however, set it to the same value as the up-threshold if the switching overhead is low. There is one more condition that controls transitions to a lower frequency: after adjusting the aggregated rendering time for the lower frequency, the frame deadlines must still be met. The actual rendering time depends on both the frequency scaling and the workload instruction mix. We assume a linear scaling with frequency. Linear scaling is the worst-case scenario for performance reduction with frequency reduction. Therefore, we can guarantee that the above condition is satisfied. Every time the core of the QoS-aware DVFS algorithm executes, it queries the most up- to-date rendering time information for each EGL context using the avg-frame-time function (Code 5.1, line 5). As shown in Code 5.2 line 14, this function simply returns the average frame time across the historical samples for each context. The actual work for time tracking in QoS-aware DVFS occurs independently and asynchronous to the DVFS algorithm execution. Updates to the frame rendering time are aligned with workload execution phases. Kernel Mode Driver (KMD) accurately tracks the execution time of the command batches submitted by the UMD, inserting profiling instructions at the start and end of each command batch. Upon completion of a command batch, KMD updates the accumulated rendering time of the corresponding EGL context (Code 5.2 line 9). When the last command batch in the frame completes, the oldest frame time sample is replaced with the new sample. We then reset the accumulated frame time to start tracking the rendering time of the next frame (Code 5.2 line 12).

// Goal: fit frames within vsync 1

i n t dcvs() { 2

// calculate aggregate rendering time 3

f o r (idx=0; idx< n u m ctxts; idx++) { 4

91 CHAPTER 5. QOS-AWARE DVFS

f t i m e = a v g f r a m e t i m e ( i d x ) ; 5

a g g r e g f t += f t i m e ; } 6

// Select the frequency(level index) 7

i f (aggreg f t > u p t h r e s h ∗ v s y n c p e r i o d ) 8

l e v e l c h a n g e = 1// go one level up 9

e l s e if (aggreg f t < 10

d o w n t h r e s h ∗ v s y n c p e r i o d ) 11

l e v e l c h a n g e = − 1// one level down 12

e l s e level c h a n g e = 0 ; 13

r e t u r n level c h a n g e ; } 14

Listing 5.1: QoS-aware DVFS: core algorithm

void accum ft(ctxt , render t i m e ) { 1

c t x t . a c c u m f t += r e n d e r t i m e ; } 2

void update f r a m e t i m e ( c t x t ) { 3

t a i l i d x ++;// modulo increment 4

c t x t . f t a r r a y [ h e a d idx++]=ctxt .accum f t ; 5

c t x t . a c c u m f t = 0 ; } 6

void cmdbatch completion(ctxt , cmd) { 7

a c c u m ft(ctxt , cmd.render t i m e ) ; 8

i f (cmd.eof) 9

u p d a t e f r a m e time(ctxt); } 10

void avg f r a m e t i m e ( c t x t ) { 11

r e t u r n average(ctxt.ft a r r a y ) ; } 12

Listing 5.2: Rendering time tracking

5.2 Proactively avoiding deadline misses

One of the main problems with history-based DVFS techniques is the potential for mis- predictions [78, 82]. Specifically, for mobile 3D graphics workloads with soft real-time deadlines, underestimating the rendering time may result in janks and is not acceptable. To alleviate this problem, we augment our rendering time tracking to detect mispredictions as soon as possible and prevent deadline misses.

1 void avg f r a m e t i m e proactive(ctxt) {

92 CHAPTER 5. QOS-AWARE DVFS

2 r e t u r n max(average(ctxt.ft a r r a y ) ,

3 c t x t . a c c u m f t ) ; }

Listing 5.3: Pro-active detection of rendering time increase

Code 5.3 shows the modified version of avg-frame-time. When calculating the frame rendering time for an EGL context, we use the maximum of the average frame rendering time for the previously completed frames and the accumulated rendering time for the current, unfinished, frame. By detecting the rendering time increase before the frame rendering completes, we can reduce the algorithm latency in reacting to increased rendering time. Figure 5.1.b provides an example scenario. When DVFS runs the second time, the new frame for context 2 has not finished yet. Its accumulated rendering time has already surpassed the value of the previous frame. In this case, the accumulated rendering times of the unfinished frames are used when calculating the aggregated rendering time.

5.3 Single step, multilevel frequency change

1 void lvl i n c r e a s e stride(aggreg f t , c u r r l v l ) {

2 f o r (i=curr l v l +1; i

3 i f ((aggreg f t ∗ f r e q [ c u r r l v l ] /

4 f r e q [ i ] ) < u p t h r e s h ∗ v s y n c p e r i o d )

5 b r e ak;

6 } r e t u r n min(i, num levels −1) − c u r r l v l ; }

7 void lvl d e c r e a s e stride(aggreg f t , c u r r l v l ) {

8 f o r (i=curr l v l −1; i >=0; −−i ) {

9 i f ((aggreg f t ∗ f r e q [ c u r r l v l ] /

10 f r e q [ i ] ) > u p t h r e s h ∗ v s y n c p e r i o d )

11 b r e ak;

12 } r e t u r n max(i+1, 0) − c u r r l v l ; }

Listing 5.4: Multilevel frequency change

Step-by-step ramp up (and ramp down) of the frequency is one of the factors responsible for the high latency of utilization-based DVFS algorithms [15]. By changing only one frequency level in each interval, these algorithms intend to achieve a balance between performance and energy consumption. Hence, it takes a long time for the algorithm to settle at the appropriate frequency in reaction to a workload change. To accommodate more interactive workloads such as UI, some

93 CHAPTER 5. QOS-AWARE DVFS algorithms jump to a predefined high frequency in some special cases [12, 8]. For mobile devices with strict power budgets, such solutions come at the unacceptable cost of energy consumption. Kim and Kim propose a multilevel frequency change which is based on the ratio of the change in utilization [15]. This approach has one major shortcoming that limits its applicability: deadline misses due to significant increases in rendering-time naturally occur when the GPU is fully utilized and cannot accommodate the increased workload. Ideally, a DVFS algorithm would keep utilization high for more efficient operation. Therefore, such sudden increases in the required performance are not detectable by monitoring utilization, which is already capped at close to 100%. In QoS-aware DVFS we accurately track frame rendering times. When the frame rendering time increases significantly, we can calculate the ratio of the rendering time over the VSYNC- period. Therefore, we can adjust the frequency by as many levels as required to accommo- date the increased workload intensity and meet deadlines. Code 5.4 shows two new functions: lvl-increase-stride and lvl-decrease-stride. These two functions replace Lines 9 and 11 in Code 5.1. When the rendering time increases (decreases), the lvl-increase-stride (lvl-decrease-stride) increase the frequency by the minimum (lowest frequency) in order to satisfy the required performance. Multilevel frequency changes, combined with proactive rendering times, increase detection and improve the accuracy of frame tracking considerably, which also reduces the latency of our DVFS solution.

5.4 QoS awareness

To maximize energy savings without negatively impacting the user experience for mobile graphics workloads, it is important to consider the QoS requirement for each application. Specifically, 3D graphics applications have different FPS requirements. Applications might opt to use a lower FPS than the default (display) FPS for various reasons: to reduce energy consumption or to maintain a smooth user experience, especially when the underlying hardware cannot achieve a higher FPS rate given the computing workload of the application. By accurately tracking rendering times per EGL context, QoS-aware DVFS enables efficient operation tailored to the individual needs of the application. Consider the example frame rendering scenario in Figure 2.4. Assume the EGL context 1 submitting the frame rendering work has FPSctxt = 2 ∗ FPSdisplay. Normally, the algorithm in Code 5.4 selects the operating frequency Freq1 that meets the display FPS requirement (i.e., one VSYNC-period). This decision results in unnecessarily higher energy consumption given that the

94 CHAPTER 5. QOS-AWARE DVFS

Ctxt0 AvgFT ScaledAvgFT F2 × Ctxt1 (60 fps) F1 F3 Ctxt2 (20 fps) F1 + AggregFT CtxtN AvgFT F1P1 F1P2 F1P3 × VSYNC VSYNC VSYNC VSYNC Scaling Factor (Si) (a) (b)

Figure 5.2: Multicontext multi-FPS rendering frame deadline is two VSYNC-periods. To rectify this problem and allow longer deadlines for the frames for this context, we multiply the VSYNC-period in the calculations in Code 5.4 by FPSdisplay , a factor of two in this example. The same algorithm equipped with the adjusted deadline FPSctxt now selects Freq3, which is lower than Freq1, and saves some energy without impacting the user experience. In a multiple concurrent EGL context scenario, where all contexts share the same FPS (which can be different from the display FPS), we can use the same approach. For multiple concurrent EGL context use cases, there is no (single) reference FPS to use for the deadline adjustment. Consider the example in Figure 5.2.a. There are two EGL contexts in this figure with different FPS requirements, Ctxt1 with 60 FPS and Ctxt2 with 20 FPS. Let Pi, be equal to FPSdisplay for EGL context i (P equals the frame deadline for Context as a multiple of the FPSctxt i i VSYNC-period). We define the scaling factor, S = 1 . By multiplying the frame rendering time of i Pi ctxti by the scaling factor, Si, we can calculate the average frame rendering time per VSYNC-period for contexti. We can then utilize the same algorithm in Code 5.1 to support multiple concurrent EGL contexts with different FPS. The only difference is that on the line 5 of the algorithm, we use the scaled average frame rendering time when calculating the aggregated rendering time across contexts. Figure 5.2.b demonstrates this approach. We can use Figure 5.2.a to understand the idea behind this approach intuitively. Context 1 renders one frame every VSYNC-period. In each VSYNC-period, context 2 completes a portion of its rendering task for the one frame that can span across three VSYNC-periods (dashed lines 1 denoted by F1Pi, i ∈ {1,2,3}). Ideally, F 1P 1 = F 1P 2 = F 1P 3 = 3 ∗ F 1; i.e., context 2 spreads its frame rendering evenly across the three VSYNC-periods. Therefore, each frame from context 1 can meet its deadline every VSYNC-period, and each frame from context 2 meets its deadline of three VSYNC-periods.

Lemma 1. The operating frequency that satisfies the following inequality ensures that there is sufficient performance to meet the rendering deadlines for all active contexts (not accounting for

95 CHAPTER 5. QOS-AWARE DVFS potential rendering time prediction errors due to workload changes).

AggregatedF T < V syncP eriod (5.1) n X AggregatedF T = ScaledF Ti (5.2) i=1

ScaledF Ti = Si ∗ AvgF Ti, for contexti (5.3)

Proof. Consider the example in Figure 5.2.a. To meet all the required deadlines, we must satisfy the following equations:

F 1C1 + F 2C1 + F 3C1 + F 1C2 (5.4) < 3 ∗ V syncP eriod

F 1C1 + F 1P 1C2 < V syncP eriod (5.5)

F 2C1 + F 1P 2C2 < V syncP eriod

F 3C1 + F 1P 3C2 < V syncP eriod.

We start with Equation (5.4). Our QoS-aware DVFS algorithm utilizes history-based prediction. It estimates the rendering time of the following frame(s), based on the average of the last few frames (denoted by AvgF Ti for context i). By replacing F jC1, j ∈ {1, 2, 3} with the AvgF T1 in Equation (5.4), we arrive at the following equation:

F 1C1 + F 2C1 + F 3C1 + F 1C2 =

3 ∗ AvgF T1 + F 1C2 < 3 ∗ V syncP eriod F 1 ⇒ AvgF T + C2 < V syncP eriod (5.6) C1 3

If we extend Equation (5.6) to more than two contexts, we arrive at Equation (5.1). To try to validate

Equation (5.5), we substitute F jC1 with AvgF T1. Additionally, we divide the rendering time of a

F 1C2 single frame of context 2 into three equal parts, with rendering times equal to 3 . Substituting F 1P i in each inequality in Equation (5.5) with this new value, we arrive at the same inequality as in Equation (5.6). Therefore Equation (5.1) satisfies all of the requirements above.

Corollary 1.1. In a system with fine-grained (cycle level) preemption capabilities, we can guarantee all frame deadlines by: (a) satisfying inequality in Equation (5.1). (b) assigning higher priority to higher FPS EGL contexts.

96 CHAPTER 5. QOS-AWARE DVFS

We already showed that by satisfying item (a) above, we can guarantee the performance will meet the deadline. Now consider the previous example of two contexts with 60 FPS (context 1) and 20 FPS (context 2). Assume frame 1 rendering time for context 2 is more than a VSYNC-period and it is composed of a single, long, command batch. When the rendering for this frame starts, it can starve the next frame of context 1, which needs to complete within the next VSYNC-period. That is, while the performance is enough to finish all the rendering work within the three VSYNC-periods, we might miss the deadline for one of the frames of context 1. Assuming a fine-grained preemption capability, Item (b) above guarantees frames from the context 1 are not starved and receive the rendering cycles needed to meet their respective deadlines. Fine-grained (cycle-level) preemption on a GPU is typically costly due to the large size of the state that is saved and restored. Therefore, fine-grained preemption should be saved only for critical cases. In QoS-aware DVFS we track the rendering time at command-batch granularities. We are aware of the total number of command batches per frame, and we can track the maximum and average rendering times per command batch. We can also track the size of the rendering command batches that are submitted and executed concurrently. To account for the lack of (fine-grained) preemption, we can update Equation (5.3) as follows:

ScaledF TCi = max(SCi, MaxAtomicRenderCi) (5.7)

MaxAtomicRender in this equation is the maximum rendering time for the group of command batches submitted and executed together. We are assuming no other EGL context can perform rendering while this atomic unit is executing. This equation accounts for the possibility that the largest atomic execution unit in context 2 is larger than the average rendering time per VSYNC-period for the frame in context 2. Without fine-grained preemption, it is not always possible to guarantee deadlines for all workload submission patterns. In addition, there are different approaches for tracking the largest atomic execution unit. Hence, when fine-grained preemption is not available, it is possible to have corner cases where our approach underestimates the rendering work in a VSYNC-period. As with the prediction error, we can hide some of these miscalculations and uncertainties by higher-level (OS) approaches, such as triple buffering. Identifying FPS for an EGL context So far we have described how QoS-aware DVFS utilizes knowledge of the FPS of an application to make more informed decisions. Currently, there is no OpenGL ES or EGL API for the applications to specify their desired FPS rates in the Android OS. There are different approaches that applications use to control their FPS, including the following: (a) Running as fast as possible.

97 CHAPTER 5. QOS-AWARE DVFS

(b) Using Android Choreographer callback events. (c) Using a sleep call in the rendering loop. In the case of (a), the application submits as many frame rendering workloads as it can, before the eglSwapBuffers API call blocks the execution. In this case, the display FPS dictates the average FPS rate for the application. The second approach is a more deterministic way of work submission that aligns frame workload submissions with the display VSYNC. An application can control its FPS by selectively skipping work submission for a subset of the callbacks. In all of these approaches, the UMD or KMD is not aware of the FPS of the application. We are proposing two solutions for this issue.

1. Utilizing the GL EXT debug marker group of OpenGL ES APIs. 2. Estimating the FPS rate using End Of Frame (EOF) event markers.

Utilizing the existing OpenGL ES extensions: The first solution requires an application to use one of the existing OpenGL ES extension APIs, GL EXT debug marker, to explicitly communicate its FPS to the UMD. This API is intended to be used for debugging, and allows an application to insert debug marker strings into the command stream. This solution, therefore, requires modification of an application. Estimating FPS using EOF events: In QoS-aware DVFS we are tracking EOF events. The application controls the EOF events. When a new EGL context is created, we start by assuming the maximum FPS, i.e., the display FPS. If the actual FPS for Ctxti is less than the display FPS, we can estimate it using N EOF marker samples as follows: N FPS = (5.8) T imeDistance(EOFN ,EOF1) The time between two consecutive frame submissions is subject to variations in the rendering control loop mechanisms described before. But on average, this time is expected to be equal to the inverse of the FPS. Therefore, in Equation (5.8) we use N samples to calculate the average time between frame submissions. There are two potential scenarios that this equation results in an estimated FPS lower than the maximum (display) FPS: (a) The GPU performance is the bottleneck. (b) The application limits submissions to control the FPS rate. Only in the latter case do we have to assume a lower FPS for the corresponding context in our DVFS algorithm. In the former case, we need to distinguish whether the application requires a lower FPS rate, or is limited by the GPU performance. That means we can only estimate the actual FPS. If we are running at the highest GPU frequency, the aggregated rendering time is less than a VSYNC-period. Practically speaking, a mobile GPU is designed not to run at the maximum performance for a long time since it will encounter thermal

98 CHAPTER 5. QOS-AWARE DVFS

Table 5.1: Applications used for evaluation App # frames App # frames EgyptHD 6651 Chrome 5900 Hover 2839 TheChase 1018 AsianFusion 3050 AirDerby 1660 Launcher(UI) 1771 Trex 3298 limitations. Therefore, this solution works for most of the cases. For those cases where the highest frequency does not meet deadlines when running at 60 FPS, we can look at the GPU utilization. If GPU utilization is close to 100%, we can estimate the FPS to be 60. On the other hand, if there are long idle periods on the GPU, we can accept the estimated FPS based on the EOF markers.

5.5 Evaluation

We evaluate the effectiveness of our QoS-aware DVFS solution, comparing it against two fixed-frequency configurations (601 MHz - level 7 out of 7, and 401 MHz - level 4), as well as Qualcomm’s proprietary DVFS algorithm (msm-adreno-tz).

5.5.1 Implementation

We have implemented our QoS-aware DVFS solution on a Qualcomm Snapdragon 820 platform. Our setup runs on Android M (6.0), and utilizes 7 different DVFS frequency levels. In our implementation, QoS-aware DVFS runs in the KMD. To track rendering time for command batches at runtime, we used command stream instrumentation similar to the approach we used for our profiling framework demonstrated in Figure 3.1 and Table 3.1. The difference between the two approaches is that in QoS-aware DVFS the runtime instrumentation is done in KMD. We do not insert wait-for-idles into the command stream; doing so can negatively impact performance. The KMD accumulates the rendering time of individual command batches per context until the EOF marker is observed. At that time, we update the frame rendering time history. To ensure consistent rendering of the exact same frames every time we run our applications, we use the API capture and playback tool described in Section 3.4.1. For our energy profiling, we utilize the power measurement equipment setup shown in Figure 3.3. We start power profiling before the application starts and stop the profiling after the application finishes. To accurately mark the start and end of the application execution on the power waveform, we insert GPIO markers at the beginning and end of our application execution, similar

99 CHAPTER 5. QOS-AWARE DVFS to the approach described in Section 3.3 (refer to Figure 3.4. Note that we do not need to correlate performance and power data, since we do not need to extract per frame energy consumption).

5.5.2 Applications

Previous work on DVFS for (mobile) graphics most commonly have focused on 3D games. We have utilized eight different OpenGL ES applications in our evaluation, and run them on a real mobile SoC device (see Table 6.2). These applications include: a mobile game (Hover), a graphical user interface (Android Launcher), a web browser use case (Chrome-webGL), and EgyptHD and Trex from the GLBenchmark suite. The difference between widely accepted benchmarks such as GLBenchmarks, as compared to regular games, is that they incorporate a predefined set of inputs that result in predetermined scene transitions for consistency of the profiling. We have included three demo applications that exercise some of the advanced features of the OpenGL ES 3.2 APIs (TheChase, AirDerby, AsianFusionRestaurant).

5.5.3 Results

Frame-deadline Violations We start our analysis by considering frame-deadline viola- tions. In a mobile device with a 60 FPS display, the GPU has 16.7 ms to render a new frame (i.e., the frame deadline). Here, we calculate the number of frames for which the rendering time exceeds the frame deadline1. Figure 5.3.b shows the percentage of frames that miss the 16.7 ms deadline (V/F rameCount, where V stands for the number of deadline violations). Fig- ure 5.3.a, shows the normalized frame deadline violations (we use the following notation in the charts: < AppName > < F P S >. Egypt and Trex have 20 and 40 FPS configurations, in addition to the display’s default 60 FPS). These values are normalized to the fixed frequency (401

MHz) configuration using the following equation: ViNorm = (Vi − V401MHz)/F rameCount + 1. From Figure 5.3.b, our first observation is that there are some applications with a high percentage of deadline violations, even when running at the highest frequency (624 Mhz). There is less opportunity for energy savings through frequency scaling for these applications. Instead, our goal is to guarantee the DVFS algorithm maintains violations as close to the minimum (i.e., running at the highest frequency) as possible. On the other end of the spectrum, some of the applications, such as Hover, meet all frame deadlines at the lower frequencies (normalized violations for the 624

1 Our definition of frame deadline violations assumes triple buffering is not utilized. VR and AR workloads have virtually zero tolerance for a frame drop, and they cannot use triple buffering due to low motion-to-photon latency requirements.

100 CHAPTER 5. QOS-AWARE DVFS

1.5

1.14 1.07 1.06 1.07 1 1 1 1 1 1.02 1.01 1.01 1 1.02 1.02 1.01 1 0.95 0.86 0.87 0.8 0.81 0.81 0.8 0.8 0.81 0.83

0.46 0.48 0.47 0.5 0.39 0.32

violations (norm) 0 624MHz msm-adreno-tz QoS-aware

AirDerby_60 Chrome_webGL_60 EgyptHD27_20 EgyptHD27_40 EgyptHD27_60 Hover_60 QualcommAsianFusion_60 TheChase_60 Trex27_20 Trex27_40 Trex27_60 UI_Launcher_AndroidO_60 GMean

125 (a)

100 98.39 100 96.49 97.21 96.79 86.64 79.2 78.02 79.75 71.81 75 66.63

51.39 54.48 51.93 50 35.84 31.72 27.57 25.64 20.83 22.75 violations (%) 25 18.76 13.83 11.64 12.24 3.78 4.95 0 0 0 0.4 0 0 0 0 0 0.22 1.41 2.27 0.22 1.76 1.96 0.68 0 624MHz 401MHz msm-adreno-tz QoS-aware

AirDerby_60 Chrome_webGL_60 EgyptHD27_20 EgyptHD27_40 EgyptHD27_60 Hover_60 QualcommAsianFusion_60 TheChase_60 Trex27_20 Trex27_40 Trex27_60 UI_Launcher_AndroidO_60 AMean

(b)

Figure 5.3: (a) Normalized and (b) percent of frame deadline violations

101 CHAPTER 5. QOS-AWARE DVFS

MHz frequency for the applications in this category are close to one in Figure 5.3.a). It is important that a DVFS algorithm does not run at an unnecessarily high frequency in such cases, since it just wastes energy. From Figure 5.3.a, it is clear that our QoS-aware DVFS algorithm maintains deadline violations within 2% of ideal for most of the use cases. The only exceptions are Chrome-webGL and TheChase. Inspecting the characteristics of these two applications, we find: i) the percentage of frames that exhibit large increases (i.e., more than 5%) in frame-to-frame rendering times is high, ii) the number of frames that have rendering times within 5% of the deadline is high (i.e., they will likely miss a deadline if rendering time increases). More than 17% and 5% of the frames exhibit large frame-to-frame rendering time increases in Chrome-webGL and TheChase, respectively. At the same time, more than 5% and 27% percent of the frames have rendering times within 5% of the deadline in Chrome-webGL and TheChase, respectively. The combination of these two characteristics increases the number of deadline violations as compared to the ideal scenario (highest frequency). For all other applications, one or both of these parameters are lower. Our QoS-aware DVFS solution significantly outperforms Qualcomm’s msm-adreno-tz. On average, it decreases the deadline violations by 28% compared to msm-adreno-tz (the average is calculated by Geometric Mean for the normalized values and arithmetic mean for absolute values). Moreover, unlike msm-adreno-tz, it performs consistently well for all the applications; it reduces violations compared to msm-adreno-tz by more than 72% for TheChase. Energy Consumption Notice that running at 624 MHz is superior to 401 MHz, in terms of deadline guarantees. Inspecting the normalized energy values in Figure 5.4.a, we can see the other side of the equation. Running at 624 MHz consistently consumes more energy, 24% and 20% for the worst and average cases, respectively. Clearly, the two DVFS-based algorithms closely track or improve upon the 401 MHz results. For AirDerby, AsianFusion, TheChase, and Trex, the DVFS algorithms consume more energy as compared to the baseline. All of these workloads encounter a much higher rate of deadline violations at 401 MHz. DVFS algorithms eliminate many of these violations, at the cost of higher energy consumption. Msm-adreno-tz and QoS-aware DVFS algorithms on average consume 2% and 5% more energy, respectively, as opposed to running at a fixed 624 MHz, which increases energy by more than 20%. Except for EgyptHD and TheChase, QoS-aware DVFS matches or reduces the energy consumption as compared to msm-adreno-tz. For EgyptHD and TheChase, our QoS-aware DVFS selects higher frequencies to achieve a 72% and 99% reduction in deadline violations over msm- adreno-tz (our scheme achieves virtually zero violations for EgyptHD).

102 CHAPTER 5. QOS-AWARE DVFS

1.5

1.23 1.23 1.24 1.21 1.2 1.19 1.2 1.2 1.2 1.17 1.2 1.09 1.08 1.05 0.98 1.02 0.99 0.99 1 0.96 0.96 0.94

0.5 Energy (norm)

0 624MHz msm-adreno-tz QoS-aware AirDerby_60 Chrome_webGL_60 EgyptHD27_20 EgyptHD27_40 EgyptHD27_60 Hover_60 QualcommAsianFusion_60 TheChase_60 Trex27_20 Trex27_40 Trex27_60 UI_Launcher_AndroidO_601.5 GMean

1.24 1.19 1.23 1.23 1.16 1.16 1.13 1.13 1.13 1.03 1.05 0.99 0.98 0.98 0.97 0.95 0.98 0.98 1 0.9 0.9 0.79 0.79 0.8 0.76 0.76 0.75 0.75 0.73

0.5

EV2P (norm) 0.26 0.26 0.24 0.17 0.13

0 624MHz msm-adreno-tz QoS-aware

AirDerby_60 Chrome_webGL_60 EgyptHD27_20 EgyptHD27_40 EgyptHD27_60 Hover_60 QualcommAsianFusion_60 TheChase_60 Trex27_20 Trex27_40 Trex27_60 (a) UI_Launcher_AndroidO_60 GMean (b)

Figure 5.4: Normalized (a) energy and (b) EV 2P

103 CHAPTER 5. QOS-AWARE DVFS

Energy-DeadlineViolations-Product For battery-powered mobile GPUs that run graphics workloads, both performance (frame deadline guarantees) and low energy consumption are key metrics. The literature on low-power design techniques often use Energy-Delay-Product (EDP ) to combine these two factors into one metric, in order to compare optimization techniques. For use-cases where performance is more important than energy consumption, Energy-DelaySquare- Product (ED2P ) is commonly used. However, these metrics are not suitable for realtime workloads such as graphics rendering. Consider a baseline configuration that meets all the frame deadlines. If an optimization technique increases the energy consumption by a factor of 1.1x and decreases the frame rendering time by 20% as compared to the baseline, we achieve an EDP = 1.1 ∗ 0.8 = 0.88, which is an improvement. But the reality is that the new optimization technique consumes more energy without any improvement in deadline guarantees. To incorporate this fact into the equation, we replace the delay component in these equations with our normalized deadline violations. We use EVP and EV 2P notations for this new metric. Figure 5.4.b shows that QoS-aware DVFS improves EV 2P versus msm-adreno-tz by about 19% on average, and by up to 84% for TheChase. QoS-aware DVFS provides up to a 83% improvement in EV 2P over the baseline for TheChase, and an overall improvement of 27%. QoS Awareness One premise of our QoS-aware DVFS algorithm is to provide more efficient operation for applications with varying QoS requirements, specifically varying FPS rates. EgyptHD and Trex have a FPS configuration parameter. At 60 FPS, Trex does not offer many opportunities for frequency scaling; running at 624 MHz, the workload experiences 80% frame deadline violations. But when we run at 40 FPS, the deadline violations decrease to zero at 624 MHz. At 40 FPS, QoS-aware DVFS manages to achieve 1.29% deadline violations as compared to 2.27% violations when using msm-adreno-tz. We also manage to reduce energy consumption by about 4%, resulting in an 8% improvement in EV 2P as compared to msm-adreno-tz. We observe a similar behavior for EgyptHD. While there is significant opportunity for frequency scaling at 60 FPS, at 40 and 20 FPS we can consistently run at lower frequencies and meet all frame deadlines. While the efficiency values of QoS-aware and msm-adreno-tz are close for 20 and 40 FPS, it is QoS-aware DVFS that is able to take full advantage of the frequency scaling opportunities at 60 FPS and vastly outperform msm-adreno-tz. Overall, QoS-aware demonstrates high consistency across various FPS settings. By contrast, the effectiveness of msm-adreno-tz varies considerably with FPS. Further investigation of QoS handling of our proposed algorithm, including multi-FPS concurrent use-cases, will be pursued in future work.

104 CHAPTER 5. QOS-AWARE DVFS

5.6 Summary

Utilization-based DVFS schemes suffer from high latency when reacting to workload changes. As a result, utilization-based schemes can suffer from an increased number of frame deadline violations, a number unacceptable for many graphics workloads such as VR and AR. We have shown that by accurately tracking frame rendering times, our QoS-aware DVFS can resolve many of these problems. QoS-aware DVFS can reduce deadline violations to within 2% of running at the ideal configuration (highest frequency), on average. Combined with greater energy savings, our QoS- aware scheme considerably improves EV 2P as compared to both fixed frequency configurations and Qualcomm’s msm-adreno-tz DVFS solution. QoS-aware DVFS is also better equipped to manage DVFS when faced with varying QoS requirements for multiple applications. As opposed to msm-adreno-tz, our scheme performs consistently well across multiple FPS settings.

105 Chapter 6

Energy-Aware DVFS

In this chapter, we present the details of our proposed integrated GPU-DDR solution, energy-aware DVFS. The DVFS policy for energy-aware DVFS is the same policy we adopted throughout this thesis, described in Section 4.2.1. Consequently, energy-aware DVFS, maintains the low-latency, individual frame-granularity operation of QoS-aware DVFS (Chapter 5). The primary advantages of energy-aware DVFS over QoS-aware DVFS are realized through the use of fine-grained statistical performance and energy models. These models enable us to utilize the full two-dimensional GPU-DDR frequency space; and from the whole OPP space select the right OPP that meets the frame deadline and is among the most energy-efficient OPPs. Therefore, one signficant contribution of our work centers around the performance and energy models that we describe throughout the rest of this chapter. To evaluate the accuracy and effectiveness of our DVFS solution in Section 6.3, we compare our statistical models, we compare the performance and energy profile of our solution against the ideal implementation of our DVFS policy using oracle performance and energy models (Section 4.2.1), as well as a na¨ıve energy model. To implement energy-aware DVFS, we use a modified version of the ideal DVFS algorithm shown in Listing 2. We plug in our statistical performance and/or energy models into the algorithm to implement the MeetsDeadline(...) and GetEnergy(...) functions. Instead of a frame number, the inputs to our statistical models are GPU frequency, DDR frequency, and workload performance counter data. Figure 6.1 shows the contrast between our proposed energy-aware DVFS solution (b) and the existing approach for utilization-based DVFS techniques (e.g., DVFS governor algorithms [12]). The input to these existing solutions is a mesure of the intensity of the workload (in this case utilization; there are other variations that utilize IPC, memory-to-compute ratio, to name a few) as a proxy for the performance, i.e., draw time. And the output is an incremental change in the frequency.

106 CHAPTER 6. ENERGY-AWARE DVFS

Our proposed solution identifies a more efficient GPU-DDR frequency combination.

Listing 2 The energy-aware DVFS Input: P erfCntrs 7: Energies ⇐ [] Output: SelectedOpp 8: for opp in V iableOpps do

1: V iableOpps ⇐ [] 9: Energy ⇐ GetEnergy(P erfCntrs, opp)

2: for opp in AllOpps do 3: if MeetsDeadline(P erfCntrs, opp) 10: Energies.add(Energy) then 11: end for

4: V iableOpps.add(opp) 12: OppIndex ⇐ LowestEnergy(Energies)

5: end if 6: end for 13: SelectedOpp ⇐ V iableOpps[OppIndex] n P .

o i s Performance e R t q r D a f e Model o z r D i r l F i m W t n P a o U Naïve P n w r c k O

o OPP

e l t o D

s C

Performance-Energy a / Sequencer e e o d p B u m U . i

n

Model s T t U

q e l P e r a r s G

t Energy Model F o T

(a) Current DVFS solutions (b) Proposed Solution

Figure 6.1: Components of (a) existing utilization-based DVFS algorithms, and (b) the proposed DVFS solution

As mentioned in Section 4.4, our proposed solution in this chapter relies on history-based predictors to predict the upcoming frame workload. Using our performance and energy models, our focus is to predict the best OPP for the given workload.

6.1 Performance model

The main task of our performance model is to identify all the OPPs that provide the required performance to meet the frame deadline. To satisfy this requirement, we explore two approaches for our statistical performance models:

107 CHAPTER 6. ENERGY-AWARE DVFS

1. A model to predict if an OPP meets the frame deadline, based on workload performance counter data. 2. A model to predict the rendering time of a frame at a given OPP. We can then use the predicted rendering time to identify if the frame deadline is met at that OPP.

Predicting the actual rendering time of the GPU has some advantages over directly pre- dicting whether the OPP meets the frame deadline. For example, we can build a model that predicts the average power consumption, and predict the energy consumption by multiplying the predicted average power by the predicted rendering time. Accurately predicting the rendering time for the complex computing pipeline of a GPU is challenging. In Section 2.2, we provided an overview of the GPU hardware pipeline, programming model and the software pipeline base on the OpenGL ES 3D graphics programming API. As described in that section, in a tile-based rendering mobile GPU, which is the target of our work, a frame draw involves tiling, followed by unresolve (optional), rendering, and resolve operations repeated for each tile. Furthermore, we noted that the rendering stage is dominated by vertex and fragment processing, as well as texture operations. Ideally, these operations work completely in parallel on a subset of the vertices and fragments. We can calculate the processing time of the rendering phase by identifying the bottleneck among these stages. The processing bottlenecks will dominate the total processing time of the rendering stage. Figure 6.2.a demonstrates how we can combine the various processing stages to estimate the total frame draw time.

Vertex Shading M Vertex Shading Shader Processor U (ALU Operations) M

Fragment Shading I Fragment Shading

+ y X g r A e y e Texture Sampling Texture Sampling g n M

r Texture Processor E m

i e / T

+ n +

(TEX Operations) . E Tiling Tiling f w + r a e r P Unresolve + D Unresolve + DDR Resolve Resolve (MEM Read/Write)

(a) (b) (c) Figure 6.2: (a, b) Relationship between software pipeline stage operations and performance/energy consumption. (c) Consolidated hardware pipeline of a mobile GPU with unified shaders centered around major processing blocks performing computation and data transfer.

To estimate the frame draw time with the approach described in Figure 6.2.a, we need to profile the processing time of individual stages in this figure. In our evaluation setup, we only have the profiling data for the whole frame draw time. We do not have fine-grained, per-stage, processing data. Alternatively, we can categorize these operations based on the hardware blocks that perform

108 CHAPTER 6. ENERGY-AWARE DVFS the operations, as demonstrated in Figure 6.2.c, to create our performance models. We build our performance models by characterizing the amount of work each hardware block in Figure 6.2.c performs for a given frame draw. The color of each of the software pipeline stages in Figure 6.2.a matches the color of the hardware block in Figure 6.2.c that is used to carry out its operations. For example, the green color of tiling, vertex shading, and fragment shading stages indicates that all these stages use the unified shader processor, shown in Figure 6.2.c. The darker color of the fragment shading stage reflects the fact that typically, the fragment shading operations dominate the computation time when rendering a frame. The purple color in all the software pipeline stages shows the data transfer to/from DDR, which is required for all the operations. Finally, the blue color corresponds to the TP, which executes texture sampling operations.

6.2 Energy model

To estimate the energy consumption, the software pipeline model shown in Figure 6.2.b has the same components as the frame draw time. The only difference is that for energy consumption, all of the components are additive. Similar to the performance model, we use the hardware pipeline model in Figure 6.2.c to build predictive models for the energy consumption. The hardware blocks in this figure and their operations, encompass the majority of the activity in the GPU for a 3D graphics frame rendering workload, which in turn, corresponds to the majority of the dynamic energy consumption. These hardware blocks also constitute the majority of the GPU die area, which correlates with the static energy consumption. Based on Figure 6.2, we can expect to build more accurate energy models than performance models, since the components of the energy consumption are additive. Problem formulation and design space The main requirement for our energy model is to identify the best OPP from a set of OPPs that meet the frame deadline. There are different approaches to achieve this goal. For example, we can use classification techniques to map frame workloads to an optimized OPP at runtime. We can also use logistic regression to identify if OPP1 is more energy-efficient than OPP2. Then by iterating over the viable OPPs, we could identify the more energy-efficient OPP. Using linear regression models to predict the energy consumption for each OPP is another approach. We then use the model at runtime to find the most energy-efficient OPP (the OPP sequencer shown in Figure 6.1.b uses such a model, iterating over the OPPs to find the most efficient OPP). Linear regression has practical advantages during the model training phase, and also enables additional use cases. For

109 CHAPTER 6. ENERGY-AWARE DVFS example, it allows us to create a new DVFS policy that provides maximum performance within a given power/energy envelope. In this work, we only explore linear regression for our energy models. In designing our energy models, we use the following criteria to assess the quality of the different models we explore. (a) Fidelity: relative prediction accuracy across OPPs. High-fidelity models can reliably identify the more energy-efficient OPP, while their absolute prediction accuracy at individual OPPs might be lower. (b) Simplicity: simple computations with low latency, runtime, and energy consumption overhead. Simplicity also makes it easier to automate model training process. (c) Scalability: the ease of adding (removing) OPPs to (from) the model. (d) Flexibility: the ease of updating models for new or future generation chipsets. Our DVFS solution controls the two-dimensional OPP space defined by all combinations of the GPU and DDR frequencies. Our energy model predicts the total energy consumption of the graphics (GFX) and DDR rails. We explore four different configurations in our design space outlined in Table 6.1.a. For UoppUpr and UoppSpr, the configurations that have a single model across OPPs, GPU and DDR frequencies/voltages must be factored in the energy model as predictor variables. Conversely, for SoppUpr and SoppSpr, only workload event performance counters are used as predictors. In these two configurations, we generate different models for each OPP (or the same model with different coefficients).

Table 6.1: Design space of our energy models. U: Unified S: Separate PR: Power Rail Si: Simplicity, Fi: Fidelity, Sc: Scalability, Fl: Flexibility X positive, X negative, - neutral (a) (b) PRs GFX+DDR GFX, DDR OPPs combined separate Si Fi Sc Fl

Unified across OPPs UoppUpr UoppSpr UoppUpr X XX - - UoppSpr X X--

One model per OPP SoppUpr SoppSpr SoppUpr XXXX SoppSpr X XX X X Table 6.1.b compares these four options in our design space based on the aforementioned criteria. While we provide a high-level discussion about our design space configurations here, we provide more details of the models and quantitative results in Section 6.3 that support the assessment shown in Table 6.1.b. Simplicity we consider all four configurations in a straightforward manner. The models

110 CHAPTER 6. ENERGY-AWARE DVFS are linear regression models with few predictor variables. The computations are multiply and add. Fidelity based on our evaluations, separate models for graphics (GFX) and DDR energy (right column) provide more accurate predictions. Additionally, having a dedicated model per OPP provides even greater improvement over a unified model across OPPs. The latter case can be intu- itively explained by the nonlinear influence voltage and frequency have on the energy consumption: there is quadratic relationship between the dynamic power, and cubic relationship between the static power consumption and the operating voltage; there is also the nonlinear relationship between energy consumption and the operating frequency. In a single-frequency domain, dynamic energy consumption is assumed to be independent of the frequency (since the number of clock cycles for the computation is constant, see Equation (2.8) and (2.9)). In a two-dimensional GPU-DDR frequency space, this assumption is not valid anymore. All of these nonlinear factors introduce inaccuracy into a unified energy model across OPPs (UoppUpr and UoppSpr ). Scalability using the definition of scalability we discussed earlier, it is obvious that the per-OPP models (SoppUpr and SoppSpr) are more scalable. Adding a new OPP is as simple as training a model for the additional OPP, without any impact on the models for the other OPPs. Note that for the unified models across OPPs (UoppUpr and UoppSpr), in theory, we could add an OPP to the system and use the already trained models without retraining them; i.e., we can use the models to extrapolate for the OPPs we have not trained the model for. But that would potentially further reduce the accuracy of the unified models. That is the reason we have marked UoppUpr and UoppSpr as neutral in terms of scalability.

Flexibility SoppUpr and SoppSpr are more flexible because the simpler nature of per-OPP models (i.e., their linear performance relationship) makes it easier to automate the model generation process. It is more likely to be able to re-train the models using automation to calculate new model coefficients for a new generation hardware and get consistently satisfactory prediction accuracy. Unified models across OPPs are more likely to need human intervention to update the models for a new generation hardware. They are also more susceptible to increased prediction error due to new optimization techniques that only apply to certain but not all OPPs. Energy vs. average power Another design choice we have to make is whether to predict energy consumption, or predict average power consumption, with our power/energy models. Our DVFS solution requires the predicted energy consumption to make OPP selection decisions. If our model predicts average power consumption, we calculate energy consumption by multiplying the predicted average power by the total draw time. Thus, the power model cannot be combined with a logistic regression-based

111 CHAPTER 6. ENERGY-AWARE DVFS performance model that only predicts if an OPP meets the frame deadline. Additionally, the accuracy of the energy prediction depends on both the power model and the performance model predicting the total draw time.

6.3 Experimental setup and evaluation results

Predictor variables The set of workload performance counters we use as predictor vari- ables for our models is another element of our design space. Generally, a combination of the high-level knowledge about the behavior of the system being modeled, as well as statistical analysis techniques are used to come up with the right set of predictors. If a number of important variables are missing from the model, the model can produce inaccurate predictions for a range of inputs. If too many variables are used, there is the risk of overfitting the model to the training data. Runtime predictions can suffer from high prediction error in such cases. We also want to minimize the number of predictors to maintain low runtime overhead for the models. As stated earlier, most of the processing involved in the drawing of a 3D graphics frame is performed in the Shader Processor (SP)s (specifically ALUs in SPs), TPs, and involves data transfers to/from system memory (DDR) (refer to Section 6.1 and Figure 6.2.c). We have identified a number of hardware workload performance counters to quantify the workload of these three components. The event counters we have selected track workload-dependent events that are independent of the operating frequency. These counters are listed in Table 6.3. Table 6.2: Applications used for evaluation. #OPPs= 63 (9-DDR × 7-GPU frequencies) #data points= 1,040,697 (#all frames × #OPPs) Application ChromeWebGL Hover EgyptHD Trex LauncherAndroidO AirDerby Frame count 5900 2839 2494 1855 1771 1660

L2ReadSP, L2ReadTP, and L2ReadVFD quantify the read requests issued to the L2$ by the three major pipeline stages involved in vertex, fragment, and texture processing operations. These parameters are a superset of the MemRead, MemWrite, and DataTotal, which quantify the memory interactions from L2$ to DDR. To model performance and energy consumption of the data transfer, we use MemRead and MemWrite. The L2$ counters can help describe some of the energy consumption of the GPU power rail. Requests to L2$ are 128-bit wide and requests from L2$ to DDR are 256-bit wide. By counting ALUCycles, we quantify ALU operations. And finally, the combination of TPOps16Bit and TPOps32Bit counters quantify the texture processing operations.

112 CHAPTER 6. ENERGY-AWARE DVFS

Table 6.3: List of the performance counters used to profile workload parameters. Counter Description

ALUCycles cycles ALUs in SPs have been active Compute TPOps16Bit 16-bit bilinear texture operations TPOps32Bit 32-bit bilinear texture operations L2ReadSP read requests to L2$ from Shader Processors L2ReadTP read requests to L2$ from Texture Processors Data L2ReadVFD read requests to L2$ from Vertex Fetch and Decode unit MemRead read requests from L2$ to DDR MemWrite write requests from L2$ to DDR DataTotal total read and write requests from L2$ to DDR

Platform For our evaluation we have used the Qualcomm Snapdragon 820 SoC. Our setup runs on Android M (6.0), and uses 7 GPU frequencies and 9 DDR frequencies, for a total of 63 OPPs. For our frame-level (fine-grained) performance profiling, we have developed a tool that instruments the command stream to sample performance counters at specific points within the GPU pipeline. We perform our power profiling using a power acquisition hardware from National Instruments. Applications To obtain our profiling data, we have run six different OpenGL ES appli- cations on this platform (see Table 6.2). As opposed to the previous work in this area that focused on mobile 3D games, we consider a wider range of applications including, a mobile game (Hover), a graphical user interface (Android Launcher), a web browser use case (Chrome-webGL), and EgyptHD and Trex from the GLBenchmark suite. The difference between widely accepted bench- marks such as GLBenchmarks, as compared to regular games, is that they incorporate a predefined set of inputs that result in predetermined scene transitions for consistency of the profiling. We have also included a demo application, AirDerby, that specifically utilizes advanced features of the OpenGL ES 3.2 APIs. Profiling data We run these applications at each and every OPP. Our profiling data is composed of the frame draw time, energy consumption for GFX and DDR power rails, and the value of the performance counters listed in Table 6.3 at start and end of each frame (we only need to profile these counters at one OPP as they count workload-dependent parameters that do not change with frequency). All this data is captured for every frame, across all applications. To extract per-frame energy consumption, we map the start and end of the frame from our profiled performance time

113 CHAPTER 6. ENERGY-AWARE DVFS domain to the power acquisition time domain. We do the mapping by creating a sync point between the draw time and energy consumption; at the sync point, we toggle a GPIO pin and sample the performance counter used for profiling frame rendering time. By finding the GPIO marker on the power profiling time domain, we can align the two time domains and extract energy consumption from start to end of a frame, as well as the end of one frame to start of the next one (for idle time and energy estimations). Training and validation We have employed two of the common practices used for vali- dating statistical models: (1) Splitting the data into training (70%) and test (30%) data sets through random selection. (2) 10-fold cross-validation. First we train models across our deign space con- figurations using the training data set and present our results for those configurations for the test data set. After identifying the most accurate configurations, we perform 10-fold cross-validation on those selected configurations. For most of our analysis, which involves running DVFS algorithms on our data, we need the data points for every OPP for a given frame. Therefore, rather than randomly selecting samples from our 1,040,697 data points, for each application, we randomly select frame indices. If a frame belongs to the training data set, all data points for that frame across all OPPs (total of 63) are added to the training data set.

6.3.1 Correlation analysis

In this section, we assess the strength and linearity of the relationship between the workload event performance counters, and the draw time and energy consumption. We also look at the relationship between the DDR and GPU frequencies, and the draw time and energy consumption. Energy consumption Figure 6.3.a shows the correlation coefficients between workload performance counters and energy consumption. The scatter plots in Figure 6.3.b and 6.3.c com- plement the correlation matrices to analyze the linearity of the relationship between the highly correlated elements. The charts on the left column are created after filtering the data for a single OPP. Similarly, Figure 6.4 represents the relationship between the energy consumption and draw time, and operating voltage/frequency. The charts on the left column in Figure 6.4 represent the data for a single randomly selected frame from a randomly selected test. The charts on the right column, which represent the data across all frames, are presented for the insight they provide about the variations of the rendering workload of different frames and their impact on the draw time and energy consumption. But, at runtime, our predictive models only make predictions for a given frame workload. Note that the operating voltage for the DDR subsystem is not included in Figure 6.4

114 CHAPTER 6. ENERGY-AWARE DVFS

GPU_DDR Freq = 401800000_1296000000 All OPPs 0.8 GFX+DDRenergy 0.94 0.93 0.85 0.4 GFX+DDRenergy 0.87 0.84 0.78 0.37 0.6 0.4 FrameTime 0.66 0.62 0.55 0.2 FrameTime 0.65 0.61 0.56 0.24 0.3 0.0 0.0 DDRenergy DDRenergy 0.9 0.84 0.74 0.4 0.4 0.87 0.81 0.74 0.37 0.3 0.6 GFXenergy 0.93 0.96 0.93 0.37 0.8 GFXenergy 0.76 0.76 0.73 0.32 DataTotal ALUCycles TPOps16Bit TPOps32Bit DataTotal ALUCycles TPOps16Bit TPOps32Bit (a)

(b)

(c)

(d)

Figure 6.3: Correlation coefficients (a) and scatter plots between energy (b, c) and frame draw time (d), and workload event performance counters. Left column shows data only from one representative OPP (GPU Freq. = 401.8 MHz, DDR Freq. = 1296 MHz)

115 CHAPTER 6. ENERGY-AWARE DVFS because most DDR subsystems typically use similar voltages for all DDR frequency levels (including in our platform). We can make the following observations about energy consumption, based on the correla- tion coefficients and scatter plots in Figure 6.3 and Figure 6.4:

1. There is an almost perfect correlation between the TotalData and DDR energy at a single OPP. The same applies to ALUCycles and TPOpsXXbit, and the GFX energy consumption. Combined with the scatter plots, we can confirm that this strong relationship is linear. Hence, these workload performance counters are good inputs for our predictive energy models. 2. The negative impact of combining GFX and DDR energy into a single unified GFX+DDR model (the left column in Table 6.1.a) on the correlation coefficients is negligible. 3. The correlation coefficients between the workload performance counters and energy consump- tion, for the data across all OPPs, is strong, but noticeably lower than the single OPP case (Figure 6.3.a), especially for the GFX energy. The scatter plots across all OPPs in Figure 6.3.b- right and 6.3.c-right do not show a single linear trend as in Figure 6.3.b-left and 6.3.c-left; but, they show a set of linear trends with different slopes, which correspond to the individual OPPs (plotting individual OPPs in separate charts makes this clearer). Unified energy model across OPPs (top row in Table 6.1.a) can account for variable slopes across by incorporating voltage/frequency into the model, but they are bound to have lower accuracy compared to configurations with a separate model for each OPP (bottom row in Table 6.1.a). 4. High correlation and linear trends are present in Figure 6.4.a-left and 6.4.b-left, confirming the quadratic relationship between the GFX energy and GFX voltage we expected (linear relationship with GFX voltage-squared), based on Equation (2.8). There are multiple lines with different slopes, which correspond to different DDR frequencies. 5. Figure 6.4.c-left can help us explain the counterintuitive negative correlation between DDR frequency and DDR energy consumption in Figure 6.4.a. We can see different trend lines corresponding to different GPU frequencies. There is an inflection point in DDR frequency: frequencies lower than the inflection point are negatively correlated with energy consumption. We do not expect to see significant increase in energy consumption at higher frequency levels because the DDR subsystem uses simiilar operating voltage across all frequencies. Consequently, one possible explanation for this negative correlation is that if DDR frequency is the bottleneck (i.e., lower DDR frequencies, depending on the GPU-to-DDR frequency ratio), increasing the DDR frequency can eliminate some of the GPU busy-wait cycles and

116 CHAPTER 6. ENERGY-AWARE DVFS

Single Frame All Frames GFX+DDRenergy 0.95 0.41 0.38 ­0.47 0.8 GFX+DDRenergy 0.27 0.1 0.092 ­0.14 0.30 0.4 0.15 FrameTime 0.72 ­0.27 ­0.31 ­0.66 FrameTime 0.31 ­0.09 ­0.1 ­0.27 0.0 0.00 DDRenergy 0.45 ­0.45 ­0.49 ­0.17 DDRenergy 0.13 ­0.085 ­0.094 ­0.072 0.4 0.15

GFXenergy 0.92 0.65 0.63 ­0.47 0.8 GFXenergy 0.37 0.26 0.25 ­0.18 0.30

GPU2DDRRatio VoltageSquare GPUFreq DDRFreq GPU2DDRRatio VoltageSquare GPUFreq DDRFreq (a)

(b)

(c)

(d)

(e)

Figure 6.4: Correlation coefficients (a) and scatter plots between energy (b, c), and frame draw time (d, e), and GPU/DDR voltage and frequency. Left column shows data from only a single randomly selected frame of a randomly selected test.

117 CHAPTER 6. ENERGY-AWARE DVFS

enable more efficient utilization of the DDR subsystem through data coalescing. As a result, DDR energy consumption for the same workload is reduced. 6. The GPU-to-DDR frequency ratio has a strong correlation with both the frame time and the GFX energy consumption, even stronger than the GFX voltage-squared and GFX energy consumption (Figure 6.4.a). The higher the GPU-to-DDR frequency ratio, the more cycles the GPU spends busy-waiting for data from memory. Therefore, energy consumption increases. The impact of the GPU-to-DDR frequency ratio is what causes different slopes in the linear trends observed in Figure 6.4.b-left. Note that DDR energy consumption tracks closely with the amount of data transferred to/from the system memory (versus only the DDR, or the GPU-to-DDR frequency ratio).

Note that there is a strong correlation between the workload compute counters (ALUCycles and TPOpsXXBit) and the DDR energy consumption (and similarly between the DataTotal and the GFX energy; refer to Figure 6.3). Correlation does not necessarily show causality. The computations do not increase energy consumption on the DDR rail; but there is a correlation between computation and data transferred to/from memory. Thus, there is a correlation between the computation and the DDR energy consumption. Frame draw time From the charts in Figure 6.3 and Figure 6.4 we can make the following observations about the frame draw time:

1. Workload event counter values show a strong correlation with the frame draw time for data points across OPPs (Figure 6.3.a). But the scatter plot in Figure 6.3.d shows that the relationship is strongly nonlinear, even for the single OPP case with no impact from the GPU and DDR frequencies. This observation matches our understanding of the GPU pipeline as described in Section 2.2. 2. The GPU to DDR frequency ratio has a strong positive correlation with frame draw time (Figure 6.4.a). If memory data transfers are the bottleneck (often the case when the GPU to DDR frequency ratio is high), increasing DDR frequency (thus decreasing GPU to DDR ratio), results in a considerable reduction in the frame draw time, which explains the strong positive correlation. There is a point, depending on the compute to memory ratio of the frame rendering workload, beyond which increasing the DDR frequency does not impact the frame draw time.

Note that the scatter plots between the frame draw time, and TOPOpsXXBit and DataTotal that are not presented in Figure 6.3 are similar to the ALUCycles shown in Figure 6.3.d

118 CHAPTER 6. ENERGY-AWARE DVFS

Conclusions Our next steps are to build and evaluate our performance and energy models. Here we summarize some conclusions from our preceding analysis that can guide these steps. Specifically, these conclusions can help us decide what predictor variables to use for each model, including potential transformation of the predictors.

1. Taking out the nonlinear impact of the operating voltage and frequency on the energy con- sumption and frame draw time from the model by creating separate models for each OPP will likely yield the best results. 2. For the unified energy models across OPPs (top row in Table 6.1), we can transform the workload predictors (ALUCycles and TPOpsXXBit) by multiplying them by the GPU to DDR frequency ratio to account for the nonlinear impact of varying frame draw times across the OPPs on the energy consumption. We do the same for the unified performance models. 3. For the unified energy models across OPPs (top row in Table 6.1), we can transform the 2 outcome variable, i.e., the energy consumption, by dividing it by the Vgpu , to account for the nonlinear effect of the GPU voltage on the energy consumption. Predictions require an additional postprocessing step to multiply the predicted value of the transformed outcome 2 variable by Vgpu to calculate the energy consumption. 4. We need separate, fine-grained processing time data for different stages of the GPU pipeline during rendering phase, and for the memory data transfers, to train accurate linear performance models for each stage; then we identify the bottleneck to calculate the total processing time. In this thesis, we only have the final frame draw time available, and would expect our linear performance models to underperform. But our goal is to use the performance model to identify the OPPs that meet the frame deadline. We do not need high prediction accuracy of the exact rendering time for each OPP. For that reason, we also explore the effectiveness of using logistic regression to produce a binary outcome: whether an OPP meets the frame deadline or not.

6.3.2 Models

By combining our knowledge of the GPU pipeline for 3D graphics workloads, described in Section 2.2, 6.1, and 6.2, with our correlation analysis presented in Section 6.3.1, we have devised 5 statistical models for predicting the energy consumption, and 4 models for predicting performance. They are described in Table 6.4. Note that for those configurations in Table 6.1 that have separate

119 CHAPTER 6. ENERGY-AWARE DVFS

models for GFX and DDR rails or separate models for each OPP, e.g., EnSoppUpr, we use the term model to refer to the whole collection of the linear models in the configuration. From the workload performance counters in Table 6.3, we use a combination of ALUCycles, TPOpsXXBit, MemRead, and MemWrite as predictor variables. Through our analysis, we observed that we can use MemRead and MemWrite counters effectively for the GFX energy model, instead of the L2$ counters. Using fewer counters esults in lower runtime overhead.

EnxxxSoppSpr, EnSoppUpr, APxxxSoppSpr, DGSopp, DTSopp: The impact of GPU and DDR voltage and frequency on the outcome variable is isolated in these configurations since there is a separate linear model for each OPP. Therefore, predictors are used without transformation.

Note that APxxxSoppSpr predicts the average power consumption and calculates the energy by multiplying the predicted average power by the frame draw time. This transforms both predictors and the outcome variables, dividing them by the frame draw time.

EnxxxUoppSpr, EnUoppUpr: The transformations in these two models are a direct application of our observations from correlation analysis in Section 6.3.1. We noted that higher GPU to DDR frequency ratios resulted in higher energy consumption on the GFX rail. This was because of the increase in the number of cycles the GPU spends busy-waiting for data from memory.

In EngfxUoppSpr, we incorporate this relationship into the compute predictors, ALUCycles and TPOpsXXBit, by multiplying them by Fgpu . We do not transform the MemRead and MemWrite. Fddr Additionally, we indentify the nonlinear relationship of the voltage-squared and OPPs, when consid- ering GFX energy consumption. We separate the contributions of voltage-squared to the total energy 2 consumption by dividing the GFX energy by Vgpu before model training and runtime prediction.

The model essentially predicts Ceff · Nclk in Equation (2.8). The energy consumption is then 2 calculated by multiplying the prediction by Vgpu . In EnddrUoppSpr, we have two transformations of the predictors: (a) We multiply MemRead and MemWrite by Fddr to account for the higher energy consumption of the higher DDR frequencies. (b) We multiply MemRead and MemWrite by Fddr Fgpu to reflect the fact that a higher GPU to DDR frequency ratio can give the DDR subsystem more opportunity to optimize operations through data coalescing, as well as using low-power states more efficiently by reducing the frame draw time. For EnUoppUpr, we combine the above transformations for the GFX and DDR rail energy consumption into a single model. The main caveat in this case is

1Voltage scaling transformation is applied to the DDR energy as well. Otherwise, we could not reconstruct the total energy consumption from the predicted value. 2There is only one component for the outcome variable in the performance models. We can only separate models across OPPs

120 CHAPTER 6. ENERGY-AWARE DVFS

Table 6.4: Predictors and outcome variable for the energy and performance models we have explored. Some models use transformed predictors and/or outcome variable that require preprocessing for both offline training and runtime prediction.

En: Energy AP: Average Power Fgpu: GPU frequency Fddr: DDR frequency DT: Draw Time DG: Deadline Guaranteed Energy Model Predictors Outcome Variable

EngfxSoppSpr ALU, TP16, TP32, MemRead, MemWrite Engfx

EnddrSoppSpr MemRead, MemWrite Enddr

En U S F Fddr (MemRead, MemW rite) En ddr opp pr ddr Fgpu ddr Fgpu 1 2 EngfxUoppSpr (ALU, T P 16,TP 32), MemRead, MemWrite ( ) Engfx Fddr Vgpu

EnSoppUpr ALU, TP16, TP32, MemRead, MemWrite Engfx+ddr Fgpu (ALU, T P 16,TP 32), Fddr 1 2 1 EnUoppUpr ( ) Engfx+ddr F Fddr (MemRead, MemW rite) Vgpu ddr Fgpu

1 1 APgfxSoppSpr DT (ALU, T P 16,TP 32, MemRead, MemW rite) DT Engfx 1 1 APddrSoppSpr DT (MemRead, MemW rite) DT Enddr

Perf. Model Predictors Outcome Variable 2 DGSopp ALU, TP16, TP32, MemRead, MemWrite DG (logistic: yes/no) 1 Fddr (ALU, T P 16,TP 32), Fgpu Fgpu DGUopp DG (logistic: yes/no) 1 (MemRead, MemW rite) Fddr

DTSopp ALU, TP16, TP32, MemRead, MemWrite Frame Draw Time 1 Fddr (ALU, T P 16,TP 32), Fgpu Fgpu DTUopp Frame Draw Time 1 (MemRead, MemW rite) Fddr

121 CHAPTER 6. ENERGY-AWARE DVFS that we have to scale the combined GFX+DDR energy by voltage-squaredd, even though the GPU operating voltage does not have any impact on the DDR energy consumption.

DGUopp, DTUopp: For these two models we perform following transformations for the compute parameters: (1) We divide ALUCycles and TPOpsXXBit by the GPU frequency to reflect shorter processing time for the computation running at higher GPU frequencies. (2) We divide ALUCycles and TPOpsXXBit by the GPU to DDR frequency ratio. A higher GPU to DDR frequency ratio can have a positive impact on the efficiency of the memory subsystem and reduce the frame draw time. We also divide MemRead and MemWrite by the DDR frequency since the memory data transfer time is shorter for higher DDR frequencies.

6.3.3 Evaluation results

Evaluation criteria To evaluate our energy and performance models, we use a combination of the following analyses:

(1) We report on the efficiency of our DVFS algorithm, shown in Listing 2, when using our predictive models, as compared to the performance and energy profile of the oracle models. In this analysis, we care about the fidelity of data point predictions across OPPs; i.e., if OPP1 has higher energy consumption compared to OPP2, we need the predicted energy consumption to reflect that, regardless of the absolute accuracy of the two predicted values. (2) Mean Absolute Prediction Error (MAPE) [102] for data point prediction accuracy. Here we look at the prediction errors of individual data points. (3) Goodness of fit analysis. We use R-squared/R-squared adjusted and P-value analysis to assess the quality of our fitted models.

We start this section by evaluating the performance and energy models separately. Then we analyze the efficiency of our DVFS solution considering combinations of our performance and energy models. Performance models Using our performance models, the OPP sequencer in our energy- aware DVFS solution, shown in Figure 6.1.b, identifies whether the frame deadline for a given frame is met at each OPP and creates a list of viable OPPs to feed to the energy model. Figure 6.5 shows the accuracy of our models when predicting whether an OPP meets the frame deadline for our 1,040,697 data points. We have two types of mispredictions: (1) False Positive: the OPP fails to meet the frame deadline, but the predicted outcome is met the deadline. The potential impact of false positive predictions on the DVFS algorithm is an increase in the number of frame deadline

122 CHAPTER 6. ENERGY-AWARE DVFS

False Positive (%) - threshold: 0.7 10 9.1

7

5.3 5 4.6 3.9 3.7 3.3 2.5 2.7 2.4 2.1 2 1.5 1.4 1.1 1.2 0.8 0.7 1 FalsePositive (%) 0 0 0 0 0 0 0 0 0 0 AirDerby ChromeWebGL EgyptHD2500 Hover LauncherAndr… Trex2000 mean DGSopp DGUopp DTSopp DTUopp (a) False Positive (%) - threshold: 0.5 10 9.1

7 6.8 5.8 5 4.6 3.7 3.3 3.1 3.1 2.5 2.7 2.5 2.4 2.7 2.6 1.8 2 1.4 1.4 1 FalsePositive (%) 0 0.4 0.4 0 0.4 0 0 0 0 AirDerby ChromeWebGL EgyptHD2500 Hover LauncherAndr… Trex2000 mean DGSopp DTSopp DGUopp DTUopp (b) False Negative (%) - threshold: 0.7 11

10 9 7.8 8.2 6.7 5.1 5 4.5 4.6 3.8 3.5 2.9 3.1 3 2.9 2.9 2.3 2.4 2.7 1.9 2.2 1.3 0.7 0.9 0.7 0.8 FalseNegative (%) 0 0.1 0.4 0 AirDerby ChromeWebGL EgyptHD2500 Hover LauncherAndr… Trex2000 mean DTUopp DTSopp DGSopp DGUopp (c) Correct Prediction (%) - threshold: 0.7 98.9 99.6 100 96.9 97.1 96.5 97 96.3 97.2 93.6 95.9 (%) 90.2 92.3 92.1 89 89.7 92.2

50

Correct Prediction 0 AirDerby ChromeWebGL EgyptHD2500 Hover LauncherAnd… Trex2000 mean DTUopp DGUopp DTSopp DGSopp (d) Figure 6.5: Performance model deadline-guarantee prediction accuracy 123 CHAPTER 6. ENERGY-AWARE DVFS violations; that is only if the OPP with incorrectly a predicted outcome also has the lowest energy consumption among the viable set of OPPs in Listing 2. (2) False Negative: the OPP meets the deadline, but the prediction is that the deadline is violated. The potential impact of a false positive on the DVFS algorithm is lower energy efficiency if the OPP that is excluded from the viable OPP set has the lowest energy consumption compared to the other viable OPPs. The bars for the performance models in the charts in Figure 6.5 are grouped together on the X-axis for each application. There is one additional group, mean, which represents the average percentage of correct predictions, false positives, and false negatives across all applications 3. The bars across the models in each category are sorted in ascending order, from left to right, based on the values in the mean group. Our linear regression-based performance model outputs a predicted frame draw time. We convert the predicted draw time to the binary outcome: meets (1)/does not meet(0), for the frame 1 deadline of 60 = 16.7 ms. The output of our logistic regression models is a value in the range [0,1]. We convert the output to a binary outcome by mapping all the values above a threshold to 1, and below or equal the threshold to 0. We assess the impact of using two different thresholds, 0.5 and 0.7, on the breakdown between the correct prediction, false positive, and false negative. Figure 6.5.d shows the percentage of data points for which the deadline-guaranteed pre- diction is correct. All four models have high prediction accuracy, up to 95.9% for the DGSopp on average across all applications. Except for Trex and EgyptHD, DGSopp is consistently the most accurate model. As expected, DGSopp and FTSopp are the two most accurate models, more than 7% more accurate than the unified models across OPPs for Hover and EgyptHD. In our DVFS policy (refer to Section 4.2.1), the first goal is to minimize frame deadline violations. Reducing energy consumption must not come at the cost of increased deadline violations. Hence, minimizing the percentage of false positives is a higher priority. Using 0.7 as the threshold, the two logistic models DGXopp have a clear advantage. Specifically, DGSopp with 3.9% and 1.1% are the worst case and average false positive rate, respectively, compared to 9.1% and 3.7% for

FTUopp. Using a lower threshold of 0.5 for the logistic models increases false positives and reduces false negatives. Still, DGSopp is the best model. Considering all 4 charts in Figure 6.5, DGSopp is the best model, along with the FTSopp. What they have in common is a dedicated regression model for each OPP. 3This is different from calculating percentage of, e.g., correct predictions, across all data points. We are treating these applications as representatives of different types of workloads. We are giving equal weight to the values for each application to asses the effectiveness of our models across different types of workloads. The values across all data points can be calculated based on the correct predictions per application, the number of frames for each application, listed in Figure 6.2, and the number of OPPs. The same approach has been used for calculating the average across applications in the reset of this section.

124 CHAPTER 6. ENERGY-AWARE DVFS

Energy models To evaluate our energy models, we plug them into our DVFS algorithm shown in Listing 2. Since we want to evaluate our energy models in isolation from our performance models, we plug in the oracle performance model into the Listing 2. Hence, there are no false positives, nor false negatives. Figure 6.6 shows the percent difference in the energy consumption between the following two scenarios: (a) Our DVFS algorithm in Listing 2 using each of our energy models. (b) The same algorithm using the oracle energy model. There is one more configuration in this figure, LowestFreq, besides our energy models listed in Table 6.4. LowestFreq represents the na¨ıve energy model commonly used in DVFS algorithms; from the list of viable OPPs, it always selects the one with the lowest voltage/frequency level. In our two-dimensional frequency space, it sorts the OPPs based on GPU voltage/frequency first, and then DDR frequency. Comparing LowestFreq to the oracle energy model helps us quantify potential energy savings by using an accurate energy model. By comparing our energy models with the oracle model, we quantify the effectiveness of our energy models in taking advantage of that potential saving opportunity. Additionally, note that AP SoppSpr predicts average power consumption. Energy consumption is calculated by multiplying the predicted value by the frame draw time. Compared to the energy models (e.g., EnxxxSoppSpr), this model allows us to isolate the nonlinear impact of the frame draw time on the energy consumption. To achieve this, we use frame draw time from the oracle performance model to convert the predicted average power consumption to the energy consumption. Performance Model: Oracle 40 38.4

27.1 24.5 21.3 21.9 20 18.5 17.7 14.8 10.7 5.8

Energydelta (%) 4.5 4.4 4.4 3.2 3.1 2.8 2.8 2.2 2.7 1.41.7 0.1 0.1 0.1 0.8 0.8 0.4 0.7 0.8 1 0 0 0 0 0 0 0.7 0.3 0 1 0 AirDerby ChromeWebGL EgyptHD2500 Hover LauncherAndroidO Trex2000 mean

APSoppSpr EnSoppUpr EnSoppSpr EnUoppUpr EnUoppSpr LowestFreq

Figure 6.6: Energy model-based DVFS accuracy.

There is about a 15% potential energy saving obtained from using the detailed energy model on average (the difference between the LowestFreq and the oracle model); more interestingly, the savings opportunity is more than 20% (up to 38%) for three of the applications. Our DVFS algorithm (based on our prediction models) is able to take advantage of these opportunities. Specifically,

EnxxxSoppSpr, EnSoppUpr, and AP SoppSpr achieve energy improvements within 3% of the oracle

125 CHAPTER 6. ENERGY-AWARE DVFS model in the worst case and 1%, 1.4%, and 1.7% on average, respectively. The gap between the configurations with a unified model across all OPPs, and those with a dedicated model per OPP, is more significant for the energy models (EnUoppUpr and EnxxxUoppSpr vs. EnSoppUpr and

EnxxxSoppSpr), as opposed to the performance models (i.e., DGUopp and DTUopp vs. DGSopp and DTSopp). The unified models are able to save less than 34% percent of the total potential saving opportunity on average, and are more than 20% less energy efficient than the oracle model for the three applications that show significant savings opportunity. Combined performance and energy models We looked at the accuracy and effectiveness of the performance and energy models when used separately in our DVFS algorithm. Here we assess how efficient our DVFS algorithm is when combining our performance and energy models as compared to the ideal configuration, which employs the oracle performance and energy models.

Energy vs. Draw time Energy vs. Deadline gaurantee

0

100

-0.2

50 (%)

-0.4 Drawtime delta (%) 0 Deadlinegaurantee delta 0 5 10 15 20 0 5 10 15 20 Energy delta (%) Energy delta (%)

mean across tests mean across tests (a) (b) Figure 6.7: Scatter plot for all combinations of the Perf. models, threshold values, and energy models: (a) Energy-Draw Time, (b) Energy-Deadline Guarantee

Table 6.5: Breakdown of the top configurations in Figure 6.7 ED: Energy Delta (%) DTD Draw Time Delta (%) DGD: Deadline Guarantee Delta (%) Perf. model Count Energy model Count Threshold 0.5 0.7

DGSopp 6 AP SoppSpr 6 Count 6 6

DGUopp 6 EnxxxSoppUpr 6 Average (ED) 1.0 1.9

DTSopp 3 EnxxxSoppSpr 6 Average (DTD) 2.8 1.2

DTUopp 3 Average (DGD) -0.06 -0.02 We have evaluated all permutations of performance and energy models in our design space, a total of 49. There are 7 options for the performance model: DGSopp and DGUopp logistic models with 0.5 and 0.7 thresholds, DTSopp and DTUopp, and the oracle model. There are also 5 energy models, plus the oracle and the na¨ıve (LowestFreq) model. Figure 6.7 shows the scatter plot of the average energy-vs-draw time delta (a) and energy-vs-deadline guarantee delta (b) compared

126 CHAPTER 6. ENERGY-AWARE DVFS to the configuration with the oracle models. The following observations from this chart are worth highlighting:

1. The data points can be divided into two separate clusters with respect to both energy and draw time delta. 2. Within the cluster with an energy delta above 9.2%, there is a positive correlation between the energy delta and frame draw time delta. The subset of data points in this cluster that is high- lighted with the yellow background (energy delta above 13%) belongs to the 7 configurations that use the na¨ıve (LowestFreq) energy model. This positive correlation is a perfect example of the problem with existing solutions, which we address with our energy-aware solution: the combination of the two-dimensional GPU-DDR frequency space (which is virtually ignored in all existing DVFS work), combined with the nonlinear impact of the complex pipeline of the GPU on the frame draw time, and consequently energy consumption, results in non-negligible range of frame workload scenarios that run at the lower voltage/frequency levels, and consume more energy. Hence, the na¨ıve energy model suffers from both performance and energy consumption inefficiencies. 3. Not surprisingly, the configurations with the na¨ıve energy model also have some of the highest increases in frame deadline violations, as compared to the oracle configuration. Running at lower frequencies increases the possibility of deadline violations. Given the high accuracy of our performance models, the worst case increase in percent deadline violations is less than 0.5% across all frames. This increase is still important since VR and AR use cases are sensitive to even a slight increase in frame drops. 4. Within the first cluster, all configurations experience an energy delta within 2.2% of ideal. Note that this cluster also includes those configurations with the oracle model for either performance, energy, or both. Excluding those configurations, Table 6.5 shows the breakdown of the remaining 18 configurations. Inline with our prior observations from Figure 6.5, all four

performance models participate in the top configurations, with 6 configurations for DGSopp

and DGUopp, each, compared to 3 configurations for DTSopp and DTUopp. As for the energy

models, EnxxxSoppSpr, EnSoppUpr, and AP SoppSpr, each contribute to 6 configurations. 5. Each of the two thresholds, 0.5 and 0.7, exist in 6 of the most efficient configurations. There is no significant difference between the energy and performance profile of the two thresholds. But it is worth pointing out that all configurations with the 0.5 threshold consistently have a better energy profile at the cost of higher frame draw time and slightly higher increase in the

127 CHAPTER 6. ENERGY-AWARE DVFS

deadline violation rate. The average energy, draw time, and deadline-guarantee delta across the 6 configurations, for each thresholdi, are shown in Table 6.5.

OPP prediction accuracy Overall, the combination of our best performance and energy models provide significant improvements over the existing na¨ıve approaches, and achieve energy efficiency within a few percent of the ideal configuration using oracle models. It is reasonable to assess the possibility of overfitting in our models. Figure 6.8 shows the accuracy of our DVFS algorithm (Listing 2) in predicting the optimal GPU and/or DDR frequency; i.e., for each test, it shows the percentage of the frames for which the selected GPU/DDR frequency, based on the predicted performance and energy, matches the frequencies that the ideal configuration using oracle models selects (if both frequencies match, the OPP is a match). The configurations in this chart include our most accurate performance model, DGSopp, and our two most accurate energy models, 4 EnxxxSoppSpr and AP SoppSpr This figure also shows the results for the na¨ıve (LowestFreq) energy model. The first thing we noticed is the disparity between the accuracy of the frequency prediction and the efficiency of our DVFS solution using these prediction models, as compared to the oracle models. While the energy efficiency of our models were within a few percent of the oracle models, we see a higher rate of GPU/DDR frequency mispredictions. The explanation for this difference is that even when the predicted OPP does not exactly match the most efficient OPP, it is among the top most efficient choices, and does not result in a significant increase in the energy consumption or frame draw time. Another noticeable trend in Figure 6.8 is that the prediction accuracy of the DDR frequency is lower than that of the GPU frequency. These results are inline with our observations from the correlation analysis. The DDR subsystem has special optimizations that apply to a range of the DDR frequencies, but not all of them. Linear regression models are not able to completely capture the impact of those special optimizations, and thus, have a lower accuracy. Data point prediction error The next level of finer-grained analysis is to evaluate the data point prediction error of our models using Mean Absolute Percentage Error (MAPE) [102] in Figure 6.9. The categories on the x-axis in this figure are the OPPs (¡GPUFreq¿ ¡DDRFreq¿). One of the concerns with MAPE is that it can produce a significantly high number of data points, with very small observed values for the outcome variable (infinity, if the value is 0). To alleviate this problem, after calculating the absolute percentage error for each data point, we remove the outliers;

4Note that the output of the performance model only shows if the OPP meets the frame deadline. In the case of using AP SoppSpr as the energy model, we use the actual frame draw time from the oracle model to convert the predicted average power to energy consumption.

128 CHAPTER 6. ENERGY-AWARE DVFS

Performance Model: DGSopp 99.799.7 100 90.2 84.2 85.9 86 81.8 80.3 75 65.5 67.2 60.3 56 50 40

25 14.2

GPUFreq delta (%)delta GPUFreq 0 0.6 0

Hover mean AirDerby Trex2000

ChromeWebGL EgyptHD2500 LauncherAndroidO LowestFreq EnSoppSpr APSoppSpr (a) Performance Model: DGSopp 99.899.8 100 92.4 92.5 80 80.9 75.3 75 62.3

45.2 50 41 37.3 32.6 30.9 21.8 24.4 25 17.7

DDRFreq delta (%)delta DDRFreq 1 0

Hover mean AirDerby Trex2000

ChromeWebGL EgyptHD2500 LauncherAndroidO LowestFreq EnSoppSpr APSoppSpr (b) Performance Model: DGSopp 99.699.6 100 83 83 83.1 75 71.273.5

54.2 50

27.8 28.8 25 20.7 18 20.6

OPP delta (%)delta OPP 15.2 5.6 8.4 0 0.3 0

Hover mean AirDerby Trex2000

ChromeWebGL EgyptHD2500 LauncherAndroidO LowestFreq EnSoppSpr APSoppSpr (c)

Figure 6.8: DVFS frequency prediction accuracy: Percentage of frames for which (a) GPU frequency, (b) DDR frequency, (c) Both GPU & DDR frequencies are correctly predicted.

129 CHAPTER 6. ENERGY-AWARE DVFS

Mean Absolute Percentage Error (MAPE) across DVFS levels

30 ) % ( 20 E P A M

10

0

0 0 8 6 4 0 7 7 5 0 0 8 6 4 0 7 7 5 0 0 8 6 4 0 7 7 5 0 0 8 6 4 0 0 6 9 0 0 4 1 5 0 0 6 9 0 0 4 1 5 0 0 6 9 0 0 4 1 5 0 0 6 9 0 1 3 7 2 8 2 5 0 5 1 3 7 2 8 2 5 0 5 1 3 7 2 8 2 5 0 5 1 3 7 2 8 _ _ _ 1 1 _ _ 1 1 _ _ _ 1 1 _ _ 1 1 _ _ _ 1 1 _ _ 1 1 _ _ _ 1 1 3 3 3 _ _ 4 4 _ _ 5 5 5 _ _ 1 1 _ _ 0 0 0 _ _ 0 0 _ _ 4 4 4 _ _ 3 3 3 3 3 1 1 4 4 1 1 1 5 5 0 0 1 1 1 1 1 0 0 6 6 0 0 2 2 2 4 4 1 1 1 3 3 2 2 1 1 3 3 3 1 1 4 4 0 0 5 5 5 1 1 5 5 6 6 6 6 6 2 2 1 1 2 2 3 3 4 4 5 5 5 5 6 6

(APSoppSpr,DDR_avg_power) (APSoppSpr,GFX_avg_power) (DTSopp,frame_times) (EnSoppSpr,DDR_energy) (EnSoppSpr,GFX_energy)

Figure 6.9: Mean Absolute Percentage Error (MAPE) for AP SoppSpr and EnSoppSpr en- ergy models, and DTSopp performance model (mode,outcome variable) across OPPs (GPUFre- qMHz DDRFreqMHz categories on the x-axis). outliers are values that are more than three standard deviations larger than the average. The average and worst case percentage of outlier data points in each OPP across all configurations in Figure 6.9 is 0.66% (77 outliers) and 1.57%, respectively. One of the potential reasons for outliers in our data set is that some of our applications process simple frames at the beginning of execution during initial loading. These frames have very short draw times and low energy consumption. As described, this initial behavior can result in very high MAPE, even for small absolute error values. Note the following highlights from Figure 6.9: (a) For both energy models, the prediction errors are noticeably higher than the delta between the energy profile of the models in comparison with the oracle models shown in Figure 6.6 and Figure 6.7. (b) As expected, the MAPE for the draw time DTSopp is high, up to 34%, and considerably higher than MAPE for the energy models. This observation verifies our expectation of lower accuracy for the performance models due to the lack of visibility in our profiling data into the individual stages of the pipeline that can become the bottleneck and dominate processing time. Yet, our performance model is able to identify whether an OPP meets the frame deadline with a high degree of accuracy (Figure 6.5). For more information on why our performance models are inevitably less accurate, refer to Section 6.1 and 6.3.1. (c) There is a trend that the highest DDR frequencies for each GPU frequency have higher draw time MAPE (consequently, a similar trend in terms of energy consumption). At lower DDR frequencies, DDR becomes the bottleneck and dictates the performance. The relationship between draw time and the frame workload MemRead and MemWrite requests is relatively simple and linear. In contrast,

130 CHAPTER 6. ENERGY-AWARE DVFS running at the highest DDR frequencies, the bottleneck shifts to the GPU, and can be any of the many GPU pipeline stages. The relationship between performance and workload parameters become less linear. Therefore the prediction error, i.e., MAPE, increases. It is also noticeable that the magnitude of MAPE, when running at the highest DDR frequencies, increases with the GPU frequency. At lower GPU frequencies, there is a certain DDR frequency beyond which MAPE does not change. That is because DDR frequency is no longer the bottleneck as we increase the GPU frequency. Cross-validation K-fold cross validation is considered a more robust approach for ana- lyzing the prediction error of statistical models. Training/testing prediction error analysis tends to have an optimistic error estimation, while k-fold cross validation tends to have more upward error estimation [103]. A combination of the two solutions can provide a better understanding of the accuracy of the models. The analysis we presented so far in this section were based on a 70%-30% split between the training/test data sets. Here, we present two scatter plots similar to Figure 6.7 to base our 10-fold cross validation analysis. Figure 6.10.a and 6.10.b show the energy-vs-draw time delta between the oracle configuration and the combination of our best performance model, DGSopp, and our two best energy models, EnSoppSpr and AP SoppSpr. We present data for each application in different colors and symbols (there is also a data series for the average across the tests, similar to Figure 6.7). In our 10-fold cross-validation, we divide our data into 10 equal-sized subsets. We iterate through the 10 subsets; in each iteration, we select the ith subset, i = 1..N, for validation data and the data from the 9 remaining subsets for training data. The data points for the 10 iterations are shown with the same symbol-color. For comparison, for each application we have also included the data point based on our training-test split (presented in Figure 6.7 only for the average across tests). This data point is presented with the solid fill color in Figure 6.10. For most of the applications, and on average, the data points are consistent and there are one or more data points that closely match the training/test data point. There is a general trend towards a slight increase in the energy consumption as compared to the training/test data point, with the advantage of a significant reduction in the frame draw time. Ideally for graphics workloads, higher performance is not beneficial as long as all the frame deadlines are met. But practical DVFS solutions are imperfect; having better overall performance reduces the chances of missing frame deadlines. For AP SoppSpr, energy profiles of all the data points are still within 3% of the oracle models. For EnSoppSpr, the worst-case energy profiles for ChromeWebGL and LauncherAndroidO are within 10%. On average, the energy profile is within 4% of the oracle models. Goodness of fit In a linear regression model, the P-value is a measure of the statistical significance of the model. It is a test for the null-hypothesis: for each predictor, it tests the hypothesis

131 CHAPTER 6. ENERGY-AWARE DVFS

(Perf, Energy) model: (DGSopp,EnSoppSpr) (Perf, Energy) model: (DGSopp,APSoppSpr)

40 10 ) ) % % ( (

20 5 a a t t l l e e d d

e 0 e 0 m m i i t t

w w

a -20 a -5 r r D D

-40 -10 0 2 4 6 8 10 0 0.5 1 1.5 2 2.5 3 Energy delta (%) Energy delta (%)

AirDerby ChromeWebGL EgyptHD2500 Hover AirDerby ChromeWebGL EgyptHD2500 Hover LauncherAndroidO Trex2000 mean across tests LauncherAndroidO Trex2000 mean across tests (a) (b) Figure 6.10: 10-fold cross-validation scatter plot; energy models: (a) GFXDDRSepPerOPP, (b) AvgPowerSepPerOPP. In both charts, the performance model is MeetsLogitPerOPP. A data point from the training/test data split (Figure 6.7) for the same model configuration, for each test (and the average across tests), is highlighted with the solid fill color. that the predictor does not describe any variation in the outcome variable and can be removed from the model. A value below 0.05 confirms the statistical significance of the corresponding predictor. R-squared and Adjusted R-squared measure how much of the variation in the output variable is explained by the fitted model. The valid range is [0,1]; the higher the number, the better. Adjusted R-squared addresses some of the limitations of the R-squared, including alleviating the problem with overfitting. Table 6.6 presents goodness of fit parameters for our top performance and energy models. Since all of our top models compose one linear/logistic regression model for each OPP, we report the average and standard deviation for R-squared across these OPP-specific models. The number of P-values in each configuration equals #opps × #predictors (63 × 5 for EngfxSoppSpr). The number in the table is the number of OPP-predictor permutations for which the P-value is not less than 0.05. As expected, the R-squared numbers for our energy models are close to 1.0. We highlighted throughout this chapter that accurate prediction of the frame draw time using linear models is more challenging. The R-squared values for our performance models, especially DGSopp, reflect this reality. The R-squared values are still high and we know that our models have high fidelity for the purpose of our DVFS solution. Finally, lower R-squared for the APddrSoppSpr energy model is due to the fact that we transform its predictors and outcome variable by dividing them by the frame draw time before training. As we have pointed out, the DDR energy consumption is mostly a function of the amount of data transfer, as well as DDR frequency; neither of these two factors have strong

132 CHAPTER 6. ENERGY-AWARE DVFS

Table 6.6: Goodness of fit parameters for our energy and performance models. EnxxxSoppSpr APxxxSoppSpr 5 gfx ddr EnSoppUpr gfx ddr DGSopp DTSopp R2 (mean) 1.00 0.99 1.00 0.97 0.67 0.79 0.93 R2 (std) 0.00 0.00 0.00 0.01 0.17 0.07 0.06 count(P-value > 0.05) 5 0 0 10 0 40 (29) 4 correlation with the frame draw time. Therefore, the trained model for predicting the average DDR power consumption cannot explain some of the changes in the outcome variable. With a few exceptions, the P-value numbers show that the variables selected for our models have strong correlation with the outcome variable. Out of the 315 combinations of models and predictor variables, there are 5, 10, and 4 cases that the P-value is larger than 0.05 for EngfxSoppSpr,

APgfxSoppSpr, and DTSopp respectively. The only exception with a higher count of P-values ¿ 0.05 is the DGSopp logistic regression model; 40 combinations, across 29 OPPs. In most of these cases, the high P-value is for the TPOps32Bit. Note that TPOps32Bit operations are uncommon. Only 17% of our data points have nonzero values for TPOps32Bit and the average value of TPOps16Bit for the 17% is 3 times the average value for TPOps32Bit. That is a contributing factor resulting in lower correlation (a higher P-value) between TPOps32Bit and our outcome variables. But we know that each of those operations consumes power and impacts the processing time, and hence, we keep the parameter in the models.

5R2 is not defined for logistic regression. This value is pseudo-R2 using McFadden’s algorithm [104]

133 Chapter 7

Conclusion and future work

7.1 Conclusion

The contributions of this thesis are centered around improving the mobile GPU DVFS expe- rience for 3D graphics workload. DVFS algorithms dynamically reduce operating voltage/frequency at runtime to reduce energy consumption, which is a first order design objective for mobile SoCs. 3D graphics workloads are highly sensitive to negative performance impact of DVFS solutions. Existing DVFS solutions for general-purpose CPU and GPU-compute applications cannot satisfy the strict soft-realtime requirements of graphics workloads. Additionally, the existing solutions proposed for (mobile) graphics workloads in recent years leave room for considerable improvement opportunities. In this thesis, we achieved three main goals towards increasing energy saving for mobile GPUs through DVFS, with minimal impact on the performance and the user experience due to frame deadline violations. First, in Chapter 3, we presented a detailed performance and energy profiling framework for graphics workloads based on state-of-the-art Snapdragon SoC platform. In Chapter 4 we performed extensive performance and energy consumption analysis on graphics workloads, as it relates to DVFS, using this framework. We showed that DDR frequency scaling, which is has not been studied extensively in previous work, combined with the complexity of the state-of-the-art mobile GPU pipelines and the 3D graphics workloads, has a significant impact on the performance and energy profile. So much of our results void widely held assumptions or renders them considerably inefficient. Specifically, we showed that lower DVFS voltage/frequency levels do not always consume less energy. Further, we found that assuming a linear relationship between the frequency and performance can lead to suboptimal decisions that result in high degrees of frame deadline violations. Through our design space exploration in this chapter, we also showed that

134 CHAPTER 7. CONCLUSION AND FUTURE WORK level switching latency and safety margin are two of the most influential design parameters in our environment. We showed that incorporating energy awareness into DVFS has the potential for up to 10% reduction in the energy consumption. Finally, we proposed heuristics for OPP configuration selection for the DVFS solutions which do not support dual-domain GPU-DDR frequency scaling. Second, we proposed QoS-aware DVFS in Chapter 5 to provide improved energy efficiency and soft-realtime deadline guarantees, as compared to the existing history-based governor algorithms in the Linux/Android kernel such as the Simple Governor and QUALCOMM msm-adreno-tz. By improving the level switching latency in response to workload changes, this solution allows safety margins to be reduced. These two improvements can provide better energy savings without sacrificing frame deadline guarantees. This solution also incorporates context and QoS (FPS) awareness into the DVFS algorithm. Incorporating QoS awareness for individual applications across the set of concurrently running applications enables more optimized decision making. Specifically, QoS-aware DVFS tracks the frame rendering time of individual applications, as well as their FPS requirement, separately. More efficient execution is achieved by aligning the GPU’s performance with the dynamic behavior and QoS (FPS) requirements of the running applications. The QoS-awareness additionally enables proactive decisions for DVFS level changes when a new application starts using the GPU, or when the frame rendering time of an application increases. These improvements reduce the number frame deadline misses, and consequently, allow shrinking safety margins further for improved energy efficiency. With QoS-aware DVFS, we achieved frame deadline guarantee within 2% of the ideal profile (highest frequency, 624MHz), and 17% and 27% improvement in frame deadline guarantees and EV 2P , respectively, over the 410 MHz operating frequency. QoS-aware DVFS consistently outperforms Qualcomm’s msm-adreno-tz DVFS solution. Finally, we proposed energy-aware DVFS in Chapter 6. Energy-aware DVFS uses detailed performance and energy consumption models to address the limitations of the QoS-aware DVFS, mainly the lack of explicit support for DDR frequency scaling. Using statistical models, energy-aware DVFS identifies an optimized OPP for a given workload. The optimized OPP is one that meets the FPS requirement of the application with the minimum energy consumption. The performance model is used to identify a subset of the OPPs that meet the performance (FPS) requirements. The energy model is then used to identify the more energy-efficient OPP among those that meet the performance requirement. As a result, the performance model opens up the whole OPP space to the DVFS algorithm. There is no need for static, design time, OPP space configuration selection. The combination of the performance and energy models allows the algorithm to identify those workload

135 CHAPTER 7. CONCLUSION AND FUTURE WORK phases that the intuitive assumptions do not apply. For example, it can identify when running at higher GPU or DDR frequency is more energy efficient. Throughout this thesis, our DVFS optimization policy was to only to consider energy consumption if the frame deadline guarantees are satisfied. With energy-aware DVFS, we achieved energy consumption within 3% of the ideal profile for our DVFS policy. It only increases frame deadline violations by less than 0.2

7.2 Future work

We can take the ideas presented in this thesis in a number of new directions. Here we list some paths for future work:

1. More extensive analysis for QoS-aware DVFS. QoS-aware DVFS has a core logic and addi- tional components that aim to minimize frame deadline violations. Quantifying the specific contributions of each element, and providing a comparison between these elements, is an inter- esting future step. Evaluating QoS-aware DVFS through a complex set of multiple concurrent applications with different QoS requirements, is of specially interest. 2. Quantification of the improvements energy-aware DVFS offers on top of QoS-aware DVFS. By implementing QoS-aware DVFS in the same environment as the energy-aware DVFS, we can make an apples-to-apples comparison between the two solutions. 3. Evaluation of the energy-aware DVFS in the full software stack, on device. 4. Comparison of modern graphics APIs as it relates to DVFS. All of our work in this thesis was focused on the OpenGL ES-based applications. Modern graphics APIs, such as Vulkan, change the programming model significantly. The main goal in the Vulkan API is to minimize the driver overhead by eliminating runtime checks and state tracking. As a result, Vulkan gives more control to the applications, in a way that has implications for the power management. As an specific example, Vulkan allows more flexibility in the applications to reuse resources and previously rendered surfaces across many frames. Depending on how heavily the applications utilize this flexibility, and the way it is used, it could have an impact on the effectiveness of the history-based prediction of the workload of the next phase. We demonstrated in our analysis that history-based predictors are effective for existing graphics workloads; both of our proposed DVFS solutions utilize history-based predictors.

136 Bibliography

[1] J. Rabaey, Low Power Design Essentials. Springer Science & Business Media, 2009. [Online]. Available: https://books.google.com/books?hl=en{&}lr={&}id=A-sBy{ } nmQ8wC{&}pgis=1

[2] Khronos Group, “OpenGL ES 3,” 2008. [Online]. Available: https://www.khronos.org/ opengles/

[3] R. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge, “Near-Threshold Computing: Reclaiming Moore’s Law Through Energy Efficient Integrated Circuits,” Proceedings of the IEEE, vol. 98, no. 2, pp. 253–266, feb 2010. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5395763

[4] R. Mahajan and G. Chrysler, “Cooling a Microprocessor Chip,” Proceedings of the IEEE, vol. 94, no. 8, pp. 1476–1486, aug 2006. [Online]. Available: http: //ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1705137

[5] D. Dunn, “The best and worst cities for data centers,” InformationWeek, 2006. [Online]. Available: https://scholar.google.com/scholar?hl=en{&}q=The+best+and+worst+cities+for+ data+centers{&}btnG={&}as{ }sdt=1{%}2C34{&}as{ }sdtp={#}0

[6] J. Markoff and S. Hansell, “Hiding in plain sight, Google seeks more power,” New York Times, 2006. [Online]. Available: https://scholar.google.com/scholar?q=Hiding+in+Plain+Sight{%} 2C+Google+Seeks+More+Power{&}btnG={&}hl=en{&}as{ }sdt=0{%}2C34{#}0

[7] G. L. Wasner and J. A. Brock, “Determinants of thermal pain thresholds in normal subjects.” Clinical neurophysiology : official journal of the International Federation of Clinical Neurophysiology, vol. 119, no. 10, pp. 2389–95, oct 2008. [Online]. Available: http://www.sciencedirect.com/science/article/pii/S1388245708008511

137 BIBLIOGRAPHY

[8] B. Egilmez, G. Memik, S. Ogrenci-Memik, and O. Ergin, “User-specific skin temperature- aware DVFS for smartphones,” pp. 1217–1220, 2015.

[9] S. Borkar, “Design challenges of technology scaling,” IEEE Micro, vol. 19, no. 4, pp. 23–29, 1999.

[10] S. Kaxiras and M. Martonosi, Computer Architecture Techniques for Power-Efficiency, 2008, vol. 3, no. 1.

[11] H. Esmaeilzadeh, E. Blem, R. St. Amant, K. Sankaralingam, and D. Burger, “Dark silicon and the end of multicore scaling,” IEEE Micro, vol. 32, no. 3, pp. 122–134, 2012.

[12] V. Pallipadi and A. Starikovskiy, “The ondemand governor,” Proceedings of the Linux Symposium, 2006. [Online]. Available: http://img.pusuo.net/2009-11-05/110969207.pdf{#} page=223

[13] V. Spiliopoulos, S. Kaxiras, and G. Keramidas, “Green governors: A framework for Continuously Adaptive DVFS,” in 2011 International Green Computing Conference and Workshops. IEEE, jul 2011, pp. 1–8. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/ epic03/wrapper.htm?arnumber=6008552

[14] Y. Gu and S. Chakraborty, “A Hybrid DVS Scheme for Interactive 3D Games,” in 2008 IEEE Real-Time and Embedded Technology and Applications Symposium. IEEE, apr 2008, pp. 3–12. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4550775

[15] S. K. Kim and Y. J. Kim, “GPGPU-Perf: efficient, interval-based DVFS algorithm for mobile GPGPU applications,” Visual Computer, vol. 31, no. 6-8, pp. 1045–1054, 2015.

[16] A. Pathania, A. E. Irimiea, A. Prakash, and T. Mitra, “Power-Performance Modelling of Mobile Gaming Workloads on Heterogeneous MPSoCs,” p. 201, jun 2015. [Online]. Available: http://dl.acm.org/citation.cfm?id=2744769.2744894

[17] L. Benini and G. De Micheli, “Automatic synthesis of low-power gated-clock finite-state machines,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 15, no. 6, pp. 630–643, jun 1996. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=503933

138 BIBLIOGRAPHY

[18] L. Benini, A. Bogliolo, and G. De Micheli, “A survey of design techniques for system-level dynamic power management,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 8, no. 3, pp. 299–316, jun 2000. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=845896

[19] P. Babighian, L. Benini, and E. Macii, “A scalable algorithm for RTL insertion of gated clocks based on ODCs computation,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 24, no. 1, pp. 29–42, jan 2005. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1372659

[20] H. Mahmoodi, V. Tirumalashetty, M. Cooke, and K. Roy, “Ultra low-power clocking scheme using energy recovery and clock gating,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 17, no. 1, pp. 33–44, 2009.

[21] R. Bhutada and Y. Manoli, “Complex clock gating with integrated clock gating logic cell,” in Proceedings - 2007 International Conference on Design and Technology of Integrated Systems in Nanoscale Era, DTIS 2007, 2007, pp. 164–169.

[22] C.-L. Su and A. M. Despain, “Cache design trade-offs for power and performance optimization,” in Proceedings of the 1995 international symposium on Low power design - ISLPED ’95. New York, New York, USA: ACM Press, apr 1995, pp. 63–68. [Online]. Available: http://dl.acm.org/citation.cfm?id=224081.224093

[23] H. Zhou, M. C. Toburen, E. Rotenberg, and T. M. Conte, “Adaptive mode control: A static-power-efficient cache design,” ACM Transactions on Embedded Computing Systems, vol. 2, no. 3, pp. 347–372, aug 2003. [Online]. Available: http://dl.acm.org/citation.cfm?id=860176.860181

[24] M. Rasquinha, D. Choudhary, S. Chatterjee, S. Mukhopadhyay, and S. Yalamanchili, “An energy efficient cache design using spin torque transfer (STT) RAM,” in Proceedings of the 16th ACM/IEEE international symposium on Low power electronics and design - ISLPED ’10. New York, New York, USA: ACM Press, aug 2010, p. 389. [Online]. Available: http://dl.acm.org/citation.cfm?id=1840845.1840931

[25] A. R. Alameldeen, I. Wagner, Z. Chishti, W. Wu, C. Wilkerson, and S.-L. Lu, “Energy-efficient cache design using variable-strength error-correcting codes,” in Proceeding

139 BIBLIOGRAPHY

of the 38th annual international symposium on Computer architecture - ISCA ’11, vol. 39, no. 3. New York, New York, USA: ACM Press, jun 2011, p. 461. [Online]. Available: http://dl.acm.org/citation.cfm?id=2000064.2000118

[26] R. Banakar, S. Steinke, B.-S. Lee, M. Balakrishnan, and P. Marwedel, “Scratchpad memory,” in Proceedings of the tenth international symposium on Hardware/software codesign - CODES ’02. New York, New York, USA: ACM Press, may 2002, p. 73. [Online]. Available: http://dl.acm.org/citation.cfm?id=774789.774805

[27] W. Davis, J. Wilson, S. Mick, C. Mineo, A. Sule, M. Steer, and P. Franzon, “Demystifying 3D ICs: The Pros and Cons of Going Vertical,” IEEE Design and Test of Computers, vol. 22, no. 6, pp. 498–510, jun 2005. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1541910

[28] Q. Gu, Z. Xu, J. Ko, and M.-C. F. Chang, “Two 10Gb/s/pin Low-Power Interconnect Methods for 3D ICs,” in 2007 IEEE International Solid-State Circuits Conference. Digest of Technical Papers. IEEE, feb 2007, pp. 448–614. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4242458

[29] M.-F. Chang, P.-F. Chiu, and S.-S. Sheu, “Challenges and trends in low-power 3D die-stacked IC designs using RAM, memristor logic, and resistive memory (ReRAM),” in 2011 9th IEEE International Conference on ASIC. IEEE, oct 2011, pp. 299–302. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6157181

[30] M. Borah, R. Owens, and M. Irwin, “Transistor sizing for low power CMOS circuits,” IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 15, no. 6, pp. 665–671, jun 1996. [Online]. Available: http: //ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=503935

[31] P. Pant, R. Roy, and A. Chattejee, “Dual-threshold voltage assignment with transistor sizing for low power CMOS circuits,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 9, no. 2, pp. 390–394, apr 2001. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=924061

[32] H. Kaul, M. Anders, S. Hsu, A. Agarwal, R. Krishnamurthy, and S. Borkar, “Near-threshold voltage (NTV) design,” in Proceedings of the 49th Annual Design Automation Conference on -

140 BIBLIOGRAPHY

DAC ’12. New York, New York, USA: ACM Press, jun 2012, p. 1153. [Online]. Available: http://dl.acm.org/citation.cfm?id=2228360.2228572

[33] T. O. Pitkanen¨ and J. Takala, “Low-Power Application-Specific Processor for FFT Computations,” Journal of Signal Processing Systems, vol. 63, no. 1, pp. 165–176, sep 2010. [Online]. Available: http://link.springer.com/10.1007/s11265-010-0528-z

[34] C. H. K. van Berkel, “Multi-core for mobile phones,” pp. 1260–1265, apr 2009. [Online]. Available: http://dl.acm.org/citation.cfm?id=1874620.1874924

[35] Y. Abe, H. Sasaki, M. Peres, K. Inoue, K. Murakami, and S. Kato, “Power and performance analysis of GPU-accelerated systems,” p. 10, oct 2012. [Online]. Available: http://dl.acm.org/citation.cfm?id=2387869.2387879

[36] H. Wang, V. Sathish, R. Singh, M. J. Schulte, and N. S. Kim, “Workload and power budget partitioning for single-chip heterogeneous processors,” in Proceedings of the 21st international conference on Parallel architectures and compilation techniques - PACT ’12. New York, New York, USA: ACM Press, sep 2012, p. 401. [Online]. Available: http://dl.acm.org/citation.cfm?id=2370816.2370873

[37] R. Nathuji, C. Isci, and E. Gorbatov, “Exploiting Platform Heterogeneity for Power Efficient Data Centers,” in Fourth International Conference on Autonomic Computing (ICAC’07). IEEE, jun 2007, pp. 5–5. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/ wrapper.htm?arnumber=4273099

[38] R. Kumar, K. Farkas, N. Jouppi, P. Ranganathan, and D. Tullsen, “Single-ISA heterogeneous multi-core architectures: the potential for processor power reduction,” in 22nd Digital Avionics Systems Conference. Proceedings (Cat. No.03CH37449). IEEE Comput. Soc, 2003, pp. 81–92. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1253185

[39] R. Gonzalez, B. Gordon, and M. Horowitz, “Supply and threshold voltage scaling for low power CMOS,” IEEE Journal of Solid-State Circuits, vol. 32, no. 8, pp. 1210–1216, 1997. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=604077

[40] K. Roy and V. De, “Low voltage low power CMOS design techniques for deep submicron ICs,” in VLSI Design 2000. Wireless and Digital Imaging in the Millennium. Proceedings

141 BIBLIOGRAPHY

of 13th International Conference on VLSI Design. IEEE Comput. Soc, 2000, pp. 24–29. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=812579

[41] K. Usami, M. Igarashi, F. Minami, T. Ishikawa, M. Kanzawa, M. Ichida, and K. Nogami, “Automated low-power technique exploiting multiple supply voltages applied to a media processor,” IEEE Journal of Solid-State Circuits, vol. 33, no. 3, pp. 463–472, mar 1998. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=661212

[42] K. Usami and M. Igarashi, “Low-power design methodology and applications utilizing dual supply voltages,” in Proceedings of the 2000 conference on Asia South Pacific design automation - ASP-DAC ’00. New York, New York, USA: ACM Press, jan 2000, pp. 123–128. [Online]. Available: http://dl.acm.org/citation.cfm?id=368434.368590

[43] S. Mutoh, S. Shigematsu, Y. Matsuya, H. Fukuda, and J. Yamada, “A 1 V multi-threshold voltage CMOS DSP with an efficient power management technique for mobile phone application,” in 1996 IEEE International Solid-State Circuits Conference. Digest of TEchnical Papers, ISSCC. circuit level; deep submicron; design time; low power; manufacturing time; multi Vdd; multi Vth: IEEE, 1996, pp. 168–169,. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=488556

[44] J. Kao and A. Chandrakasan, “Dual-threshold voltage techniques for low-power digital circuits,” IEEE Journal of Solid-State Circuits, vol. 35, no. 7, pp. 1009–1018, jul 2000. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=848210

[45] L. Wei, Z. Chen, K. Roy, M. Johnson, Y. Ye, and V. De, “Design and optimization of dual-threshold circuits for low-voltage low-power applications,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 7, no. 1, pp. 16–24, mar 1999. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=748196

[46] K. von Arnim, E. Borinski, P. Seegebrecht, H. Fiedler, R. Brederlow, R. Thewes, J. Berthold, and C. Pacha, “Efficiency of body biasing in 90-nm CMOS for low-power digital circuits,” IEEE Journal of Solid-State Circuits, vol. 40, no. 7, pp. 1549–1556, jul 2005. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1459000

[47] S. M. Martin, K. Flautner, T. Mudge, and D. Blaauw, “Combined dynamic voltage scaling and adaptive body biasing for lower power microprocessors under dynamic workloads,” in

142 BIBLIOGRAPHY

Proceedings of the 2002 IEEE/ACM international conference on Computer-aided design - ICCAD ’02. New York, New York, USA: ACM Press, nov 2002, pp. 721–725. [Online]. Available: http://dl.acm.org/citation.cfm?id=774572.774678

[48] J. Tschanz, S. Narendra, B. Bloechel, and S. Borkar, “Dynamic sleep transistor and body bias for active leakage power control of microprocessors,” IEEE Journal of Solid-State Circuits, vol. 38, no. 11, pp. 1838–1845, nov 2003. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1240963

[49] G. Gammie, A. Wang, M. Chau, S. Gururajarao, R. Pitts, F. Jumel, S. Engel, P. Royannez, R. Lagerquist, H. Mair, J. Vaccani, G. Baldwin, K. Heragu, R. Mandal, M. Clinton, D. Arden, and U. Ko, “A 45nm 3.5G Baseband-and-Multimedia Application Processor using Adaptive Body-Bias and Ultra-Low-Power Techniques,” in 2008 IEEE International Solid-State Circuits Conference - Digest of Technical Papers. IEEE, feb 2008, pp. 258–611. [Online]. Available: https://www.infona.pl//resource/bwmeta1.element.ieee-art-000004523155

[50] A. Auberton-Herve, “SOI: materials to systems,” Electron Devices Meeting, 1996. IEDM’96. ... , 1996. [Online]. Available: http://ieeexplore.ieee.org/xpls/abs{ }all.jsp?arnumber=553028

[51] T. Oashi, T. Eimori, F. Morishita, T. Iwamatsu, Y. Yamaguchi, F. Okuda, K. Shimomura, H. Shimano, N. Sakashita, K. Arimoto, Y. Inoue, S. Komori, M. Inuishi, T. Nishimura, and H. Miyoshi, “16 Mb DRAM/SOI technologies for sub-1 V operation,” in International Electron Devices Meeting. Technical Digest. IEEE, 1996, pp. 609–612. [Online]. Available: https://www.infona.pl//resource/bwmeta1.element.ieee-art-000000554057

[52] S. Kim, S. V. Kosonocky, D. R. Knebel, K. Stawiasz, and M. C. Papaefthymiou, “A Multi-Mode Power Gating Structure for Low-Voltage Deep-Submicron CMOS ICs,” IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 54, no. 7, pp. 586–590, jul 2007. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=4273638

[53] Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson, and P. Bose, “Microarchitectural techniques for power gating of execution units,” in Proceedings of the 2004 international symposium on Low power electronics and design - ISLPED ’04. New York, New York, USA: ACM Press, aug 2004, p. 32. [Online]. Available: http://dl.acm.org/citation.cfm?id=1013235.1013249

143 BIBLIOGRAPHY

[54] R. Camposano, “Embedded System Design,” Progress White Papers, pp. 7–27, 2006. [Online]. Available: http://www2.fhi.nl/progress/images/tuecorporaal10mei.pdf

[55] K. Govil, E. Chan, and H. Wasserman, “Comparing algorithm for dynamic speed-setting of a low-power CPU,” in Proceedings of the 1st annual international conference on Mobile computing and networking - MobiCom ’95. New York, New York, USA: ACM Press, dec 1995, pp. 13–25. [Online]. Available: http://dl.acm.org/citation.cfm?id=215530.215546

[56] T. Pering, T. Burd, and R. Brodersen, “The simulation and evaluation of dynamic voltage scaling algorithms,” in Proceedings of the 1998 international symposium on Low power electronics and design - ISLPED ’98. New York, New York, USA: ACM Press, 1998, pp. 76–81. [Online]. Available: http://portal.acm.org/citation.cfm?doid=280756.280790

[57] J. Gu and G. Qu, “Incorporating temperature-leakage interdependency into dynamic voltage scaling for real-time systems,” in 2013 IEEE 24th International Conference on Application-Specific Systems, Architectures and Processors. IEEE, jun 2013, pp. 289–296. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6567592

[58] I. Koren and C. M. Krishna, “Improving processor lifespan and energy consumption using DVFS based on ILP monitoring,” in 2015 Sixth International Green and Sustainable Computing Conference (IGSC). IEEE, dec 2015, pp. 1–6. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=7393725

[59] C. Isci, G. Contreras, and M. Martonosi, “Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management,” in 2006 39th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO’06). IEEE, dec 2006, pp. 359–370. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm? arnumber=4041860

[60] C. Lefurgy, X. Wang, and M. Ware, “Power capping: A prelude to power shifting,” Cluster Computing, vol. 11, no. 2, pp. 183–195, 2008.

[61] H. Zhang and H. Hoffmann, “Maximizing Performance Under a Power Cap : A Comparison of Hardware , Software , and Hybrid Techniques,” in ACM International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS), 2016.

144 BIBLIOGRAPHY

[62] K. Rajamani, H. Hanson, J. Rubio, S. Ghiasi, and F. Rawson, “Application-aware power management,” in Proceedings of the 2006 IEEE International Symposium on Workload Characterization, IISWC - 2006, 2006, pp. 39–48.

[63] H. Hanson, S. S. W. Keckler, S. Ghiasi, K. Rajamani, F. Rawson, and J. Rubio, “Thermal response to DVFS: analysis with an Intel Pentium M,” Low Power Electronics and Design (ISLPED), 2007 ACM/IEEE International Symposium on, pp. 219–224, 2007.

[64] B. Goel, S. A. McKee, R. Gioiosa, K. Singh, M. Bhadauria, and M. Cesati, “Portable, scalable, per-core power estimation for intelligent resource management,” in International Conference on Green Computing. IEEE, aug 2010, pp. 135–146. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=5598313

[65] W. Huang, C. Lefurgy, W. Kuk, A. Buyuktosunoglu, M. Floyd, K. Rajamani, M. Allen-Ware, and B. Brock, “Accurate Fine-Grained Processor Power Proxies,” in 2012 45th Annual IEEE/ACM International Symposium on Microarchitecture. IEEE, dec 2012, pp. 224–234. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=6493622

[66] C. Nugteren, G.-J. van den Braak, and H. Corporaal, “Roofline-aware DVFS for GPUs,” Proceedings of International Workshop on Adaptive Self-tuning Computing Systems - ADAPT ’14, pp. 8–10, 2014. [Online]. Available: http://dl.acm.org/citation.cfm?id=2553067http: //dl.acm.org/citation.cfm?doid=2553062.2553067

[67] Q. Jiao, M. Lu, H. P. Huynh, and T. Mitra, “Improving GPGPU Energy-Efficiency Through Concurrent Kernel Execution and DVFS,” Proceedings of the 13th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 1–11, 2015. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=7054182

[68] I. Paul, W. Huang, M. Arora, and S. Yalamanchili, “Harmonia: Balancing compute and memory power in high-performance GPUs,” in Proceedings of the 42nd Annual International Symposium on Computer Architecture - ISCA ’15. New York, New York, USA: ACM Press, 2015, pp. 54–65. [Online]. Available: http://dl.acm.org/citation.cfm?doid=2749469.2750404

[69] S. Che and K. Skadron, “BenchFriend: Correlating the performance of GPU benchmarks,” International Journal of High Performance Computing Applications, vol. 28, no. 2, pp. 238–250, oct 2013. [Online]. Available: http://hpc.sagepub.com/content/28/2/238.short

145 BIBLIOGRAPHY

[70] M. Wimmer and P. Wonka, “Rendering time estimation for real-time rendering,” Proceedings of the 14th Eurographics workshop on Rendering, pp. 118–129, 2003. [Online]. Available: http://dl.acm.org/citation.cfm?id=882404.882422

[71] J. W. Sheaffer, D. Luebke, and K. Skadron, “A flexible simulation framework for graphics architectures,” in Proceedings of the ACM SIGGRAPH/EUROGRAPHICS conference on Graphics hardware - HWWS ’04. New York, New York, USA: ACM Press, aug 2004, p. 85. [Online]. Available: http://dl.acm.org/citation.cfm?id=1058129.1058142

[72] D. Grunwald, C. B. Morrey III, P. Levis, M. Neufeld, and K. I. Farkas, “Policies for dynamic clock scheduling,” p. 6, oct 2000. [Online]. Available: http://dl.acm.org/citation.cfm?id= 1251229.1251235

[73] V. Raghunathan, C. Pereira, M. Srivastava, and R. Gupta, “Energy-aware wireless systems with adaptive power-fidelity tradeoffs,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 13, no. 2, pp. 211–225, feb 2005. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1393022

[74] G. Contreras and M. Martonosi, “Power prediction for Intel XScale processors using performance monitoring unit events,” in ISLPED ’05. Proceedings of the 2005 International Symposium on Low Power Electronics and Design, 2005. IEEE, 2005, pp. 221–226. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1522767

[75] S. Holmbacka, J. Keller, P. Eitschberger, and J. Lilius, “Accurate energy modeling for many- core static schedules with streaming applications,” Microprocessors and Microsystems, pp. 1– 12, 2016. [Online]. Available: http://linkinghub.elsevier.com/retrieve/pii/S0141933116000119

[76] Y. Gu, S. Chakraborty, and W. T. Ooi, “Games are up for DVFS,” in 2006 43rd ACM/IEEE Design Automation Conference. IEEE, 2006, pp. 598–603. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=1688867

[77] B. Mochocki, K. Lahiri, and S. Cadambi, “Power analysis of mobile 3D graphics,” . . . on Design, automation and test in . . . , pp. 502–507, 2006. [Online]. Available: http://dl.acm.org/citation.cfm?id=1131617

[78] B. Mochocki, K. Lahiri, S. Cadambi, and X. Hu, “Signature-based workload estimation for mobile 3D graphics,” in 2006 43rd ACM/IEEE Design Automation Conference. IEEE,

146 BIBLIOGRAPHY

2006, pp. 592–597. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper. htm?arnumber=1688866

[79] Yan Gu and S. Chakraborty, “Control theory-based DVS for interactive 3D games,” pp. 740–745, 2008.

[80] Y. Gu and S. Chakraborty, “Power Management of Interactive 3D Games Using Frame Structures,” in 21st International Conference on VLSI Design (VLSID 2008). IEEE, 2008, pp. 679–684. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm? arnumber=4450576

[81] Po-Han Wang, Yen-Ming Chen, Chia-Lin Yang, and Yu-Jung Cheng, “A Predictive Shutdown Technique for GPU Shader Processors,” IEEE Computer Architecture Letters, vol. 8, no. 1, pp. 9–12, jan 2009. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm? arnumber=4758617

[82] B. Silpa, G. Krishnaiah, and P. R. Panda, “Rank based dynamic voltage and frequency scaling for tiled graphics processors,” in Proceedings of the eighth IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis - CODES/ISSS ’10. New York, New York, USA: ACM Press, oct 2010, p. 3. [Online]. Available: http://dl.acm.org/citation.cfm?id=1878961.1878965

[83] J. G. Park, C. Y. Hsieh, N. Dutt, and S. S. Lim, “Quality-aware mobile graphics workload char- acterization for energy-efficient DVFS design,” in 2014 IEEE 12th Symposium on Embedded Systems for Real-Time Multimedia, ESTIMedia 2014. Institute of Electrical and Electronics Engineers Inc., 2014, pp. 70–79.

[84] A. Pathania, Q. Jiao, A. Prakash, and T. Mitra, “Integrated CPU-GPU Power Management for 3D Mobile Games,” Proceedings of the The 51st Annual Design Automation Conference on Design Automation Conference - DAC ’14, pp. 1–6, 2014. [Online]. Available: http://dl.acm.org/citation.cfm?doid=2593069.2593151

[85] W. HE, “Cmos Vlsi Design: A Circuits And Systems Perspective, 3/E,” 2006. [Online]. Available: https://scholar.google.com/scholar?q=CMOS+VLSI+Design++A+Circuits+and+ Systems+Perspective+4th+Edition+david+harris{&}btnG={&}hl=en{&}as{ }sdt=0{%} 2C22{#}0

147 BIBLIOGRAPHY

[86] R. Chadha and J. Bhasker, An ASIC Low Power Primer: Analysis, Techniques and Specification. Springer Science & Business Media, 2012. [Online]. Available: https://books.google.com/books?hl=en{&}lr={&}id=0dNRw6JkPQMC{&}pgis=1

[87] Q. Wang, P. Ping, X. Zhao, G. Chu, J. Sun, and C. Chen, “Thermal runaway caused fire and explosion of lithium ion battery,” pp. 210–224, 2012.

[88] R. J. Rost, B. Licea-Kane, D. Ginsburg, J. M. Kessenich, B. Lichtenbelt, H. Malan, and M. Weiblen, OpenGL shading language. Pearson Education, 2009.

[89] C. Potzsch,¨ “Reducing latency in mobile VR by using single buffered strip rendering,” 2016. [Online]. Available: https://www.imgtec.com/blog/ reducing-latency-in-vr-by-using-single-buffered-strip-rendering/

[90] A. Schiewe, M. Anstoots, and J. Kruger,¨ “State of the Art in Mobile Volume Rendering on iOS Devices,” in Eurographics Conference on Visualization (EuroVis) - Short Papers, 2015, pp. 139–143. [Online]. Available: http://hpc.uni-due.de/publications/2015/Schiewe{ } 2015{ }SOTA.pdf

[91] J. Bergman, “Energy efficient graphics: Making the rendring process power aware,” 2010. [Online]. Available: http://www.diva-portal.org/smash/record.jsf?pid=diva2:343747

[92] P.-H. Wang, C.-L. Yang, Y.-M. Chen, and Y.-J. Cheng, “Power gating strategies on GPUs,” ACM Transactions on Architecture and Code Optimization, vol. 8, no. 3, pp. 1–25, oct 2011. [Online]. Available: http://dl.acm.org/citation.cfm?id=2019608.2019612

[93] J. H. Woo, J. H. Sohn, B. G. Nam, and H. J. Yoo, Mobile 3D Graphics SoC: From Algorithm to Chip. Wiley Publishing, 2010.

[94] G. Semeraro, G. Magklis, R. Balasubramonian, D. Albonesi, S. Dwarkadas, and M. Scott, “Energy-efficient processor design using multiple clock domains with dynamic voltage and frequency scaling,” in Proceedings Eighth International Symposium on High Performance Computer Architecture. IEEE Computer. Soc, 2002, pp. 29–40. [Online]. Available: http://ieeexplore.ieee.org/lpdocs/epic03/wrapper.htm?arnumber=995696

[95] V. del Barrio, C. Gonzalez, J. Roca, and A. Fernandez, “ATTILA: a cycle-level execution-driven simulator for modern GPU architectures,” in 2006 IEEE International

148 BIBLIOGRAPHY

Symposium on Performance Analysis of Systems and Software. IEEE, pp. 231–241. [Online]. Available: http://ieeexplore.ieee.org/document/1620807/

[96] D. You and K. S. Chung, “Quality of service-aware dynamic voltage and frequency scaling for embedded GPUs,” IEEE Computer Architecture Letters, vol. 14, no. 1, pp. 66–69, jan 2015. [Online]. Available: http://ieeexplore.ieee.org/document/6808482/

[97] N. Peters, D. Fuß,¨ S. Park, and S. Chakraborty, “Frame-based and thread-based power management for mobile games on HMP platforms,” in Proceedings of the 34th IEEE International Conference on Computer Design, ICCD 2016. IEEE, oct 2016, pp. 169–176. [Online]. Available: http://ieeexplore.ieee.org/document/7753277/

[98] A. Munshi, D. Ginsburg, and D. Shreiner, OpenGL ES 2.0 Programming Guide, 2008. [Online]. Available: http://www.amazon.de/dp/0321502795

[99] K. Pulli, “New APIs for Mobile Graphics,” in Electronic Imaging 2006, R. Creutzburg, J. H. Takala, and C. W. Chen, Eds. International Society for Optics and Photonics, feb 2006, pp. 607 401–607 401–13. [Online]. Available: http://proceedings.spiedigitallibrary.org/proceeding. aspx?articleid=1273505http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.87.1338

[100] QUALCOMM, “Snapdragon 810 system on chip,” https://www.qualcomm.com/products/ snapdragon/processors/810.

[101] T. Jin, S. He, and Y. Liu, “Towards Accurate GPU Power Modeling for Smartphones,” in Proceedings of the 2nd Workshop on Mobile Gaming - MobiGames ’15. New York, New York, USA: ACM Press, may 2015, pp. 7–11. [Online]. Available: http://dl.acm.org/citation.cfm?id=2751496.2751502

[102] S. Kim and H. Kim, “A new metric of absolute percentage error for intermittent demand forecasts,” International Journal of Forecasting, vol. 32, no. 3, pp. 669–679, jul 2016. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0169207016000121

[103] T. Fushiki, “Estimation of prediction error by using K-fold cross-validation,” Statistics and Computing, vol. 21, no. 2, pp. 137–146, apr 2011. [Online]. Available: http://link.springer.com/10.1007/s11222-009-9153-8

149 BIBLIOGRAPHY

[104] D. McFadden, “Conditional logit analysis of qualitative choice behavior,” Frontiers in Econometrics, pp. 105–142, 1973. [Online]. Available: https://elsa.berkeley.edu/reprints/ mcfadden/zarembka.pdf

150