POWER CONSTRAINED PERFORMANCE OPTIMIZATION IN CHIP MULTI-PROCESSORS

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of

Philosophy in the Graduate School of the Ohio State University

By

Kai Ma, B.S., M.S.

Graduate Program in Electrical and Computer Engineering

The Ohio State University

2013

Dissertation Committee:

Prof. Xiaorui Wang, Advisor

Prof. F¨usun Ozg¨¨ uner Prof. Kevin M. Passino

Prof. Umit¨ V. C¸ ataly¨urek c Copyright by

Kai Ma

2013 ABSTRACT

With the technology scaling in semiconductor industry, both the power density and the power consumption of processors keep increasing. Compared with traditional frequency increasing, integrating more cores on the processor chip offers the oppor- tunity to explore inter-thread parallelism and better energy efficiency. Therefore, the processor design has officially entered a chip multi-processor era. However, the peak power consumption (i.e., power budget or power cap) of a processor is still constrained by the cooling capacity, power delivery limitation, or the limitations specified by the users for different management purposes. Accordingly, it is important to discuss the performance optimization with power constraints (i.e., power capping). Important as it is, power capping is also challenging. Fundamentally, the performance/power relationship of applications is unknown a priori due to runtime variations. Therefore, it is difficult to choose the optimal adjustment in a large possible adjustment space.

In this document, we investigate different aspects of power capping such as consid- ering more components (e.g, caches part) in addition to traditional core part, using new knobs (e.g, power gating), managing new emerging platforms (e.g, CPU-GPU hybrid systems), and using new cooling technology (e.g., thermal electric cooling) .

First, we explore the opportunity to coordinate the cache part and the core part in

CMP (i.e., chip multi-processor). Second, we investigate a scalable power capping algorithm that can leverage the inter-thread dependency of multi-threaded applica- tions for optimized performance. Third, we integrate dynamic voltage and frequency

ii scaling with power gating for power capping as well as considering the core-level service lifetime balancing. Fourth, we develop an energy conservation algorithm for

CPU-GPU hybrid systems. Fifth, we check the co-optimization between computa- tional power and cooling power offered by new cooling devices. In this document, we focus on the power capping issue but also discuss the related energy conservation and thermal issues.

iii This document is dedicated to my wonderful family.

iv ACKNOWLEDGMENTS

Without the help of the following people, I would not have been able to complete my dissertation. My heartfelt thanks to:

Dr. Xiaorui Wang, for his guidance. I could not have asked for a better mentor. Without his help, I would not have had the opportunity to change my specialization to

Computer Architecture, nor would I have enjoyed the level of success I have achieved in this area of research.

Dr. Yefu Wang, for his help with the feedback-control-based power control concept that ultimately developed into our Temperature-Constrained Power Control paper.

Dr. Ming Chen, for his help with the writing advice that ultimately developed into our Scalable Power Control paper.

Xue Li, Wei Chen, and Chi Zhang, for their contributions to the GreenGPU project.

v VITA

1981 ...... BorninChangchun,Jilin, China

2004 ...... B.S.InformationEngineering, Zhejiang University Hangzhou, Zhejiang, China

2007 ...... M.S.ElectricalEngineering, Tongji University Shanghai, China

2008-2011 ...... GraduateResearchAssociate, The University of Tennessee, Knoxville Knoxville, TN, USA

2011-Present ...... GraduateResearchAssociate, The Ohio State University Columbus, OH, USA

vi PUBLICATIONS

Yefu Wang, Kai Ma, and Xiaorui Wang, Temperature-Constrained Power Control for Chip Multiprocessors with Online Model Estimation, The 36th International Sym- posium on Computer Architecture. June 2009, Austin, Texas, USA

Xiaorui Wang, Kai Ma, and Yefu Wang, Achieving Fair or Differentiated Cache Sharing in Power-Constrained Chip Multiprocessors, the 39th International Confer- ence on Parallel Processing September 2010, San Diego, California, USA

Kai Ma, Xue Li, Ming Chen, and Xiaorui Wang, Scalable Power Control for Many- Core Architectures Running Multi-threaded Applications, the 38th International Sym- posium on Computer Architecture. June 2011, San Jose, California, USA

Kai Ma, Xiaorui Wang, and Yefu Wang, DPPC: Dynamic Power Partitioning and Capping in Chip Multiprocessors, the 29th International Conference on Computer Design, October 2011, Amherst, Massachusetts, USA

Kai Ma, Xue Li, Wei Chen, Chi Zhang, and Xiaorui Wang, GreenGPU: A Holistic Approach to Energy Efficiency in GPU-CPU Heterogeneous Architectures, the 41st International Conference on Parallel Processing, September 10-13, 2012, Pittsburgh, PA, USA

Kai Ma, and Xiaorui Wang, PGCapping: Exploiting Power Gating for Power Cap- ping and Core Lifetime Balancing in CMPs, the 21st International Conference on Parallel Architectures and Compilation Techniques, September 19-23, 2012, Min- neapolis, MN, USA

Xiaorui Wang, Kai Ma, and Yefu Wang, Cache Latency Control for Application Fairness or Differentiation in Power-Constrained Chip Multiprocessors, IEEE Trans- actions on Computers, 61(12): 1-15, December 2012

Xiaorui Wang, Kai Ma, and Yefu Wang, Adaptive Power Control with Online Model Estimation for Chip Multiprocessors, IEEE Transactions on Parallel and Distributed Systems, 22(10): 1681-1696, October 2011

Kai Ma, Xiaorui Wang and Yefu Wang, DPPC: Dynamic Power Partitioning and Control for Improved Chip Multiprocessor Performance, IEEE Transactions on Com- puters, 2013, (accepted)

vii FIELDS OF STUDY

Major Field: Electrical and Computer Engineering

Specialization: Computer Systems and Architecture

viii TABLE OF CONTENTS

Abstract...... ii Dedication...... iii

Acknowledgments...... v

Vita...... vi

ListofFigures...... xii

CHAPTER PAGE

1 IntroductionandBackground...... 1

1.1PowerWall...... 1 1.2ChipMulti-processors...... 2 1.3PowerCapping...... 3 1.4Contributions...... 4

2 Scalable Many-Core Power Control for Multi-threaded Applications . . 5

2.1Introduction...... 5 2.2Background...... 9 2.3SystemArchitecture...... 10 2.4Chip-levelPowerControl...... 14 2.5DynamicAggregatedFrequencyPartitioning...... 16 2.5.1Chip-levelPartitioning...... 16 2.5.2Group-levelPartitioning...... 19 2.6Core-levelPowerEstimationonPhysicalTestbed...... 21 2.7Implementation...... 23 2.7.1Testbed...... 23 2.7.2SimulationEnvironment...... 25 2.7.3DiscussiononHardwareImplementation...... 26 2.8Evaluation...... 27 2.8.1Baselines...... 28 2.8.2EstimationAccuracy...... 29

ix 2.8.3TestbedResults...... 29 2.8.4SimulationResults...... 36 2.8.5 Discussion on Algorithm Complexity and Scalability .... 36 2.9Conclusion...... 38

3 PowerGatingforPowerCappingandCoreLifetimeBalancing..... 40 3.1Introduction...... 40 3.2Background...... 44 3.3SystemDesign...... 45 3.3.1DesignofPCPGManagementModule...... 47 3.3.2DesignofDVFSManagementModule...... 48 3.3.3LifetimeBalancing...... 50 3.4Implementation...... 51 3.4.1PowerCappingEvaluationTestbed...... 51 3.4.2LifetimeBalancingEvaluationSimulator...... 54 3.5Evaluation...... 54 3.5.1Baselines...... 54 3.5.2PowerControlAccuracy...... 57 3.5.3ApplicationPerformance...... 60 3.5.4LifetimeBalancing...... 62 3.6Conclusion...... 64

4 EnergyEfficiencyinGPU-CPUHeterogeneousArchitectures...... 66 4.1Introduction...... 66 4.2Background...... 69 4.3Motivation...... 71 4.3.1 A Case Study on Frequency Scaling for GPU Cores and Memory...... 71 4.3.2 A Case Study on Workload Division between GPU and CPU 73 4.4SystemDesignofGreenGPU...... 74 4.5GreenGPUAlgorithms...... 78 4.5.1 Dynamic Frequency Scaling for GPU Cores and Memory . 78 4.5.2WorkloadDivision...... 81 4.6Implementation...... 83 4.7Experiments...... 87 4.7.1FrequencyScalingforGPUCoresandMemory...... 87 4.7.2WorkloadDivisionbetweenGPUandCPU...... 90 4.7.3GreenGPUasaHolisticSolution...... 92 4.8Conclusion...... 93

x 5 Integrating Thermoelectric Coolers and Fans for Energy Efficiency . . . 95 5.1Introduction...... 95 5.2Background...... 98 5.3SystemDesign...... 99 5.3.1ThermalModel...... 100 5.3.2PowerandPerformanceModel...... 102 5.3.3ProblemFormulation...... 104 5.3.4HeuristicSolution...... 105 5.3.5HardwareCost...... 108 5.3.6Per-coreDVFSassumption...... 110 5.4SimulationSetup...... 110 5.5Experiments...... 113 5.5.1BaselineResults...... 113 5.5.2StudiedPolicies...... 114 5.5.3IntegratingTECwithFan...... 116 5.5.4CoolingPerformance...... 119 5.5.5SystemPerformance...... 120 5.6Conclusions...... 124 6 Conclusions...... 126

Bibliography...... 127

xi LIST OF FIGURES

FIGURE PAGE

2.1 Three-layer power control architecture for a 16-core chip multiproces- sor. Cores running the same multi-threaded applications are grouped together. Idle cores (e.g., C9) are transitioned into a low power mode. 12

2.2 Power estimation accuracy experiments on a 12-core hardware testbed. 27

2.3 Power control accuracy comparison. In (a)-(c), the frequencies are relative to the peak of a selected core. In (d), the power values are relativetothepeakpowerineachtestcase...... 30

2.4 Group-level (thread criticality-aware) frequency quota (i.e., sum of normalized DVFS levels) allocation traces of FreqPar and the baselines. 33 2.5 Chip-level (power efficiency-aware) frequency quota (i.e., sum of nor- malized DVFS levels) allocation traces and power efficiency of FreqPar andthebaselines...... 34

2.6 Overall performance comparison between FreqPar the baselines on a 12-corehardwaretestbed...... 35

2.7 Power and performance comparison in simulations under different num- bersofcores...... 37

2.8 Execution time experiments show that FreqPar is more scalable than Steepest Drop...... 37

3.1 Decoupled design uses the power budget, chip power measurement, per-core utilization, temperature, lifetime as inputs. It computes next- step power mode (e.g., on/off, DVFS levels, overclocking state) of each core to cap the entire chip power, boost performance, and balance the lifetime...... 46

3.2 Quicksearch algorithm flowchart. Only power-higher-than-budget case ispresentedforconcision...... 48

xii 3.3 Decoupled solution PGCapping can precisely enforce the power budget by using PCPG, DVFS and overclocking in both the high and low power budget cases. We calculate the average power with a 2s window as P avg to clearly present the general trend. Hardware testbed results. 55

3.4 Decoupled solution PGCapping can reserve power headroom by using PCPG and accelerate cores running useful workloads by using over- clocking. Hardware testbed results. The frequencies are normalized to the peak frequency of one core (i.e., Relative freq). The Freq/# is calculated by dividing the total aggregated relative frequency of the entire chip by the number of turned-on cores, which can be interpreted as a high-level computing capability that each turned-on core can offer (the higher is preferable). We also calculate average Freq/# with a 2s window as F a/#...... 58 3.5 PGCapping achieves very close results with Balanced (a best-effort per-core-DVFS-based lifetime balancing). PGCapping outperforms Random and Round-robin baselines.Simulationresults...... 59

4.1 Normalized execution time is the execution time of a workload normal- ized to its execution time at the peak frequency. Relative energy is the energy normalized to the energy consumed at the peak frequency. There are opportunities to save energy with negligible performance loss by throttling under-utilized components...... 69 4.2 Energy consumption for different workload division ratios. The coop- eration of the CPU and GPU parts can be more energy efficient than theGPUparttakingalltheworkexclusively...... 73

4.3 GreenGPU features a two-tier design to reduce the energy consump- tion of CPU-GPU heterogeneous platforms. The higher tier (i.e., the workload division tier) dynamically partitions the incoming workloads to the CPU and GPU parts. The dash lines connect the components of the workload division part. The lower tier (i.e., the frequency scal- ing tier) takes the utilization of processing elements (GPU cores, GPU memory, and CPUs) to decide the proper frequency levels of them to reduce energy consumption. The dotted lines connect the components ofthefrequencyscalingpart...... 75

4.4 Hardware testbed used in our experiments, which includes a Dell Opti- plex 580 desktop with an Nvidia GeForce GPU and an AMD Phenom II CPU, two power meters, one separated ATX power supply to power the GPU card. Meter1 measures the power of the CPU side, while Meter2measuresthepoweroftheGPUside...... 84

xiii 4.5 Frequency scaling algorithm adjusts the frequencies of GPU cores and memory based on their utilizations respectively to save energy without increasingexecutiontime...... 87

4.6 Energy saving compared with best-performance for different workloads. 88

4.7 Workload division algorithm adjusts the workload allocation between the CPU and GPU parts to minimize idling energy on either side causedbywaitingfortheother(slower)side...... 90

4.8 Energy and workload division ratio trace in respect of the iterations. GreenGPU outperforms workload division only and frequency scaling onlyonenergysavings...... 92

5.1 The side view of the target chip packaging and TEC cooling effect. TECs are embedded between the heat spreader and the processor chip in the thermal interface material (TIM) layer. By applying current to the TEC, heat can be pumped from one side to the other side of this film device. iTEcool coordinates the fan and multiple TECs to improve the overall cooling efficiency. In addition, iTECool also coordinates DVFS level of each core and the cooling subsystem (TECs and fan) to reduce the energy consumption of the entire system (i.e., processor, fanandTECs)...... 95

5.2 Multi-step down-hill greedy algorithm (Co-op) flow chart. Based on the thermal, power, performance models (Equation (10)-(17)), Co-op estimates the next step possible energy if certain adjustment is used; it then selects the adjustment that has the smallest energy consumption within the temperature constraint. If the current temperature is lower than the threshold, Co-op compares the energy of turning off TEC and raising the DVFS level of one core; if the current temperature is higher than the threshold, Co-op compares the energy of turning on TEC and lowering the DVFS level of one core. Co-op takes the steps forward multiple steps along the small energy adjustment direction until the temperatureconstraintisachieved...... 106

5.3 Simulated processor floor plan. The chip floor plan is scaled based on the SCC 48-core chip [49]. A core tile is half the size of the daul-core tile on SCC. The component placement and relative size is scaled with Alpha 21264. The router size is the same with SCC on- chip router. We estimate the voltage regulator size based on 0.5W delivered power/mm2 measurement on a prototype on-chip regulator [62]. Chip floor plan: 10.4mm×14.4mm, a 4×4 core tile array. Core tile floor plan: 2.6mm×3.6mm...... 109

xiv 5.4 TEC+Fan vs. Dynamic-fan: temperature and cooling power compar- ison between Dynamic-fan and TEC+fan. Using the 1st (highest) fan speed level can achieve much better cooling than using the 2nd fan speed level. However, using TEC and the 2nd fan speed can achieve very close cooling effect to that of the 1st fan speed level. In addition, the combined cooling power consumption of using TEC and 2nd fan speed is much lower than running fan at 1st speed level. That is due to the cubic dependence of fan power consumption on fan speed [4]. . 117

5.5 Cooling performance comparison. We set the highest temperature of baseline cases as T th in each experiment. Co-op consistently offers the lowest maximum temperature in studied cases. Co-op also has the smallest T thviolation...... 118

5.6 Execution performance comparison. Due to the cubic dynamic reduc- tion of DVFS at linear performance degradation cost, DVFS+fan has the lowest energy usage. However, it has the longest delay. Since Co- op gives priority to performance, it reduces the power on the condition of not sacrificing too much performance. Therefore, Co-op achieves thelowestEDP...... 118

5.7 Total relative cycles comparison. This metric shows the slow-down introduced by applying DVFS. DVFS introduce 26% and 27% slow- down on DVFS+fan and DVFS+TEC, respectively. However, from Figure 5.6, the delays of DVFS+fan and DVFS+TEC are more than 50%. The performance gap can be introduced by the inter-thread correlation. Throttling one core without considering the other cores will make one thread become the slowest thread, increasing the total executiontime...... 121 5.8 Number of active cores sensitivity. We deploy 4 threads running on oursimulator...... 122

5.9 Temperature threshold sensitivity. We set the T th as running 16- threadatthesecondfanlevel...... 123

xv CHAPTER 1

INTRODUCTION AND BACKGROUND

Thebroadfocusofthisdocumentistheperformance optimization of computer sys- tems with the power consumption as a primary constraint. This chapter provides an overview of the problem that we target.

1.1 Power Wall

For half a century, Moore’s Law [106] has driven the technology scaling in the semicon- ductor industry. The number of components in integrated circuits has doubled every eighteen months. In theory, if the supply voltage of CMOS scaled with lithographic dimensions, the process scaling would have introduced faster and lower energy gates.

The switching energy reduction can match the increased energy by having more gates and having them switch faster, so the power density (i.e., power per unit area) stays constant. This analysis has been summarized as classic Dennard Scaling [28]. How- ever, in reality, the supply voltage has practically stopped declining, mainly due to two reasons: 1) the gate switching delay does not decrease at the same rate as the geometric feature size decreases, which means we cannot lower the voltage at the same rate as the feature size shrinks. 2) Lowering the supply voltage, combined with lowering the feature size, reduces the circuit’s reliability to process variations (i.e.,

1 parameter deviation from the designed nominal value). Therefore, both the absolute power consumption and the power density of the processors have kept increasing.

In embedded computing or other battery-powered devices, battery capacity ad- vancement is far behind the logarithmic scaling in the semiconductor industry. Such a lag makes the battery lifetime an even more important design constraint. The battery lifetime limits the power consumption in embedded systems. In desktop en- vironments, the cooling capacity determines the power dissipation of the system. In data center servers, the huge electricity bill requires by the servers and the related cooling devices is one of the key concerns for data center service providers. Power and related issues have become the key limiter for computer system advancement expanding the entire spectrum, which is summarized as the Power Wall [83]. There- fore, our study takes the power and related issues as the primary constraint in system performance optimization.

1.2 Chip Multi-processors

The ever-growing demand for higher computing throughput requires the processors to increase operating frequency and the number of working units. If we keep increasing the operating frequency of the processors, we need to maintain a high supply voltage to ensure reliable transistor switches, which increases the power density. Since the cooling capacity of the computer systems have already been limited by the cost, the processor providers universally switch the technology advancement from increasing the operating frequency to integrating more cores on one chip, because CMPs (i.e., chip multi-processor) offer the possibility to explore inter-thread parallelism and allow increasing throughput without increasing the power density. Therefore, the processor builders started to pack more and more cores on one processor chip. The computer

2 systems entered the CMP era [34]. To push the CMP idea even further, some hard- ware (e.g., nVidia GPU) integrate hundreds of simple cores on one chip to maximize the throughput.

1.3 Power Capping

However, the increasing throughput in CMPs still requires consuming more power.

The peak power consumption is still constrained by the cooling capacity, power de- livery limitation, or specified by the users for different management purposes. A key concept related to the cooling limit is the thermal design power (TDP) [82]. TDP is a key parameter in packaging design, as long as the power dissipation of the entire chip is under the TDP, the packaging design can guarantee the chip will not overheat in most cases. Therefore, we normally assume that the TDP is the power upper bound in terms of thermal issues. The power delivery limit is constrained primarily by the power pins on the processors [113]. Due to the fixed area of a processor, the off-chip communication and the power delivery are competing for pin resources. Therefore, the limited number of power pins is expected to form an even tighter power budget constraint than the cooling capacity in the near future [113]. In addition to cooling and power delivery limitations, the users might want to assign a power budget for a computer system at runtime to enable server oversubscription (i.e., safely deploying more servers within a fixed power/cooling infrastructure in data center environment) or enforce power budget cut (i.e., assign a very tight power budget for the system to address temporary/partial cooling failure). In summary, the power budget of a com- puter system comes from the cooling limit, power delivery limit, or user specification.

Due to those limitations, it is important to discuss the performance optimization with power constraints (i.e., power capping).

3 1.4 Contributions

This document presents several novel solutions to the power/thermal constrained performance optimization problems.

1. A scalable power control method that dynamically tunes the frequency alloca-

tion among cores in a many core processor to maintain a fixed power budget as well as to improve the performance.

2. A fast power control technique to integrate the per-core DVFS and power gating

for improved performance and service lifetime.

3. A practical power management system to coordinate the CPU and GPU in a

high performance server to improve the energy efficiency of the entire system.

4. A intelligent on-line management system to adjust the thermoelectric cooler,

the cooling fan(s), and the DVFS level of each core on CMPs to keep the temperature under an assigned threshold as well as to conserve the energy

consumption.

4 CHAPTER 2

SCALABLE MANY-CORE POWER CONTROL FOR

MULTI-THREADED APPLICATIONS

2.1 Introduction

Power dissipation has become a first-class constraint in current microprocessor de- sign. As the gap between peak and average power widens with the rapidly increasing level of core integration, it is important to control the peak power of a many-core microprocessor to allow improved reliability and reduced costs in chip cooling and packaging. Therefore, compared with the extensively studied power minimization problem, an equally, if not more, important problem is to precisely control the peak power consumption of a many-core microprocessor to stay below a desired budget level while optimizing its performance.

Scalability is the first key challenge in controlling the power consumption of a many-core microprocessor. While various power control solutions have been proposed for multi-core microprocessors (e.g., [54, 84, 124]), the majority of current solutions relies on centralized decision making and thus cannot be applied directly to many- core systems. For example, the MaxBIPS policy [54] uses an exhaustive search to find a combination of DVFS (Dynamic Voltage and Frequency Scaling) levels for all the cores of a microprocessor. The search is predicted to result in the best application performance while maintaining the power of the chip below the budget. While this

5 solution works effectively for microprocessors with only a few cores, MaxBIPS does not scale well because the number of possible combinations increases exponentially with the number of cores. Therefore, highly scalable approaches need to be developed for many-core architectures.

The requirement to host multi-threaded applications is the second challenge for many-core power control. Although a few recent studies [127, 105, 88] present scal- able control algorithms for many-core architectures based on per-core DVFS, they do not consider multi-threaded parallel applications and assume that the workload of every core is independent. As a result, these solutions may unnecessarily decrease the DVFS levels of the CPU cores running the critical threads in barrier-based multi- threaded applications. The lack of knowledge of thread criticality can exacerbate the load imbalance in multi-core microprocessors and thus lead to unnecessarily long application execution times and undesired barrier stalls. This issue is particularly important for many-core architectures whose primary workloads are expected to be multi-threaded applications. Furthermore, many-core systems are likely to simultane- ously host a mixed group of single-threaded and multi-threaded applications, due to the increasing trend of server consolidation, to fully utilize the core resource [80, 12].

Therefore, a power control algorithm must be able to handle such realistic workload combinations and utilize thread criticality to efficiently allocate power among the cores that are running different applications.

Another major challenge in multi-core or many-core power control is accurate power monitoring [99]. Although the power consumption of a microprocessor can be measured by sensing the current fed into the chip [125], direct power measurement of a single core on a multi-core or many-core die is not yet available. On-die current sensors have been proposed, but have rarely been used in production due to problems such as area and performance overhead and calibration drift introduced by process

6 variations [18]. It is possible to estimate the core power at runtime by counting the component utilizations (e.g., cache accesses) and computing power based on a per-component power model. However, such direct computation of core and struc- ture power at runtime is complex due to a large number of performance statistics required [125]. Since many-core systems are expected to have many simple cores [12], it may not be desirable to adopt an approach that requires much extra hardware and statistics collection. Recently, Kansal et al. [58] have shown that the CPU power consumption of each Virtual Machine (VM) on a server can be estimated by adap- tively weighting only one metric (CPU utilization) of each VM. However, they did not explicitly consider the impact of DVFS on their model despite the fact that the power consumption is different under different DVFS levels even for the same application.

We propose to extend their work to estimate the power consumption of each core in a DVFS environment by taking both DVFS level and utilization into consideration. As a result, many-core power control can be evaluated on a real hardware platform instead of just by simulations as in previous work [127, 105].

In this chapter, we propose a novel and highly scalable power control solution for many-core microprocessors that is specifically designed to handle realistic workload combinations. Our control solution features a three-layer design. First, we adopt control theory to precisely control the power of the whole chip to its chip-level bud- get, with theoretically guaranteed accuracy and stability, by adjusting the aggregated frequency quota of all the cores on the chip. In a DVFS-enabled system, aggregated frequency is defined as the summation of the DVFS levels of all the cores normalized to the peak DVFS level of one core. Second, we dynamically group cores running the same applications and then partition the aggregated chip-level frequency quota derived from the chip-level power controller among different groups for optimized overall microprocessor performance. Finally, we partition the group-level aggregated

7 frequency quota among the cores in each groupbasedonmeasuredthreadcriticality for a shorter application completion time. As a result, our solution can optimize the processor performance while precisely limiting the chip-level power consumption below the desired budget. Specifically, this chapter makes the following major con- tributions:

• We propose a highly scalable power control solution for many-core architectures running multi-threaded applications. Our solution partitions the limited chip-

level power budget among different applications and cores based on measured

application performance and thread criticality.

• We adopt feedback control theory as a theoretical foundation to control the

power consumption of a many-core chip to its desired power budget. This

rigorous design methodology is in sharp contrast to heuristic-based solutions that rely on extensive manual tuning.

• Since the power consumption of a core cannot be directly measured in real

multi-core microprocessors, we extend the technique of estimating the power

consumption of a VM on a physical server to estimate the power consumption

of a CPU core and validate the estimation model on a hardware testbed.

• We implement our control solution on a 12-core AMD processor and

present empirical results to demonstrate that our solution achieves better ap-

plication performance within a given power budget than two state-of-the-art

solutions. Our extensive simulation results with 32, 64, and 128 cores, as well

as overhead analysis for up to 4,096 cores, demonstrate the scalability of our

solution in many-core architectures.

The rest of this chapter is organized as follows. Section 2.3 discusses the system ar- chitecture of our control solution. Section 2.4 presents the chip-level power controller 8 design. Section 2.5 describes dynamic aggregated frequency quota partitioning at the chip and group levels. Section 2.6 presents the per-core power estimation technique.

Section 2.7 introduces our hardware testbed, simulation setups, and the implementa- tion details of our solution. Section 2.8 presents our evaluation results. Section 2.2 discusses the related work and Section 2.9 concludes this chapter.

2.2 Background

Power dissipation has been one of the major design concerns for computing systems. Much prior work has focused on minimizing the power consumption within a specified performance guarantee. For example, Li et al. [70] propose a solution called thrifty barrier that places the faster cores into a lower power mode at the barriers (i.e., joint point) while waiting for the slower cores so that power can be saved. Liu et al. [74] use per-core DVFS to slow down the faster cores, such that both the idle time due to waiting and power consumption are reduced. Cai et al. [17] extend [74] by adding meeting points within the execution of the parallel loops and solve the same problem at a finer granularity. However, all of the solutions cannot provide any explicit guarantees for the power consumption to stay below a desired budget though the performance is guaranteed to some extent. Our work is different in that we focus on a different, but equally important, problem, i.e., power capping to avoid power overload or thermal violations and prevent over-provision of cooling, packaging, and power supply capacities at the processor design time. Some work has been performed to manage peak power or temperature for CMPs.

Intel Foxton technology [82] has successfully controlled the power and temperature of a microprocessor using chip-wide DVFS. Isci et al. [54] propose a closed-loop algorithm called Priority and a prediction-based algorithm called MaxBIPS to limit the power of a CMP. Wang et al. [124] also apply advanced control theory to develop a

9 power control algorithm for improved CMP performance. However, the application of these solutions on many-core systems is prohibited either by the exponential explosion of the number of possible global power management states in many-core architectures, e.g., [82, 54], or by the high control delay and computation overhead due to centralized decision making, e.g., [124]. As a result, none of them are scalable to the large number of cores in many-core architectures.

A recent study by Winter et al. [127] presents a global power management al- gorithm called Steepest Drop for many-core systems with a light overhead. Sartori et al. [105] discuss using hierarchical structure to cap the power of many-core sys- tems. Another related piece of work by Mishra et al. [87, 88] uses absolute BIPS to allocate the chip power budget to each power island and performs per-island power control. However, these solutions assume the independence of workloads among all the cores. Therefore, it may impair the coupling of workloads among all the cores and result in degraded system performance. In contrast, our highly scalable solution can dynamically shift the power budget among the groups of cores that host different applications based on power efficiency, and then further among all the cores in the same group that host the coupled threads from the same application based on thread criticality.

2.3 System Architecture

In this section, we present a high-level description of our three-layer power control solution. As shown in Figure 2.1, in the first layer, the chip-level power controller controls the power consumption of the whole chip to the chip power budget by adjust- ing the aggregated frequency quota (i.e., summation of normalized DVFS levels) of all the cores. The second layer, i.e., the chip-level frequency quota partitioning layer, partitions the chip-level aggregated frequency quota among the groups of cores, which

10 host different applications proportionally to a metric called power efficiency (defined in Section 2.5.1). The third layer, i.e., the group-level frequency partitioning layer, further partitions the group aggregated frequency quota among all the cores in each group, which host coupled threads of the same application, based on thread criti- cality (defined in Section 2.5.2). The aggregated frequency quota is first partitioned among different applications (i.e., groups of cores) and then partitioned among cou- pled threads (i.e., individual cores) to achieve optimized performance. As a result, if the aggregated frequency quota of every core is enforced, the power of the entire chip can be controlled to stay within the desired power budget. In this chapter, we adopt

DVFS to enforce the frequency quota of each core, but our solution can also work with other frequency scaling techniques such as clock modulation. We assume that the frequency of each core can be adjusted individually in future many-core systems based on various industry practices and research studies [127, 105]. For example, IBM and AMD have implemented per-core DVFS on commercial massive multi-core microprocessors (POWER7 8-core and Opteron 12-core systems). Moreover, Intel has implemented per-tile DVFS on its 24-tile many-core experimental chips [48]. In addition, a 167-core computational platform with per-core DVFS support has been implemented recently [119]. Even in the systems without physically implemented per-core DVFS (e.g., multi-power-island chips), Rangan et al. [102] have shown that thread migration on systems with only two power states can be used to approximate the functionality of continuous, per-core DVFS.

As shown in Figure 2.1, the key components in the chip-level power control layer include a power controller and a power monitor. The following steps are invoked at the end of every control period: 1) the power monitor (e.g., an on-board power measurement circuit [125]) measures the power consumption of the chip in the last control period and sends the value to the power controller and 2) the power controller

11 Power supply Chip Power power Chip power monitor C1 C2 C3 C4 Per-core budget frequency Utilization budget Chip-level power Per-core Power Group 1 Group 2 controller estimator C5 C6 C7 C8 DVFS Chip Per-core modulator frequency power quota Criticality IPS Chip-level partitioner C9 C10 C11 C12 counter Group frequency Group 3 IPS counter quota Criticality C13 C14 C15 C16 Group-level partitioner Per-core Firmware frequency on the service processor budget Many-core processor Figure 2.1: Three-layer power control architecture for a 16-core chip multiprocessor. Cores running the same multi-threaded applications are grouped together. Idle cores (e.g., C9) are transitioned into a low power mode.

computes the new aggregated frequency quota for all the cores of the chip based on the desired power budget and measured power consumption. The aggregated frequency quota is then partitioned to optimize the system performance in the partitioning layers. The key components in the chip-level frequency quota partitioning layer include a single chip-level partitioner and an IPS (Instructions Per Second) counter on each core. In order to effectively partition the power budget, we need to be able to calculate the power efficiency of each core. We adopt IPS/Watt as our power efficiency metric, which has been used by Intel for this purpose [42]. The chip-level frequency quota is partitioned among multiple groups of cores periodically. At the end of each control period, the partitioner collects the grouping information of all the cores based on the

OS scheduler (details are described in Section 2.5.1). Each group of cores hosts all the threads of the same application. If a group consists of only one core, we refer to it as a single-threaded group; otherwise, we refer to it as a multi-threaded group.Ifa core is idle, we transition it to a low-power mode. For example, Core 1, 2, 5, and 6 are grouped together since they run four threads of a parallel application based on

12 the scheduling information from OS. Core 9 is transitioned into the low-power mode since it is idle. The chip-level partitioner computes the power efficiency based on the

IPS and the estimated power of each core, then calculates the overall power efficiency of each group by summing up the efficiency of each core in the group. The chip-level partitioner partitions the aggregated frequency quota of the entire chip among the groups proportionally to the overall power efficiency of each group. Note that since the power control period at the chip level can be configured shorter than the OS scheduling period, we assume the mapping between the threads and cores does not change within each control period. The same assumption has been made in [127].

The group-level frequency quota partitioning layer includes a group-level parti- tioner in each group, a criticality counter, and a DVFS modulator in each core. At the end of each control period, the criticality counter on each core monitors the criticality metric (defined in Section 2.5.2) and forwards it to the partitioner. The partitioner receives the allocated group frequency quota from the chip-level partitioner and par- titions the frequency quota among all the cores in the group based on the thread criticality of each core. Then, the DVFS modulator of each core changes the DVFS level of the core accordingly.

Because the computation of the controller may change the overall aggregated frequency quota and the recalculation of chip-level partitioner may change the group aggregated frequency quota, the three layers run sequentially at the end of every control period. Figure 2.1 shows a possible implementation that the three layers are integrated as firmware on the service processor, similar to IBM POWER7’s power control module [125]. We also discuss other implementation possibilities in Section

2.7.3.

13 2.4 Chip-level Power Control

In this section, we introduce the chip-level power controller that controls the power consumption of the entire chip to the desired power budget by adjusting the aggre- gated frequency quota (i.e., summation of normalized DVFS levels). A key advantage of the control-theoretic design approach is that it can tolerate a certain degree of mod- eling errors and adapt to online model variations based on dynamic feedback [36].

Therefore, our solution does not rely on power models that are perfectly accurate, which is in sharp contrast to open-loop solutions that would fail without an accurate model.

We first introduce some notation. Tc is the control period. M is the number of cores on this chip. cp(k) is the power consumption of the entire chip in the kth control period. f(k) is the total aggregated frequency of all the cores on the chip in the kth control period. The dynamic range of f(k)isL ∗ M ≤ f(k) ≤ M, relative to the peak of one core, where L is the lowest available DVFS normalized to the peak level. We assume that our target system is a homogeneous-core system, which is the dominant configuration of the current multi-core and many-core systems [48, 120, 5].

However, extending for heterogeneous-core systems is straightforward by scaling f(k). For example, if we have a more powerful core in the system along with the normal cores, instead of taking the dynamic range of the more powerful core as L to 1 like the normal core, we count it as L to H.BothL and H are derived by scaling the available DVFS levels of the powerful core to the peak DVFS level of a normal core.

Δf(k)=f(k +1)− f(k). Pt is the power budget of the whole chip, which can be determined by the thermal and power supply constraints of the processor or specified by the user during runtime. e(k) is the control error, specifically, e(k)=Pt − cp(k).

The control goal is to direct cp(k)toconvergetoPt within a certain number of control periods by adjusting f(k).

14 System Modeling. We now model the dynamics of the controlled system, namely the relationship between the controlled variable, i.e., cp(k), and manipulated variable, i.e., f(k). Existing studies by both Raghavendra et al. [101] and Wang et al. [123] have shown that the processor power can be modeled as an approximately linear function to the DVFS level within the limited DVFS adaptation range available in real multi-core processors. In this chapter, the power consumption of a processor is modeled similarly as:

cp(k)=aΔf(k − 1) + cp(k − 1), (2.4.1) where a is the generalized parameters that may vary for different chips and applica- tions. a is also the scaling factor that characterizes the impact of DVFS change on chip power. In our design, we derive a by using the data sheet full power range (from the idle power to the maximum power of the chip [3]) divided by the dynamic range of f(k). We conducted stability analysis [36] on our controlled system. The results show the stability range of a isfrom0to2a. Since we used the maximum possible a at design time, the variation of a could never exceed the range. The control loop is theoretically guaranteed to converge to the set point for all possible workloads.

Controller Design. Proportional-Integral (PI) control can provide robust con- trol performance despite considerable modeling errors. Based on the system model (2.4.1), we design a PI controller as follows:

f(k)=f(k − 1) + K1e(k) − K1K2e(k − 1). (2.4.2)

Following the standard pole placement method [36], we can choose our control parameters as K1 =1/a and K2 = 0, such that the controlled system is stable and has a zero steady state error. The detailed steps can be found in a standard

15 control textbook and are skipped due to space limitations. The desired aggregated frequency quota of all the cores on the chip in the kth control period can be computed accordingly as:

(Pt − cpi(k − 1)) f(k)=f(k − 1) + . (2.4.3) a

2.5 Dynamic Aggregated Frequency Partitioning

We first introduce the details of the aggregated frequency (i.e., summation of nor- malized DVFS levels) partitioning schemes at the chip and group levels.

2.5.1 Chip-level Partitioning

A many-core microprocessor may host multiple applications simultaneously. For ex- ample, a virtualized many-core system may host multiple VMs and each of the VM may host a different application. If the power budget is limited, different power allocations among the applications may lead to different system performance. To achieve high overall performance is one of the most fundamental goals for many-core systems [12]. The goal of chip-level partitioning is to dynamically partition the chip- level aggregated frequency quota computed in the chip power controller (Equation

(2.4.3)) among different applications, such that we can achieve the optimized sys- tem performance. In this chapter, we use Fair Speedup (FS) as the performance indicator. The FS of a partitioning scheme is defined as the harmonic mean of per-application speedup with respect to the equal resource share case (i.e., peak fre- quency for all applications) [24, 10]. The FS achieved by a scheme can be expressed Na ETappi(scheme) as FS(scheme)=Na/ ,whereETappi(scheme) is the execution time i=1 ETappi(base) th of the i application under a certain power management scheme, and ETappi(base) is the execution time of running the ith application of the peak frequency level all the

16 time. Na is the number of applications in the system, i.e., the set of applications that execute together. FS is an indicator of the overall improvement in execution efficiency gained across the applications. It is also a metric of fairness. In the follow- ing sections, we first introduce how to group the cores that run the threads of the same application based on the scheduling information in the OS. We then present the aggregated frequency quota partitioning among the groups.

Core Grouping

In many-core microprocessors, different threads run simultaneously on different cores.

We place the cores that host the threads of the same application into a group. There- fore, the number of groups is equal to the number of applications running on all the cores. The benefit of core grouping is to reduce the coupling of the power demand among different applications.

In this project, we assume that the mapping between the threads and cores does not change within a certain period (i.e., the scheduling period). Since the scheduling interval in operating systems is in tens of milli-seconds, if we conduct the power control in a shorter period, this assumption is valid. The same assumption is made in [127, 6, 10, 17, 84]. At the end of each scheduling period, the chip-level frequency partitioner may collect the grouping information from the OS. If we implement the algorithm as a loadable kernel module of OS, the grouping information can be derived from a system function. If we implement the controller as a piece of hardware on the chip, this information exchange between hardware and software can be achieved by adding special purpose registers on the chip. If the proposed solution is implemented as firmware running on the service processor on the motherboard, the information exchange between the main processor and the service processor can be achieved via external ports [125].

17 Aggregated Frequency Partitioning

Before we discuss the chip-level power partitioning, we first introduce some notation.

A many-core microprocessor has N groups of cores and group i runs application i,

th where 1 ≤ i ≤ N. IPSi is the average IPS of group i when running the i application on the many-core microprocessor without any power constraint. IPSi can be derived by conducting application profiling on the desired number of cores at the peak DVFS level and then calculating the average IPS of each core. Note the profiling is only performed once for each application on the desired number of cores. The OS can send

th IPSi to the controller via on-chip registers. ipsi(k) is the measured IPS of the i

th group. WTi(k) is the estimated power of the i group. Since each group may consist of multiple cores, ipsi(k)andWTi(k) are the accumulated IPS and power of all the cores in the ith group.

To achieve optimized overall performance, the aggregated frequency quota parti- tioned among different groups should be proportional to the ratio between the per- formance and power consumption (i.e., ipsi(k)/W Ti(k)). However, this may lead to the following problem. Some applications intrinsically have a low IPS even without any power constraint. To partition power based on IPS is unfair to those applications if they run simultaneously with other applications that have intrinsically high IPSs.

To address this problem, we use the relative IPS, ripsi(k), as the performance metric in this chapter, which is the measured IPS ipsi(k) normalized to IPSi. Specifically, ripsi(k)=ipsi(k)/IP Si, similar to the fairness definition used in [61]. We define

th the power efficiency of the i group ei(k) as the ratio between ripsi(k)andWTi(k).

Specifically, ei(k)=ripsi(k)/W Ti(k).

In this chapter, we partition the chip-level aggregated frequency quota among groups proportionally based on the power efficiency of each group to achieve the

18 Table 2.1: Workload mixes used in testbed and simulation experiments.

1. Physical testbed workload mixes Mixes PARSEC 2.1 , SPEC2006 Aggregate Effect mix1 12-perlbench all seperated applications mix2 12-Streamcluster high-barrier parallel workload mix3 8-swaptions, 4-omnetpp no-barrier parallel workload mix4 4-x264, 8-fluidanimate no-barrier, high-lock workload mix 4-(blackscholes, bodytrack), mix5 low-barrier and high-barrier mix 2-(xalancbmk, povray) 4-(vips,facesim), mix6 random mix 1-(libquantum,astar,soplex,dealII) 2. Simulation workload mixes Mixes SPLASH-2, SPEC2006 Aggregate Effect mix1 water (nsquared) all parallel application mix2 dealII all seperated applications FFT,Ocean non,LU con,LU non mix3 random mix (each occupies 1/4 number of cores)

optimized performance

ei(k − 1) fgi(k)=N f(k), (2.5.1) j=1 ej(k − 1) where f(k) is the aggregated frequency quota of the entire chip and fgi(k)isthe aggregated frequency allocation for the ith group in the kth control period. In systems that need to support application priority, we can assign different weights to the co- scheduled applications when we calculate fgi(k).

2.5.2 Group-level Partitioning

The goal of group-level aggregated frequency quota partitioning is to further partition the group frequency quota among all the cores running the threads of the same application, such that each thread has a balanced progress at the common barriers. For the one-threaded groups, the core quota is the same as the group quota. For the multi-threaded groups, the problem of achieving an optimized performance in a group

19 is translated into discerning which running threads are more critical (i.e., slower) and then allocating more aggregated frequency to the critical threads to expedite the progress of the entire application.

In this chapter, we adopt a thread criticality prediction approach proposed by

Bhattacharjee and Martonosi [6], which considers both L1 and L2 cache misses. Com- pared with other approaches [17, 70, 74], the advantage of this predictor is that it can handle both barrier and non-barrier parallel workloads. The criticality of core j in the ith group in the kth period is

(L1L2penalty) × N(L1L2miss) crij(k)=N(L1miss)+ , (2.5.2) L1penalty

where N(L1miss) is the number of L1 misses that hit in the L2 cache, N(L1L2miss)is the number of L1 misses that also miss in the L2 cache, and L1L2penalty and L1penalty are L2 and L1 cache miss penalties, respectively. The cache miss penalty is measured in CPU cycles. Within a parallel working group, a higher criticality value implies a more poorly-cached, slower thread [6], which means that additional power needs to be shifted to that thread from the non-critical threads (with smaller criticality values) to reduce the runtime imbalance. In our design, we proportionally sub-partition the frequency quota of a multi-threaded group to its cores based on criticality as follows

crij(k − 1) ij i fc (k)= Mi fg (k), (2.5.3) k=1 crik(k − 1)

th where fcij(k) is the target frequency of Core j in Group i in the k control period.

Mi is the number of cores in Group i.

20 2.6 Core-level Power Estimation on Physical Testbed

In our power management solution, the chip-level partitioning is conducted accord- ing to the relative power efficiency of each group. Therefore, we need a reasonable estimation of the power consumption of each core. Besides, one of our baselines,

Steepest Drop [127], also assumes the knowledge of the power consumption of each core, even though real microprocessors available in today’s market cannot yet pro- vide such information. In this section, we introduce our per-core power estimation method.

Although the power consumption of each individual core cannot be directly mea- sured in today’s microprocessors, previous work by Kansal et al. [58] has shown the CPU power consumption of each VM on a server can be estimated by adap- tively weighting the CPU utilization of the VM. However, they did not explicitly consider the impact of DVFS in their model despite the fact that power consump- tion scales with different DVFS levels. We extend their work to estimate the power consumption of each core under DVFS environment by taking both DVFS level and utilization into consideration. The utilization metric represents the high-level work- load characteristics, while the DVFS level represents the hardware working condition of the core. Power consumption is the interactive result of both the hardware and software parts. We adopt the commonly used multiplication operation to model the interaction among different parts [91]. Therefore, the total power consumption of the chip is modeled as:

N CP = Ui ∗ fi ∗ W + C, (2.6.1) i=1 where CP is the total power consumption of the entire chip. Ui is the utilization of

th the i core, 1 ≤ i ≤ M. M is the number of cores. fi is the DVFS level of Core i. W

21 is the weight and C is the idle power of the chip. In this model, the static power of the chip has been captured by C. We do not explicitly consider the dynamic power of the uncore part because the uncore power is actually driven by the core part. For example, the last level cache accesses are introduced by the cache misses or writebacks from the cache of each individual core. Instead of modeling the dynamic power of the uncore part explicitly, we attribute it to the corresponding core part since the purpose of our per-core power estimation is to support power allocation among the cores. Therefore, we simply sum the power of all the cores as the power of the entire chip. CP can be measured with a multimeter (as described in Section 2.7.1). Ui and fi can be measured with the performance monitoring registers. W and C are updated by linear regression at runtime. Note this simple, yet effective, estimation method has only two unknown variables in the regression. Moreover, the number of unknown variables does not scale with the number of cores. Since our testbed is a homogeneous-core system, the estimated power of each core CPi is

C CPi = W ∗ Ui ∗ fi + . (2.6.2) M

For heterogeneous-core systems, we could extend the model by scaling fi (as described in Section 2.4). Intuitively, the workload characteristics could be more accurately captured by using different weights (Wi instead of W ) for different cores.

However, this approach would make the complexity of the on-line regression problem increase linearly with the number of cores, which might not be favorable for many-core architectures.

22 Table 2.2: Simulator configuration parameters for SESC.

CMP 32,64,128 cores Core Alpha 21264 like Peak Frequency/Vdd 1GHz/1.0V Technology/Temperature 65nm/80◦C fetch/issue/commit width 4/4/4 Private L1 d/i-cache 2-way, 64 bytes block 32KB, 2 cycles hit latency Private L2 cache 8-way, 64 bytes block 256KB, 12 cycles hit latency Main memory latency 400 cycles

2.7 Implementation

In this section, we first introduce the physical testbed and the simulation environment used in our experiments. Next, we discuss the possible implementation of our solution on a physical chip.

2.7.1 Testbed

Our testbed is a 12-core AMD Opteron 6168 processor, which supports per-core fre- quency scaling with five different levels [3]. The operating system is OpenSUSE 11.2 with Linux kernel 2.6.31. To evaluate our power control policy, we simultaneously run different combinations of selected benchmarks from the PARSEC 2.1 [7] and SPEC

CPU2006 suites on our physical testbed. We use the SPEC suite subset identified in [98] to represent the major characteristics of SPEC CPU2006. The constructed workload combinations of PARSEC and SPEC cover a variety of different aggregate effects and are listed in Table 2.1. In Table 2.1, the number-appname notation is the number of threads of the application with the name of appname for the PARSEC and SPLASH-2 workloads; for the SPEC2006 workload, it is the number of copies of the application with the name of appname.

To measure the power consumption of the processor, we use the approach proposed

23 in [32, 55]. An Agilent 34410A digital multimeter is used together with a Fluke i410 current probe to measure the current running through the 12V power lines that power the processor. The probe is clamped to the 12V lines and produces a voltage signal proportional to the current running through the lines with a coefficient of 1mv/A.

The resultant voltage signal is then measured with the multimeter. The measured value is read by the server through a USB cable using a USBTMC device driver. The accuracy of the current probe is (3.5% of reading + 0.5A). The power consumption of each core is estimated periodically (described in Section 2.6).

On our prototype testbed, we implement the control algorithm as an OS daemon process to control the target processor, though the controller can be implemented in the service processor firmware in a real system. The controller periodically reads the power consumption from the power monitor and the performance statistics from the

OS. It then executes the control algorithm presented in Section 2.4. As the outputs of the control algorithm, new frequency levels are calculated and enforced in the next control period. The control period Tc for the controller is set to 1 second because the timer resolution in Linux is 10ms. Note that much shorter control periods could be used in a real firmware or on-chip implementation.

Since the new aggregated frequency level periodically received from the power controller can be any value, it may not be exactly one of the frequency levels supported by the processor. Therefore, a frequency modulator is needed to approximate the desired level with a series of supported frequency levels. For example, to approximate

0.9 during a control period, the modulator would output the sequence, 0.8, 1.0, 0.8, 1.0 etc, on a smaller timescale. To implement the approximation, we use the first-order delta-sigma modulator [67] to generate the sequence in each control period. This type of modulator is commonly used in analog-to-digital signal conversion. Clearly, when the sequence has more numbers during a control period, the approximation will be

24 better, but the actuation overhead may become higher. On our prototype testbed, we use 10 subintervals to approximate the desired DVFS level and each subinterval is 100ms. As a result, the effect of actuation overhead on system performance is no more than 0.02% (20μs of DVFS overhead [111] divided by 100ms), even in the worst scenario when the DVFS level needs to be changed in each subinterval.

2.7.2 Simulation Environment

To simulate different applications running in the many-core system, we simultaneously run different combinations of selected benchmarks from the SPLASH-2 [128] and

SPEC CPU2006 suites in our simulator. We use SPLASH-2 in our simulation because there is no known report about cross-compiling PARSEC for SESC. We are working with the PARSEC authors on it.

To stress test our power management solution in a many-core microprocessor with a large-scale configuration (i.e., a large number of cores), we conduct simulations using

SESC [104] simulator with modifications for per-core DVFS support. The cores are configured based on an Alpha 21264 (EV6) scaled to current process technology. Its main parameters are listed in Table 2.2. Each core has 4 DVFS levels normalized to the maximum frequency (i.e., 1, 0.9, 0.8 and 0.7). We integrate the MESI cache coherence protocol on the top of the basic Network on Chip (NoC) module of SESC to construct a fully functional NoC so that our simulated many-core microprocessor can provide a coherence guarantee among all the cores. The power of the core part is derived as the combination of the dynamic power reported by Wattch [15] and leakage power estimated by HotLeakage [131]. We use Orion 2.0 [57] to estimate the power consumption of the NoC. To simulate the DVFS actuation overhead, we assume that no instruction is executed during synchronization [54]. Based on recent studies [63], the DVFS actuation overhead can be decreased to tens of ns in the near future.

25 Therefore, we choose to use a 250ns transition overhead [6] and 10 subintervals in a control period of 250μs, which leads to a DVFS overhead of up to 1% in each control period. Our scheme allows a higher DVFS overhead by having a longer control period.

For example, if we change control period from 250μs to 5ms, the DVFS overhead is still up to 1% with a 5μs transition time as in Nehalam processors [52].

2.7.3 Discussion on Hardware Implementation

Since the power management algorithm might need to be implemented on-chip [82] due to the future system integration trend, we discuss the possible on-chip imple- mentation of our solution in this section. The dominant circuit in our solution is the fixed-point division in the partitioner and the fixed-point multiplication in the controller. The division could be implemented as a multiplier with 10% extra circuits

[96]. Since the controller and two partitioners are sequential in our algorithm, they could share the same piece of hardware that could be deployed on an on-chip PMU

[82]. For the total power budget on the order of hundreds of Watts (a 10-bit repre- sentation is sufficiently accurate for power management purpose), we conservatively use a 16-bit fixed-point multiplier. Bitirgen et al. [10] have estimated that a 16-bit fixed-point multiplier yields an area of 0.0057mm2 with 65nm process technology.

For a typical 200mm2 die, the hardware overhead is only 0.003%. To approximate the power consumption by the fixed-point multiplier circuits in the hardware, we as- sume the power density of IBM POWER6’s FPU [25], which is 0.56W/mm2 at 100% utilization with nominal voltage and frequency values (1.1V and 4 GHz). The extra power consumed by the fix-point multiplier is only 0.003W. Additionally, we need to add two programmable registers on each core: one is to record the base IPS (IPSi defined in Section 2.5.1) of the thread scheduled on this core; the other is to record the grouping information. The performance counters recording the instruction count

26 120 120 100 100 80 80 60 60 Power (W) Power (W) 40 40 20 20 Measured Estimated Measured Estimated 0 0 0 102030405060708090 0 102030405060708090 Time (s) Time (s) (a) At the peak DVFS level. (b) At randomized DVFS levels. 150 30 Measured Estimated Measured Estimated 120 25 mix1 mix2 mix3 mix4 mix5 mix6 mix1 mix2 mix3 mix4 mix5 mix6 20 90 15 60 10 30 5 Power Stdev (W)

Average power (W) power Average 0 0 0.8 1.3 1.9 1. 0.8 1.3 1.9 1. 0.8 1.3 1.9 1. 1 1 1 0.8 1.3 1.9 1.5 0.8 1.3 1.9 1.5 0.8 1.3 1.9 1.5 1 1 1 5 5 5 Frequency level (GHz) Frequency level (GHz) (c) Average power. (d) Standard deviation. Figure 2.2: Power estimation accuracy experiments on a 12-core hardware testbed.

and cache miss count are already available on most modern microprocessors. The delta-sigma modulator consists of one accumulator and one comparison, which is a very light-weighted piece of circuit. Furthermore, if the DVFS level is fine-gained enough (e.g., [125]), the modulator could be eliminated. Therefore, our solution is a hardware-efficient solution even for on-chip implementation.

2.8 Evaluation

First, we introduce two state-of-the-art baselines. Next, we present our physical testbed and simulation results for many-core architectures. Finally, we provide both the theoretical and experimental analysis of the scalability of our solution.

27 2.8.1 Baselines

Our first baseline, referred to as Priority, is a heuristic-based power controller for

CMPs proposed by Isci et al. [54]. The control scheme of Priority is briefly sum- marized as follows. 1) Every core is assigned a priority. 2) In each control period, if the total power consumption of the chip is lower than the set point, Priority chooses the core with the highest priority to increase its DVFS level by one. If the core is already running at its highest DVFS level, the core with the next highest priority will be tried. Alternatively, if the power of the chip is above the set point, Pri- ority chooses a core (starting from the lowest priority core) to decrease its DVFS level by one. 3) Priority repeats step 2 until the system stops. Priority represents a typical centralized solution that adopts a light-weighted trial-and-error approach to exploit the combinations of the power states of all the cores. We compare our power management solution against Priority to show that heuristic-based, though light-weighted, solutions may be insufficient for many-core architectures due to the exponential explosion of the number of global power states. Our second baseline, referred to as Steepest Drop, is a heuristic-based optimization algorithm for many-core architectures proposed by Winter et al. [127]. When the chip power is over the budget, the algorithm selects the application/core pair that would provide the biggest ratio of power reduction to performance loss if the DVFS level was dropped by one step. This process is repeated until the total power is smaller than the budget. Although Steepest Drop has been demonstrated to outperform a variety of state-of-the-art algorithms and have a small computation overhead even for a 256-core system, it is based on the assumption that all the cores run independent workloads. This assumption is invalid since it oversimplifies the coupling among the cores hosting the same parallel application. We compare it with our solution to show

28 that it is necessary to consider the couplings among all the cores in power management for improved application performance.

In the following subsections,we refer to our solution as FreqPar.

2.8.2 Estimation Accuracy

In this experiment, we test the accuracy of our online power estimator presented in

Section 2.6. In Figure 2.2(a), we set all the cores to the peak frequency and run mix1 on the 12 cores. Figure 2.2(a) shows that the estimated total chip power can track the measured actual power during the execution time with only small errors. To stress test our estimation scheme, in Figure 2.2(b), we randomly modulate the DVFS levels of all the cores. Figure 2.2(b) shows our scheme can track the total power with a reasonable accuracy. In Figures 2.2(c) and (d), we plot the average and standard deviation of the measured and estimated power respectively, under different frequency levels and benchmark mixes. On our physical testbed, there are five DVFS levels:

0.8GHz, 1.0GHz, 1.3GHz, 1.5GHz, and 1.9GHz. For the estimated average power, the average difference between estimated and measured power is within 1 Watt in most cases, with the worst case being mix5 at 1.9GHz with a 3.6W difference. For the standard deviation, the average difference between estimated and measured power is within 2 Watt in most cases, with the worst case being mix4 at 0.8GHz with a standard deviation difference of 3.2W.

2.8.3 Testbed Results

In this subsection, we compare our solution with the two baselines on our physical testbed, in terms of power control accuracy and application performance.

29 85 13 85 13 12 12 75 75 11 11 65 10 65 10 9 9 Power (W) Power (W) 55 Power Budget 55 Power Budget 8 8 Total frequency Total Freq frequency Total Freq 45 7 45 7 0 1020304050 0 1020304050 Time (s) Time (s) (a) Priority.(b)Steepest Drop. 85 13 100 12 80% Budget 75 70% Budget 11 80 65 10 9 60 Power (W) 55 Power Budget (to peak %) 8 Improved Priority Steepest Drop Relative power Freq frequency Total FreqPar Budget 45 7 40 0 1020304050 123456123456 Time (s) Workload mixes (c) FreqPar. (d) Different benchmarks and budgets. Figure 2.3: Power control accuracy comparison. In (a)-(c), the frequencies are relative to the peak of a selected core. In (d), the power values are relative to the peak power in each test case.

Power Control Accuracy

In this subsection, we first perform a case study to investigate different power man- agement algorithms to show that the algorithms based on the feedback control theory can achieve better power control accuracy. Then, we present the average power con- trol results of FreqPar and the baselines under different power budgets.

In Figures 2.3(a)-(c), we schedule 12 copies of lbm from SPEC CPU2006 suite on the 12 cores and all algorithms start with all cores set to the lowest frequency level, which is the default setting policy of Linux if the cores are idle. Since the power is lower than the set point (80W for all the policies in this test) at the beginning, all the policies try to raise the DVFS levels of the cores to improve performance. In Figure

2.3(a) Priority responds by increasing the DVFS level of one core at a time, until the power is higher than the set point in the 20th control periods. Then it directs the DVFS level of a core up/down around the set point. There are two parts of this simple

30 algorithm that can be improved. The first one is that a large number of cores will lead to a very long settling time because Priority only steps up/down one level of a core in each control period. The second one is that Priority always oscillates between two adjacent DVFS levels of a core, even in the steady state. As a result, it never settles to the set point. Since Priority has steady state errors, it may be undesirable to use Priority in a real system because a positive steady state error (i.e., average power is above the set point) may violate the power budget. Lefurgy et al. [67] have identified this issue and addressed it by having a safety margin when the power budget is assigned. To ensure that the safety margin is safe for all the benchmarks, we run the most hungry benchmark (mix1 ) on every set point between 60W to 85W with 1W step to get the maximum positive steady-state errors, which is 2.42W. By applying such a safety margin when assigning the power budget, we ensure that the average power is always within the power budget. We refer to the Priority policy with a safety margin as Improved Priority.NotethatImproved Priority is actually not feasible in practice because it is difficult to have such a priori knowledge about the safety margin before actually running the workload with all the possible power set points. In the following experiments, we use Improved Priority as a baseline that can achieve the best possible performance in an ad hoc way and yet does not violate the power constraint. In order to address the long settling time, Steepest Drop [127] has been proposed to explore multiple steps in each control period, based on an analytical model of the power consumption and performance contribution of a core. Figure 2.3(b) plots the typical run of Steepest Drop. Steepest Drop takes advantage of the direction of its model, going directly to the estimated optimal DVFS level each time. This policy successfully addresses the long settling time issue in Priority. However, even in the ideal case, the power managed under Steepest Drop is always lower than the set point

31 because the algorithm terminates the search until the estimated power is just lower than the budget, resulting in an unnecessarily lower total running frequency.

Figure 2.3(c) shows that FreqPar can precisely control the power of the chip by receiving a desired DVFS level from the controller, and then using the DVFS modulator to generate a series of supported DVFS levels on a finer timescale to approximate the desired level. One may think that Priority could be improved by also using a series of DVFS levels for each core. However, Priority would still have the same steady-state error because, without a desired DVFS level precisely determined based on control theory, Priority can only oscillate between two DVFS levels of a core. Compared with Steepest Drop in Figure 2.3(b), FreqPar on average runs at a higher frequency (9.9 relative to the peak of one core, compared to 9.4 for Steepest

Drop), because the precise control policy of FreqPar can use all the available power budget. Figure 2.3(d) plots the average power with standard deviations (as error bars) for

FreqPar and the baselines. In all the tests, we run the application mixes to the end.

The power readings are relative to the peak power in each test case. As discussed previously, FreqPar can precisely achieve the desired power budgets while the other two methods waste the budgets.

Application Performance

In this subsection, we first provide two case studies to discuss the underlying reasons that FreqPar has better performance at both the group and chip levels. Then, we present the average performance of FreqPar and the baselines under different power budgets.

Figures 2.4(a)-(c) show the reason that FreqPar can outperform the baselines for

32 1.2 18 1.2 18 1.2 18 FreqCore0 FreqCore1 FreqCore0 FreqCore1 FreqCore0 FreqCore1 1.1 15 1.1 15 1.1 15 Imbalance Imbalance Imbalance 1 12 1 12 1 12 0.9 9 0.9 9 0.9 9 0.8 6 0.8 6 0.8 6 Frequency Frequency 0.7 3 Frequency 0.7 3 0.7 3 Weighted cache Weighted Weighted cache Weighted Weighted cache Weighted misses (x10,000) misses (x10,000) 0.6 0 0.6 0 misses (x10,000) 0.6 0 0 1020304050 0 1020304050 0 1020304050 Time (s) Time (s) Time (s) (a) Improved Priority.(b)Steepest Drop.(c)FreqPar.

Figure 2.4: Group-level (thread criticality-aware) frequency quota (i.e., sum of nor- malized DVFS levels) allocation traces of FreqPar and the baselines.

parallel workloads. In this experiment, we run two threads of streamcluster (PAR-

SEC) on Core0andCore1 and leave the other cores unused with a 45W total chip power budget. Under the management of Steepest Drop, the algorithm always favors the cores hosting a high raw IPC, which is not necessarily the critical thread that needs more frequency quota within the parallel workload at runtime. Considering the spinning lock, in which the thread has a very high IPC without real progress, a policy that simply favors the high IPC will make the imbalance worse [1]. Bhattacharjee et al. [6] have identified that the weighted cache misses are positively correlated with criticality and proposed to take weighted cache misses as the imbalance indicator. In

Figures 2.4(a) and (b), we observe that the imbalanced frequency allocation policies like Priority and Steepest Drop increase the weighted cache miss difference between the two cores within a short period of time (60s in this test). The cross-marked curves in Figure 2.4 are the aggregated weighted cache misses. In contrast, with FreqPar,

Figure 2.4(c) shows that the frequency quota is fairly shared by the two cores and always shifted to the cores with a higher criticality within the working group, which dramatically reduces the imbalance between the two threads. As a comparison, we also test the case of evenly dividing frequency quota between the two cores in our initial design as Even. Our results show that by dynamically allocating the frequency

33 1.1 1.1 FreqCore0 FreqCore1 FreqCore0 FreqCore1 1 1

0.9 0.9

0.8 0.8 Frequency Frequency 0.7 0.7

0.6 0.6 0 102030405060708090 0 102030405060708090 Time (s) Time (s) (a) Improved Priority.(b)Steepest Drop. 1.1 0.12 FreqCore0 FreqCore1 Core1 Core2 0.1 1 0.08 0.9 0.06 0.8 0.04 Frequency 0.7 0.02 Power efficiency 0.6 0 0 102030405060708090 0 102030405060708090 Time (s) Time (s) (c) FreqPar.(d)FreqPar. Figure 2.5: Chip-level (power efficiency-aware) frequency quota (i.e., sum of normal- ized DVFS levels) allocation traces and power efficiency of FreqPar and the baselines.

to the critical thread, FreqPar is better than Even. The average absolute weighted cache misses differences between the cores in this test are 7843, 4437, 1703, and 739 for Priority, Steepest Drop, Even,andFreqPar, respectively.

Figure 2.5 illustrates the reason why FreqPar can outperform the baselines in terms of FS (Fair Speedup) at the chip level. In this experiment, we run one copy of milc (SPEC CPU2006) on Core0 and one copy of zeusmp (SPEC CPU2006) on

Core1. We leave the other cores unused. Figures 2.5(a)-(c) show the frequencies of the two cores under different policies with a 45W chip power budget. Steepest

Drop always favors the core hosting high raw IPC because that core generally has a higher IPC/Watt gradient, hence sacrificing fairness. In Figure 2.5(b), Core1has an advantage because it has a higher IPC by nature than Core0. Therefore, Core0 is always placed into a low frequency level except when Core1 has reached its peak frequency and power consumption is still lower than the budget. Improved Priority

34 100 70% Budget 80% Budget 80

60

40

20 Improved Priority Steepest Drop Fair speedup (%) FreqPar 0 1 2 3 4 5 6 1 2 3 4 5 6 Avg

Workload mixes Figure 2.6: Overall performance comparison between FreqPar the baselines on a 12-core hardware testbed.

in Figure 2.5(a) behaves in a similar way because it has a preset static priority. In contrast, with FreqPar, Figures 2.5(c) and (d) show that the frequency quota is always shifted to the core with a higher relative power efficiency. Both cores have the opportunity to get their fair share of the frequency quota under the power budget.

FreqPar outperforms the baselines because FreqPar uses relative power efficiency as the frequency quota partitioning criterion to achieve optimized performance with fairness consideration. In FreqPar, ripsi(k) (defined in Section 2.5.1) captures the relative application progress and eliminates the instruction count inflation caused by the natural characteristics of the applications. In contrast, the fairness-blind baselines favor some applications too much over other co-scheduled applications, resulting in degraded overall performance. Based on ripsi(k), FreqPar allocates more frequency quota to high-efficiency applications to optimize overall performance with fairness consideration. Figure 2.6 plots the overall performance comparison among different power bud- gets and benchmarks in terms of FS. Due to the reasons discussed in the previous case studies, we observe that FreqPar outperforms the Improved Priority and Steepest

Drop by 17% and 11% on average, respectively.

35 2.8.4 Simulation Results

In this subsection, we compare FreqPar with the two baselines in our simulation environment. In our simulator, for each number of core configuration, we run the constructed benchmarks (Table 2.1) with a power budget of 75% of the peak power of each benchmark mix and report the average. We fast-forward 1 billion instructions and simulate 3, 5, and 8 billion for 32, 64, and 128 core configurations, respectively.

Figure 2.7 shows the power control accuracy and performance of each policy. Power readings are relative to the power budget. The FS numbers are relative to the FS of FreqPar in each test case. We observe that FreqPar can precisely control the power of the chip to the desired set point while both the two baselines waste certain amount of the available power budget. Note that when we calculate the average power, we skipped the initial phase and compute the steady phase for all the policies. Figure

2.7alsoshowsthatFreqPar has better performance than the baselines for each core configuration. Furthermore, FreqPar ’s power and performance improvements over the baselines increase with the number of cores. This is because Improved Priority spends a long time under the budget because of its long settling time. For Steepest Drop, with more cores, the raw-IPC-directed optimization without fairness and criticality considerations worsens the imbalance among the cores.

2.8.5 Discussion on Algorithm Complexity and Scalability

The computational complexity in terms of the algorithm execution time determines the algorithm scalability. We first analyze the computational complexity of studied algorithms and then provide experimental results. Suppose there is an M-core system that supports per-core DVFS and the number of available DVFS levels is L. Priority adjusts the DVFS level of a core one step at a time. It is a O(1) algorithm. However,

36 110 25 Power Fair Speedup Steepest Drop (50 levels) (1ms) 100 20 90 Steepest Drop (2 levels) (100us) 15 80 FreqPar (100us) 70 10 60 Improved Priority Steepest Drop 5 50 Execution time

performance (%) FreqPar

Relative powerand 40 0 32 64 128 32 64 128 16 64 256 1024 4096 Number of cores Number of cores Figure 2.7: Power and performance Figure 2.8: Execution time experiments comparison in simulations under differ- show that FreqPar is more scalable than ent numbers of cores. Steepest Drop.

its long convergence time makes it unlikely to be implemented in many-core systems.

In the worst case, if the current power state is that every core is running at the peak state, Priority takes LM control periods to reach the lowest power setpoint. During that time, undesirable overheating may occur. Steepest Drop checks all the cores to determine the optimal one step action based on an analytical model in one iteration, and it iterates multiple times until the estimated power of the analytical model is less than the budget. In the worst case, the search goes up/down L ∗ M times.

Therefore, it is a O(L ∗ M 2) algorithm. Winter et al. [127] propose a special data structure to reduce the implementation complexity to O(L ∗ Mlg(M)). Since IBM has implemented a digital phase-locked loop [117], which achieves a near-continuous set of output frequencies without skipping processing cycles, an algorithm with an execution time linearly increasing with the number of available DVFS levels might be unfavorable. In contrast, even in the worst case, FreqPar proceeds through the cores twice (one for the chip-level and one for the group-level). Therefore, it is a O(M) algorithm independent from L. Furthermore, if the available DVFS levels are fine enough, the delta-sigma modulator could be eliminated, resulting in an even simpler design.

37 We now conduct experiments to measure the execution times of the three algo- rithms on our testbed. We examine the algorithm execution time as the number of cores increases in Figure 2.8. In the experiment, we invoke the investigated algo- rithms 500 times on our physical testbed with random generated power and DVFS levels as the inputs and present the average execution time.

In Figure 2.8, the execution time scaling trends of the algorithms confirm our theoretical analysis. As a O(L∗Mlg(M)) algorithm, Steepest Drop achieves a modest execution time scaling with the number of cores when the number of available DVFS levels is small (e.g., 2 levels). However, when the number of DVFS levels is large, the scalability is questionable. Suppose that the dynamic power adaptation range of one core is from 0.5 to 1 relative to the peak, with a 1% step size (achievable for

POWER7 [117]), the average execution time of Steepest Drop uses more than 20ms on a 4096-core system. In contrast, FreqPar is a scalable O(M) algorithm independent from the number of DVFS levels. Even for managing a 4096-core system, the average execution time of FreqPar is only 650μs. The key reason that FreqPar outperforms

Steepest Drop (a classic down-hill search algorithm with optimized implementation) on scalability is that, with the delta-sigma modulator, FreqPar uses a numerical computation to replace the discrete search used by Steepest Drop in the decision making part.

2.9 Conclusion

The majority of existing solutions on power control of multi-core architectures does not scale well for many-core architectures. More importantly, those solutions can- not effectively allocate power based on thread criticality to accelerate multi-threaded parallel applications, which are expected to be the primary workloads of many-core

38 architectures. In this chapter, we have presented a highly scalable power control so- lution that is specifically designed to handle realistic workloads, i.e.,amixedgroup of single-threaded and multi-threaded applications. Our solution features a three- layer design. First, we adopt control theory to precisely control the power of the entire chip to its chip-level budget by adjusting the aggregated frequency of all the cores on the chip. Second, we dynamically group cores running the same applications and then partition the chip-level aggregated frequency quota among different groups for optimized overall processor performance. Finally, we partition the group-level aggregated frequency quota among the cores on each group based on the measured thread criticality for shorter application completion time. As a result, our solution can optimize the processor performance while precisely limiting the chip-level power consumption below the desired budget. Empirical results on a physical testbed show that our control solution can provide precise power control, as well as 17% and 11% better application performance than two state-of-the-art solutions, on average, for mixed PARSEC and SPEC benchmarks. Furthermore, our extensive simulation re- sults with 32, 64, and 128 cores, as well as overhead analysis for up to 4096 cores, demonstrate that our solution is highly scalable to many-core architectures.

39 CHAPTER 3

POWER GATING FOR POWER CAPPING AND CORE

LIFETIME BALANCING

3.1 Introduction

Power has become a first-class constraint in current microprocessor design due to packaging, cooling, and power delivery circuit limits. An important research challenge is to optimize the performance of a Chip Multiprocessor (CMP) within a given power constraint (i.e., power capping). Recently, many research studies have been conducted to utilize Dynamic Voltage and Frequency Scaling (DVFS) to provide a way to cap power. Unfortunately, in recent generations of technology scaling, to keep leakage current under control, the decrease in the threshold voltage (Vth) of transistors has stopped [47]. This, in turn, has prevented the supply voltage (Vdd) from decreasing further. As a result, DVFS alone may no longer be able to fully address the power capping issue. Power gating is a technique that cuts off the power supply of a logic block by inserting a gate (or sleep transistor) in a series with the power supply

[77]. Gating the power supply results in almost no power consumption in the gated block [53]. Power gating complements DVFS by providing an effective mechanism to reduce leakage power. Therefore, it is preferable to integrate power gating and DVFS in power capping for further improved CMP performance. In this paper, we consider the case of per-core power gating (PCPG) because it has

40 been implemented in mainstream processors [53]. PCPG and DVFS have different characteristics in terms of their transition overheads and interactions with the OS, which requires a decoupled design to address their differences. PCPG has a larger transition time and energy overhead [77, 68]. Furthermore, since PCPG changes the number of turned-on cores, the OS scheduler may reallocate the thread-core mapping in each power gating interval. Therefore, the algorithm designed for PCPG should track long-term trends and avoid actuation oscillations. In contrast, DVFS has a smaller transition overhead (e.g., 10μs in Nehalem processors [53]). Moreover, it does not change the on/off states of the cores. Therefore, DVFS is preferable to explore short-term workload variations (e.g., multiple DVFS adjustment intervals within one scheduling interval) [116, 78]. However, existing efforts [64, 8] on integrating power gating and DVFS for power capping simply treat the power gating state as an extra- low power state below the existing DVFS levels and consider power gating and DVFS state in a coupled fashion at a coarse time scale. These coupled designs cannot either take advantage of the fine time granularity DVFS or avoid unnecessary actuations for power gating. Furthermore, these coupled designs usually require manually disabling the OS scheduler to have a fixed thread-core mapping, because if the OS changes the thread-core mapping, the current statistics of one core cannot be used to decide the power state for the next interval. Therefore, a decoupled design that can meet the different requirements of power gating and DVFS, as well as can be deployed with the native OS, needs to be developed.

Through PCPG, the wasted leakage power of certain under-utilized cores (in

DVFS alone systems) can be proactively transformed into the dynamic power head- room for accelerating useful applications. Hardware overclocking provides CMPs with the capability of fully utilizing the dynamic power headroom for optimized perfor- mance. However, many existing CMPs have only homogeneous cores with chip-level

41 overclocking capability (e.g., Intel TurboBoost [53]), which cannot fully explore the variations of different applications among cores during runtime. Since the benefit of per-core DVFS has been discussed in detail [63], we consider the case that with the per-core overclocking enabled, CMPs with homogeneous cores can mimic the func- tionality of heterogeneous cores to dynamically provide more powerful cores to meet the runtime requirements of applications. Even in the systems without physically im- plemented per-core DVFS (e.g., multi-power-island chips), Rangan et al. [102] have shown that thread migration on systems with only two power states can be used to approximate the functionality of per-core DVFS. Compared with TurboBoost, which only adjusts the chip-level DVFS levels, this paper addresses a different issue of co- ordinating the DVFS/overclocking states for multiple on-chip cores to improve the

CMP performance within a chip-level power cap.

While per-core DVFS and overclocking offer new opportunities to explore the power-performance trade-off, they also pose serious challenges to CMP reliability.

Overclocking directly leads to a higher wear-out rate on the overclocked cores [59].

The practice of employing per-core DVFS/overclocking aggravates the case that some cores may age much faster than others and become the reliability bottleneck for the whole system; thus, significantly reducing the system service life [51]. Previous studies [23] have developed effective algorithms to use per-core DVFS to balance the service lifetime of on-chip cores. However, the solution of lifetime-balancing algorithms may conflict with the solution of power-capping algorithms. For example, lifetime-balancing per-core DVFS algorithms may throttle the cores running high- activity applications (e.g., high IPC) to reduce the wear-out. However, those cores are exactly the cores that performance-power optimization algorithms usually seek to boost. It is unlikely to be able to coordinate those conflictive goals by using just one knob. Fortunately, power gating offers new opportunities to manage the core

42 lifetime balancing issue, because the power-gated cores do not wear out measurably [59]. Therefore, it is possible to achieve lifetime balancing through modulating the power gating state of each core.

Based on the above observations, this chapter proposes PGCapping (Power Gating for Capping), a decoupled design to integrate power gating with per-core DVFS/overclocking for CMP power capping and also discusses how to use power gating to balance the core-level lifetime. Specifically, PGCapping consists of a Proportional-Integral (PI) controller based on feedback control theory to manage power gating and a Quick- search algorithm for DVFS/overclocking management. Both the PI controller and

Quicksearch algorithm are invoked periodically. We select different intervals for the PI controller and Quicksearch to decouple them. The PI controller adjusts the number of turned-on cores to control the chip power at a coarse time scale with a theoreti- cally provable stability guarantee. At a fine-gained time scale, Quicksearch employs per-core DVFS to fully handle the short-term workload variations. Core-level lifetime balancing is achieved by selecting which core(s) to be turned on/off after the power controller decides the number of turned-on cores.

Specifically, this chapter makes the following major contributions:

• We propose a novel algorithm PGCapping, which integrates power gating, DVFS,

and core overclocking to optimize the CMP performance within a power cap. PGCapping explores a novel decoupled design direction. PGCapping conducts

power gating with a coarse time scale for reduced runtime overhead and DVFS

at a fine time scale to handle short-term workload variations.

• Since overclocking may have negative impacts on the core aging rates and lead

to unnecessarily shortened CMP lifetime, PGCapping integrates core lifetime

balancing as an integral part of the proposed power capping framework and

uses a power-gating-based balancing algorithm to maximize the CMP lifetime. 43 • While most existing work has only simulation results, we implement the pro- posed power capping solution on a 12-core AMD Opteron processor and present

empirical results. Our results show that our decoupled design achieves up to

42.0% better average application performance than five state-of-the-art base-

lines for mixed PARSEC and SPEC CPU 2006 benchmarks.

The rest of this chapter is organized as follows. Section 3.2 highlights the dif- ferences between this chapter and related work. Section 3.3 describes the decoupled design. Section 3.4 introduces our hardware testbed, simulation setups, and the imple- mentation details of our solutions. Section 3.5 presents our baselines and evaluation results. Section 3.6 concludes this chapter.

3.2 Background

A well-known industry practice to integrate the DVFS, overclocking and power gating is the Intel TurboBoost technology [53]. However, TurboBoost, as well as the studies by Li’s [71] and by Lee’s [64], discusses chip-wide DVFS rather than per-core DVFS, which limits the optimization space. The chip-wide DVFS cannot fully explore the inter-core variations. Since per-core frequency scaling has been available in generally available processors (e.g., AMD [2]) and on-chip regulators have been proposed [63] for per-core DVFS, this chapter focuses on integrating power gating and per-core

DVFS. Compared with power gating and chip-level DVFS, this chapter addresses a different but more challenging problem because we need to coordinate the power state (i.e., the power gating state and the DVFS level) of each core for optimized performance. Many power-capping solutions [116, 78] based on the per-core DVFS have been proposed. However, they do not discuss the coordination with the power gating technology. Bircher et al. [8] use both the per-core DVFS and power gating.

However, they focus on proposing and validating a workload phase predictor for the 44 power management. The impact of the real-world OS scheduling, which is crucial to the workload-based power management, nees more detailed consideration.

Karpuzcu et al. [59] have proposed using per-core overclocking to boost sequential application execution. However, they only overclock one core each time until it runs out and then they cut that core off from the on-chip logic and power network. In contrast, our scheme allows multiple cores to simultaneously stay at the overclocking state within a desired power budget and we balance the service lifetime of each core. The core-level lifetime balancing issue of multi-core processors has been exten- sively studied. However, previous explorations mainly focus on the per-core DVFS and scheduling [51, 23]. Using per-core power gating as a reliability management knob and the impact of overclocking is rarely mentioned. Compared with previous art, this chapter discusses using the power gating as a reliability management knob and specifically considers the overclocking scenarios.

3.3 System Design

In this section, we introduce PGCapping, which integrates power gating, DVFS, and overclocking to optimize the CMP performance within a power cap. PGCapping is a novel decoupled design that conducts power gating with a coarse time scale for reduced runtime overhead and DVFS at a fine time scale to handle short-term workload variations.

As shown in Figure 3.1, we design one power gating management module (i.e., PI controller) and one DVFS/overclocking management module (i.e., Quicksearch algorithm) to conduct power capping at different time scales. PI controller adjusts the number of turned-on cores to control the chip power based on the power measurement and budget at a coarse time interval. Core-level lifetime balancing is achieved through assigning the on/off state of each core based on its utilization. At a fine-gained time

45 Power Voltage Gate (V- Power monitoring supply Core0 Vdd gate) is a power Power reading VRM Core transistor used to Per-core Freq enforce power Util, lifetime PLL Power PI Core1 Aging gating. Per-core Vss budget controller V-gate sensor on/off mode Core2 Performance Voltage regulator Per-core Next step counter Module (VRM) and Quicksearch Util, freq PCPG/DVFS Phase-lock loop Per-core DVFS/ Core3 power mode Lifetime Util (PLL) are used to overclocking level enforce DVFS. Multi-core chip

Figure 3.1: Decoupled design uses the power budget, chip power measurement, per- core utilization, temperature, lifetime as inputs. It computes next-step power mode (e.g., on/off, DVFS levels, overclocking state) of each core to cap the entire chip power, boost performance, and balance the lifetime.

interval, the power control and performance optimization are realized by adjusting the DVFS/overclocking level of each core. We treat the overclocking states as extra

DVFS levels higher than the labeled peak DVFS/overclocking levels. Therefore, the assignment of overclocking state is also conducted in Quicksearch. Both PI controller and Quicksearch are invoked periodically. We select different intervals for PI controller and Quicksearch to decouple them. The interval of PI controller is set to be much longer than the OS scheduling interval to give OS enough time for spreading and balancing workload among turned-on cores. The interval of

Quicksearch is selected to be much shorter than that of OS scheduling. Therefore, before the next PI controller interval starts, Quicksearch has already settled at the optimal point with the current power gating setting. As a result, the PI controller does not introduce oscillations to the DVFS management. On the other hand, since

Quicksearch converges quickly, the impacts of DVFS on PCPG are observed to be negligible. Therefore, the two loops will not interfere with each other and can be designed independently. Moreover, both loops are decoupled from OS scheduling, so native OS scheduling without modification can be used in this decoupled design.

46 3.3.1 Design of PCPG Management Module

In this section, we introduce the design of PI controller that controls the power consumption of the entire chip to a desired power budget by adjusting the number of turned-on cores at a coarse time interval.

The power of the chip generally has a monotonic relationship with the number of turned-on cores. However, the run-time variations (e.g., different applications or

DVFS) make it unlikely to develop an accurate model for all possible cases. Therefore, we adopt feedback control theory to design a PI controller to decide the number of turned-on cores. A key advantage of the control-theoretic design approach is that it can tolerate a certain degree of modeling errors and adapt to online model variations based on dynamic feedback [36]. Therefore, our solution does not rely on power models that are perfectly accurate, which is in sharp contrast to open-loop solutions that would fail without an accurate model.

Controller Design. Following standard PI control-theoretic design procedures

[36], our controller is designed as:

(Pt − cp(k − 1)) N(k)=N(k − 1) + . (3.3.1) a where N(k) is the number of turned-on cores on the chip in the kth control period.

Pt is the power budget of the entire chip, which can be determined by the thermal and power supply constraints of the processor or specified by the user during run- time. cp(k) is the power consumption of the entire chip in the kth control period. a characterizes the power consumption of one core,whichmayvaryfordifferentchips and applications. In our design, we derive a by using the datasheet full power range

(from the idle power to the maximum power of the chip [3]) divided by the dynamic range of N(k)(e.g.,a=5.9).

47 End search and apply the Power>Budget N selected sequence of power state Y 2 § Freq · § Vdd · Power Power u ¨ next ¸ u ¨ next ¸ Calculate D of next current ¨ ¸ ¨ ¸ power perf © Freq current ¹ © Vdd current ¹ power state within one step Freq Perf Perf u next If Power

Figure 3.2: Quicksearch algorithm flowchart. Only power-higher-than-budget case is presented for concision.

Controller Analysis. A fundamental benefit of the control-theoretic approach is that it offers a mathematical framework to analyze the system stability and perfor- mance, even when the system power model may change at runtime due to variations.

By applying pole analysis [36] on our controller, we prove that the closed-loop system is stable as long as 0≤a≤11.8. In our experiments, the variation never exceeded the range.

Actuation Refinement. Through the controller calculation, we have derived the number of cores Na that should be in the turned-on state in order to optimize the CMP performance within the desired power cap. Next, we get the number of unfinished tasks Nb from the OS. We then enforce min (Na,Nb) cores in the turned- on state for the next interval to transform wasted leakage power to the dynamic power headroom that can be used to improve the CMP performance.

3.3.2 Design of DVFS Management Module

Since the power gating state of each core has been decided by the PI controller, we now introduce Quicksearch algorithm to explore short-term workload variations with per-core DVFS/overclocking.

Quicksearch uses the performance/power ratio to decide the power state (i.e., the

48 DVFS/overclocking level) of each core. Specifically, we define Perf=Util ∗ Freq as our performance metric. As discussed in [59], the aggregated frequency Freq is a reasonable high-level computational capacity metric. We extend this metric by weighting frequency with utilization Util to deduct the idling cycles. As shown in

Figure 3.2, Quicksearch starts with the current power states of all the cores. If the current power is higher than the chip-wide budget, the algorithm selects the appli- cation/core pair that would provide the highest Dpower−perf (i.e., power reduction to performance loss ratio). The chip-level power consumption with this new configu- ration is estimated by adding Powernext-Powercurrent to the current chip power. If the new estimated chip power is still higher than the budget, Quicksearch is again invoked from the new configuration. This process is repeated until the power bud- get is met. If the current power is lower than the chip-wide budget, the algorithm selects the application/core pair that provides the highest ratio of performance gain to power increase Dperf−power. The chip power consumption of this new configura- tion is estimated by adding Powernext-Powercurrent to the current chip power. If the new estimated chip power is still lower than the budget, Quicksearch is again invoked from the new configuration. This process is repeated until the power budget is met. Quicksearch extends the SteepestDrop algorithm [127] to consider overclock- ing states. In an overclocking enabled system, the final peak DVFS level might not be fixed [2]. Starting from the peak configuration (as in original SteepestDrop)is not always possible. Therefore, Quicksearch uses the current configuration as the search starting point. Moreover, there are relatively fewer power mode transitions in the stable execution phases. Therefore, Quicksearch reduces the search time. As a greedy algorithm, Quicksearch cannot completely avoid ending at a local optimum.

However, the DVFS impacts the performance approximately in a linear way unless the application is extremely memory-bounded, which is not a common case.

49 Overclocking and parallel applications. In Quicksearch, we do not explicitly discuss the core acceleration because we consider the overclocking states as extra

DVFS levels above the labeled peak DVFS level. For parallel applications, hardware or software spin loop detectors [72] can be used to identify the active waiting loops within applications to factor out spinning cycles from working cycles. In that case, the utilization-weighted frequency represents the amount of real work that a core is doing. Selecting the steps to maximize the performance/power ratio in Quicksearch is still valid. We can even further improve the parallel application performance by on-line detecting the critical thread and overclocking it [78].

3.3.3 Lifetime Balancing

In this section, we introduce the core lifetime balancing algorithm in PGCapping.

We propose to organize all the cores into two lists. One list contains all the turned-on cores. The other list contains all the turned-off cores. When we need to turn on one core, we find the core with the longest estimated lifetime in the turned-off core list to turn on. When we turn off one core, we find the core with the shortest estimated lifetime in the turned-on core list to turn off. Wear-out information can be estimated by using aging sensors that dynamically measure the increase in critical path delays due to aging [11]. Please note lifetime balancing only takes place when some cores are turned on/off. Therefore, this balancing algorithm does not interfere with the power-capping algorithm introduced earlier. Our experiments have shown that this balancing strategy can achieve decent core-level lifetime balance results

(in Section 3.5.4). Compared with previous sophisticated designs (e.g., [23]), which usually require on-line workload characteristic evaluation and then use scheduling or per-core DVFS to balance the lifetime of each core, the algorithm we present does not require any modification in OS scheduling and DVFS adjustment. There are mainly

50 two reasons why such a lightweight algorithm can work well: 1) in a real production environment, there are always variations (e.g., processor utilization and power budget variations) for us to turn on/off cores; 2) the normal wear-out process takes place gradually since we eliminate catastrophic overheating through power capping.

3.4 Implementation

In this section, we introduce the physical testbed for power capping evaluation and simulation environment for lifetime balancing evaluation.

3.4.1 Power Capping Evaluation Testbed

Our testbed is a 12-core AMD Opteron 6168 processor running OpenSUSE 11.3.

6168 does not support per-core DVFS and PCPG [3]. We use per-core p-state assign- ment to emulate per-core DVFS. 6168 has 5 p-states (0.8GHz/0.975V, 1.0GHz/1.0V,

1.3GHz/1.025V, 1.5GHz/1.0625V, and 1.9GHz /1.1125V). When we assign a p-state to one core, the frequency of that core is enforced independently as the defined value.

However, the core voltage is decided by the highest frequency among the cores be- cause all the cores share the same voltage plane [3]. We use CPU Hotplug [90] to emulate PCPG as in [68]. CPU Hotplug was originally intended for systems with hardware support to install and remove CPU modules without interruption, which mimics precisely the behavior necessary to model per-core power gating. The per- core p-state assignment and CPU Hotplug can emulate the performance impact of per-core DVFS and PCPG. However, we need to estimate the power consumption because the directly measured CPU power does not factor in the power impact of core-grained voltage scaling and PCPG. Our hybrid power estimation (based on real- time measurement), which considers the power impact of per-core DVFS and PCPG, works as follows. 51 First we measure the power consumption of 6168 as in [55]. An Agilent 34410A digital multimeter is used together with a Fluke i410 current probe to measure the current running through the 12V power lines that power the processor. The accuracy of this measurement is +3.5% of reading. We first set all the cores to the same p-state and measure the idling power Pidle. Our measure is: 44.0W(0.8GHz), 46.4W(1.0GHz),

51.3W(1.3GHz), 55.1W(1.5GHz), 61.3W (1.9GHz). Then we calculate the activity factor β in model (3.4.1) [58, 78].

N−1 Freq(j) VDD 2 mP ow = β ∗ ∗ ( ) ∗ Util(j)+Pidle (3.4.1) j=0 1.9 1.1125 where mP ow is the measured power of the entire chip. Freq(j) is the frequency of

th the j core. VDD is the Vdd corresponding to the highest frequency among all the cores because all the cores share one voltage plane. Util(j) is the utilization of the jth core. mP ow is measured. Pidle at different chip-level p-states (decided by the peak frequency among the cores) has been measured. Freq(j)andUtil(j)canbederived from the OS. By using those inputs, we can compute the chip-level activity factor

β.Usingβ is a trade-off between estimation accuracy and estimation complexity

[58, 78]. Once we derive β, we estimate the power of each core as:

Freq(j) VDD(j) Pidle Pow(j)=β ∗ ∗ ( )2 ∗ Util(j)+ (3.4.2) 1.9 1.1125 12

th where Pow(j) is the estimated power of the j core. Please note VDD(j)isthe voltage of the jth core at the current DVFS level. The chip power is estimated as N−1 Pow= Pow(j). We assume Pow(j)=0 if it is power-gated. We report Pow in j=0 our experiment part because it accounts for the power impact of per-core DVFS and

PCPG. We do not explicitly consider the dynamic power of the uncore part because 52 Table 3.1: Workload mixes used in the physical testbed.

Mixes PARSEC 2.1, SPEC2006 1 Aggregate Effect mix1 12-mcf memory intensive mix2 12-perlbench CPU intensive mix3 8-swaptions, 4-omnetpp no-barrier parallel mix4 4-(blackscholes, bodytrack) low-barrier and 2-(xalancbmk, povray) high-barrier parallel mix5 4-x264, 8-fluidanimate high-lock parallel 4-(vips,facesim) mix6 random mix 1-(libq,astar,soplex,dealII)

the uncore power is actually driven by the core part. We attribute uncore part power to the corresponding core part, which has been shown sufficiently accurate for power management purposes [58, 78].

We evaluate the proposed solution with the PARSEC and SPEC workload mixes

(Table 3.1), which covers a variety of different aggregate effects [78].

On our prototype testbed, we implement the control algorithm as an OS daemon process to control the target processor. The intervals of our designs are decided as follows. First, our physical testbed measurement shows that the overhead of DVFS is 35μs (close to 10μs in Nehalem [53]) and the overhead of PCPG averages 127ms, worst case 220ms (close to 100ms measurement result in [68]). Second, we study the key intervals in the OS. Although the minimum Linux time quantum can reach

10ms, the default time slice is 100ms [13]. The default load balancing interval is

200ms. Third, we decide our design intervals. In the decoupled design, we select 2s as the PCPG interval to give the kernel sufficient time to balance the workloads in the decoupled design. We select the DVFS interval to be 50ms, which is smaller than the default time slice to test fine-grain DVFS.

1The number-appname notation is the number of threads of the application with the name of appname for PARSEC; for SPEC2006 workload, it is the number of copies of the application with the name of appname.

53 3.4.2 Lifetime Balancing Evaluation Simulator

The 6168 does not have aging sensors. Therefore, we evaluate the lifetime balancing part in a simulator [79]. We model the core of Opteron 6168 (e.g., voltage, DVFS levels, frequency). In each time interval (e.g., DVFS interval), per-core power is modeled as in [64] with two parts: the dynamic power is based on utilization and

DVFS level, the leakage power is based on temperature. The temperature is modeled as in [59] based on the total power. Since the total power impacts the temperature and the temperature impacts leakage power. These two parts are entangled. We run multiple iterations of the calculation until both the power and temperature readings converge (e.g., no more than 5% change between iterations). We use RAMP2.0 to model the service lifetime. The different failure models and key parameters are the same as those used in [112].

3.5 Evaluation

In this section, we compare the power control accuracy, application performance, core-level lifetime balancing results of PGCapping and baselines in both the physical testbed and the simulator.

3.5.1 Baselines

In this section, we introduce the five studied policies. Our first baseline is per-core-DVFS-only. It adopts the original SteepestDrop al- gorithm [127], without PCPG or overclocking states. We select SteepestDrop as an example of state-of-the-art per-core DVFS based solutions. However, the dynamic range of per-core-DVFS-only is limited due to the lack of overclocking states and

PCPG.

54 70 Power P_avg Budget 70 Power P_avg Budget 70 Power P_avg Budget 60 60 60 50 50 50 40 40 40 Power (W) Power (W) Power (W) 30 30 30 0 800 1600 0 800 1600 0 800 1600 Time (50ms) Time (50ms) Time (50ms) (a) Per-core-DVFS-only.(b)PCPG-only.(c)ETurboBoost. 70 Power P_avg Budget 70 Power P_avg Budget 70 Power P_avg Budget 60 60 60 50 50 50 40 40 40 Power (W) Power (W) Power (W) 30 30 30 0 800 1600 0 800 1600 0 800 1600 Time (50ms) Time (50ms) Time (50ms) (d) Per-core-DVFS+PCPG.(e)EQuicksearch.(f)PGCapping. 85 mix1 mix2 mix3 mix4 mix5 mix6 70 55 40 Power (W) 25 DVFS only PCPG only ETurboBoost DVFS+PCPG EQuicksearch PGCapping Budget (g) Power comparison with different budgets and different benchmarks. Figure 3.3: Decoupled solution PGCapping can precisely enforce the power budget by using PCPG, DVFS and overclocking in both the high and low power budget cases. We calculate the average power with a 2s window as P avgtoclearlypresentthe general trend. Hardware testbed results.

Our second baseline is PCPG-only. We disable DVFS and use PCPG as the only knob to control power. When the current power cp(k) is lower than the budget Pt, we need to turn on cores. In order to achieve a fast convergence, we turn on multiple cores each time. We randomly pick a core in the turned-off core list, then we use the power when it was turned off as an estimation and add that power to cp(k). If cp(k) is still lower than Pt, we pick another core and so on, until cp(k)ishigherthanPt.

When the current power cp(k) is higher than the budget Pt, we do just the opposite.

We allow turn on/off multiple cores each time to achieve fast convergence because PCPG takes place at a coarse time interval. We design PCPG-only as an example

55 of using PCPG as the only knob to control the chip-level power consumption in a heuristic way. However, this simple heuristic cannot avoid oscillations.

Our third baseline ETurboBoost is an extended version of Intel TurboBoost [53].

It uses chip-level DVFS and PCPG. The original TurboBoost does not turn off cores when all the cores are running applications and adjusts the DVFS level of the entire chip to fit the power budget by one level at each DVFS interval. ETurboBoost extends original TurboBoost by using the PI controller to control the number of cores when the assigned power budget is lower than the power consumption when all the cores are running at the lowest frequency levels. We implement ETurboBoost as an example of state-of-the-art power capping solution through power gating and chip-level DVFS.

Our fourth baseline Per-core-DVFS+PCPG is using the PI controller for PCPG and per-core DVFS with Quicksearch but without overclocking states. We select this baseline to show the necessity of including overclocking states to fully explore the power headroom.

We design our fifth baseline EQuicksearch (i.e., Enhanced Quicksearch) to be a very competitive coupled solution to integrate per-core DVFS/overclocking with power gating. EQuicksearch extends the power gating state to Quicksearch.Wetake the power gating state as the extra-low power state under the labeled lowest DVFS level as in the previous coupled design [64]. We assume the power and performance of a turned-off core to be zero. We estimate Powernext and Perfnext of a currently turned-off core as the average power and performance of all the turned-on cores at the last interval. We also add the transition power to Powernext and deduct the transition time from Perfnext to minimize undesirable oscillations. We take EQuicksearch as an example of coupled solutions because it demonstrates two key features of such a solution: 1) the coupled solution has the opportunity to select from the entire available adjustment space; 2) due to the huge search space, pruning strategies (e.g., greedy

56 algorithm like SteepestDrop) usually need to be employed. We consider EQuicksearch as a state-of-the-art coupled design because its predecessor SteepestDrop has been shown to outperform many other recently published solutions [84, 116] and intuitive baselines [127]. Since EQuicksearch may change the power gating states of multiple cores, we set the interval to be sufficiently long to conduct the required actuations.

In our experiment, we invoke EQuicksearch every 1s to get the trade-off between actuation overhead and system responsiveness. Note that coupled solutions (either EQuicksearch or previous studies like [64]) may change the on/off states of cores at each control interval. As a result, the OS might reschedule the thread-core mapping, which makes the last-interval-statistic-based decision-making invalid. Therefore, we have to manually assign thread-core affinities when we use coupled solutions. In contrast, our proposed decoupled design can work with the native OS without any manual intervention because the PCPG and DVFS parts have been decoupled from the OS through interval selection.

3.5.2 Power Control Accuracy

In this section, we present the physical testbed results. We set 1.3GHz as the normal peak working state. We have three DVFS states (0.8GHz, 1.0GHz, 1.3GHz) and two overclocking states (1.5GHz, 1.9GHz). The overclocking setting agrees with the overclocking potential analysis in [41] (e.g., normally 10-40%).

In Figure 3.3, we run 7 copies of a randomly selected benchmark gobmk from

SPEC CPU2006 on Opteron 6168. We reduce the power budget at 40s from 55W to

35W to emulate a power budget cut scenario due to various reasons (e.g., thermal emergency). The power budget is then raised back to 55W at 80s after the emergency is resolved. Initially, we turn on all cores at the lowest DVFS level as in Linux default case.

57 50 # of cores Budget 2 50 # of cores Budget 2 50 # of cores Budget 2 40 Freq/# F_a/# 1.6 40 Freq/# F_a/# 1.6 40 Freq/# F_a/# 1.6 30 1.2 30 1.2 30 1.2 20 0.8 20 0.8 20 0.8 # ofcores # of cores 10 0.4 # of cores 10 0.4 10 0.4 Relative freq Relative freq Relative Relative freq Relative Budget (10W) Budget Budget (10W) 0 0 Budget (10W) 0 0 0 0 0 800 1600 0 800 1600 0 800 1600 Time (50ms) Time (50ms) Time (50ms) (a) Per-core-DVFS-only.(b)PCPG-only.(c)ETurboBoost. 50 # of cores Budget 2 50 # of cores Budget 2 50 # of cores Budget 2 40 Freq/# F_a/# 1.6 40 Freq/# F_a/# 1.6 40 Freq/# F_a/# 1.6 30 1.2 30 1.2 30 1.2 20 0.8 20 0.8 20 0.8 # ofcores # of cores 10 0.4 10 0.4 # ofcores 10 0.4 Relative freq Relative Relative freq Relative Relative freq Relative Budget (10W) Budget (10W) 0 0 0 0 Budget (10W) 0 0 0 800 1600 0 800 1600 08001600 Time (50ms) Time (50ms) Time (50ms) (d) Per-core-DVFS+PCPG.(e)EQuicksearch.(f)PGCapping. 100 mix1 mix2 mix3 mix4 mix5 mix6 Avg 75 50 25 0

Fair Speedup (%) DVFS only PCPG only ETurboBoost DVFS+PCPG EQuicksearch PGCapping (g) Performance comparison with different budgets and different benchmarks. Figure 3.4: Decoupled solution PGCapping can reserve power headroom by using PCPG and accelerate cores running useful workloads by using overclocking. Hard- ware testbed results. The frequencies are normalized to the peak frequency of one core (i.e., Relative freq). The Freq/# is calculated by dividing the total aggregated relative frequency of the entire chip by the number of turned-on cores, which can be interpreted as a high-level computing capability that each turned-on core can offer (the higher is preferable). We also calculate average Freq/# with a 2s window as F a/#.

Figure 3.3a shows the results of per-core-DVFS-only. Because the frequencies of all the cores have been set to the lowest level, per-core-DVFS-only cannot further reduce power to enforce a power budget cut at 40-80s. This might introduce a system crash or other undesirable failures. In Figure 3.3b, with PCPG, the system can successfully reach the low power budget. However, the frequently turning on/off cores introduces large power oscillations and potentially large thermal cycling. Figure

3.3c shows the results of ETurboBoost. ETurboBoost can enforce the desired power budget. However, ETurboBoost uses chip-level DVFS and introduces large power

58 30 PGCapping Random Round-robin Balanced 20

10

MTTF (years) MTTF 0 1 6 11 16 21 26 Avg Time (day) (a) Lifetime-balancing results with real-world data center utilization traces. 15 15 PGCapping Random PGCapping Random Round-robin Balanced Round-robin Balanced 10 10

5 5 MTTF (years) MTTF (years) MTTF 0 0 AZ CO NY TN Avg Jan Apr Jul Oct Avg Locations (state) Time (month) (b) Lifetime-balancing results. (c) Combined test. Figure 3.5: PGCapping achieves very close results with Balanced (a best-effort per- core-DVFS-based lifetime balancing). PGCapping outperforms Random and Round- robin baselines. Simulation results.

oscillations. More importantly, ETurboBoost always oscillates between two adjacent DVFS levels of the entire chip, even in the steady state. As a result, even on average, it never settles to the set point. Note that there is on average a 2.5W power gap between the 2s-window average power and the budget curve during 0-40s and 40-80s.

Potential performance benefits could be gained by fully utilizing the power budget.

Moreover, the processor is running on average 0.7W higher than the budget during

40-80s, which may possibly introduce undesirable failures. In Figure 3.3d, we show the case of per-core DVFS with PCPG but no overclocking. Our PI controller with refinement (Section 3.3.1) can identify that there are only 7 cores are used in our experiment and turn off the remaining 5 cores. However, without overclocking, even with all the cores set to the peak DVFS level, the power budget still cannot be fully utilized. Figure 3.3e shows that EQuicksearch can enforce the power budget.

However, EQuicksearch has larger oscillations due to the coarser time interval. Figure

59 3.3f shows that PGCapping can precisely enforce the power budget. In the steady state, PGCapping oscillates between two adjacent DVFS levels of one core, which is much smaller than oscillating between two adjacent DVFS levels of the entire chip in ETurboBoost. As more and more cores are expected to be integrated on-chip with future process technology scaling, the benefit of one core adaptation instead of whole chip adaptation will become more significant. Figure 3.3g shows that per-core-

DVFS-only, PCPG-only, ETurboBoost,andPer-core-DVFS+PCPG cannot adjust the processor to the desired power budget in certain cases for the reasons analyzed above. In contrast, PGCapping and EQuicksearch can always control the power to the assigned setting point for different benchmark mixes with different power budgets.

3.5.3 Application Performance

In Figure 3.4, we present the corresponding frequency, number of turned-on cores traces of Figure 3.3 to show the performance impact of different policies. The fre- quencies of the turned-off cores are counted as 0.

In Figure 3.4a, per-core-DVFS-only policy has no PCPG. Therefore, during 40-

80s, the frequencies of all the cores have been set to the lowest level continuously and could not go lower to enforce desired power budget. Figure 3.4b shows the results of

PCPG-only. Because PCPG-only does not change the DVFS levels of the cores, it only adjusts the number of turned-on cores to fit the power budget. We can observe that the cores are always turned on/off frequently. This is not desirable because

PCPG has a high overhead compared with DVFS. Frequently turning on/off cores negatively impacts performance. In Figure 3.4c, ETurboBoost adjusts the DVFS level of the entire chip up/down around the power budget, which could not explore the inter-core variations. Moreover, due to the large granularity of chip-level DVFS,

DVFS and PCPG have large interference at transition time, leading to the large

60 overshoot at 80s. Figure 3.4d shows the results of per-core DVFS with PCPG but no overclocking. We can see from 0-40s and 40-80s, when the power budget is high, due to the lack of overclocking state, the DVFS levels of the cores could not be raised further, which fails to achieve optimal performance. In Figure 3.4e, EQuicksearch purely relies on a heuristic algorithm to decide the power state of each core in a coupled fashion (e.g., using DVFS or on/off state to adjust power consumption).

Therefore, the adaptation is limited to a coarse level. Furthermore, EQuicksearch cannot avoid unnecessary oscillation of power state due to the system noises and workload variations. Note that we already minimize the power gating oscillation by considering the transition power and time overhead in EQuicksearch algorithm. In

Figure 3.4f, PGCapping can identify the idling cores and turn them off to reserve the power budget headroom. It then uses overclocking to fully utilize the headroom. Note that a core is running at a higher DVFS level on average when managed by PGCapping than ETurboBoost and EQuicksearch because PGCapping can fully utilize the power budget. In addition, PGCapping can fully explore the runtime variations among different cores by using per-core DVFS and per-core overclocking. Figure 3.4g shows the performance comparison for different policies on different benchmark mixes. In this chapter, we use Fair Speedup (FS) as our performance indicator. The FS of a partitioning scheme is defined as the harmonic mean of per-application speedup with respect to the equal resource share case (i.e., peak frequency for all applications) [10].

The FS achieved by a scheme can be expressed as

Napp ETappi(scheme) FS(scheme)=Napp/ (3.5.1) i=1 ETappi(base)

th where ETappi(scheme) is the execution time of the i application under a certain power

61 th management scheme. ETappi(base) is the execution time of running the i applica- tion of the peak frequency level all the time. Napp is the number of applications in the system, i.e., the set of applications that execute together. FS is an indica- tor of the overall improvement in execution efficiency gained across the applications.

Figure 3.4g shows that PGCapping outperforms per-core-DVFS-only, PCPG-only,

ETurboBoost,andPer-core-DVFS+PCPG by 16.9%, 42.0%, 13.9%, 21.4% on aver- age, respectively. In general, the reason that PGCapping can outperform baselines is that it fully utilizes the power budget. Therefore, the cores are generally running at a higher frequency level. Specifically, PGCapping reserves power headroom for cores running useful workload by turning off unused cores. Therefore, it outperforms per-core-DVFS-only. Then, PGCapping uses per-core DVFS/overclocking to fully uti- lize the power headroom and explore the runtime inter-core variations for optimized performance. Therefore, PGCapping can outperform ETurboBoost,andPer-core- DVFS+PCPG. Althought EQuicksearch has a full set of the possible configuratioin combination to explore for optimized performance at a coarse time interval, it cannot eliminate unnecessary oscillation of power gating due to system noises. In contrast,

PGCapping benefits from fine-time-grain DVFS and PI controller in the power gating part. Therefore, PGCapping achieves 11.5% better performance than EQuicksearch on average.

3.5.4 Lifetime Balancing

In this section, we first introduce three baselines for our lifetime balancing part. Next, we present our simulation results. We use three baselines to compare with our power-gating-based lifetime balancing algorithm in PGCapping (Section 3.3.3). The first baseline is Random: when we need to turn on cores, we randomly select the turned-off cores to turn on; when we turn

62 off cores, we randomly select the turned-on cores to turn off. The second baseline is Round-robin [59]: when we turn off cores, we start from the first core, then the second, and so on. When we turn on cores, we also start with the first. If it is already turned on, we turn to the next. The third baseline Balanced is an extension of the lifetime balancing algorithm discussed in [23]. Balanced sorts the cores in lifetime descended order and the next step available pre-core DVFS levels or on/off states in descending order. Next, it maps the two lists. Balanced strictly enforces lifetime balancing among cores at each DVFS and PCPG interval without considering the actuation overhead.

Balanced negatively impacts performance. Therefore, it is not suitable for real-world implementation. However, it has been suggested as a reasonable approximation for the best-effort per-core DVFS based lifetime balancing [23].

In Figure 3.5, we conduct three test cases to evaluate the lifetime balancing with real-world workload and power budget variations. In Figure 3.5a, we evaluate our proposed life-balancing solution in PGCapping with real-world workload variations. We assign a fixed power cap (e.g., 62W) to the processor while using a continuous 30-day real-world server utilization trace [132].

We derive the 62W budget by calculating the overall 30-day average utilization and test run the simulator with the average utilization at peak frequency. PGCapping outperforms Random and Round-robin by 9.2% and 8.1%. PGCapping achieves 92.3% of Balanced in MTTF. In Figure 3.5b, we evaluate our proposed life-balancing solution with real-world power budget variations. We assume the processor is powered by a solar panel (e.g. BP3180N [14]) and hosting HPC applications, which is a newly proposed green-energy solution [69]. Therefore, the power budget changes with the solar radiation but the utilization is fixed to 100%. We use meteorological data from the Measurement and Instrumentation Data Center (MIDC) [92] at the locations that have different solar energy resource potentials (e.g., AZ, CO, NY, TN) across

63 different seasons (e.g. the middle of Jan., Apr., Jul. of 2011 and Oct. of 2010) to calculate the maximum available solar power envelopes. Then, we run our simulator with the power budget envelopes. On average, PGCapping outperforms Random and

Round-robin by 11.0% and 7.1%. PGCapping achieves 90.7% of Balanced in MTTF.

In Figure 3.5c, we assume the processor is powered by a solar panel with real-world server utilization trace. We exhaustively test the 30-day trace with 4 different seasons and 4 different locations and present the result in Figure 3.5c (e.g., totally 4x4 30- day tests). PGCapping outperforms Random and Round-robin by 9.2% and 7.2%. It achieves 90.4% of Balanced in terms of MTTF in this test case.

When PGCapping uses PCPG to enforce the power budget, it always turns on the longer-lifetime cores and turns off the shorter-lifetime cores. Therefore, none of the on-chip cores becomes significantly worn out. That is the reason why PGCapping can outperform Random and Round-robin. Due to real-world power budget and utilization variations, there are always certain chances to turn on/off some cores.

Therefore, although PGCapping does not strictly balance lifetime among cores at each DVFS and PCPG interval, it still achieves 90.4% lifetime of Balanced.

3.6 Conclusion

Both DVFS and power gating are anticipated to be important power adaptation knobs. However, current studies on integrating power gating and DVFS focus on de- ciding the power gating and DVFS levels in a coupled fashion, leaving the whole direc- tion of a decoupled design undiscussed. In this chapter, we have explored PGCapping, a decoupled design to integrate per-core power gating with DVFS/overclocking for

CMP power capping. However, both per-core DVFS and overclocking may make some cores age much faster than others and thus become the reliability bottleneck in the whole system. Therefore, PGCapping also balances the lifetimes of the CMP

64 cores by using power gating. Our empirical results show that PGCapping achieves up to 42.0% better average application performance than five state-of-the-art baselines.

Furthermore, our extensive simulation results with real-world traces demonstrate that

PGCapping can increase CMP lifetime by 9.2% on average.

65 CHAPTER 4

ENERGY EFFICIENCY IN GPU-CPU

HETEROGENEOUS ARCHITECTURES

4.1 Introduction

In recent years, GPU-CPU heterogeneous architectures have been increasingly adopted in high performance computing, because of their capabilities of providing high com- putational throughput. For example, the recently built supercomputer Tianhe-1A, which has won the second spot on the TOP500 list [118], is equipped with Intel

5670 processors and Nvidia’s CUDA-enabled Tesla M2050 general purpose GPUs.

GPUs excel at data parallel operations due to the optimized SIMD (Single Instruc- tion Multiple Data) structure. Given the same amount of data, using one instruction to process multiple pieces of data can be more efficient than one instruction for one piece of data in both performance and energy. This advantage has helped Tianhe- 1A to achieve more than two folds energy efficiency than the third-place CPU-based

Jaguar on the TOP500 list. However, Tianhe-1A still has an estimated annual elec- tricity bill of $2.7 million [40]. Therefore, it is important to further improve the energy efficiency of GPU-CPU heterogeneous architectures.

GPU-CPU heterogeneous architectures offer some unique opportunities for energy conservation. Since GPUs have energy efficiency advantage over CPUs for parallel and computation-intensive applications, an intuitive solution seems to be assigning

66 all those workloads to the GPU for energy efficiency. However, our experiments (in Section 4.3.2) show that the GPU taking all the workloads is not necessarily the most energy-efficient workload division. The main reason is that if the GPU takes all the workloads while the CPU is totally idling, the execution time of the entire system can be longer than that in the case when the CPU does a fair portion of work. Since energy is the product of time and power, a more energy-efficient solution is to split and distribute the workload to GPU and CPU, such that both sides can finish approximately at the same time. However, because CPUs and GPUs differ considerably in their processing and memory capabilities, it is challenging to design an algorithm that can achieve an energy-efficient workload division for all different workloads. Furthermore, different power adaptation knobs, such as frequency (and voltage) scaling, are commonly enabled in both CPUs and GPUs. Since frequency scaling may impact the hardware capabilities, workload division policies assuming fixed underlying hardware working status might lead to inferior workload allocation.

Therefore, a dynamic workload division algorithm aware of the hardware status needs to be designed.

After the workload is split and allocated to the GPU and CPU, another research challenge is to manage hardware resources according to the runtime needs of work- loads for energy savings without compromising performance. In GPUs, real-world ap- plications rarely fully stress the GPU cores and memory simultaneously [46]. Hence, there are opportunities to save energy by throttling the components with low utiliza- tions. For example, for GPU core-bounded workloads, we can throttle the memory frequency for energy savings. However, a naive solution may over-throttle the memory and thus make the memory part become the system bottleneck, resulting in unneces- sary performance degradation. A similar argument also applies to memory-bounded

67 workloads. Therefore, the GPU cores and memory must be managed in a coordi- nated manner, based on the workload characteristics, for energy savings with minimal performance degradation. However, existing research on GPU energy management focuses on either GPU cores or memory. The direction of coordinately throttling the frequency levels of both the GPU cores and memory for improved GPU energy efficiency needs to be investigated.

In this paper, we propose GreenGPU, a holistic energy management framework for GPU-CPU heterogeneous architectures. Our solution features a two-tier design.

In the first tier, GreenGPU dynamically splits and distributes workloads to GPU and CPU based on the workload characteristics, such that both sides can finish ap- proximately at the same time. As a result, the energy wasted on idling and waiting for the slower side is minimized. In the second tier, GreenGPU dynamically throt- tles the frequencies of the GPU cores and memory in a coordinated manner, based on their utilizations, for maximized energy savings with only minimal performance degradation. Likewise, the frequency and voltage of the CPU are scaled similarly.

Specifically, this paper makes following contributions:

• We propose to improve the energy efficiency of GPU-CPU heterogeneous ar-

chitectures in a holistic way to utilize both workload division and frequency

scaling for maximized energy savings.

• We design a two-tier solution that dynamically splits and distributes workloads to GPU and CPU in the first tier and throttles the frequencies of GPU cores,

GPU memory, and CPU cores in the second tier.

• We develop a light-weight machine learning algorithm to adjust the frequency

levels for the GPU cores and memory in a coordinated manner, based on the

workload characteristics, for energy conservation.

68 1.5 1.5 1.5 1.5 nbody nbody nbody nbody 1.3 SC 1.3 SC 1.3 1.3 SC SC 1.1 1.1 1.1 1.1 time time 0.9 0.9 0.9 0.9 Normalized Energy Normalized energy 0.7 0.7 0.7 0.7 Normalized execution 900 820 740 660 580 900 820 740 660 580 520 465 410 355 300 Normalized execution 520 465 410 355 300 Memory Frequency Memory Frequency Core Frequency Core Frequency (MHz) (MHz) (MHz) (MHz) (a) Mem:Energy (b) Mem:Performance (c) Core:Energy (d) Core:Performance Figure 4.1: Normalized execution time is the execution time of a workload normalized to its execution time at the peak frequency. Relative energy is the energy normalized to the energy consumed at the peak frequency. There are opportunities to save energy with negligible performance loss by throttling under-utilized components.

• We implement GreenGPU using the CUDA framework on a real physical testbed

with Nvidia GeForce GPUs and AMD Phenom II CPUs. Experiment results with standard Rodinia benchmarks show that the proposed GreenGPU frame-

work achieves 21.04% average energy savings than several well-designed base-

lines.

The rest of the paper is organized as follows. Section 4.2 highlights the difference between this paper and related work. Section 4.3 motivates our work with two case studies. Section 4.4 presents our overall system design at a high level. Section 4.5 introduces in detail the algorithms proposed for dynamic frequency scaling and workload division. Section 4.6 provides the implementation details of our testbed, while Section 4.7 discusses our hardware experimental results. Finally, Section 4.8 concludes this paper.

4.2 Background

Recently, workload division between CPU and GPU has drawn the attention of re- searchers. Luk et al. [76] propose Qilin framework, an adaptive mapping scheme to

69 map computation tasks to processing elements on the CPU and GPU in one machine. While their target is solely to minimize the execution time, our scheme integrates workload division with GPU core and memory throttling to improve the energy effi- ciency of the entire system. Wang et al. [122] propose to coordinate inter-processor work distribution to minimize energy consumption under a given scheduling length constraint. However, their work does not throttle the GPU memory and cores in a coordinated manner based on workload characteristics for maximized energy savings. In addition, their approach requires offline profiling, which may be undesirable be- cause it can be expensive to do profiling for applications with a large amount of data every time for different input variables. Some GPU-CPU workload division studies are conducted based on the MapReduce [27] framework. For example, Ravi [103] proposes dynamic input data partitioning among CPU cores and GPU cores based on two applications, K-means clustering and Principal Component Analysis.Honget al. [45] discuss a uniform memory management framework among CPU and GPU as well as a uniform programming API. While both the two studies aim at improving the code portability between CPU and GPU, GreenGPU addresses a different problem, i.e., improving the energy efficiency of GPU-CPU heterogeneous architectures.

There are some existing research efforts to improve the energy efficiency of GPUs but those studies address GPU cores and memory in an isolated manner. Hong et al. [46] have shown that throttling the number of GPU cores based on their novel power model can save energy. Based on their power measurement, Collange et al. [22] conclude that memory access pattern and bandwidth play a major role in achieving both good performance and low power consumption. Jang et al. [56] formulate the memory access pattern of the threads inside a computation kernel in a matrix to select runtime parameters. Compared with those previous studies that address either GPU cores or memory, GreenGPU coordinates both GPU cores and memory for

70 maximized energy savings. Recently, Lee et al. [65] have studied to coordinate the number of active cores, the frequency of core part, and cache part in GPUs. However, the coordination between CPU and GPU (one of the key components in GreenGPU) is not examined in their work.

The general energy saving techniques of CPUs have been researched extensively

[50, 129, 44, 70, 17]. Particularly, some projects have tried to improve the energy efficiency of chip multiprocessors running parallel applications. For example, Li et al. [70] propose a solution called thrifty barrier that places the faster cores into a lower power mode at the barriers (i.e., joint point) while waiting for the slower cores so that energy can be saved. Liu et al. [74] use per-core DVFS to slow down the faster cores, such that both the idle time due to waiting and energy consumption are reduced. Cai et al. [17] extend [74] by adding meeting points within the execution of the parallel loops and solve the same problem at a finer granularity. However, all these studies focus on CPU-only architectures without considering GPU as part of the system, so their methods could not be directly applied to the GPU-CPU heterogeneous architectures.

4.3 Motivation

In this section, we conduct two case studies: 1) frequency scaling on GPU cores and memory, and 2) workload division between CPU and GPU, to motivate our work.

4.3.1 A Case Study on Frequency Scaling for GPU Cores and Memory

In GPUs, real-world applications rarely stress both the core and memory parts at the same time [46]. The component (e.g., cores and memory) utilization measures how intense one workload is exercising one part of the hardware resource. Nvidia defines the core part utilization as “GPU busy cycles/total cycles” and memory utilization 71 as “actual bandwidth/rated peak bandwidth” [93]. We get the utilizations by using Nvidia’s nvidia-smi toolset. Based on the utilization trace analysis, we categorize nbody as core-bounded and streamcluster (SC ) as memory-bounded. nbody and SC are workloads in CUDA SDK. In the following experiments, we conduct experiments on Nvidia GeForce 8800 GTX GPU to explore energy saving opportunities.

Figures 4.1a and 4.1b illustrate the performance and energy when we throttle the memory frequency. The frequency of the cores is set to the peak value. The opposite case is presented in Figures 4.1c and 4.1d. In Figures 4.1a and 4.1b, reducing the frequency of the memory part saves energy with minor performance loss for core- bounded nbody; but it impacts both performance and energy for memory-bounded

SC. Figures 4.1c and 4.1d show that reducing the frequency of core part negatively impacts both performance and energy for core-bounded nbody.ForSC, reducing the frequency of core part to certain point (e.g., 410MHz) saves energy with negligible performance loss; further reducing the frequency of core part negatively impacts both performance and energy. We make the following two observations based on the experiments:

1. For either memory-bounded or core-bounded workloads, properly scaling down the under-utilized component can save energy with negligible performance impact.

2. To one utilization, there can be a frequency level of that component that is most suitable, i.e., a higher frequency may lead to higher energy consumption while a lower frequency may result in performance loss.

In this paper, we aim to design frequency scaling algorithms that dynamically adjust the frequency levels according to the measured utilizations of the cores and memory.

72 50 40 30 20 10 Energy (KJ) 0 0 102030405060708090 Percentage of workload CPU takes (%)

Figure 4.2: Energy consumption for different workload division ratios. The cooper- ation of the CPU and GPU parts can be more energy efficient than the GPU part taking all the work exclusively.

4.3.2 A Case Study on Workload Division between GPU and CPU

Although GPUs have theoretical energy-efficiency in parallel computing, CPUs may participate in the computation to provide even higher throughput for the whole sys- tem. Luk et al. [76] give a workload division case study to show that different workload divisions between the CPU and GPU parts yield different performance. In

Figure 4.2, we conduct similar experiments to investigate the relation between energy consumption and workload division. Implementation details are introduced at Sec- tion 4.6. We measure the energy consumption of the entire GPU-CPU system when we vary the percentage of work allocated to CPU from 0% to 90%. The example case is based on the kmeans workload from the Rodinia benchmark set [20]. We observe that the energy consumption goes down as CPU work percentage goes from 0% to

10%, and then goes up from 10% to 90%. The optimal point takes place when CPU side takes 10% of the total work. Figure 4.2 shows that the cooperation of the CPU and GPU parts can be more energy efficient than the GPU part taking all the work exclusively. On a GPU-CPU heterogeneous platform, the average energy/data effi- ciencies (i.e., the joules consumed per certain amount of data [42]) on the GPU and

CPU sides are different. Given a fixed amount of workload (fixed amount of data), the energy optimization problem can be abstracted as: Given a GPU-CPU system with a GPU and a CPU, a certain amount of workload as x, the energy coefficient of

73 the CPU and GPU as a1 and a2, respectively, we need to find a workload division x1 for GPU and x2 for CPU and x1 + x2 = x, such that a1 ∗ x1 + a2 ∗ x2 is minimized.

This optimization problem is not trivial because a1 and a2 are different for different workloads and may change over time. So the minimum point may also change over time. In this paper, we aim to design a workload division algorithm that dynamically adjusts the workload division to find the energy minimized point at runtime.

4.4 System Design of GreenGPU

In this section, we first introduce a typical hardware configuration of GPU-CPU heterogeneous architectures. We then present the two-tier design of GreenGPU, tar- geting at reducing the system energy consumption. Finally, we discuss the interaction between the two tiers in GreenGPU.

The lower part of Figure 4.3 shows the logic view of a typical GPU-CPU het- erogeneous platform. The CPU, main memory, and the GPU are connected by the system bus. CPU works as the master and GPU works as the slave in this configura- tion. Although GPU is a slave device, it has DMA (Direct Memory Access) to improve memory capability. The CPU and GPU parts have very different architectures. Com- pared with CPU, the GPU part is equipped with enhanced stream processors (SP) organized as stream multiprocessors (SM) to perform high throughput computation.

The SIMD architecture of GPUs is highly optimized for data parallelism and has higher energy efficiency than CPUs for certain parallel workloads. As shown in the upper part of Figure 4.3, GreenGPU is a two-tier solution, running on the CPU. GreenGPU includes a workload division unit and two frequency scaling units (one for CPU and one for GPU, respectively). Both the workload division and the frequency scaling are invoked periodically, however, with different invocation periods.

74 … Software algorithms Workload 1 2 3 4 (running on CPU) division GPU Computation CPU iterations Execution Execution time time Work Frequency for Frequency SM&Mem scaling CPU Work scaling utilization CPU CPU for SM&Mem utilization frequency GPU frequency GPU (Slave) ALU ALU Cache B DRAM (Mem) Control Core Core U Cache CPU (Master) S Stream Multiprocessor 0 … SP: Stream Main Memory Processor DMA Stream Multiprocessor 15 Logic hardware topology, not Control SP SP SP SP SP SP SP SP physical floor plan Cache

Figure 4.3: GreenGPU features a two-tier design to reduce the energy consumption of CPU-GPU heterogeneous platforms. The higher tier (i.e., the workload division tier) dynamically partitions the incoming workloads to the CPU and GPU parts. The dash lines connect the components of the workload division part. The lower tier (i.e., the frequency scaling tier) takes the utilization of processing elements (GPU cores, GPU memory, and CPUs) to decide the proper frequency levels of them to reduce energy consumption. The dotted lines connect the components of the frequency scaling part.

In the first tier, the workload division unit dynamically divides the workloads between CPU and GPU based on their execution time to reduce the idling energy.

The dynamic workload division takes place at every iteration. An iteration is defined as the execution of a fixed amount of work. The iteration selection is workload de- pendent. An iteration could be the execution to the reduction point in the algorithm of a workload (e.g., the iteration in kmeans). Or it could be the execution to the common barrier point in a workload with barriers (e.g., the step in hotspot). Since the operations inside each iteration are similar [86], the statistics collected during the previous iteration can serve as a prediction for the execution of the next iteration.

Our target is to dynamically adjust the workload division in every iteration to min- imize the execution time difference between the GPU and CPU parts for minimized

75 idle energy. If in the previous iteration, the CPU runs slower than the GPU, we as- sign less work to the CPU and more work to the GPU in the next iteration and vice versa. Likewise, if in the previous iteration, CPU runs faster than GPU, we assign more work to the CPU and less to the GPU in the next iteration. This light-weight heuristic reduces the idling time by workload division between the CPU and GPU parts to reduce idle energy. Please note although we give two examples for iterations, the iteration selection is not limited to these two types. As long as the next iteration execution time on the CPU and GPU parts can be reasonably predicted based on current iteration, those iteration selections are valid. For example, for workloads that all the working threads are totally independent, the whole data set could be divided into a series of chunks. We refer to each run on a chunk as an iteration.

In the second tier, once the workload is divided and assigned to the CPU and

GPU, the two frequency scaling units in GreenGPU monitor the utilization of each component, and dynamically adjust the frequency levels of each component to achieve an improved energy efficiency. As we analyzed in detail in Section 4.3, component utilization metric can capture how intense the application is exercising the hardware.

Highly utilized resource needs to be running at a high frequency level; low utilized resource can be throttled to save energy without significantly impacting performance.

Since the GPU cores and memory interact with each other, we propose a coordinated algorithm to assign frequencies to the GPU cores and memory. The inputs of the algorithm are the utilizations of the GPU cores and memory in the current interval.

The outputs of the algorithm are the target frequency levels of GPU cores and mem- ory for the next interval. The detailed algorithm is presented in Section 4.5. For the CPU part, because there already exists rich researches on dynamic voltage and frequency scaling (DVFS), we simply adopt the default Linux power saving strategy by setting CPU frequency policy mode to on-demand [97], instead of designing a new

76 algorithm. Ondemand is a dynamic in-kernel cpufreq governor that can change the CPU frequency depending on the CPU utilization. The ondemand governor works as follows. If CPU utilization rises above a upper utilization threshold value, the ondemand governor increases the CPU frequency to the highest available frequency.

When CPU utilization falls below a low utilization threshold, the governor sets the

CPU to run at the next lowest frequency. Ondemand was first introduced in the linux-

2.6.9 kernel. It has been commonly used in a variety of systems with proven success. Please note that other more sophisticated DVFS-based processor power management strategies, such as [50, 129, 110], can also be integrated into GreenGPU for even more energy savings.

The workload division tier is invoked periodically to change the workload division between the CPU and GPU parts, which impacts the utilizations of the GPU cores,

GPU memory and CPU. Therefore, to minimize the impact on the stability of the frequency scaling loops in the second tier, the period (i.e., iteration length) of the workload division tier is configured to be much longer than the period of frequency scaling to decouple the two loops. This design choice provides more flexibility for the design of each individual part. Alternatively, we could explore coupled algorithms.

However, workload division commonly has higher overheads than frequency scaling and thus cannot be performed too frequently to deal with short-term workload vari- ations. On the other hand, frequency scaling can be conducted more frequently due to its lower overheads. As a result, the two actuations are designed to run in dif- ferent intervals for the trade-off between overhead and system responsiveness. For the GPU in our testbed, both the core and memory parts have 6 frequency levels.

Therefore, the GPU frequency scaling loop may need 36 periods to test all the 36

(6x6) combinations of core and memory frequencies in the worst case. We select the workload division interval long enough (e.g., no less than 40 times longer than

77 Algorithm 1 Pseudo code for online learning frequency scaling scheme Initialize weight[N][M]; weight[i][j] ∈ [0, 1], i ∈ N, j ∈ M, N core frequency levels and M memory frequency levels Set up ucmean[N], ummean[M]; while 1 do Read GPU core and memory utilizations uc and um; Calculate core loss function, memory loss function and total loss; Update weight[N][M]; Select frequency pair corresponding to the highest weight[i][j] for GPU cores and memory; Assign frequency levels to cores and memory; end while

that of GPU frequency scaling interval) for the DVFS algorithm to converge to the optimal setting within one workload division interval. Therefore, before the next di- vision starts, DVFS setting has already settled at the optimal point with the current division setting. Since the convergence time is normally short, the impacts of DVFS on workload division are observed to be negligible. Therefore, the two tiers will not interfere in a destructive way.

4.5 GreenGPU Algorithms

In this section, we first present our frequency scaling algorithm for GPU cores and memory. We then introduce our workload division algorithm.

4.5.1 Dynamic Frequency Scaling for GPU Cores and Memory

Our dynamic frequency scaling algorithm aims at assigning frequency levels to GPU cores and memory to save energy with minimal performance loss. Since the GPU cores and memory parts interact with each other, we develop a coordinated algo- rithm to address this issue. We adopt the design framework of WMA (Weighted

Majority Algorithm) [73] to develop our algorithm. In machine learning, WMA is a meta-learning algorithm used to construct a compound algorithm from a pool of

78 configurations. A weighted voting method could be used to find the optimal one(s) [37]. Specifically, we maintain a core-memory frequency pair weight table. Each field records the weight of one core-memory frequency pair. Those weights are updated according to the utilizations of the GPU cores and memory in the previous interval.

The algorithm selects the core-memory pair with the highest weight to enforce in the next interval. Algorithm 1 explains the flow of our algorithm. We first initialize all of fields to an equal value (e.g., 0) since we do not have preference on any specific frequency level in the initial state. After the initialization, we periodically read the utilizations of GPU cores and memory, update the weight in each field based on its corresponding utilizations, then select the highest weight in the table and enforce the corresponding core-memory frequency pair. We can see the key part is how to update the weight, which we introduce in detail in the following paragraph.

As discussed in Section 4.3.1, highly utilized resource needs to run at a high fre- quency level while low utilized resource can be throttled to save energy without sig- nificantly impacting system performance. Therefore, for each component utilization, there may be a corresponding optimal frequency level. However, the available fre- quency levels in our system are discrete. Therefore, the purpose of our core-memory pair table is to evaluate how close the current utilization value is to the most suit- able utilization of each core-memory frequency pair. The suitability for the current

t workload is represented by a loss factor (0 ≤ li ≤ 1, 1 ≤ i ≤ N, N the number of available core levels, t is the time interval index), which can be evaluated by com- paring current utilization of the workload u to the most suitable utilization of each available frequency level umean. We define umean in the same way as in [29], which has been studied and validated on CPUs. We assume the peak frequency is suitable for utilization 100%. The lowest frequency is suitable for utilization 0%. And the other utilization and suitable frequency pairs are linearly mapped. Table 4.1 shows

79 Table 4.1: Loss function used in the GPU frequency scaling algorithm

t t Value of u Energy Loss (lie) Performance Loss (lip) u>umean[i] 0 (u - umean[i]) u

t how to calculate the overall loss (li ). If current u is smaller than the umean of a uti- lization level, then the workload stresses the resource less than the current frequency level can deliver. Hence, the resource can afford to run slower. This configuration

t has no performance loss (lip), but since it could have saved energy by running slower,

t it has energy loss (lie). Similarly, there is a performance loss, but no energy loss when u>umean. Our metric includes both performance and energy to be general, because it offers a trade-off between performance and energy. Although energy is the product of execution time and power, sometimes a DVFS setting with very low power consumption but a long execution time can be selected if its energy is the lowest. We include the trade-off to prevent this situation, because performance is the primary concern for many HPC applications. Our target is to save energy with only negligible performance degradation. α in Table 4.1 is a user-defined parameter that determines the relative importance of performance vs. energy savings. A smaller

α directs the algorithm to favor performance while a larger α directs the algorithm to favor energy saving. In our system, since energy increases when performance de- grades (i.e., a longer execution time), we give a higher weight to performance by setting (αc =0.15 for cores and αm =0.02 for memory, αc, αm are derived from experiments). Specifically, the loss factors for GPU cores and memory are calculated as:

t t t l ci = αc × l cie +(1− αc) × l cip (4.5.1)

80 t t t l mj = αm × l mje +(1− αm) × l mjp (4.5.2)

Then we combine core and memory loss together by a factor φ to get the total loss in Equation 4.5.3. φ balances the impact of cores and memory on the system perfor- mance and energy saving. In our hardware testbed, φ =0.3 is the value that reflects the system characteristic derived from experiments. φ is selected from experiments.

t t t TotalLossij = φ × l ci +(1− φ) × l mj (4.5.3)

Based on the total loss, the weights used in the frequency scaling algorithm can be updated as follows.

(t+1) (t) t weightij = weightij × (1 − (1 − β) × TotalLossij) (4.5.4)

In Equation 4.5.4, β (0 <β<1) is introduced to get the trade-off between the current loss factor and the previous history weight. We select β =0.2fromex- periments to filter out limited system noise with quick workload change response.

Among the N × M weights (suppose we have N core frequency levels and M mem- ory frequency levels), the highest one is selected and and its corresponding core and memory frequencies are enforced in the next period. Please note currently we derive

α, β,andφ from manual tuning due to the lack of accurate, general, and scalable performance/performance model for GPUs, which could be our future direction.

4.5.2 Workload Division

We now introduce how we use execution time as an indicator to divide workloads between the CPU and GPU parts.

We define the percentage of work that CPU takes in an iteration as r,then

81 GPU takes the rest 1 − r percentage of the work. The time CPU uses to finish its work in an iteration is defined as tc, while GPU’s execution time is defined as tg.

When the system finishes the computation of the current iteration, the workload division unit will compare tc and tg.Iftc is longer than tg, r will be reduced by one step (e.g., one fixed amount, 5%). If tc is shorter than tg, r will be increased by one step. The 5% division step is hardware platform dependent and decided by experiments. The system takes a long time to converge to the optimal division point if we use a small step. There will be large oscillation if we use a large step.

Since the workload division is not consecutive, there may be oscillation between two ratios. For example, if the optimal division is 12.5/87.5 (CPU/GPU), the system will oscillate between 10/90 (CPU/GPU) and 15/85 (CPU/GPU). In our experiments, this oscillation significantly degrades system performance due to the overheads of frequent workload division. Therefore, we introduce a safeguard scheme to avoid this situation. Specifically, we linearly scale the execution times of the GPU and CPU in the previous iteration on both sides based on the possible workload allocation to predict the execution times in the next iteration. If the predicted execution times show that there can be oscillation, we keep using the current division for the next interval. For example, if we have tc < tg for a division of 10/90 (CPU/GPU) in one iteration, we should take a 5% workload away from the GPU and give it to the CPU based on the algorithm. We now predict the execution times of GPU and CPU in the next iteration as tc =(15/10) ∗ tc and tg =(85/90) ∗ tg, respectively. If tc >tg, oscillation may happen and so we keep the current division for the next interval.

Clearly, our light-weight heuristic cannot completely guarantee to reach global optimal since we do not exhaust the searching space. But our experiments (Section

4.7.2) show that the result is close to the global optimal. We choose to use this light- weight algorithm as a trade-off between solution performance and runtime overheads.

82 Table 4.2: Summary of workloads used in our hardware experiments.

Workloads Enlargement Description bfs 65536 iterations High core and memory utilization lud 10 iterations; 8192 by 8192 matrix Medium core utilization, low memory utilizatoin nbody 50 of iterations High core and memory utilization PF 2048 by 2048 dimensions Low core and memory utilization QG 600 iterations; 16777216 points Utilizations highly fluctuate srad v2 2048 columns by 2048 rows High core utilization, medium memory utilization hotspot 2048 by 2048 grids of 600 iterations Medium core utilization, low memory utilization kmeans 988040 data points Medium core utilization, low memory utilization streamcluster 65536 points with 512 dimensions Utilizations highly fluctuate

Please note the focus of this paper is on a holistic energy management framework that integrates higher-level workload division and lower-level hardware resource manage- ment (i.e., frequency scaling) to improve the system energy efficiency. GreenGPU can be integrated with other sophisticated global optimal algorithms (e.g., [46]) for better performance or more energy saving at the cost of more complicated implementation and higher runtime overheads.

4.6 Implementation

In this section, we introduce our hardware testbed and the implementation details of

GreenGPU.

We use GeForce8800 GTX GPU [93] in our testbed. GTX8800 has 16 Stream

MultiProcessors (SM) with the 90nm technology. By using the utility bandwidthTest from the Nvidia SDK, we derive the CPU to GPU bandwidth is 656.3 MB/s and the GPU to CPU bandwidth is 803.3 MB/s. We use an off-the-shelf GPU card but not the latest card (e.g., Tesla series) because it is fully compatible with the management tools such nvidia-smi and nvidia-settings, which are required in our experiments.

We select six frequency levels with equal distance in the dynamic range of the core part and memory part, respectively (e.g., 900MHz, 820MHz, 740MHz, 660MHz, 580MHz,

83 Figure 4.4: Hardware testbed used in our experiments, which includes a Dell Optiplex 580 desktop with an Nvidia GeForce GPU and an AMD Phenom II CPU, two power meters, one separated ATX power supply to power the GPU card. Meter1 measures the power of the CPU side, while Meter2 measures the power of the GPU side. and 500MHz for GPU memory). Finer frequency levels may introduce large conver- gence time; coarser frequency levels may introduce large oscillation. Our selection is a trade-off between the convergence time and oscillation. For example, we use 576MHz,

513MHz, 466MHz, 411MHz, 356MHz, and 297MHz for GPU core frequency levels, and 900MHz, 820MHz, 740MHz, 660MHz, 580MHz, and 500MHz for GPU mem- ory frequency levels. We enable Coolbits attribute of NVIDIA graphic card and use nvidia-settings shipped with the NVIDIA GPU driver to adjust the cores and mem- ory frequencies of GPU. We use 4.0 CUDA driver and the 3.2 runtime. nvidia-smi

[94] in Nvidia’s toolkit is used to read the current GPU core and memory utiliza- tions. The CPU in our physical testbed is a dual core AMD Phenom II X2 processor with four available frequency levels as 2.8GHz, 2.1GHz, 1.3GHz, and 800MHz. The operating system is Ubuntu 10.04 with a Linux kernel 2.6.32. We use 2 Wattsup Pro power meters [126] to get the power readings. As shown in Figure 4.4, to measure the power consumption of the CPU and other parts of the system, we put the first power meter between the box and the 110V AC wall outlet. Therefore, the first power

84 meter measures the total power of the CPU side, including the motherboard, disk, and main memory. The GPU card is powered by an independent ATX power supply and its power consumption is measured with the second power meter placed between this ATX power supply and the wall outlet. The second power meter measures the total power of the GPU card.

We use Rodinia [20] and Nvidia SDK [95] as our workloads. Rodinia suite pro- vides both CUDA and OpenMP implementations for their applications. We enlarge the data size and/or number of iterations of GPU computation kernels in order to get stable power reading. Table 4.2 shows the key parameters. Our workload se- lection covers all core and memory utilization characteristics and we also include workloads with dramatic fluctuation in terms of utilization (i.e., QG and SC )totest our GreenGPU framework. We identify QG and SC as high fluctuation workloads by studying the utilization traces of our workloads. We implement our frequency scaling part as a Python script and run the script as a background daemon process to adjust the GPU cores and memory frequency levels.

The workloads for two-tier design experiments are from Rodinia. Please note programming model that supports GPU-CPU heterogeneous architectures is still in early experimental stages (e.g., OpenCL). For instance, Open Computing Language

(OpenCL) [60] is a C-based programming framework with promising heterogeneous processing support, but it is still in early developmental phases. Therefore, the work- load distribution between the CPU and GPU still requires low-level programming and memory management (e.g., programming a combination of OpenMP or pthreads with CUDA [95]). We adopt a preliminary implementation structure as introduced in [103, 76]: multiple pthreads are launched in the main function. Some pthreads are in charge of CUDA execution (one pthread for one GPU), the other pthreads are deployed on the cores of the main CPU (one pthread for one core). We wrap

85 the CPU and GPU implementations into different kernel functions and launch those kernels in different pthreads. We also implement parameters in those kernels such that the data size mapped to each kernel can be changed when it is invoked. We repeatedly call kernel functions with different data sizes to implement the workload division. We implement our workload division algorithm within the application code.

Currently, the main program spawns pthreads for both CPU and GPU versions of implementation for the benchmarks in our evaluation. However, new programming interface like OpenCL offers opportunities to have just one copy of implementation that can be deployed on both CPU and GPU (or even DSP) platforms, which will significantly reduce the programming effort.

Due to the future system integration trend, the energy and power management algorithm might need to be implemented on-chip [52]. However, since workload di- vision within an application is a software problem by nature, it does not suit for on-chip hardware implementation. Therefore, we only sketch the possible hardware implementation for our frequency scaling tier. For monitoring, all the statistics we need in our algorithm can be derived from performance counters, which are already available on most modern CPUs and GPUs. There is no extra monitoring hardware required. The key part in the frequency scaling tier is the core-memory pair weight table. We need a N × M table to record the core-memory frequency pairs. Because the loss factor value is between 0 and 1. 8-bit precision is accurate enough for the purpose of picking up the largest weight. For our testbed with 6 core frequency levels and 6 memory levels, we only need a 36 bytes table (6x6x8). The weights are up- dated based on Section 4.5.1. Please note the multipliers in Equations 4.5.1 to 4.5.3 are one coefficient fixed, they can be highly optimized to a simple shift-add logic.

Scaled to 8-bit and current 65nm technology, the adder presented in [81] only con- sumes 0.001mm2 and 12.5×10−9J each invocation. The leakage power of the 36-byte

86 95 Core Frequency Memory Frequency GPU scaling 1200 120 1200 120 Best performance Core Utilization Memory Utilization 75 800 80 800 80

55

400 40 400 40 Power (w)

0 0 (%) Core utilization 0 0 35 Core frequency Core frequency (MHz)

0 6 12 18 24 30 36 0 6 12 18 24 30 36 utilization(%) Memory 0 6 12 18 24 30 36

Time (s) frequency(MHz) Memory Time (s) Time (s) (a) Core part trace (b) Memory part trace (c) Power trace Figure 4.5: Frequency scaling algorithm adjusts the frequencies of GPU cores and memory based on their utilizations respectively to save energy without increasing execution time. storage and the adder is negligible compared to that of billions of transistors in a modern GPU. Therefore, the hardware structure of our frequency scaling algorithm is area and energy efficient to be implemented on-chip.

4.7 Experiments

In this section, we first evaluate the GPU frequency scaling. We then test the work- load division tier. Finally, we present the results of GreenGPU as a holistic energy management solution.

4.7.1 Frequency Scaling for GPU Cores and Memory

In this experiment, we enable the frequency scaling tier but disable the workload division tier (i.e., all the workloads are put to the GPU) to test the performance and energy savings of the frequency scaling algorithm. We use the best-performance policy as our baseline. Best-performance sets both core and memory frequencies always at the highest level (i.e., 576MHz for cores and 900MHz for memory). We compare our frequency scaling algorithm with best-performance to show that our algorithm can achieve considerable energy savings with only negligible performance loss. Figure 4.5 shows the trace file of a typical run of our frequency scaling with the

87 16 80 20 GPU Scaling Dynamic Energy Saving CPU/GPU Scaling 12 60 15

8 40 10

4 20 5 Energy saving (%) saving Energy Energy saving (%) saving Energy 0 (%) saving Energy 0 0

(a) GPU energy (b) GPU dynamic energy (c) System energy Figure 4.6: Energy saving compared with best-performance for different workloads. streamcluster workload. Our experiment starts with the frequencies of cores and memory running at the lowest levels, which is the default case for a GPU. Figures

4.5a and 4.5b show that the core and memory frequencies are generally directed by their utilizations. In Figure 4.5a, the utilization of cores starts to ramp up from the

6th second. Since our frequency scaling interval is 3s in this test, at the 9th second

(i.e., the immediate next period after the utilization increase), the frequency of cores is adjusted to be higher. Since our algorithm evaluates the loss value of all possible frequency levels, it can adjust the GPU core and memory frequencies directly to the best levels according to the utilizations. In Figure 4.5b, the memory frequency converges to 820MHz, which is lower than the peak frequency (i.e., 900Mhz) and so results in energy savings. As shown in Figure 4.5c, the average power consumption of our algorithm is lower than that of best-performance throughout the experiment, but the execution time (i.e., performance) is similar. As a result, the energy efficiency is improved.

Figure 4.6 presents the energy saving percentage of our scheme compared with best-performance for different workloads. In Figure 4.6a, GPU scaling is the measured result of our core-memory coordinated frequency scaling algorithm. Our algorithm saves 5.97% on average and up to 14.53% of GPU energy. In Figure 4.6b, we present the energy savings in termsofdynamicGPUenergy. Dynamic Energy Saving num- bers are calcuated by subtracting the idle energy from the runtime energy. Figure

88 4.6b shows that our approach saves 29.2% of dynamic energy on average with only 2.95% longer execution time than best-performance. In Figure 4.6c, CPU/GPU scal- ing is the result when we throttle both the CPU and GPU for maximized energy savings. The key idea is that the CPU frequency can be throttled down for energy savings with asynchronized GPU-CPU communications, when the GPU part is do- ing all the computation. However, due to the limitations in the implementation of synchronized GPU-CPU communications used in our benchmarks, the CPU has a utilization of 100% even when it is idling and the GPU is doing all the work. As a result, the on-demand CPU frequency governor of Linux used in GreenGPU fails to throttle the CPU frequency for energy savings. We therefore emulate this case to highlight the energy saving potential of dynamically throttling both the CPU and

GPU. In our emulation, we conservatively assume that the CPU frequency cannot be throttled if the CPU needs to communicate with the GPU at any time, such as the workload launching and ending times. When the CPU is idling and its frequency can be throttled without impacting the system performance, we replace the CPU energy with the average CPU energy at the lowest frequency level to emulate that CPU frequency is throttled to the lowest level. Figure 4.6c shows that the average energy saving is 12.48% if both CPU and GPU are throttled. Note that we do emulation only in Figure 4.6c. Based on Figure 4.6 and the corresponding workloads in Table 4.2, we can make the following observations. First, for workloads with phase fluctuation, such as QG and streamcluster, our scheme can achieve energy savings because we dynamically detect the on-line utilization information of the cores and memory and dynamically adjust frequencies accordingly. Second, for applications with a lower average utiliza- tion (either core part or memory part, such as PF and lud), our scheme yields good energy savings. However, for the applications with high utilization rates, such as

89 100 200 100 200 80 160 80 160 60 120 60 120 40 80 40 80 Time (s) 20 40 Time (s) 20 40 0 0 0 0 Division ratio (%) ratio Division Division ratio (%) ratio Division 12345678910 12345678910 Computation iterations Computation iterations

GPU CPU GPU Execution Time CPU Execution Time GPU CPU GPU Execution Time CPU Execution Time (a) Workload division and execution time of(b) Workload division and execution time of kmeans. hotspot. Figure 4.7: Workload division algorithm adjusts the workload allocation between the CPU and GPU parts to minimize idling energy on either side caused by waiting for the other (slower) side. bfs, the energy savings are smaller. This is because if all the resources are occupied, throttling either core or memory frequency will significantly increase execution time, resulting in increased energy consumption.

To summarize this part, our scheme is effective for both phase-stable and phase-

fluctuating workloads, and it performs better for the workloads with low utilizations of either GPU cores or memory than the workloads with high utilizations of both

GPU cores and memory.

4.7.2 Workload Division between GPU and CPU

In this section, we enable the workload division tier but disable the frequency scaling tier to investigate the effectiveness of our workload division algorithm.

Figure 4.7 presents the traces of the workload division for kmeans and hotspot.

In Figure 4.7, the X-axis is the iteration sequence number; the left Y-axis is the workload division percentage; the right Y-axis is the execution time. The triangle dot is the execution time of the GPU part in the corresponding iteration. The round dot is the execution time of the CPU part in the corresponding iteration. In Figure

4.7a, the initial division ratio is set to be 30% workloads on the CPU part. We pick up 30% here in order to have a faster convergence. In real usage, this value

90 can be set to an arbitrary ratio (e.g., 50%). Since we use 5% as workload division step, in the worst case, we need 10 iterations if we start from the 50% division point.

In our experiments, we find that setting initial ratio to 50% to 30% can help to converge the balanced workload division in a shorter time. However, we will show that our algorithm converges to the balanced workload division regardless of this initial division ratio. In the 1st iteration, the CPU execution time is much longer than the GPU execution time. Our division algorithm takes one piece of workload from the CPU and assigns it to the GPU part. The execution time difference between the CPU and GPU becomes smaller in the 2nd iteration. But the CPU execution time is still much longer than the GPU execution time. Our division algorithm takes one more piece of workload from the CPU and assigns it to the GPU part. In the

3rd iteration, the execution times of the two parts become even closer. The process repeats until the execution times on both sides are roughly the same after 4 iterations. The rationale of this adjustment is to minimize the idling energy caused by waiting for the slower side, as discussed in Section 4.1. Figure 4.7b shows a similar case for another workload, hotspot. As demonstrated in figures 4.7a and 4.7b, our algorithm can dynamically adjust the workload division based on the runtime execution time difference between the CPU and GPU parts in a GPU-CPU system regardless of its initial division ratio. To examine how close the result of our workload division algorithm is to the optimal division point with the minimum energy consumption, we have also conducted a series of experiments to test static workload division from 0/100 to 100/0 (CPU/GPU) with a step size of 5. For kmeans, we find that the energy- minimum division is 15/85 (CPU/GPU). In comparison, our algorithm converges to 20/80. For hotspot, the energy minimum division is 50/50 (CPU/GPU), while our algorithm converges exactly to 50/50 and obtains 99% of the maximum saving. The 1% difference is mainly due to 1) the higher energy consumption before the

91 100 60 100 35 80 50 80 32 60 60 29 40 40 40 26 20 30 20 23 Energy (KJ) Energy Energy (KJ) Energy 0 20 0 20 12345678910 12345678910 Division Ratio (%) Ratio Division Division Ratio (%) Ratio Division Computation Iterations Computation Iterations

GPU CPU GreenGPU Division Frequency Scaling GPU CPU GreenGPU Division Frequency Scaling (a) hotspot. (b) kmeans. Figure 4.8: Energy and workload division ratio trace in respect of the iterations. GreenGPU outperforms workload division only and frequency scaling only on energy savings. convergence, and 2) the overheads of dynamic workload division. For the workload division part only, our solution only has 5.45% longer execution time than the optimal division.

4.7.3 GreenGPU as a Holistic Solution

We now enable both the workload division and the frequency scaling tiers to test

GreenGPU as a holistic solution. We show that such a holistic solution leads to more energy savings than each individual scheme.

Figure 4.8 shows the runtime traces of hotspot and kmeans. In Figure 4.8, the X- axis is the iteration sequence number; the left Y-axis is the workload division percent- age; the right Y-axis is the energy consumption in each iteration. The triangle dots, hollow round dots, squares are the energy consumption of GreenGPU, Division (with frequency scaling disabled), and Frequency-scaling (with workload division disabled) in the corresponding iteration. In Figure 4.8a, GreenGPU consumes less energy than

Division and Frequency-scaling. Compared with Division, GreenGPU’s frequency scaling units observe the energy saving opportunities when the GPU cores, GPU memory, or CPU has a low utilization. GreenGPU then throttles the frequencies of them based on the proposed frequency scaling algorithm. By doing that, GreenGPU achieves lower energy consumption than Division. Compared with Frequency-scaling,

92 GreenGPU has more energy savings by dynamically dividing workloads between the GPU and CPU. By balancing the workload between GPU and CPU, GreenGPU can minimize the time idle energy consumed by GPU or CPU to wait for the other side to

finish. For hotspot, GreenGPU achieves 7.88% more energy saving than Division and

28.76% more than Frequency-scaling. Figure 4.8b shows a similar case for a different workload, kmeans. In our testbed, Division contribute more to energy saving than

Frequency-scaling in holistic solution because nvidia-settings on GeForce8800 only conducts frequency scaling [94]. If DVFS is enabled, we expect more energy saving can be achieved from frequency scaling. GreenGPU saves 1.6% more than Division and 12.05% more than frequency scaling for kmeans. The default runtime config- uration of Rodinia is that all the workloads are allocated to the GPU and all the frequencies are at their peak levels [20]. Compared with that, GreenGPU can achieve on average 21.04% energy saving for kmeans and hotspot. For the holistic solution as a whole, GreenGPU has 1.7% longer execution time than workload-division-only.

4.8 Conclusion

Current research on GPU-CPU architectures focuses mainly on the performance as- pects, while the energy efficiency of such systems receives much less attention. There are few existing studies that start to lower the energy consumption of GPU-CPU ar- chitectures, but they address either GPU or CPU in an isolated manner and thus can- not achieve maximized energy savings. In this paper, we have presented GreenGPU, a holistic energy management framework for GPU-CPU heterogeneous architectures.

Our solution features a two-tier design. In the first tier, GreenGPU dynamically splits and distributes workloads to GPU and CPU based on the workload characteristics, such that both sides can finish approximately at the same time. As a result, the en- ergy wasted on staying idle and waiting for the slower side to finish is minimized. In

93 the second tier, GreenGPU dynamically throttles the frequencies of GPU cores and memory in a coordinated manner, based on their utilizations, for maximized energy savings with only marginal performance degradation. Likewise, the frequency and voltage of the CPU are scaled similarly. We implement GreenGPU using the CUDA framework on a real physical testbed with Nvidia GeForce GPUs and AMD Phenom

II CPUs. Experiment results show that GreenGPU achieves 21.04% average energy savings and outperform several well-designed baselines.

94 CHAPTER 5

INTEGRATING THERMOELECTRIC COOLERS AND

FANS FOR ENERGY EFFICIENCY

5.1 Introduction

Semiconductor industry has observed the end of classic Dennard scaling [33] in recent generations of process technology scaling. As the transistor integration outpaces the supply voltage scaling-down, the power density of microprocessor chips increases rapidly [107]. Traditionally, computer systems rely on forced convection effect to cool

Fan speed Fan Heat i dissipated

TEC Cu iTECool N P on/off Heat Spreader Thermal interface Heat absorbed DVFS levels Chip material Substrate (TIM) TEC TEC Peltier cooling Figure 5.1: The side view of the target chip packaging and TEC cooling effect. TECs are embedded between the heat spreader and the processor chip in the thermal in- terface material (TIM) layer. By applying current to the TEC, heat can be pumped from one side to the other side of this film device. iTEcool coordinates the fan and multiple TECs to improve the overall cooling efficiency. In addition, iTECool also coordinates DVFS level of each core and the cooling subsystem (TECs and fan) to reduce the energy consumption of the entire system (i.e., processor, fan and TECs).

95 the processors. The cooling fans are driven by motors with feedback controllers, such that the speed of the fan is adjusted by on board firmware or management algorithms running on CPU [107]. By adjusting the fan speed, we can provide flexible cooling to the processor. Unfortunately, this solution has following limitations. First, the fan speed increases at the cost of cubic increase in its power consumption [4]. The fan system power in high-end servers has reached up to 51% of the total server power consumption [66]. Furthermore, the fan systems can only provide global cooling, which cannot efficiently address the multi-core hot spot issues [19]. For example, among all the cores, only several cores (specifically, only certain components) appear to be hot, the fan(s) have to work at a very high speed, resulting in low cooling efficiency. Alternative more efficient cooling solutions need to be investigated.

Thermoelectric cooler (TEC) is a new kind of film material that can actively pump heat from one side to the other side when current is applied. Compared with conventional fan-based cooling system, TEC-based cooling has promising potential to manage the local hot spots to improve overall cooling efficiency [21]. However,

TEC is an active device that itself also generates heat within processor package.

The overall thermal effect of TEC is the competing effect of Peltier cooling (surface effect) and Joule heating (volume effect) [43]. Without careful management, TEC might heat the chip instead of cooling it down. Furthermore, in a TEC-based cooling design, previous studies usually assume a fixed fan speed [9] without exploring the coordination between TEC and fan. Since adjustable fan can provide flexible global cooling ability and TECs excel in local hot spots removal, intelligently coordinating the two can improve the overall cooling efficiency in different scenarios, which needs to be explored.

Other than improving the cooling subsystem efficiency, dynamic thermal man- agement (DTM) with throttling processor execution to lower temperature has been

96 widely studied [16, 109, 30]. Among those studies, dynamic voltage and frequency scaling (DVFS) has been shown to be one of the most successful knobs due to its cubic dynamic power reduction at only linear performance degradation. The key idea of

DTM can be summarized as: when the temperature is higher than a certain thresh- old, we lower the DVFS level of the processor to reduce heat dissipation; when the temperature is lower than the threshold, we increase the DVFS level of the processor to improve performance. However, high temperature on the processor is usually cor- related to high activity, which exactly the time that a processor needs high frequency the most for better performance. Throttling processors at high temperature may significantly degrade the system performance. On the other side, lower temperature usually means less activity or more stalls (e.g., many memory accesses), increasing the

DVFS level at those times may not improve performance a lot. Compared with throt- tling at high temperature and boosting at low temperature, it seems more intuitive to boosting at high temperature and throttling at low temperature from performance perspective. However, in order to avoid overheating, cooling subsystem needs to be coordinated with computational part, which needs detailed analysis.

Based on the above observations, this paper presents iTECool, a highly con-

figurable TEC-based cooling framework for energy efficiency. iTECool uses well- established power, performance, and thermal models to formulate the global energy management problem as a multi-variable energy efficiency optimization problem with temperature constraint. Since the search space of the optimization problem is pro- hibitively large, iTECool uses an efficient multi-step down-hill algorithm to solve this problem. We evaluate iTECool with extensive experiments on our simulation plat- form. The results show that iTECool can save the system energy by 10% compared with state-of-the-art baselines. In summary, this paper makes the following major contributions.

97 • The paper proposed a highly configurable optimization framework based on well-established power, performance, and thermal models.

• The paper applies the optimization framework on the overall energy conserva-

tion problem with TEC device as part of the cooling solution.

• To reduce the algorithm running time, this paper proposes simplified algorithm to solve the optimization problem.

The rest of this paper is organized as follows. Section 5.2 highlights the differences between this paper and related work. Section 5.3 sketches the system design. Section

5.4 introduces our simulation setups and the implementation details of our solutions.

Section 5.5 presents the evaluation results. Section 5.6 concludes this paper.

5.2 Background

Chowdhury et al. [21] have shown a prototype chip of thin-film TEC using nanotech- nology and presented key parameters. Gupta et al. [43] have examined the transient temperature behaviors of TEC devices. Those studies focus on physical parame- ter characterization. At the architecture level, Long [75] have studied the optimal amount of TECs and their locations on chips. Chaparro et al. [19] have conducted several case studies on how to manage the on/off state of TEC devices. Biswas et al

[9] have shown that using TEC to improve cooling efficiency can save cooling cost in data centers. Although those studies all explicitly or implicitly assume the existence of fans in the cooling subsystem, they do not discuss the coordination between TEC and fans. In contrast, iTECool focuses on the coordination between fan and TEC to improve overall cooling efficiency.

There have been several early-stage studies on joint energy co-optimization of the cooling and computational power. Shin et al. [107] have discussed the co-optimization 98 between the CPU and fan. Ayoub [4] have introduced a cooling-aware task scheduling. However, those studies do not consider the local cooling capacity that TEC can offer.

Compared with discussing the trade-off between the fan and CPU, iTECool addresses a more challenging multi-variable optimization problem.

5.3 System Design

In this section, we present the system design of iTECool. The target system is illustrated in Figure 5.1. TECs are embedded between the heat spreader and the processor chip in the thermal interface material layer. We assume using through silicon via (TSVs), a widely-used technique in 3-D stacking, to connect the TEC device and on-chip power delivery network (PDN) and use the power transistors to control the on/off state of the TECs. By applying current to TEC, the heat can be pumped from the chip side to heat sink side. In this way, we enable fine-grain local cooling. By adjusting the fan speed, we change the global cooling. We discuss the coordination between the fan and multiple TECs to improve the cooling efficiency, we also include DVFS as an extra knob to coordinate the computational power and the cooling power.

By adjusting the TEC on/off state, the DVFS level of each core, and fan speed, we minimize the energy of the entire system, including the cooling (TEC and fan) and computational (processor) parts, with the constraint that the peak temperature is always below a safe threshold.

99 5.3.1 Thermal Model

A widely-used steady state thermal model (e.g., in [75]) for multiple components on achipis

G(k)Ts(k)=P(k) (5.3.1)

T where Ts(k)=[Ts1(k),Ts2(k), ··· ,TsN (k)] is the steady state temperature vector th T of all the components at the k interval. P (k)=[P1(k),P1(k), ··· ,PN (k)] is the power vector of all the components at the kth interval. G(k) is the thermal conduc- tance matrix among the components. ⎡ ⎤ ⎢ g11(k) ··· g1N (k) ⎥ ⎢ ⎥ ⎢ . . . ⎥ G(k)=⎢ . .. . ⎥ (5.3.2) ⎣ ⎦ gN1(k) ··· gNN(k)

gij(k)=gfan(k)+gTECij (k)+gothersij (k) (5.3.3)

th where gij(k) is the thermal conductance between node i and j at the k interval (gij(k) is 0 if node k and j are not adjacent). Therefore, G(k) is a band matrix with only main diagonal and two minor diagonals adjacent to the main diagonal (totally only three diagonals) have nonzero elements. gfan(k) is the thermal conductance of

the fan. gTECij (k) is the thermal conductance of the TEC device (gTECij (k)is0if i = j). In Equation (5.3.3), by changing the fan speed, we can adjust gfan(k); by

changing the on/off state of TECs, we can adjust gTECij (k). gothersij (k)isthethermal conductance of the rest of the package, which we assume as a constant. If DVFS is enabled in the system, we can also change P(k) through DVFS.

100 Please note Equation (5.3.1) only presents the steady state temperature. In the following process, we derive the transient temperature estimation in iTECool. We use scalar form for concision. The matrix induction process is very similar. We start with the well-known RC thermal model [109],

dT (t) 1 1 = ∗ P (t) − (T (t) − Tα) (5.3.4) dt Cth RthCth

where T (t) is the processor temperature. Tα is the ambient temperature. Rth is the thermal resistance, Cth is the thermal capacitance. P (t) is the power feeding into the

RC model. By solving Equation (5.3.4), we have

T (t)=(1− κ) ∗ Ts+ κ ∗ T (t0) (5.3.5)

− t−t0 κ = e Rth∗Cth (5.3.6)

where Ts = Tα + Rth ∗ P (k) is the steady state temperature. T (t0) is the initial tem- perature. Equation (5.3.5) shows that the transient temperature can be interpolated by using the initial temperature and the steady state temperature. By discretizing

Equation (5.3.5), we derive

T (k)=(1− κ) ∗ Ts+ κ ∗ T (k − 1) (5.3.7)

− Δt κ = e Rth∗Cth (5.3.8) where Δt is the time interval. By minus T (k −1) from both sides of Equation (5.3.7),

101 we get

ΔT (k)=(1− κ) ∗ (Ts− T (k − 1)) (5.3.9)

The matrix format is

T(k)=(1− κ) ∗ (Ts(k) − T(k − 1)) + T(k − 1) (5.3.10)

By solving Equation (5.3.1), we can get Ts(k). Then we plug Ts(k) into Equation

(5.3.10), we can predict the transient temperature results of our actuators (e.g., TEC on/off, fan speed, and core level DVFS). In matrix form, it is very complicated to compute κ, we conservatively use 1 − κ =Δt as an approximation when Δt is small

(e.g., at the order of microseconds, the same approximation is used in [121]). When

Δt is large (e.g., at the order of seconds), we directly use Ts(k)asT(k).

5.3.2 Power and Performance Model

Fan and TEC power model. The power and air flow rate (e.g., used to calculate gfan(k)) can be derived from a fan datasheet [31]. TEC power and conductance

gTECij (k) can also be derived from product datasheet. In this paper, we use the parameters in [75].

Processor power model.

Pi(k)=Pleaki (k)+Pdyni (k) (5.3.11)

∗ − − ∗ Ai Pleaki (k)=(PTDPleak + α (Ti(k 1) TTDP)) (5.3.12) Achip

102 − ∗ Fi(k) ∗ Vddi(k) 2 Pdyni (k)=Pdyni (k 1) ( ) (5.3.13) Fi(k − 1) Vddi(k − 1)

where Pi(k) is the total power of component i, Pleaki (k) is the leakage part, Pdyni (k)

is the dynamic part. PTDPleak is leakage power under TDP status, which could be estimated by using datasheet and used as a constant in our on-line estimation. Ai is the area of component i; Achip is the chip area. Using linear temperature model to estimate leakage power within a limited temperature range has been shown rea- sonably accurate in [107, 114]. Equation (5.3.13) has been widely used in dynamic power estimation. In real processors, the estimation of per-structure dynamic power − Pdyni (k 1) by monitoring only six performance metrics has been proposed in [100]. Processor power model. We estimate the execution speed of the entire chip as:

N−1 N−1 − ∗ Fi(k) IPSchip(k)= IPSi(k)= IPSi(k 1) − (5.3.14) i=0 0 Fi(k 1)

th where IPSchip(k) is the number of instructions per second at k interval, which is the summation of the number of instructions of each core IPSi(k). IPSi(k)isestimated by last interval IPS of each core IPSi(k − 1) scaled by the frequency change ratio Fi(k) . Fi(k − 1)

Enchip = Pchip ∗ T imechip (5.3.15)

N−1 M−1

Pchip = Pcorei + PTECj + Pfan (5.3.16) i=0 j=0

103 2 PTEC = r ∗ i + αiΔθ (5.3.17)

where PTEC is TEC power. i is current applied. r and α are material paramters.

Δθ is the temperature difference between the two sides of the TEC device. Since applying more than 8A has been identified as dangerous to introduce overheating

[75]. We conservatively assume we apply 6A current. Therefore, the PTEC has a linear relationship with the Δθ.

1 T imechip ∝ (5.3.18) IPSchip

where Enchip is the energy of the entire chip. Pchip is the power of the CPU and cooling components. We have N cores and M pieces of TEC devices in our packaging.

T imechip is the execution time of the entire chip, which is inverse to the execution speed.

5.3.3 Problem Formulation

Our goal is to minimize the system energy, including both the cooling and compu- tational part, with maximum temperature constraint by adjusting TEC on/off state, core-level DVFS, and fan speed. In formal formulation:

N−1 M−1 ( i=0 Pcorei + j=0 PTECj + Pfan) min{Enchip = } (5.3.19) N−1 Fi(k) i − ∗ 0 (IPS (k 1) Fi(k−1) )

104 subject to

max{T (k)}≤Tth (5.3.20)

by adjusting TEC on/off state (impact G(k)andPTECj ), core-level DVFS (impact P (k), Pcorei ,andFi), and fan speed (impact G(k)andPfan).

5.3.4 Heuristic Solution

To solve the problem formulated in Section 5.3.3 in general, exhaustive search needs

Lcore to be used to derive the optimal result. The entire search space is (Ncore) ∗

LTEC (NTEC) ∗ Lfan (Ncore is the number of cores; Lcore is the DVFS levels of each core; NTEC is the number of TEC devices; LTEC is the number of state levels of each TEC device; Lfan is the levels of the fan). However, this number of different combined configurations is prohibitive for on-line search. We develop a multi-step down-hill heuristic to simplify the solver based on the following observations: 1) The

TEC modulation impacts local thermal characteristics; per-core DVFS impacts the core-level power dissipation; the fan speed adjustment impacts the global thermal characteristics. 2) The effective time of different knobs is different. The fan cooling effect takes place through heat sink and heat spreader, which the thermal capacitance is normally hundreds of Joule per Kelvin (several seconds to stable). In contrast, the

TEC and per-core DVFS modulation’s effect engages much faster. The cooling effect of TEC device (i.e., Peltier effect) takes up to 20μs [43] to engage. The overhead of per-core DVFS is decided by the on-chip voltage regulator, which is reported to be at the order of tens of nanoseconds [62]. Those two key observations inspire us to develop the following two-level hierarchical solution Co-op. At the low level (the fine time scale, e.g., 2ms, TEC engage overhead is 20μs, DVFS overhead is 100ns [130]),

105 Estimate the En chip if turn off the TEC on top of coolest component, or raise the DVFS level of one core by one level, select the adjustment that saves the most energy cool Y iteration Iterations end until the condition does not max{Tˆ(k)} T th hold, then apply all the hot adjustments iteration N

Estimate the En chip if turn on the TEC on top of hottest component, or lower the DVFS level of the hottest core by one level, select the adjustment that saves the most energy

Figure 5.2: Multi-step down-hill greedy algorithm (Co-op) flow chart. Based on the thermal, power, performance models (Equation (10)-(17)), Co-op estimates the next step possible energy if certain adjustment is used; it then selects the adjustment that has the smallest energy consumption within the temperature constraint. If the current temperature is lower than the threshold, Co-op compares the energy of turning off TEC and raising the DVFS level of one core; if the current temperature is higher than the threshold, Co-op compares the energy of turning on TEC and lowering the DVFS level of one core. Co-op takes the steps forward multiple steps along the small energy adjustment direction until the temperature constraint is achieved.

we use per-core DVFS and TEC device on/off state modulation as knobs to save energy as well as control temperature. The key idea is to use the thermal, power, and performance models presented in previous sections to estimate the next step status if certain adjustment is used. We then select the adjustment that has the smallest energy consumption within the temperature constraint. At the high level (the coarse time scale, e.g., seconds), we use the fan speed modulation as a knob to control temperature and save energy. Our down-hill algorithm works as follows.

In each low level interval, we assume the fan speed is fixed and use Equation

(5.3.1, 5.3.10) to estimate the next interval temperature of adjusting the power of each core and turn on/off the TEC device.

As shown in Figure 5.2, if some components are hotter than the threshold (i.e., “hot iteration”, connected by the red line). We estimate and compare the next step

106 energy of lowering the hot core or turning on the TEC on top of the hot area. We then select the one that offers the smaller energy consumption to estimate the temperature.

We define this energy estimation, comparison, and temperature estimation process as one iteration. If the temperature is still higher than the threshold, we repeat the iterations until the estimated temperature is lower than the threshold. If there still exist hot spots somewhere on the chip, we repeat the iteration process until all the hot spots are resolved. As shown in Figure 5.2, if there exist no hot spots (i.e., “cool iteration”, connected bytheblueline).WeevaluatetheopportunitytochangetheDVFSlevelsorturnoff the TEC device to save energy. We first calculate the energy of increasing the DVFS level of a core by one step, then we calculate the energy of turning off the TEC on top of the coolest components. We adopt the configuration that can save the most energy. We then estimate the temperature. If there still exist no hot spots, we use the new configuration as the starting point of our next iteration search until there is no opportunity to conserve energy consumption by increasing the DVFS level of each core or turning off the TEC without violating the temperature constraint.

In each fan interval, we assume the average power and DVFS levels of the last fan interval as the power reading and DVFS level configuration. We also assume the average TEC state in the last fan interval as TEC state, which means we have middle state besides the on/off state. Then we predict the next step temperature by using Equation (5.3.1, 5.3.10). If there exists hot spots, we increase the speed of the fan speed level until hot spots disappear. If there is no hot spots. We evaluate the opportunity to slow down the fan to save energy. We lower the fan speed until the hot spots appear.

107 5.3.5 Hardware Cost

We estimate the hardware cost of our proposed solution in this section. The key hardware overhead of iTECool is the temperature estimation (calculating Equation (5.3.1, 5.3.10). Please note we use thermal conductance matrix G in consistence with our evaluation framework HotSpot simulator. Because HotSpot uses thermal conduc- tance matrix. In real hardware implementation, instead of using thermal conductance matrix, we can use thermal resistance matrix. In that case, the inverse calculation could be bypassed and only the matrix-vector-multiplication is left. Since the thermal impact only takes place on adjecent or nearby components, G is by-nature a band matrix, which means the matrix only has nonzero elements along the main diagonal and some additional minor diagonals (typically only one) on either side of the main diagonal. Band matrix and vector multiplication can be implemented in systolic array

[85], a very speed and space efficient hardware. Since the inter-core thermal impact is limited in tile-structured many-core architecture. We only evaluate the temperature of one core each time. We analysis the power and area cost of a very aggressive design, in which the systolic array gives the temperature results of one core in one cycle, we need K × M fixed-point multiplication (M is the number of components in one core,

K is the number of how many component we assume they have thermal impact).

For power and thermal comparison, 8-bit encoding is sufficient. Bitirgen et al. [10] have estimated that a 16-bit fixed-point multiplier yields an area of 0.057mm2 with

65nm process technology. For a typical 200mm2 die, the hardware overhead is only 0.03%. To approximate the power consumption by the fixed-point multiplier circuits in the hardware, we assume the power density of IBM POWER6’s FPU [26], which is

0.56W/mm2 at 100% utilization with nominal voltage and frequency values (1.1V and

4 GHz). The extra power consumed by the fix-point multiplier is only 0.03W. Since in our design, we evaluate 18 components and assume only adjacent components have

108 Table 5.1: Simulation Configuration. Memory 200 cycles Core Alpha 21264 like I- and D-cache private, 16KB, 4-way, 64B, 1 cycle L2 cache private, 256KB, 16-way, 64B, 12 cycles DVFS levels 1 1.14V/1.0GHz, 1.05V/750MHz, 0.9V/500MHz, 0.7V/125MHz

FPMap IntMap Int_Q IntReg Core1 Core2 Core3 Core4 FPMul LdSt_Q IntExec FPReg FP_Q FPAdd ITB Voltage Core5 Core6 Core7 Core8 Bpred DTB Regulator i-cache d-cache

Core9 Core10 Core11 Core12

L2

Core13 Core14 Core15 Core16 Router

(a) Chip floor plan. (b) Core tile floor plan.

Figure 5.3: Simulated processor floor plan. The chip floor plan is scaled based on the Intel SCC 48-core chip [49]. A core tile is half the size of the daul-core tile on SCC. The component placement and relative size is scaled with Alpha 21264. The router size is the same with SCC on-chip router. We estimate the voltage regulator size based on 0.5W delivered power/mm2 measurement on a prototype on-chip regulator [62]. Chip floor plan: 10.4mm×14.4mm, a 4×4 core tile array. Core tile floor plan: 2.6mm×3.6mm.

thermal interaction. We use 54 = M ∗ K =18∗ 3 8-bit fixed-point multiplications in order to evaluate the temperature of one core in one cycle, only adding less than

1.7% extra area and power to the target chip. The other computation of iTECool can time-share the calculation unit of the temperature estimation part. Therefore, iTECool is an affordable solution in terms of power and area cost.

1Our DVFS level setting is consistent with [49].

109 5.3.6 Per-core DVFS assumption

Please note iTECool does not rely on per-core DVFS, we only per-core DVFS to show that even with per-core DVFS enabled, iTECool is still able to handle the large searching space that is introduced by this extra knob. iTECool can be integrated with chip-level DVFS seamlessly. Even without DVFS, iTECool can stand alone as a solution to improve cooling subsystem efficiency. Due to its capability to explore inter-core runtime variations of different workloads, per-core DVFS has been shown to be effective to address the thermal issues on multi-core processors. With the popularity of digital phase-lock loop (DPLL), per-core frequency scaling has been available on main-stream processors (e.g., IBM POWER7 [125]). However, to enable per-core DVFS, we still need to deploy the voltage regulator (VR) on-chip to provide fast voltage adjustment. Novel voltage regulator (VR) (e.g., on-chip VR) has drawn a lot of research attention due to its potential to enable fine-grain DVFS and reduce off-chip power pins requirement [38]. Kim et al. [63] have analyzed some key design trade-offs in deploying on-chip VR on processors. Eyerman et al. [35] have shown the power-performance benefits of fine-grain DVFS enabled by on-chip VR. Recently,

IBM Watson lab has demonstrated a 2.5D on-chip VR, showing the great promise of having per-core DVFS. However, the area overhead of on-chip VR is still the major concern for this technology. Fortunately, Yan et al. [130] have presented a hybrid scheme to mimic per-core VR with per-cluster VR and scheduling to reduce the area overhead of on-chip VR. Therefore, we use per-core DVFS as one of knobs.

5.4 Simulation Setup

In this section, we introduce our simulation environment. We use SESC [104] to evaluation the processor performance. SESC has been integrated with Wattch [15]

110 Table 5.2: Baseline results.

Workload Inputfile FF Inst Threads Inst Time (ms) Power (W) T(◦C) 16 1 billion 48.0 125.9 90.07 cholesky tk29.O 200M 4 250M 57.2 42.0 74.8 16 1 billion 59.68 74.9 69.69 fmm fmm.in 300M 4 250M 72.66 32.5 62.15 volrend head 300M 16 800M 41.42 85.4 71.79 water-nsquared water.in 300M 4 250M 38.1 43.7 68.7 16 400M 20.34 109.9 84.49 lu on input 300M 4 100M 19.6 42.1 70.75

and CACTI [89] to estimate the power of each on-chip component. The key simulation parameters are summarized in Table 5.1. Then we use HotSpot 5.02 [121] to estimate the temperature. We use SPLASH-2 benchmark with 16 threads launched.

We use Intel single-chip cloud (SCC) processor [49] as our floorplan base (shown as Figure 5.3a) to recalibrate the power reading because Intel releases detailed mea- surement results which are not available for other processors. From the die micro- graph, we estimate the size of the most identifiable components. A dual-core tile is

3.6×5.2mm. An IA-32 core in SCC is 4mm2 (L2 cache not included). A on-chip router is 1.5mm2. A router is shared by two cores on a dual-core tile. Therefore, we assume the router for each core is 0.8mm2. L2 cache for each core is 2.4mm2.

Since the detailed functional unit size information is not available to our study. We scale an Alpha 21264 core to a SCC core size and use scaled the function unit size in temperature estimation (shown as Figure 5.3b). We also scale the simulator reported power numbers to the SCC measurement results. Specifically, we calibrate the peak power estimated by Wattch to the peak reported core part power. We monitor the cache coherence activities and scale the reported mesh power to estimate the run-time router power. We use a second-order polynomial model [114] to estimate the leak- age power. We calibrate the chip leakage model [114] to the reported leakage power presented in [49]. We assign the chip leakage power to each component proportional

111 to the area of each component. We use this value as the initial state of temperature- leakage iteration. Please note HotSpot 5.02 only integrates temperature-leakage loop in the steady temperature routine. We modify the transient temperature calculation routine to consider temperature-leakage loop at run-time.

We use 0.5W delivered/mm2 to estimated the on-chip VR area [62]. The peak power of one core is 1.8W, resulting in a 3.6mm2 VR.Consideringthecoresizeis only 4mm2, the VR is of too much area overhead. Therefore, we use a quasi-parallel VR [115]. The key idea is that we parallel connect the off-chip regulator and on-chip regulator. We use rely on off-chip converter to deliver most of the output power; while we use low power on-chip buck type converter to achieve voltage regulation. In the proposed off-/on-chip hybrid VR design, for 1.8W peak power, if we use one off- chip regulator connected to one on-chip regulator, only 0.9W needs to be provided by on-chip VR on steady state time. However, to be conservative, we budget the on-chipVRwith2.2mm2, resulting in 1.1W peak power delivery capability. The size of VR has been reduce by 40% in our proposed scheme. If we can parallel more off-chip VR, the area overhead of on-chip VR can be further reduced. Although the on-chip VR still occupies a large on-chip area, this 24% area overhead (i.e., 2.2mm2 on a 3.6×2.6 core tile) could be justified by the dark silicon effect [130] (e.g, 20% area are dark). The relative size of each component is shown in Figure 5.3b. The power delivery efficiency of this on-chip VR is based on reported measured values

[62]. The efficiency peak is 77%. The efficiency decreases as the load power decrease.

We assume the power generation of on-chip VR is its power delivery efficiency loss.

We assume using the same TEC device in [75]. Each TEC device is a 0.5mm×0.5mm

film-form material. We embed a 3×3 TEC array on top of each core tile between the heat sink and processor die. The TEC array covers most core area of each core tile. We assume the each TEC device is controlled independently by a power transistor,

112 which is coupled to the on-chip power delivery circuit by through silicon vias (TSV) [108]. TSV has been commonly used in multi-layer stacking technology. The TEC power is calculated with the model presented in [75], which considers the temperature difference between the cool side and the hot side and the TEC input current. The cooling effect of TEC device (i.e., Peltier effect) takes (e.g., up to be to 20μs [43]) to engage. We calculate the heat resistance change 20mus after we turn on a TEC.

Please note this setting is conservative, if we use shorter delay, the cooling effect will be better.

We assume using a speed adjustable fan in our packaging. The available speed levels and the fan power and air flow rate at different speed levels are from a com- mercial fan datasheet [31]. Since the impact of fan is through heat sink and the thermal constant of heat sink is in the range of 15-30s [4]. If we simulate the effect of dynamic fan, the simulated CPU time should be at least several minutes. For a 16- core processor in SESC (integrated with HotSpot), the simulation time is prohibitive.

Therefore, for each benchmark, we run all the studied policies with all possible fan speed levels in multiple tests. In this way, we get the results for all possible fan speed levels. When we report our results, on the condition of not violating the temperature threshold, we choose the lowest fan speed results as the dynamic fan results.

5.5 Experiments

In this section, we present our evaluation results.

5.5.1 Baseline Results

In our baseline case, we set all the cores running at the peak DVFS levels; we use the highest fan speed; all the TEC devices are powered off. We run 16 and 4 threads on the

16-core system each time. Our command line parameters and results are presented 113 in Table 5.2. When we run 16-thread water and 4-thread volrend,thesimulation suspends and cannot reach the required number of instructions. Therefore, we only report the 4-thread water and 16-thread volrend cases.

In Table 5.2 shows that the maximum temperature correlates to the total power consumption. Within each workload, from 16 to 4 threads, the total power dissipation decreases, so as the maximum temperature. Across different workloads, this general rule still holds in most cases. Only when we compare lu and water at 4-thread cases, lu consumes less power than water but the maximum temperature of lu is higher than water. Through trace file analysis, we find that in both lu and water,IntQ0 (the integer queue of the 1st core) is the hottest unit. However, the component power in water is higher that that in lu. This shows that besides the global heat dissipation, the power density also decides the temperature. That confirms our motivation to use

TEC, which is to use local cooling to address local hot spot issues.

5.5.2 Studied Policies

We study five different management policies in this paper. They are Dynamic-fan,

TEC+fan, DVFS-only, DVFS+TEC,andCo-op. The current industry practice on dynamic cooling system is predominantly fan- based. The fan is dynamically adjusted by monitoring the temperature of the proces- sor. If the temperature is higher than the assigned threshold, we raise the fan speed until the temperature goes lower than the threshold; if the temperature is lower than the threshold, we lower the fan speed to save cooling power. We take this widely- adopted industry practice as our fan part management policy. Since the thermal capacity of head sink is very large (e.g., several hundred ◦C/J), it takes a long time to see the effect of changing the fan speed. Therefore, we run all the studied policies with all possible fan speed levels. We choose the one at the lowest fan speed level

114 without violating the temperature threshold as the Dynamic-fan result. In all the other policies expect Co-op, we assume dynamic fan is used.

TEC+fan is a simple combination of Dynamic-fan and a simple heuristic TEC+fan to manage the on/off state of the TECs. Specifically, the adjustment of fan and TECs are independent. We use Dynamic-fan to manage the fan. For TECs, when the tem- perature of a on-chip component is higher than the threshold, we turn on the TEC on top of the hot area; when the temperature is lower than the threshold, we turn off the TEC. We assume temperature sensing is available at all components (also assumed in [75, 19]).

DVFS-only is a classic DVFS-based DTM algorithm to control the peak temper- ature of a processor under the threshold. The idea is to raise the core DVFS level when the temperature of the core is lower than the threshold to improve performance and reduce the DVFS level when the temperature is higher than the threshold to protect the processor. Please note that we run DVFS-only on different fan speed levels and report the test results that provide the best energy as DVFS-only to derive an aggressive comparison.

DVFS+TEC is a simple combination of TEC+fan and DVFS-only.Theadjust- ment of per-core DVFS levels and TECs are independent. We also DVFS+TEC on different fan speed levels and report the test results that provide the best energy as final result to derive an aggressive comparison.

Co-op is our proposed solution. Co-op coordinates all the available knobs for op- timized energy based on well-established models. Also a multi-step down-hill greedy algorithm is developed to reduce the policy execution time. The algorithm is dis- cussed in detail at Section 5.3.4.

115 5.5.3 Integrating TEC with Fan

We provide the following case study to show the effectiveness of using TEC with fan in Figure 5.4. Figure 5.4 compares the cooling effect (maximum chip temperature) and cooling cost (cooling power) of Dynamic-fan and TEC+fan. In this experiment, we run 16- thread of cholesky on 16-cores. Figure 5.4(a) and (b) monitors the temperature of the hottest component under different cooling management policies. Figure 5.4(c) shows the corresponding power usage trace. We set the baseline case maximum temperature as the temperature threshold (e.g., 91◦C). That is the best that dynamic fan can do on this 16-core system. In Figure 5.4(a), simply setting fan on the 2nd speed level (i.e.,

“Fan level 2”) will introduce multiple temperature violations. Figure 5.4(b) shows that TEC+fan significantly reduce the temperature. Please note the fan is running at the 2nd speed level in Figure 5.4(b). The temperature is always under the threshold expect two data points (1283 and 1817). Figure 5.4(b) shows the effectiveness of using TEC to reduce hot spot temperature. Figure 5.4(c) shows the corresponding cooling power in this test case. The left y-axis is the power scale of the fan part; the right y-axis is the power scale of the TEC part. Since the power consumption of a fan has a cubic relationship with the speed of the fan [4], running fan at the 1st level consumes much more power than using the 2nd level. At 1st speed level, this fan consumes 14.4W cooling power. At the 2nd level, it only consumes 3.8W. Our fan power is estimated based on Dynatron R16 fan [31], which is designed for Intel Core i5 packaging. In Figure 5.4(c), the combined cooling power of fan running at

2nd level and TEC is still much smaller than running the fan at the 1st speed level.

However, using TEC+fan can achieve very close cooling effect of using fan level 1.

116 110 Fan level 1 Fan level 2 100 Threshold 90 80 70

Temperature (ºC) Temperature 60 0 500 1000 1500 Time (200us) (a) Temperature traces of dynamic fan. 110 TEC+fan Threshold 100 90 80 70

Temperature (ºC) Temperature 60 0 500 1000 1500 Time (200us) (b) Temperature trace of TEC+fan. 15 0.5 12 Fan level 1 power 0.4 9 Fan level 2 power 0.3 6 TEC power 0.2 3 0.1 Fan power(W) TEC power(W) 0 0 0 500 1000 1500 Time (200us) (c) Power traces. Figure 5.4: TEC+Fan vs. Dynamic-fan: temperature and cooling power comparison between Dynamic-fan and TEC+fan. Using the 1st (highest) fan speed level can achieve much better cooling than using the 2nd fan speed level. However, using TEC and the 2nd fan speed can achieve very close cooling effect to that of the 1st fan speed level. In addition, the combined cooling power consumption of using TEC and 2nd fan speed is much lower than running fan at 1st speed level. That is due to the cubic dependence of fan power consumption on fan speed [4].

117 105 10 Dynamic-fan TEC+fan Dynamic-fan TEC+fan 95 DVFS+fan DVFS+TEC 7.5 DVFS+fan DVFS+TEC Co-op Co-op 85 5 75 2.5 65 0 T_th violation violation (%) T_th Max temperature (ºC)

(a) Maximum temperature. (b) T th violation points/total points. Figure 5.5: Cooling performance comparison. We set the highest temperature of base- line cases as T th in each experiment. Co-op consistently offers the lowest maximum temperature in studied cases. Co-op also has the smallest T th violation.

4 2 Dynamic-fan TEC-fan Dynamic-fan TEC-fan 3 DVFS-fan DVFS+TEC 1.5 DVFS-fan DVFS+TEC Co-op Co-op 2 1 1 0.5 0 0 baseline case) to baseline case) Power (normalized Delay (normalized to to (normalized Delay

(a) Execution delay comparison. (b) Average power comparison. 2 4 Dynamic-fan TEC-fan Dynamic-fan TEC+fan 1.5 DVFS-fan DVFS+TEC 3 DVFS+fan DVFS+TEC Co-op Co-op 1 2 0.5 1 0 0 baseline case) baseline to baseline case) EDP (normalized EDP to Energy (normalized Energy

(c) Total energy comparison. (d) Energy delay product comparison. Figure 5.6: Execution performance comparison. Due to the cubic dynamic reduction of DVFS at linear performance degradation cost, DVFS+fan has the lowest energy usage. However, it has the longest delay. Since Co-op gives priority to performance, it reduces the power on the condition of not sacrificing too much performance. There- fore, Co-op achieves the lowest EDP.

118 5.5.4 Cooling Performance

Figure 5.5 compares the cooling performance of studied policies with different work- loads. Figure 5.5(a) compared the maximum temperature of different policies. In the baselines, since we use the highest fan speed (the most cooling), we can cool the processor to the lowest temperature. For general high power workload cholesky and the workload with local hot spot lu, TEC+fan has a lower maximum temperature than DVFS+fan, because TEC+fan uses TEC to effectively address the local hot spot. Low power workload fmm also enjoys the benefit of local cooling. However, for water, when the workload has a high power dissipation and evenly distributed, DVFS+fan policy has a lower maximum temperature than TEC+fan. Interestingly, we find out that in all the tested cases, the simple combination of DVFS and TEC

DVFS+TEC has higher maximum temperature than that in either TEC+fan and

DVFS+fan, because DVFS+TEC does not consider the interference between the two knobs. For example, when the current temperature is lower than the threshold, TEC part management algorithm turn offs the TEC and DVFS part algorithms increase the DVFS level. In this next step, we may have a high temperature. In contrast, Co- op consistently has the lowest temperature in the studied cases because it coordinates all the available knobs to fully take the advantage of each. In Figure 5.5(b), we set the maximum temperature of each workload in the baseline cases as the temperature threshold (T th). T th is the best cooling performance that a fan-based cooling can offer (at the cost of large cooling power). Due to dynamic management, the max- imum temperatures of the studied policies have occasional temperature violations.

SincewesettheTth to the maximum temperature in baseline cases, Dyanmic-fan uses any other fan speed other than the highest one will have a lot temperature vi- olations as we can see in Figure 5.4. Therefore, the Dyanmic-fan is the same with thebaselinecase. DVFS-only and DVFS+TEC has more violations than TEC-fan

119 due to DVFS level oscillations. Co-op coordinates all the available knobs to achieve a small violation (¡0.5%) consistently.

5.5.5 System Performance

Figure 5.6 compares the execution delay, average power, energy and energy delay product [39] of different policies. Figure 5.6(a) compares the delay of each case. We define the delay as the execution time of each case normalized the baseline case.

The shorter the delay the is, the better the performance is. Since TEC+fan and

Dynamic+fan do not adjust the DVFS level of the processor, they have the same execution time as the baseline case. By coordinating all the available knobs and choose the best actuation to enforce. Co-op has only 4% longer delay than base- line. Relying on throttling the processor to cut power dissipation to reduce temper- ature, DVFS+fan has 60% longer delay. DVFS+TEC leverages some advantages of

TEC’s local cooling capability to reduce the engagement of throttling the processor.

However, DVFS+TEC suffers from the inconsistence of DVFS and TEC. Therefore,

DVFS+TEC has longer delay than Co-op but shorter delay than DVFS+fan. Figure

5.6(b) compares the power in each case. By using TEC to do local cooling, TEC+fan affords to run the fan at the 2nd level with similar cooling effect to running the fan at the 1st fan speed level (the baseline case). By putting the fan at the 2nd level,

TEC+fan reduces the total power by 9%. DVFS+fan reduces the DVFS level to re- duce the heat dissipation. Since DVFS cuts dynamic power in a cubic way, DVFS+fan significantly reduces the power by 57%. Through using TEC to cool down local hot spots, DVFS+TEC reduces the employment of DVFS at the cost of increasing the TEC power. Therefore, the power of DVFS+TEC is slightly higher than DVFS+fan.

Co-op coordinates all the available knobs for optimized energy and gives priority to performance. Therefore, Co-op does not lower the DVFS levels very oftern. The

120 2 Dynamic-fan TEC-only 1.5 DVFS-only DVFS+TEC Co-op 1 0.5 0 Relative total cycles total Relative

Figure 5.7: Total relative cycles comparison. This metric shows the slow-down intro- duced by applying DVFS. DVFS introduce 26% and 27% slow-down on DVFS+fan and DVFS+TEC, respectively. However, from Figure 5.6, the delays of DVFS+fan and DVFS+TEC are more than 50%. The performance gap can be introduced by the inter-thread correlation. Throttling one core without considering the other cores will make one thread become the slowest thread, increasing the total execution time.

power of Co-op is higer than that of DVFS+TEC. Figure 5.6(c) presents the energy of each case. We summarize all the power value times the time interval in the trace file in one execution as the total energy. The reported numbers in Figure 5.6(c) is normalized to the baseline case. Having the same execution time but with less cool- ing requirement, TEC+fan saves the energy by 9%. Due to the cubic dynamic power reduction effect, although DVFS+fan compromise significant performance, it buys more power saving. Therefore, DVFS+fan achieves 34% energy saving. DVFS+TEC also adopts aggressive DVFS throttling. Therefore, it achieves 32% energy saving. Although Co-op gives priority to performance, it still save 27% energy. To correct the energy bias of DVFS, we also present the energy delay product [39] results in Figure

5.6(d). Co-op shows clear advantage since it still gives performance higher priority.

In contrast, the policies highly rely on DVFS loses their advantages. DVFS+fan has no saving (4% more) in EDP. DVFS+TEC even has a worse EDP than baseline cases.

Figure 5.7 compares the relative total cycles to further investigate the impact of DVFS. Specifically, we summarize the cycles of each core as the processor cycles and

121 100 2 Dynamic-fan TEC+fan Dynamic-fan TEC+fan 90 DVFS+fan DVFS+TEC 1.5 DVFS+fan DVFS+TEC Co-op Co-op 80 1 70 0.5 60 0 baseline case) baseline Delay (normalized to to (normalized Delay Max temperature (ºC)

(a) Max temperature comparison. (b) Delay comparison. 2 2 Dynamic-fan TEC+fan Dynamic-fan TEC+fan 1.5 DVFS+fan DVFS+TEC 1.5 DVFS+fan DVFS+TEC Co-op Co-op 1 1 0.5 0.5 0 0 baseline case) to baseline case) EDP (normalized EDP to Energy (normalized (normalized Energy

(c) Total energy comparison. (d) EDP comparison. Figure 5.8: Number of active cores sensitivity. We deploy 4 threads running on our simulator.

sum all the processor cycles in the entire trace file. We also compute the total cycles of the entire execution time if the processor cores are running at the peak frequency all the time. Relative total cycle is computed by dividing the total aggregated cycles by the total number of possible cycles. For example, if we have a 2-core system and

3 interval frequency trace file: [(0.6 0.9), (1.0 0.8), (0.7 0.9)]. The total aggregated cycles will be 0.6+0.9+1.0+0.8+0.7+0.9=4.9. The total possible cycles is 2 × 3 × 1.0 = 6. The relative frequency is 4.9/6=0.82. This metric presents the slow-down introduced by throttling the processor. We observe that DVFS introduces

4% slow-down in Co-op. DVFS introduce 26% and 27% slow-down on DVFS+fan and DVFS+TEC, respectively. However, from Figure 5.6, the delays of DVFS+fan and DVFS+TEC are more than 50%. The performance gap can be introduced by the inter-thread correlation. Throttling one core without considering the other cores will make one thread become the slowest thread, increasing the total execution time. The

122 110 1 Dynamic-fan TEC+fan Dynamic-fan TEC+fan 100 DVFS+fan DVFS+TEC 0.75 DVFS+fan DVFS+TEC Co-op Co-op 90 0.5 80 0.25 70 0 T_th violation T_th violation (%) Max temperature (ºC)

(a) Max temperature comparison. (b) The number of T th violation points. 1.5 2 Dynamic-fan TEC+fan Dynamic-fan TEC+fan 1.25 DVFS+fan DVFS+TEC 1.5 DVFS+fan DVFS+TEC Co-op Co-op 1 1 0.75 0.5 0.5 0 baseline case) to baseline case) baseline to Energy (normalized Energy Delay (normalized to (normalized Delay

(c) Delay comparison. (d) Total energy comparison.

Figure 5.9: Temperature threshold sensitivity. We set the T th as running 16-thread at the second fan level.

comparison between Figure 5.7 and Figure 5.6 shows the necessity of coordinating the adjustment among all the on-chip cores.

Figure 5.8 compares 4-thread cases to study the the impact of the number of active cores on the studied policies. In this test, we set T th as the peak temperature of running 4 threads in the baseline cases. Since we use the peak temperature of

4-thread baseline as T th, the baseline case has the lowest temperature in all the cases. Due to dynamic management, all the studied policies has certain temperature violation. Figure 5.8(a) and (b) shows that since the total power of 4-thread case is low, either TEC or DVFS needs to be frequently used. That is why in Figure 5.8(b), the delay difference between different is small. For the same reason, the energy

(Figure 5.8(c)) and EDP difference (Figure 5.8(d)) among different policies are small.

However, Co-op still offers the lowest energy and EDP because it coordinates all the available knobs.

123 Figure 5.9 shows the impact of temperature threshold. We set the temperature threshold T th as the highest temperature when running the processor at the 2nd fan speed. Compared with the test case in Figure 5.5 and 5.6, the threshold temperature in this test case is higher. We assume Dynamic-fan selects the 2nd fan speed level, which is the best-effort selection without violating the temperature threshold. Due to the reactive adjustment of the other policies, they all have transient temperature violation during runtime. Therefore, in Figure 5.9(a), Dynamic-fan has the lowest maximum temperature during runtime. However, Figure 5.9(b) shows that Co-op has the smallest threshold violation among all the dynamic policies because Co-op coordinates all the available knobs. Figure 5.9(c) shows that the execution time differences among the studied polices are small. Since we use a higher temperature threshold, even the DVFS-throttling-based policies do not have to resort to invoke the DVFS throttling. Therefore, the performance degradation is reduced. For the similar reason, the energy differences among the studied policies are small. The only expection is Dynamic-fan, because we assume that Dynamic-fan can select the best-effort fan adjustment at the beginning of this execution. It consumes a fixed cooling power. Other polices are similar, with 20% energy saving compared with

Dynamic-fan.

5.6 Conclusions

Both global and local thermal issues in the processors exacerbate with technology scaling. As an emerging technologies, TEC offers effective local cooling, which com- plements the global cooling of fans to improve the overall cooling efficiency. However,

TEC is an active device that itself also generates heat within the package. Therefore, careful coordination between fan and TEC, the cooling part and the computation part is needed to achieve an optimized entire system energy. In this paper, we have

124 presented iTECool, a highly configurable TEC-based cooling framework for energy efficiency. Specifically, we first formulate the energy optimization problem with tem- perature constraint as an assignment problem. Since such a problem requires a pro- hibitive long time to be resolved on-line, we design a quick greedy algorithm to solve this problem. iTECool is designed to be highly configurable that can easily integrate other components. We provide two case studies. One example is to adopt DVFS as an extra knob to co-optimize cooling and computational power. The other example is to consider the voltage regulator (VR) efficiency change at different load conditions and voltage levels in iTECool. Our extensive experiment results show that iTECool can save the system energy by 27% compared with state-of-the-art baselines.

125 CHAPTER 6

CONCLUSIONS

In this work, we have studied several techniques to improve the power/energy effi- ciency of a CMP. A scalable power control solution has been presented to adjust the frequency of each core in a CMP to maintain the power consumtion never go higher than an assigned value as well as to optimize the performance of multi-threaded work- loads. PGCapping presents a solution to integrate the core-level power gating and per-core frequency scaling to control the power consumption of a CMP and improve the system performance. GreenGPU coordinate the CPU and GPU in a high per- formance computing server to improve the power efficiency of the entire system by dynamically allocating the workloads between the GPU and CPU on software level and dynamically scale the frequency on hardware level. And finally, iTeCool inte- grates the per-core DVFS, on/off state of each TEC device, and the speed of the cooling fun to improve the energy efficiency of the entire system, including cooling part power.

126 BIBLIOGRAPHY

[1] Alameldeen, A. R., and Wood, D. A. IPC considered harmful for multi- processor workloads. IEEE Micro 26, 4 (2006).

[2] AMD. AMD dragon platform technology performance tuning guide, 2009.

[3] AMD. AMD family 10h server and workstation processor power and thermal data sheet, 2010.

[4] Ayoub, R., and Rosing, T. Cool and save: Cooling aware dynamic workload scheduling in multi-socket cpu systems. In ASP-DAC (2010).

[5] Bell, S., et al. Tile64 processor: A 64-core SoC with mesh interconnect. In ISSCC (2008).

[6] Bhattacharjee, A., and Martonosi, M. Thread criticality predictors for dynamic performance, power, and resource management in chip multiproces- sors. In ISCA (2009).

[7] Bienia, C., et al. The PARSEC benchmark suite: Characterization and architectural implications. In PACT (2008).

[8] Bircher, W. L., and John, L. Predictive power management for multi-core processors. In WEED (2010).

[9] Biswas, S., et al. Fighting fire with fire: Modeling the data center scale effects of targeted superlattice thermal management. In ISCA (2011).

[10] Bitirgen, R., Ipek, E., and Martinez, J. Coordinated management of multiple interacting resources in chip multiprocessors: A machine learning ap- proach. In MICRO (2008).

[11] Blome, J., Feng, S., Gupta, S., and Mahlke, S. Self-calibrating online wearout detection. In MICRO (2007).

[12] Borkar, S. Thousand core chips: a technology perspective. In DAC (2007).

[13] Bovet, D. P., and Cesati, M. Understanding the Linux Kernel, Third Edition. O’Reilly, 1997. 127 [14] BP. BP3180N datasheet. http://www.solardepot.com/pdf/BP3180N.pdf, 2009.

[15] Brooks, D., et al. Wattch: A framework for architectural-level power analysis and optimizations. In ISCA (2000).

[16] Brooks, D., and Martonosi, M. Dynamic thermal management for high- performance microprocessors. In HPCA (2001).

[17] Cai, Q., et al. Meeting points: using thread criticality to adapt multicore hardware to parallel regions. In PACT (2008).

[18] Calin, T., et al. Built-in current sensor for IDDQ testing in deep submicron CMOS. In VLSITS (1999).

[19] Chaparro, P., et al. Dynamic thermal management using thin-film ther- moelectric cooling. In ISLPED (2009).

[20] Che, S., Boyer, M., Meng, J., Tarjan, D., Sheaffer, J., Lee, S.-H., and Skadron, K. Rodinia: A benchmark suite for heterogeneous computing. In IISWC (2009).

[21] Chowdhury, I., Prasher, R., Lofgreen, K., Chrysler, G., Narasimhan, S., Mahajan, R., Koester, D., Alley, R., and Venkatasubramanian, R. On-chip cooling by superlattice-based thin-film thermoelectrics. Nature NanoTech. 4 (2009).

[22] Collange, S., Defour, D., and Tisserand, A. Power consumption of gpus from a software perspective. Computational Science (2009), 914–923.

[23] Coskun, A., Strong, R., Tullsen, D., and Rosing, T. S. Evaluating the impact of job scheduling and power management on processor lifetime for chip multiprocessors. In MMCS (2009), SIGMETRICS.

[24] Coskun, A. K., et al. Evaluating the impact of job scheduling and power management on processor lifetime for chip multiprocessors. In SIGMETRICS (2009).

[25] Curran, B., et al. 4GHz+ low-latency fixed-point and binary floating-point execution units for the POWER6 processor. In ISSCC (2006).

[26] Curran, B., et al. 4GHz+ low-latency fixed-point and binary floating-point execution units for the power6 processor. In ISSCC (2006).

[27] Dean, J., and Ghemawat, S. Mapreduce: simplified data processing on large clusters, 2004.

128 [28] Dennard, R., Gaensslen, F., Yu, H.-N., Rideout, L., Bassous, E., and LeBlanc, A. Design of ion-implanted mosfet’s with very small physical dimensions. Journal of Solid-State Circuits, IEEE (1974). [29] Dhiman, G., and Rosing, T. S. Dynamic voltage frequency scaling for multi-tasking systems using online learning. In ISLPED (2007). [30] Donald, J., and Martonosi, M. Techniques for multicore thermal man- agement: Classification and new exploration. In ISCA (2006).

[31] Dynatron. Dynatron new product SPEC sheet R16. http://www.dynatron-corp. com. [32] Economou, D., et al. Full-system power analysis and modeling for server environments. In MOBS (2006). [33] Esmaeilzadeh, H., et al. Dark silicon and the end of multicore scaling. In ISCA (2011). [34] Esmaeilzadehy, H., Blem, E., Amant, R., Sankaralingam, K., and Burger, D. Dark silicon and the end of multicore scaling. In ISCA (2011). [35] Eyerman, S., and Eeckhout, L. Fine-grained dvfs using on-chip regulators. TACO 8, 1 (2011). [36] Franklin, G. F., Powell, D. J., and Workman, M. Digital Control of Dynamic Systems, 3rd edition. Addition-Wesley, 1997. [37] Freund, V., and Schapire, R. A decision-theoretic generalization of on-line learning and an application to boosting. JCSS 55 (1997). [38] Gjanci, J., and Chowdhury, M. Investigating issues of on-chip voltage regulator in nanoscale integrated circuits. In ICM (2008). [39] Gonzalez, R., and Horowitz, M. Energy dissipation in general purpose microprocessors. JSSC 31 (1996).

[40] GREEN500.org. The green500 list. http://www.green500.org/lists/ 2010/11/top/list.php, 2010. [41] Greskamp, B., and Torrellas, J. Paceline: Improving single-thread per- formance in nanoscale cmps through core overclocking. In PACT (2007). [42] Grochowski, E., et al. Energy per instruction trends in intel microproces- sors. Tech. rep., Intel Microarchitecture Research Lab, 2006. [43] Gupta, M. P., Sayer, M.-H., Mukhopadhyay, S., and Kumar, S. Ultra- thin thermoelectric devices for on-chip peltier cooling. Components, Packaging and Manufacturing Technology, IEEE Transactions on 1, 9 (2011). 129 [44] Herbert, S., and Marculescu, D. Variation-aware dynamic voltage/frequency scaling. In HPCA (2009). [45] Hong, C., Chen, D., Chen, W., Zheng, W., and Lin, H. MapCG: writing parallel program portable between CPU and GPU. In PACT (2010). [46] Hong, S., and Kim, H. An integrated GPU power and performance model. In ISCA (2010). [47] Horowitz, M. Scaling, power, and the future of CMOS. In VLSID (2007). [48] Howard, J., et al. A 48-core IA-32 message-passing processor with DVFS in 45nm CMOS. In ISSCC (2010). [49] Howard, J., et al. A 48-core ia-32 processor in 45 nm cmos using on- die message-passing and dvfs for performance and power scaling. JSSC 46,1 (2011). [50] Hsu, C.-H., and Kremer, U. The design, implementation, and evaluation of a compiler algorithm for CPU energy reduction. In PLDI (2003). [51] Huang, L., Yuan, F., and Xu, Q. Lifetime reliability-aware task allocation and scheduling for MPSoC platforms. In DATE (2009).

[52] Intel. Intel turbo boost technology. http://www.intel.com/technology/ turboboost/, 2007. [53] Intel. Intel Core i7 processor extreme edition and Intel Core i7 processor datasheet. Tech. rep., Intel Corporation, 2008. [54] Isci, C., Buyuktosunoglu, A., Cher, C.-Y., Bose, P., and Martonosi, M. An analysis of efficient multi-core global power management policies: Max- imizing performance for a given power budget. In MICRO (2006). [55] Isci, C., and Martonosi, M. Runtime power monitoring in high-end pro- cessors: Methodology and empirical data. In MICRO (2003). [56] Jang, B., et al. Exploiting memory access patterns to improve memory performance in data-parallel architectures. Parallel and Distributed Systems, IEEE Transactions on 22, 1 (2011), 105–118. [57] Kahng, A., et al. Orion 2.0: A fast and accurate noc power and area model for early-stage design space exploration. In DATE (2009). [58] Kansal, A., et al. Virtual machine power metering and provisioning. In SoCC (2010). [59] Karpuzcu, U., Greskamp, B., and Torrellas, J. The bubblewrap many- core: popping cores for sequential acceleration. In MICRO (2009). 130 [60] Khronos. OpenCL - the open standard for parallel programming of hetero- geneous systems. http://www.khronos.org/opencl, 2010.

[61] Kim, S., Chandra, D., and Solihin, Y. Fair cache sharing and partitioning in a chip multiprocessor architecture. In PACT (2004).

[62] Kim, W., et al. A fully-integrated 3-level dc/dc converter for nanosecond- scale dvfs. JSSC (2012).

[63] Kim, W., Gupta, M. S., Wei, G.-Y., and Brooks, D. System level analysis of fast, per-core DVFS using on-chip switching regulators. In HPCA (2008).

[64] Lee, J., and Kim, N. S. Optimizing throughput of power- and thermal- constrained multicore processors using dvfs and per-core power-gating. In DAC (2009).

[65] Lee, J., Sathisha, V., Schulte, M., Compton, K., and Kim, N. S. Im- proving throughput of power-constrained gpus using dynamic voltage/frequency and core scaling. In PACT (2011).

[66] Lefurgy, C., et al. Energy management for commercial servers. Computer 36, 12 (2003).

[67] Lefurgy, C., et al. Server-level power control. In ICAC (2007).

[68] Leverich, J., et al. Power management of datacenter workloads using per-core power gating. IEEE Comput. Archit. Lett. 8 (2009).

[69] Li, C., Zhang, W., Cho, C.-B., and Li, T. Solarcore: Solar energy driven multi-core architecture power management. In HPCA (2011).

[70] Li, J., et al. The thrifty barrier: Energy-aware synchronization in shared- memory multiprocessors. In HPCA (2004).

[71] Li, J., and Martinez, J. F. Dynamic power-performance adaptation of parallel computation on chip multiprocessors. In HPCA (2006).

[72] Li, T., Lebeck, A., and Sorin, D. Spin detection hardware for improved management of multithreaded systems. IEEE Trans. Parallel Distrib. Syst. 17, 6 (2006).

[73] Littlestone, N., and Warmuth, M. The weighted majority algorithm. Inf. Comput. 108 (1994), 212–261.

[74] Liu, C., et al. Exploiting barriers to optimize power consumption of CMPs. In IPDPS (2005).

131 [75] Long, J., and Memik, S. O. A framework for optimizing thermoelectric active cooling systems. In DAC (2010).

[76] Luk, C.-K., Hong, S., and Kim, H. Qilin: exploiting parallelism on het- erogeneous multiprocessors with adaptive mapping. In MICRO (2009).

[77] Lungu, A., et al. Dynamic power gating with quality guarantees. In ISLPED (2009).

[78] Ma,K.,Li,X.,Chen,M.,andWang,X. Scalable power control for many-core architectures running multi-threaded applications. In ISCA (2011).

[79] Madan, N., Buyuktosunoglu, A., Bose, P., and Annavaram, M. A case for guarded power gating in multi-core processors. In HPCA (2011).

[80] Marty, M. R., and Hill, M. D. Virtual hierarchies to support server consolidation. In ISCA (2007).

[81] Mathew, S., Anders, M., Krishnamurthy, R., and Borkar, S. A 4-GHz 130-nm address generation unit with 32-bit sparse-tree adder core. In ISSCC (2003).

[82] McGowen, R., Poirier, C., Bostak, C., Ignowski, J., Millican, M., Parks, W., and Naffziger, S. Power and temperature control on a 90-nm Itanium family processor. IEEE Journal of Solid-State Circuits 41, 1 (2006).

[83] Meenderinck, C., and Juurlink, B. (when) will cmps hit the power wall? In Euro-Par 2008 Workshops - Parallel Processing,E.C´esar, M. Alexander, A. Streit, J. L. Tr¨aff,C.C´erin, A. Kn¨upfer, D. Kranzlm¨uller, and S. Jha, Eds. Springer-Verlag, 2009, pp. 184–193.

[84] Meng, K., Joseph, R., Dick, R., and Shang, L. Multi-optimization power management for chip multiprocessors. In PACT (2008).

[85] Milovanovic, E., et al. Synthesis of space optimal systolic arrays for band matrix-vector multiplication. JC (2008).

[86] Minh, T. N., and Wolters, L. Modeling parallel system workloads with temporal locality. In Job Scheduling Strategies for Parallel Processing. Springer- Verlag, 2009.

[87] Mishra,A.K.,etal. Poster: Coordinated power management of voltage islands in CMPs. In SIGMETRICS (2010).

[88] Mishra, A. K., Srikantaiah, S., Kandemir, M., and Das, C. R. CPM in CMPs: Coordinated power management in chip-multiprocessors. In SC (2010).

132 [89] Muralimanohar, N., et al. CACTI 6.0: A tool to model large caches. Tech. rep., HP Laboratories, 2009.

[90] Mwaikambo, Z., and Raj, A. Linux kernel hotplug cpu support. Linux Symposium 2 (2004).

[91] Noonburg, D. B., and Shen, J. P. Theoretical modeling of superscalar processor performance. In MICRO (1994).

[92] NREL. Measurement and instrumentation data center. http://www.nrel. gov/midc, 2011.

[93] NVIDIA. Geforce 8800. http://www.nvidia.com/page/geforce_8800.html, 2010.

[94] NVIDIA.Nvidia-smi.http://www.nvidia.com/, 2010.

[95] NVIDIA. Cuda toolkit 3.2 downloads. http://developer.nvidia.com/ cuda-toolkit-32-downloads, 2011.

[96] Oberman, S. Floating point division and square root algorithms and imple- mentation in the AMD-K7TM microprocessor. In CA (1999).

[97] Pallipadi, V., and Starikovskiy, A. The ondemand governor. http: //ftp.kernel.org/pub/linux/kernel/, 2006.

[98] Phansalkar, A., et al. Analysis of redundancy and application balance in the SPEC CPU2006 benchmark suite. SIGARCH Comput. Archit. News 35, 2 (2007).

[99] Powell, M., et al. CAMP: A technique to estimate per-structure power at run-time using a few simple parameters. In HPCA (2009).

[100] Powell, M. D., et al. Camp: A technique to estimate per-structure power at run-time using a few simple parameters. In HPCA (2009).

[101] Raghavendra, R., et al. No power struggle: Coordinated multi-level power management for the data center. In ASPLOS (2008).

[102] Rangan, K. R., Wei, G.-Y., and Brooks, D. Thread motion: Fine- grained power management for multi-core systems. In ISCA (2009).

[103] Ravi, V. T., Ma, W., Chiu, D., and Agrawal, G. Compiler and run- time support for enabling generalized reduction computations on heterogeneous parallel configurations. In ICS (2010).

[104] Renau, J., et al. SESC simulator, January 2005. http://sesc.sourceforge. net.

133 [105] Sartori, J., and Kumar, R. Distributed peak power management for many- core architectures. In DATE (2009).

[106] Schaller, R. R. Moore’s law: past, present, and future. IEEE Spectr. 34,6 (1997), 52–59.

[107] Shin, D., et al. Energy-optimal dynamic thermal management: Computa- tion and cooling power co-optimization. Industrial Informatics, IEEE Trans- actions on 6, 3 (2010).

[108] Singh, P., et al. Power delivery network design and optimization for 3d stacked die designs. In 3DIC (2010).

[109] Skadron, K., Abdelzaher, T., and Stan, M. R. Control-theoretic tech- niques and thermal-RC modeling for accurate and localized dynamic thermal management. In HPCA (2002).

[110] Skadron, K., Stan, M., Sankaranarayanan, K., Huang, W., Velusamy, S., and Tarjan, D. Temperature-aware microarchitecture: Modeling and im- plementation. ACM Trans. Archit. Code Optim. 1 (2004), 94–125.

[111] Skadron, K., Stan, M. R., Sankaranarayanan, K., Huang, W., Velusamy, S., and Tarjan, D. Temperature-aware microarchitecture: Modeling and im- plementation. ACM Trans. Archit. Code Optim. 1, 1 (2004).

[112] Srinivasan, J., Adve, S., Bose, P., and Rivers, J. Lifetime reliability: toward an architectural solution. Micro, IEEE (2005).

[113] Stanley-Marbell, P., Cabezas, V. C., and Luijten, R. Pinned to the walls: impact of packaging and application properties on the memory and power walls. In ISLPED (2011).

[114] Su, H., et al. Full chip leakage-estimation considering power supply and temperature variations. In ISLPED (2003).

[115] Sun, J., Xu, M., Reusch, D., and Lee, F. High efficiency quasi-parallel voltage regulators. In APEC (2008).

[116] Teodorescu, R., et al. Variation-aware application scheduling and power management for chip multiprocessors. In ISCA (2008).

[117] Tierno, J., et al. A DPLL-based per core variable frequency clock generator for an eight-core POWER7 x2122 microprocessor. In VLSIC (2010).

[118] TOP500.org. National supercomputing center in tianjin. http://top500. org/site/3154, 2010.

134 [119] Truong, D. N., et al. A 167-processor computational platform in 65 nm CMOS. IEEE Journal of Solid-State Circuits (JSSC) 44, 4 (2009).

[120] Vangal, S. R., et al. An 80-tile sub-100-w teraflops processor in 65-nm CMOS. IEEE Solid-state circuits 43, 1 (2008).

[121] W., H., et al. Hotspot: Thermal modeling for CMOS VLSI systems. TCPM (2005).

[122] Wang, G., and Ren, X. Power-efficient work distribution method for CPU- GPU heterogeneous system. In ISPA (2010).

[123] Wang, X., and Chen, M. Cluster-level feedback power control for perfor- mance optimization. In HPCA (2008).

[124] Wang, Y., Ma, K., and Wang, X. Temperature-constrained power control for chip multiprocessors with online model estimation. In ISCA (2009).

[125] Ware, M., Rajamani, K., Floyd, M., Brock, B., Rubio, J. C., Raw- son, F., and Carter, J. B. Architecting for power management: The IBM POWER7 approach. In HPCA (2010).

[126] Wattsupmeters. Watts up pro power meter. http://www.wattsupmeters. com, 2010.

[127] Winter, J. A., Albonesi, D. H., and Shoemaker, C. A. Scalable thread scheduling and global power management for heterogeneous many-core archi- tectures. In PACT (2010).

[128] Woo, S. C., et al. The SPLASH-2 programs: characterization and method- ological considerations. In ISCA (1995).

[129] Wu, Q., Juang, P., Martonosi, M., and Clark, D. W. Formal online methods for voltage/frequency control in multiple clock domain microproces- sors. In ASPLOS (2004).

[130] Yan, G., et al. Agileregulator: A hybrid voltage regulator scheme redeeming dark silicon for power efficiency in a multicore architecture. In HPCA (2012).

[131] Zhang, Y., et al. Hotleakage: A temperature-aware model of subthreshold and gate leakage for architects. TR, Univ. of Virginia. Tech. rep., the University of Virginia, 2003.

[132] Zhou, C., Sylvester, D., and Blaauw, D. Process variation and temperature- aware reliability management. In DATE (2010).

135