POWER CONSTRAINED PERFORMANCE OPTIMIZATION IN CHIP MULTI-PROCESSORS
DISSERTATION
Presented in Partial Fulfillment of the Requirements for the Degree Doctor of
Philosophy in the Graduate School of the Ohio State University
By
Kai Ma, B.S., M.S.
Graduate Program in Electrical and Computer Engineering
The Ohio State University
2013
Dissertation Committee:
Prof. Xiaorui Wang, Advisor
Prof. F¨usun Ozg¨¨ uner Prof. Kevin M. Passino
Prof. Umit¨ V. C¸ ataly¨urek c Copyright by
Kai Ma
2013 ABSTRACT
With the technology scaling in semiconductor industry, both the power density and the power consumption of processors keep increasing. Compared with traditional frequency increasing, integrating more cores on the processor chip offers the oppor- tunity to explore inter-thread parallelism and better energy efficiency. Therefore, the processor design has officially entered a chip multi-processor era. However, the peak power consumption (i.e., power budget or power cap) of a processor is still constrained by the cooling capacity, power delivery limitation, or the limitations specified by the users for different management purposes. Accordingly, it is important to discuss the performance optimization with power constraints (i.e., power capping). Important as it is, power capping is also challenging. Fundamentally, the performance/power relationship of applications is unknown a priori due to runtime variations. Therefore, it is difficult to choose the optimal adjustment in a large possible adjustment space.
In this document, we investigate different aspects of power capping such as consid- ering more components (e.g, caches part) in addition to traditional core part, using new knobs (e.g, power gating), managing new emerging platforms (e.g, CPU-GPU hybrid systems), and using new cooling technology (e.g., thermal electric cooling) .
First, we explore the opportunity to coordinate the cache part and the core part in
CMP (i.e., chip multi-processor). Second, we investigate a scalable power capping algorithm that can leverage the inter-thread dependency of multi-threaded applica- tions for optimized performance. Third, we integrate dynamic voltage and frequency
ii scaling with power gating for power capping as well as considering the core-level service lifetime balancing. Fourth, we develop an energy conservation algorithm for
CPU-GPU hybrid systems. Fifth, we check the co-optimization between computa- tional power and cooling power offered by new cooling devices. In this document, we focus on the power capping issue but also discuss the related energy conservation and thermal issues.
iii This document is dedicated to my wonderful family.
iv ACKNOWLEDGMENTS
Without the help of the following people, I would not have been able to complete my dissertation. My heartfelt thanks to:
Dr. Xiaorui Wang, for his guidance. I could not have asked for a better mentor. Without his help, I would not have had the opportunity to change my specialization to
Computer Architecture, nor would I have enjoyed the level of success I have achieved in this area of research.
Dr. Yefu Wang, for his help with the feedback-control-based power control concept that ultimately developed into our Temperature-Constrained Power Control paper.
Dr. Ming Chen, for his help with the writing advice that ultimately developed into our Scalable Power Control paper.
Xue Li, Wei Chen, and Chi Zhang, for their contributions to the GreenGPU project.
v VITA
1981 ...... BorninChangchun,Jilin, China
2004 ...... B.S.InformationEngineering, Zhejiang University Hangzhou, Zhejiang, China
2007 ...... M.S.ElectricalEngineering, Tongji University Shanghai, China
2008-2011 ...... GraduateResearchAssociate, The University of Tennessee, Knoxville Knoxville, TN, USA
2011-Present ...... GraduateResearchAssociate, The Ohio State University Columbus, OH, USA
vi PUBLICATIONS
Yefu Wang, Kai Ma, and Xiaorui Wang, Temperature-Constrained Power Control for Chip Multiprocessors with Online Model Estimation, The 36th International Sym- posium on Computer Architecture. June 2009, Austin, Texas, USA
Xiaorui Wang, Kai Ma, and Yefu Wang, Achieving Fair or Differentiated Cache Sharing in Power-Constrained Chip Multiprocessors, the 39th International Confer- ence on Parallel Processing September 2010, San Diego, California, USA
Kai Ma, Xue Li, Ming Chen, and Xiaorui Wang, Scalable Power Control for Many- Core Architectures Running Multi-threaded Applications, the 38th International Sym- posium on Computer Architecture. June 2011, San Jose, California, USA
Kai Ma, Xiaorui Wang, and Yefu Wang, DPPC: Dynamic Power Partitioning and Capping in Chip Multiprocessors, the 29th International Conference on Computer Design, October 2011, Amherst, Massachusetts, USA
Kai Ma, Xue Li, Wei Chen, Chi Zhang, and Xiaorui Wang, GreenGPU: A Holistic Approach to Energy Efficiency in GPU-CPU Heterogeneous Architectures, the 41st International Conference on Parallel Processing, September 10-13, 2012, Pittsburgh, PA, USA
Kai Ma, and Xiaorui Wang, PGCapping: Exploiting Power Gating for Power Cap- ping and Core Lifetime Balancing in CMPs, the 21st International Conference on Parallel Architectures and Compilation Techniques, September 19-23, 2012, Min- neapolis, MN, USA
Xiaorui Wang, Kai Ma, and Yefu Wang, Cache Latency Control for Application Fairness or Differentiation in Power-Constrained Chip Multiprocessors, IEEE Trans- actions on Computers, 61(12): 1-15, December 2012
Xiaorui Wang, Kai Ma, and Yefu Wang, Adaptive Power Control with Online Model Estimation for Chip Multiprocessors, IEEE Transactions on Parallel and Distributed Systems, 22(10): 1681-1696, October 2011
Kai Ma, Xiaorui Wang and Yefu Wang, DPPC: Dynamic Power Partitioning and Control for Improved Chip Multiprocessor Performance, IEEE Transactions on Com- puters, 2013, (accepted)
vii FIELDS OF STUDY
Major Field: Electrical and Computer Engineering
Specialization: Computer Systems and Architecture
viii TABLE OF CONTENTS
Abstract...... ii Dedication...... iii
Acknowledgments...... v
Vita...... vi
ListofFigures...... xii
CHAPTER PAGE
1 IntroductionandBackground...... 1
1.1PowerWall...... 1 1.2ChipMulti-processors...... 2 1.3PowerCapping...... 3 1.4Contributions...... 4
2 Scalable Many-Core Power Control for Multi-threaded Applications . . 5
2.1Introduction...... 5 2.2Background...... 9 2.3SystemArchitecture...... 10 2.4Chip-levelPowerControl...... 14 2.5DynamicAggregatedFrequencyPartitioning...... 16 2.5.1Chip-levelPartitioning...... 16 2.5.2Group-levelPartitioning...... 19 2.6Core-levelPowerEstimationonPhysicalTestbed...... 21 2.7Implementation...... 23 2.7.1Testbed...... 23 2.7.2SimulationEnvironment...... 25 2.7.3DiscussiononHardwareImplementation...... 26 2.8Evaluation...... 27 2.8.1Baselines...... 28 2.8.2EstimationAccuracy...... 29
ix 2.8.3TestbedResults...... 29 2.8.4SimulationResults...... 36 2.8.5 Discussion on Algorithm Complexity and Scalability .... 36 2.9Conclusion...... 38
3 PowerGatingforPowerCappingandCoreLifetimeBalancing..... 40 3.1Introduction...... 40 3.2Background...... 44 3.3SystemDesign...... 45 3.3.1DesignofPCPGManagementModule...... 47 3.3.2DesignofDVFSManagementModule...... 48 3.3.3LifetimeBalancing...... 50 3.4Implementation...... 51 3.4.1PowerCappingEvaluationTestbed...... 51 3.4.2LifetimeBalancingEvaluationSimulator...... 54 3.5Evaluation...... 54 3.5.1Baselines...... 54 3.5.2PowerControlAccuracy...... 57 3.5.3ApplicationPerformance...... 60 3.5.4LifetimeBalancing...... 62 3.6Conclusion...... 64
4 EnergyEfficiencyinGPU-CPUHeterogeneousArchitectures...... 66 4.1Introduction...... 66 4.2Background...... 69 4.3Motivation...... 71 4.3.1 A Case Study on Frequency Scaling for GPU Cores and Memory...... 71 4.3.2 A Case Study on Workload Division between GPU and CPU 73 4.4SystemDesignofGreenGPU...... 74 4.5GreenGPUAlgorithms...... 78 4.5.1 Dynamic Frequency Scaling for GPU Cores and Memory . 78 4.5.2WorkloadDivision...... 81 4.6Implementation...... 83 4.7Experiments...... 87 4.7.1FrequencyScalingforGPUCoresandMemory...... 87 4.7.2WorkloadDivisionbetweenGPUandCPU...... 90 4.7.3GreenGPUasaHolisticSolution...... 92 4.8Conclusion...... 93
x 5 Integrating Thermoelectric Coolers and Fans for Energy Efficiency . . . 95 5.1Introduction...... 95 5.2Background...... 98 5.3SystemDesign...... 99 5.3.1ThermalModel...... 100 5.3.2PowerandPerformanceModel...... 102 5.3.3ProblemFormulation...... 104 5.3.4HeuristicSolution...... 105 5.3.5HardwareCost...... 108 5.3.6Per-coreDVFSassumption...... 110 5.4SimulationSetup...... 110 5.5Experiments...... 113 5.5.1BaselineResults...... 113 5.5.2StudiedPolicies...... 114 5.5.3IntegratingTECwithFan...... 116 5.5.4CoolingPerformance...... 119 5.5.5SystemPerformance...... 120 5.6Conclusions...... 124 6 Conclusions...... 126
Bibliography...... 127
xi LIST OF FIGURES
FIGURE PAGE
2.1 Three-layer power control architecture for a 16-core chip multiproces- sor. Cores running the same multi-threaded applications are grouped together. Idle cores (e.g., C9) are transitioned into a low power mode. 12
2.2 Power estimation accuracy experiments on a 12-core hardware testbed. 27
2.3 Power control accuracy comparison. In (a)-(c), the frequencies are relative to the peak of a selected core. In (d), the power values are relativetothepeakpowerineachtestcase...... 30
2.4 Group-level (thread criticality-aware) frequency quota (i.e., sum of normalized DVFS levels) allocation traces of FreqPar and the baselines. 33 2.5 Chip-level (power efficiency-aware) frequency quota (i.e., sum of nor- malized DVFS levels) allocation traces and power efficiency of FreqPar andthebaselines...... 34
2.6 Overall performance comparison between FreqPar the baselines on a 12-corehardwaretestbed...... 35
2.7 Power and performance comparison in simulations under different num- bersofcores...... 37
2.8 Execution time experiments show that FreqPar is more scalable than Steepest Drop...... 37
3.1 Decoupled design uses the power budget, chip power measurement, per-core utilization, temperature, lifetime as inputs. It computes next- step power mode (e.g., on/off, DVFS levels, overclocking state) of each core to cap the entire chip power, boost performance, and balance the lifetime...... 46
3.2 Quicksearch algorithm flowchart. Only power-higher-than-budget case ispresentedforconcision...... 48
xii 3.3 Decoupled solution PGCapping can precisely enforce the power budget by using PCPG, DVFS and overclocking in both the high and low power budget cases. We calculate the average power with a 2s window as P avg to clearly present the general trend. Hardware testbed results. 55
3.4 Decoupled solution PGCapping can reserve power headroom by using PCPG and accelerate cores running useful workloads by using over- clocking. Hardware testbed results. The frequencies are normalized to the peak frequency of one core (i.e., Relative freq). The Freq/# is calculated by dividing the total aggregated relative frequency of the entire chip by the number of turned-on cores, which can be interpreted as a high-level computing capability that each turned-on core can offer (the higher is preferable). We also calculate average Freq/# with a 2s window as F a/#...... 58 3.5 PGCapping achieves very close results with Balanced (a best-effort per-core-DVFS-based lifetime balancing). PGCapping outperforms Random and Round-robin baselines.Simulationresults...... 59
4.1 Normalized execution time is the execution time of a workload normal- ized to its execution time at the peak frequency. Relative energy is the energy normalized to the energy consumed at the peak frequency. There are opportunities to save energy with negligible performance loss by throttling under-utilized components...... 69 4.2 Energy consumption for different workload division ratios. The coop- eration of the CPU and GPU parts can be more energy efficient than theGPUparttakingalltheworkexclusively...... 73
4.3 GreenGPU features a two-tier design to reduce the energy consump- tion of CPU-GPU heterogeneous platforms. The higher tier (i.e., the workload division tier) dynamically partitions the incoming workloads to the CPU and GPU parts. The dash lines connect the components of the workload division part. The lower tier (i.e., the frequency scal- ing tier) takes the utilization of processing elements (GPU cores, GPU memory, and CPUs) to decide the proper frequency levels of them to reduce energy consumption. The dotted lines connect the components ofthefrequencyscalingpart...... 75
4.4 Hardware testbed used in our experiments, which includes a Dell Opti- plex 580 desktop with an Nvidia GeForce GPU and an AMD Phenom II CPU, two power meters, one separated ATX power supply to power the GPU card. Meter1 measures the power of the CPU side, while Meter2measuresthepoweroftheGPUside...... 84
xiii 4.5 Frequency scaling algorithm adjusts the frequencies of GPU cores and memory based on their utilizations respectively to save energy without increasingexecutiontime...... 87
4.6 Energy saving compared with best-performance for different workloads. 88
4.7 Workload division algorithm adjusts the workload allocation between the CPU and GPU parts to minimize idling energy on either side causedbywaitingfortheother(slower)side...... 90
4.8 Energy and workload division ratio trace in respect of the iterations. GreenGPU outperforms workload division only and frequency scaling onlyonenergysavings...... 92
5.1 The side view of the target chip packaging and TEC cooling effect. TECs are embedded between the heat spreader and the processor chip in the thermal interface material (TIM) layer. By applying current to the TEC, heat can be pumped from one side to the other side of this film device. iTEcool coordinates the fan and multiple TECs to improve the overall cooling efficiency. In addition, iTECool also coordinates DVFS level of each core and the cooling subsystem (TECs and fan) to reduce the energy consumption of the entire system (i.e., processor, fanandTECs)...... 95
5.2 Multi-step down-hill greedy algorithm (Co-op) flow chart. Based on the thermal, power, performance models (Equation (10)-(17)), Co-op estimates the next step possible energy if certain adjustment is used; it then selects the adjustment that has the smallest energy consumption within the temperature constraint. If the current temperature is lower than the threshold, Co-op compares the energy of turning off TEC and raising the DVFS level of one core; if the current temperature is higher than the threshold, Co-op compares the energy of turning on TEC and lowering the DVFS level of one core. Co-op takes the steps forward multiple steps along the small energy adjustment direction until the temperatureconstraintisachieved...... 106
5.3 Simulated processor floor plan. The chip floor plan is scaled based on the Intel SCC 48-core chip [49]. A core tile is half the size of the daul-core tile on SCC. The component placement and relative size is scaled with Alpha 21264. The router size is the same with SCC on- chip router. We estimate the voltage regulator size based on 0.5W delivered power/mm2 measurement on a prototype on-chip regulator [62]. Chip floor plan: 10.4mm×14.4mm, a 4×4 core tile array. Core tile floor plan: 2.6mm×3.6mm...... 109
xiv 5.4 TEC+Fan vs. Dynamic-fan: temperature and cooling power compar- ison between Dynamic-fan and TEC+fan. Using the 1st (highest) fan speed level can achieve much better cooling than using the 2nd fan speed level. However, using TEC and the 2nd fan speed can achieve very close cooling effect to that of the 1st fan speed level. In addition, the combined cooling power consumption of using TEC and 2nd fan speed is much lower than running fan at 1st speed level. That is due to the cubic dependence of fan power consumption on fan speed [4]. . 117
5.5 Cooling performance comparison. We set the highest temperature of baseline cases as T th in each experiment. Co-op consistently offers the lowest maximum temperature in studied cases. Co-op also has the smallest T thviolation...... 118
5.6 Execution performance comparison. Due to the cubic dynamic reduc- tion of DVFS at linear performance degradation cost, DVFS+fan has the lowest energy usage. However, it has the longest delay. Since Co- op gives priority to performance, it reduces the power on the condition of not sacrificing too much performance. Therefore, Co-op achieves thelowestEDP...... 118
5.7 Total relative cycles comparison. This metric shows the slow-down introduced by applying DVFS. DVFS introduce 26% and 27% slow- down on DVFS+fan and DVFS+TEC, respectively. However, from Figure 5.6, the delays of DVFS+fan and DVFS+TEC are more than 50%. The performance gap can be introduced by the inter-thread correlation. Throttling one core without considering the other cores will make one thread become the slowest thread, increasing the total executiontime...... 121 5.8 Number of active cores sensitivity. We deploy 4 threads running on oursimulator...... 122
5.9 Temperature threshold sensitivity. We set the T th as running 16- threadatthesecondfanlevel...... 123
xv CHAPTER 1
INTRODUCTION AND BACKGROUND
Thebroadfocusofthisdocumentistheperformance optimization of computer sys- tems with the power consumption as a primary constraint. This chapter provides an overview of the problem that we target.
1.1 Power Wall
For half a century, Moore’s Law [106] has driven the technology scaling in the semicon- ductor industry. The number of components in integrated circuits has doubled every eighteen months. In theory, if the supply voltage of CMOS scaled with lithographic dimensions, the process scaling would have introduced faster and lower energy gates.
The switching energy reduction can match the increased energy by having more gates and having them switch faster, so the power density (i.e., power per unit area) stays constant. This analysis has been summarized as classic Dennard Scaling [28]. How- ever, in reality, the supply voltage has practically stopped declining, mainly due to two reasons: 1) the gate switching delay does not decrease at the same rate as the geometric feature size decreases, which means we cannot lower the voltage at the same rate as the feature size shrinks. 2) Lowering the supply voltage, combined with lowering the feature size, reduces the circuit’s reliability to process variations (i.e.,
1 parameter deviation from the designed nominal value). Therefore, both the absolute power consumption and the power density of the processors have kept increasing.
In embedded computing or other battery-powered devices, battery capacity ad- vancement is far behind the logarithmic scaling in the semiconductor industry. Such a lag makes the battery lifetime an even more important design constraint. The battery lifetime limits the power consumption in embedded systems. In desktop en- vironments, the cooling capacity determines the power dissipation of the system. In data center servers, the huge electricity bill requires by the servers and the related cooling devices is one of the key concerns for data center service providers. Power and related issues have become the key limiter for computer system advancement expanding the entire spectrum, which is summarized as the Power Wall [83]. There- fore, our study takes the power and related issues as the primary constraint in system performance optimization.
1.2 Chip Multi-processors
The ever-growing demand for higher computing throughput requires the processors to increase operating frequency and the number of working units. If we keep increasing the operating frequency of the processors, we need to maintain a high supply voltage to ensure reliable transistor switches, which increases the power density. Since the cooling capacity of the computer systems have already been limited by the cost, the processor providers universally switch the technology advancement from increasing the operating frequency to integrating more cores on one chip, because CMPs (i.e., chip multi-processor) offer the possibility to explore inter-thread parallelism and allow increasing throughput without increasing the power density. Therefore, the processor builders started to pack more and more cores on one processor chip. The computer
2 systems entered the CMP era [34]. To push the CMP idea even further, some hard- ware (e.g., nVidia GPU) integrate hundreds of simple cores on one chip to maximize the throughput.
1.3 Power Capping
However, the increasing throughput in CMPs still requires consuming more power.
The peak power consumption is still constrained by the cooling capacity, power de- livery limitation, or specified by the users for different management purposes. A key concept related to the cooling limit is the thermal design power (TDP) [82]. TDP is a key parameter in packaging design, as long as the power dissipation of the entire chip is under the TDP, the packaging design can guarantee the chip will not overheat in most cases. Therefore, we normally assume that the TDP is the power upper bound in terms of thermal issues. The power delivery limit is constrained primarily by the power pins on the processors [113]. Due to the fixed area of a processor, the off-chip communication and the power delivery are competing for pin resources. Therefore, the limited number of power pins is expected to form an even tighter power budget constraint than the cooling capacity in the near future [113]. In addition to cooling and power delivery limitations, the users might want to assign a power budget for a computer system at runtime to enable server oversubscription (i.e., safely deploying more servers within a fixed power/cooling infrastructure in data center environment) or enforce power budget cut (i.e., assign a very tight power budget for the system to address temporary/partial cooling failure). In summary, the power budget of a com- puter system comes from the cooling limit, power delivery limit, or user specification.
Due to those limitations, it is important to discuss the performance optimization with power constraints (i.e., power capping).
3 1.4 Contributions
This document presents several novel solutions to the power/thermal constrained performance optimization problems.
1. A scalable power control method that dynamically tunes the frequency alloca-
tion among cores in a many core processor to maintain a fixed power budget as well as to improve the performance.
2. A fast power control technique to integrate the per-core DVFS and power gating
for improved performance and service lifetime.
3. A practical power management system to coordinate the CPU and GPU in a
high performance server to improve the energy efficiency of the entire system.
4. A intelligent on-line management system to adjust the thermoelectric cooler,
the cooling fan(s), and the DVFS level of each core on CMPs to keep the temperature under an assigned threshold as well as to conserve the energy
consumption.
4 CHAPTER 2
SCALABLE MANY-CORE POWER CONTROL FOR
MULTI-THREADED APPLICATIONS
2.1 Introduction
Power dissipation has become a first-class constraint in current microprocessor de- sign. As the gap between peak and average power widens with the rapidly increasing level of core integration, it is important to control the peak power of a many-core microprocessor to allow improved reliability and reduced costs in chip cooling and packaging. Therefore, compared with the extensively studied power minimization problem, an equally, if not more, important problem is to precisely control the peak power consumption of a many-core microprocessor to stay below a desired budget level while optimizing its performance.
Scalability is the first key challenge in controlling the power consumption of a many-core microprocessor. While various power control solutions have been proposed for multi-core microprocessors (e.g., [54, 84, 124]), the majority of current solutions relies on centralized decision making and thus cannot be applied directly to many- core systems. For example, the MaxBIPS policy [54] uses an exhaustive search to find a combination of DVFS (Dynamic Voltage and Frequency Scaling) levels for all the cores of a microprocessor. The search is predicted to result in the best application performance while maintaining the power of the chip below the budget. While this
5 solution works effectively for microprocessors with only a few cores, MaxBIPS does not scale well because the number of possible combinations increases exponentially with the number of cores. Therefore, highly scalable approaches need to be developed for many-core architectures.
The requirement to host multi-threaded applications is the second challenge for many-core power control. Although a few recent studies [127, 105, 88] present scal- able control algorithms for many-core architectures based on per-core DVFS, they do not consider multi-threaded parallel applications and assume that the workload of every core is independent. As a result, these solutions may unnecessarily decrease the DVFS levels of the CPU cores running the critical threads in barrier-based multi- threaded applications. The lack of knowledge of thread criticality can exacerbate the load imbalance in multi-core microprocessors and thus lead to unnecessarily long application execution times and undesired barrier stalls. This issue is particularly important for many-core architectures whose primary workloads are expected to be multi-threaded applications. Furthermore, many-core systems are likely to simultane- ously host a mixed group of single-threaded and multi-threaded applications, due to the increasing trend of server consolidation, to fully utilize the core resource [80, 12].
Therefore, a power control algorithm must be able to handle such realistic workload combinations and utilize thread criticality to efficiently allocate power among the cores that are running different applications.
Another major challenge in multi-core or many-core power control is accurate power monitoring [99]. Although the power consumption of a microprocessor can be measured by sensing the current fed into the chip [125], direct power measurement of a single core on a multi-core or many-core die is not yet available. On-die current sensors have been proposed, but have rarely been used in production due to problems such as area and performance overhead and calibration drift introduced by process
6 variations [18]. It is possible to estimate the core power at runtime by counting the component utilizations (e.g., cache accesses) and computing power based on a per-component power model. However, such direct computation of core and struc- ture power at runtime is complex due to a large number of performance statistics required [125]. Since many-core systems are expected to have many simple cores [12], it may not be desirable to adopt an approach that requires much extra hardware and statistics collection. Recently, Kansal et al. [58] have shown that the CPU power consumption of each Virtual Machine (VM) on a server can be estimated by adap- tively weighting only one metric (CPU utilization) of each VM. However, they did not explicitly consider the impact of DVFS on their model despite the fact that the power consumption is different under different DVFS levels even for the same application.
We propose to extend their work to estimate the power consumption of each core in a DVFS environment by taking both DVFS level and utilization into consideration. As a result, many-core power control can be evaluated on a real hardware platform instead of just by simulations as in previous work [127, 105].
In this chapter, we propose a novel and highly scalable power control solution for many-core microprocessors that is specifically designed to handle realistic workload combinations. Our control solution features a three-layer design. First, we adopt control theory to precisely control the power of the whole chip to its chip-level bud- get, with theoretically guaranteed accuracy and stability, by adjusting the aggregated frequency quota of all the cores on the chip. In a DVFS-enabled system, aggregated frequency is defined as the summation of the DVFS levels of all the cores normalized to the peak DVFS level of one core. Second, we dynamically group cores running the same applications and then partition the aggregated chip-level frequency quota derived from the chip-level power controller among different groups for optimized overall microprocessor performance. Finally, we partition the group-level aggregated
7 frequency quota among the cores in each groupbasedonmeasuredthreadcriticality for a shorter application completion time. As a result, our solution can optimize the processor performance while precisely limiting the chip-level power consumption below the desired budget. Specifically, this chapter makes the following major con- tributions:
• We propose a highly scalable power control solution for many-core architectures running multi-threaded applications. Our solution partitions the limited chip-
level power budget among different applications and cores based on measured
application performance and thread criticality.
• We adopt feedback control theory as a theoretical foundation to control the
power consumption of a many-core chip to its desired power budget. This
rigorous design methodology is in sharp contrast to heuristic-based solutions that rely on extensive manual tuning.
• Since the power consumption of a core cannot be directly measured in real
multi-core microprocessors, we extend the technique of estimating the power
consumption of a VM on a physical server to estimate the power consumption
of a CPU core and validate the estimation model on a hardware testbed.
• We implement our control solution on a 12-core AMD Opteron processor and
present empirical results to demonstrate that our solution achieves better ap-
plication performance within a given power budget than two state-of-the-art
solutions. Our extensive simulation results with 32, 64, and 128 cores, as well
as overhead analysis for up to 4,096 cores, demonstrate the scalability of our
solution in many-core architectures.
The rest of this chapter is organized as follows. Section 2.3 discusses the system ar- chitecture of our control solution. Section 2.4 presents the chip-level power controller 8 design. Section 2.5 describes dynamic aggregated frequency quota partitioning at the chip and group levels. Section 2.6 presents the per-core power estimation technique.
Section 2.7 introduces our hardware testbed, simulation setups, and the implementa- tion details of our solution. Section 2.8 presents our evaluation results. Section 2.2 discusses the related work and Section 2.9 concludes this chapter.
2.2 Background
Power dissipation has been one of the major design concerns for computing systems. Much prior work has focused on minimizing the power consumption within a specified performance guarantee. For example, Li et al. [70] propose a solution called thrifty barrier that places the faster cores into a lower power mode at the barriers (i.e., joint point) while waiting for the slower cores so that power can be saved. Liu et al. [74] use per-core DVFS to slow down the faster cores, such that both the idle time due to waiting and power consumption are reduced. Cai et al. [17] extend [74] by adding meeting points within the execution of the parallel loops and solve the same problem at a finer granularity. However, all of the solutions cannot provide any explicit guarantees for the power consumption to stay below a desired budget though the performance is guaranteed to some extent. Our work is different in that we focus on a different, but equally important, problem, i.e., power capping to avoid power overload or thermal violations and prevent over-provision of cooling, packaging, and power supply capacities at the processor design time. Some work has been performed to manage peak power or temperature for CMPs.
Intel Foxton technology [82] has successfully controlled the power and temperature of a microprocessor using chip-wide DVFS. Isci et al. [54] propose a closed-loop algorithm called Priority and a prediction-based algorithm called MaxBIPS to limit the power of a CMP. Wang et al. [124] also apply advanced control theory to develop a
9 power control algorithm for improved CMP performance. However, the application of these solutions on many-core systems is prohibited either by the exponential explosion of the number of possible global power management states in many-core architectures, e.g., [82, 54], or by the high control delay and computation overhead due to centralized decision making, e.g., [124]. As a result, none of them are scalable to the large number of cores in many-core architectures.
A recent study by Winter et al. [127] presents a global power management al- gorithm called Steepest Drop for many-core systems with a light overhead. Sartori et al. [105] discuss using hierarchical structure to cap the power of many-core sys- tems. Another related piece of work by Mishra et al. [87, 88] uses absolute BIPS to allocate the chip power budget to each power island and performs per-island power control. However, these solutions assume the independence of workloads among all the cores. Therefore, it may impair the coupling of workloads among all the cores and result in degraded system performance. In contrast, our highly scalable solution can dynamically shift the power budget among the groups of cores that host different applications based on power efficiency, and then further among all the cores in the same group that host the coupled threads from the same application based on thread criticality.
2.3 System Architecture
In this section, we present a high-level description of our three-layer power control solution. As shown in Figure 2.1, in the first layer, the chip-level power controller controls the power consumption of the whole chip to the chip power budget by adjust- ing the aggregated frequency quota (i.e., summation of normalized DVFS levels) of all the cores. The second layer, i.e., the chip-level frequency quota partitioning layer, partitions the chip-level aggregated frequency quota among the groups of cores, which
10 host different applications proportionally to a metric called power efficiency (defined in Section 2.5.1). The third layer, i.e., the group-level frequency partitioning layer, further partitions the group aggregated frequency quota among all the cores in each group, which host coupled threads of the same application, based on thread criti- cality (defined in Section 2.5.2). The aggregated frequency quota is first partitioned among different applications (i.e., groups of cores) and then partitioned among cou- pled threads (i.e., individual cores) to achieve optimized performance. As a result, if the aggregated frequency quota of every core is enforced, the power of the entire chip can be controlled to stay within the desired power budget. In this chapter, we adopt
DVFS to enforce the frequency quota of each core, but our solution can also work with other frequency scaling techniques such as clock modulation. We assume that the frequency of each core can be adjusted individually in future many-core systems based on various industry practices and research studies [127, 105]. For example, IBM and AMD have implemented per-core DVFS on commercial massive multi-core microprocessors (POWER7 8-core and Opteron 12-core systems). Moreover, Intel has implemented per-tile DVFS on its 24-tile many-core experimental chips [48]. In addition, a 167-core computational platform with per-core DVFS support has been implemented recently [119]. Even in the systems without physically implemented per-core DVFS (e.g., multi-power-island chips), Rangan et al. [102] have shown that thread migration on systems with only two power states can be used to approximate the functionality of continuous, per-core DVFS.
As shown in Figure 2.1, the key components in the chip-level power control layer include a power controller and a power monitor. The following steps are invoked at the end of every control period: 1) the power monitor (e.g., an on-board power measurement circuit [125]) measures the power consumption of the chip in the last control period and sends the value to the power controller and 2) the power controller
11 Power supply Chip Power power Chip power monitor C1 C2 C3 C4 Per-core budget frequency Utilization budget Chip-level power Per-core Power Group 1 Group 2 controller estimator C5 C6 C7 C8 DVFS Chip Per-core modulator frequency power quota Criticality IPS Chip-level partitioner C9 C10 C11 C12 counter Group frequency Group 3 IPS counter quota Criticality C13 C14 C15 C16 Group-level partitioner Per-core Firmware frequency on the service processor budget Many-core processor Figure 2.1: Three-layer power control architecture for a 16-core chip multiprocessor. Cores running the same multi-threaded applications are grouped together. Idle cores (e.g., C9) are transitioned into a low power mode.
computes the new aggregated frequency quota for all the cores of the chip based on the desired power budget and measured power consumption. The aggregated frequency quota is then partitioned to optimize the system performance in the partitioning layers. The key components in the chip-level frequency quota partitioning layer include a single chip-level partitioner and an IPS (Instructions Per Second) counter on each core. In order to effectively partition the power budget, we need to be able to calculate the power efficiency of each core. We adopt IPS/Watt as our power efficiency metric, which has been used by Intel for this purpose [42]. The chip-level frequency quota is partitioned among multiple groups of cores periodically. At the end of each control period, the partitioner collects the grouping information of all the cores based on the
OS scheduler (details are described in Section 2.5.1). Each group of cores hosts all the threads of the same application. If a group consists of only one core, we refer to it as a single-threaded group; otherwise, we refer to it as a multi-threaded group.Ifa core is idle, we transition it to a low-power mode. For example, Core 1, 2, 5, and 6 are grouped together since they run four threads of a parallel application based on
12 the scheduling information from OS. Core 9 is transitioned into the low-power mode since it is idle. The chip-level partitioner computes the power efficiency based on the
IPS and the estimated power of each core, then calculates the overall power efficiency of each group by summing up the efficiency of each core in the group. The chip-level partitioner partitions the aggregated frequency quota of the entire chip among the groups proportionally to the overall power efficiency of each group. Note that since the power control period at the chip level can be configured shorter than the OS scheduling period, we assume the mapping between the threads and cores does not change within each control period. The same assumption has been made in [127].
The group-level frequency quota partitioning layer includes a group-level parti- tioner in each group, a criticality counter, and a DVFS modulator in each core. At the end of each control period, the criticality counter on each core monitors the criticality metric (defined in Section 2.5.2) and forwards it to the partitioner. The partitioner receives the allocated group frequency quota from the chip-level partitioner and par- titions the frequency quota among all the cores in the group based on the thread criticality of each core. Then, the DVFS modulator of each core changes the DVFS level of the core accordingly.
Because the computation of the controller may change the overall aggregated frequency quota and the recalculation of chip-level partitioner may change the group aggregated frequency quota, the three layers run sequentially at the end of every control period. Figure 2.1 shows a possible implementation that the three layers are integrated as firmware on the service processor, similar to IBM POWER7’s power control module [125]. We also discuss other implementation possibilities in Section
2.7.3.
13 2.4 Chip-level Power Control
In this section, we introduce the chip-level power controller that controls the power consumption of the entire chip to the desired power budget by adjusting the aggre- gated frequency quota (i.e., summation of normalized DVFS levels). A key advantage of the control-theoretic design approach is that it can tolerate a certain degree of mod- eling errors and adapt to online model variations based on dynamic feedback [36].
Therefore, our solution does not rely on power models that are perfectly accurate, which is in sharp contrast to open-loop solutions that would fail without an accurate model.
We first introduce some notation. Tc is the control period. M is the number of cores on this chip. cp(k) is the power consumption of the entire chip in the kth control period. f(k) is the total aggregated frequency of all the cores on the chip in the kth control period. The dynamic range of f(k)isL ∗ M ≤ f(k) ≤ M, relative to the peak of one core, where L is the lowest available DVFS normalized to the peak level. We assume that our target system is a homogeneous-core system, which is the dominant configuration of the current multi-core and many-core systems [48, 120, 5].
However, extending for heterogeneous-core systems is straightforward by scaling f(k). For example, if we have a more powerful core in the system along with the normal cores, instead of taking the dynamic range of the more powerful core as L to 1 like the normal core, we count it as L to H .BothL and H are derived by scaling the available DVFS levels of the powerful core to the peak DVFS level of a normal core.
Δf(k)=f(k +1)− f(k). Pt is the power budget of the whole chip, which can be determined by the thermal and power supply constraints of the processor or specified by the user during runtime. e(k) is the control error, specifically, e(k)=Pt − cp(k).
The control goal is to direct cp(k)toconvergetoPt within a certain number of control periods by adjusting f(k).
14 System Modeling. We now model the dynamics of the controlled system, namely the relationship between the controlled variable, i.e., cp(k), and manipulated variable, i.e., f(k). Existing studies by both Raghavendra et al. [101] and Wang et al. [123] have shown that the processor power can be modeled as an approximately linear function to the DVFS level within the limited DVFS adaptation range available in real multi-core processors. In this chapter, the power consumption of a processor is modeled similarly as:
cp(k)=aΔf(k − 1) + cp(k − 1), (2.4.1) where a is the generalized parameters that may vary for different chips and applica- tions. a is also the scaling factor that characterizes the impact of DVFS change on chip power. In our design, we derive a by using the data sheet full power range (from the idle power to the maximum power of the chip [3]) divided by the dynamic range of f(k). We conducted stability analysis [36] on our controlled system. The results show the stability range of a isfrom0to2a. Since we used the maximum possible a at design time, the variation of a could never exceed the range. The control loop is theoretically guaranteed to converge to the set point for all possible workloads.
Controller Design. Proportional-Integral (PI) control can provide robust con- trol performance despite considerable modeling errors. Based on the system model (2.4.1), we design a PI controller as follows:
f(k)=f(k − 1) + K1e(k) − K1K2e(k − 1). (2.4.2)
Following the standard pole placement method [36], we can choose our control parameters as K1 =1/a and K2 = 0, such that the controlled system is stable and has a zero steady state error. The detailed steps can be found in a standard
15 control textbook and are skipped due to space limitations. The desired aggregated frequency quota of all the cores on the chip in the kth control period can be computed accordingly as:
(Pt − cpi(k − 1)) f(k)=f(k − 1) + . (2.4.3) a
2.5 Dynamic Aggregated Frequency Partitioning
We first introduce the details of the aggregated frequency (i.e., summation of nor- malized DVFS levels) partitioning schemes at the chip and group levels.
2.5.1 Chip-level Partitioning
A many-core microprocessor may host multiple applications simultaneously. For ex- ample, a virtualized many-core system may host multiple VMs and each of the VM may host a different application. If the power budget is limited, different power allocations among the applications may lead to different system performance. To achieve high overall performance is one of the most fundamental goals for many-core systems [12]. The goal of chip-level partitioning is to dynamically partition the chip- level aggregated frequency quota computed in the chip power controller (Equation
(2.4.3)) among different applications, such that we can achieve the optimized sys- tem performance. In this chapter, we use Fair Speedup (FS) as the performance indicator. The FS of a partitioning scheme is defined as the harmonic mean of per-application speedup with respect to the equal resource share case (i.e., peak fre- quency for all applications) [24, 10]. The FS achieved by a scheme can be expressed Na ETappi(scheme) as FS(scheme)=Na/ ,whereETappi(scheme) is the execution time i=1 ETappi(base) th of the i application under a certain power management scheme, and ETappi(base) is the execution time of running the ith application of the peak frequency level all the
16 time. Na is the number of applications in the system, i.e., the set of applications that execute together. FS is an indicator of the overall improvement in execution efficiency gained across the applications. It is also a metric of fairness. In the follow- ing sections, we first introduce how to group the cores that run the threads of the same application based on the scheduling information in the OS. We then present the aggregated frequency quota partitioning among the groups.
Core Grouping
In many-core microprocessors, different threads run simultaneously on different cores.
We place the cores that host the threads of the same application into a group. There- fore, the number of groups is equal to the number of applications running on all the cores. The benefit of core grouping is to reduce the coupling of the power demand among different applications.
In this project, we assume that the mapping between the threads and cores does not change within a certain period (i.e., the scheduling period). Since the scheduling interval in operating systems is in tens of milli-seconds, if we conduct the power control in a shorter period, this assumption is valid. The same assumption is made in [127, 6, 10, 17, 84]. At the end of each scheduling period, the chip-level frequency partitioner may collect the grouping information from the OS. If we implement the algorithm as a loadable kernel module of OS, the grouping information can be derived from a system function. If we implement the controller as a piece of hardware on the chip, this information exchange between hardware and software can be achieved by adding special purpose registers on the chip. If the proposed solution is implemented as firmware running on the service processor on the motherboard, the information exchange between the main processor and the service processor can be achieved via external ports [125].
17 Aggregated Frequency Partitioning
Before we discuss the chip-level power partitioning, we first introduce some notation.
A many-core microprocessor has N groups of cores and group i runs application i,
th where 1 ≤ i ≤ N. IPSi is the average IPS of group i when running the i application on the many-core microprocessor without any power constraint. IPSi can be derived by conducting application profiling on the desired number of cores at the peak DVFS level and then calculating the average IPS of each core. Note the profiling is only performed once for each application on the desired number of cores. The OS can send
th IPSi to the controller via on-chip registers. ipsi(k) is the measured IPS of the i
th group. WTi(k) is the estimated power of the i group. Since each group may consist of multiple cores, ipsi(k)andWTi(k) are the accumulated IPS and power of all the cores in the ith group.
To achieve optimized overall performance, the aggregated frequency quota parti- tioned among different groups should be proportional to the ratio between the per- formance and power consumption (i.e., ipsi(k)/W Ti(k)). However, this may lead to the following problem. Some applications intrinsically have a low IPS even without any power constraint. To partition power based on IPS is unfair to those applications if they run simultaneously with other applications that have intrinsically high IPSs.
To address this problem, we use the relative IPS, ripsi(k), as the performance metric in this chapter, which is the measured IPS ipsi(k) normalized to IPSi. Specifically, ripsi(k)=ipsi(k)/IP Si, similar to the fairness definition used in [61]. We define
th the power efficiency of the i group ei(k) as the ratio between ripsi(k)andWTi(k).
Specifically, ei(k)=ripsi(k)/W Ti(k).
In this chapter, we partition the chip-level aggregated frequency quota among groups proportionally based on the power efficiency of each group to achieve the
18 Table 2.1: Workload mixes used in testbed and simulation experiments.
1. Physical testbed workload mixes Mixes PARSEC 2.1 , SPEC2006 Aggregate Effect mix1 12-perlbench all seperated applications mix2 12-Streamcluster high-barrier parallel workload mix3 8-swaptions, 4-omnetpp no-barrier parallel workload mix4 4-x264, 8-fluidanimate no-barrier, high-lock workload mix 4-(blackscholes, bodytrack), mix5 low-barrier and high-barrier mix 2-(xalancbmk, povray) 4-(vips,facesim), mix6 random mix 1-(libquantum,astar,soplex,dealII) 2. Simulation workload mixes Mixes SPLASH-2, SPEC2006 Aggregate Effect mix1 water (nsquared) all parallel application mix2 dealII all seperated applications FFT,Ocean non,LU con,LU non mix3 random mix (each occupies 1/4 number of cores)
optimized performance
ei(k − 1) fgi(k)= N f(k), (2.5.1) j=1 ej(k − 1) where f(k) is the aggregated frequency quota of the entire chip and fgi(k)isthe aggregated frequency allocation for the ith group in the kth control period. In systems that need to support application priority, we can assign different weights to the co- scheduled applications when we calculate fgi(k).
2.5.2 Group-level Partitioning
The goal of group-level aggregated frequency quota partitioning is to further partition the group frequency quota among all the cores running the threads of the same application, such that each thread has a balanced progress at the common barriers. For the one-threaded groups, the core quota is the same as the group quota. For the multi-threaded groups, the problem of achieving an optimized performance in a group
19 is translated into discerning which running threads are more critical (i.e., slower) and then allocating more aggregated frequency to the critical threads to expedite the progress of the entire application.
In this chapter, we adopt a thread criticality prediction approach proposed by
Bhattacharjee and Martonosi [6], which considers both L1 and L2 cache misses. Com- pared with other approaches [17, 70, 74], the advantage of this predictor is that it can handle both barrier and non-barrier parallel workloads. The criticality of core j in the ith group in the kth period is
(L1L2penalty) × N(L1L2miss) crij(k)=N(L1miss)+ , (2.5.2) L1penalty
where N(L1miss) is the number of L1 misses that hit in the L2 cache, N(L1L2miss)is the number of L1 misses that also miss in the L2 cache, and L1L2penalty and L1penalty are L2 and L1 cache miss penalties, respectively. The cache miss penalty is measured in CPU cycles. Within a parallel working group, a higher criticality value implies a more poorly-cached, slower thread [6], which means that additional power needs to be shifted to that thread from the non-critical threads (with smaller criticality values) to reduce the runtime imbalance. In our design, we proportionally sub-partition the frequency quota of a multi-threaded group to its cores based on criticality as follows
crij(k − 1) ij i fc (k)= Mi fg (k), (2.5.3) k=1 crik(k − 1)
th where fcij(k) is the target frequency of Core j in Group i in the k control period.
Mi is the number of cores in Group i.
20 2.6 Core-level Power Estimation on Physical Testbed
In our power management solution, the chip-level partitioning is conducted accord- ing to the relative power efficiency of each group. Therefore, we need a reasonable estimation of the power consumption of each core. Besides, one of our baselines,
Steepest Drop [127], also assumes the knowledge of the power consumption of each core, even though real microprocessors available in today’s market cannot yet pro- vide such information. In this section, we introduce our per-core power estimation method.
Although the power consumption of each individual core cannot be directly mea- sured in today’s microprocessors, previous work by Kansal et al. [58] has shown the CPU power consumption of each VM on a server can be estimated by adap- tively weighting the CPU utilization of the VM. However, they did not explicitly consider the impact of DVFS in their model despite the fact that power consump- tion scales with different DVFS levels. We extend their work to estimate the power consumption of each core under DVFS environment by taking both DVFS level and utilization into consideration. The utilization metric represents the high-level work- load characteristics, while the DVFS level represents the hardware working condition of the core. Power consumption is the interactive result of both the hardware and software parts. We adopt the commonly used multiplication operation to model the interaction among different parts [91]. Therefore, the total power consumption of the chip is modeled as: