<<

CHAPTER TWO

Techniques to Measure, Model, and Manage Power

Bhavishya Goel, Sally A. McKee, and Magnus Själander and Engineering, Chalmers University of Technology, 412 96 Gothenburg, Sweden

Contents 1. Introduction 8 2. Problem Statement 10 3. Empirical Power Measurement 12 3.1 Measurement Techniques 13 3.1.1 At the Wall Outlet 13 3.1.2 At the ATX Power Rails 14 3.1.3 At the Processor Voltage Regulator 17 3.2 Experimental Results 19 3.3 Further Reading 22 4. Power Estimation 23 4.1 Power Modeling Techniques 24 4.1.1 Performance Monitoring Counters 24 4.1.2 PMC Access 25 4.1.3 Counter Selection 26 4.1.4 Model Formation 32 4.2 Secondary Aspects of Power Modeling 34 4.2.1 Temperature Effects 34 4.2.2 Effects of Dynamic Voltage and Frequency Scaling 37 4.2.3 Effects of Simultaneous Multithreading 38 4.3 Validation 39 5. Power-Aware Resource Management 42 5.1 Sample Policies 44 5.2 Experimental Setup 45 5.3 Results 46 5.4. Further Reading 48 6. Discussion 49 References 50

Advances in Computers, Volume 87 © 2012 Elsevier Inc. ISSN 0065-2458, http://dx.doi.org/10.1016/B978-0-12-396528-8.00002-X All rights reserved. 7 8 Bhavishya Goel et al.

Abstract Society’s increasing dependence on has resulted in the deployment of vast compute resources. The energy costs of operating these resources coupled with environmental concerns have made energy-aware one of the primary challenges for the IT sector. Making energy-efficient computing a rule rather than an exception requires that researchers and system designers use the right set of techniques and tools. These involve measuring, analyzing, and controlling the energy expenditure of computers at varying degrees of granularity. In this chapter, we pres- ent techniques to measure power consumption of computer systems at various levels and to compare their effectiveness. We discuss methodologies to estimate processor power consumption using performance-counter-based power modeling and show how the power models can be used for power-aware scheduling. Armed with such techniques and methodologies, we as a research and development community can better address challenges in power-aware management.

1. INTRODUCTION Green Computing has become much more than a buzz phrase. The greening of the Information and Communication Technology (ICT) ­sector has grown into a significant movement among manufacturers and service ­providers. Even end users are rising to the challenge of creating a green society and sustainable environment in which our development and use of information technology can still flourish. Environmental legisla- tion and ­rising operational and waste disposal costs obviously lend force to this movement, but so do public perceptions and corporate images. For instance, environmental concerns have a growing impact on the ICT industry’s products and services, and they increasingly influence the choices that ICT organizations make (environmental criteria are now among the top buying criteria for ICT-related goods and services). Most ICT providers now prioritize choices that reduce long-term, negative environmental impact instead of just reducing operational costs. Over a computing system’s lifetime, the array of costs includes design, ­verification, manufacturing, deployment, operation, maintenance, ­retirement, disposal, and . All of these include an ICT component, themselves. Green ICT thus spans: • environmental risk mitigation; • green metrics, assessment tools, and methodologies; • energy-efficient computing and ; • data center design and location; • environmentally responsible disposal and recycling; and • legislative compliance. Techniques to Measure, Model, and Manage Power 9

Murugesan notes that each personal computer in use in 2008 was responsible for generating about a ton of carbon dioxide per year [33]. In 2007–2008, multiple independent studies calculated the global ICT foot- print to be 2% [50] of the total emissions from all human activity. While the growing ICT sector’s global emissions will continue to rise (by a pro- jected 6% per annum through the year 2020 [50]), increases in products and services and advances in technology will potentially bring about greater reductions in other sectors. The implications of Green Computing thus reach far beyond the ICT sector itself. One factor in this growing carbon footprint is the steadily increasing amount of total electrical energy expended by ICT. As computer system architects, the obvious first step that system designers can take toward addressing the larger problem of total emissions footprint is to reduce ­operational power consumption. Although power efficiency is but one aspect of this multifaceted environmental problem, the design of more power-efficient systems will help inform solutions that impact other aspects. The most robust solutions are likely to come from hardware/ software codesign to create hardware that provides more real-time power consumption information to software that can leverage that information to save power throughout the system. Until such combined solutions exist, though, we still need to reduce power consumption of existing platforms. This chapter discusses an approach to achieving this reduction for current systems. Power-aware resource management requires introspection into the dynamic behavior of the system. In Section 2, we first discuss some of the challenges to obtaining this information. Our solution is to use performance monitoring counters (PMCs). Such counters are nearly ubiquitous in current platforms, and they provide the best available introspection into computational and system activity. We use PMC values to build per-core­ power consumption models that can then be used to generate power estimates to drive resource management decisions. For such models to be useful, we must verify their accuracy, which requires a means to measure dynamic power consumption. In Section 3, we thus describe a set of power-measurement techniques and discuss their pros and cons with respect to their use in better resource management. In Section 4, we set the context by surveying previous power modeling work before explaining our methodology in detail. In Section 5, we present a case study of power management techniques that leverage this methodology. 10 Bhavishya Goel et al.

2. PROBLEM STATEMENT Power consumption has joined performance as a first-class metric for dictating system design and performance specifications [32]. Efficient use of available system resources requires balancing power consumption and per- formance requirements. To make power-aware decisions, system resource managers require real-time information about power consumption and temperature, preferably at the granularity of individual resources. In a chip multiprocessor (CMP), power consumption for different cores may vary widely, depending on the properties of the code they execute. Armed with information about power usage, task schedulers, hypervisors, and operating systems can make better decisions for how to execute a given workload efficiently. Unfortunately, most available hardware lacks the on-die infrastructure for sensing current consumption, largely due to the hardware costs and the intrusive nature of the sensing techniques. Even when such sensing capabilities exist, the information they provide is rarely made available to software. For example, the ® Core™ i7 [14] processor employs power monitoring hardware on-chip to enable its Turbo Boost technology. But this interface is only available to and used only by the hardware for selectively and temporarily increasing chip performance. External power meters can be used to measure total system power. Digital multimeters can be used to further isolate CPU power from system power, but their use requires access to the power rails coming out of the power sup- ply unit (PSU). Intel’s Node Manager [19] can be used in combination with certain Intel® Xeon® processors to measure power and to control power dis- sipation. This technique can report both system-wide as well as processor and memory system power consumption. However, the above techniques lack functionality to provide power consumption at the granularity of devices, such as cores, integer units, floating-point units, or caches. Intel’s Sandy Bridge microarchitecture can measure power at the core level [38]. Their power-measurement techniques are based on the same methodology as pre- sented in this chapter. They use microarchitectural events that are multiplied with energy weights and then summed together to form the power of a core or of the complete CPU. This technique does not provide insights into power dissipation, as it is proprietary, and only the end result can be read from soft- ware. Furthermore, it is limited to a specific processor model. System simulators [10] are used at design time to obtain detailed and decomposable information about component power consumption. Most of the architectural power models used in such simulators are prone to error [22], Techniques to Measure, Model, and Manage Power 11 and thus obtaining accurate power models for off-the-shelf commercial pro- cessors can be difficult, even impossible. Furthermore, simulators suffer very long running times. Finally, these tools must be used offline, and they provide little useful information for online power estimation of arbitrary applications. A viable alternative is to create power models that can be computed in real time and whose results can be made available to the appropriate software layers. While it is true that most platforms lack infrastructure specifically designed to measure power, almost all modern processors include an array of PMCs that can track dynamic activity within a core. It goes without saying that models based on observable core activity should be more accurate than simplistic approaches that assume all instructions require the same amount of power. For instance, Fig. 1 shows that when we connect an external power meter to an Intel® Core™ i7 running applications from three benchmark suites (comprised of a mix of integer, floating point, single-threaded, and mul- tithreaded applications) in series, we see large variations in measured power consumption. These results suggest that these applications exercise different portions of the microarchitecture (even within a single execution) and that the various microarchitectural components draw different amounts of power. Figure 2 reinforces this conclusion by showing the variations in power con- sumption when we use microbenchmarks to exercise some of these microar- chitectural components by running different mix of instructions. Here, MOV refers to instructions that move data between registers. SIMD refers to instructions that perform single-instruction ­multiple-data (SIMD) operations including data transfer operations and packed ­arithmetic operations. BRANCH refers to conditional and unconditional branch instructions. INT refers to integer arithmetic instructions. L2 refers to microbenchmarks with heavy L2 accesses but no off-chip memory accesses.

140

120

100 Power (W) 80 0 5000 10000 15000 20000 Time (sec) Fig. 1. Intel® Core™ i7 system power consumption for NAS, SPEC2006, and SPEC-OMP suites. 12 Bhavishya Goel et al.

120

115

110 Consumption

Core i7 Power 105

100 L2 INT x87 MOV SIMD MEM BRANCH Microarchitecture Components

Fig. 2. Variations in Intel® Core™ i7 processor power consumption when different processor components are exercised.

MEM refers to microbenchmarks with heavy off-chip memory accesses. x87 refers to instructions that are executed by a processor’s x87 float- ing-point unit including data transfer operations and floating-point ­arithmetic operations. These data demonstrate the possibility of variation in processor power consumption across applications or across different phases of same appli- cation and hence, motivate the power estimation approach we describe below.

3. EMPIRICAL POWER MEASUREMENT Designing intelligent power-aware resource managers requires an infrastructure that can accurately measure and log the system power ­consumption (and preferably that of individual resources). Resource man- agers can use this information to identify power consumption ­problems in both hardware (e.g., hotspots) and software (e.g., ­power-hungry tasks) and then to address those problems (e.g., through scheduling tasks to even out power or temperature across the chip) [5, 21, 49]. A ­measurement infrastructure is also needed for power benchmarking [44, 48] and power modeling [5, 7, 17, 21]. Unfortunately, support from system and chip manufacturers to communicate accurate power information to the system software remains weak. Most available hardware lacks infrastructure for sensing current, and even when such infrastructure is present, the infor- mation is not readily available to software. In this section, we compare three approaches to measuring power consumption on an Intel® Core™ Techniques to Measure, Model, and Manage Power 13 i7 machine. The demonstrated techniques can be applied to other systems with or without adaptation.

3.1 Measurement Techniques Power can be measured at various points in a system; we sample power consumption at the three points shown in Fig. 3: 1. The first and least intrusive method for measuring the power of an entire system is to use a power meter like the Watts up? Pro[18] plugged directly into the wall outlet; 2. The second method uses custom sense hardware to measure the current on individual ATX power rails; and 3. The third and most intrusive method measures the CPU voltage and CPU current directly at the CPU voltage regulator. In the rest of this section, we describe the methodology of all three approaches and discuss their advantages and disadvantages in terms of accuracy, sensitivity, measurement granularity, and ease of setting up the infrastructure.

3.1.1 At the Wall Outlet The first method uses an off-the-shelf (Watts up? Pro) power meter that sits between the machine under test and the power outlet. Measurements from the meter are logged on a separate machine through a USB interface, as shown in Fig. 3. To prevent data logging activity from disturbing the system under test, we use a separate machine with all three infrastructures. Although easy to deploy and unintrusive, this meter delivers only a single system measurement, making it difficult to separate the power consump- tion of different system components. Moreover, the measured power values are inflated compared to actual power consumption due to inefficiencies in the system PSU and on-board voltage regulators. The acuity of the

Machine under Test 1 2 Motherboard

+5V Custom Wall AC Watts? Up AC Power Supply +3.3V Buck +12V1/2 Sense Outlet Unit +12V3 Controller PRO Hardware 3

Data Logging Data Acquisition CPU Machine Device Machine under Test

Fig. 3. Power measurement setup. 14 Bhavishya Goel et al. measurements is also limited by the (low) sampling frequency of the power meter (one sample per second for the Watts up? Pro). The accuracy of the system power readings depends on the accuracy specifications provided by the manufacturer ( 1.5% in our case). The overall accuracy of measure- ± ments at the wall outlet is affected by the mechanism converting alternating current (AC) to direct current (DC) in the PSU. For instance, when we discuss measurement results, below, we will examine the accuracy effects of the large electrolytic smoothing capacitor used in the PSU. This approach is suitable for studies of total system power ­consumption instead of individual components like CPU, memory, graphics cards, etc. [52]. It is also useful in power modeling research, where the absolute value of the CPU and/or memory power consumption is less essential than the trends [21]. This approach is ill-suited for isolating the power for the CPU, main memory, or other system components.

3.1.2 At the ATX Power Rails The second methodology measures current on the supply rails of the ATX (Advanced Technology eXtended) motherboard’s power supply connec- tors. As per ATX power supply design specifications [40], the PSU delivers power to the motherboard through two connectors, a 24-pin connector that delivers 5.5V, 3.3V, and 12V, and an 8-pin connector that deliv- + + + ers 12V used exclusively by the CPU. Table I shows the pinout of these + connectors. Depending on the system under test, the pins belonging to the same power region may be connected together on the motherboard. In our case, all 3.3 VDC pins are connected together, so are all 5 VDC pins and + + 12V3 pins. Apart from that, the 12V1 and 12V2 pins are connected + + + together to supply current to the CPU. Hence, to measure the total power consumption of the motherboard, we can treat these connections as four logically distinct power rails— 3.3V, 5V, 12V3, and 12V1/2—on + + + + which to measure current. For our experiments, we developed custom measurement hardware using current transducers from LEM [15]. These transducers use the Hall effect to generate an output voltage in accordance with the changing cur- rent flow. The top-level schematic of the hardware is shown in Fig. 4, and Fig. 5 shows the manufactured board. Note that when designing such a (PCB), care must be taken to ensure that the current capacity of PCB traces carrying the combined current for ATX power rails is sufficiently high and that the on-board resistance is as low as possible. We used a PCB with 105 μm copper instead of the more widely used ­thickness Techniques to Measure, Model, and Manage Power 15

Table I ATX connector pinout. Pin Signal Pin Signal (a) 24-pin ATX connector pinout 1 3.3 VDC 13 3.3 VDC 2 +3.3 VDC 14 +12 VDC 3 COM+ 15 COM− 4 5 VDC 16 PS_ON 5 COM+ 17 COM 6 5 VDC 18 COM 7 COM+ 19 COM 8 PWR OK 20 Reserved 9 5 VSB 21 5 VDC 10 12 V3 22 +5 VDC 11 +12 V3 23 +5 VDC 12 +3.3 VDC 24 COM+ + (b) 8-pin ATX connector pinout 1 COM 5 12 V1 2 COM 6 +12 V1 3 COM 7 +12 V2 4 COM 8 +12 V2 +

NI DAQ Power Supply Motherboard Unit 3.3V LTS

25-NP Connectors 5V LTS 25-NP ATX 24-Pin 24-Pin

ATX 12V3 LTS

25-NP Connectors 12V1/2 LTS 25-NP 8-Pin 8-Pin PCB

Fig. 4. Measurement setup on the ATX power rails. of 35 μm. Traces carrying high current are at least 1 cm wide and are backed by thick-stranded wire connections, when required. The current ­transducers need 5V supply voltage, which is provided by the 5VSB (stand by) rail + + from the ATX connector. Using 5VSB for the transducer’s supply serves + 16 Bhavishya Goel et al.

Fig. 5. Our custom measurement board. two purposes. First, because the 5VSB voltage is available even when the + machine is powered off, we can measure the base output voltage from the current transducers for calibration purposes. Second, because the current consumed by the transducers themselves ( 28 mA) is drawn from 5VSB, ∼ + it does not interfere with our power measurements. We sample and log the analog voltage output from the current transducers using a data acquisition (DAQ) unit from National Instruments (NI USB-6210­ [34]). As per the LEM datasheet, the base voltage of the current transducer is 2.5V. Our experiments indicate that the current transducer produces an output voltage of 2.494V when zero current is passed through its primary turns. The sensitivity of the current transducer is 25 mV/A, hence the cur- rent can be calculated as in Eqn (1): V − BASE_VOLTAGE I = out . (1) out 0. 025 We verified our current measurements by comparing against the output from a digital multimeter. The power consumption can then be calculated by simply multiplying the current with the respective voltage. Apart from the ATX power rails, the PSU also provides separate power connections to the hard drive, CD-ROM, and cabinet fan. To calculate the total PSU load without adding extra hardware, we disconnect the I/O devices and fan, and we boot our system from a USB memory powered by the motherboard. Techniques to Measure, Model, and Manage Power 17

The total power consumption of the motherboard can then be calculated as in Eqn (2):

P = I3.3V ∗ V3.3V + I12V 3 ∗ V12V 3 + I5V ∗ V5V + I12V 1/2 ∗ V12V 1/2. (2) The theoretical current sensitivity of this measurement infrastructure can be calculated by dividing the voltage sensitivity of the DAQ unit (47 μV) by the current sensitivity of the LTS-25NP current transducers from LEM (25 mV/A). This yields a current sensitivity of 2 mA. This approach improves accuracy by eliminating the complexity of ­measuring power on AC. Furthermore, the approach enjoys greater ­sensitivity to current changes (2 mA) and higher acquisition unit sampling frequencies (up to 250 K/s). Since most modern motherboards have ­separate supply connectors for the CPU(s), this approach facilitates distinguishing CPU power consumption from that of other motherboard components. Again, this improvement comes with increased cost and complexity: The sophisticated DAQ unit is priced an order of magnitude higher than the power meter, and we had to build a custom board to house the current transducer infrastructure.

3.1.3 At the Processor Voltage Regulator Although measurements taken at the motherboard supply rails factor out the PSU’s efficiency curve, they are still affected by the efficiency curve of the on-board voltage regulators. To eliminate this source of inaccuracy,­ we investigate a third approach. Motherboards that follow Intel’s ­processor power delivery guidelines (Voltage Regulator-Down (VRD) 11.1 [13]) provide a load indicator output (IMON) from the processor­ voltage regulator. This load indicator is connected to the processor­ for use by the processor’s power management features. This signal provides an analog voltage linearly proportional to the total load current of the processor. We make use of this current sensing pin from the ­processor’s voltage regulator chip (CHL8316, in our case) to acquire real-time information about total current delivered to the processor. We also use the voltage output at the V_CPU pin of the voltage regulator, which is directly connected to the core voltage supply input of the processor. We locate these two signals on the motherboard and solder wires at the respective connection points (the resistor/capacitor pads connected to these signals). We connect these two signals and the ground point to our DAQ unit, logging the values read on the separate machine. This current measurement setup is shown in Fig. 6. 18 Bhavishya Goel et al.

Fig. 6. Measurement setup on CPU voltage regulator.

The full voltage swing of the IMON output is 900 mV for the full-scale­ current of 140 A (for the motherboard under test). Hence, the current ­sensitivity of the IMON output comes to about 6.42 mV/A. The ­theoretical sensitivity of this infrastructure depends on the voltage ­sensitivity of the DAQ unit (47 μV) and its overall sensitivity to current changes comes to 7 mA. This sensitivity is less than that for measuring current at the ATX power rails, but the sensitivity may vary for different voltage regulators employed on different motherboards. This method provides the most accurate measurements of absolute current feeding the processor. But it is also the most intrusive, as it requires soldering wires on the motherboard, an invasive instrumentation procedure that should only be performed by skilled technicians. Moreover, these power measurements are limited to processor power consumption (we get no information about other system­ components). For example, for memory-intensive applications, we can account for power consumption effects of the external bus ­transactions triggered by off-chip memory accesses, but this method provides no means of measuring power consumed in the DRAMs. The accuracy of the IMON output is specified by the CHL8316 datasheet to be within 7%. ± This falls far below the 0.7% accuracy of the current transducers at ATX power rails (note that the accuracy specifications of the processor’s voltage regulator may differ for different manufacturers). Techniques to Measure, Model, and Manage Power 19

3.2 Experimental Results We compared power measurement results from our three approaches to fur- ther evaluate their advantages and disadvantages. The Watts Up? Pro measures power consumption of the entire system at the rate of one sample per second, whereas the DAQ unit is configured to capture samples at the rate of 40,000 samples per second from the four effective ATX voltage rails ( 12V1/2, + 12V3, 5V, and 3.3V) and the CPU voltage regulator V_CPU and + + + IMON outputs. We choose this rate because the combined sampling rate of the six channels adds up to 240 K samples per second, and the maximum sampling rate supported by the DAQ is 250 K samples per second. To remove background noise, we average the DAQ samples over a period of 40 samples, which effectively gives 1000 samples per second. We use a CPU-bound test workload consisting of a 32 32 matrix multiplication­ in an infinite loop. × Figure 7 shows power measurement results across the three different points as we vary the number of active cores. Steps in the power consump- tion are captured by all measurement setups. The low sampling frequency of the wall-socket power meter prevents it from capturing short and sharp peaks in power when the CPU is idle. The power consumption changes we observe at the wall outlet are at least 13 W from one activity level to another, diminishing the smoothing effect of the PSU's smoothing capacitor. Figure 8 depicts measurement results when the CPU frequency is varied every 5 s from 2.93 to 1.33 GHz in steps of 0.133 GHz. The power measurement setup at the ATX power rails and the CPU voltage regulator capture the changes in power consumption accurately, and apart from the differences in absolute values and the effects of the CPU voltage regulator

Outlet Power MB CPU Power 100 ATX CPU Power 4 cores 3 cores (W ) 2 cores 50 1 core Powe r CPU Idle

0 0 10000 20000 30000 40000 50000 Time (msec) Fig. 7. Power measurement comparison when varying the number of active cores. 20 Bhavishya Goel et al.

Outlet Power MB CPU Power ATX CPU Power 100 2.93 2.80 GHZ 2.67 GHZ 2.53 GHZ 2.40 CPU Idle GHZ GHZ 2.27 2.13 GHZ 2.00 GHZ 1.87 GHZ 1.73 1.60 GHZ GHZ 1.47 1.33 GHZ GHZ Power (W ) 50 GHZ

0 0 20000 40000 60000 80000 100000 Time (msec) Fig. 8. Power measurement comparison when varying core frequency. efficiency curve, there is not much to differentiate measurements at the two points. However, the power measurements taken by the power meter at the wall outlet fail to capture the changes faithfully, even though its 1-s sampling rate is enough to capture steps that last 5 s. This effect is even more visible when we introduce throttling (at eight different levels for each CPU frequency), as shown in Fig. 9. Here, each combination of CPU frequency and throttling level lasts for 2 s, which should be long enough for the power meter to capture steps in the power consumption. But the power meter performs worse as power consumption decreases. This can be attributed to the smoothing effect of the capacitor in the PSU. These effects are not

Outlet Power MB CPU Power 100 ATX CPU Power

Power (W) 50

0 0 200000 400000 600000 Time (msec)

Fig. 9. Power measurement comparison when varying core frequency together with throttling level. Techniques to Measure, Model, and Manage Power 21

100

80

60

40 Efficiency (%)

20

0 01020304050 MB CPU Power (W) Fig. 10. Efficiency curve of CPU voltage regulator. visible between measurement points at the ATX power rails and CPU volt- age regulator because the motherboard’s decoupling and storage capacitors hold much less charge than those housed in the PSU. Figure 10 shows the efficiency curve of the CPU voltage regulator at various load levels. The voltage regulator on the test system employs dynamic phase control to adjust the number of phases with varying load current to try to optimize the efficiency over a wide range of loads. The voltage regulator switches to one-phase or two-phase operation to increase the efficiency at light loads. When the load increases, the regulator switches to four-phase operation at medium loads and six-phase operation at high loads. The sharp change in efficiency visible in Fig. 10 is presumably due to adaptation in phase control. Figure 11 shows the efficiency graph of

100

) 80

60

40 Efficiency (%

0 0 20 40 60 80 100 ATX Total Power (W) Fig. 11. Efficiency curve of the PSU. 22 Bhavishya Goel et al.

MB CPU Power ATX DIMM Power 100 ATX CPU Power

Power (W) 50

0 0 20000 40000 60000 Time (msec) Fig. 12. Power measurement comparison for the CPU and DIMM (running GCC). the PSU against total power consumption calculated on ATX power rails. The total system power never goes below 30W, and the efficiency of the PSU varies from 60% to around 80% in the output power range from 30 to 100 W. Figure 12 shows the changes in CPU and main memory power consump- tion while running gcc from SPEC CPU2006 [47]. Power consumption of the main memory varies from around 7.5 to 22 W across various phases of the gcc run. Researchers and practitioners who wish to assess main memory power consumption will at least want to measure power at the ATX power rails.

3.3 Further Reading There have been many interesting studies on power-modeling and ­power-aware resource management. These employ various means to mea- sure empirical power. Rajamani et al. [37] use on-board sense resistors located between the processor and voltage regulators to measure power con- sumed by the processor. They use a National Instruments isolation amplifier and data acquisition unit to filter, amplify, and digitize their measurements. Isci and Martonosi [25] measure current on the 12V ATX power lines using clamp ammeters, which are hooked to a digital multimeter (DMM) for data collection. The DMM is connected to a data logging machine via an RS232 serial port. Contreras and Martonosi [11] use on-board jumpers on their Intel® XScale™ development board to measure the power ­consumption of the CPU and memory separately. They feed the measurements to a LeCroy oscilloscope for sampling. Cui et al. [16] also measure the power ­consumption at the ATX power rails. They use current-sense resistors and Techniques to Measure, Model, and Manage Power 23 amplifiers to generate sense voltages (instead of using current transducers), and they log their measurements using a digital multimeter. Bedard et al. [4] build their own hardware combining the voltage and current measurements and host interface into one solution. They use an Analog Devices ADM1191 digital power monitor to sense voltage and current values and an Atmel® microcontroller to send the measured values to a host USB port.

4. POWER ESTIMATION Power-measurement techniques like those from the previous ­section are essential for analyzing power consumption of systems under test. However, these measurement techniques do not provide detailed ­information on the power consumption of individual processor cores or smaller modules (e.g., caches, floating-point units, integer execution units). To develop resource-management­ for an individual processor, ­system designers need to analyze power consumption at the granularity­ of processor cores or even components within a processor core. This ­information can be provided by placing on-die digital power meters, but that increases the chip’s hardware cost. Hence, support for such power meters has been limited by the chip manufacturers. Another alternative is to estimate the power consumption at the desired granularity using software power models. Such models identify various power-relevant events in the targeted microarchitecture and track those events to generate a representative power-consumption value. We can characterize desirable aspects of a software power-estimation model by the following attributes: Portability. The model should be easy to port from one platform to another; Scalability. The model should be easy to scale across varying number of active cores and across different CPU voltage-frequency points; CPU usage. The model’s CPU footprint should be negligible, so as not to pollute the power consumption values of the system under test; Accuracy. The model’s estimated values should closely follow the empiri- cally measured power of the device that is modeled; Granularity. The model should provide power consumption estimates at the granularity desired for the problem description (per core, per microarchitectural module, etc.); and Speed. The model should supply power estimation values to the software­ at minimal latency (preferably within microseconds). 24 Bhavishya Goel et al.

In the next section, we survey power modeling techniques used in prior work and discuss various aspects of power modeling, in general.

4.1 Power Modeling Techniques The past decade has seen considerable research in the field of power ­modeling. Depending on the problem description and research goals, power modeling can be performed both on simulators [26, 27, 35] and on hardware [5, 7, 11 ,17 ,21, 26, 37, 42] platforms. Power estimation based on simulation allows greater freedom for researchers to select which power estimation­ techniques to employ. Previous studies use instruction- level power ­estimation [28, 39, 51] that assigns representative power ­values to individual instructions or to a cluster of instructions within an ­instruction-set ­simulator. Popular power estimation tools like Wattch [10] and SimplePower [56] work with the SimpleScalar [1] simulator to provide cycle-level power estimates. These tools monitor the activity of ­microarchitectural components to form a decomposed power model for individual sub-units of the architecture. Although these power models provide insight into the power consumption behavior of new designs developed by computer architects and form an essential part of research on power ­management, their use in implementing power management algorithms for actual hardware platforms is limited. The most popular mechanism adopted by researchers to develop power models for hardware platforms is the use of event-driven PMCs.

4.1.1 Performance Monitoring Counters Most modern processors are equipped with a Performance Monitoring Unit (PMU) providing the ability to count the microarchitectural events that expose the inner workings of processor. This allows programmers to analyze processor performance, including the interaction between the pro- gram and the microarchitecture, in real time on real hardware, rather than relying on simplified performance results from simulations. The PMUs provide a wide variety of performance events. These events can be counted by mapping them to a limited set of PMC registers. For example, on Intel and AMD platforms, these performance counter registers are accessible as Model Specific Registers (MSRs). Also called Machine Specific Registers, these are not compatible across processor families. Software can configure the performance counters to select which events to count. The PMCs can be used to count events like cache misses, micro-operations retired, stalls at various stages of an out-of-order pipeline, floating point/memory/branch Techniques to Measure, Model, and Manage Power 25 operations executed, and many more. Although, the counter values are not error-free [54, 57] or even deterministic [53], if used correctly, the errors are small enough to make PMCs suitable candidates for estimating power consumption. PMCs are available individually for each core and hence can be used to create core-specific models. The number and variety of PMCs available for modern processors is increasing with each new architecture. For example, the number of PMCs available in the Intel® Core™ i7 processor is about 10 times the number available in the Intel® Core Duo processor [23]. This comprehensive cov- erage of event information increases the chances that the available PMCs will be good representatives of overall microarchitectural activity for the purposes of performance and power analysis. For further reading about PMCs, please refer to Intel’s System Programming Guide for Intel® 64 and IA-32 architectures [23].

4.1.2 PMC Access Ever since researchers and programmers started profiling their software code using performance counters, various kernel interfaces, helper librar- ies, and monitoring tools have been developed to access PMU hardware. A kernel interface is required because special instructions are needed to write to the performance counter control registers, which can only be done at the highest privilege level, although instructions to read the registers may be executed at the user level on some platforms. Intel provides a commer- cial profiling tool named VTune™ Amplifier XE for IA-32 and Itanium® processor-based machines. Many users use Oprofile as an open-source monitoring tool alternative to Intel® VTune™. Oprofile uses its own kernel interface to access the performance counter registers, and hence needs to be part of the Linux kernel tree. Most major Linux distributions have the Oprofile kernel interface as part of their source tree. Oprofile currently sup- ports both system-wide and per-thread monitoring and can profile applica- tions on a wide variety of hardware including Intel, ARM, AMD, and MIPS platforms. Apart from the Oprofile kernel interface, a different kernel inter- face called Perfctr was developed as a generic interface, which can be used by monitoring tools. Perfctr is not part of the Linux kernel and hence requires a separate kernel patch. Like Oprofile, Perfctr supports both system-wide and per-thread monitoring. The Perfctr interface is used by an application pro- gramming interface named PAPI (Performance Application Programming Interface). Another open-source kernel interface called Perfmon2 was developed with the aim to standardize the performance-monitoring kernel 26 Bhavishya Goel et al. interface for Linux and provide a generic interface that can be ported across all PMU models and architectures. Perfmon2 also supports system-wide and per-thread monitoring. A helper called libpfm or PAPI can be used to access the Perfmon2 kernel interface. The Perfmon2 interface is used by open-source monitoring tools like Caliper and pfmon. Like Perfctr, Perfmon2 is not part of the Linux kernel and requires a separate patch. In the quest to standardize the performance-monitoring interface in the Linux kernel tree, the Linux community finally agreed in 2008 to add a generic API to the kernel tree called Linux Performance Event Subsystem, also known as perf_events. perf_events is included in the Linux kernel since version 2.6.31 and is gaining acceptance from the ­performance-profiling community. PAPI now supports the perf_events interface and has deprecated support for both Perfctr and Perfmon2. The development on Perfmon2 stopped after the release of perf_events and a completely revised version of libpfm, called libpfm4, uses the perf_events interface. libpfm4 comes with its own sample examples, which can be used to create specialized monitoring applications. There have been discussions in the Oprofile ­community to port Oprofile to perf_events. perf_events, unlike Oprofile, does not require root access for profil- ing user-level threads.

4.1.3 Counter Selection Selecting appropriate PMCs to use is extremely important with respect to accuracy of the power model. Our methodology chooses counters that are most highly correlated with measured power consumption. The chosen counters must also cover a sufficiently large set of events to ensure that they capture general application activity. If the chosen counters do not meet these criteria, the model will be prone to error. The problem of choosing appropriate counters for power modeling has been handled in different ways by previous researchers. Research studies that estimate power for an entire core or a processor­ [11, 17, 37] use a small number of PMCs. Research studies that aim to con- struct decomposed power models to estimate the power consumption of sub-units of a core [5, 7, 26] tend to monitor a greater number of PMCs. The number of counters needed depends on the model granularity and the acceptable level of complexity. Also, most modern processors allow simultane- ous counting of only two or four microarchitectural events. Hence, using more counters in the model requires interleaving the counting of events and extrap- olating the counter values over the total sampling period. This reduces accu- racy of ­absolute counter values but allows researchers to track more counters. Techniques to Measure, Model, and Manage Power 27

The event counters for the power model can be chosen based on ­analytical inspection of the microarchitecture, statistical correlation with the measured power consumption values, or a combination of both. Rajamani et al. [37] design their power model using only a single PMC (Decoded Instructions per Cycle, or DPC) based on the argument that power consump- tion correlates strongly with DPC. They try to capture the event activity that includes instructions executed speculatively but not committed.­ Joseph and Martonosi [26] analyze the microarchitecture of the Alpha 21264 processor and identify the power-relevant events based on their analysis and previous research. After identifying the events, they select PMCs rep- resenting those events. When the relevant counters are not available, they employ heuristics to calculate the utilization factors for chosen events using the available counters. Bellosa et al. [5] use statistical correlation of PMCs with power consumption to select the most promising counters for their model. Contreras and Martonosi [11] choose five PMCs to estimate power through a combination of analytical and statistical approaches. Pusukuri et al. [36] start by using all the counters that they argue are relevant for power consumption calculation. Once they create a model that uses all the selected counters, they rank the accuracy of selected counters using “Relative Important Measures” and choose only those counters that show high effectiveness in terms of R2 value. Like Singh et al. [42] and Goel et al. [21], we divide the available ­counters into four categories and then choose one counter from each category based upon statistical correlation. This ensures that the chosen counters are comprehensive representations of the entire microarchitecture and are not biased toward any particular section. Consider the microarchi- tecture of a given processor. Caches and floating point units form a large part of the chip real estate, and thus PMCs that keep track of their activity factors would be useful additions to the total power consumption infor- mation. Depending on the platform, multiple counters will be available in both these categories. For example, we can count the total number of cache references as well as the number of cache misses for various cache levels. For floating point operations, depending upon the processor model, we can count (separately or in combination) the number of multiply, addi- tion, or division operations. Because of the deep pipelining of modern processors, we can also expect out-of-order logic to account for a signifi- cant amount of power consumption. Stalls due to branch mispredictions or an empty instruction decoder may reduce average power consumption over a fixed period of time. On the other hand, pipeline stalls caused by 28 Bhavishya Goel et al. full reservation stations and reorder buffers will be positively correlated with power because these indicate that the processor has extracted enough instruction-level parallelism to keep the execution units busy. Hence, pipe- line stalls indicate not just the power usage of out-of-order logic but of the executions units, as well. In addition, we would like to use a counter that can cover all the microarchitectural components not covered by the above three categories. This includes, for example, integer execution units, branch prediction units, and single-instruction multiple-data (SIMD) units. These events can be monitored using the specific PMCs tied to them or by a generalized counter like total instructions/micro-operations (UOPS) retired/executed/issued/decoded. To construct a power model for indi- vidual sub-units, we need to identify the respective PMCs that represent each sub-unit’s utilization factors. To choose counters by using statistical correlation, we run a ­training application while sampling the performance counters and collecting empirical power measurement values. As an example, Fig. 13 shows simpli- fied pseudo-code for the microbenchmark developed by Singh et al. [42]. Here, different phases of the microbenchmark exercise different parts of the microarchitecture to establish the correlation between the PMCs and power consumption. Since, the number of relevant PMCs will most likely be more than the limit on simultaneously monitored counters, multiple training set runs will be required to gather data for all the desired coun- ters. Researchers and practitioners can either develop their own custom

Fig. 13. Microbenchmark pseudo-code. Techniques to Measure, Model, and Manage Power 29

­microbenchmarks [7, 17, 42] or use a subset of available benchmarks [37] as the training set. Previous studies have come to different conclusions regarding the ­benefits of tracking more events to increase the accuracy of the model. Goel et al. [20, 21] use four counters for their composite power model. They find that a model using eight counters exhibits a median error of 1.92%, whereas the one using four counters shows a median error of 2.06%. They conclude that the improvement in accuracy of the model is not suf- ficiently significant to justify increasing the complexity of the model and requiring multiplexing of the counters. Pusukuri et al. [36] compare their power model accuracy using eight-predictor and two-predictor models: they begin with an eight-predictor model and then choose two of the most statistically significant predictors to create a two-predictor model. Their results show that the latter model performs as well as or better than the former. They conclude that the two-predictor model is more robust than the eight-predictor model, as it does not overfit their training data. Bertran et al. [7] also test model accuracy for one, two, and eight predictors. They show that although the mean error for all their models is in same range (2.15%, 2.77%, and 2.02%, respectively), there is a marked decrease in standard deviation of estimation errors as the number of tracked events increases (4.11%, 2.57%, 1.48%, respectively). They conclude that tracking more events is beneficial for the robustness of the model. Once the performance counter values and the respective empirical power consumption values are collected, one can use a statistical correlation method to establish the correlation between performance events (counter values normalized to the number of instructions executed) and power to select the most suitable events for making the power model. The type of correlation method used can affect the model accuracy. Singh et al. [42] use Spearman’s rank correlation [43] to measure the relationship between each counter and power. Using this rank correlation, in comparison to using correlation methods like Pearson’s, ensures that the nonlinear relationship between the counter and the power values does not affect the correlation coefficient. As an example, we elaborate on the counter selection methodology of Singh et al. [42] and Goel et al. [20, 21] (which we also adopt here). Table II shows the most power-relevant counters divided categorically accord- ing to the correlation coefficients obtained from running their micro- benchmarks on the Intel® Core™ i7 platform. Table IIa shows that only FP_COMP_OPS_EXE:X87 is a suitable candidate from the floating point (FP) category. Ideally, to get total FP operations executed on the processor, we 30 Bhavishya Goel et al.

Table II Intel® Core™ i7 counter correlation. Counters ρ (a) FP operations FP_COMP_OPS_EXE:X87 0.65 FP_COMP_OPS_EXE:SSE_FP 0.04

(b) Total instructions UOPS_EXECUTED:PORT1 0.84 UOPS_ISSUED:ANY 0.81 UOPS_EXECUTED:PORT015 0.81 INSTRUCTIONS_RETIRED 0.81 UOPS_EXECUTED:PORT0 0.81 UOPS_RETIRED:ANY 0.78

(c) Memory operations MEM_INST_RETIRED:LOADS 0.81 UOPS_EXECUTED:PORT2_CORE 0.81 UOPS_EXECUTED:PORT234_CORE 0.74 MEM_INST_RETIRED:STORES 0.74 LAST_LEVEL_CACHE_MISSES 0.41 LAST_LEVEL_CACHE_REFERENCES 0.36

(d) Stalls ILD_STALL:ANY 0.45 RESOURCE_STALLS:ANY 0.44 RAT_STALLS:ANY 0.40 UOPS_DECODED:STALL_CYCLES 0.25

should count both 87 FP operations (FP_COMP_OPS_EXE:X87) and SIMD × (FP_COMP_OPS_EXE:SSE_FP) operations. The microbenchmarks do not use SIMD floating point operations, and, hence, we see high correlation for the 87 counter but not for the SSE (Streaming SIMD Extensions) coun- × ter. Because of the limit on the number of counters that can be sampled simultaneously, we have to choose between the two counters. Ideally, chip manufacturers would provide a counter reflecting both 87 and SSE FP × instructions, obviating the need to choose. In Table IIb, the correlation values in the total instructions category are almost equal, and thus these counters need further analysis. The same is true for the top three counters in the stalls category, shown in Table IId. Since we are looking for counters providing insight into out-of-order logic usage, the RESOURCE_STALLS:ANY counter is our best option. As for memory operations, choosing either Techniques to Measure, Model, and Manage Power 31

MEM_INST_RETIRED:LOADS or MEM_INST_RETIRED:STORES will bias the model toward load- or ­store-intensive applications. Similarly, choosing UOPS_EXECUTED:PORT1 or UOPS_EXECUTED:PORT0 in the total instructions category will bias the model toward addition- or multiplication-intensive applications. We therefore omit these counters from further consideration. Table III shows that correlation analysis may find counters from the same category with very similar correlation numbers. Our aim is to make a com- prehensive power model using only four counters, and thus we must make sure that the counters chosen convey as little redundant information as pos- sible. We therefore analyze the correlation among all the counters. To select a counter from the memory operations category, we analyze the correlation of UOPS_EXECUTED:PORT234_CORE and LAST_LEVEL_CACHE_MISSES with the counters from the total instructions category, as shown in Table IIIa. From this table, it is evident that UOPS_EXECUTED:PORT234_CORE is highly correlated

Table III Counter–counter correlation. UOPS_EXECUTED:PORT234 LAST_LEVEL_ CACHE_MISSES (a) MEM vs. INSTR correlation UOPS_ISSUED:ANY 0.97 0.14 UOPS_EXECUTED: 0.88 0.2 PORT015 INSTRUCTIONS_RETIRED 0.91 0.12 UOPS_RETIRED:ANY 0.98 0.08

FP_COMP_OPS_EXE:X87 (b) FP vs. INSTR correlation UOPS_ISSUED:ANY 0.44 UOPS_EXECUTED: 0.41 PORT015 INSTRUCTIONS_RETIRED 0.49 UOPS_RETIRED:ANY 0.43

RESOURCE_STALLS:ANY (c) STALL vs. INSTR correlation UOPS_ISSUED:ANY 0.25 UOPS_EXECUTED: 0.30 PORT015 INSTRUCTIONS_RETIRED 0.23 UOPS_RETIRED:ANY 0.21 32 Bhavishya Goel et al.

Table IV PMCs selected for the Intel® Core™ i7. Category Intel® Core™ i7 Memory LAST_LEVEL_CACHE_MISSES Instructions executed UOPS_ISSUED Floating point FP_COMP_OPS_EXE:X87 Stalls RESOURCE_STALLS:ANY

with the instructions counters, and hence LAST_LEVEL_CACHE_MISSES is the better choice. To choose a counter from the total instructions category, we analyze the correlation of these counters with the FP and stalls counters (in Table IIIb and c, respectively). These ­correlations do not clearly ­recommend any particular choice. In such cases, we can either choose one counter at random or choose a counter intuitively. UOPS_EXECUTED:PORT015 is not preferable since it does not cover memory operations that are satisfied by cache accesses, instead of main memory. The UOPS_RETIRED:ANY and INSTRUCTIONS_RETIRED counters cover only retired instructions and not those that are executed but not retired, e.g., due to branch misprediction. A UOPS_EXECUTED:ANY counter would have been appropriate, but since such a counter does not exist, the next best option is UOPS_ISSUED: ANY. This ­counter covers all instructions issued, so it also covers the instruc- tions issued but not executed (and thus not retired). Table IV shows the counters we selected for the Intel® Core™ i7 (these are the same as used by Goel et al. [21]).

4.1.4 Model Formation The type of power model targeted and the method of calculating coun- ter coefficients also affect counter selection. Rajamani et al. [37] develop a power model using a single PMC (Decoded Instructions per Cycle). They develop their power model as a linear fit of measured counter values to empirical power values while minimizing the absolute value of error. They derive a different model (different values of counter coefficients and con- stants) for each p-state (power state) of their processor. Bellosa et al. [5] use the dqed subroutine from netlib FORTRAN to calculate the weights of their performance events from the set of linear equations relating those events to the power consumption. Bertran et al. [7] construct a decomposable power model in which power consumption per component is estimated using 13 PMCs. They calculate the power weight of each performance event by running a microbenchmark, which exercises the related component in Techniques to Measure, Model, and Manage Power 33 isolation. When it is not possible to isolate the activity of a component, they calculate the respective weight incrementally. For example, for deriving power for the L2 cache, first they derive power for the L1 cache and then use that figure to derive power consumption for the L2. They represent the total CPU dynamic power as a sum of products of counters and their respective weights. After calculating the power weights of all CPU compo- nents, they use a microbenchmark that exercises all components to calculate the power for CPU front-end logic and the static power. The dynamic power is summed with the static power to get total CPU power. Contreras and Martonosi [11] use five PMCs to implement their power model. Like Bellosa et al. [5], they construct the power model as a linear equation of counters. They calculate counter weights using multidimensional parameter estimation in an effort to minimize power-estimation errors. All of these studies use linear equations to construct their power models­ (either assuming linear relationship between power and performance­ events or choosing to ignore the nonlinearity that exists in these ­relationships). In contrast, we adopt the approach of Goel et al. [21] and Singh et al. [42], who apply nonlinear transformations to normalized counter values to account for nonlinearity. They use multiple regression analysis to form a linear regression model to predict power consumption via sampled counter values and temperature readings. Sampled PMC ­values, ei, are normalized to the elapsed cycle count to generate event rates, ri, which are used in an equation incorporating rise in core temperature, T, and rise in power consumption, Pcore, which are instantaneous values. The normaliza- tion ensures that changing the sampling period of the ­readings does not affect the weights of the respective predictors. As in Singh et al. [42] and Goel et al. [21], we develop a piecewise power model that achieves better fit by separating the collected samples into two bins based on the values of either the memory counter or the FP counter. Breaking the data using the memory counter value helps in separating memory-bound phases from CPU-bound phases. Using the FP counter instead of the memory counter to divide the data helps in separating FP-intensive phases. The selection of a candidate for breaking the model is machine specific and depends on what gives a better fit. Regardless, we believe that piecewise linear models better capture processor behavior. Our piecewise model is shown in Eqns (3) and (4):

ˆ F1(g1(r1), ..., gn(rn), T), if condition, Pcore = F2(g1(r1), ..., gn(rn), T), else, (3) 34 Bhavishya Goel et al.

where ri = ei/(cycle count), T = Tcurrent − Tidle

Fn = p0 + p1 ∗ g1(r1) +···+pn ∗ gn(rn) + pn+1 ∗ T. (4) The piecewise linear regression model for our Intel® Core™ i7 is shown in Eqn (5). Here, rMEM refers to the counter LAST_LEVEL_C ACHE_MISSES, rINSTR refers to the counter UOPS_ISSUED, rFP refers to the counter FP_COMP_OPS_EXE:X87, and rSTALL refers to the counter RESOURCE_STALLS:ANY. The piecewise model is broken based on the value of the memory counter. For the first part of the piecewise model, the coef- ficient for the memory counter is zero (due to the very low number of memory operations we sampled):

10. 9246 + 0 ∗ rMEM  +5. 8097 ∗ rINSTR + 0. 0529 ∗ rFP  + ∗ + ∗ − =  6. 6041 rSTALL 0. 1580 T, if rMEM < 1e 6, (5) Pcore  + ∗ 19. 9097 556. 6985 rMEM � +1. 5040 ∗ rINSTR + 0. 1089 ∗ rFP   −2. 9897 ∗ rSTALL + 0. 2802 ∗ T, if rMEM 1e − 6.  

4.2 Secondary Aspects of Power Modeling While constructing the power model, researchers need to consider ­various aspects of the system and architecture, and they must tune their methodology accordingly. Some of these aspects include chip tempera- ture, dynamic voltage and frequency scaling, simultaneous multithreading, and custom performance boosting techniques. Next, we discuss each of these aspects.

4.2.1 Temperature Effects Processor power consumption consists of both dynamic and static elements. Among these, the static power consumption is dependent on the core ­temperature. Equation (6) shows that the static power consumption of a pro- cessor is a function of both leakage current and supply voltage. The processor leakage current is, in turn, affected by process technology, supply voltage, and temperature. With the increase in processor power consumption, processor temperature increases. This increase in temperature increases ­leakage current, which, in turn, increases static processor power consumption. To study the Techniques to Measure, Model, and Manage Power 35

70 Empirical

Temperature (C) 8 Exponential Curve Fit 60 100 6

50 4 80 Temp Power (W) Power 40 2 60 0 200 400 600 800 Increase in Static Power (W) 45 50 55 60 65 70 Sample Index Temperature (C) (a) Temperature vs Power on Intel® Core™ i7 (b) Static Power Curve 70 10 Temperature (C ) 8 60 6 50 Power (W ) 4 Temp 2 Increase in Static Power 40 0 200 400 600800 Sample Index (c) Temperature vs Static Power on Intel® Core™ i7

Fig. 14. Temperature effects on power consumption. effects of temperature on power consumption, we ran a multithreaded pro- gram executing MOV operations in an infinite loop on our Intel® Core™ i7 machine. The behavior of the program over its entire run remains very consistent. This indicates that the dynamic power consumption of the pro- cessor changes little over the run of the program. Figure 14a shows that the total power consumption of the machine increases during the program’s runtime, and it coincides with the increase in chip temperature, while the CPU load remains constant. Thus, the gradual increase in power consump- tion during the run of this program can be attributed to the coincidental, gradual increase in temperature. The total power consumption increases by almost 10% due to the change in temperature. To account for this increase in static power, it is necessary to include temperature.

qVd/kT Pstatic = Ileakage ∗ Vcore = Is(e − 1) ∗ Vcore, (6) where Is reverse saturation current; Vd diode voltage; k Boltzmann’s = = = constant; q electronic charge; T temperature; Vcore core supply voltage. = = = As per Eqn (6), the static power consumption increases exponentially with temperature. We confirm this empirically by plotting the net increase 36 Bhavishya Goel et al. in power consumption once the program starts execution at the higher temperature, as shown in Fig. 14b. The non-regression analysis gives us Eqn (7), and the curve fit shown in Fig. 14b, which closely follows the empirical data points with determination coefficient R2 = 0. 995.

T PstaticInc = 1. 4356 × 1. 034 , when Vcore = 1. 09 V. (7)

Plotting this estimate of increment in static power consumption, as in Fig. 14c, explains the gradual rise in total power consumption when the dynamic behavior of a program remains constant. Goel et al. [21] include the temperature effects in their power model to account for the increase in static power consumption as chip temperature increases. Instead of using a nonlinear function, they approximate the static power increase as a linear function of temperature. This is a fair approxi- mation considering that the nonlinear equation given in Eqn (7), can be closely approximated with linear equation given in Eqn (8) with deter- mination coefficient R2 = 0. 989 for the range in which die temperature changes occur. This linear approximation is a trade-off for avoiding the added cost of introducing an additional exponential term in the model.

PstaticInc = 0. 359 × T − 16. 566, when Vcore = 1. 09 V. (8) Modern processors allow programmers to read temperature informa- tion for each core from on-die thermal diodes. For example, Intel platforms report relative core temperatures on-die via Digital Thermal Sensors (DTS), which can be read by software through Model Specific Registers (MSRs) or the Platform Environment Control Interface (PECI) [6]. This data is used by the system to regulate CPU fan speed or to throttle the processor in case of overheating. Third-party tools like RealTemp and CoreTemp on Windows and open-source software like lm-sensors on Linux can be used to read data from thermal sensors. As Intel documents indicate [6], the accuracy of temperature readings provided by thermal sensors varies, and the values reported are not exactly equal to the actual core temperatures. Because of factory variation and individual DTS calibration, accuracy of readings varies from chip to chip. The DTS equipment also suffers from slope errors, which means that temperature readings are more accurate near the T-junction max (the maximum temperature that cores can reach before thermal throttling is activated) than at lower temperatures. DTS circuits are designed to be read over reasonable operating temperature ranges, and the readings may not show lower values than 20 °C even if the actual core temperature is lower. Techniques to Measure, Model, and Manage Power 37

Since DTS is primarily created as a thermal protection mechanism, reason- able accuracy at high temperatures is acceptable. But this affects the accuracy of power models using core temperature. Researchers and practitioners should read the processor model datasheet, design guidelines, and errata to understand the limitations of their respective thermal monitoring circuits and take corrective measures for their power models, if required.

4.2.2 Effects of Dynamic Voltage and Frequency Scaling Modern microarchitectures use various performance- and power-manage- ment techniques to optimize trade-offs between processor performance and power consumption. These techniques commonly scale voltage and/or frequency dynamically (commonly known as Dynamic Voltage and Frequency Scaling or DVFS) to vary the amount of energy available to a processor based on OS demand. Modern implementations of such techniques include Intel’s SpeedStep® technology and AMD’s PowerNow!™ technology. For example, the Intel® Core™ i7-870 processor can operate at 14 different P (performance) states (as opposed to power or global states). The range of frequencies across the P states varies from a maximum of 2.93 GHz to a minimum of 1.197 GHz, in steps of 0.13 GHz. Although this wide range and fine control of frequency scaling proves to be an excellent knob for performance and power-consumption control, enabling DVFS technology tremendously increases the complexity of power-modeling methodology. This is because, as per Eqns (9) and (10), both dynamic and static power consumption change significantly with changes in voltage and frequency. This results in the change in relationship between the activity ratio of performance events and power consumption. To estimate power consump- tion using counter events at different performance points, we can run the chosen curve fitting mechanism for each performance point and calculate the parameter weights separately for each point [7, 11, 21, 37, 42]. Pusukuri et al. [36] report success in constructing a single power model that can esti- mate power consumption across varying core frequencies by incorporating the PMC that counts the number of cycles during which the CPU is not in a halted state. They argue that since the number of unhalted CPU clock cycles correlates with the CPU frequency, they are able to estimate core power consumption for different frequencies. An alternative method for creating a single power model that can correctly estimate power consump- tion across all the P-states would be to separate the estimation of static power from dynamic power and then scale the two power components separately for different frequency and core supply voltage points. 38 Bhavishya Goel et al.

= ∗ ∗ 2 ∗ Pdynamic NSW Cpd VCC fI , (9) where NSW number of bits switching; Cpd dynamic power dissipation = = capacitance; VCC supply voltage; fI CPU frequency. = = P = I ∗ V = I (eqVd/kT − 1) ∗ V , static leakage core s core (10) where Is reverse saturation current; Vd diode voltage; k Boltzmann’s = = = constant; q electronic charge; T temperature; Vcore core supply voltage. = = = Newer Intel microarchitectures like Nehalem and Sandy Bridge employ Turbo Boost [12], a technique that allows active processor cores to run at a higher than base operating frequency when there is headroom in the temper- ature, power, and current-specification limits. The Turbo Boost upper limit is defined by the number of active cores. When not all four cores are active and the demands the highest performance state (P0), the core frequency can be increased. This change in frequency will have similar effects on the accuracy of the power model as discussed above for DVFS. Researchers and system administrators have the following options to tackle the problem of dynamic frequency and voltage scaling: • Construct and employ a single power model that can handle dynamic change in the CPU frequency and voltage. • Construct multiple power models for multiple performance states and use the respective power model for the detected performance. • Disable techniques like DVFS and Turbo Boost to fix the CPU perfor- mance point.

4.2.3 Effects of Simultaneous Multithreading Some modern microarchitectures like Intel Nehalem and Intel Sandy Bridge support Simultaneous Multithreading (SMT). SMT allows the pro- cessor to divide the physical core into two or more logical cores, thereby making the operating system see more than the actual number of physi- cal cores. Table V shows an example of partitioning of the physical core resources among two logical cores as in the Nehalem microarchitecture. Enabling SMT in the processor and the details of the partitioning scheme affects the power-estimation model. Researchers and practitioners must understand the exact partitioning of the processor’s resources and identify PMCs accordingly. Depending on the microarchitecture, the PMCs may be available separately for each logical core, shared among the logical cores, or some combination. For example, Nehalem has most of the core PMCs Techniques to Measure, Model, and Manage Power 39

Table V Hyper-threading partitioning on the Intel® Core™ i7. Policy Description Affected microarchitecture Replicated Duplicate logic per Register state thread Renamed RSB Large page ITLB Partitioned Statically allocated to Load buffer threads Store buffer Reorder buffer Small page ITLB Competitively Dynamically allocated Reservation station shared to threads Caches Data TLB 2nd level TLB Unaware No impact Execution units

separately available for each logical core. Power models that aim to provide power estimation values of individual sub-units of a microarchitecture or of physical cores would need to combine the counter values from each logical core to get utilization factors for physical units.

4.3 Validation To ensure the correctness and robustness of the constructed power model, it should be validated in comprehensive test conditions. The power model should be used to estimate power for both single-threaded benchmark suites (like SPEC CPU2006 [47] and SPEC CPU2000 [45]) and parallel benchmark suites (like NAS [2], PARSEC [8], and SPEC OMP2001 [46]). The estimated power consumption values should be compared against simultaneously measured empirical power values to calculate the estimation error. The test applications that are used to validate the model’s accuracy should be different from the applications used as the training set to select relevant counters and to calculate counter weights. As discussed in Section 4.1.3, a good practice is to develop custom microbenchmarks [7, 21, 42] as the training set and test the accuracy of model estimates with real applica- tions or standard benchmark suites. In addition to checking the estimates for absolute error, it is important to check the standard deviation of those errors. Higher standard deviation value means that the estimation error values are spread over a large range instead of concentrated around the mean error value. Figures 15 and 16 depict the median error and standard 40 Bhavishya Goel et al.

10 10 10 8 8 8 6 6 6 4 4 4 2 2 2 % Median Err o r 0

% Median Error 0 0 % Median Error astar lbm mcfmilc art bzip2 dealll gcc sjeng bt ft lu gobmk namd cg ep mg sp ua apsi bwaves hmmer povraysoplex applu mgridquakeswim calculixgamess gromacsh264ref leslie3d omnetpp lu-hp ammp fma3d perlbench zeusmp wupwise cactusADM GemsFDTD libquantum xalancbmk (a) NAS (b) SPEC-OMP (c) SPEC 2006 Fig. 15. Median estimation error for the Intel(r) Core™ i7.

10 10 10 8 8 8 6 6 4 6 % SD 4 4 % SD 2 % SD 2 2 0 0 0 astar gcc lbm mcfmilc ft art bzip2 dealll namd povraysjeng bt lu apsi gobmk soplex cg ep mg sp ua swim bwaves calculixgamess h264refhmmerleslie3d ammpapplu fma3dgafortmgridquake gromacs omnetpp zeusmp lu-hp perlbench wupwise cactusADM libquantum xalancbmk GemsFDTD (a) NAS (b) SPEC-OMP (c) SPEC 2006

Fig. 16. Standard deviation of error for the Intel(r) Core™ i7. deviation for test benchmarks on the Intel® Core™ i7 processor, as per the estimation results published by Goel et al. [20]. Their results show that the median error for the SPEC OMP2001 art benchmark is less than 0.2%, but the standard deviation of error for the same benchmark is around 6%. There are few other applications with high standard deviations. Goel et al. attribute these high standard deviations for certain applications to the limitation of their setup: the low sampling rate (one sample per second) of their power meter cannot faithfully capture sharp changes in performance counter activity. The power model of Bertran et al. [7] shows much less standard deviation in their estimates. They observe that the standard devia- tion of estimation errors goes down when the number of tracked events is increased. The cumulative distribution function (CDF) of the estimate error can be plotted to analyze the error distribution of the power model. The CDF plots can help researchers in analyzing the extent to which their power model is accurate across all samples collected. The CDF plots in Fig. 17 show that on the Core™ i7, 82% of estimates have less than 5% error and 96% of estimates have less than 10% error. In contrast, on the Core™ Duo, only 62% of estimates have less than 5% error. A number of other factors should be taken into consideration when forming and using power models. For instance, if a model is not intended to be specific to a given machine, the model should be ported and validated on multiple platforms. To demonstrate model portability, Goel et al. [21] Techniques to Measure, Model, and Manage Power 41

1.0 1.0

Space 0.8 Space 0.8

of 0.6 of 0.6 0.4 0.4 Covered 0.2 Covered 0.2

Fraction 0.0 Fraction 0.0 0102030 0102030 %Error % Error (a) Intel (r) Core™ i7 (b) Intel (r) Core™ Duo

Fig. 17. Cumulative distribution function (CDF) plots showing the fraction of space predicted (y-axis) under a given error (x-axis). validate their power model on six different platforms, generating consis- tently good estimation results. Joseph and Martonosi [26] construct their model for a 600 MHz Alpha 21264 model in a simulator (SimpleScalar [1]) and a 200 MHz Pentium Pro machine. Their model shows good results in simulation but not on the Pentium hardware. They attribute this to not being able to isolate power information for smaller microarchitectural structures like the branch target buffer and address generation units that, in total, constitute 24% of power consumption. Model accuracy also depends on the particular PMCs available on a given platform. If available PMCs do not sufficiently represent the micro- architecture, model accuracy will suffer. For example, the AMD Opteron™ 8212 supports no single counter giving total floating point operations. Instead, separate PMCs track different types of floating point operations. We therefore choose the one most highly correlated with power. Model accuracy would likely improve if a single PMC reflecting all floating point operations were available. The same is true for stall counters available on the Intel® Core™ Duo. For processor models supporting only two Model Specific Registers for reading PMC values, capturing the activity of four counters requires multiplexing the counting of PMC-related events. This means that events are counted for only half a second (or half the total sam- pling period), and are doubled to estimate the value over the entire period. This approximation can introduce inaccuracies when program behavior is changing rapidly. Similarly, even though the microbenchmarks try to cover all scenarios of power consumption, the resulting regression model will represent a gen- eralized case. This is especially true for a model that tries to estimate power for a complex microarchitecture using limited number of counters. For example, floating point operations can consist of add, multiply, or divide 42 Bhavishya Goel et al. operations, which use different execution units and hence consume a dif- ferent amounts of power. If the test benchmark is close to the instruction mix used in the microbenchmarks, the estimation error will be low, and vice versa. For power models that address machines supporting DVFS, the model should be validated at different frequencies. On multi-core platforms, the power model should be validated for both multithreaded and single- threaded applications. Finally, the estimation errors should be compared for specific types of applications, like floating point and integer applications or between CPU-bound and memory-bound applications to ensure that the model is not biased. The validation results for the power model can suffer high error peaks due to limitations in the sampling rate of the power meter used during model validation. For example, a maximum sampling rate of one per second means that we must accumulate PMC values at that rate and normalize them using the cycles elapsed during 1 s. This, in effect, averages the counter activity dur- ing the 1 s accumulation period and the estimated power value (rightly) is, in effect, the average power consumed over the 1 s duration. But during valida- tion, the estimated power value is compared against the value from power meter which is read at 1 s boundary. As a result, whenever there is a rapid change in counter activity, the power estimated for that sample is significantly lower (for a positive surge) or higher (for a negative surge) compared to the power meter value. Finally, a model can be no more accurate than the information used to build it. For instance, all power measurement devices suffer from some inherent (hopefully small) error. Performance counter implementations also display non-determinism [53] and error [55]. As discussed in Section 4.2.1, temperature plays a large part in model formation. The lm-sensors driver reads the temperature from on-die thermal diodes that are not very accu- rate for some processor models. All of these impact model accuracy. Given all these sources of inaccuracy, there seems little need for more complex, sophisticated mathematics when building a model.

5. POWER-AWARE RESOURCE MANAGEMENT In the previous section, we discussed techniques to estimate the power consumption of processor resources using power modeling. In this section, we discuss the applicability of these power models to resource Techniques to Measure, Model, and Manage Power 43 managers that perform task scheduling. Using the power model, the scheduler can quantitatively assess the impact of scheduling decisions on power ­consumption in real time. The power models are essential when the resource scheduler has to guide scheduling decisions under the constraint of a strict power budget (rather than aiming for efficient scheduling, in general). A power model also simplifies the scheduler code, since it has to deal with a single point of reference instead of keeping track of multiple performance counters. The power models can be incorporated in both kernel-level schedulers [21, 30] and user-level (meta) schedulers [3, 37, 42]. Next we look at an example of how the task scheduler can control the power consumption­ of the processor in real time. To demonstrate one use of online power models, we experiment with the user-level meta-scheduler of Singh et al. [41, 42]. This live power management maintains a user-defined system power budget by schedul- ing tasks appropriately and/or by using DVFS. We use the power model to compute core power consumption dynamically. The application spawns one process per core. The scheduler reads PMC values via pfmon and feeds the sampled PMC values to the power model to estimate core power consumption. The meta-scheduler binds the affinity of the processes to a particular core to simplify task management and power estimation. It dynamically calculates power values at a set interval (1 s in our case) and compares the system power envelope with the sum of power for all cores together with the uncore power. We calculate uncore power by subtracting the idle CPU power from the idle system power. Idle CPU power can be measured using the techniques mentioned in Section 3.1.2 or 3.1.3. When such an infrastructure is not available, the idle CPU power values can be taken from tech sites such as Anandtech and Tom’s Hardware. When these values are not available even from the mentioned websites, we can calculate the value of idle CPU power by first observing the idle system power and then removing the hardware components (e.g., the Ethernet card, hard disk, and RAM DIMMs) to derive an approximate value of the power consumption outside the core. When the scheduler detects a breach in the power envelope, the sched- uler takes steps to force down the power consumption. The ­scheduler employs two knobs to control system power consumption: dynamic voltage-frequency scaling as a fine knob, and process suspension as a coarse knob. When the envelope is breached, the scheduler first tries to lower the power consumption by scaling down the frequency. If power consumption 44 Bhavishya Goel et al.

Fig. 18. Flow diagram for the meta-scheduler. remains too high, the scheduler starts suspending processes to meet the envelope’s demands. When the estimated power consumption is less than the target power envelope, the scheduler checks whether any suspended processes can be resumed. If the gap between the current and target power budget is not enough to resume a suspended process, and if the processor is operating at a frequency lower than maximum, the scheduler scales up the voltage frequency. shows the flow diagram of the meta-scheduler .

5.1 Sample Policies When the scheduler suspends a process, it needs to choose for suspension the process that will have the least impact on completion time of all the processes. We explore the use of our power model in a scheduler via two sample policies for process suspension. The Throughput policy targets maximum power efficiency (max ­instructions/watt) under a given power envelope. When the envelope is breached, the scheduler calculates the ratio of instructions/UOPS retired to the power consumed for each core and suspends the process having committed the fewest instructions per watt of power consumed. When resuming a process, it selects the process (if there is more than one sus- pended process) that had committed the maximum instructions/watt at the time of suspension. This policy favors processes that are less often stalled Techniques to Measure, Model, and Manage Power 45 while waiting for load operations to complete. This policy thus favors CPU-bound applications. The Fairness policy divides the available power budget equally among all processes. When applying this policy, the scheduler maintains a running average of the power consumed by each core. When the scheduler must choose a process for suspension, it chooses the process having consumed the most average power. For resumption, the scheduler chooses the process that has consumed the least average power at the time of suspension. This policy strives to regulate core temperature, since it throttles cores consum- ing the most average power. Since there is high correlation between core power consumption and core temperature, this makes sure that the core with highest temperature receives time to cool down, while cores with lower temperatures continue working. Since memory-bound applications are stalled more often, they consume less average power, and so this policy favors memory-bound applications.

5.2 Experimental Setup We conduct our scheduler experiments on an Intel® Core™ i7 ­processor. We divide our workloads into three sets based on CPU intensity. We define CPU intensity as the ratio of instructions retired to last level cache misses. The three sets are designated as CPU-Bound, Moderate, and Memory-Bound workloads (in decreasing order of CPU intensity). Apart from these three sets, we also experiment with a mixed set of work- loads focusing on similar execution times. The workloads categorized in these sets are listed in Table VI and their unconstrained execution times are shown in Fig. 19. We conduct the experiments by setting the power envelope to 90%, 80%, and 70% of the peak power usage and measuring the total execution time to run all applications in the workload under a given policy.

Table VI Workloads for scheduler evaluation. Benchmark category Benchmark applications Peak system power (W) CPU-bound ep, gamess, namd, povray 130 Moderate art, lu, wupwise, 135 xalancbmk Memory-bound astar, mcf, milc, soplex 130 Mixed ua, sp, soplex, povray 145 46 Bhavishya Goel et al.

milc namd mcf povray astar gamessep soplex 0 200 400 600 0 200 400 600 Time(sec) Time(sec) (a) CPU-bound (b) Memory-bound

povray wupwiseart soplex sp

lu ua xalancbmk 0 500 1000 1500 0 100 200 400 Time(sec) Time(sec) (c) Moderate (d) Mixed

Fig. 19. Absolute runtimes for unconstrained workloads on the Intel® Core™ i7.

5.3 Results Figure 20 shows the normalized runtimes for all the workloads when only process suspension is used to maintain the power envelope. As per the results obtained by Singh et al. [41], the Throughput policy should favor CPU-bound workloads, while the Fairness policy should favor memory- bound workloads, but this distinction is not clearly visible here. This is because of the differences in the runtimes of the various workloads. To achieve the best possible runtime to complete all workloads, the scheduler should always select shorter workloads for suspension. This will ensure that the longest workload, which is critical to the total runtime, is never throttled, and hence the impact on runtimes is minimal. This is clearly vis- ible for the execution times of CPU-bound benchmarks when using the Throughput policy. The CPU-bound applications ep and gamess have the lowest computational intensities and execution times. As a result, these two applications are suspended most frequently, which does not affect the total execution time, even when the power envelope is set to 80% of peak usage. Figure 21 shows the results when the scheduler uses both DVFS and process suspension to maintain the given power envelope. As noted, the scheduler uses DVFS as a fine knob and process suspension as a coarse Techniques to Measure, Model, and Manage Power 47

2 2

Throughput Throughput Fair Fair 1 1

0 0 70% 80% 90% 100% 70%80% 90%100% Normalized Run Time Normalized Run Time % Power Budget % Power Budget (a) CPU-bound (b) Memory-bound

2 2

Throughput Throughput Fair Fair 1 1

0 0 70% 80% 90% 100% 70%80% 90%100% Normalized Run Time Normalized Run Time % Power Budget % Power Budget (c) Moderate (d) Mixed Fig. 20. Runtimes for workloads on the Intel® Core™ i7 (without DVFS).

2 2

Throughput Throughput Fair Fair 1 1

0 0 70% 80% 90% 100% 70%80% 90%100% Normalized Run Time Normalized Run Time % Power Budget % Power Budget (a) CPU-bound (b) Memory-bound

2 2

Throughput Throughput Fair Fair 1 1

0 0 70% 80% 90% 100% 70%80% 90%100% Normalized Run Time Normalized Run Time % Power Budget % Power Budget (c) Moderate (d) Mixed

Fig. 21. Runtimes for workloads on the Intel® Core™ i7 (with DVFS). knob in maintaining the envelope. The Intel® Core™ i7-870 processor that we use for our experiments supports 14 different voltage-frequency points. These frequency points range from 2.926 to 1.197 GHz. For our experiments, we have made models for seven frequency points (2.926, 2.66, 2.394, 2.128, 1.862, 1.596, and 1.33 GHz), and we adjust the pro- cessor frequency across these points. The experimental results show that for CPU-bound and moderate benchmarks, there is hardly any difference 48 Bhavishya Goel et al. in execution time under different suspension policies. This result suggests that for these applications, the scheduler hardly needs to suspend the processes and that regulating DVFS points proves sufficient to maintain the power envelope. Performance for DVFS degrades compared to cases where no DVFS is used, except for the mixed workload set. The explana- tion for this lies in the difference between runtimes of different applica- tions within the workload sets. When no DVFS is used, all processes run at full speed. And even when one of the processes is suspended, if that pro- cess is not critical, it still runs at full speed later in parallel with the critical process. But in the case of DVFS being given higher priority over process suspension, when the envelope is breached, all processes are slowed, and this affects total execution time. This is further proved by results of the mixed workload set. Since the differences in runtimes among individual applications of this set are low, it shows improvement in performance over the non-DVFS case.

5.4 Further Reading Apart from the case study mentioned above, there have been many ­innovative and interesting research papers published in the area of power-aware ­scheduling. Rajamani et al. [37] use their power estimation model to drive a power management policy called Performance Maximizer. For a given power budget, they exploit the DVFS levels of the processor to try to maximize the processor performance. For each performance state (P-state), they apply their power model to estimate power consumption at the current P-state. They use the calculated power value to estimate the power consumption at other P-states by linearly scaling the current power value with frequency. The scheduler increases the performance state to the highest level such that its estimates would be safely below the power budget value. Banikazemi et al. [3] use a power-aware meta-scheduler. Their meta-scheduler monitors the performance, power, and energy of the system by using performance counters and in-built power monitoring hardware. It uses this information to dynamically remap the software threads on multi-core servers for higher performance and lower energy usage. Their framework is flexible enough to substitute the hardware power monitor with a performance-counter-based model. Isci et al. [24] analyze global power management policies to enforce a given power budget and to minimize power consumption for the given performance target. They conduct­ their experiments on the Turandot [31] simulator. They assume the presence of on-core current sensors to acquire Techniques to Measure, Model, and Manage Power 49 core power information, while they use performance counters to gather core performance information. They have developed a global power manager that periodically monitors the power and performance of each core and sets the operating mode (akin to DVFS performance states) of the core for the next monitoring interval. They assume that the power mode can be set independently for each core. They experiment with three policies to evalu- ate their global power manager. The Priority policy assigns different priorities to different cores in a multi-core processor and tries to speed up the core with the highest priority while slowing down the lowest priority core when the power consumption overshoots the assigned budget. The policy called pullHipushLo is similar to our Fairness policy from the above case study; it tries to balance the power consumption of each core by slowing down the core with the highest power consumption when the power budget is exceeded and speeds up the core with the lowest power consumption when there is a power surplus. MaxBIPS tries to maximize the system throughput by choosing the combination of power modes on different cores that is pre- dicted to provide maximum overall BIPS (Billion instructions per second). Meng et al. [29] apply a multi-optimization power-saving strategy to meet the constraints of a chip-wide power budget on reconfigurable processors. They run a global power manager that configures the CPU frequency and/ or cache size of individual cores. They use risk analysis to evaluate the trade- offs between power-saving optimizations and potential performance loss. They select the power-saving strategies at design time to create a static pool of candidate optimizations. They make an analytic power and performance model using performance counters and sensors that allows them to quickly evaluate many power modes and enables their power manager to choose a global power mode at periodic intervals that can obey the processor-wide power budget while maximizing the throughput.

6. DISCUSSION To help researchers and practitioners in designing energy-efficient systems, we first need robust techniques to generate energy metrics. This requires efficient, adaptable, and accurate methods of measuring or esti- mating system power consumption. Making this information accessible to resource managers in either software or hardware (or a combination of the two) enables much more efficient system operation. Here we have 50 Bhavishya Goel et al. explained one approach to providing the infrastructure for implementing smarter, power-aware resource managers. In this chapter, we discussed methodologies that enable the collec- tion and flow of information from hardware to software. We discussed techniques to measure system power consumption at various temporal granularities, comparing the information available at three different points in a system. We then presented a methodology for estimating power con- sumption using event-based power models instead of measuring power empirically. This approach is inexpensive and straightforward to implement, is inherently flexible, and is portable—it allows easy comparison among different platforms (as opposed to embedding such a model in hardware, as in Intel’s Sandy Bridge microarchitecture). Models such as ours have been implemented within kernel schedulers [9, 21] and within virtual machines (part of our ongoing work); they are efficient to compute, and add virtually no overhead to the schedulers employing them. Information and Communication Technology is a significant ­contributor to the global carbon footprint, and it continues to grow. One of the major components in this growing emissions footprint is the ­electricity con- sumed by ICT. As computing technology becomes increasingly ubiquitous in our everyday lives, the importance of improving the energy efficiency of computing­ systems grows in accordance. Green Computing entails much more than just energy-efficient system operation, but reducing power consumption in platforms ranging from embedded systems and handheld devices to data centers and up to the coming exascale systems is an important­ component in managing the ICT footprint. Furthermore, creating more energy-efficient systems is well within reach of today’s technology.

REFERENCES [1] T. Austin, SimpleScalar 4.0 Release Note. . [2] D.H. Bailey, T. Harris, W.C. Saphir, R.F. Van der Wijngaart, A.C. Woo, M. Yarrow, The NAS Parallel Benchmarks 2.0, Report NAS-95-020, NASA Ames Research Center, December 1995. [3] M. Banikazemi, D. Poff, B. Abali, PAM: a novel performance/power aware meta- scheduler for multi-core systems, in Proceedings of the IEEE/ACM Supercomputing International Conference on High Performance Computing, Networking, Storage and Analysis, No. 39, November 2008. [4] D. Bedard, M. Y. Lim, R. Fowler, A. Porterfield, Powermon: fine-grained and inte- grated power monitoring for commodity computer systems, Proceedings of the IEEE SoutheastCon 2010 March 2010, pp. 479–484. Techniques to Measure, Model, and Manage Power 51

[5] F. Bellosa, S. Kellner, M. Waitz, A. Weissel, Event-driven energy accounting for dynamic thermal management, Proceedings of the Workshop on and Operating Systems for Low Power September 2003. [6] M. Berktold, T. Tian, CPU Monitoring With DTS/PECI, White Paper, Intel Corporation, September 2010. . [7] R. Bertran, M. Gonzalez, X. Martorell, N. Navarro, E. Ayguade, Decomposable and responsive power models for multicore processors using performance counters, Proceedings of the 24th ACM International Conference on Supercomputing, June 2010, pp. 147–158. [8] C. Bienia, S. Kumar, J.P. Singh, K. Li, The PARSEC benchmark suite: characterization and architectural implications, Proceedings of the IEEE/ACM International Conference on Parallel Architectures and Compilation Techniques, October 2008, pp. 72–81. [9] C. Boneti, R. Gioiosa, F.J. Cazorla, M. Valero, A dynamic scheduler for balancing HPC applications, Proceedings of the IEEE/ACM Supercomputing International Conference on High Performance Computing, Networking, Storage and Analysis, No. 41, November 2008. [10] D. Brooks, V. Tiwari, M. Martonosi, Wattch: a framework for architectural-level power analysis and optimizations, Proceedings of the 27th IEEE/ACM International Symposium on , June 2000, pp. 83–94. [11] G. Contreras, M. Martonosi, Power prediction for Intel XScale processors using performance monitoring unit events, Proceedings of the IEEE/ACM International Symposium on Low Power Electronics and Design, August 2005, pp. 221–226. [12] Intel Corporation, Intel Turbo Boost Technology in Intel Core™ Microarchitecture Nehalem Based Processors, White Paper, Intel Corporation, November 2008. [13] Intel Corporation, Voltage Regulator-Down VRD 11.1, Design Guidelines, Intel Corporation, September 2009. [14] Intel Corporation, Intel Core i7-800 and i5-700 Desktop Processor Series, Datasheet, Intel Corporation, July 2010. [15] LEM Corporation, Intel Current Transducer LTS 25-NP, Datasheet, LEM, November 2009. [16] Z. Cui, Y. Zhu, Y. Bao, M. Chen, A fine-grained component-level power measurement method, Proceedings of the 2nd International Green Computing Conference, July 2011, pp. 1–6. [17] D. Economou, S. Rivoire, C. Kozyrakis, P. Ranganathan, Full-system power analysis and modeling for server environments, Proceedings of the Workshop on Modeling, Benchmarking, and Simulation, June 2006. [18] Electronic Educational Devices, Watts Up PRO. May 2009. . [19] R.A. Giri, A. Vanchi, Increasing data center efficiency with server power measurements, White Paper, Intel Corporation, January 2010. . [20] B. Goel, Per-core power estimation and power aware scheduling strategies for CMPs, Master’s Thesis, Chalmers University of Technology, January 2011. [21] B. Goel, S.A. McKee, R. Gioiosa, K. Singh, M. Bhadauria, M. Cesati, Portable, scalable, per-core power estimation for intelligent resource management, Proceedings of the 1st International Green Computing Conference August 2010, pp. 135–146. [22] M. Govindan, S. Keckler, D. Burger, End-to-end validation of architectural power ­models, Proceedings of the 14th ACM/IEEE International Symposium on Low Power Electronics and Design, July 2009, pp. 383–388. [23] Intel, Intel 64 and IA-32 Architectures Software Developer’s Manual, May 2012. 52 Bhavishya Goel et al.

[24] C. Isci, A. Buyuktosunoglu, C.Y. Cher, P. Bose, M. Martonosi, An analysis of efficient multi-core global power management policies: maximizing performance for a given power budget, Proceedings of the IEEE/ACM 40th Annual International Symposium on Microarchitecture, December 2006, pp. 347–358. [25] C. Isci, M. Martonosi, Runtime power monitoring in high-end processors: methodol- ogy and empirical data, Proceedings of the IEEE/ACM 37th Annual International Symposium on Microarchitecture, 2003, pp. 93–104. [26] R. Joseph, M. Martonosi, Run-time power estimation in high-performance micro- processors, Proceedings of the IEEE/ACM International Symposium on Low Power Electronics and Design, August 2001, pp. 135–140. [27] B.C. Lee, D.M. Brooks, Accurate and efficient regression modeling for microarchitec- tural performance and power prediction, Proceedings of the 12th ACM Symposium on Architectural Support for Programming Languages and Operating Systems, October 2006, pp. 185–194. [28] A. Mathur, S. Roy, R. Bhatia, A. Chakraborty, V. Bhargava, J. Bhartia, Joulequest: an accurate power model for the StarCore DSP platform, Proceedings of the 20th International Conference on VLSI Design, January 2007, pp. 521–526. [29] K. Meng, R. Joseph, R.P. Dick, L.Shang, Multi-optimization power management for chip multiprocessors, Proceedings of the 17th International Conference on Parallel Architectures and Compilation Techniques, 2008, pp. 177–186. [30] A. Merkel, F. Bellosa, Balancing power consumption in multicore processors, Proceedings of the ACM SIGOPS/EuroSys European Conference on Computer Systems, April 2006, pp. 403–414. [31] M. Moudgill, P. Bose, J. Moreno, Validation of Turandot, a fast processor model for microarchitecture exploration, Proceedings of the International Performance, Computing, and Communications Conference, February 1999, pp. 452–457. [32] T. Mudge, Power: a first-class architectural design constraint, IEEE Comput. 34 (2001) 52–57. [33] S. Murugesan, Harnessing green it: principles and practices, IEEE IT Prof. 10 (1) (2008) 24–33. [34] National Instruments Corporation, NI Bus-Powered M Series Multifunction DAQ for USB, April 2009.. [35] Y.-H. Park, S. Pasricha, F.J. Kurdahi, N. Dutt, A multi-granularity power modeling methodology for embedded processors, IEEE Trans. VLSI 19 (4) (2011) 668–681. [36] K.K. Pusukuri, D. Vengerov, A. Fedorova, A methodology for developing simple and robust power models using performance monitoring events, Proceedings of the 6th Annual Workshop on the Interaction between Operating Systems and Computer Architecture, June 2009. [37] K. Rajamani, H. Hanson, J. Rubio, S. Ghiasi, F. Rawson, Application-aware power man- agement, Proceedings of the IEEE International Symposium on Performance Analysis of Systems and Software, October 2006, pp. 39–48. [38] E. Rotem, A. Naveh, D. Rajwan, A. Ananthadrishnan, E. Weissmann, Power- management architecture of the Intel microarchitecture code-named Sandy Bridge, IEEE Micro 32 (2) (2012) 20–27. [39] M. Sami, D. Sciuto, C. Silvano, V. Zaccaria, An instruction-level energy model for embedded VLIW architectures, IEEE Trans. Comput. Aided Des. Integr. Circuits Syst. 21 (9) (2002) 998–1010. [40] Server System Infrastructure Forum, EPS12V Power Supply Design Guide, 2.92 ed., , HP, SGI, and IBM, 2006. [41] K. Singh, Prediction strategies for power-aware computing on multicore processors, PhD Thesis, Cornell University, 2009. Techniques to Measure, Model, and Manage Power 53

[42] K. Singh, M. Bhadauria, S.A. McKee, Real time power estimation and thread schedul- ing via performance counters, Proceedings of the Workshop on Design, Architecture and Simulation of Chip Multi-Processors, November 2008. [43] C. Spearman, The proof and measurement of association between two things, Am. J. Psychol. 15 (1) (1904) 72–101. [44] E. Stahl, Power Benchmarking: A new methodology for analyzing performance by applying energy efficiency metrics, White Paper, IBM, 2006. [45] Standard Performance Evaluation Corporation, SPEC CPU Benchmark Suite, 2000. . [46] Standard Performance Evaluation Corporation, SPEC OMP Benchmark Suite, 2001. . [47] Standard Performance Evaluation Corporation, SPEC CPU Benchmark Suite, 2006. . [48] Standard Performance Evaluation Corporation, SPECpower_ssj2008 Benchmark Suite, 2008. . [49] C. Sun, L. Shang, R.P. Dick, Three-dimensional multiprocessor system-on-chip ther- mal optimization, Proceedings of the 5th IEEE/ACM International Conference on Hardware/Software Codesign and System Synthesis, 2007, pp. 117–122. [50] The Climate Group, SMART 2020: Enabling the Low Carbon Economy in the , GeSI’s Activity Report, The Climate Group on Behalf of the Global eSustainability Initiative GeSI, June 2008. [51] V. Tiwari, S. Malik, A. Wolfe, M.T.-C. Lee, Instruction level power analysis and optimi- zation of software, Proceedings of the 9th International Conference on VLSI Design, January 1996, pp. 326–328. [52] X. Wang, M. Chen, Cluster-level feedback power control for performance optimiza- tion, Proceedings of the 14th IEEE International Symposium on High Performance Computer Architecture, February 2008, pp. 101–110. [53] V.M. Weaver, J. Dongarra, Can hardware performance counters produce expected, deterministic results, proceedings of Third Workshop on Functionality of Hardware Performance Monitoring, December 2010. [54] V.M. Weaver, S.A. McKee, Can hardware performance counters be trusted? Technical Report CSL-TR-2008-1051, Cornell University, August 2008. [55] V.M. Weaver, S.A. McKee, Can hardware performance counters be trusted? Proceedings of the IEEE International Symposium on Workload Characterization, September 2008, pp. 141–150. [56] W. Ye, N. Vijaykrishnan, M. Kandemir, M.J. Irwin, The design use of SimplePower: a cycle-accurate energy estimation tool, Proceedings of the 37th ACM/IEEE Design Automation Conference, June 2000, pp. 340–345. [57] D. Zaparanuks, M. Jovic, M. Hauswirth, Accuracy of performance counter mea- surements, Technical Report USI-TR-2008-05, Università della Svizzera italiana, September 2008.

ABOUT THE AUTHORS Bhavishya Goel received his bachelor’s degree in Electronics and Communication Engineering from DDIT, India and his master’s degree in Integrated Electronic System Design from Chalmers University of Technology, Sweden. He is currently pursuing his doc- toral studies in the area of Computer Architecture at Chalmers University of Technology. His research areas include power modeling, power aware scheduling, and reconfigurable memory systems. He has interned at RUAG Aerospace and has worked at eInfochips Ltd. 54 Bhavishya Goel et al.

Sally A. McKee received her bachelor’s degree in Computer Science from Yale University, master’s from Princeton University, and doctorate from the University of Virginia. She has held positions at Digital Equipment Corporation, , Bell Labs, and Intel before, during, and after graduate school. McKee worked as a Post-Doctoral Research Associate in the University of Virginia Computer Science Department for a year after her Ph.D. (waiting for the chip to come back from fabrication). She became a Research Assistant Professor at the University of Utah’s School of Computing in July 1998, where she worked on the Impulse Adaptable Memory Controller project. She moved to Cornell University’s Computer Systems Lab within the School of Electrical and Computer Engineering in July 2002. Since November, 2008, has been an Associate Professor in Computer Science and Engineering at Chalmers University of Technology. Her research has focused largely on ana- lyzing application memory behavior and designing more efficient memory systems together with the software to exploit them. Other projects span development of efficient, validated modeling tools to power-aware resource management. Magnus Själander received his M.S. degree (2003) in Computer Science and Engineering from Luleå University of Technology, Sweden, and both the Lic.Eng. degree (2006) and the Ph.D. degree (2008) in Computer Engineering from Chalmers University of Technology, Sweden. He is currently working as a post-doctoral researcher at Florida State University. His research interests include energy-efficient computing, high-performance and low- power digital circuits, micro-architecture and memory-system design, and hardware-soft- ware interaction. He has also interned at NXP Semiconductors, worked at Aeroflex Gaisler, and been a post-doctoral researcher at Chalmers University of Technology.