Power Measurement Using Performance Counters

October 2016

1 Introduction

CPU’s are based on complementary metal oxide semiconductor technology (CMOS). CMOS technology theoretically only dissipates power when switching of states occurs accounting for the Dynamic Power of CPU. There is however also leakage which is called Static Power.

CPU Total Power Dissipation = Dynamic Power + Static Power

Static power dissipation can be split up into two factors;

• Subthreshold conduction

• Tunnelling current through gate oxide layer.

The tunnelling power dissipation is becoming a large factor with the size of processors getting smaller. The metal oxide layer becomes thinner and therefore easier for electrons to tunnel through the insulating layer. With the insulating layers getting thinner whilst the supply voltage is staying the same tunnelling is the largest factor of leakage. This is the main source of static power dissipation.

Pstatic =m × V , where m = constant & V = CPU core voltage.

Dynamic power dissipation can be split up into two factors causing power dissipation;

• Transition • Short-circuit power dissipation Transition power arises from the voltage source charging up the gates as if it is a capacitor and then the capacitor discharging to the ground. This process yields the following equation with f = operating frequency, V= core voltage and C= capacitance of the circuit.

1 2 Ptransition = 2 ×CV f The objective of this study is to understand the variations in power and energy require- ments of core and uncore system components while executing different benchmarks.

Objective 1: Identification of different types of power measurement counters (core and

1 uncore) available on the smartphone, , desktop, and server grade hardware.

Objective 2: 1. Measurement of CPU and system power (ideally broken down by dif- ferent components) in various workload scenarios. 2. Measurement of total energy consumption of CPU and system (ideally broken down by different components).

2 Case Study

The experiments were carried on Desktop and Laptop machines. The configuration of the two machines is detailed below:

2.1 System Configuration Desktop

S.N. Parameter System Specification 1. Model Name Dell Optiplex 9020 2. Processor Type i7 3. CPU Max Frequency 4000 MHz 4. CPU Min Frequency 800 MHz 5. Number of Sockets 1 6. Number of Physical Cores 4 7. Number of Virtual Cores 8 8. L3 Cache 8 MB 9. Main Memory 8 GB

Laptop

S.N. Parameter System Specification 1. Model Name HP Pavilion Dv6 2. Processor Type Intel i5 3. CPU Max Frequency 2700 MHz 4. CPU Min Frequency 800 MHz 5. Number of Sockets 1 6. Number of Physical Cores 2 7. Number of Virtual Cores 4 8. L3 Cache 3 MB 9. Main Memory 8 GB

2.2 Performance Counters Performance counters are used to provide information as to how well the operating system or an application, service, or driver is performing. These counters are hardware registers attached with the processor which measures various programmable events occurring in the processor. Performance Monitoring Counters reveal considerable amount of information about power consumption. These counters monitor different events that take place when a processor executes instructions.

2 The various performance counters available on the desktop/ laptop machines are as de- scribed below:

Instructions per Cycle (IPC): Power consumption of a processor is dependent on its activity. If the IPC is high, the processor will very likely consume more power. Fetch counters: IPC considers only the retired instructions, but processors execute many instructions speculatively. These are flushed due to branch mis-predictions but con- sume power. Hence, we keep track of number of Fetched instructions, Branch correct pre- dictions (BCP) and, Branch mis-predictions (BMP). Miss/Hit counters: Upon cache misses, the processor stalls. Thus, the events: L1 hit, L1 miss, L2 hit, L2 miss, page hit and, TLB miss may impact the power consumed. Retired instructions counters: Depending on the type of the retired instructions (Integer (INT), Floating-point (FP), Memory, Branch), different functional units are being exercised. If some of these are power-hungry (say FP), then, by monitoring the type of retired instructions, we would be able to estimate power more accurately. Stalls : Processors stall due to dependencies (data or resource conflicts).

2.3 Benchmarks Used Following benchmarks have been used for conducting the experiment:

While(1): A simple while loop to keep the CPU busy and active set to iterate for some large number.

Linpack: This Benchmark is a measure of a system’s floating point computing power. HPLinpack has been used in the experiment. It shows how fast a computer solves a dense n by n system of linear equations Ax = b.

181.mcf: A benchmark derived from a program used for single-depot vehicle scheduling in public mass transportation. The program is written in C, the benchmark version uses almost exclusively integer arithmetic. The benchmark requires about 100 and 190 megabyte for a 32 and a 64 bit architecture, respectively. It is a cache-memory intensive benchmark. dd utility: DD is a command line utility for UNIX like operating systems. dd Utility offers the ability to backup and restore IMG files to memory cards and disks. This bench- mark has been used to copy a 10GB file from one location to other in the disk. The syntax is ‘dd if=hsource file namei of=htarget file namei [Options]’.

3 Methodology Used

In this experiment, we have used Intel's Performance Counter Monitor(PCM) tool to measure system power and energy behavior for different workload scenarios. The Intel Performance Counter Monitor provides sample C++ routines and utilities to estimate the internal resource utilization. PCM tool reports energy consumed by the socket and DRAM in the last one second. The energy consumed by the system in the last one second is also a measure of power (energy per second). Intel's Performance Counter Monitor offers the

3 possibility to print the output as comma separated values (CSV format).

The experimented has been done using two CPU governor modes: performance & powersave. A CPU governor controls how the CPU raises and lowers its frequency in response to the demands the user is placing on the device. CPU frequency scaling enables the operating system to scale the CPU frequency up or down in order to save power.

The performance governor mode runs the CPU at maximum frequency whereas pow- ersave governor mode runs the CPU at the minimum frequency.

For desktop machine:

• Power Save Mode Frequency = 800 MHz • Performance Mode Frequency = 4 GHz For laptop machine:

• Power Save Mode Frequency = 800 MHz

• Performance Mode Frequency = 2.7 GHz

4 Experiment Results

In this experiment, we have executed the multiple instances/ threads of every benchmark and compared the power and energy consumption (core and DRAM) of every benchmark in the two governor modes namely, performance and powersave.

We have plotted four different characteristics for each benchmarks namely, runtime, core power, DRAM power and core energy for the two target machines.

The different graphs for desktop machine are shown below:

1. Runtime Comparison

Observation

1. For linpack benchmark in performance mode, runtime for 2 instances is nearly 2.5 times that for single instance. - As every core has its own FPU, so there must not be contention for FPU and hence runtime should have been similar.

- Memory bus contention issue, Cache issue.

- On reducing the problem size (number of equations to solve) so that data stays in the cache, difference in runtime for two cases is insignificant.

- The runtime shown in the graph is for a problem size of 20000.

4 2. In case of cpu-intensive benchmarks(while loop, linpack, 181.mcf) - Runtime will depend on a fast CPU and become significantly slower if we don’t have one.

- Hence, there is a tremendous increase in runtime of benchmarks in powersave mode (slow cpu @800 Mhz).

3. For non-cpu intensive workload such as DD, - Hard drives won’t be able to keep up with the processor anyway.

- We won’t gain any benefit from a fast processor.

2. Core Power Comparison

Observation

1. Similar trends as in case of runtime of benchmarks.

2. When all cores the idle, changing frequency doesn’t affect the core power consump- tion.

3. For while loop and 181.mcf benchmarks, core power consumption with performance

5 governor enabled is nearly 5.5 times that with powersave governor enabled.

4. For linpack benchmark, core power consumption with performance governor enabled is nearly 2.5 times that with powersave governor enabled.

5. For DD, core power consumption with both governors is almost similar as it stresses the disk subsystem which is relatively slow.

3. DRAM Power Comparison

Observation

1. For while loop, there is not significant difference in DRAM power in two modes as it cpu-intensive and does not has memory-access instructions. – DRAM is in sleep mode most of the time.

2. For DD, DRAM power is similar in two modes.

3. DRAM power for linpack and 181.mcf in two modes is different. – Number of memory access instruction per unit time reduces with reduction in processor frequency.

6 7 The different graphs for the laptop machine are shown below:

1. Runtime Comparison

Observation

1. In case of 181.mcf, runtime for two instances is significantly greater than that for single instance. – Cache size in laptop is 3 MB whereas in Desktop is 8 MB.

2. For linpack benchmark in performance mode, runtime for 2 instances is nearly 2.5 times that for single instance. – As every core has its own FPU, so there must not be contention for FPU and hence runtime should have been similar.

– Memory bus contention issue, Cache issue

3. For dd, not runtime in two mode is similar as disk is the bottleneck.

8 2. Core Power Comparison

3. DRAM Power Comparison

9 5 References

[1] Intel PCM Tool. https://software.intel.com/en-us/articles/intel-performance-counter-monitor

[2] SPEC 2000 MCF Benchmark. http://poorvi.cse.iitd.ac.in/~dahiya/CSL862.specbuild

[3] Linpack Benchmark http://www.netlib.org/benchmark/hpl/

[4] http://async.org.uk/tech-reports/NCL-EEE-MICRO-TR-2015-197.pdf

10