Empirical Study of Power Consumption of X86-64 Instruction Decoder

Empirical Study of Power Consumption of x86-64 Instruction Decoder Mikael Hirki1,2, Zhonghong Ou3, Kashif N. Khan1,2, Jukka K. Nurminen1,2,4, and Tapio Niemi1 1Helsinki Institute of Physics 2Aalto University 3Beijing University of Posts and Telecommunications 4VTT Research Abstract are more complex: a single instruction can load a value It has been a common myth that x86-64 processors from memory and add it to a register. Today these CISC suffer in terms of energy efficiency because of their com- instructions are decoded into smaller micro-operations, plex instruction set. In this paper, we aim to investigate i.e. one instruction is translated into one or more micro- whether this myth holds true, and determine the power operations. A common myth is that decoding the x86-64 consumption of the instruction decoders of an x86-64 instructions is somehow expensive. processor. To that end, we design a set of microbench- In this paper, we aim to investigate the myth that de- marks that specifically trigger the instruction decoders by coding x86-64 instructions is expensive. We leverage exceeding the capacity of the decoded instruction cache. the new features that are present in Intel processor ar- We measure the power consumption of the processor chitectures from Sandy Bridge onwards. The two fea- package using a hardware-level energy metering model tures are the Running Average Power Limit (RAPL) and called the Running Average Power Limit (RAPL), which a micro-op (short for micro-operation) cache. RAPL is supported in the latest Intel architectures. We leverage [6, 8] allows measuring the processor energy consump- linear regression modeling to break down the power con- tion at high accuracy. The micro-op cache allows the sumption of each processor component, including the in- instruction decoders to be shut down while decoded instruction decoders. Through a comprehensive set of ex- structions are served from the cache, thus improving en- periments, we demonstrate that the instruction decoders ergy efficiency. Our core idea is to design microbench- can consume between 3% and 10% of the package power marks that exhaust the micro-op cache, thus trigger- when the capacity of the decoded instruction cache is ex- ing the instruction decoders. This allows us to specifi- ceeded. Overall, this is a somewhat limited amount of cally measure the power consumed by the instruction de- power compared with the other components in the pro- coders. The code for our microbenchmarks is available cessor core, e.g., the L2 cache. We hope our finding can at https://github.com/mhirki/idq-bench2. shed light on the future optimization of processor archi- Our paper makes the following major contributions: tectures. • We develop a set of microbenchmarks to accurately measure the power consumption of the instruction 1 Introduction decoders in an x86-64 processor. Making data centers more energy efficient requires • We show that the percentage of power consumed by finding and eliminating new sources of inefficiency. the instruction decoders is between 3% and 10% of With the recent introduction of ARM-based servers, we the total package power. will soon have more CPU architectures to choose from. These servers are based on the new ARMv8 architecture • We conclude that the x86-64 instruction set is not that adds support for 64-bit computing. The first proto- a major hindrance in producing an energy-efficient types started shipping in 2014 [1]. Energy efficiency of processor architecture. the first models has been somewhat poor but efficiency is The rest of the paper is structured as follows. Section expected to improve in later models [2]. 2 gives a more detailed explanation of processor archi- The most commonly cited difference between ARM tecture and the Intel RAPL feature. Section 5 reviews and Intel processors is the instruction set [5]. Intel the related work. Section 3 presents the hardware and processors use the x86-64 instruction set, which is a software configuration as well as the implementation of Complex Instruction Set Computer (CISC) architecture; our microbenchmarks. Section 4 presents the results ob- while ARM designs are based on the Reduced Instruction tained using our microbenchmarks. Finally, Section 6 Set Computer (RISC) architecture. CISC instructions concludes our paper and presents an idea for future work. 1 2 Background Package domain estimates the energy consumed by the 2.1 Processor Architecture entire processor chip. Power plane 0 is a subdomain which estimates the energy consumed by the processor cores. Power plane 1 is another subdomain which esti- L1 instruction cache mates the energy consumed by the integrated graphics in 16B desktop models. The DRAM domain is a separate do- Pre-decoding main which estimates the energy used by the dual in-line 6 instructions memory modules (DIMMs) installed in the system. In Instruction queue this paper, we present measurements of the RAPL Pack- age domain exclusively. Micro- Complex Simple Simple Simple code decoder decoder decoder decoder 3 Methodology 4 µops 1 µop 1 µop 1 µop 3.1 Hardware and Software Micro-ops 4 µops Table 1: Hardware used in the experiments. Decode queue (56 micro-ops) cache Processor: Intel Core i7-4770 @ 3.40 GHz 4 µops Architecture: Haswell Cores: 4 Renaming and scheduling L3 cache: 8 MB Turbo Boost: Supported but disabled Execution ports Hyperthreading: Supported but disabled RAM: 2x 8 GB Kingston DDR3, 1600 MHz (dual-channel) Retirement Motherboard: Intel DH87RL Figure 1: Simplified version of the Intel Haswell Power supply: Corsair TX750 pipeline. The hardware used in the experiments is listed in Ta- Figure 1 depicts a simplified version of the Intel ble 1. An Intel Haswell desktop processor, released in Haswell pipeline. This simplified version highlights the 2013, is used for the experiments. The Turbo Boost fea- micro-op cache, a new feature first introduced in the ture is disabled to make the experiments repeatable with- Sandy Bridge architecture. Before the new cache, the out relying on thermal conditions. Meanwhile, disabling Intel architecture relied on a very small decoded instruc- Turbo Boost ensures that the processor does not exceed tion buffer and four instruction decoders. Three of the its base frequency of 3.4 GHz. In addition, the hyper- decoders are simple decoders that can only decode a sin- threading feature is disabled in the BIOS of the machine gle instruction into a single micro-op. The fourth de- on purpose. This gives access to eight programmable coder can receive any instruction and produce up to four performance counters, compared with only four when micro-ops per cycle. Thus, instructions that are trans- hyperthreading is enabled. lated into more than one micro-op can only be decoded by the complex decoder. This creates a potential bottle- Table 2: Software and kernel parameters used in the ex- neck if the frequency of complex instructions is high. periments. The micro-op cache solves performance problems in Operating system: Scientific Linux 6.6 some cases. In addition, it allows shutting down the Kernel: Linux 4.1.6 instruction decoders when they are not used, thus sav- Cpufreq governor: Performance ing power. The capacity of the micro-op cache is 1536 NMI watchdog: Disabled micro-ops, which is equivalent to eight kilobytes of x86 Transparent hugepages: Disabled code at maximum. Compiler: GCC 4.4.7 Performance measurement: PAPI library 5.4.1 and Perf 4.1.6 2.2 Running Average Power Limit (RAPL) Intel introduced the RAPL feature in their Sandy Table 2 states the software components used in the Bridge architecture. It produces an estimate of the en- experiments. Certain kernel settings are adjusted to energy consumed by the processor using a model based on sure that the experiments are repeatable. The Cpufreq microarchitectural counters. RAPL is always enabled in governor is set to performance that runs the processor supported models and the energy estimates are updated at the highest available frequency when it is not idle. every millisecond. RAPL supports up to four different This prevents the processor from running at frequencies power domains depending on the processor model. The lower than 3.4 GHz. The Non-maskable interrupt (NMI) 2 watchdog is disabled, which frees up one programmable stored in a register. The benchmarks do not write to performance counter. In addition, support for transparent memory because that would require an additional param- hugepages is disabled since it can change performance eter in the power model. The following C code shows characteristics while programs are running. The perfor- microbenchmark #2: mance counters are read using both the PAPI (Perfor- D += (A[ j ] << 3) ∗ (A[ j ] << 4) ∗ mance Application Programming Interface) library [14] ( ( B[ j ] << 2) ∗ 5 + 1 ) ; and the Perf tool. Perf is a low-level performance analy- sis tool that is shipped with the Linux kernel. The RAPL In this case, A and B are integer arrays. Since instruc- counters are read using the MSR driver interface. tions that access memory generate at least two micro-ops GCC 4.4.7 is used as the compiler in the experiments and hence are forced through the complex decoder, we since it is the default for the Linux distribution. The need to add several filler operations that generate only same compiler is likely still used by many applications one micro-op. Here we use bitwise shifts, multiplica- running on Scientific Linux 6. The compiler flag -O2 is tions and additions as filler operations. This approach used to produce optimized binaries. Model-specific opti- eliminates the complex decoder as a performance bottle- mizations, such as Advanced Vector eXtensions (AVX), neck, allowing us to stress all four decoders. are not used. We write many variations of each benchmark. In addition to the previous two benchmarks, we have dozens 3.2 Microbenchmark Design of different variations of them.

Empirical Study of Power Consumption of X86-64 Instruction Decoder

1 Introduction

Instruction Latencies and Throughput for AMD and Intel X86 Processors

The Von Neumann Computer Model 5/30/17, 10:03 PM

Lecture Notes

Exam 1 Solutions

Trends in Processor Architecture

Theoretical Peak FLOPS Per Instruction Set on Modern Intel Cpus

COSC 6385 Computer Architecture - Multi-Processors (IV) Simultaneous Multi-Threading and Multi-Core Processors Edgar Gabriel Spring 2011

Reverse Engineering X86 Processor Microcode

Assembly Language: IA-X86

State of the Port to X86-64

Theoretical Peak FLOPS Per Instruction Set on Less Conventional Hardware