Simultaneous Multithreading on X86 64 Systems: an Energy Efficiency Evaluation
Total Page:16
File Type:pdf, Size:1020Kb
Simultaneous Multithreading on x86_64 Systems: An Energy Efficiency Evaluation Robert Schöne Daniel Hackenberg Daniel Molka Center for Information Services and High Performance Computing (ZIH) Technische Universität Dresden, 01062 Dresden, Germany {robert.schoene, daniel.hackenberg, daniel.molka}@tu-dresden.de ABSTRACT Intel added simultaneous multithreading to the pop- In recent years, power consumption has become one ular x86 family with the Netburst microarchitecture of the most important design criteria for micropro- (e.g. Pentium 4) [6]. The brand name for this tech- cessors. CPUs are therefore no longer developed nology is Hyper-Threading (HT) [8]. It provides with a narrow focus on raw compute performance. two threads (logical processors) per core and still is This means that well-established processor features available in many Intel processors. that have proven to increase compute performance The primary objective of multithreading is to im- now need to be re-evaluated with a new focus on prove the throughput of pipelined and superscalar energy efficiency. This paper presents an energy ef- architectures. Working on multiple threads pro- ficiency evaluation of the symmetric multithreading vides more independent instructions and thereby (SMT) feature on state-of-the-art x86 64 proces- increases the utilization of the hardware as long as sors. We use a mature power measurement method- the threads do not compete for the same crucial re- ology to analyze highly sophisticated low-level mi- source (e.g., bandwidth). However, MT may also crobenchmarks as well as a diverse set of applica- decrease the performance of a system due to con- tion benchmarks. Our results show that{depending tention for shared hardware resources like caches, on the workload{SMT can be at the same time ad- TLBs, and re-order buffer entries. Harmful effects vantageous in terms of performance and disadvan- such as cache and TLB thrashing can be the result. tageous in terms of energy efficiency. Moreover, we Many scientific studies have investigated the effec- demonstrate how the SMT efficiency has advanced tiveness of SMT in general and HT in particular. In between two processor generations. this paper we re-evaluate this technique with an new focus: Does HT increase the power consumption of the processor, and if so, does this outweigh the 1. INTRODUCTION performance advantage so that the overall energy Hardware multithreading is an established tech- requirement to carry out a given task increases? nique to increase processor throughput by handling The question is even more difficult to answer when multiple threads in parallel. It can be distinguished considering the fact that the answer strongly de- into temporal multithreading (TMT) and simulta- pends on the type of workload that is executed. neous multithreading (SMT). While TMT proces- This paper contributes an energy efficiency study sors can only issue instructions from a single thread of Hyper-Threading using a mature power measure- in one cycle, SMT processors are able to issue in- ment methodology and two state-of-the-art x86 64 structions from multiple threads concurrently. The Intel microarchitectures (Westmere-EP and Sandy first SMT-capable system [2] was installed in 1988. Bridge). We analyze the impact of HT on the power consumption when running synthetic benchmarks as well as a rich set of application benchmarks from the SPEC CPU and OMP benchmark suites. 2. RELATED WORK Ungerer et al. describe the different types of hard- ware multithreading that are implemented in pro- cessors along with their advantages and shortcom- 1 ings [13]. Tullsen et al. focus on simultaneous mul- Table 1: Hardware Configuration tithreading (SMT) and illustrate the substantially Fujitsu ESPRIMO increased complexity of the processor design [12]. System Dell PowerEdge R510 P700 E85+ Hyper-Threading is the most common type of SMT Processor 2x Xeon X5670 1x Core i7 2600 and has been described in [8, 6]. A recent perfor- 2.93 GHz 3.4 GHz Core clock mance evaluation of Nehalem cluster has demon- (Turbo 3.33 GHz) (Turbo 3.8 GHz) strated that the performance improvement of HT Uncore clock 2.66 GHz 3.4 GHz (Turbo 3.8) is highly application dependent [11]. Li et al. [7] TDP 95 W 95 W model the effects of SMT on power consumption for Codename Westmere-EP Sandy Bridge POWER4-like architectures. They conclude that FPU 2x 128 Bit (SSE) 2x 256 Bit (AVX) the IBM implementation of SMT can significantly L1 cache 2x 32 KiB per core 2x 32 KiB per core improve energy efficiency. L2 cache 256 KiB per core 256 KiB per core A set of synthetic low-level benchmarks that al- L3 cache 12 MiB per chip 8 MiB per chip lows to determine the performance of data transfers IMC channels 3x RDDR3 2x DDR3 within shared memory x86 64 systems has been pre- Memory type PC3L-10600R PC3-10600 sented in[9, 4]. They have been further extended Memory size 12 GiB (6x 2 GiB) 8 GiB (2x 4 GiB) to determine the energy consumption of accesses to Chipset Intel 5520 Intel Q65 Dell 480W Fujitsu different levels in the memory hierarchy and per- Power supply PN H410J DPS-250AB-62 A forming arithmetic operations on two-socket AMD OS Linux 2.6.38 Linux 2.6.38 and Intel systems [10]. This study has shown first signs that the use of Hyper-Threading introduces a significant power consumption overhead while not of dual socket Sandy Bridge processors, we compare providing any performance advantage. In this paper a single socket Sandy Bridge system with a dual we build upon the results of [10] in order to investi- socket Westmere-EP machine. In order to ensure a gate how the more recent Sandy Bridge microarchi- fair comparison, we do not look at absolute power tecture performs and how this effect translates to consumption numbers, but only relative changes, real applications as represented by the benchmarks i.e. the percentage difference of the power and en- SPEC OMP [1] and SPEC CPU2006 [5]. ergy consumption when running a workload with SMT enabled or disabled. Moreover, we compare 3. EXPERIMENTAL SETUP workloads that{while using all available cores of the system{perform almost no communication be- 3.1 Test System Hardware tween the individual threads. Finally, we ensure that memory allocation of each thread occurs lo- We use two state-of-the-art x86 64 test systems to cally, so that no significant inter-socket communi- evaluate the energy efficiency of the multithreading cation occurs on the dual socket machine. implementation (HT) of two different generations of Intel processors. One test system is based on 3.2 Software Workloads the most current dual socket Intel Xeon processors (Westmere-EP). This processor features six cores The highly optimized microbenchmarks [9, 4] al- with SSE units, private L1 and L2 caches and a low us to determine the data throughput of different shared L3 cache. The second test system is based instruction types that access well-defined locations on the most recent Intel processor family (Sandy within the memory hierarchy. Either n (on an n- Bridge)1. This processor features four cores with core system with HT disabled) or 2n (HT enabled) AVX units and a similar memory subsystem. Ta- threads perform a uniform workload consisting of ble 1 lists the test system configuration in detail. independent operations that on disjoint memory re- Both of the test systems are highly power opti- gions. The innermost benchmark routines are writ- mized with only indispensable components installed. ten in highly optimized assembly code that achieves The power supplies of both systems are very effi- near peak bandwidth for every cache level as well as cient and the Dell system additionally features low- main memory. One thread per core is generally suf- power DRAM modules. Due to the unavailability ficient to fully utilize the data paths, leaving little to no potential for performance increases through 1This is in fact a desktop class CPU (Core i7). How- the use of HT. With HT enabled, the execution ever, almost identical results were obtained on a pre- units and caches are concurrently stressed by two production Intel Xeon 1280 test system. The Xeon system showed some stability issues and we therefore threads per core that both achieve roughly half of present the Core i7 results. the peak per-core throughput. The existing bench- 2 marks have been extended to use AVX instructions Table 2: Compiler Configuration on the Sandy Bridge platform in order to fully uti- Westmere-EP Sandy Bridge lize the 256 Bit wide execution units and data paths. Compiler Suite Intel Composer XE 12.0 We also run a subset of the SPEC CPU2006 ap- Compiler Flags -m32 -xSSE4.2 -ipo -m32 -xAVX -ipo plication benchmarks on our test systems. Officially SPEC CPU -O3 -no-prec-div -static -opt-prefetch submitted SPEC CPU2006 results from HT-capable Compiler Flags -xSSE4.2 systems are often generated with HT enabled to SPEC OMP -O3 -ipo1 -openmp improve the overall benchmark performance. In our study we focus on the integer part of SPEC 3.3 Power Measurement CPU2006 as we found these benchmarks to be more For both the synthetic and the application bench- sensitive to the use of HT. We focus on the through- marks we measure the power consumption using a put mode of the benchmark (rate instead of speed) ZES Zimmer LMG450 power analyzer attached to that runs multiple copies of the same serial work- the power supply of each test system. This device load on all available cores. To avoid unnecessary provides highly accurate power consumption data complexity, we use the base configuration that al- with a sampling frequency of 20 Hz. The power con- lows for just one set of compiler flags for all bench- sumption data is collected by dedicated hardware marks instead of individually tuned configurations and software components [10].