Quick viewing(Text Mode)

Simultaneous Multithreading on X86 64 Systems: an Energy Efficiency Evaluation

Simultaneous Multithreading on X86 64 Systems: an Energy Efficiency Evaluation

Simultaneous Multithreading on x86_64 Systems: An Energy Efficiency Evaluation

Robert Schöne Daniel Hackenberg Daniel Molka Center for Information Services and High Performance (ZIH) Technische Universität Dresden, 01062 Dresden, Germany {robert.schoene, daniel.hackenberg, daniel.molka}@tu-dresden.de

ABSTRACT added simultaneous multithreading to the pop- In recent years, power consumption has become one ular family with the Netburst of the most important design criteria for micropro- (e.g. 4) [6]. The brand name for this tech- cessors. CPUs are therefore no longer developed nology is Hyper-Threading (HT) [8]. It provides with a narrow focus on raw compute performance. two threads (logical processors) per core and still is This means that well-established features available in many Intel processors. that have proven to increase compute performance The primary objective of multithreading is to im- now need to be re-evaluated with a new focus on prove the throughput of pipelined and superscalar energy efficiency. This paper presents an energy ef- architectures. Working on multiple threads pro- ficiency evaluation of the symmetric multithreading vides more independent instructions and thereby (SMT) feature on -of-the-art x86 64 proces- increases the utilization of the hardware as long as sors. We use a mature power measurement method- the threads do not compete for the same crucial re- ology to analyze highly sophisticated low-level mi- source (e.g., bandwidth). However, MT may also crobenchmarks as well as a diverse set of applica- decrease the performance of a system due to con- tion benchmarks. Our results show that–depending tention for shared hardware resources like caches, on the workload–SMT can be at the same time ad- TLBs, and re-order buffer entries. Harmful effects vantageous in terms of performance and disadvan- such as and TLB thrashing can be the result. tageous in terms of energy efficiency. Moreover, we Many scientific studies have investigated the effec- demonstrate how the SMT efficiency has advanced tiveness of SMT in general and HT in particular. In between two processor generations. this paper we re-evaluate this technique with an new focus: Does HT increase the power consumption of the processor, and if so, does this outweigh the 1. INTRODUCTION performance advantage so that the overall energy Hardware multithreading is an established tech- requirement to carry out a given task increases? nique to increase processor throughput by handling The question is even more difficult to answer when multiple threads in parallel. It can be distinguished considering the fact that the answer strongly de- into (TMT) and simulta- pends on the type of workload that is executed. neous multithreading (SMT). While TMT proces- This paper contributes an energy efficiency study sors can only issue instructions from a single of Hyper-Threading using a mature power measure- in one cycle, SMT processors are able to issue in- ment methodology and two state-of-the-art x86 64 structions from multiple threads concurrently. The Intel (Westmere-EP and Sandy first SMT-capable system [2] was installed in 1988. Bridge). We analyze the impact of HT on the power consumption when running synthetic benchmarks as well as a rich set of application benchmarks from the SPEC CPU and OMP suites.

2. RELATED WORK Ungerer et al. describe the different types of hard- ware multithreading that are implemented in pro- cessors along with their advantages and shortcom-

1 ings [13]. Tullsen et al. focus on simultaneous mul- Table 1: Hardware Configuration tithreading (SMT) and illustrate the substantially ESPRIMO increased complexity of the [12]. System PowerEdge R510 P700 E85+ Hyper-Threading is the most common type of SMT Processor 2x X5670 1x Core i7 2600 and has been described in [8, 6]. A recent perfor- 2.93 GHz 3.4 GHz Core clock mance evaluation of Nehalem cluster has demon- (Turbo 3.33 GHz) (Turbo 3.8 GHz) strated that the performance improvement of HT Uncore clock 2.66 GHz 3.4 GHz (Turbo 3.8) is highly application dependent [11]. Li et al. [7] TDP 95 W 95 W model the effects of SMT on power consumption for Codename Westmere-EP POWER4-like architectures. They conclude that FPU 2x 128 (SSE) 2x 256 Bit (AVX) the IBM implementation of SMT can significantly L1 cache 2x 32 KiB per core 2x 32 KiB per core improve energy efficiency. L2 cache 256 KiB per core 256 KiB per core A set of synthetic low-level benchmarks that al- L3 cache 12 MiB per 8 MiB per chip lows to determine the performance of data transfers IMC channels 3x RDDR3 2x DDR3 within x86 64 systems has been pre- Memory type PC3L-10600R PC3-10600 sented in[9, 4]. They have been further size 12 GiB (6x 2 GiB) 8 GiB (2x 4 GiB) to determine the energy consumption of accesses to Intel 5520 Intel Q65 Dell 480W Fujitsu different levels in the and per- Power supply PN H410J DPS-250AB-62 A forming arithmetic operations on two-socket AMD OS 2.6.38 Linux 2.6.38 and Intel systems [10]. This study has shown first signs that the use of Hyper-Threading introduces a significant power consumption overhead while not of dual socket Sandy Bridge processors, we compare providing any performance advantage. In this paper a single socket Sandy Bridge system with a dual we build upon the results of [10] in order to investi- socket Westmere-EP machine. In order to ensure a gate how the more recent Sandy Bridge microarchi- fair comparison, we do not look at absolute power tecture performs and how this effect translates to consumption numbers, but only relative changes, real applications as represented by the benchmarks i.e. the percentage difference of the power and en- SPEC OMP [1] and SPEC CPU2006 [5]. ergy consumption when running a workload with SMT enabled or disabled. Moreover, we compare 3. EXPERIMENTAL SETUP workloads that–while using all available cores of the system–perform almost no communication be- 3.1 Test System Hardware tween the individual threads. Finally, we ensure that memory allocation of each thread occurs lo- We use two state-of-the-art x86 64 test systems to cally, so that no significant inter-socket communi- evaluate the energy efficiency of the multithreading cation occurs on the dual socket machine. implementation (HT) of two different generations of Intel processors. One test system is based on 3.2 Workloads the most current dual socket Intel Xeon processors (Westmere-EP). This processor features six cores The highly optimized microbenchmarks [9, 4] al- with SSE units, private L1 and L2 caches and a low us to determine the data throughput of different shared L3 cache. The second test system is based instruction types that access well-defined locations on the most recent Intel processor family (Sandy within the memory hierarchy. Either n (on an n- Bridge)1. This processor features four cores with core system with HT disabled) or 2n (HT enabled) AVX units and a similar memory subsystem. Ta- threads perform a uniform workload consisting of ble 1 lists the test system configuration in detail. independent operations that on disjoint memory re- Both of the test systems are highly power opti- gions. The innermost benchmark routines are writ- mized with only indispensable components installed. ten in highly optimized assembly code that achieves The power supplies of both systems are very effi- near peak bandwidth for every cache level as well as cient and the Dell system additionally features low- main memory. One thread per core is generally suf- power DRAM modules. Due to the unavailability ficient to fully utilize the data paths, leaving little to no potential for performance increases through 1This is in fact a desktop class CPU (Core i7). How- the use of HT. With HT enabled, the ever, almost identical results were obtained on a pre- units and caches are concurrently stressed by two production Intel Xeon 1280 test system. The Xeon system showed some stability issues and we therefore threads per core that both achieve roughly half of present the Core i7 results. the peak per-core throughput. The existing bench-

2 marks have been extended to use AVX instructions Table 2: Configuration on the Sandy Bridge platform in order to fully uti- Westmere-EP Sandy Bridge lize the 256 Bit wide execution units and data paths. Compiler Suite Intel Composer XE 12.0 We also run a subset of the SPEC CPU2006 ap- Compiler Flags -m32 -xSSE4.2 -ipo -m32 -xAVX -ipo plication benchmarks on our test systems. Officially SPEC CPU -O3 -no-prec-div -static -opt-prefetch submitted SPEC CPU2006 results from HT-capable Compiler Flags -xSSE4.2 systems are often generated with HT enabled to SPEC OMP -O3 -ipo1 - improve the overall benchmark performance. In our study we focus on the integer part of SPEC 3.3 Power Measurement CPU2006 as we found these benchmarks to be more For both the synthetic and the application bench- sensitive to the use of HT. We focus on the through- marks we measure the power consumption using a put mode of the benchmark (rate instead of speed) ZES Zimmer LMG450 power analyzer attached to that runs multiple copies of the same serial work- the power supply of each test system. This device load on all available cores. To avoid unnecessary provides highly accurate power consumption data complexity, we use the base configuration that al- with a sampling frequency of 20 Hz. The power con- lows for just one set of compiler flags for all bench- sumption data is collected by dedicated hardware marks instead of individually tuned configurations and software components [10]. Using the bench- (peak). We omit benchmarks that do not run on mark runtime and the power consumption data (in our Westmere-EP test system due to an excessive Watts) we then calculate the overall energy con- per-core memory footprint (400, 401, 429). Our sumption of the workload. The power and energy application study is complemented with measure- data is merged with the standard benchmark results ments of the SPEC OMP benchmark suite that in- in a post mortem step, thereby effectively eliminat- cludes floating-point intense, OpenMP parallelized ing any measurement overhead. workloads from different fields of High Performance Computing. However, these benchmarks are not suitable for our single-socket four-core test system, 4. RESULTS as no official submission is available to serve as a performance reference. Our compiler setup for all 4.1 Synthetic Benchmarks SPEC benchmarks is listed in Table 2. We run three Figure 2 illustrates the power and performance iterations of each benchmark and present the aver- impact of Hyper-Threading on low-level benchmark age of all three runs in Section 4.2. Similar to previ- routines that execute either load or add or mul ously submitted SPEC results we enable the Turbo instructions to access different levels of the mem- for the SPEC CPU2006 benchmarks, but not ory hierarchy. While the accumulated throughput for SPEC OMP. of all benchmark threads is only impacted slightly,

12% 12% Bandwidth HT Off Bandwidth HT Off 1000 10% 1000 10% Bandwidth HT On Bandwidth HT On power ratio 8% power ratio 8% 6% 6% 100 100

4% 4%

GB/s GB/s 2% 2% 10 10 0% 0% -2% -2%

1 -4% 1 -4%

L2_add L1_add L3_add

L1_mul L2_mul L3_mul

L2_add L1_add L3_add

L1_mul L2_mul L3_mul

L1_load L2_load L3_load

L1_load L2_load L3_load

RAM_add

RAM_mul

RAM_add

RAM_mul

RAM_load RAM_load (a) Dual Socket Intel Westmere-EP (b) Single Socket Intel Sandy Bridge

Figure 1: Absolute data throughput and relative power consumption of different SSE/AVX packed double assembly instructions with HT disabled (1 thread per core) and enabled (2 threads per core). While the accumulated data throughput is mostly independent of the number of threads per core, the power consumption can vary significantly. On the Westmere- EP system, the power consumption of a workload that stresses exclusively the L1 cache can increase by up to 10 % when using two threads per core without providing any performance advantage. This effect has been effectively eliminated in the subsequent Sandy Bridge design.

3 30% 30% Runme Energy Power Runme Energy Power 20% 20%

10% 10%

0% 0%

-10% -10%

-20% -20%

-30% -30% 403.gcc 403.gcc average average 473.astar 473.astar 458.sjeng 458.sjeng 445.gobmk 445.gobmk 456.hmmer 456.hmmer 464.h264ref 464.h264ref 471.omnetpp 471.omnetpp 483.xalancbmk 483.xalancbmk 462.libquantum 462.libquantum (a) Dual Socket Intel Westmere-EP (b) Single Socket Intel Sandy Bridge

Figure 2: Runtime, power and energy consumption of selected SPECint rate base2006 bench- marks for the two test systems. Running two threads per core (HT enabled) consistently in- creases the power consumption. Three benchmarks run less energy efficient with HT enabled on Westmere-EP. For 462.h264ref the increased power consumption outweighs the runtime advantage, thus decreasing energy efficiency. While two benchmarks show increased runtimes with HT enabled on Westmere-EP, all benchmark runtimes decrease on the Sandy Bridge test system. 456.hmmer is the one exception that still shows decreased efficiency with HT enabled. the additional power demand in the HT enabled SPECint rate base2006 results depicted in Figure 2 case on the Westmere-EP system is evident, reach- show a consistent increase in power consumption ing up to 10%. In previous work on a different with HT enabled. A performance penalty with HT dual socket test system we have measured a 40- enabled only occurs on the Westmere-EP platform. Watts premium for some SSE2 packed integer op- For minor runtime benefits the increased power con- erations without any notable performance advan- sumption can worsen the overall energy efficiency tage. The extent to which the power consump- in terms of Joule per workload. This is the case for tion increases is highly proportional with the overall 464.h264ref on Westmere-EP and for 456.hmmer on data throughput. In contrast to Westmere or Ne- Sandy Bridge. A comparison of the runtime and en- halem processors, the new Sandy Bridge architec- ergy bars between both platform clearly shows that ture does not show this behavior at all.This demon- the HT implementation in the Sandy Bridge mi- strates a considerable improvement of the power ef- croarchitecture has been strongly improved in terms ficiency of Intel’s SMT implementation. The inter- of both performance and energy efficiency. esting question now is whether and how this ad- The SPEC OMP benchmark characteristics dif- vancement translates from synthetic benchmarks to fer significantly from CPU2006, as the participat- an improved energy efficiency of real applications. ing tasks actually synchronize with each other and therefore is an issue. The scalability of 4.2 Application Benchmarks 314.mgrid, 318.galgel, and 320.equake is known to We use published SPEC results in order to find be poor [3], which explains the HT performance a competitive set of compiler flags.2 Due to the penalty shown in Figure 3. Additionally to the missing smartheap that is known to strongly runtime increase, HT causes a power penalty that influence C++ results, our C++ benchmark runs makes the HT disadvantage even worse in terms show noticeably lower performance. Most of our of energy efficiency. Benchmarks that do not scale other CPU2006 benchmarks runtimes are in line poorly and are less memory bound than 312.swim with published results on both test platforms. The typically benefit from Hyper-Threading. However, 330.art as well as the overall average show again 2http://www.spec.org/cpu2006/results/res2011q2/ cpu2006-20110329-15360.html that a minor performance advantage can come with http://www.spec.org/cpu2006/results/res2010q1/ a bigger energy efficiency disadvantage. cpu2006-20100315-09799.html

4 30% Runme Energy Power 6. REFERENCES 20% [1] Vishal Aslot, Max J. Domeika, Rudolf Eigenmann, 10% Greg Gaertner, Wesley B. Jones, and Bodo Parady. Specomp: A new benchmark suite for measuring 0% parallel performance. In Proceedings of the International Workshop on OpenMP Applications and -10% Tools: OpenMP Shared Memory Parallel Programming, WOMPAT ’01, pages 1–10, London, -20% UK, 2001. Springer-Verlag. [2] Mikhail N. Dorozhevets and Peter Wolcott. The -30% el’brus-3 and mars-m: Recent advances in russian high-performance computing. The Journal of

330.art Supercomputing, 6:5–48, 1992. 10.1007/BF00128641. average 324.apsi 312.swim 316.applu 314.mgrid 318.galgel

326.gafort [3] Karl Furlinger,¨ Michael Gerndt, and Jack Dongarra. 332.ammp 328.fma3d 320.equake

310.wupwise Scalability analysis of the SPEC OpenMP benchmarks on large-scale shared memory multiprocessors. In Yong Shi, Geert van Albada, Jack Dongarra, and Peter Figure 3: Runtime, power consumption and Sloot, editors, Computational Science – ICCS 2007, energy consumption of SPEC OMP2001M volume 4488 of Lecture Notes in , benchmarks for the Westmere-EP test sys- pages 815–822. Springer Berlin / Heidelberg, 2007. tem. Running two threads per core consis- [4] Daniel Hackenberg, Daniel Molka, and Wolfgang E. Nagel. Comparing cache architectures and coherency tently increases the power consumption. The protocols on x86-64 multicore smp systems. In MICRO energy efficiency typically decreases. 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 413–422, New York, NY, USA, 2009. ACM. [5] John L. Henning. Spec cpu2006 benchmark 5. CONCLUSION descriptions. SIGARCH Comput. Archit. News, 34:1–17, September 2006. This paper presents an in-depth study of SMT on [6] David Koufaty and Deborah T. Marr. Hyperthreading current x86 64 Intel processors. Using a sophisti- technology in the netburst microarchitecture. IEEE cated power measurement methodology we extend Micro, 23:56–65, March 2003. [7] Yingmin Li, David Brooks, Zhigang Hu, Kevin the traditional performance analysis by including Skadron, and Pradip Bose. Understanding the energy important energy efficiency aspects. A set of syn- efficiency of simultaneous multithreading. In thetic low-level microbenchmarks is used to demon- Proceedings of the 2004 international symposium on Low power electronics and design, ISLPED ’04, pages strate how the use of SMT increases the power con- 44–49, New York, NY, USA, 2004. ACM. sumption of a Westmere-EP compute node by up to [8] Deborah T. Marr, Frank Binns, David L. Hill, Glenn 10% even though the data throughput remains un- Hinton, David A. Koufaty, Alan J. Miller, and Michael Upton. Hyper-Threading technology architecture and changed. We also show that this deficiency of Intel’s microarchitecture. , 6(1), SMT implementation has been effectively removed February 2002. in the latest Sandy Bridge microarchitecture. [9] Daniel Molka, Daniel Hackenberg, Robert Sch¨one, and Compared to earlier processors, the newer SMT Matthias S. Muller.¨ Memory performance and cache coherency effects on an Intel Nehalem multiprocessor implementation also provides more significant per- system. In PACT ’09: Proceedings of the 2009 18th formance gains when running application bench- International Conference on Parallel Architectures and marks. Although the use of SMT still increases the Compilation Techniques, pages 261–270, Washington, DC, USA, 2009. IEEE Computer Society. power consumption consistently for all workloads, [10] Daniel Molka, Daniel Hackenberg, Robert Sch¨one, and the performance gains typically outweigh the in- Matthias S. Muller.¨ Characterizing the energy creased power consumption. For parallel programs consumption of data transfers and arithmetic operations on x86-64 processors. In Proceedings of the that do not scale with the number of parallel tasks 1st International Conference, pages (e.g. some of the SPEC OMP workloads), using 123–133. IEEE, 2010. SMT may increase both runtime and power con- [11] Subhash Saini, Andrey Naraikin, Rupak Biswas, David Barkai, and Timothy Sandstrom. Early performance sumption at the same time. Although our study evaluation of a ”nehalem” cluster using scientific and shows significant advancements of Intel’s SMT im- engineering applications. SC Conference, 0:1–12, 2009. plementation, the use of SMT still needs to be care- [12] .M. Tullsen, S.J. Eggers, and H.M. Levy. fully considered–preferably not only with the run- Simultaneous multithreading: Maximizing on-chip parallelism. In , 1995. time but also the power consumption in mind. Proceedings. 22nd Annual International Symposium on, pages 392 – 403, June 1995. Acknowledgment This work has been funded by the Bun- [13] Theo Ungerer, Borut Robiˇc, and Jurij Silc.ˇ A survey of desministerium fur¨ Bildung und Forschung via the Spitzen- processors with explicit multithreading. ACM Comput. cluster CoolSilicon (BMBF 13N10186) and the eeClust re- Surv., 35:29–63, March 2003. search project (01IH08008C). We thank Intel Germany for providing us with an Intel Xeon E3-1280 evaluation system.

5