[3B2-9] mmi2012030110.3d 17/5/012 12:23 Page 110

...... WHAT IS HAPPENING TO POWER, PERFORMANCE, AND SOFTWARE?

...... SYSTEMATICALLY EXPLORING POWER, PERFORMANCE, AND ENERGY SHEDS NEW LIGHT

ON THE CLASH OF TWO TRENDS THAT UNFOLDED OVER THE PAST DECADE: THE RISE

OF PARALLEL PROCESSORS IN RESPONSE TO TECHNOLOGY CONSTRAINTS ON POWER,

CLOCK SPEED, AND WIRE DELAY; AND THE RISE OF MANAGED HIGH-LEVEL, PORTABLE

PROGRAMMING LANGUAGES.

...... Quantitative performance analy- seeking performance improvements and en- sis is the foundation for computer system de- ergy efficiency in smaller technologies to sign and innovation. In their classic paper, build parallel heterogeneous architectures. EmerandClarknotedthat‘‘alackof This hardware requires parallel software detailed timing information impairs efforts and exposes software to ongoing hardware to improve performance.’’1 They pioneered upheaval. Unfortunately, most software Hadi Esmaeilzadeh the quantitative approach by characterizing today is not parallel, nor is it designed to instruction mix and cycles per instruction modularly decompose onto a heterogeneous University of Washington on timesharing workloads. Emer and Clark substrate. surprised expert readers by demonstrating a Over this same decade, Moore’s transistor gap between the theoretical 1 million bounty drove orthogonal and disruptive soft- Ting Cao instructions per second (MIPS) peak of the ware changes with respect to how software is VAX-11/780 and the 0.5 MIPS it delivered deployed, sold, and built. Software demands Xi Yang on real workloads. Hardware and software for correctness, complexity management, researchers in industry and academia now programmer productivity, time-to-market, Stephen M. Blackburn use and have extended this principled perfor- reliability, security, and portability pushed Australian National mance analysis methodology. Our research developers away from low-level compiled applies this quantitative approach to mea- ahead-of-time (native) programming lan- University sured power and energy. guages. Developers increasingly choose This work is timely because the past de- high-level managed programming languages cade heralded the era of power-constrained with a selection of safe pointer disciplines, Kathryn S. McKinley hardware design. Hardware demands for en- garbage collection (automatic memory man- ergy efficiency intensified in large-scale sys- agement), extensive standard libraries, and Microsoft Research tems, in which power began to dominate portability through dynamic just-in-time costs, and in mobile systems, which are con- compilation. For example, modern web ser- strained by battery life. Unfortunately, tech- vices combine managed languages, such as nology limits on power retard Dennard PHP on the server side and JavaScript on scaling2 and prevent systems from using all theclientside.Inmarketsasdiverseasfinan- transistors simultaneously (dark silicon).3 cial software and cell phone applications, These constraints are forcing architects Java and .NET are the dominant choice......

110 Published by the IEEE Computer Society 0272-1732/12/$31.00 c 2012 IEEE [3B2-9] mmi2012030110.3d 17/5/012 12:23 Page 111

...... Related Work in Power Measurement, Power Modeling, and Methodology Processor design literature is full of performance measurement and analysis. Despite power’s growing importance, power measurements References 1-3 are still relatively rare. Isci and Martonosi combine a clamp ammeter 1. X. Fan, W.D. Weber, and L.A. Barroso, ‘‘Power Provisioning for with performance counters for per-unit power estimation of the Pen- a Warehouse-sized Computer,‘‘ Proc. 34th Ann. Int’l Symp. 2 tium 4 on SPEC CPU2000. Fan et al. estimate whole-system power for Computer Architecture (ISCA 07), ACM, 2007, pp. 13-23. 1 large-scale data centers. They find that even the most power-consuming 2. C. Isci and M. Martonosi, ‘‘Runtime Power Monitoring in workloads draw less than 60 percent of peak possible power consump- High-End Processors: Methodology and Empirical Data‘‘, tion. We measure chip power and support their results by showing that Proc.36thAnn.IEEE/ACMInt’lSymp.Microarchitecture thermal design power (TDP) does not predict measured chip power. Our (Micro 36), IEEE CS, 2003, pp. 93-104. work is the first to compare microarchitectures, technology generations, 3. E. Le Sueur and G. Heiser, ‘‘Dynamic Voltage and Frequency individual benchmarks, and workloads in the context of power and Scaling: The Laws of Diminishing Returns,‘‘ Proc. Int’l Conf. performance. Power-Aware Computing and Systems (HotPower 10), USENIX Power modeling is necessary to thoroughly explore architecture Assoc., 2010, pp. 1-8. 4-6 design. Measurement complements simulation by providing valida- 4. O. Azizi et al., ‘‘Energy-Performance Tradeoffs in Processor tion. For example, some prior simulators used TDP, but our measure- Architecture and Circuit Design: A Marginal Cost Analysis,‘‘ ments show that this estimate is not accurate. As we look to the Proc. 37th Ann Int’l Symp. Computer Architecture (ISCA 10), future, programmers will need to tune their applications for power 2010, pp. 26-36. and energy, and not just performance. Just as architectural event per- 5. Y. Li et al., ‘‘CMP Design Space Exploration Subject to Phys- formance counters provide insight to applications, so will power and ical Constraints,‘‘ Proc. 12th Int’l Symp. High-Performance energy measurements. Computer Architecture, IEEE CS, 2006, pp. 17-28. Although the results show conclusively that managed and native 6. J.B. Brockman et al., ‘‘McPAT: An Integrated Power, Area, workloads respond differently to architectural variations, perhaps this re- and Timing Modeling Framework for Multicore and Manycore 7 sult should not be surprising. Unfortunately, few architecture or operat- Architectures,‘‘ Proc. 42nd Ann. IEEE/ACM Int’l Symp. Micro- ing systems publications with processor measurements or simulated architecture (MICRO 42), IEEE CS, 2009, pp. 469-480. designs use Java or any other managed workloads, even though the 7. S.M. Blackburn et al., ‘‘Wake Up and Smell the Coffee: Eval- evaluation methodologies we use here for real processors and those uation Methodologies for the 21st Century,‘‘ Comm. ACM, 7 for simulators are well developed. vol. 51, no. 8, 2008, pp.83-89.

Exponential performance improvements in is the first to systematically measure the hardware hid many of the costs of high- power, performance, and energy characteris- level languages and helped create a virtuous tics of software and hardware across a range cycle with ever more capable, reliable, and of processors, technologies, and workloads. well performing software. This ecosystem is We execute 61 diverse sequential and paral- resulting in an explosion of developers, soft- lel benchmarks written in three native lan- ware, and devices that continue to change guages and one managed language, all of how we live and learn. which are widely used: C, C++, Fortran, Unfortunately, a lack of detailed power and Java. We choose Java because it has ma- measurements is impairing efforts to reduce ture virtual machine technology and substan- energy consumption on modern software. tial open-source benchmarks. We choose eight representative Intel IA32 processors Examining power, performance, and energy from five technology generations (130 nm This work quantitatively examines power, to 32 nm). Each processor has an isolated performance, and energy during this period processor power supply on the motherboard. of disruptive software and hardware changes Each processor has a power supply with sta- (2003 to 2011). Voluminous research explores ble voltage, to which we attach a Hall effect performance and a growing body of work sensor that measures current and, hence, pro- explores power (see the ‘‘Related Work in cessor power. We calibrate and validate our Power Measurement, Power Modeling, sensor data. We find that power consump- and Methodology’’ sidebar), but our work tion varies widely among benchmarks......

MAY/JUNE 2012 111 [3B2-9] mmi2012030110.3d 17/5/012 12:23 Page 112

...... TOP PICKS

Findings Our results recommend that systems research- Power consumption is highly application dependent and is poorly correlated to TDP. ers include managed, native, sequential and Energy-efficient architecture design is very sensitive to workload. Configurations in parallel workloads when designing and eval- the native nonscalable Pareto frontier differ substantially from all the other workloads. uating energy-efficient systems. Comparing one core to two, enabling a core is not consistently energy efficient. The Java Virtual Machine induces parallelism into the execution of single-threaded Architecture Java benchmarks. Hardware features such as clock scaling, Simultaneous multithreading delivers substantial energy savings for recent hardware gross microarchitecture, simultaneous multi- and in-order processors. threading (SMT), and chip multiprocessors The most recent processor in our study does not consistently increase energy (CMPs) each elicit a huge variety of power, consumption as its clock increases. performance, and energy responses. This va- The power/performance response to clock scaling of the native nonscalable workload riety and the difficulty of obtaining power differs from the other workloads. measurements recommends exposing on-chip power meters and, when possible, structure- Figure 1. We organize our discussion around these seven findings from specific power meters for cores, caches, and an analysis of measured chip power, performance, and energy on other structures. 61 workloads and eight processors. The ASPLOS paper includes more Modern processors include power- findings and analysis.4 management techniques that monitor power sensors to minimize power usage and boost performance, but these sensors are not visi- ble. Only in 2011, after our original paper, Furthermore, relative power, performance, did Intel first expose energy counters in a and energy are not well predicted by core production processor ().5 Just count, clock speed, or reported thermal de- as hardware event counters provide a quanti- sign power (TDP). tative grounding for performance innova- We use controlled hardware configura- tions, future architectures should include tions to explore the energy impact of hard- power meters to drive innovation in the ware features, software parallelism, and power-constrained computer systems era. workload. We perform historical and Pareto Measurement is the key to understanding analyses that identify the most energy- and optimization. efficient designs in our architecture configu- ration space. We made all of our data pub- Methodology licly available in the ACM Digital Library This section overviews essential elements as a companion to our original ASPLOS of our experimental methodology. For a 2011 paper.4 Our data quantifies the extent, more detailed treatment, see our original with precision and depth, of some known paper.4 workload and hardware trends and some pre- viously unobserved trends. This article is Software organized around just seven findings, listed We systematically explored workload se- in Figure 1, from our more comprehensive lection because it is a critical component ASPLOS analysis. Two themes emerge for analyzing power and performance. Native from our analysis with respect to workload and managed applications embody different and architecture. tradeoffs between performance, reliability, portability, and deployment. It is impossible Workload to meaningfully separate language from The power, performance, and energy workload. We offer no commentary on the trends of nonscalable native workloads differ virtue of language choice. We created the fol- substantially from native parallel and managed lowing four workloads from 61 benchmarks. workloads. For example, the SPEC CPU2006 native benchmarks draw significantly less Native nonscalable benchmarks: C, C++, power than parallel benchmarks; and and Fortran single-threaded compute- managed runtimes exploit parallelism even intensive benchmarks from SPEC when executing single-threaded applications. CPU2006...... 112 IEEE MICRO [3B2-9] mmi2012030110.3d 17/5/012 12:23 Page 113

Native scalable benchmarks: Multi- cores on the chip multiprocessors (CMP), threaded C and C++ benchmarks disabled simultaneous multithreading (SMT), from PARSEC. and disabled Turbo Boost using operating Java nonscalable benchmarks: Single and system boot time BIOS configuration. multithreaded benchmarks that do not scale well from SPECjvm, DaCapo Power, performance, and energy measurements 06-10-MR2, DaCapo 9.12, and We measured on-chip power by isolating pjbb2005. the direct current (DC) power supply to the Java scalable benchmarks: Multithreaded processor on the motherboard. Prior work Java benchmarks from DaCapo 9.12 used a clamp ammeter, which can only mea- that scale in performance similarly sure the whole system alternating current to native scalable benchmarks on the (AC) supply.7-9 We used Pololu’s ACS714 i7 (45). current sensor board. The board is a carrier for Allegro’s 5 A ACS714 Hall effect- We execute the Java benchmarks on the based linear current sensor. The sensor Oracle HotSpot 1.6.0 Virtual Machine be- accepts a bidirectional current input with a cause it is a mature high-performance virtual magnitude up to 5 A. The output is an ana- machine. The virtual machine dynamically log voltage (185 mV/A) centered at 2.5 V optimizes each benchmark on each architec- with a typical error of less than 1.5 percent. ture. We used best practices for virtual We place the sensors on the 12 V power machine measurement of steady-state perfor- line that supplies only the processor. We mance.6 We compiled native nonscalable measured voltage and found it to be stable, with icc at -o3.Weusedgccat-o3 for na- varying less than 1 percent. We sent the val- tive scalable benchmarks because icc did not ues from the current sensor to the machine’s correctly compile all benchmarks. The icc USB port using a Sparkfun’s Atmel AVR compiler usually produces better performing Stick, which is a simple data-logging device code than the gcc compiler. We executed the with a data-sampling rate of 50 Hz. We same native binaries on all machines. All of used a similar arrangement with a 30A Hall the parallel native benchmarks scale up to effect sensor for the high power i7 (45). We eighthardwarecontexts.TheJavascalable executed each benchmark, logged its power benchmarks are the subset of Java bench- values, and then computed average power marks that scale similarly to the native scal- consumption. able benchmarks. After publishing the original paper, Intel made chip-level and core-level energy meas- Hardware urements available on Sandy Bridge process- Table 1 lists the eight Intel IA32 process- ors.5 Our methodology should slightly ors that cover four process technologies overstate chip power because it includes (130 nm, 65 nm, 45 nm, and 32 nm) and losses due to the motherboard’s voltage regu- four microarchitectures (NetBurst, Core, lator. Validating against the Sandy Bridge Bonnell, and Nehalem). The release price energy counter shows that our power measure- and date give context regarding Intel’s market ments consistently measure about 5 percent placement. The Atoms and the Core 2Q (65) more current. Kentsfield are extreme market points. These We executed each benchmark multiple processors are only examples of many pro- times. The aggregate 95 percent confidence cessors in each family. For example, Intel intervals of execution time and power range sells more than 60 Nehalems at 45 nm, rang- from 0.7 to 4 percent. The measurement ing in price from around US$190 to more error in time and power for all processors than US$3,700. We used these samples be- and applications is low. cause they were sold at similar price points. We compute arithmetic means over the To explore the influence of architectural four workloads, weighting each workload features, we selectively down-clocked equally. To avoid biasing performance meas- (reduced the frequency of the CPU from urements to any one architecture, we com- its default) the processors, disabled the pute a reference performance. We normalize ......

MAY/JUNE 2012 113 [3B2-9] mmi2012030110.3d 17/5/012 12:23 Page 114

...... TOP PICKS

Table 1. Specifications for the eight experimental processors.

Processor mArch Processor sSpec Release date Price (US$) CMP SMT LLC B Pentium 4 NetBurst Northwood SL6WF May 2003 — 1C2T 512 K Core 2 Duo E6600 Core SL9S8 July 2006 $316 2C1T 4 M Core 2 Quad Q6600 Core Kentsfield SL9UM Jan. 2007 $851 4C1T 8 M Core i7 920 Nehalem Bloomfield SLBCH Nov. 2008 $284 4C2T 8 M Atom 230 Bonnell Diamondville SLB6Z June 2008 $29 1C2T 512 K Core 2 Duo E7600 Core Wolfdale SLGTD May 2009 $133 2C1T 3 M Atom D510 Bonnell Pineview SLBLA Dec. 2009 $63 2C2T 1 M Core...... i5 670 Nehalem Clarkdale SLBLT Jan. 2010 ...... $284 2C2T 4 M . * mArch: microarchitecture; CMP: chip multiprocessor; SMT: simultaneous multithreading; LLC: last level cache; VID: processor’s stock voltage; TDP: thermal design power; FSB: front-side bus; B/W: bandwidth.

We measured 45 processor configurations 100 (8 stock and 37 BIOS configured) and pro- Native nonscale duced power and performance data for 90 Native scale each benchmark and processor, as illustrated Java nonscale 80 Java scale in Figure 2.

70 Perspective The rest of this article organizes our anal- 60 ysis by the seven findings listed in Figure 1.

Power (W) 50 (The ASPLOS paper contains additional analysis, results, and findings.) We begin 40 with broad trends. We show that applica- tions exhibit a large range of power and per- 30 formance characteristics that are not well 20 summarized by a single number. This section 2.00 3.00 4.00 5.00 6.00 7.00 8.00 conducts a Pareto energy efficiency analysis Performance/reference for all the 45 nm processor configurations. Even with this modest exploration of archi- Figure 2. Power/performance distribution on the i7 (45). Each point tectural features, the results indicate that represents one of the 61 benchmarks. Power consumption is highly each workload prefers a different processor variable among the benchmarks, spanning from 23 W to 89 W. The wide configuration for energy efficiency. spectrum of power responses from different benchmarks points to power saving opportunities in software. Power is application dependent The nominal thermal design power (TDP) for a processor is the amount of individual benchmark execution times to the power the chip may dissipate without average execution time when executed on exceeding the maximum transistor junction four architectures: Pentium 4 (130), Core temperature. Table 1 lists the TDP for 2D (65), Atom (45), and i5 (32). These each processor. Because measuring real pro- choices capture four microarchitectures and cessor power is difficult and TDP is readily four technology generations. We also normal- available, researchers often substitute TDP ized energy to a reference, since energy ¼ for real measured power. Figure 3 shows power time. The reference energy is the av- that this substitution is problematic. It erage power on these four processors times plots on a logarithmic scale measured the average execution time. Given a power power for each benchmark on each stock and time measurement, we compute energy processor as a function of TDP. TDP is and then normalize it to the reference energy. marked with an 7. Note that TDP is strictly ...... 114 IEEE MICRO [3B2-9] mmi2012030110.3d 17/5/012 12:23 Page 115

Clock (GHz) nm Trans (M) Die (mm2) VID range (V) TDP (W) FSB (MHz) B/W (Gbytes/s) DRAM model

2.4 130 55 131 — 66 800 — DDR-400 2.4 65 291 143 0.85—1.50 65 1,066 — DDR2-800 2.4 65 582 286 0.85—1.50 105 1,066 — DDR2-800 2.7 45 731 263 0.80—1.38 130 — 25.6 DDR3-1066 1.7 45 47 26 0.90—1.16 4 533 — DDR2-800 3.1 45 228 82 0.85—1.36 65 1,066 — DDR2-800 1.7 45 176 87 0.80—1.17 13 665 — DDR2-800 ...... 3.4 32 382 81 0.65—1.40...... 73 — 21.0 DDR3-1333

higher than actual power, that the gap be- tween peak measured power and TDP varies P4 (130) from processor to processor, and that TDP is 100 C2D (65) up to a factor of four higher than measured C2Q (65) i7 (45) power. The variation among benchmarks is Atom (45) highest on the i7 (45) and i5 (32), which C2D (45) likely reflects their advanced power manage- AtomD (45) i5 (32) ment. For example on the i7 (45), measured power varies between 23 W for 471.omnetpp and 89 W for fluidanimate. The smallest 10 variation between maximum and minimum is on the Atom (45) at 30 percent. This Measured power (W) (log) Measured trend is not new. All the processors exhibit a range of application specific power varia- tion. TDP loosely correlates with power con- 1 sumption, but it does not provide a good 1 10 100 estimate for maximum power consumption TDP (W) (log) of individual processors, comparing among processors, or approximating benchmark- Figure 3. Measured power for each processor running 61 benchmarks. specific power consumption. Each point represents measured power for one benchmark. An 7 shows Finding: Power consumption is highly ap- the reported TDP for each processor. Power is application dependent and plication dependent and is poorly correlated does not strongly correlate with TDP. to TDP. Figure 2 plots power versus relative per- formance for each benchmark on the i7 (45) Power is not strongly correlated with perfor- with eight hardware contexts, which is the mance across benchmarks. If the correlation most recent of the 45 nm processors. We mea- were strong, the points would form a straight sure this data for every processor configuration. line. For example, the point on the bottom Native and managed benchmarks are differen- right of the figure achieves almost the best tiated by color, whereas scalable and nonscal- relative performance and lowest power. able benchmarks are differentiated by shape. Unsurprisingly, the scalable benchmarks (tri- Pareto analysis at 45 nm angles) tend to perform the best and con- The Pareto optimal frontier defines a set sume the most power. More unexpected is of choices that are most efficient in a tradeoff the range of performance and power charac- space. Prior research uses the Pareto frontier teristics of the nonscalable benchmarks. to explore power versus performance using ......

MAY/JUNE 2012 115 [3B2-9] mmi2012030110.3d 17/5/012 12:23 Page 116

...... TOP PICKS

from 4 to 29 by configuring the number of 0.60 hardware contexts (CMP and SMT), clock Average scaling, and disabling or enabling Turbo 0.55 Native nonscale Native scale Boost. The 25 nonstock configurations repre- 0.50 Java nonscale sent alternative design points. We then com- Java scale pute an energy and performance scatter plot 0.45 (not shown here) for each hardware configu- 0.40 ration, workload, and workload average. We next pick off the frontier—the points that 0.35 are not dominated in performance or energy efficiency by any other point—and fit them 0.30 with a polynomial curve. Figure 4 plots these polynomial curves for each workload Normalized workload energy 0.25 and the average. The rightmost curve delivers 0.20 the best performance for the least energy. Each row of Figure 5 corresponds to one of 0.15 thefivecurvesinFigure4.Thecheckmarks 0.00 2.00 4.00 6.00 Workload performance/workload reference performance identify the Pareto-efficient configurations that define the bounding curve and include Figure 4. Energy/performance Pareto frontiers for 45 nm processors. 15 of 29 processor configurations. Somewhat The energy/performance Pareto frontiers are workload-dependent and surprising is that none of the Atom D (45) significantly deviate from the average. configurations are Pareto-efficient. Notice:

Native nonscalable shares only one choice with any other workload. models to derive potential architectural Java scalable and the average share all designs on the frontier.10 We present a Pareto the same choices. frontier derived from measured performance Only two of 11 choices for Java non- and power. We hold the process technology scalable and scalable workloads are constant by using the four 45 nm processors: common to both. Atom, Atom D, Core 2D, and i7. We ex- Native nonscalable does not include the pand the number of processor configurations Atom (45) in its frontier.

5) 1C2T@ 2.4GHz 4

Atom (45)Core [email protected] 2DCore (45) [email protected] (45) (45) [email protected] 1C1T@i7 (45) 2.7GHz1C1T@i7(45) 1C2T@ 2.7GHzNoi7 ( TB 1.6GHzi7 (45) 2C1T@i7 (45) [email protected] (45) 1.6GHz4C1T@i7 (45) [email protected] (45) 2.7GHz4C2T@Noi7 (45) TB [email protected] (45) 2.1GHz4C2T@i7 (45) 2.7GHz4C2T@ No2.7GHz TB Native nonscalable Native scalable Java nonscalable Java scalable Average

Figure 5. Pareto-efficient processor configurations for each workload. Stock configurations are bold. Each 3 indicates that the configuration is on the energy/performance Pareto- optimal curve. Native nonscalable has almost no overlap with any other workload.

...... 116 IEEE MICRO [3B2-9] mmi2012030110.3d 17/5/012 12:23 Page 117

1.60 1.20 1.50 i7 (45) i5 (32) i7 (45) i5 (32) 1.40 1.10 1.30 1.00 1.20 1.10 0.90 1.00 0.90 0.80 2 cores/1 core 2 cores/1 0.80 core 2 cores/1 0.70 0.70 0.60 0.60 Performance Power Energy Native Native Java Java (a) (b) nonscale scale nonscale scale

Figure 6. CMP: Comparing two cores to one core. Impact of doubling the number of cores on performance, power, and energy, averaged over all four workloads (a). Energy impact of doubling the number of cores for each workload (b). Doubling the cores is not consistently energy efficient among processors or workloads.

This last finding contradicts prior simula- hardware parallelism, and gross microarchi- tion work, which concluded that dual-issue tecture. This analysis yielded a large number in-order cores and dual-issue out-of-order of findings and insights. Reader and reviewer cores are Pareto-optimal designs for native feedback yielded diverse opinions as to which nonscalable workloads.10 Instead, we find findings were most surprising and interest- that all of the Pareto efficient points for the ing. To give a flavor of our analysis, this sec- native nonscalable workload map to a tion presents our CMP, SMT, and clock quad-issue out-of-order i7 (45). scaling analysis. Figure 4 shows that each workload devi- ates substantially from the average. Even Chip multiprocessors when the workloads share points, the points Figure 6 shows the average power, perfor- fall in different places on the curves because mance, and energy effects of chip multiproces- each workload exhibits a different energy sors (CMPs) by comparing one core to two and performance tradeoff. Compare the scal- cores from the two most recent processors in able and nonscalable benchmarks at 0.40 our study. We disable Turbo Boost in all normalized energy on the y-axis. It is impres- these analyses because it adjusts power dynam- sive how well these architectures effectively ically based on the number of idle cores. We exploit software parallelism, pushing the disable SMT here to maximally expose curves to the right and increasing normalized thread-level parallelism to the CMP hardware performance from about 3 to 7 while holding feature. Figure 6a compares relative power, energy constant. This measured behavior con- performance, and energy as an average of the firms prior model-based observations about workloads. Figure 6b breaks down the energy software parallelism’s role in extending the en- efficiency as a function of the workload. ergy and performance curve to the right.10 While average energy is reduced by 9 percent whenaddingacoretothei5(32),itis Finding: Energy-efficient architecture design increased by 12 percent when adding a core is very sensitive to workload. Configurations to the i7 (45). Figure 6a shows that the source in the native nonscalable Pareto frontier dif- of this difference is that the i7 (45) experien- fer substantially from all other workloads. ces twice the power overhead for enabling a In summary, architects should use a variety core as the i5 (32) while producing roughly of workloads and, in particular, should avoid the same performance improvement. only using native nonscalable workloads. Finding: Comparing one core to two, enabling Feature analysis a core is not consistently energy efficient. Our original paper evaluates the energy Figure 6b shows that native nonscalable effect of a range of hardware features: clock and Java nonscalable benchmarks suffer the frequency, die shrink, memory hierarchy, most energy overhead with the addition of ......

MAY/JUNE 2012 117 [3B2-9] mmi2012030110.3d 17/5/012 12:23 Page 118

...... TOP PICKS

1.60 that the garbage collector induced memory system improvements with more cores by 1.50 reducing the collector’s displacement effect 1.40 on the application thread. 1.30 Simultaneous multithreading 1.20 Figure 8 shows the effect of disabling si-

2 cores / 1 core 1.10 multaneous multithreading (SMT) on the 1.00 Pentium 4 (130), Atom (45), i5 (32), and i7 (45).11 Each processor supports two-way 0.90 SMT. SMT provides fine-grained parallelism db fop jack jess dio antlr bloat javac to distinct threads in the processors’ issue luindex compress mpegau logic and threads share all processor compo- nents, such as execution units and caches. Figure 7. Scalability of single-threaded Java benchmarks. Counterintuitively, Singhal states that the small amount of some single-threaded Java benchmarks scale well. They scale because logic exclusive to SMT consumes very little 12 the underlying JVM exploits parallelism during compilation, profiling, and power. Nonetheless, this logic is integrated, garbage collection. so SMT contributes a small amount to total power even when disabled. Therefore, our results slightly underestimate SMT’s power another core on the i7 (45). As expected, cost. We use only one core to ensure that performance for native nonscalable bench- SMT is the sole opportunity for thread-level marks is unaffected. However, turning on parallelism. Figure 8a shows that SMT’s per- an additional core for native nonscalable formance advantage is significant. On the i5 benchmarks leads to a power increase of (32) and Atom (45), SMT improves average 4 percent and 14 percent, respectively, for performance significantly, without much cost the i5 (32) and i7 (45), translating to energy in power, leading to net energy savings. overheads. Finding: SMT delivers substantial energy More interesting is that Java nonscalable savings for recent hardware and for in- benchmarks do not incur energy overhead order processors. when enabling another core on the i5 (32). In fact, we were surprised to find that the Given that SMT was motivated by the single-threaded Java nonscalable benchmarks challenge of filling issue slots and hiding la- run faster with two processors! Figure 7 tency in wide-issue superscalars, it may ap- shows the scalability of the single-threaded pear counter intuitive that performance on subset of Java nonscalable on the i7 (45), the dual-issue in-order Atom (45) should comparing one and two cores. Although benefit so much more from SMT than the the Java benchmarks are single threaded, quad-issue i7 (45) and i5 (32) benefit. One the Java virtual machines (JVMs) on which explanation is that the in-order pipelined they execute are not. Atom (45) is more restricted in its capacity to fill issue slots. Compared to other process- Finding: The JVM induces parallelism into the ors in this study, the Atom (45) has much execution of single-threaded Java benchmarks. smaller caches. These features accentuate Because virtual machine runtime services theneedtohidelatencyandthereforethe for managed languages—such as just-in-time value of SMT. The performance improve- (JIT) compilation, profiling, and garbage col- ments on the Pentium 4 (130) due to lection—are often concurrent and parallel, SMT are half to one-third of more-recent they provide substantial scope for paralleliza- processors. Consequently, there is no net en- tion, even within ostensibly sequential appli- ergy advantage. This result is not so surpris- cations. We instrumented the HotSpot JVM ing given that the Pentium 4 (130) was the and found that HotSpot JIT compilation first commercial SMT implementation. and garbage collection are parallel. Detailed Figure 8b shows that, as expected, native performance-counter measurements revealed nonscalable benchmarks experience little ...... 118 IEEE MICRO [3B2-9] mmi2012030110.3d 17/5/012 12:23 Page 119

1.60 1.20 1.50 Pentium 4 (130) Pentium 4 (130) Atom (45) 1.40 1.10 Atom (45) i7 (45) i7 (45) 1.30 i5 (32) 1.00 i5 (32) 1.20 1.10 0.90 1.00 0.90 0.80 2 threads/1 thread 0.80 2 threads/1 thread 0.70 0.70 0.60 0.60 Performance Power Energy Native Native Java Java (a) (b) nonscale scale nonscale scale

Figure 8. SMT: one core with and without SMT. Impact of enabling two-way SMT on a single core with respect to perfor- mance, power, and energy averaged over all four workloads (a). Energy impact of enabling two-way SMT on a single core for each workload (b). Enabling SMT delivers significant energy savings on the recent i5 (32) and the in-order Atom (45). energy overhead due to enabling SMT. the clock is doubled. In stark contrast, dou- Whereas Figure 6b shows that enabling a bling the clock on the i5 (32) leads to a core incurs a significant power and thus en- slight energy reduction. ergy penalty. The scalable benchmarks of Finding: The most recent processor in our course benefit the most from SMT. study does not consistently increase energy Compare Figures 6 and 8 to see SMT’s consumption as its clock increases. effectiveness, which is impressive on recent processors as compared to CMP, particularly A number of factors may explain why the given its low die footprint. SMT provides i5 (32) performs so much better at its highest about half the performance improvement clock rate. First, the i5 is a 32 nm processor, compared to CMP, but incurs much lower while the others are 45 nm and the voltage power costs. These results on the modern setting for each frequency setting is probably processors show SMT in a much more favor- different. Second, the power-performance able light than in a study of the energy effi- curve is nonlinear and these experiments ciency of SMT and CMP using models.13 may observe only the upper (steeper) portion of the curves for i7 (45) and Core 2D (45). Clock scaling Third, although the i5 (32) and i7 (45) share We vary the processor clock on the i7 (45), the same microarchitecture, the second gen- Core 2D (45), and i5 (32) between their min- eration i5 (32) likely incorporates energy imum and maximum settings. The range of improvements. Fourth, the i7 (45) is sub- clock speeds are 1.6 to 2.7 GHz for i7 (45), stantially larger than the other processors, 1.6 to 3.1 GHz for Core 2D (45), and 1.2 with four cores and a larger cache. to 3.5 GHz for i5 (32). Figures 9a and 9b Finding: The power/performance response express changes in power, performance, and to clock scaling of the native nonscalable energy with respect to doubling in clock fre- workload differs from the other workloads. quency over the range of clock speeds to nor- malize and compare across architectures. Figure 9b shows that doubling the clock The three processors experience broadly on the i5 (32) roughly maintains or improves similar performance increases of about 80 energy consumption of all workloads, with percent, but power differences vary substan- the native nonscalable workload improving tially, from 70 percent to 180 percent. On the most. For the i7 (45) and Core 2D the i7 (45) and Core 2D (45), the perfor- (45), doubling the clock raises energy con- mance increases require disproportional sumption. Figure 9d shows that the native power increases. Consequently, energy con- nonscalable workload has different power sumption increases by about 60 percent as and performance behavior compared to the ......

MAY/JUNE 2012 119 [3B2-9] mmi2012030110.3d 17/5/012 12:23 Page 120

...... TOP PICKS

90 170 i7 (45) 80 i7 (45) 150 C2D (45) 70 C2D (45) 130 i5 (32) 60 i5 (32) 110 50 90 40 70 30 50 20 Change (%) 30 Change (%) 10 10 0 –10 –10 Performance Power Energy Native Native Java Java (a) (b) nonscale scale nonscale scale

30.00 1.60 Native nonscale i7 (45) 1.50 Java nonscale C2D (45) 25.00 Native scale 1.40 i5 (32) Java scale 1.30 20.00 1.20 15.00 1.10 Power (W) 1.00 10.00 0.90 0.80 5.00 Energy/energy at base frequency 1.00 1.50 2.00 2.50 1.00 1.50 2.00 2.50 3.00 3.50 (c) Performance/performance at base frequency (d) Performance/reference performance

Figure 9. The impact of clock scaling in stock configurations. Power, performance, and energy impact of doubling clock averaged over all four workloads (a). Energy impact of doubling clock for each workload (b). Average energy performance curve; points are clock speeds (c). Absolute power (watts) and performance on the i7 (45) (dashed lines) and i5 (32) (solid lines) by workload; points are clock speeds (d). The i5 (32) does not consistently increase energy consumption as its clock increases. The power and performance response of the native nonscalable workloads to clock scaling differs from the other workloads.

other workloads, and this difference is largely quantitatively examine their innovations, independent of clock rate. Overall, bench- and researchers should measure power and marks in the native nonscalable workload performance to understand and optimize draw less power and power increases less power, performance, and energy. MICRO steeply as a function of performance increases. The native nonscalable workload Acknowledgments (SPEC CPU2006) is the most widely studied We thank Bob Edwards, Daniel Framp- workload in the architecture literature, but it ton, Katherine Coons, Pradeep Dubey, Jung- is an outlier. These results reinforce the im- woo Ha, Laurence Hellyer, Daniel Jime´nez, portance of including scalable and managed Bert Maher, Norm Jouppi, David Patterson, workloads in energy evaluations. and John Hennessy. This work is supported by Australian Research Council grant ur experimentation and analysis yield DP0666059 and National Science Founda- O a wide-ranging set of findings in this tion grant CSR-0917191. critical time period of hardware and soft- ware upheaval. These experiments and ...... analyses recommend that manufacturers References should expose on-chip power meters to 1. J.S. Emer and D.W. Clark, ‘‘A Characteriza- the community, compiler and operating tion of Processor Performance in the VAX- system developers should understand and 11/780,‘‘ Proc. 11th Ann. Int’l Symp. Com- optimize energy, researchers should use puter Architecture (ISCA 84), ACM, 1984, both managed and native workloads to pp. 301-310...... 120 IEEE MICRO [3B2-9] mmi2012030110.3d 17/5/012 12:23 Page 121

2. M. Bohr, ‘‘A 30-Year Retrospective on Den- Hadi Esmaeilzadeh is a PhD student in the nard’s MOSFET Scaling Paper,‘‘ IEEE SSCS Department of Computer Science and Newsletter, 2007, pp. 11-13. Engineering at the University of Washing- 3. H. Esmaeilzadeh et al., ‘‘Dark Silicon and ton. His research interests include power- the End of Multicore Scaling,‘‘ Proc. 38th efficient architectures, approximate general- Ann. Int’l Symp. Computer Architecture purpose computing, mixed-signal architec- (ISCA 11), ACM, 2011, pp. 365-376. tures, machine learning, and compilers. 4. H. Esmaeilzadeh et al., ‘‘Looking Back on Esmaeilzadeh has an MS in computer science the Language and Hardware Revolutions: from the University of Texas at Austin and Measured Power, Performance, and Scal- an MS in electrical and computer engineer- ing,‘‘ Proc. 16th Int’l Conf. Architectural ing from the University of Tehran. Support for Programming Languages and Operating Systems (Asplos 11), ACM, Ting Cao is a PhD student in the Depart- 2011, pp. 319-332. ment of Computer Science at the Australian 5. S.M. Blackburn et al., ‘‘Wake Up and Smell National University. Her research interests the Coffee: Evaluation Methodologies for include architecture and programming lan- the 21st Century,‘‘ Comm. ACM, vol. 51, guage implementation. Cao has an MS in no. 8, 2008, pp. 83-89. computer science from National University 6. E. Le Sueur and G. Heiser, ‘‘Dynamic Volt- of Defense Technology, China. age and Frequency Scaling: The Laws of Diminishing Returns,‘‘ Proc. Int’l Conf. Xi Yang is a PhD student in the Department Power-Aware Computing and Systems (Hot- of Computer Science at the Australian Power 10), USENIX Assoc., 2010, pp. 1-8. National University. His research interests 7. C. Isci and M. Martonosi, ‘‘Runtime Power include architecture, operating systems, and Monitoring in High-End Processors: Method- runtime systems. Yang has an MPhil in ology and Empirical Data,‘‘ Proc. 36th Ann. computer science from the Australian IEEE/ACM Int’l Symp. Microarchitecture National University. (Micro 36), IEEE CS, 2003, pp. 93-104. 8. W.L. Bircher and L.K. John, ‘‘Analysis of Dy- Stephen M. Blackburn is a professor in the namic Power Management on Multi-Core Research School of Computer Science at the Processors,‘‘ Proc. 22nd Ann. Int’l Conf. Australian National University. His research Supercomputing (ICS 08), ACM, 2008, interests include programming language pp. 327-338. implementation, architecture, and perfor- 9. H. David et al., ‘‘RAPL: Memory Power Es- mance analysis. Blackburn has a PhD in timation and Capping,‘‘ Proc. ACM/IEEE Int’l computer science from the Australian Symp. Low-Power Electronics and Design, National University. He is a Distinguished IEEE CS, 2010, pp. 189-194. Scientist of the ACM. 10. O. Azizi et al., ‘‘Energy-Performance Trade- offs in Processor Architecture and Circuit Kathryn S. McKinley is a principal re- Design: A Marginal Cost Analysis,‘‘ Proc. searcher at Microsoft Research and an 37th Ann Int’l Symp. Computer Architecture endowed professor of computer science at (ISCA 10), 2010, pp. 26-36. the University of Texas at Austin. Her 11. D.M. Tullsen, S.J. Eggers, and H.M. Levy, ‘‘Si- research interests span architecture, program- multaneous Multithreading: Maximizing On- ming language implementation, reliability, chip Parallelism,‘‘ Proc. 22nd Ann. Int’l Symp. and security. McKinley has a PhD in Computer Architecture, 1995, pp. 392-403. computer science from Rice University. She 12. R. Singhal, ‘‘Inside Intel Next Generation is a fellow of IEEE and the ACM. Nehalem Microarchitecture,‘‘ Intel Developer Forum (IDF) presentation, Aug. 2008. Direct questions and comments about this 13. S.V. Adve et al., ‘‘The Energy Efficiency of article to Stephen M. Blackburn, School of CMP vs. SMT for Multimedia Workloads,‘‘ Computer Science, Bldg. 108, Australian Proc. 18th Ann. Int’l Conf. Supercomputing National University, Acton, ACT, 0200, (ICS 04), ACM, 2004, pp. 196-206. Australia; [email protected]......

MAY/JUNE 2012 121