ON POWER-PROPORTIONAL PROCESSORS
by
Yasuko Watanabe
A dissertation submitted in partial fulfillment of
the requirements for the degree of
Doctor of Philosophy
(Computer Sciences)
at the
UNIVERSITY OF WISCONSIN - MADISON
2011 © Copyright by Yasuko Watanabe 2011
All Rights Reserved Abstract i
Although advancements in technology continue to physically shrink transistors, power reduc- tion has fallen behind transistor scaling. This imbalance results in chips with increasing numbers of transistors that consume more power than the chips of preceding generations. The effort to meet an affordable power budget and still maintain continuous performance improvements cre- ates a need to tailor power consumption for the delivered performance—a concept called power proportionality.
Dynamic voltage and frequency scaling (DVFS) is the predominant method of controlling power and performance. However, DVFS has reached a point of decreasing benefits because tech- nology scaling reduces the effective voltage range. Previous attempts to overcome the limitations of DVFS often have undesirable consequences, including increased leakage power and process variations.
Therefore, this thesis investigates microarchitectural mechanisms to achieve power propor- tionality. From a nominal design point, the core scales up by aggregating resources for higher per- formance at higher power. Conversely, disabling resources scales down the design for lower power and performance.
We first propose a core design called WiDGET that scales from low-power in-order execution to high-performance out-of-order execution. We achieve this scalability by varying active in-order buffer and functional unit count and by organizing them in a distributed manner for higher latency tolerance. Using low-power in-order buffers makes WiDGET a more power-efficient design than a traditional out-of-order processor. ii To explore further scaling opportunities, we also examine trade-offs in achieving energy-effi- cient scalability using generalized scalable core designs. Due to wire delay, maintaining pipeline balance results in energy inefficiency when scaled up. On the other hand, it is more energy effi- cient to uniformly scale down the pipeline.
Finally, we explore techniques for scaling power and performance down. We propose a con- cept called Power Gliding that selectively disables microarchitectural optimizations that follow the traditional DVFS 3:1 power-to-performance optimization rule for efficient power scale-down.
Through two case studies, we empirically show that power gliding frequently does as well as DVFS and performs better in some cases.
With the mechanisms proposed above, this thesis demonstrates processor designs that provide power proportionality beyond DVFS. Acknowledgments iii
It has been a long journey to get here: a journey I could not have possibly done alone. Only
with family, friends, mentors, and teachers, could I have completed it. I want to dedicate this dis-
sertation to my fiancé, Joseph Eckert and my family. Joe has always been there for me. He was very
supportive and believed in me even when I myself could not. He put my education first and never
once complained about me always working on the next deadline. Thank you for bringing laughter
and comfort into my life.
I cannot imagine the bravery and faith my parents had when they allowed their 18-year-old
daughter to leave a small town in Japan to pursue a bachelor’s degree in the U.S. all by herself. I
only found out a couple years ago that my father has been donating to UNICEF under my name
for good luck all these years. My mother actively reached out to exchange students in Japan, hop-
ing that people in the U.S. would do the same for me. My sister and brother always supported my
decisions and encouraged me. I never felt alone even across the Pacific, thanks to my family.
I feel fortunate to have Professor David Wood as my advisor. He is highly intelligent, has a wide
scope of knowledge, and is able to provide both detailed discussion and see big picture on any
topic. He taught me the joy and the depth of research. I will remember all the lessons he gave me to become a great researcher like him.
John Davis was a vital part of my graduate school career. He was always available even for brainstorming and provided industrial perspectives and insights. He was also a great mentor for me. It is not an exaggeration to say that I would not have been able to complete my dissertation without his guidance and encouragement. iv I also want to thank my committee for their constructive criticism.
The fellow students in the architecture group enriched my graduate school years. In particular,
I had a pleasure to work closely with Dan Gibson, who has a quick wit and is caring. He is also a
good teacher and taught me the mechanisms of low-level circuits as well as how to play chess. I
miss our trips to "The Library." Derek Hower has also been supportive. He provided valuable feed-
back to my dissertation. His ability to think out of the box is admirable.
I also want to thank the students who came before me for their guidance, especially Phillip
Wells, Michael Marty, Alaa Alameldeen, Matthew Allen, Brad Beckmann, Jayaram Bobba,
Koushik Chakraborty, Jichuan Chang, Natalie Enright Jerger, Kevin Moore, Dana Vantrease, Min
Xu, and Luke Yen. In addition, I want to thank my fellow students: Shoaib Altaf, Akanksha Baid,
Arkaprava Basu, Spyridon Blanas Emily Blem, Marc de Kruijf, Polina Dudnik, Hamid Reza
Ghasemi, Venkatraman Govindaraju, Gagan Gupta, Andrew Nere, Lena Olson, Marc Orr, Jason
Power, Cindy Rubio Gonzalez, Somayeh Sardashti, Rathijit Sen, Srinath Sridharan, Nelay Vaish,
Haris Volos, and Cong Wang.
Doug Burger sparked my interest in computer architecture when I first took an undergraduate
course with him at the University of Texas at Austin. I am thankful that I had an opportunity to
work as an undergraduate research assistant with him, and it was he who first encouraged me to
pursue a Ph.D.
Lastly, I thank the Wisconsin Computer Architecture Affiliates for their time and feedback, the
Computer Systems Laboratory for machine and software support, and the Wisconsin Condor project. v
Table of Contents
Abstract...... i
Acknowledgments...... iii
Table of Contents ...... v
List of Figures...... ix
List of Tables ...... xii
Chapter 1 Introduction ...... 1 1.1 Technology Trends ...... 3 1.2 Power Proportionality ...... 6 1.2.1 Wire Delay ...... 7
1.3 Desirable Hardware Features for Power Proportionality ...... 9 1.3.1 WiDGET: Decoupled, In-Order Scalable Cores ...... 11
1.3.2 Scalable Core Substrate ...... 12
1.3.3 Power Gliding: Extending the Power-Performance Curve ...... 14
1.4 Contributions ...... 14 1.5 Dissertation Structure ...... 15
Chapter 2 Related Work...... 17 2.1 Power-Proportional Computing ...... 17 2.1.1 Circuit-Level Techniques ...... 19
2.1.2 System-Level Techniques ...... 20
2.1.3 Dynamically Adaptive Cores ...... 21
2.1.4 Heterogeneous Chip Multi-Processors ...... 21
2.2 Low-Complexity Microarchitectures ...... 22 2.2.1 Clustered Architectures ...... 22 vi
2.2.2 Thread-Level Speculation ...... 23
2.2.3 Approximating OoO Performance with In-Order Execution ...... 23
2.3 Instruction Steering Cost Model ...... 24 2.4 Prior Scalable Core Designs ...... 26 2.5 Designing Power-Proportional Processors ...... 27
Chapter 3 Evaluation Methodology...... 29 3.1 Simulation Tools ...... 29 3.1.1 Simulation Assumptions ...... 30
3.2 Workloads ...... 31 3.3 Common Design Configuration ...... 33
Chapter 4 WiDGET: Wisconsin Decoupled Grid Execution Tiles ...... 34 4.1 High-Level Overview ...... 34 4.2 Toward Practical Instruction Steering ...... 36 4.3 Microarchitecture ...... 38 4.3.1 Pipeline Stages ...... 39
4.3.2 Frontend ...... 40
4.3.3 Execution Unit ...... 42
4.3.4 Backend ...... 44
4.4 Evaluation ...... 44 4.4.1 Simulation Methodology ...... 44
4.4.2 Performance Range ...... 45
4.4.3 Improving Performance ...... 47
4.4.4 Impacts of a Cluster Size ...... 49
4.4.5 Power Range ...... 50
4.5 Summary ...... 55 vii
Chapter 5 Deconstructing Scalable Cores...... 57 5.1 Core Scaling Taxonomy ...... 57 5.2 Two Abstract Cores: Borrowing vs. Overprovisioning ...... 59 5.2.1 Trade-offs of Resource Borrowing and Overprovisioning ...... 61
5.3 Methodology ...... 62 5.4 Initial Evaluation ...... 64 5.4.1 Performance Comparison ...... 64
5.4.2 Performance Sensitivity to Communication Overheads ...... 68
5.4.3 Chip Power Comparison ...... 71
5.4.4 Energy Efficiency ...... 73
5.5 Deconstructing Power-Hungry Components ...... 74 5.5.1 Scaling the Frontend and Backend Width ...... 75
5.5.2 Cache Aggregation ...... 76
5.6 Improving the Energy Efficiency of Scalable Cores ...... 82 5.6.1 Evaluation ...... 84
5.7 Summary ...... 90
Chapter 6 Power Gliding: Extending the Power-Performance Curve ...... 92 6.1 Limitation of Frequency Scaling and Power Gliding Opportunities ...... 94 6.1.1 Analysis of Frequency Scaling ...... 94
6.1.2 Power-Performance Scaling Opportunities ...... 98
6.2 Methodology ...... 99 6.3 Case Study 1: Frontend Power Gliding ...... 100 6.3.1 Implementation ...... 100
6.3.2 Evaluation ...... 103
6.4 Case Study 2: L2 Power Gliding ...... 109 6.4.1 Implementation ...... 110 viii
6.4.2 Evaluation ...... 111
6.4.3 Application of Power Gliding to COBRi ...... 115
6.5 Summary ...... 118
Chapter 7 Conclusions...... 119 7.1 Summary ...... 119 7.2 Reflections ...... 121
References...... 124
Appendix A Supplements for Instruction Steering Cost Model (Chapter 2) ...... 130
Appendix B Supplements for Simulation Tools (Chapter 3)...... 132
Appendix C Supplements for WiDGET’s Instruction Steering Heuristic (Chapter 4) ...... 135
Appendix D Supplements for Per-EU Instruction Buffer Limit Study (Chapter 5)...... 137
Appendix E Tables of Baseline Values ...... 141 ix List of Figures 1-1 Intel thermal design power trend ...... 3 1-2 Impact of Intel technology scaling ...... 4 1-3 Supply and threshold voltage trends ...... 5 1-4 Power proportionality ...... 6 1-5 Maximum signalling distance vs. clock frequency, M3 ...... 8 1-6 On-chip communication distances in the context of out-of-order core sizes ...... 9 1-7 Conceptual power proportionality goal by this thesis ...... 10 1-8 High-level WiDGET design ...... 11 1-9 Conceptual diagrams of (a) resource borrowing and (b) resource overprovisioning philosophies. Shaded components are shared between cores...... 13 2-1 Salverda and Zilles cost model ...... 25 3-1 Target CMP ...... 33 4-1 Conceptual block diagram of WiDGET ...... 35 4-2 Limitations of the Salverda and Zilles cost model ...... 36 4-3 WiDGET microarchitecture ...... 38 4-4 Pipeline Stages ...... 40 4-5 Frontend ...... 40 4-6 Pseudo-code for instruction steering ...... 42 4-7 Execution Unit ...... 43 4-8 8-EU performance relative to the Neon ...... 46 4-9 Average cycles spent on each EU state with 8 EUs ...... 47 4-10 Harmonic mean IPCs relative to the Neon ...... 49 4-11 Harmonic mean system power relative to the Neon ...... 51 4-12 Power breakdown relative to the Neon ...... 52 4-13 Power Proportionality of WiDGET compared to Neon and Mite ...... 53
4-14 Geometric mean power efficiency (BIPS3/W) ...... 54 5-1 Conceptual block diagrams of (a) Borrowing All Resources (BAR) and (b) Cheap Overprovisioned Resources (COR) models ...... 60 5-2 IPC normalized to the baseline OoO ...... 65 5-3 Percentages of in-flight instructions spent in each state ...... 66 x 5-5 Misprediction rate of the cache-bank predictor in the BAR ...... 67 5-6 Memory-level parallelism ...... 67 5-4 Instruct-ions affected by remote operand transfers in BAR ...... 67 5-7 Performance sensitivity to communication overheads ...... 69 5-8 Chip power normalized to the baseline OoO ...... 70 5-9 Categorized chip-wide power consumption ...... 70 5-11 L1-I access count normalized to BAR1 ...... 72 5-10 Per-core power breakdown normalized to the baseline OoO ...... 72 5-12 Geometric mean energy efficiency ...... 74 5-13 Conceptual diagrams of BAR4 and COR4 with the default configurations in Table 5-3 ...... 75 5-14 Effect of 2-wide (Narrow) and 4-wide (Wide) frontend/backend on COR ...... 76 5-15 L1-I aggregation mechanisms across BAR and COR ...... 77 5-16 L1-I miss rate of default BAR with L1-I aggregation ...... 78 5-17 L1-D aggregation mechanisms across BAR and COR ...... 79 5-18 L1-D miss rate of default BAR with L1-D aggregation ...... 81 5-19 Conceptual block diagram of the COBRA hybrid design ...... 83 5-20 Power-performance of all designs with the default configurations normalized to the baseline 85 5-21 IPC of COBRi8 normalized to COBRo8 ...... 86 5-22 MLP of COBRo (left bars) and COBRi (right bars) ...... 87 5-23 Per-benchmark IPC of COBRo ...... 88 5-24 Categorized chip-wide power consumption ...... 89 5-25 Geometric mean energy efficiency ...... 90 6-1 Conceptual power proportionality goal ...... 92 6-2 Chip power reduction by frequency scaling ...... 97 6-3 Run-time slowdown by frequency scaling ...... 97 6-4 Chip power breakdown at the nominal frequency ...... 98 6-5 Useful checkpoint rate of the baseline ...... 101 6-6 Power-performance normalized to the baseline ...... 105 6-7 Ratio of committed / dispatched instructions ...... 107 6-8 IPC impacts of the applied techniques with Stall-1 ...... 107 6-9 Power breakdown normalized to the baseline ...... 108 6-10 Power-performance normalized to the baseline ...... 112 xi 6-11 Power breakdown normalized to the baseline ...... 113 6-12 Normalized total L2 power ...... 114 6-13 Harmonic mean power and performance ...... 116 A-1 Performance sensitivity under realistic communication delays ...... 130 C-1 Instruction steering example ...... 136 D-1 IPC sensitivity ...... 138 D-2 Chip power sensitivity ...... 138 D-3 ED sensitivity ...... 139 D-4 ED2 sensitivity ...... 139 xii List of Tables 2-1 Comparison of prior related work with regard to desirable power proportional core attributes . 18 3-1 SPEC CPU 2006 characterization...... 31 3-2 Wisconsin commercial workload characterization ...... 32 3-3 Common configuration parameters...... 33 4-1 Machine configurations ...... 45 5-1 Core scaling taxonomy...... 58 5-2 Scaling mechanisms of WiDGET ...... 59 5-3 Design-Specific Default Configuration Parameters...... 63 5-4 Power Categories and Descriptions ...... 71 5-3 COBRA Configuration Parameters ...... 84 6-1 Workload characteristics ...... 95 6-2 Baseline configuration parameters ...... 99 6-3 Simulated frequency scaling points ...... 99 6-4 Configuration space for Case Study 1 ...... 104 6-5 Configuration space for Case Study 2 ...... 111 6-6 COBRi configuration parameters ...... 115 B-1 Simulation parameter space...... 132 E-1 Baseline values for Figure 4-8 ...... 141 E-2 Baseline values for Figure 5-2 ...... 142 E-3 Baseline values for Figure 5-7 ...... 142 E-4 Baseline values for Figure 5-11 ...... 143 E-5 Baseline values for Figure 5-21 ...... 144 E-6 Baseline values for Figure 6-2 ...... 144 E-7 Baseline values for Figure 6-8 ...... 145 E-8 Baseline values for Figure 6-13 ...... 146 E-9 TAGE branch misprediction rate ...... 146 1 Chapter 1
Introduction
Microprocessor performance has increased dramatically over the past few decades. This rapid
increase in performance was driven, in part, by technological developments that doubled the
number of transistors on chip every two years, a trend described by Moore’s law. Unfortunately, the
power supply voltage of transistors did not improve at the same rate. These mismatched transistor
scaling trends created chip designs with growing numbers of power-inefficient transistors, and, as
a direct result, chip power usage continued to increase exponentially alongside any performance gains until early 2000s. This unsustainable increase in power usage, known as the Power Wall, cre-
ates packaging and cooling problems for the correct operation of processors, and has placed a tight
limit on total chip power.
In an effort to utilize the increasing transistor budget within the constraining framework of the
strict power budget, chip vendors have moved away from traditional uniprocessors to multi-core
chips [43,65]. Multi-cores better addresses the Power Wall for two reasons. First, with dynamic
voltage and frequency scaling (DVFS) [49], cores can run at lower voltage to stay within affordable
chip power while still yielding higher system throughput. DVFS is an effective power management
technique due to the cubic relationship between (dynamic) power and performance: for each 3%
reduction in power, DVFS reduces performance by only about 1%. Second, the high throughput
nature of multi-cores allows the complexity of each core to be reduced without significantly sacri- ficing overall performance. Guided by the 3:1 power-to-performance ratio of DVFS, cores can 2 eliminate performance optimizations that exceed the 3:1 ratio in order to save more in power than
they lose in performance [32].
Despite increasing performance demands, the utility of DVFS is diminishing due to the nar-
rowing gap between maximum and minimum supply voltages [77]. Consequently, either fewer
cores can run simultaneously, or each core must be simplified further to prevent an increase in
chip power. The former comes with the cost of reduced system throughput, while the latter is sus-
ceptible to sequential bottlenecks. Neither is a desirable solution, especially for future applications
with versatile resource requirements [26]. Amdahl’s Law, a key tenant of microprocessor design
philosophy, states that the overall run-time enhancement achieved when only a part of the system
is improved depends on the time spent executing the non-optimized part. In other words, it is the
weakest link in the chain that determines the bottleneck on performance. Therefore, it is critical to
balance system throughput and single-thread performance to avoid those performance bottle-
necks [5,35].
This thesis describes an alternative to DVFS that uses microarchitectural mechanisms for flex-
ible power and performance management. Rather than statically selecting a design point optimal
for a small set of workloads, we propose the use of power-proportional cores—cores that dissipate
power in proportion to work done—to provide many different operating points appropriate for a broad range of workloads. Chips composed of power-proportional cores can speed up some of the cores for sequential threads at higher power while running as many parallel threads as possible at lower speed so as not to exceed a given power budget. We evaluate power-proportional cores in a single-thread context to limit the scope of this dissertation. 3
FIGURE 1-1. Intel thermal design power trend [74]
We first review technology trends (Section 1.1) that have led to the need for power-propor- tional computing (Section 1.2) in more detail. To achieve power proportionality using microarchi- tectural mechanisms, we analyze what underlying hardware components and mechanisms are desirable, and how to harness them. Section 1.3 provides a brief overview of our findings, and we conclude this chapter by presenting key contributions (Section 1.4) and the structure of this dis- sertation (Section 1.5).
1.1 Technology Trends
For decades, every performance increase has come at the price of an increase in power usage.
Figure 1-1 demonstrates that the thermal design power (TDP) of Intel chips increased exponen- tially over the last forty years until power dissipation became too large for affordable cooling sys- tems. Power usage then plateaued with the Pentium 4 processor, the last uniprocessor Intel has produced to date. Until that point, chip designers exploited Moore’s Law by devoting larger num- ber of smaller transistors to a single core in pursuit of higher single-thread performance. Those 4
FIGURE 1-2. Impact of Intel technology scaling [74]
transistors were used to make pipelines both deeper and wider and increased the degree and
aggressiveness of speculation, eventually running into diminishing returns and wasted power.
However, a more significant reason behind the escalation in power usage is the fact that scaling
of supply voltage (Vdd) has lagged in comparison with feature size. Figure 1-2 plots the impact of
Intel technology scaling, normalized to the 4004 processor released in 1971. The figure demon-
strates that feature scaling substantially outpaced Vdd scaling by up to two orders of magnitude.
That is, a given die area can have more transistors, however, power per area continues to rise.
These transistors with increasing power per area and rising chip power led designers to start integrating multiple cores on the same die and use DVFS to manage chip power. Although this paradigm shift enabled performance improvement without exponential power increase, it is not a reliable solution going forward due to an inherent limitation of DVFS. Figure 1-3 plots the past
trend in supply and threshold voltages [71,21] as well as both conservative [12] and optimistic
[25] projections of future supply voltage reductions. These voltage-scaling trends show that the 5
Vdd Trend 2.5 Conservative Vdd Projection Optimistic Vdd Projection Vt Trend and Optimistic Projection 2 Gate Drive Rule
1.5
Voltage (V) 1
0.5
0 250 180 130 90 65 45 32 22 15 11 8 Technology Node (nm) FIGURE 1-3. Supply and threshold voltage trends
gap between supply and threshold voltages is closing, and thus the range of voltage scaling is diminishing. In fact, for high-performance transistors, future technology nodes have no room left for voltage scaling, based on the gate drive rule of maintaining at least a 4:1 supply to threshold ratio [71]. While various circuit-level techniques have been proposed to address this problem, such as ultra-low voltage operation [19,21], many incur performance, leakage power, and/or reli- ability issues.
The limited leverage of DVFS makes design point selection challenging for multi-cores.
Regardless of a homogeneous or heterogeneous core composition, modern statically configured cores have a chosen design point that is optimal only for a few target workloads. To efficiently run a variety of workloads on a single chip, we need to redesign processors in order to accommodate today’s technology scaling. 6