ON POWER-PROPORTIONAL PROCESSORS

by

Yasuko Watanabe

A dissertation submitted in partial fulfillment of

the requirements for the degree of

Doctor of Philosophy

(Computer Sciences)

at the

UNIVERSITY OF WISCONSIN - MADISON

2011 © Copyright by Yasuko Watanabe 2011

All Rights Reserved Abstract i

Although advancements in technology continue to physically shrink transistors, power reduc- tion has fallen behind transistor scaling. This imbalance results in chips with increasing numbers of transistors that consume more power than the chips of preceding generations. The effort to meet an affordable power budget and still maintain continuous performance improvements cre- ates a need to tailor power consumption for the delivered performance—a concept called power proportionality.

Dynamic voltage and frequency scaling (DVFS) is the predominant method of controlling power and performance. However, DVFS has reached a point of decreasing benefits because tech- nology scaling reduces the effective voltage range. Previous attempts to overcome the limitations of DVFS often have undesirable consequences, including increased leakage power and process variations.

Therefore, this thesis investigates microarchitectural mechanisms to achieve power propor- tionality. From a nominal design point, the core scales up by aggregating resources for higher per- formance at higher power. Conversely, disabling resources scales down the design for lower power and performance.

We first propose a core design called WiDGET that scales from low-power in-order execution to high-performance out-of-order execution. We achieve this scalability by varying active in-order buffer and functional unit count and by organizing them in a distributed manner for higher latency tolerance. Using low-power in-order buffers makes WiDGET a more power-efficient design than a traditional out-of-order processor. ii To explore further scaling opportunities, we also examine trade-offs in achieving energy-effi- cient scalability using generalized scalable core designs. Due to wire delay, maintaining pipeline balance results in energy inefficiency when scaled up. On the other hand, it is more energy effi- cient to uniformly scale down the pipeline.

Finally, we explore techniques for scaling power and performance down. We propose a con- cept called Power Gliding that selectively disables microarchitectural optimizations that follow the traditional DVFS 3:1 power-to-performance optimization rule for efficient power scale-down.

Through two case studies, we empirically show that power gliding frequently does as well as DVFS and performs better in some cases.

With the mechanisms proposed above, this thesis demonstrates processor designs that provide power proportionality beyond DVFS. Acknowledgments iii

It has been a long journey to get here: a journey I could not have possibly done alone. Only

with family, friends, mentors, and teachers, could I have completed it. I want to dedicate this dis-

sertation to my fiancé, Joseph Eckert and my family. Joe has always been there for me. He was very

supportive and believed in me even when I myself could not. He put my education first and never

once complained about me always working on the next deadline. Thank you for bringing laughter

and comfort into my life.

I cannot imagine the bravery and faith my parents had when they allowed their 18-year-old

daughter to leave a small town in Japan to pursue a bachelor’s degree in the U.S. all by herself. I

only found out a couple years ago that my father has been donating to UNICEF under my name

for good luck all these years. My mother actively reached out to exchange students in Japan, hop-

ing that people in the U.S. would do the same for me. My sister and brother always supported my

decisions and encouraged me. I never felt alone even across the Pacific, thanks to my family.

I feel fortunate to have Professor David Wood as my advisor. He is highly intelligent, has a wide

scope of knowledge, and is able to provide both detailed discussion and see big picture on any

topic. He taught me the joy and the depth of research. I will remember all the lessons he gave me to become a great researcher like him.

John Davis was a vital part of my graduate school career. He was always available even for brainstorming and provided industrial perspectives and insights. He was also a great mentor for me. It is not an exaggeration to say that I would not have been able to complete my dissertation without his guidance and encouragement. iv I also want to thank my committee for their constructive criticism.

The fellow students in the architecture group enriched my graduate school years. In particular,

I had a pleasure to work closely with Dan Gibson, who has a quick wit and is caring. He is also a

good teacher and taught me the mechanisms of low-level circuits as well as how to play chess. I

miss our trips to "The Library." Derek Hower has also been supportive. He provided valuable feed-

back to my dissertation. His ability to think out of the box is admirable.

I also want to thank the students who came before me for their guidance, especially Phillip

Wells, Michael Marty, Alaa Alameldeen, Matthew Allen, Brad Beckmann, Jayaram Bobba,

Koushik Chakraborty, Jichuan Chang, Natalie Enright Jerger, Kevin Moore, Dana Vantrease, Min

Xu, and Luke Yen. In addition, I want to thank my fellow students: Shoaib Altaf, Akanksha Baid,

Arkaprava Basu, Spyridon Blanas Emily Blem, Marc de Kruijf, Polina Dudnik, Hamid Reza

Ghasemi, Venkatraman Govindaraju, Gagan Gupta, Andrew Nere, Lena Olson, Marc Orr, Jason

Power, Cindy Rubio Gonzalez, Somayeh Sardashti, Rathijit Sen, Srinath Sridharan, Nelay Vaish,

Haris Volos, and Cong Wang.

Doug Burger sparked my interest in computer architecture when I first took an undergraduate

course with him at the University of Texas at Austin. I am thankful that I had an opportunity to

work as an undergraduate research assistant with him, and it was he who first encouraged me to

pursue a Ph.D.

Lastly, I thank the Wisconsin Computer Architecture Affiliates for their time and feedback, the

Computer Systems Laboratory for machine and software support, and the Wisconsin Condor project. v

Table of Contents

Abstract...... i

Acknowledgments...... iii

Table of Contents ...... v

List of Figures...... ix

List of Tables ...... xii

Chapter 1 Introduction ...... 1 1.1 Technology Trends ...... 3 1.2 Power Proportionality ...... 6 1.2.1 Wire Delay ...... 7

1.3 Desirable Hardware Features for Power Proportionality ...... 9 1.3.1 WiDGET: Decoupled, In-Order Scalable Cores ...... 11

1.3.2 Scalable Core Substrate ...... 12

1.3.3 Power Gliding: Extending the Power-Performance Curve ...... 14

1.4 Contributions ...... 14 1.5 Dissertation Structure ...... 15

Chapter 2 Related Work...... 17 2.1 Power-Proportional Computing ...... 17 2.1.1 Circuit-Level Techniques ...... 19

2.1.2 System-Level Techniques ...... 20

2.1.3 Dynamically Adaptive Cores ...... 21

2.1.4 Heterogeneous Chip Multi-Processors ...... 21

2.2 Low-Complexity Microarchitectures ...... 22 2.2.1 Clustered Architectures ...... 22 vi

2.2.2 Thread-Level Speculation ...... 23

2.2.3 Approximating OoO Performance with In-Order Execution ...... 23

2.3 Instruction Steering Cost Model ...... 24 2.4 Prior Scalable Core Designs ...... 26 2.5 Designing Power-Proportional Processors ...... 27

Chapter 3 Evaluation Methodology...... 29 3.1 Simulation Tools ...... 29 3.1.1 Simulation Assumptions ...... 30

3.2 Workloads ...... 31 3.3 Common Design Configuration ...... 33

Chapter 4 WiDGET: Wisconsin Decoupled Grid Execution Tiles ...... 34 4.1 High-Level Overview ...... 34 4.2 Toward Practical Instruction Steering ...... 36 4.3 Microarchitecture ...... 38 4.3.1 Pipeline Stages ...... 39

4.3.2 Frontend ...... 40

4.3.3 Execution Unit ...... 42

4.3.4 Backend ...... 44

4.4 Evaluation ...... 44 4.4.1 Simulation Methodology ...... 44

4.4.2 Performance Range ...... 45

4.4.3 Improving Performance ...... 47

4.4.4 Impacts of a Cluster Size ...... 49

4.4.5 Power Range ...... 50

4.5 Summary ...... 55 vii

Chapter 5 Deconstructing Scalable Cores...... 57 5.1 Core Scaling Taxonomy ...... 57 5.2 Two Abstract Cores: Borrowing vs. Overprovisioning ...... 59 5.2.1 Trade-offs of Resource Borrowing and Overprovisioning ...... 61

5.3 Methodology ...... 62 5.4 Initial Evaluation ...... 64 5.4.1 Performance Comparison ...... 64

5.4.2 Performance Sensitivity to Communication Overheads ...... 68

5.4.3 Chip Power Comparison ...... 71

5.4.4 Energy Efficiency ...... 73

5.5 Deconstructing Power-Hungry Components ...... 74 5.5.1 Scaling the Frontend and Backend Width ...... 75

5.5.2 Cache Aggregation ...... 76

5.6 Improving the Energy Efficiency of Scalable Cores ...... 82 5.6.1 Evaluation ...... 84

5.7 Summary ...... 90

Chapter 6 Power Gliding: Extending the Power-Performance Curve ...... 92 6.1 Limitation of Frequency Scaling and Power Gliding Opportunities ...... 94 6.1.1 Analysis of Frequency Scaling ...... 94

6.1.2 Power-Performance Scaling Opportunities ...... 98

6.2 Methodology ...... 99 6.3 Case Study 1: Frontend Power Gliding ...... 100 6.3.1 Implementation ...... 100

6.3.2 Evaluation ...... 103

6.4 Case Study 2: L2 Power Gliding ...... 109 6.4.1 Implementation ...... 110 viii

6.4.2 Evaluation ...... 111

6.4.3 Application of Power Gliding to COBRi ...... 115

6.5 Summary ...... 118

Chapter 7 Conclusions...... 119 7.1 Summary ...... 119 7.2 Reflections ...... 121

References...... 124

Appendix A Supplements for Instruction Steering Cost Model (Chapter 2) ...... 130

Appendix B Supplements for Simulation Tools (Chapter 3)...... 132

Appendix C Supplements for WiDGET’s Instruction Steering Heuristic (Chapter 4) ...... 135

Appendix D Supplements for Per-EU Instruction Buffer Limit Study (Chapter 5)...... 137

Appendix E Tables of Baseline Values ...... 141 ix List of Figures 1-1 thermal design power trend ...... 3 1-2 Impact of Intel technology scaling ...... 4 1-3 Supply and threshold voltage trends ...... 5 1-4 Power proportionality ...... 6 1-5 Maximum signalling distance vs. clock frequency, M3 ...... 8 1-6 On-chip communication distances in the context of out-of-order core sizes ...... 9 1-7 Conceptual power proportionality goal by this thesis ...... 10 1-8 High-level WiDGET design ...... 11 1-9 Conceptual diagrams of (a) resource borrowing and (b) resource overprovisioning philosophies. Shaded components are shared between cores...... 13 2-1 Salverda and Zilles cost model ...... 25 3-1 Target CMP ...... 33 4-1 Conceptual block diagram of WiDGET ...... 35 4-2 Limitations of the Salverda and Zilles cost model ...... 36 4-3 WiDGET microarchitecture ...... 38 4-4 Pipeline Stages ...... 40 4-5 Frontend ...... 40 4-6 Pseudo-code for instruction steering ...... 42 4-7 Execution Unit ...... 43 4-8 8-EU performance relative to the Neon ...... 46 4-9 Average cycles spent on each EU state with 8 EUs ...... 47 4-10 Harmonic mean IPCs relative to the Neon ...... 49 4-11 Harmonic mean system power relative to the Neon ...... 51 4-12 Power breakdown relative to the Neon ...... 52 4-13 Power Proportionality of WiDGET compared to Neon and Mite ...... 53

4-14 Geometric mean power efficiency (BIPS3/W) ...... 54 5-1 Conceptual block diagrams of (a) Borrowing All Resources (BAR) and (b) Cheap Overprovisioned Resources (COR) models ...... 60 5-2 IPC normalized to the baseline OoO ...... 65 5-3 Percentages of in-flight instructions spent in each state ...... 66 x 5-5 Misprediction rate of the cache-bank predictor in the BAR ...... 67 5-6 Memory-level parallelism ...... 67 5-4 Instruct-ions affected by remote operand transfers in BAR ...... 67 5-7 Performance sensitivity to communication overheads ...... 69 5-8 Chip power normalized to the baseline OoO ...... 70 5-9 Categorized chip-wide power consumption ...... 70 5-11 L1-I access count normalized to BAR1 ...... 72 5-10 Per-core power breakdown normalized to the baseline OoO ...... 72 5-12 Geometric mean energy efficiency ...... 74 5-13 Conceptual diagrams of BAR4 and COR4 with the default configurations in Table 5-3 ...... 75 5-14 Effect of 2-wide (Narrow) and 4-wide (Wide) frontend/backend on COR ...... 76 5-15 L1-I aggregation mechanisms across BAR and COR ...... 77 5-16 L1-I miss rate of default BAR with L1-I aggregation ...... 78 5-17 L1-D aggregation mechanisms across BAR and COR ...... 79 5-18 L1-D miss rate of default BAR with L1-D aggregation ...... 81 5-19 Conceptual block diagram of the COBRA hybrid design ...... 83 5-20 Power-performance of all designs with the default configurations normalized to the baseline 85 5-21 IPC of COBRi8 normalized to COBRo8 ...... 86 5-22 MLP of COBRo (left bars) and COBRi (right bars) ...... 87 5-23 Per-benchmark IPC of COBRo ...... 88 5-24 Categorized chip-wide power consumption ...... 89 5-25 Geometric mean energy efficiency ...... 90 6-1 Conceptual power proportionality goal ...... 92 6-2 Chip power reduction by frequency scaling ...... 97 6-3 Run-time slowdown by frequency scaling ...... 97 6-4 Chip power breakdown at the nominal frequency ...... 98 6-5 Useful checkpoint rate of the baseline ...... 101 6-6 Power-performance normalized to the baseline ...... 105 6-7 Ratio of committed / dispatched instructions ...... 107 6-8 IPC impacts of the applied techniques with Stall-1 ...... 107 6-9 Power breakdown normalized to the baseline ...... 108 6-10 Power-performance normalized to the baseline ...... 112 xi 6-11 Power breakdown normalized to the baseline ...... 113 6-12 Normalized total L2 power ...... 114 6-13 Harmonic mean power and performance ...... 116 A-1 Performance sensitivity under realistic communication delays ...... 130 C-1 Instruction steering example ...... 136 D-1 IPC sensitivity ...... 138 D-2 Chip power sensitivity ...... 138 D-3 ED sensitivity ...... 139 D-4 ED2 sensitivity ...... 139 xii List of Tables 2-1 Comparison of prior related work with regard to desirable power proportional core attributes . 18 3-1 SPEC CPU 2006 characterization...... 31 3-2 Wisconsin commercial workload characterization ...... 32 3-3 Common configuration parameters...... 33 4-1 Machine configurations ...... 45 5-1 Core scaling taxonomy...... 58 5-2 Scaling mechanisms of WiDGET ...... 59 5-3 Design-Specific Default Configuration Parameters...... 63 5-4 Power Categories and Descriptions ...... 71 5-3 COBRA Configuration Parameters ...... 84 6-1 Workload characteristics ...... 95 6-2 Baseline configuration parameters ...... 99 6-3 Simulated frequency scaling points ...... 99 6-4 Configuration space for Case Study 1 ...... 104 6-5 Configuration space for Case Study 2 ...... 111 6-6 COBRi configuration parameters ...... 115 B-1 Simulation parameter space...... 132 E-1 Baseline values for Figure 4-8 ...... 141 E-2 Baseline values for Figure 5-2 ...... 142 E-3 Baseline values for Figure 5-7 ...... 142 E-4 Baseline values for Figure 5-11 ...... 143 E-5 Baseline values for Figure 5-21 ...... 144 E-6 Baseline values for Figure 6-2 ...... 144 E-7 Baseline values for Figure 6-8 ...... 145 E-8 Baseline values for Figure 6-13 ...... 146 E-9 TAGE branch misprediction rate ...... 146 1 Chapter 1

Introduction

Microprocessor performance has increased dramatically over the past few decades. This rapid

increase in performance was driven, in part, by technological developments that doubled the

number of transistors on chip every two years, a trend described by Moore’s law. Unfortunately, the

power supply voltage of transistors did not improve at the same rate. These mismatched transistor

scaling trends created chip designs with growing numbers of power-inefficient transistors, and, as

a direct result, chip power usage continued to increase exponentially alongside any performance gains until early 2000s. This unsustainable increase in power usage, known as the Power Wall, cre-

ates packaging and cooling problems for the correct operation of processors, and has placed a tight

limit on total chip power.

In an effort to utilize the increasing transistor budget within the constraining framework of the

strict power budget, chip vendors have moved away from traditional uniprocessors to multi-core

chips [43,65]. Multi-cores better addresses the Power Wall for two reasons. First, with dynamic

voltage and frequency scaling (DVFS) [49], cores can run at lower voltage to stay within affordable

chip power while still yielding higher system throughput. DVFS is an effective power management

technique due to the cubic relationship between (dynamic) power and performance: for each 3%

reduction in power, DVFS reduces performance by only about 1%. Second, the high throughput

nature of multi-cores allows the complexity of each core to be reduced without significantly sacri- ficing overall performance. Guided by the 3:1 power-to-performance ratio of DVFS, cores can 2 eliminate performance optimizations that exceed the 3:1 ratio in order to save more in power than

they lose in performance [32].

Despite increasing performance demands, the utility of DVFS is diminishing due to the nar-

rowing gap between maximum and minimum supply voltages [77]. Consequently, either fewer

cores can run simultaneously, or each core must be simplified further to prevent an increase in

chip power. The former comes with the cost of reduced system throughput, while the latter is sus-

ceptible to sequential bottlenecks. Neither is a desirable solution, especially for future applications

with versatile resource requirements [26]. Amdahl’s Law, a key tenant of microprocessor design

philosophy, states that the overall run-time enhancement achieved when only a part of the system

is improved depends on the time spent executing the non-optimized part. In other words, it is the

weakest link in the chain that determines the bottleneck on performance. Therefore, it is critical to

balance system throughput and single-thread performance to avoid those performance bottle-

necks [5,35].

This thesis describes an alternative to DVFS that uses microarchitectural mechanisms for flex-

ible power and performance management. Rather than statically selecting a design point optimal

for a small set of workloads, we propose the use of power-proportional cores—cores that dissipate

power in proportion to work done—to provide many different operating points appropriate for a broad range of workloads. Chips composed of power-proportional cores can speed up some of the cores for sequential threads at higher power while running as many parallel threads as possible at lower speed so as not to exceed a given power budget. We evaluate power-proportional cores in a single-thread context to limit the scope of this dissertation. 3

FIGURE 1-1. Intel thermal design power trend [74]

We first review technology trends (Section 1.1) that have led to the need for power-propor- tional computing (Section 1.2) in more detail. To achieve power proportionality using microarchi- tectural mechanisms, we analyze what underlying hardware components and mechanisms are desirable, and how to harness them. Section 1.3 provides a brief overview of our findings, and we conclude this chapter by presenting key contributions (Section 1.4) and the structure of this dis- sertation (Section 1.5).

1.1 Technology Trends

For decades, every performance increase has come at the price of an increase in power usage.

Figure 1-1 demonstrates that the thermal design power (TDP) of Intel chips increased exponen- tially over the last forty years until power dissipation became too large for affordable cooling sys- tems. Power usage then plateaued with the 4 processor, the last uniprocessor Intel has produced to date. Until that point, chip designers exploited Moore’s Law by devoting larger num- ber of smaller transistors to a single core in pursuit of higher single-thread performance. Those 4

FIGURE 1-2. Impact of Intel technology scaling [74]

transistors were used to make pipelines both deeper and wider and increased the degree and

aggressiveness of speculation, eventually running into diminishing returns and wasted power.

However, a more significant reason behind the escalation in power usage is the fact that scaling

of supply voltage (Vdd) has lagged in comparison with feature size. Figure 1-2 plots the impact of

Intel technology scaling, normalized to the 4004 processor released in 1971. The figure demon-

strates that feature scaling substantially outpaced Vdd scaling by up to two orders of magnitude.

That is, a given die area can have more transistors, however, power per area continues to rise.

These transistors with increasing power per area and rising chip power led designers to start integrating multiple cores on the same die and use DVFS to manage chip power. Although this paradigm shift enabled performance improvement without exponential power increase, it is not a reliable solution going forward due to an inherent limitation of DVFS. Figure 1-3 plots the past

trend in supply and threshold voltages [71,21] as well as both conservative [12] and optimistic

[25] projections of future supply voltage reductions. These voltage-scaling trends show that the 5

Vdd Trend 2.5 Conservative Vdd Projection Optimistic Vdd Projection Vt Trend and Optimistic Projection 2 Gate Drive Rule

1.5

Voltage (V) 1

0.5

0 250 180 130 90 65 45 32 22 15 11 8 Technology Node (nm) FIGURE 1-3. Supply and threshold voltage trends

gap between supply and threshold voltages is closing, and thus the range of voltage scaling is diminishing. In fact, for high-performance transistors, future technology nodes have no room left for voltage scaling, based on the gate drive rule of maintaining at least a 4:1 supply to threshold ratio [71]. While various circuit-level techniques have been proposed to address this problem, such as ultra-low voltage operation [19,21], many incur performance, leakage power, and/or reli- ability issues.

The limited leverage of DVFS makes design point selection challenging for multi-cores.

Regardless of a homogeneous or heterogeneous core composition, modern statically configured cores have a chosen design point that is optimal only for a few target workloads. To efficiently run a variety of workloads on a single chip, we need to redesign processors in order to accommodate today’s technology scaling. 6

(a) Current Servers [10] (b) Ideal FIGURE 1-4. Power proportionality

1.2 Power Proportionality

We propose using power-proportional cores that provide many different power-performance points, without DVFS. Our aim of power proportionality—dissipating power in proportion to the amount of work performed—is adapted from Barroso and Hölzle’s definition of energy propor- tionality [10], with a focus on single-thread power and performance. Barroso and Hölzle defined

“work” loosely to encompass all performance metrics, and we use instruction-level parallelism

(ILP) as a measure of work throughout the dissertation.

Achieving power proportionality within the framework of modern processor designs, without relying on DVFS, is a challenging proposition. Processors are typically optimized for a narrow power-performance range. Furthermore, leakage power will consume a larger fraction of the total power in future technology nodes [36], and, as a result, will become a hurdle in scaling down power when there is little work to do.

Figure 1-4a plots current server power usage as a function of work [10]. The performance and availability requirements of servers prohibit the use of DVFS or other conventional low-power 7 techniques (Chapter 2) regardless of work load. As a result, even the idle state burns almost half of

the maximum power, representing a non-negligible amount of power simply wasted due to the ris-

ing power consumption of computing systems. An ideal power proportional design, on the other

hand, would scale power more gracefully with work, as illustrated in Figure 1-4b. Running a sys-

tem at close to full speed corresponds to the upper-right region, the area in which current systems

are designed to operate. The remaining region is much more difficult to reach because the system

must use just enough power to meet the performance goal. Therefore, an important characteris- tic—and challenge—of power proportionality is yielding a wide dynamic power-performance range, covering both full-utilization and idle states.

Achieving power-proportional computing requires a holistic approach, incorporating the entire system, from the memory and disk subsystems to networks, not just processors. In addition, software-hardware collaboration may enable more efficient management of hardware resources as well as determine the mix of concurrent threads to better control the system-wide power-perfor- mance. However, we limit the scope of this thesis to core microarchitectures, leaving the rest as future work.

1.2.1 Wire Delay

The modern trend toward increasing wire delay makes designing power-proportional cores more difficult. Figure 1-5 plots the maximum single-cycle signalling distance as a function of the

clock frequency, using a 100-stage Pi wire model. Data is derived from CACTI [64], and assumes

optimal repeater count and placement (50 ps fixed delay per repeater, and 25 ps setup time at the

wire’s endpoint). We further assume level 3 metal layer (M3) for short-distance, component-to-

component routing. For a signal to traverse about two millimeters of linear distance in a clock 8

FIGURE 1-5. Maximum signalling distance vs. clock frequency, M3 [64]

cycle using a 45nm technology node, the maximum possible frequency is 3 GHz. In contrast, the

frequency drops to 2 GHz in a smaller 32nm technology.

Tight bounds on communication distance restrict the size of a core as well as resource place-

ment in order to meet a target clock speed. In 45 nm, a small two-issue out-of-order core occupies

about 22.5 mm2 [40]. Assuming an optimistic 2:1 core aspect ratio, linear distance to cross a core is

3.3 mm or 2 cycles, depending on the particulars of the core’s internal floorplan. Communication

latency starts to dominate for more aggressive and larger superscalar core designs.

Wire delay is a challenge for multi-cores. Without attempting to address all possible internal

core designs, Figure 1-6 plots the hypothetical best-case single-cycle (dark grey) and two-cycle

(light grey) communication distances with the above design assumptions. In the two-core domain

(i.e., Figure 1-6a), a two-millimeter horizontal distance between cores provides enough coverage.

However, four-way communication among horizontally- and vertically-mirrored cores with a 1:1 core aspect ratio (i.e., Figure 1-6b) leaves very little communication area (~2x2 mm) after consid- 9

Core0 Core1

Core0 Core1 2-Cycle Distance 1-Cycle Distance

Core2 Core3 (a) Two cores (b) Four cores (c) L1-D cache area

FIGURE 1-6. On-chip communication distances in the context of out-of-order core sizes. Figure to scale, 1mm = 0.1in

ering one- or two-cycle Manhattan signalling distances. This area, for reference, is about two thirds of that occupied by a 32KB L1-D cache (Figure 1-6c). On the other hand, if four cores are organized into rows in the on-chip floorplan, signalling distance is 6.6-9.9 mm (3-5 cycles)—much higher than that of the “four corners” floorplan—but the constraint of corner placement effectively disappears.

Regardless of the organization, wire delay leads to a simple implication for multi-cores. In line with intuition, latency-sensitive core resources that need to communicate must reside near one another to minimize wire delay between these resources. The more cores participate in communi- cation, the tighter constraints on the size and placement are imposed on designing multi-cores.

1.3 Desirable Hardware Features for Power Proportionality

Rather than scaling voltage and/or frequency, we scale architectures themselves to deliver power proportionality (as selected by system software, compiler, hardware predictor, or some combination thereof). Our approach seeks to achieve more than a linear power-performance scal- ability by scaling frequency alone. We aim to continue the cubic voltage function of the DVFS or the 3:1 power-to-performance ratio, and leverage it at the resource allocation level. We aggregate 10

฀ ฀ FIGURE 1-7. Conceptual power proportionality goal by this thesis

resources to seek a 1% performance increase with at most 3% power increase, and selectively dis-

able resources or performance optimizations for a 3% power savings at a 1% performance loss.

For higher performance cores, a key challenge lies in providing flexibility in performance with proportional power consumption. We achieve this with a scalable core substrate. Scalable cores

scale up the amount of resources dedicated to a single core to meet the performance demands of

single threads. They can also scale down resources to match the available ILP or to run many

threads in parallel and still stay within the given power budget. As the granularity and selection of

resource scaling have large impacts on the resulting power proportionality, this thesis explores the

organization of microarchitecture, proposing an example architectural design called WiDGET, and

what and how resources should be scaled to best facilitate graceful transition across the power-

performance curve.

For lower performance cores, we identify resources or performance optimizations to disable in

order to “glide” down the power-performance curve. We call this concept power gliding. 11

FIGURE 1-8. High-level WiDGET design

Figure 1-7 depicts our approach for power proportional computing. We seek to provide a foundation for future computing systems and diverse workload types by provisioning resources.

The remainder of this section provides a high-level overview of our findings, the first two pieces targeting scalable core foundations and the last piece targeting gliding down the power-perfor- mance curve.

1.3.1 WiDGET: Decoupled, In-Order Scalable Cores

We first propose a power-proportional computing infrastructure, called WiDGET (Wisconsin

Decoupled Grid Execution Tiles), that decouples thread context management (i.e., instruction engines) from a sea of simple in-order execution units (EUs), loosely defining core boundaries

(Figure 1-8). WiDGET’s decoupled design provides the flexibility to scale cores up and down through global resource allocation, varying the number of enabled instruction engines and the number of EUs assigned to each instruction engine. Because WiDGET activates only the computa- tion resources needed for a particular power-performance target and turns off the rest to save 12 power, it dynamically enables many different combinations of small and/or powerful cores on a

single chip.

Low-complexity building blocks like in-order EUs are desirable for power savings, but make

delivering high single-thread performance a key challenge. To overcome in-order issue con-

straints, we leverage 1) distributed instruction buffers spread across the EUs for latency tolerance

and 2) instruction steering logic that accounts for communication overheads and data locality. By

distributing scheduled instructions to the buffers, later, ready instructions can execute ahead of earlier, stalled instructions in other buffers, essentially approximating OoO execution with much simpler in-order building blocks. This feature allows each core on WiDGET to scale from in-order to coarse-grain OoO execution just by varying the number EUs and/or instruction buffers. When scaled up, we show that per-thread performance of WiDGET exceeds a -like high-perfor- mance processor while consuming more power (upper right region in Figure 1-7). WiDGET can also scale down to a level comparable to an -like low-power processor, turning off resources for less aggressive execution and power (toward the middle region in Figure 1-7). Importantly,

WiDGET delivers power-performance points anywhere in between these two extremes for fine- grained scaling.

1.3.2 Scalable Core Substrate

WiDGET is one design point, and many other scalable core designs are also possible. To better understand scaling trade-offs in this architecture space, we step back and reconsider energy-effi- cient design options for scalable cores in general, not necessarily related to WiDGET. We examine three previously proposed scalable cores—Core Fusion, Composable Lightweight Processors, and

Forwardflow [40,28,46]—to identify seven principal areas in which scaling mechanisms and poli- 13 ฀

฀ (a) Resource borrowing (b) Resource overprovisioning

FIGURE 1-9. Conceptual diagrams of (a) resource borrowing and (b) resource overprovisioning philosophies. Shaded components are shared between cores.

cies differ. We argue that these differences stem from disparate fundamental resource acquisition

philosophies: whether to borrow resources from neighboring cores (Figure 1-9a), or to overprovi-

sion core-private resources (Figure 1-9b). We analyze the impact of these differing design philos-

ophies using a common framework to abstract away artifactual differences in the original

proposals and focus on the most important constraints underlying modern core design: wire

delay, energy efficiency, and area. We study two abstract cores, BAR (Borrowing All Resources)

and COR (Cheap Overprovisioned Resources), which represent two extremes in the design space

for cores that scale up to 4x their nominal size.

We find that when scaling up, overprovisioning a few, cheap resources (i.e., COR) is more

energy efficient than borrowing large resources from neighboring cores (i.e., BAR). When bor- rowing resources, wire delays add multi-cycle penalties to access distant resources, largely negat- ing the benefits. For caches, borrowing has the potential to do more harm than good. L1-I aggregation increases L1-I power more than 2.5x, and L1-D aggregation actually degrades perfor- mance by up to 16% on average. On the other hand, when fully scaled down, smaller components in borrowing-based cores better facilitate energy efficiency than statically overprovisioned larger components. 14 Derived from these insights, we propose a hybrid design called COBRA. Although COBRA

builds on the overprovisioned model, it also integrates the small modular feature of the resource-

borrowing model. Moreover, COBRA can borrow small, latency-effective resources for further scalability. We investigate both OoO and in-order versions of COBRA, and show that both achieve significant energy efficiency improvements, but the in-order version yields better effi- ciency and power-performance scalability by harnessing low-power components.

1.3.3 Power Gliding: Extending the Power-Performance Curve

The last piece of this thesis targets the lower end of the power-performance curve. Because of the goal to bring power consumption toward zero as work or performance decreases, we only focus on scaling down architectures in this work. We propose power gliding which dynamically dis-

ables performance optimizations that meet the 3:1 power-to-performance ratio. While some opti-

mizations may have much less than a 3:1 ratio, and thus should be left on, others may exceed the

3:1 ratio for a given workload, allowing power gliding to do better than DVFS.

Power gliding can leverage many previously proposed low-power techniques that result in a

performance loss. Although those techniques might not have been considered appropriate for

high-performance processors, the techniques become viable options under the 3:1 ratio. We select

two sets of techniques—targeting the core frontend and the L2 cache, respectively—and evaluate

them in conventional static cores to demonstrate the broad applicability.

1.4 Contributions

The following summarizes this dissertation’s most important contributions. 15 • Proposes and evaluates a power-proportional architecture, WiDGET, enabled by a sea of

computation resources, scaling across the power-performance spectrum. By harnessing in-

order buffers with an intelligent instruction steering heuristic, WiDGET scales anywhere from

an -like low-power processor to a chip that exceeds an Intel Xeon-like high-perfor-

mance processor while consuming less power on a single chip (Chapter 4).

• Demonstrates that resource acquisition philosophy is central to the efficiency of core scal-

ing. Borrowing resources from neighboring cores adds communication overheads, resulting in

performance degradation and higher power. Consequently, overprovisioning—provisioning

core-private resources for the most aggressive configuration—provides better energy effi-

ciency and scalability (Chapter 5).

• Makes a case for power gliding—turning off power-dominant performance optimizations

to lower the power-performance curve toward zero. Power gliding allows previously pro-

posed low-power techniques to be re-examined in a new context and apply them without com-

plex policies or logic to obtain the best possible power savings. Two case studies demonstrate

the potential of power gliding, even providing better power scaling than DVFS in some cases

(Chapter 6).

1.5 Dissertation Structure

The rest of this dissertation first reviews prior work on power-proportional computing and low-power/complexity designs (Chapter 2). We then present the common evaluation methodol- ogy used throughout the dissertation (Chapter 3). 16 The next three chapters discuss the dissertation contributions in detail: a power-proportional

computing infrastructure, WiDGET (Chapter 4); deconstruction of scalable cores to identify energy efficient scaling (Chapter 5); and power gliding to glide down the power-performance curve (Chapter 6). This dissertation ends with a summary of power-proportional scalable microarchitecture, reflects on the limitations of the current state, and discusses future work to achieve system-wide power-proportional computing (Chapter 7).

The dissertation not only includes all the content from our previously submitted work, but also provides materials supplemental to that work. It adds simulated data and discussion on design assumptions and configurations. Furthermore, the dissertation includes a section in Chapter 6 that demonstrates how our three proposals work together achieve the power proportionality goal. 17 Chapter 2

Related Work

Although the use of dynamic microarchitectural modifications to achieve power-proportional computing is relatively recent, the fundamental concept has been around for the last few decades.

This chapter discusses the prior work that formulated the general concept of dynamic scalability, comparing each work to the seven desirable attributes of power-proportional cores in Table 2-1.

We first review designs and techniques that dynamically adjust power and performance without

the use of scalable cores (Section 2.1). We next consider a class of work that simplifies microarchi- tectures to address the Power Wall (Section 2.2). Previous work has challenged the instruction steering aspect of distributed architectures, and because WiDGET builds on that architecture we review the observations made by the previous work (Section 2.3). After discussing prior scalable

core designs (Section 2.4), we conclude this chapter by explaining how the previous work has shaped our approach in realizing power-proportional processors (Section 2.5).

2.1 Power-Proportional Computing

Chandrakasan et al. are among the first to introduce the concept of power-proportional com-

puting [20]. They pointed out that once computational capability of a design meets service-level

agreements, the remaining transistor budget should be devoted to power saving techniques. Bar-

roso and Hölzle made a case for energy proportionality, especially for servers that rarely reach

complete idle or near-peak utilization [10]. They call for novel energy efficient mechanisms that 1) 18 TABLE 2-1. Comparison of prior related work with regard to desirable power proportional core attributes. The designs above the bold line represent orthogonal work to this thesis.

Category Design Row number Scale & Down? Up Symmetric? Exec? Decoupled In-Order? Wire Delays? Driven? Data ISA Compatibility? DVFS [49] 2 Y - - - - - Y Circuit-Level Techniques Power gating [37] 3 N - - - - - Y PowerNap [53] 4 N - - - - - Y System-Level Techniques Thread Motion [58] 5 N N N - - - Y Adaptive Cores [3,29,24] 6 N - Y/N Y/N - - Y Heterogeneous CMPs [48] 7 N N N Y/N - - Y Quad-Cluster [9] 8 N Y Y N Y Y/N Y Cost-Effective [18] 9 N N Y N Y Y Y Clustered Architectures Multiscalar [68] 10 N Y N N - N N Complexity-Effective [56] 11 N Y Y Y N Y Y Access/Execute [66] 12 N N N Y - Y N TLS [69,68] 13 N Y N Y/N - N Y/N TLS Hydra [31] 14 N Y N N - N Y OoO Approximation ILDP [47] & Braid [73] 15 N Y Y Y - Y N Steering Cost Model Salverda & Zilles [61] 16 Y Y N Y N Y Y Core Fusion [40] 17 Y Y N N N - N CLP [46] 18 Y Y N Y N Y N Scalable Cores Forwardflow [28] 19 Y Y N N Y Y Y WiDGET 20YYYYYYY Column number 3 4 5 6 7 8 9

lessen wake-up penalties from power-saving modes and 2) consume energy in proportion to the amount of work performed. Due to the single-thread performance focus of this thesis, we use power proportionality as a metric instead. 19 2.1.1 Circuit-Level Techniques

Circuit-level techniques (rows 2-3 in Table 2-1) have been a popular method to aid power pro- portionality, although some of them only address the lower end of the curve. We consider these techniques orthogonal to the work done in this dissertation, which focuses on mechanisms at the microarchitectural level.

Clock gating is a widely used method for dynamic power reduction by turning off the clock to idle structures [11]. Dynamic voltage-frequency scaling (DVFS) also aims at dynamic power; however, a main difference is its ability to vary transistor speed by tuning supply voltage and PLL frequency (col. 3) [49]. The shrinking operating voltage range of DVFS has spawned various research efforts to either widen the voltage scaling range or uncover alternative methods for power savings. The former aims at overcoming the challenges of reducing the supply voltage to sub- threshold [19] or near threshold [21]. These challenges include leakage power, transistor perfor- mance, and reliability, which worsen as the supply voltage is lowered further. Dreslinski et al. argue that near-threshold operation is a more attractive design point than subthreshold operation because both have comparable energy savings but near-threshold operation has improved perfor- mance and variability [21]. On the other hand, Chandrakasan et al. propose optimizing the tradi- tional 6T bit-cell SRAM design for subthreshold operation to mitigate the issues of ultra-low voltage operation [19].

Azizi et al., in contrast, view the minimizing voltage scaling range as an optimization problem and provide a framework to evaluate numerous circuit styles and gate sizes as well as architectural models [6]. 20 Frequency scaling is gaining its importance because of the limited utility of DVFS in future

technology nodes and the need for fine-grained power management within shared voltage planes.

For instance, IBM POWER7 [43] implements per-core frequency scaling to regulate the power of cores on the same voltage plane. However, such globally-asynchronous locally-synchronous design requires an entirely different verification process, covering numerous non-deterministic asynchronous interactions and detecting and correcting any race conditions. This process is known to be non-trivial, even with formal analysis tools [1].

Other circuit-level techniques to control both dynamic and leakage power include multi-Vdd

[54] and power gating [37]. The former utilizes high Vdd for transistors on critical paths identified at design time and low Vdd on other logic, while the latter dynamically cuts off supply voltage to selected logic. Lastly, multi-threshold CMOS [67], variable threshold CMOS [72], and sleep tran- sistors [41] are examples of leakage power reduction techniques.

2.1.2 System-Level Techniques

System-level techniques (rows 4-5) are also orthogonal to our use of microarchitectural mech- anisms to achieve power proportionality. Two notable system-level techniques are PowerNap and

Thread Motion. PowerNap aims to reduce idle power on a machine with frequent yet short idle periods, such as servers [53], The entire system quickly transitions to a near-zero-power idle state when server utilization goes down. In contrast, Thread Motion proposes fine-grained manage- ment of non-idle power [58]. Instead of a regulator-based DVFS approach, they migrate a thread to a different, statically set voltage/frequency domain (col. 4) for power savings. Because the effec- tiveness of Thread Motion still depends on a large operating voltage range, it is susceptible to the shrinking voltage scaling range in future technology nodes. 21 2.1.3 Dynamically Adaptive Cores

Previous work addresses the rapid rise of chip power by making limited, localized changes to

the microarchitecture. The primary focus was reducing wasteful power in a statically fixed resource by exploiting application phase behaviors. Thus, maintaining performance is crucial, unlike the aim of power proportional cores to scale up and down for a wide range of operating points.

Adaptive cores (row 6 in Table 2-1), for instance, vary the sizes of power-hungry hardware structures, including instruction queues [16] and caches [4,7,22]. Albonesi et al. give an overview of dynamic adaptation techniques for microprocessors [3]. As previously proposed adaptive cores focus on power savings [3,29,24], they are limited to scaling cores down (column 3).

Other microarchitectural adaptive techniques include exploiting narrow-width operands either by disabling unused width of the hardware or by packing multiple values [13], and com- pressing strings of zeros or ones anywhere they appear in the full width of an operand [17].

Although these techniques reduce component-specific local power, they do not take a global approach (e.g., cols. 4, 7, and 8) to address inherent inefficiencies that exist in the microarchitec- ture.

2.1.4 Heterogeneous Chip Multi-Processors

A heterogeneous chip multi-processor (CMP) (row 7) has an asymmetric design by combining a small number of aggressive superscalar cores for ILP with many lightweight cores for thread- level parallelism (TLP) (col. 4) [48]. Rather than adjusting the capability of cores (e.g., power pro- portional cores), a heterogeneous CMP dynamically migrates threads to best-fit cores for given 22 workload conditions (col. 2), thereby controlling the obtainable power and performance. Due to

the statically set core designs for the target class of applications, a heterogeneous CMP has limited

effectiveness for applications outside of the target class [61]. Furthermore, it requires a chip vendor

to design and verify at least two different cores for a single chip release or integrate pre-existing

cores with different interface requirements. Finally, as fixed resources, resource scheduling and

real-time constraints are more difficult because of the performance differential on the heteroge- neous cores.

2.2 Low-Complexity Microarchitectures

Over the past two decades, many researchers attempted to simplify microarchitectural designs with little impact on the performance. Although their aims may not necessarily coincide with power efficiency, our work builds on many of their insights and also avoids shortcomings of their mechanisms in pursuit of developing a power-proportional scalable design.

2.2.1 Clustered Architectures

The goal of early clustered architectures (rows 8-12) [56,9,66,18,68] was to further improve superscalar performance while reducing the complexity. Hence, each cluster may still utilize com- plex OoO execution (col. 6). To eliminate monolithic structures on critical paths, some designs decouple the execution core and/or the backend from the frontend (col. 5). The decentralized exe- cution inevitably gave rise to a plethora of instruction steering policies. Despite many subtle differ-

ences, they commonly trade off inter-cluster communication latencies for load balancing so long

as the latencies can be hidden (cols. 7-8). These performance-centric policies ignore the energy 23 aspect of communication, which worsens with more clusters and longer wires. We therefore take data locality into consideration when developing a steering heuristic in Section 2.3.

2.2.2 Thread-Level Speculation

Another example of achieving high single-thread performance without using complex cores is thread-level speculation (TLS). TLS (rows 13-14) leverages multiple thread contexts in multi-cores with minimum changes to the core microarchitecture, using software [69,68] or speculative mem- ory support [31]. In the case of the former, a TLS compiler divides a dynamic instruction stream into contiguous segments at control-flow boundaries (col. 8). The hardware then speculatively executes the resulting chain of control dependent threads, using buffered state to recover from misspeculation. Instead of relying on conservative synchronization points inserted by a compiler, the latter exploits speculative memory support in hardware to enable more aggressive speculation.

However, because both the software and hardware approaches take a control-driven execution style, they are susceptible to load imbalance and thread squash propagation [59]. Another short- coming is increased inter-thread traffic and the resulting energy, as data dependent instructions are spread across threads.

2.2.3 Approximating OoO Performance with In-Order Execution

Palacharla et al. observed that the wake-up and selection logic of an OoO issue queue is one of the most complex structures in a traditional superscalar, and employed multiple in-order buffers instead (col. 6) [56]. We exploit data locality to avoid prohibitive wire delays (col. 7).

Two microarchitectures (row 15) that leverage clusters of in-order buffers (col. 6) are the Braid architecture [73] and Instruction Level Distributed Processing (ILDP) [47], both of which have 24 heavy software reliance. The Braid architecture expands a conventional ISA (col. 9) and uses a

compiler to re-order instructions based on the data dependencies at basic-block boundaries (col.

8). ILDP similarly requires either a new ISA or binary translation (col. 9), utilizing a profiler to

identify groups of dependent instructions spanning control-flow boundaries (col. 8). Although the

software-based dependence extraction simplifies the hardware, exploiting dynamic data depen-

dency becomes a challenge not to mention that loosing binary compatibility creates a major hur-

dle, especially for legacy applications. Hence, the following section discusses reducing the

complexity of on-line dependency analysis to make hardware steering feasible, achieving binary

compatibility.

2.3 Instruction Steering Cost Model

Instruction steering policy is an integral part of the performance equation for any distributed

architecture, especially those that build on in-order instruction issue. The policy essentially deter- mines issue time, which is governed by data dependencies, and structural hazards. Salverda and

Zilles (row 16) therefore proposed a cost model for instruction steering to understand this com- plex interaction, and argued that hardware steering logic for in-order execution units (EUs) is not practical [61]. This section reviews their model, and, in Chapter 4, we extend it, making a case for implementable steering design.

Salverda and Zilles evaluated steering cost of an instruction i as a function of the dataflow (i.e., horizon) and in-order issue constraints (i.e., frontier). The horizon marks the time when an

instruction becomes ready to issue, which is imposed by the dispatch time, disp(i), and computa- tion of the source operands, data(i). Hence, the horizon of i is h(i) = max{disp(i), data(i)}. 25

฀฀฀฀ ฀ ฀ ฀ ฀฀฀฀ ฀฀ ฀฀฀฀ ฀฀฀฀ ฀ ฀

฀ ฀฀ ฀ ฀

(a) (b) (c) FIGURE 2-1. Salverda and Zilles cost model. (a) An example instruction sequence and the dataflow graph. (b) Steering cost of i3. (c) Steering under idealized communication assumption.

We use Figure 2-1 to help explain their cost model. Figure 2-1(a) shows an example sequence of instructions, each of which takes one cycle to execute on one of the two available in-order EUs.

Both i2 and i3 depend on i1 and must execute before i4. Assuming all three instructions dis-

patch in cycle 0, then the horizon of i3 is 2 because it must wait for the result of i1 to become

available, as the arrow and shaded region in Figure 2-1(b) show. On the other hand, the frontier of

an in-order EU e, f(e), denotes the earliest time an instruction becomes the head of the FIFO queue. In Figure 2-1(b), the frontier of EU 1 is 3, whereas that of EU 2 is 1 due to the unutilized

resource.

The cost of steering an instruction to an EU becomes: Cost(i, e) = h(i) - f(e). A negative cost

indicates a true cost of the instruction because earlier instructions in the steered EU delay the issue time. This is the case of steering i3 to EU 1. Although i3 becomes ready at cycle 2, it cannot issue until i2 finishes execution at cycle 3. A positive cost, on the other hand, reflects an opportunity 26 cost. The instruction becomes the earliest instruction in the EU while still waiting for the oper- ands, potentially deferring execution of later instructions. Steering i3 to EU 2 incurs an opportu- nity cost by leaving EU 2 idle at cycle 1. Thus, an ideal steering occurs when Cost(i, e) is zero, issuing the instruction as soon as it becomes ready without lowering EU utilization. However, this example has no zero-cost steering. As the Salverda and Zilles cost model prefers minimal true cost to opportunity cost in order to increase parallelism, it steers i3 to EU 2. i4 can be steered to

either EU 1 or EU 2, but in either case completes in cycle three as illustrated in Figure 2-1(c).

In Chapter 4, we extend their model by accounting for realistic communication latencies

(Appendix A) between EUs and discuss the implications of the new model.

2.4 Prior Scalable Core Designs

There have been many proposals for dynamically scalable chips [3,4,7,16,22,28,40,46]. In gen-

eral, these designs are capable of multiple discrete configurations, each of which provides a differ-

ent power-performance trade-off. When scaled up, additional resources (e.g., caches, functional

units, instruction window space, etc.) are allocated to a single thread, improving performance

through better exploitation of instruction-level parallelism (ILP). Typically, cores consume signifi-

cantly more power when scaled up, and it may be necessary to slow down (via DVFS) or disable

other cores to accommodate additional power demands of a scaled-up core [39].

When scaled down, cores de-activate resources to reduce power consumption, which commen-

surately reduces the core’s ability to exploit ILP. Though individually less capable than scaled-up

cores, scaling down affords the power needed to operate many scaled-down cores in parallel,

exploiting available thread-level parallelism (TLP) in multithreaded or multi-programmed work-

loads. 27 In this section, we focus on three previous scalable cores (rows 17-19): Core Fusion [40], Com- posable Lightweight Processors (CLP) [46], and Forwardflow [28]. Though the main visions of these

proposals are similar (i.e., scalable cores), each of these proposed designs approaches the problem

of implementing a scalable core very differently. In particular, the first two approaches [40,46]

consider whole-core aggregation as a means by which to implement scale-up—when aggressively

pursuing single-thread performance, entire cores are merged into a single processing entity, effec- tively sharing all resources in the processor pipeline. These individual proposals differ on the granularity of individual scaled components; Core Fusion considers aggregation of full out-of- order pipelines, whereas CLP considers aggregation of simpler processing elements in greater number.

This general approach contrasts that taken by Forwardflow [28], which dynamically scales only the instruction window and execution units. Specifically, Forwardflow [28] seeks to build a large instruction scheduler from an explicit dataflow representation. This design gives no consideration to scaling other aspects of the pipeline or dynamically sharing resources between cores when scaled up.

Chapter 5 develops core scaling taxonomy based on these three scalable core designs.

2.5 Designing Power-Proportional Processors

Given the current technology constraints and the insights from the relevant prior work, achieving power proportionality (cols. 4-8) requires a core design that scales both up and down

(col. 3) while retaining ISA compatibility for wide-spread adaptation (col. 9). We present an exam- ple power-proportional design called WiDGET (row 20) in Chapter 4. As ever increasing wire delays favor many small hardware structures over a few large monolithic ones (col. 7), WiDGET 28 forms clusters of execution resources, each of which executes instructions in order for low com- plexity and power (col. 6). By decoupling the execution resources from the rest of the pipeline structures (col. 5), WiDGET has the flexibility to vary active execution resource count, managing the attainable power and performance. To harness the distributed in-order resources, its hardware steering logic sends groups of data dependent instructions (col. 8) to the same execution resources for data locality while distributing independent instructions to different execution resources for parallelism.

WiDGET takes a unique stance with regard to core scale-up. It scales only the instruction win- dow and execution resources, like Forwardflow. However, WiDGET borrows some execution resources from neighboring cores, an implementation which resembles with the resource acquisi- tion style of Core Fusion and CLP.

Although scalable cores, including WiDGET, yield many different operation points, fully scaled down configurations still consume moderate power. To bring the power-performance point closer to zero, power gliding selectively disables resources and performance optimizations. Using

WiDGET when performance is more important and power gliding when power must be con- served helps achieve the goal of power proportionality. 29 Chapter 3

Evaluation Methodology

This chapter presents the common evaluation methodology used for the dissertation.

3.1 Simulation Tools

We evaluate core designs in this dissertation using full-system cycle-accurate execution-driven simulation. A full-system simulator effectively provides virtual hardware that is independent of the nature of the host computer, running real device drivers and operating systems, not just applica-

tion programs. Unlike a trace-driven simulator, an execution-driven simulation allows dynamic

change of execution paths to be executed.

We use Simics [50], GEMS’s Ruby [52], and in-house timing-first processor models.

Simics provides full-system functional simulation of multiprocessor systems and verifies correct-

ness of the other two simulators. Ruby models detailed memory system timing, while our proces-

sor models offer a rich environment for microarchitectural exploration ranging from a simple in-

order model to an out-of-order (OoO) superscalar and a multithreaded multi-core model. Fur-

thermore, we have augmented our simulators with Wattch [9] and CACTI 5 [36], which provide

architectural-level approximations of power consumed by logic and memory structures. Detailed

discussion of our simulation models and assumptions follows next. 30 3.1.1 Simulation Assumptions

In line with other microarchitecture simulators, our simulators do not attempt to model every computer component in detail, but instead focus on specific components of interest while making simplifying assumptions on the rest in order to maintain usable simulation speed. Given the nature of this dissertation, our in-house timing-first processor models aim to capture much of the subtle interactions within and between microarchitectural states. All pipeline stages, which are configurable, and core structures faithfully model the allocated width, size, and/or ports, thereby simulating structural hazards. The execution-driven simulation naturally handles both data and control hazards as instructions execute. All processor models share various flavors of memory dis- ambiguation mechanisms, which can be as aggressive or conservative as the configuration allows to be. Because we simulate unmodified SPARCv9 operating systems, all microarchitectures model hardware-assisted translation lookaside buffer (TLB) fill and register window exceptions. On the other hand, GEMS’s Ruby manages the memory systems, including coherence protocols, on-chip interconnects, and memory controllers.

Although some ALU instructions (e.g., multiplication and division) have variable execution latencies, we simplify the ALU logic by fixing the latency for a given opcode. We select the laten- cies based on numerous published SPARCv9 data. Another simplification is infinite bandwidth of operand networks, though we faithfully model network routing and latency. Finally, the power models assume aggressive clock gating of logic structures not in use, with no reactivation delay.

Appendix B details the parameter space of our simulators. 31 3.2 Workloads

TABLE 3-1. SPEC CPU 2006 characterization

Workload Description ILP misprediction Branch L1-I miss rate L1-D miss rate L2 miss rate L3 miss rate MLP perlbench Email spam checking with a perl script L sphinx3 Speech recognition L gromacs Molecular dynamics: lysozyme in water and ion solution L L calculix Finite-element solver LL C++ program library for finite elements models and error dealII LL estimation soplex Sparse matrix solver LL leslie3D Computational fluid dynamics LL omnetpp Discrete event simulation of a large ethernet LL tonto Quantum chemistry LL gcc Based on gcc Version 3.2 LL h264ref H.264/AVC audio/video codec LLL astar Path-finding AI HLL gobmk AI: The Game of Go HLL wrf Weather research forecasting LLH povray Raytracing LLH xalancbmk XML to HTML converter LH cactusADM Numerical relativity calculation LH GemsFDTD Finite difference time domain solver of Maxwell’s equations LH namd Biomolecular systems simulation HLLH sjeng Chess AI HLLH Multiple compression/decompression steps of input JPEGs, bzip2 LHLH binaries, source, and HTML gamess Quantum chemical computations HHLLH hmmer Search of a gene sequence database HLL zeusmp Computational fluid dynamics LHH lbm Lattice boltzmann computational fluid dynamics LHH bwaves Simulation of blast waves in 3d viscous flow LLHH mcf Combinatorial optimization: Vehicle scheduling LLHH milc Quantum chromodynamics lattice computation LLLHH libquantum Quantum computer simulation LLLHHH 32 We simulate SPEC CPU 2006 [34] and Wisconsin commercial workloads [2]. All programs were compiled for the 64-bit SPARC ISA using the Sun Studio 11 compiler with base tuning.

As performance scaling is best understood in the context of single-threaded benchmarks,

Chapters 4, 5, and 6 assume that each benchmark runs on a single core with no other concurrent threads. We simulate each benchmark for one hundred million instructions. We fast-forward benchmarks past their initialization phases, which warms up page tables, TLBs, caches, and numerous predictors.

We provide brief workload characterization of SPEC CPU 2006 in Table 3-1 and Wisconsin commercial workloads in Table 3-2, each grouped by hamming distance of the listed characteris- tics. L means low, while H is for high. Blank cells indicate medium.

TABLE 3-2. Wisconsin commercial workload characterization

Workload Description ILP misprediction Branch L1-I miss rate L1-D miss rate L2 miss rate L3 miss rate MLP Apache Static web content serving LL Java program emulating a 3-tier e-business system with SPECjbb LL emphasis on the middle tier business logic Zeus Web server LL TPC-C v3.0, IBM’s DB2 V7.2 EEE database management OLTP HLL system 33 3.3 Common Design Configuration #ORE #ORE #ORE #ORE , )

Our target machine is an 8-core chip multipro- , , , , 0IPELINE ," ," ," ," cessor (CMP) as Figure 3-1 depicts. Each node con- ," ," ," ," , $ , , , , , sists of a core, a private L1-L2 cache hierarchy, and #ORE #ORE #ORE #ORE one bank of a large shared L3. We keep some con- FIGURE 3-1. Target CMP figuration parameters unchanged throughout (or most of) the dissertation, which Table 3-3 lists. We discuss other design-specific parameters in each chapter.

TABLE 3-3. Common configuration parameters

Component Chapter 4 Chapter 5 Chapter 6

TAGE [62], 1K-entry tagged tables, 5 history bits Branch Prediction 64-entry 4-way BTB 256-entry 4-way BTB 16-entry RAS Disambiguation NoSQ [63] 1024-entry predictor, 1024-entry double-buffered SSBF Cache-Core Pre- N/A 2048 entries per core [40,8] N/A dictor Fetch-Dispatch 7 Cycles, unless specified otherwise 32KB, 4-way, 64B line, next-line prefetching L1-I Cache 1 cycle 2 cycles 32KB, 4-way, 64B line, write-through, write-invalidate, 2 ports L1-D Cache 1 cycle 2 cycles L2 Cache 1 MB, 8-way, 4 banks, 64B line, 11 cycle latency, write back, private, inclusive 16-way, 8 banks, 64B line, 24 cycle latency, shared, inclusive L3 Cache 4 MB 8 MB Main Memory 2 QPI-like Links (Up to 64GB/s), ~300 cycle latency Coherence MOSI-based Protocol MOESI-based Directory Protocol Technology 3 GHz clock, 0.9 Vdd, 45nm process 34 Chapter 4

WiDGET: Wisconsin Decoupled Grid Execution Tiles

Power-proportional computing requires the ability to operate, on a single platform, across a

wide range of operating points, from high performance, high power operation to low power, low

performance operation. We achieve this wide operating range by aggregating or disabling

resources to scale up or down, respectively, for each individual core. In this chapter, we propose

the underlying microarchitecture, called WiDGET (Wisconsin Decoupled Grid Execution Tiles), and evaluate the single thread capability with emphasis on power proportionality.

This chapter begins with a high-level overview of WiDGET (Section 4.1). We then present an

instruction cost model that accounts for communication overheads, making a case for an imple-

mentable hardware instruction steering design (Section 4.2). We discuss details of the WiDGET

microarchitecture (Section 4.3), and examine the power proportionality (Section 4.4), followed by

a summary of our findings (Section 4.5).

4.1 High-Level Overview

WiDGET aims to gracefully scale cores across the power-performance spectrum. To address

this goal, we harness multiple in-order issue resources, instead of relying on power-dominant out-

of-order (OoO) logic. This design delivers power efficiency while preserving three OoO-like per-

formance constraints. First, ready instructions must be exposed to the heads of the in-order issue 35

FIGURE 4-1. Conceptual block diagram of WiDGET. The shaded components are the primary means to accelerate the single thread performance.

buffers. Second, stalled instructions must not block the execution of later ready instructions.

Third, the design should provide enough buffering capacity to prevent instruction buffer clog—a pathological stall condition in which dispatch is halted while all scheduling resources are occupied by earlier waiting instructions. Satisfying all of these constraints requires intelligent management of in-order issue resources.

Figure 4-1 illustrates WiDGET’s sea of resources design. An instruction engine (IE) resembles a conventional OoO core’s frontend and backend pipeline functions with the addition of instruc- tion steering logic for the distributed execution units (EUs). Each EU is capable of buffering and executing instructions in order. A hierarchical operand network connects a cluster of four adjacent

EUs via full bypass, while a 1-cycle link bridges two adjacent clusters. An IE has an associated EU cluster, which is enough to deliver the performance of a comparable OoO machine (Section 4.4.2).

Yet the decoupled design provides the flexibility to further scale up the core by borrowing up to four EUs from the neighboring IE. The hardware has control paths to distribute instructions and commands to any assigned EU in one cycle. 36

฀฀฀฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀฀฀฀ ฀฀ ฀฀฀฀ ฀฀฀฀ ฀ ฀

฀ ฀฀ ฀ ฀ (a) (b) (c) (d) (e)

FIGURE 4-2. Limitations of the Salverda and Zilles cost model. (a) An example instruction sequence and the dataflow graph. (b) Steering cost of i3. (c) Steering under idealized communication assumption. (d) Impacts of adding a 1-cycle latency between EUs using the ideal cost model. (e) Communication-latency-aware cost model under a 1-cycle latency between EUs.

By varying the number of in-order EUs, which include some amount of instruction buffering,

we can select a point on the power-performance spectrum best suited to the current situation.

When the workload calls for aggressive exploitation of instruction-level parallelism (ILP), addi-

tional EUs can be allocated to service demands. On the other hand, the number of EUs can be

reduced to conserve power, e.g., when running many threads, or if available ILP is limited.

4.2 Toward Practical Instruction Steering

Chapter 2 reviewed the instruction steering cost model Salverda and Zilles proposed [61]

(depicted again in Figure 4-2(a)-(c)). They argued that steering for in-order EUs is too complex to implement, assuming idealized inter-EU communication (Figure 4-2(c)). However, we cannot

ignore the communication cost as wire delay increasingly dominates with decreasing CMOS fea-

ture scaling (Appendix A). If just a single-cycle inter-EU delay is added, Figure 4-2(d) shows that 37 the same instruction sequence in Figure 4-2(a) now takes five cycles. The clouds depict the

incurred operand transfer delays that did not exist under the idealized communication assump-

tion in Figure 4-2(c). A cost model that is sensitive to communication overheads will instead keep

all four instructions in the same EU, completing the sequence in four cycles as depicted in

Figure 4-2(e).

Our extension of the Salverda and Zilles cost model incorporates the above observation. Spe-

cifically, operand availability is now governed by two variables: operand computation time by the

producer and the operand transfer time to reach the consumer EU. We denote the latter as comm(i, e). The horizon is therefore a function of an EU as well: h(i, e) = max{disp(i), data(i) + comm(i, e)}.

In the example of Figure 4-2, the horizon of i3 is calculated as the following, provided it is dis-

patched at time 0 and i1 is steered to EU 1:

h(i3, EU 1) = max{0, 2 + 0} = 2

h(i3, EU 2) = max{0, 2 + 1} = 3

We call the extended model the Communication-Latency-Aware Cost Model and measure the

steering cost: Cost(i, e) = h(i, e) - f(e). An ideal steering decision, therefore, sends an instruction to

a different EU from the producer’s EU only when the operand transfer latency can be hidden. In

contrast, the Salverda and Zilles cost model, assuming no communication penalties, spreads com-

putation across the EUs to minimize true cost, benefiting from higher EU count. Salverda and

Zilles therefore conclude that dataflow properties constrain the performance improvement from

fusing in-order cores. It requires either a very convoluted steering mechanism that keeps track of

each EU’s frontier in relation to an instruction’s horizon or fusing so many cores that fusion over-

heads become impractical. Under realistic communication delays, however, our model tends to 38 ฀ ฀฀ ฀

FIGURE 4-3. WiDGET microarchitecture.

mitigate the pressure for more EUs and obviate the need for considering distant EUs. This reduces

the number of available instruction steering slots, leading to an implementable steering policy.

We approximate the communication-latency-aware cost model by controlling often known

variables, disp(i) and comm(i, e), and simplifying hard-to-predict variables, data(i) and f(e). We try

to steer a consumer directly behind the producer, similar to the dependence-based steering pro- posed by Palacharla et al. [56]. The important difference is accounting for communication, thereby keeping dependent instructions nearby to reduce the latency and power from operand transfers. Hence, WiDGET only considers a subset of the available EUs for a given instruction, making the steering complexity tractable.

4.3 Microarchitecture

The current technological trends favor a hardware design based on a sea of resources. This

design naturally maps well to TLP, but makes achieving high ILP very challenging. WiDGET 39 addresses this issue by aggregating in-order-issue EUs to approximate OoO-issue capability. We therefore employ steering to distribute instructions, localizing dependent instructions into the same cluster whenever possible. The routing network forwards operands to intra-cluster EUs in time for back-to-back execution, but incurs an additional cycle for each inter-cluster transfer.

Conversely, independent instructions are steered to any empty EUs and execute in parallel. When there is a long-latency instruction, the EU that is executing the instruction acts as a buffer for the chain of dependent instructions; other EUs remain in an unblocked state and can continue execu- tion. Thus, our sea of resources design enables independent instructions to run ahead of the ear- lier stalled instructions in available EUs, extracting ILP and memory-level parallelism from a program. Figure 4-3 illustrates an example WiDGET chip with eight IEs, each of which consists of frontend and backend pipeline functions comparable to a conventional OoO core. An IE therefore manages thread specific information, including the register file and the re-order buffer (ROB), for a thread fetched and dispatched from the IE. The following sections provide more details about the

IE and EU functionality.

4.3.1 Pipeline Stages

Figure 4-4 shows WiDGET’s pipeline stages, highlighting those that are unique to WiDGET.

The non-shaded stages resemble a conventional OoO design except for the additional NoSQ

(short for No Store Queue) support to eliminate a centralized memory disambiguation mechanism during execution [63].

WiDGET makes steering decisions at the Steer Stage so that instructions are dispatched to the appropriate EUs the following cycle. Section 4.3.2 provides detailed description of our steering heuristic. 40

FIGURE 4-4. Pipeline Stages.

฀ ฀ ฀

฀ FIGURE 4-5. Frontend.

The Execute stage can take multiple cycles depending on the operation and the utilization of the selected EU. Each EU independently manages instruction execution and no more than one operation issues at a time per EU. The total issue width is a function of the aggregate EU count, as each EU provides an additional execution engine. Executed instructions are removed from their

EUs and forward the results directly to the consumer EUs, if any, and to the register file in the dis- patching IE. Section 4.3.3 describes the detailed implementation.

4.3.2 Frontend

Figure 4-5 illustrates the detailed frontend of our architecture, which resembles a conventional

OoO core’s, including the centralized instruction fetch. Our frontend also has the NoSQ mecha- nism (Bypassing Predictor) for memory disambiguation and instruction steering. We derive the steering heuristic from the observation made in Chapter 2 that dependent instructions must be 41 kept nearby, obviating the need for considering every EU each time. Specifically, we send consum-

ers directly behind the producer or to an empty EU in the same EU cluster. If no such EU is found,

we simply stall steering until either a desirable EU becomes available or the producer finishes exe-

cution. It is through stalling that we ensure steering complexity is manageable and communication

overheads do not diminish the benefits from parallelism.

The heuristic requires three pieces of information: a producer’s steered EU, whether a pro-

ducer has another consumer steered to the same instruction buffer, and a list of empty EUs. We

employ a Last Producer Table (LPT) and an empty bit vector to keep track of the first two and the

last information, respectively. The LPT is indexed by a register and contains two fields. The first

field indicates the instruction buffer ID to which the producing instruction of the given register is

steered. The second field consists of a single bit; when set, this bit indicates at least one instruction

has been steered as a result of the producer-consumer relationship. An LPT entry is updated when

an instruction is steered and is invalidated when the register value is written back to the register

file. An invalid entry, therefore, indicates that the value has been computed and is available in the

register file. The empty bit vector is sized to the total number of instruction buffers, marking the

corresponding buffer’s occupancy status. We similarly use a full bit vector to ensure a producer

instruction buffer still has room for the consumer instruction. Feedback from the EUs updates both of the bit vectors every cycle.

Figure 4-6 provides pseudo-code for the steering heuristic. The location of a producer (EUp) is tracked by accessing LPT with the consumer’s source operand. If the indexed entry’s buffer ID is null, the producer has already computed the operand. getEmptyEU() accesses the empty bit vector and returns an ID whose corresponding entry is set to 1. If more than one entry is set, it randomly 42 /* Let I be an instruction under consideration. Also, let s and EUps be I’s source operand and the producer EU of operand s, respectively. s is omitted when only a single operand is outstanding. */ 0: switch (numOutstandingOps(I)){ 1: case 0: 2: return getEmptyEU(); 3: case 1: 4: if (!hasInstrBehind(s)) 5: return EUp; 6: else 7: return getEmptyEUInCluster(EUp); 8: case 2: 9: if (!hasInstrBehind(s1)) 10: return EUp1; 11: else if (!hasInstrBehind(s2)) 12: return EUp2; 13: else 14: return getEmptyEUInCluster(EUp1,EUp2); 15:}

FIGURE 4-6. Pseudo-code for instruction steering.

chooses an ID from an EU cluster with more empty buffers for load balancing. If all of the entries

are set to 0, the function returns -1, indicating a steering stall. hasInstrBehind(s) returns true if an

LPT entry indexed by s has the consumer field set to 1; otherwise, it returns false. Given one or

more EU IDs, getEmptyEUInCluster() searches the empty bit vector only within the corresponding

EU cluster(s). It returns either an available ID similarly to getEmptyEU() or -1 if no empty buffers

are found in the cluster(s), stalling the steering. Appendix C provides a working example to help

explain this heuristic.

4.3.3 Execution Unit

Figure 4-7 shows an EU, consisting of a small instruction buffer (IB) FIFO, execution engine

(an integer ALU, floating-point unit, and address generation unit), operand buffer, and router con-

nections at the input and output. The IB is configured to be four times larger than the fetch width 43

FIGURE 4-7. Execution Unit.

to minimize frontend stalls. Each entry has four fields: instruction, operand 1, operand 2, and a bit vector of consumer EU IDs. The consumer EU ID field indicates EUs to forward the result to.

Although our dynamic instruction steering provides the ability to adjust to dynamic events, the caveat is that consumer EUs are not known until the dependent instructions are steered. There- fore, an instruction has to send its EU ID to the producer EU via control paths after steering is per- formed. However, a race can occur if a producer has finished execution and has been removed from the IB by the time the consumer reaches the producer EU. To prevent this situation, an oper- and buffer holds the result of an instruction for a cycle, which is the latency of register file write backs.

The units in the execution engine are pipelined, though an EU can only issue one instruction per cycle. Note that each additional EU increases both issue bandwidth and buffer space for sched- uled instructions. This primarily contributes to WiDGET’s high single thread performance with simple in-order EUs. 44 4.3.4 Backend

The backend resembles a conventional OoO core’s. The ROB ensures in-order commit, and

stores write their values to the data cache at commit as in a traditional pipeline.

4.4 Evaluation

This section evaluates the microarchitecture described in Section 4.3 and the potential for power-proportional computing. Given this chapter’s focus on single thread capability, our evalua-

tion examines whether WiDGET meets two crucial properties of power proportionality: wide per-

formance and power ranges.

4.4.1 Simulation Methodology

We use two baselines designed after commercial processors on the opposite ends of a design

spectrum: a low power Intel Atom Silverthorne [27] and a high performance Intel Xeon Tulsa [70].

We call the former Mite and the latter Neon. The memory hierarchy of the baselines and WiDGET

is configured to emulate Neon’s in order to isolate the performance of the three different core

designs. We evaluate WiDGET configurations with one through eight EUs allocated to an instruc-

tion engine. These initial experiments assume a priori static allocation of EUs, as might be done by low-level system software.

Table 4-1 lists the key configuration parameters. The area estimate only accounts for a single- threaded core with the listed memory hierarchy. We derived the area of the Neon and Mite from published die area and attributed core component or unit area. The Neon’s area was then halved because of the process technology change from 65 and 45nm. Note, our more aggressive memory hierarchy increases Mite’s memory die area. WiDGET’s area estimate includes an instruction 45 TABLE 4-1. Machine configurations

Mite Neon WiDGET

LI-I / L1-D 32 KB, 4-way, 1 cycle; Next-line prefetching for L1-I Instruction Engine 2-wide FE and BE 4-wide FE and BE; 128-entry ROB 32-entry unified OoO 16-entry in-order IB per EU; 16-entry unified in- instruction queue; 1 INT, 1 FP, and 1 Addr Gen per EU; order instruction Execution Core 3 INT, 3 FP, and 2 Addr Gen; buffer; 0-cycle operand bypass within a clus- 0-cycle operand bypass to ter of four EUs; 2 INT, 2 FP, 2 Addr Gen anywhere in core 1-cycle inter-cluster link Disambiguation NoSQ; 256-entry, 4-way store-load bypassing predictor; 1K-entry T-SSBF L2 / L3 / DRAM 1 MB, 8-way, 12 cycles / 4 MB, 16-way, 24 cycles / ~300 cycles, 16-entry MSHR Area Estimate ~30 mm2 ~41 mm2 ~33 mm2 (8 EUs) (45nm)

engine, 8 EUs, and the memory hierarchy. We estimate that the Atom’s core area is roughly equiva- lent to WiDGET with 2 EUs due to the similar core structure sizes. The area of an each additional

EU is based on a TRIPS processor’s Execution Tile [44], which resembles the EU composition.

Despite WiDGET’s greater ALU resources, our area model concludes that WiDGET is smaller than the Neon, mainly due to WiDGET’s simpler structures.

In this evaluation, we fully provision ports on the register file and L1-D cache. When modeling power consumption, we outfit the register file with read ports numbering twice the dispatch width

(8), and write ports numbering the commit width (4). A similar simplifying assumption is applied to the L1-D power model.

4.4.2 Performance Range

A wide performance range is vital for power proportionality, yet WiDGET’s in-order issue constraint makes it challenging to match OoO execution performance. Therefore, we first evaluate 46

1.0 Neon 0.5 8 EUs

0.0 Normalized IPC gcc wrf mcf lbm milc astar h264 tonto sjeng bzip2 namd dealII soplex povray gobmk hmmer gamess bwaves zeusmp leslie3d calculix sphinx3 gromacs omnetpp perlbench FP HMean xalancbmk libquantum cactusADM INT HMean Integer FP GemsFDTD

FIGURE 4-8. 8-EU performance relative to the Neon.

single thread performance of WiDGET when configured with the maximum number of EUs: eight

EUs with one instruction buffer per EU. Figure 4-8 presents IPCs relative to the high-performance

Neon baseline, with integer benchmarks on the left and floating-point benchmarks on the right.

Even with more than double the ALU resources, one third of the benchmarks fail to match the

Neon’s performance. In particular, WiDGET is only able to produce 35% of the Neon performance

for the outlier libquantum, drastically impacting the integer harmonic mean.

Figure 4-9 demonstrates the average EU utilization, revealing the sources of the performance

degradation. Empty is when an EU has no instructions in the instruction buffer (IB). Waiting for

Producer and Waiting for Op Transfer are when EU utilization is wasted because the head instruc- tion in the IB has at least one outstanding operand. The former is waiting for the operand to be computed, whereas the latter indicates that the operand has been computed but has not yet reached the EU due to the inter-cluster communication delay. Finally, Accessing Memory and Exe- cuting ALU are when an EU is executing memory and non-memory instructions, respectively.

Since instructions reside in IBs until execution is complete, memory-intensive workloads cause

EUs to spend much of the time waiting for load data (Accessing Memory).

The under-performing benchmarks demonstrate common characteristics: frequent stalls due to memory access and outstanding producers. This increases the pressure on the EUs to buffer 47 100 80 Empty (%) Waiting for Producer 60 Waiting for Op Transfer pent S 40 Accessing Memory

me Executing ALU

Ti 20

0 gcc wrf mcf lbm milc astar h264 tonto sjeng bzip2 namd dealII soplex povray gobmk hmmer gamess bwaves zeusmp leslie3d calculix sphinx3 gromacs omnetpp perlbench FP HMean xalancbmk libquantum cactusADM INT HMean Integer FP GemsFDTD

FIGURE 4-9. Average cycles spent on each EU state with 8 EUs.

more dependent instructions for a longer period of time. As a result, the steering logic becomes

more prone to stalls due to the lack of desirable EUs. libquantum is the most prominent example.

It spends 62% and 37% of the time on memory accesses and waiting for producers, respectively,

leaving non-memory execution to a mere 1%. In contrast, benchmarks that have comparable per-

formance to the Neon have the opposite trends. They have a larger portion of time spent on exe-

cuting non-memory instructions and are less likely to waste EU utilization by waiting for

operands. Hence, fewer EUs are necessary to buffer stalled chains of instructions, leveraging more

EUs to execute independent chains. Note that WiDGET’s hierarchical operand network in lieu of

the Neon’s full operand bypass has little effect on the EUs. EUs spend less than 1% of the time on

waiting for operands to be transferred, which is accomplished by enforcing cluster affinity at the

steering logic.

4.4.3 Improving Performance

As Salverda and Zilles observed [61], the limiting factor of WiDGET’s performance is the number of independent instruction chains the system can expose, not the issue bandwidth. To overcome this, we expand the buffering capability by allocating multiple IBs to each EU. Despite the same issue bandwidth, an EU can now buffer more than one stalled chain while permitting an 48 independent chain in another IB to utilize the otherwise idle execution engine. This change, how- ever, requires each EU to have simple instruction issue selection logic; we use an oldest-instruc- tion-first policy. Nevertheless, the logic is much less complex than that of a monolithic OoO as long as the number of IBs per EU is kept small. WiDGET’s selection logic, with 8 EUs and 4 IBs each, only consumes 3% of the Neon’s centralized instruction selection logic power. We also enlarge the size of the empty and full bit vectors in the steering logic to the total number of IBs.

The steering complexity is managed by keeping the cluster locality invariant.

Figure 4-10(a) summarizes the performance benefits of increased buffering. We present the harmonic mean IPCs of the entire SPEC CPU2006 benchmark suite, normalized to the Neon.

WiDGET performs comparably to the Neon with at least 12 IBs, which is explained by the bench- marks’ dataflow characteristics. When an ROB has 128 entries, which is our configuration, the integer and floating-point benchmarks have 8 and 12 extractable independent chains, respectively.

With 4 IBs per EU, as few as 3 EUs are sufficient, while 8 EUs outperform the Neon by 26%—a sharp contrast to 18% degradation by the 1 IB counterpart. Therefore, mapping chains of depen- dent instructions to the sufficient number of buffers achieves extraction of ILP and memory-level parallelism despite the in-order issue constraints. WiDGET can also scale down with a single EU and an IB with performance slightly less than the Mite, offering a wide performance range of 3.8.

It is clear that the performance sees diminishing returns after 7 EUs with 4 IBs each, obviating the need for more than 2 clusters. Rather, an interesting comparison is to 8 EUs with 3 IBs, which yield similar performance. 7 EUs with 4 IBs deploy more buffers than 8 EUs with 3 IBs, while the latter uses more ALUs. We evaluate the trade-off from the power dissipation perspective in

Section 4.4.5. 49 (a) (b)

฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀

(c) IBs

1234

1 0.0 0.0 0.0 0.0 2 0.0 0.0 0.0 0.0 3 -1.9 -8.1 -22 -24

EUs 4 -1.8 -3.1 -4.3 -2.7 5 -3.4 -7.0 -2.3 -3.4 6 -1.5 -3.0 -2.4 -2.0 7 -3.2 -6.6 -7.3 -8.8 8 -1.1 -4.1 -3.3 -2.3

FIGURE 4-10. Harmonic mean IPCs relative to the Neon (a) 4-EU cluster size. (b) 2-EU cluster size. (c) IPC degradation (%) of the 2-EU cluster size compared to the 4-EU cluster

4.4.4 Impacts of a Cluster Size

Cluster sizes impact WiDGET performance, making it non-monotonic with increasing EU count. Figure 4-10(a) illustrates this anomaly for the 5-EU case with three or four IBs, which is caused by the hierarchical operand network that employs a single-cycle link to bridge two adjacent 50 clusters of 4 fully bypassed EUs. Hence, the 5-EU design forms a highly unbalanced system of a cluster of 4 EUs and a single EU. As dependent instructions are steered to the same cluster for locality, instruction chains steered to the single-EU cluster are likely to stall the frontend due to a lack of buffers. Increased buffer count mitigates the stalls, though it also creates opportunities for other independent chains to be steered to the single-EU cluster, causing structural hazards. mcf, which has a complex dataflow, is primarily responsible for degrading the harmonic mean IPC.

Reducing the cluster size to two has more dramatic performance impacts. Even though it can simplify steering logic and intra-cluster full-bypass network, Figure 4-10(b) demonstrates that the performance becomes highly sensitive to the cluster formation. With three or more IBs per EU, the odd EU configurations degrade the performance of the even EU configurations once 2 EUs are assigned. Therefore, under this cluster size, WiDGET must allocate a pair of EUs to realize perfor- mance improvements, resulting in coarser-grained power proportionality. We thus use a cluster size of four for the rest of the chapter.

4.4.5 Power Range

Figure 4-11 presents the harmonic mean system power of the SPEC CPU2006 benchmark suite, normalized to the Neon. We did a best-effort validation of the Neon and Mite power con- sumption against Atom [27] and Xeon [70] processors, respectively, by first configuring them using the published data. We power down non-provisioned EUs [37].

The shape of WiDGET’s curve resembles that of the performance curve in Figure 4-10, dem- onstrating power proportionality. WiDGET, composed of simple building blocks, achieves 8-58% power savings compared to the Neon. Furthermore, WiDGET’s EU modularity enables scaling down the power by up to 2.2 to approximate the Mite’s low power. Note that the 5-EU case slightly 51 ฀ ฀ ฀ ฀ ฀ ฀

฀ ฀฀ ฀ ฀ FIGURE 4-11. Harmonic mean system power relative to the Neon.

reduces the power of the smaller 4-EU configuration. This behavior arises because of the non- monotonic performance increase by the 5-EU case that the previous section observed.

Figure 4-11 resolves the previous section’s performance and EU provisioning trade-off. Since power consumption of 7 EUs with 4 IBs and 8 EUs with 3 IBs are almost identical, one can resort to the former, the slightly higher performing configuration. This has additional benefits of stealing fewer EUs from a neighbor, allowing more threads to run in parallel as the power budget permits.

WiDGET’s power increase from dedicating more EUs is not solely due to the additional resources, but is also the result of higher utilization in the existing resources. As Figure 4-12 shows, the breakdown for harmonic mean system power of the SPEC CPU2006 benchmark suite is divided into two broad categories, each with four subcategories: caches and core logic. Fetch/

Decode/Rename, which includes a branch predictor and an instruction translation buffer, accounts for most of the frontend logic. Although WiDGET’s instruction steering logic resides in the fron- 52 1.0 L3 0.8 L2 L1D 0.6 L1I 0.4 Fetch/Decode/Rename Backend 0.2

Normalized Power ALU 0.0 Execution Neon 1234 1234 1234 1234 1234 1234 1234 1234 Mite 1 EU 2 EUs 3 EUs 4 EUs 5 EUs 6 EUs 7 EUs 8 EUs FIGURE 4-12. Power breakdown relative to the Neon.

tend, it is included in the Execution component along with the execution core to make a fair com-

parison with Neon and Mite for power stemming from scheduling and execution. Hence, this

category encompasses in-order instruction buffers for the Mite and WiDGET, an OoO instruction

queue for the Neon, operand network, and a data translation lookaside buffer. Backend and ALU

include the commit logic and the ALU resources, respectively.

Enlarging resource allocation has a first-order impact on the ALU and execution power. The larger effective window size also increases activity in the system, resulting in proportional power growth in Fetch/Decode/Rename, L1D, and L2. Yet, WiDGET’s considerable power savings com- pared to Neon comes from the difference in the execution models. WiDGET effectively replaces the Neon’s associative search in the OoO issue queue and the full bypass with simple in-order EUs and the hierarchical operand network, resulting in 24-29% reduction in the execution power. This is sufficient to mask out WiDGET’s additional power resulting from the higher ALU count, even in the 8-EU case.

The breakdown is also useful to understand the power gap between Mite and WiDGET’s 1 EU with 1 IB configuration. WiDGET’s OoO support in the instruction engine, namely register renaming and ROB, is primarily responsible for the extra power. 53

4 IBs 2 IBs 3 IBs

฀฀ 1 IB ฀ ฀

FIGURE 4-13. Power Proportionality of WiDGET compared to Neon and Mite.

Figure 4-13 puts together the power-performance relationship of the three designs. WiDGET

consumes 21% less power than the Neon for the same performance (i.e., 3 EUs with 4 IBs), and

yields 8% power savings for 26% better performance (i.e., 8 EUs with 4 IBs). WiDGET dissipates

power in proportion to the performance, covering both the high-performance Neon and the low-

power Mite on a single chip.

For a more detailed look at Figure 4-13, we provide a blown-up cutout focusing on the 4- and

5-EU data points that do not follow the rest of the power-performance trend due to the unbal- anced clusters. The data points of each EU count correspond to 1 through 4 IBs from left to right.

The distance between the 4- and 5-EU points shrinks as the number of IBs per EU increases from

1 to 2. With 3 or 4 IBs, the positions of the 4- and 5-EU points are swapped; the 5-EU points are to the left and lower than their 4-EU counterparts. Hence, it is not beneficial to increase buffers after allocating 2 IBs to the 5-EU configuration.

Finally, Figure 4-14 compares power efficiency of the three processor designs. We use BIPS3/

W as the metric, which is appropriate for evaluating efficiency differentials caused by microarchi- 54 Lower performance & Higher performance & higher efficiency than the Neon higher efficiency than the Neon

฀ ฀ ฀ ฀ ฀ ฀

฀ ฀ ฀ ฀ FIGURE 4-14. Geometric mean power efficiency (BIPS3/W).

tectural designs [33]. The data points within the rectangle in the figure represent higher power efficiency than the Neon design. Notably, seven of WiDGET’s configurations below the diagonal line in the rectangle deliver higher power efficiency despite the lower performance than the Neon:

6-8 EUs with 1 IB, 3-5 EUs with 2 IBs, and 3 EUs with 3 IBs. Furthermore, WiDGET is 48% more power efficient than the Neon when achieving the same performance and exceeds the Neon and

Mite by up to 2x and 21x, respectively.

It is possible for Neon to improve the power efficiency by scaling down the instruction window size (i.e., ROB). However, the ROB is not the main source of power difference when comparing

Neon to the same performance point of WiDGET (3 EUs with 4 instruction buffers). Rather, it is

Neon’s OoO issue queue that is the primary source of power disparity between the two designs. 55 Because reducing a window size only has an indirect impact on the issue queue power (unless the

window becomes smaller than the issue queue), improvement in power efficiency will be limited.

4.5 Summary

We proposed a power proportional design called WiDGET. The decoupling of the computa-

tion resources enables flexibility to provision EUs to meet different power-performance goals.

WiDGET can be optimized for high throughput and low power by provisioning a small number of

EUs to each instruction engine. When a powerful core is needed to avoid a sequential bottleneck,

for instance, system software can dedicate more EUs to accelerate single thread performance. By

using only as many resources as necessary to deliver target performance, WiDGET achieves

power-proportional computing.

Despite the use of in-order EUs to save power, WiDGET yields even higher performance than

the aggressive Neon by deploying sufficient buffering for scheduled instructions and steering

based on dependencies. This distributed instruction buffering was the key to single thread perfor-

mance and we believe global distributed buffering, if managed well, can yield performance for

other forms of parallelism. The distributed instruction buffers improve latency tolerance, allowing

independent chains to execute ahead of earlier, stalled chains, which is our mechanism to extract

ILP and memory level parallelism. Furthermore, the removal of OoO execution logic contributes

to the primary power savings of WiDGET, resulting in 24-29% reduction in the execution power

compared to the Neon. We experimentally showed that WiDGET consumes 21% less power than

the Neon for the same performance and achieves 8% power savings for 26% better performance

than the Neon. WiDGET’s additional capability to scale down to a level comparable to the Mite makes WiDGET a desirable framework for achieving power proportionality. 56 However, some questions still remain. We did not consider scaling other hardware structures,

including the frontend, backend, and caches. As WiDGET enables more aggressive execution with more EUs, scaling up other structures as well might facilitate even better performance scalability.

On the other hand, we have identified that fixed OoO logic in the frontend and backend (i.e.,

renaming and ROB) hinders scaling down WiDGET’s power, particularly compared to the Mite’s

in-order pipeline. Hence, the next chapter investigates the power-performance impacts of scaling other hardware structures. 57 Chapter 5

Deconstructing Scalable Cores

"Neither a borrower nor a lender be; For loan oft loses both itself and friend, and borrowing dulls the edge of husbandry" —Hamlet Act I, Scene III by William Shakespeare

In the previous chapter, WiDGET made implicit design choices to only scale the execution resources. Even though it achieved substantial power-performance scalability, questions still remain whether further scalability is feasible by also tuning other structures to the aggressiveness of computation.

In this chapter, we deconstruct prior scalable core designs—Core Fusion, Composable Light- weight Processors, and Forwardflow [40,46,28]—to identify which mechanisms best help, or hurt, energy-efficient scalability. We first develop core scaling taxonomy (Section 5.1). We then intro- duce (Section 5.2) and study (Section 5.4) two abstract cores that capture the fundamental differ- ence in scaling without design-specific details of the prior work. After considering various scaling mechanisms for power-inefficient components (Section 5.5), we propose a more energy-efficient hybrid design, COBRA (Section 5.6). Finally, the summary of our findings (Section 5.7) concludes this chapter.

5.1 Core Scaling Taxonomy

Table 5-1 summarizes a taxonomy for core scaling, including the mechanisms and philoso- phies used by each exemplar design. These mechanisms capture the high-level implications of 58 TABLE 5-1. Core scaling taxonomy

Composable Lightweight Scaled Component Core Fusion Forwardflow Processors

L1-I: Mechanism to aggregate Sub-banked L1-Is, oper- Distributed blocks, cen- No aggregation L1-I caches ate cooperatively tral fetch Frontend: Mechanism to scale Collective (centralized) EDGE ISA avoids central- Overprovision fetch and decode width rename, frontend x-bars ized rename Aggregated OoO sched- Distributed OoO buffers. Monolithic, banked sched- Scheduling: Mechanism to scale ulers. Instructions steered Compiler-assisted steer- uler. Static instruction instruction scheduler based on dependency ing steering Execution Resources: Mecha- FUs associated with each FUs associated with OoO FUs associated with groups nism to scale the number of scheduler schedulers of scheduler banks functional pipelines Instruction Window: Mecha- Distributed block-level Distributed ROB-like struc- nism to scale the size of the Interleaved ROB commit ture instruction window L1-D: Mechanism to aggregate Multiple core-interleaved Multiple core-interleaved No aggregation L1-D caches L1-Ds L1-Ds Overprovision per-core Resource Acquisition Philoso- Aggregate neighboring instruction window and phy: Means by which cores are Aggregate many cores. No cores. No per-core over- execution resources. No provided with additional per-core overprovisioning provisioning resource borrowing from resources when scaled up other cores

these design decisions, without necessarily focusing on the low-level details and design-specific

artifacts. We identify seven principal areas in which these designs differ: instruction cache scaling,

frontend width scaling, instruction scheduling, scaling of execution resources, instruction window

scaling, data cache scaling, and most importantly, resource acquisition philosophy. Note that not

all designs scale all components—i.e., Forwardflow only scales the execution stages and uses fixed-

size caches and a fixed-width frontend pipeline.

The summary in Table 5-1 demonstrates that prior work adopts very different resource acquisi- tion philosophies. Two of the designs seek to scale all aspects of a core through resource borrowing, i.e., through dynamic sharing of microarchitectural resources between cores. Still the other design scales only a portion of the core, and implements an individual core’s scale-up by activating addi- 59 TABLE 5-2. Scaling mechanisms of WiDGET. Shaded cells indicate mechanisms unique to WiDGET.

Scaled Component WiDGET

L1-I No aggregation Frontend Overprovision Scheduling Distributed in-order FIFO buffers. Instructions steered based on dependency Execution Resources FUs associated with each execution resource Instruction Window Monolithic ROB L1-D No aggregation Resource Acquisition Overprovision per-core instruction window resources. Borrow execution resources Philosophy from neighboring cores when necessary

tional core-private resources, a philosophy based on resource overprovisioning. Interestingly, WiD-

GET has both aspects of resource acquisition philosophies, as Table 5-2 explains. Despite the overprovisioned frontend and the mostly core-private resources, each core can borrow execution resources from neighboring cores to enhance the scalability.

In this work, we focus on two areas of evaluation: what elements should scale and which of

these scaled elements should be borrowed or overprovisioned.

5.2 Two Abstract Cores: Borrowing vs. Overprovisioning

Concerns over wire delay has significant impacts on the core-scaling trade-offs, intricately

affecting the benefits of more aggressive resource scaling. We therefore consider two abstract

designs that approach the problem of building a scalable core from either end of the spectrum of

resource acquisition philosophies. We will use these abstractions as vehicles in our evaluation.

Borrowing All Resources (BAR). Resource borrowing in general attempts to maintain pipeline balance of scaled-up cores by utilizing resources from nearby cores. Our borrowing-based scalable core, BAR, seeks to leverage all the resources of neighboring cores, from L1-I to commit logic, 60 (a) (b)

FIGURE 5-1. Conceptual block diagrams of (a) Borrowing All Resources (BAR) and (b) Cheap Overprovisioned Resources (COR) models. Shaded components are shared between cores.

when scaling up, to make best use of on-chip area (Figure 5-1a). Although constraints on area and wire delay suggest that aggregating more than two cores might be prohibitive, we optimistically evaluate the cost of borrowing resources as a flat two-cycle overhead, regardless of the number of cores BAR can effectively aggregate. As a result, this work overestimates the performance of 4-way borrowing. Section 5.4.2 examines performance sensitivity to varying borrowing overheads.

Cheap Overprovisioned Resources (COR). Our overprovisioning-based scalable core, COR, takes a complementary approach and sacrifices pipeline balance in an effort to minimize wire delays (Figure 5-1b). Intuitively, some critical core resources are small and occupy relatively little area, e.g., functional units. COR provisions these resources for the largest scaling point with rela- tively little area overhead. On the other hand, larger structures like the L1-I and L1-D caches are too large to overprovision—COR simply does not scale these entities. Essentially, this means that the COR core concentrates on scaling the effective instruction window size (scheduler, datapaths, and re-order buffer), but not the frontend resources or the caches. COR simply provisions the lat- 61 ter resources to some fixed configuration, and exercises them differently under different configu-

rations of the instruction window. Because COR’s scalable resources are small and per-core private,

wire delay is not a first-order concern to access these scaled resources.

5.2.1 Trade-offs of Resource Borrowing and Overprovisioning

Baseline Scaled Resources. BAR and COR differ in what resources are scaled, and from where those resources are acquired. Overall, COR scales fewer microarchitectural structures than does

BAR, leaving open the potential to imbalance if the statically-provisioned elements of COR are improperly sized with respect to the scaled elements.

BAR scales many more processor resources than does COR, but pays latency penalties for scal- ing these resources. From these differing baselines, we can derive insight into what resources are most profitably placed in the near-neighbor signalling domain, for use in more energy-efficient core designs (Section 5.6).

Coordinating Scaling Operations. Borrowing in general assumes cores from which resources are borrowed are either themselves scaled-down, or entirely powered off. To safely transfer ownership of shared resources, some coordination is required between the participating cores, either by sys- tem software or under hardware control or both—the nature of this coordination is outside the scope of this chapter. On the other hand, one advantage of resource overprovisioning (and COR in particular) is simplicity: cores designed with overprovisioned resources never incur the complex- ity of borrowing resources from other cores. That is, core-private resources need no coordination with other cores before they can be used. 62 Area. Overprovisioned designs have obvious area overheads. When cores scale down, they leave

their core-private resources unused or underutilized (i.e., dark silicon). In the case of COR, given fixed L1-I/D cache sizes, Burns and Gaudiot [15] estimate the area increase from 2-wide to 4-wide pipeline to be 62%, including support for increased register file pressure. Nonetheless, intuition suggests that, compared to overprovisioning, resource borrowing in general (and BAR in particu- lar) makes more efficient use of chip area, as scaled cores leave fewer areas of the chip unutilized.

As a result of implementing dynamic sharing, however, resources borrowed from neighboring cores have more complex routing and floorplanning constraints than core-private resources. Core

Fusion estimated the area overhead of implementing borrowing (among four cores) as equivalent to half of the area of a single core [40]. This additional hardware constitutes multiplexors at the inputs and outputs of shared resources, and the necessary wiring to physically route signals between shared resources in other cores.

5.3 Methodology

Table 5-3 details the configuration of the two abstract cores with three scaling points (scale-1, scale-2, and scale-4, e.g., BAR1/BAR2/BAR4), ranging from a 64-instruction windows to a 256- entry window, and includes a typical out-of-order superscalar core, similar to the Alpha 21364

[42], as a baseline for comparison. This baseline is a four-wide superscalar core, roughly equiva- lent to a scale-2 core. The unscaled resources (i.e., frontend, physical register file, etc.) in COR are tuned to match this baseline—these resources do not scale physically (though demand on these resources does change in response to scaling of other components). Other common configuration parameters are summarized in Chapter 3. 63 TABLE 5-3. Design-Specific Default Configuration Parameters. Shaded cells indicate opti- mistic assumptions and advantages of one design over another.

Cheap Overprovisioned Resources Component Borrowing All Resources (BAR) (COR) OoO (Baseline)

BAR1 No aggregation (Total: 32KB) COR1 BAR2 2 aggregated, sub-banked L1-Is [40] COR2 No aggregation L1-I (Total: 64KB) No aggregation (Total: 32KB) (Total: 32KB) BAR4 4 aggregated, sub-banked L1-Is (Total: 128KB) COR4 BAR1 2 wide COR1 Frontend/Back- BAR2 4 wide COR2 4 wide 4 wide end Width BAR4 8 wide COR4 BAR1 7 cycles COR1 Frontend Depth BAR2 11 cycles: 7 cycles+2 cycles for com- COR2 7 cycles 7 cycles munal rename xbar, +2 cycles for BAR4inter-core dispatch xbar COR4 16-entry unified OoO instruction 16-entry unified OoO IQ, 2- BAR1 queue (IQ), 2-wide issue per IQ COR1 wide issue per IQ 2 aggregated 16-entry OoO IQs, 2- 2 aggregated 16-entry OoO cycle xbar interconnect, back-to-back IQs, back-to-back bypass BAR2 bypass within a core, cache-bank-pre- COR2 between the IQs, base steering Scheduling dictor steering [40,8], 2-wide issue 32-entry unified OoO per IQ [40], 2-wide issue per IQ IQ, 4-wide issue per IQ 4 aggregated 16-entry OoO IQs, 2- 4 aggregated 16-entry OoO BAR4 cycle xbar interconnect, back-to-back IQs, back-to-back bypass bypass within a core, cache-bank-pre- COR4 between the IQs, base steering, dictor steering, 2-wide issue per IQ 2-wide issue per IQ BAR1 1 IALU, 1 FPALU, 1 AGEN COR1 1 IALU, 1 FPALU, 1 AGEN Execution BAR2 2 IALU, 2 FPALU, 2 AGEN COR2 2 IALU, 2 FPALU, 2 AGEN 2 IALU, 2 FPALU, 2 Resources AGEN bAR4 4 IALU, 4 FPALU, 4 AGEN COR4 4 IALU, 4 FPALU, 4 AGEN BAR1 64 entries, unified COR1 64 entries, unified Instruction Window BAR2 2 banks, 64 entries each COR2 128 entries, unified 128 entries, unified BAR4 4 banks, 64 entries each CPR4 256 entries, unified BAR1 No aggregation (Total: 32KB) COR1 BAR2 2 aggregated, core-interleaved L1-Ds COR2 No aggregation L1-D (Total: 64KB) No aggregation (Total: 32KB) (Total: 32KB) BAR4 4 aggregated, core-interleaved L1-Ds (Total: 128KB) COR4

We derived 2-wide and 4-wide frontend delays from two sources. First, the access time of a two-port L1-I cache (i.e., up to two lines per cycle for unaligned fetch) is three cycles according to

CACTI 5 [64]. Second, Burns and Gaudiot estimate the critical path delay of both decode and rename stages to be 1.15 ns and 1.3 ns for 2-wide and 4-wide, respectively [15]. Despite the small 64 difference, at our assumed clock frequency of 3 GHz, both of the delays translate to the same effec-

tive latency of four cycles, resulting in a total of seven-cycle frontend depth. Although a different

frequency assumption produces dissimilar frontend depth for 2-wide and 4-wide, Section 5.4

shows that frontend pipeline has little impact on the overall core efficiency.

BAR aggregates nearby ROBs when scaled up, using them as banks. A banked ROB demands

special logic to commit instructions from multiple banks in the same cycle [40,28], but in this

chapter we assume no overhead for retiring any number of instructions, from any bank, up to the

commit width of the pipeline. This assumption favors BAR over COR.

5.4 Initial Evaluation

To begin our evaluation, we consider the raw power and performance characteristics of our

abstract core models introduced in Section 5.2. Section 5.5 will examine the roots underlying

these trends.

5.4.1 Performance Comparison

Figure 5-2 plots IPC for all scaling points of BAR and COR, normalized to the baseline OoO.

The fully scaled down BAR1 and COR1 designs achieve comparable performance, which is not

surprising, considering their only difference is the width of the frontend and backend (two for

BAR1, four for COR1). At higher scaling points, COR outperforms BAR by 9% on average.

Philosophically, the BAR designs seek to uniformly increase pipeline and cache resources, with the expectation that maintaining pipeline balance will outweigh the additional communication latencies. Conversely, the COR designs vary only the size of the instruction window and the num-

ber of execution resources (those elements of the core cheap enough to simply overprovision), sac- 65

1.0 Scaling 4 Scaling 2 0.5 Scaling 1 Normalized IPC

0.0 BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR perlbench bzip2 gcc mcf gobmk hmmer sjenglibquantumh264 omnetpp astar xalanc HMean apache jbb oltp zeus HMean (a) Integer (left) and commercial (right) benchmarks

1.0 Scaling 4 Scaling 2 0.5 Scaling 1 Normalized IPC 0.0 BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR bwaves gamess milc zeusmp gromacs cactus leslie3d namd dealII soplex povray calculix gems tonto lbm wrf sphinx3 HMean (b) Floating-point benchmarks

FIGURE 5-2. IPC normalized to the baseline OoO.

rificing pipeline balance. But because these overprovisioned resources are core-private and small

(~0.3 mm2), they incur no significant additional wire delay.

To quantify the impacts of wire delays, Figure 5-3 plots a breakdown of in-flight instructions into four different states for fully scaled-up BAR and COR (i.e., BAR4 and COR4). Executed is for instructions that have been executed but not yet committed, Executing ALU is for ALU instruc- tions that are executing, Accessing Memory is for outstanding memory accesses, and Waiting is for dispatched but not yet issued instructions, including those marked as ready. The breakdown shows that COR4 has 18% less waiting instructions and 15% more executing instructions than BAR4 on average. Hence, the additional inter-core communication of BAR4 causes more instructions to be idle compared to COR4.

BAR’s whole-core scaling impacts pipeline latency in several ways. The most detrimental, direct impact is longer effective latency of the operand crossbar—on the critical path of instruction 66 100 80 Executed 60 Executing ALU 40 Accessing Memory Waiting 20 Instructions (%)

0 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 perlbench bzip2 gcc mcf gobmk hmmer sjeng libquantum h264 omnetpp astar xalanc apache jbb oltp zeus Avg (a) Integer (left) and commercial (right) benchmarks

100 80 Executed 60

ons (%) Executing ALU i 40 Accessing Memory Waiting 20 Instruct 0 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 BAR4 COR4 bwaves gamess milc zeusmp gromacs cactus leslie3d namd dealII soplex povray calculix gems tonto lbm wrf sphinx3 Avg (b) Floating-point benchmarks

FIGURE 5-3. Percentages of in-flight instructions spent in each state.

execution. Although the steering heuristic aims to dispatch dependent instructions to the same

core, there are three cases where inter-core communication occurs. First, when an instruction has

two outstanding operands and the producer instructions are in different cores, the instruction is

randomly steered to one of the cores, requiring inter-core communication for the other operand.

Second, an instruction is steered away from the producer if the producer’s instruction queue (IQ)

is full. Because these IQs enable out-of-order instruction wake-up, respecting data dependencies at

the expense of frontend stalls is unnecessary, unlike in-order IQs [21]. Third, BAR uses a cache- bank predictor [21,6], to steer memory instructions to a core-interleaved L1-D cache bank. Loads incur additional latency whenever the predicted address maps to a different core than that which produces the address operand(s). All of these cases require remote operand transfers, which amount to 23% and 32% of all instructions in BAR2 and BAR4, respectively (Figure 5-4). Further- 67

30 0.4 BAR2 20 BAR4 0.2 10

0 0.0 Instructions (%) Misprediction Rate gems cactus BAR2 BAR4

HMean leslie3d sphinx3

FIGURE 5-4. Instruct-ions affected FIGURE 5-5. Misprediction rate of the cache-bank by remote operand transfers in BAR. predictor in the BAR.

BAR4 1.0 COR4

MLP 0.5

0.0

gems cactus xalanc GMean

FIGURE 5-6. Memory-level parallelism.

more, bank mispredictions result in additional cycles to reissue the memory instruction to the cor- rect core.

This general trend is not without exception. BAR4 outperforms COR4 for two workloads, cac- tus and gems. Both benchmarks incur relatively few remote operand transfers, in part because the cache-bank predictor is highly accurate for these workloads (Figure 5-5). Furthermore, both benchmarks are memory-bound, and the memory latency helps overlap the remote operand trans- fers that do occur. Figure 5-6 plots memory-level-parallelism (MLP) and shows that both cactus and gems have many more simultaneously outstanding misses than the average benchmark. Addi- tional results (not shown) indicate that cactus and gems have L1-D miss rates of 28% and 48%, respectively, on BAR4, which are about 10% lower than on COR4. The combination of accurate 68 prediction, larger (aggregated) caches, and (memory) latency tolerance combine to make BAR out-

perform COR for these benchmarks.

Figure 5-5 also illustrates the limitations of cache-bank prediction. The benchmarks leslie3d and sphinx3 have such high misprediction rates that the performance of BAR2 is actually worse than BAR1, despite the greater cache and pipeline resources.

5.4.2 Performance Sensitivity to Communication Overheads

The performance results of Section 5.4.1 intimately depend upon the wire delay assumptions that we use for our models: 0-cycle overhead for COR and 2-cycle overhead for BAR. To under- stand how different design and technology assumptions may impact these results, we vary the wire delays from zero to two cycles for both designs. For COR, increasing the wire delay from 0 to

n cycles increases the frontend depth by n cycles and also adds n cycles to transfer operands

between function units. For BAR, reducing the wire delay from 2 to n cycles reduces the operand

transfer latency and the two frontend crossbars to/from the centralized rename (see Table 5-3), but has no effect on cache-bank misprediction which also degrades BAR’s performance.

Figure 5-7a plots normalized runtime as the wire delay varies from zero to two cycles for COR

(denoted COR-0C, COR-1C, and COR-2C), as well as the default BAR configurations (asterisks indicate default configurations). The graph breaks runtime into three categories: Cache, Comm, and Ideal. The topmost stack, Cache, represents cache-bank misprediction penalties in BAR. The middle stack, Comm, shows communication overheads that are affected by the different wire delay assumptions. Finally, Ideal represents the runtime in the absence of Comm and Cache over- heads. 69

1.4 1.4 1.2 Cache 1.2 Cache Comm Comm 1.0 1.0 0.9 Ideal 0.9 Ideal Normalized Cycles Normalized Cycles COR-1C COR-2C COR-1C COR-2C BAR-0C BAR-1C BAR-0C BAR-1C *COR-0C *BAR-0C *COR-0C *BAR-2C *COR-0C *BAR-2C *COR-0C *BAR-0C *COR-0C *BAR-2C *COR-0C *BAR-2C Scaling1 Scaling2 Scaling4 Scaling1 Scaling2 Scaling4 (a) COR (b) BAR

FIGURE 5-7. Performance sensitivity to communication overheads.

Regardless of the wire delay, all COR configurations outperform the default BAR configura-

tion, on average. At scaling point 1, neither design incurs any Cache or Comm overheads and

COR’s Ideal runtime is less than BAR’s because of COR’s wider frontend (4-wide vs. 2-wide). As

the designs scale up, BAR’s Ideal time improves more than COR’s, as it applies more (borrowed)

resources to the problem. As we increase the wire delay in COR, the Comm overhead increases to

be nearly as large as the default (2-cycle) BAR configuration. The small difference reflects BAR’s

round-trip to the centralized renamer, while COR only has a single wire delay in its frontend. This

small difference indicates that performance is not particularly sensitive to frontend depth. How-

ever, even assuming equal wire delay, COR outperforms BAR because COR never incurs a cache-

bank misprediction. Despite using a sophisticated bank predictor, BAR’s Cache overhead domi-

nates the Comm overhead.

Figure 5-7b elaborates on the relative impact of Comm and Cache overheads by plotting BAR’s overheads with different wire delays, compared to the default COR configuration. At scaling point

2 (4), BAR’s Comm overhead increases by 4% (6%) for each cycle of wire delay. Note that the 70

1.5 1.0 Scaling 4 Scaling 2 0.5 Scaling 1 0.0 Normalized Power BAR COR BAR COR BAR COR CINT CFP Commercial FIGURE 5-8. Chip power normalized to the baseline OoO.

5 Scaling 4 4 Scaling 2 3 Scaling 1 2 1 0 Normalized Power BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR BAR COR L1I F/D/R Sched/Steer Exec Backend L1D L2/L3 Chip Power FIGURE 5-9. Categorized chip-wide power consumption.

Cache component also increases with wire delay, as mispredicted instructions take correspond-

ingly longer to re-route. In general, we find that the overheads of cache aggregation, largely due to

cache-bank misprediction, outweigh benefits of larger cache capacity.

These graphs show average behavior, which does not hold for memory-bound workloads.

These workloads are largely insensitive to wire delay and their performance largely depends upon

the memory system, not inter-core communications.

In summary, BAR’s average performance only beats COR’s for scaling point 4, when both

designs assume 0-cycle delays. Such a design point seems very optimistic, especially for BAR, which must aggregate four full cores and access a centralized rename table. 71 5.4.3 Chip Power Comparison TABLE 5-4. Power Categories and Descriptions Figure 5-8 plots mean total chip power Category Description of BAR and COR with the default configura- L1-I, including sub-banking where L1I tions listed in Table 5-4. At the scaling point applicable (i.e., BAR). Fetch, Decode, and Rename logic. 1, COR consumes 9% more power than F/D/R Includes frontend crossbars in BAR BAR, across all benchmarks, as a result of and I-TLB for both. Scheduling and steering logic. Includes Sched/ COR1’s fixed, 4-wide frontend. scheduling crossbar and cache-bank Steer predictor in BAR. Both models consume more power as Execution pipelines, IALUs, FPALUs, Exec data address generation, and D-TLB. they scale up, as they activate additional Backend Commit sequencing logic (i.e., ROB). resources. For scaling points 2 and 4, BAR L1-D, including core-interleaved L1D caches. consumes more power than COR (16% at L2/L3 Power consumed in L2 and L3 caches. scaling point 2, 36% at scaling point 4).

Overall, BAR’s power scales more rapidly than does COR’s, as BAR scales up all pipeline and L1 cache resources, whereas COR only scales the instruction window and function units.

Figure 5-9 illustrates the differences in the two designs, with a detailed breakdown of mean chip power. Table 5-4 summarizes the different categories, with each including both static and dynamic power. Figure 5-10 compares BAR4 and COR4 to the OoO baseline. The top three con- tributors to the power difference between BAR and COR are L1I, L1D, F/D/R and Backend, which we will investigate further in the next section.

When scaled up, much of BAR’s added power arises from its implementation of a distributed instruction fetch engine. BAR4’s L1-I cache power exceeds BAR1’s by more than 5x despite the 4x increase in cache size. Although leakage power grows in proportion to the cache size, the aggregate 72

L2/L3 8 1.5 L1D 6 1.0 Backend BAR 1 Exec 4 BAR 2 0.5 Sched/Steer F/D/R 2 BAR 4

0.0 Normalized Power L1I

0 Normalized Accesses BAR COR OoO Scaling 4 Avg

FIGURE 5-10. Per-core power breakdown FIGURE 5-11. L1-I access count normalized to the baseline OoO. normalized to BAR1.

L1-I access count and therefore dynamic power increase at a faster pace for two reasons. First, BAR

requires additional logic to implement sub-banking, which makes each L1-I’s line size smaller than

that of the L2. Not only do tag comparisons increase proportionally, but a cache fill by prefetching

also needs to access all the sub-banks in order to correctly split the line. Second, BAR also aggre-

gates fetch units along with the caches, increasing both fetch bandwidth and fetch buffer count.

Deploying more fetch buffers increases the likelihood of fetching wrong-path instructions, and

consequently increases total L1-I accesses. Figure 5-11 shows that BAR2 has 6x more accesses than

BAR1; however, the increase from BAR2 to BAR4 is much more modest. We speculate that this dif-

ference is because of control-flow convergence. BAR4 creates large enough buffers to find already

fetched instructions in the buffers after pipeline flushes.

These aggregation overheads are not present in COR. Unlike BAR’s whole-core scaling, COR does not scale frontend resources. Instead, COR scales frontend utilization, yielding a much smaller power difference between scaling points for the F/D/R category. However, COR is unable to scale down its frontend for scaling point 1, effectively wasting energy by overprovisioning some of the structures’ ports to facilitate the four-wide frontend. 73 Interestingly, BAR’s L1-D power consumption is not substantially higher than COR, despite integration of additional caches. Unlike the L1-Is, which are each accessed every cycle, the activity factors in the individual interleaved L1-Ds are substantially lower. As the L1-Ds are interleaved on address bits, roughly the same number of total L1-D accesses occur across the BAR configura-

tions—in BAR1, all accesses are routed to a single cache, whereas in BAR4 accesses span multiple

caches, but the number of accesses is, to first order, constant. Hence, higher static power from

BAR’s two to four times more active L1-D caches contributes to much of the power difference.

COR’s power consumption dominates the Backend category, as COR overprovisions a central- ized ROB to sequence instructions. Though BAR also relies on an ROB, BAR2 and BAR4 interleave

several smaller structures to scale to maximum width, whereas COR statically scales a single struc-

ture to maximum size, resulting in a much higher per-access energy (3x for COR2 and 4x for

COR4).

5.4.4 Energy Efficiency

We use two metrics to evaluate the energy efficiency of our two models: energy-delay (ED)

product (Figure 5-12a) and ED2 (Figure 5-12b). At scaling point 1, BAR’s ability to scale down the

frontend and backend makes it more energy efficient than COR by 5% under ED and 3% under

ED2. However, COR achieves better energy efficiency as it scales up in both metrics (24% at scaling

2 and 22% at scaling 4 under ED, and 42% at scaling 2 and 48% at scaling 4 under ED2). Con- versely, BAR performs worse as it scales under the ED metric, which emphasizes energy dissipa- tion more than ED2. Under ED2, BAR’s energy efficiency is roughly constant across the scaling points due to the higher emphasis on performance. These results show that for larger configura- tions, overprovisioned cores deliver better energy efficiency for two reasons. First, the energy cost 74 2 BAR 2 1.5 COR BAR COR 1.0 1 0.5 Normalized E*D 0.0

0 Normalized E*D^ Scaling 1 Scaling 2 Scaling 4 Scaling 1 Scaling 2 Scaling 4 (a) Geometric mean ED (b) Geometric mean ED2

FIGURE 5-12. Geometric mean energy efficiency.

of borrowing outweighs its performance benefit. Second, COR consumes less static power by avoiding cache aggregation. However, when scaled down, the smaller components employed in resource-borrowing cores enable better energy efficiency.

5.5 Deconstructing Power-Hungry Components

Section 5.4 showed that, despite BAR’s attempts to maintain pipeline balance, COR’s cheap

overprovisioning philosophy yielded better results most of the time. In this section, we seek to

identify the root causes of our findings. In particular, the COR and BAR designs differ not only in

resource acquisition schemes, but also in selection of which areas of the microarchitecture to scale

at all. We focus on three areas with large power differentials between the designs, namely frontend/

backend width (Section 5.5.1) and L1-I and L1-D caches (Section 5.5.2), and evaluate if each indi-

vidual area should scale, and which strategy best addresses scaling for the area. We take the default configurations in Table 5-3, depicted again in Figure 5-13, and vary one component from each

core design at a time to isolate the effectiveness of each scaling mechanism. 75

(a) Default BAR4 (b) Default COR4

FIGURE 5-13. Conceptual diagrams of BAR4 and COR4 with the default configurations in Table 5-3.

5.5.1 Scaling the Frontend and Backend Width

Our default models take dissimilar approaches for frontend/backend scaling. COR fixes the frontend width at four, regardless of the instruction window size. This imbalanced scaling poten- tially wastes power when the datapaths are narrower than the frontend width. In contrast, BAR aggregates multiple frontends to dynamically scale the frontend width in proportion to the datap- ath width, but effectively increases the depth of the frontend pipe in doing so—a result of the com- munication between fused cores (renaming is a centralized operation).

To evaluate the benefits of varying the frontend/backend width, we consider a COR variant with a differently sized (static) frontend. Instead of fixing the width at four regardless of the scaling point, we instead scale the fetch/decode/rename/dispatch/commit pipeline of these designs to cor- respond to the widths delivered through the BAR approach. In particular, we evaluate COR1 with a narrow, 2-wide (static) pipeline, and COR4 with a wide, 8-wide (static) pipeline, while holding other parameters unchanged. Figure 5-14a plots the mean IPC of each variant; Figure 5-14b plots power. Importantly, performance loss of the narrow-frontend COR1 is negligible with respect to the default COR1 design (astar has the greatest impact, a 6% performance reduction). This finding 76

1.0 1.5 1.0 0.5 0.5 0.0 0.0 Normalized IPC Normalized Power 4 wide 2 wide 4 wide 8 wide 4 wide 2 wide 4 wide 8 wide COR1 COR4 COR1 COR4 (a) IPC (b) Chip power FIGURE 5-14. Effect of 2-wide (Narrow) and 4-wide (Wide) frontend/backend on COR.

shows that COR1’s static 4-wide frontend unnecessarily wastes energy when scaled down: Using a

2-wide frontend instead saves about 8% of chip-wide power.

Interestingly, widening the pipeline when scaled up also has little effect on performance.

Among the workloads used in this study, all exhibit IPCs substantially lower than the peak theo- retical throughput of the scaled-up designs. In essence, this behavior means the frontend through-

put is not on the critical path most of the time, and therefore there is little benefit to scaling the

frontend beyond width 4. I.e., BAR4’s aggressive borrowing to implement an 8-wide frontend was

not necessary.

In summary, scaling up the frontend and backend seems to have little benefit. On the other

hand, it is worthwhile to scale frontend/backend width down when operating less aggressively for

substantial power savings, as scaling down usually comes at little or no performance cost.

5.5.2 Cache Aggregation

The effect of increasing cache capacity has been widely studied—statically in working set anal-

ysis as well as dynamically [5]. Several approaches to cache scaling present themselves for scalable 77

(a) BAR4’ with a single active L1-I (b) COR4’ with four active, sub-banked L1-Is

1.0

0.5

0.0 Normalized IPC 1I$ 1I$ 1I$ 1I$ 2SubBankI$ 2SubBankI$ 4SubBankI$ 4SubBankI$ BAR2-4w COR2-4w BAR4-8w COR4-4w (c) IPC

FIGURE 5-15. L1-I aggregation mechanisms across BAR and COR.

cores, yet the fundamental difference and question we investigate in this section is whether caches

should be aggregated as cores scale up.

L1-I. Default BAR sub-banks L1-I caches to allow concurrent instruction fetch from all participat- ing cores. This mechanism increases the effective cache size without incurring a latency penalty, as each core accesses the nearest cache. Section 5.4.3, however, showed that sub-banking L1-Is imposes a substantial power cost. To find a less power-hungry alternative, we consider a BAR vari- ant, BAR’, in which only one L1-I is active, and instructions from this single cache feed all partici- pating cores with a 2-cycle penalty for remote accesses (Figure 5-15a).

On the other hand, default COR (Figure 5-13b) always uses a single L1-I, regardless of scaling point. In this section, we consider the effect of L1-I aggregation on COR-like designs with the 78 0.10 Scaling 1 Scaling 2 0.05 Scaling 4 Miss Rate

0.00 jbb gcc wrf mcf oltp lbm zeus milc astar h264 tonto sjeng gems bzip2 namd dealII

cactus xalanc soplex apache povray gobmk hmmer gamess

bwaves HMean zeusmp leslie3d calculix sphinx3 gromacs omnetpp perlbench libquantum

FIGURE 5-16. L1-I miss rate of default BAR with L1-I aggregation.

COR’ variant, in which sub-banked L1-Is operate collectively to increase capacity, at latency cost to

access an aggregated sub-bank (Figure 5-15b).

Figure 5-15c plots the resulting mean IPCs of having a single L1-I (1I$) and sub-banking mul-

tiple L1-Is ([2,4]SubBankI$). Surprisingly, the variant designs perform very similarly to the default designs; L1-I aggregation seems to have little impact on overall performance, at least for the cache sizes and benchmarks used in this study. To investigate why this might be the case, Figure 5-16

plots the L1-I miss rate sensitivity to aggregated L1-Is in default BAR. As expected, the miss rate decreases as the aggregated cache size increases. However, there are two reasons why the improved miss rate does not result in an overall performance increase. First, most benchmarks miss in the

L1-I very infrequently with our nominal cache size of 32KB at scaling point 1—even the large- instruction-footprint commercial workloads have 10% or less miss rates. Second, the average IQ clog rate is 60% at scaling point 1, creating a time slack to hide L1-I miss latencies. We therefore infer from these results that L1-I cache aggregation is not an integral component for scalable cores, especially given the prohibitive power consumption (Section 5.4.3). 79

(a) BAR4’ with a single active L1-D (b) COR4’ with four active L1-Ds

1.0

0.5

0.0 Normalized IPC 1D$ 1D$ 1D$ 1D$ 2BankD$ 2BankD$ 4BankD$ 4BankD$ 2AdhocD$ 2AdhocD$ 4AdhocD$ 4AdhocD$

BAR2-4w COR2-4w BAR4-8w COR4-4w (c) IPC

FIGURE 5-17. L1-D aggregation mechanisms across BAR and COR.

L1-D. Default BAR bank-interleaves aggregated L1-D caches from other cores, to maximize total cache size1 (Figure 5-13a). BAR employs a predictor at steering-time to steer memory instructions,

to maximize locality. In case of mispredictions (20% on average in BAR4 (Figure 5-5)), accesses

must be re-routed to the correct caches via the inter-core crossbar, effectively lengthening load-to-

use latency.

As with our approach to L1-I aggregation, we evaluate L1-D aggregation with variants on our

default designs. BAR4’ accesses a single L1-D when scaled up, incurring a latency penalty if that cache is not local (Figure 5-17a). COR4’ (Figure 5-17b) interleaves four caches, accessing remote

1. We also considered a scheme in which cores always access the nearest cache (i.e., ad-hoc L1-D aggregation), and rely on normal inter-core coherence to forward values appropriately. However, we found this aggregation approach to perform worse than interleaving in nearly all cases, due to lack of spatial prefetching. 80 cache slices, incurring a latency penalty when necessary, but after effective address calculation

(eliminating the effect of cache-bank prediction).

Figure 5-17c plots the mean IPCs of each design and variant. Overall, L1-D aggregation

([2,4]BankD$) seems to do more harm than good, an effect more pronounced for the fully scaled- up cores (2% for BAR4 and 4% for COR4). In the case of COR and COR’, aggregating additional caches improves capacity but harms latency, and in the case of BAR and BAR’, removing cache aggregation (even at constant latency cost for some cores) eliminates costly mispredictions.

Together, these results are counterintuitive, as they suggest that smaller caches yield better IPCs.

However, consider the classic equation for a three-level cache hierarchy:

L = p ⋅ L +++p ⋅ L p ⋅ L p ⋅ L (5.1) MEM HIT– l1 l1 HIT– l2 l2 HIT– l3 l3 MISS DRAM

where L is the average latency of accessing the memory hierarchy, p is the hit proba- MEM HIT– li bility in the i-level cache, L is the access latency of the i-level cache, p is the probability of li MISS missing in the caches, and LDRAM is the DRAM access latency.

To first order, L1-D cache aggregation does not alter the probability of a hit in the L2 or L3 caches. Similarly, the latency of these caches does not change with the L1-D configuration. How- ever, cache aggregation does affect p and L : HIT– l1 l1

• Interleaving effectively doubles cache capacity with two caches (or quadruples with four-way

aggregation), suggesting intuitively that p should increase. However, many of our work- HIT– l1

loads have extremely large working sets that easily exceed even the sizes of our L3 caches. In

fact, the smallest inner working set among these workloads is 0.3MB by gamess [30], which is

three times larger than our largest aggregated L1-D of 128KB. Therefore, p does not sub- HIT– l1 81

0.8 Scaling 1 0.6 Scaling 2 Scaling 4 ss Rate

i 0.4

M 0.2

0.0 jbb gcc wrf mcf oltp lbm zeus milc astar h264 tonto

sjeng gems bzip2 namd dealII

cactus xalanc soplex apache povray gobmk hmmer gamess bwaves HMean zeusmp leslie3d calculix sphinx3 gromacs omnetpp perlbench libquantum FIGURE 5-18. L1-D miss rate of default BAR with L1-D aggregation.

stantially change. Figure 5-18 confirms this by demonstrating that enlarging the L1-D capacity

by 4x (scaling point 4) has a negligible impact on the L1-D miss rate for most of the bench-

marks.

• Core-interleaving can lengthen L because of the likelihood of cache-bank misprediction. In l1

our evaluation, a misprediction doubles the effective L1-D access latency from 2 to 4 cycles.

The average latency therefore becomes 2.4 cycles at scaling point 4 when factoring in the 20%

misprediction rate.

We also evaluated the effect of working set size by simulating with an L1-D cache 4x larger than our default (but no more latent), which yielded performance very close to our baseline size.

The most sensitive benchmark was povray, which saw 6% performance improvement. Contrary to intuition, we found no evidence of inner working sets that fit in 128KB, but not in 32KB. There- fore, among the designs we have selected, L1-D aggregation increases cache latency for a very neg- ligible reduction in miss rate, adversely degrading the performance. We cannot claim this is a general phenomenon—it may not hold for all cache sizes, latencies, and caching techniques—but we find it holds across most benchmarks (especially those with high cache hit rate), and across all microarchitectures examined in this study. 82 5.6 Improving the Energy Efficiency of Scalable Cores

The previous two sections support our general conclusion that borrowing resources from

other cores, as in BAR, is an inefficient approach to scaling up due to the latency and energy costs

of communication. Conversely, the COR design shows that significant benefit can be obtained just from scaling those resources that are easily and cheaply attained through overprovisioning (i.e., small area), without communication penalty.

However, BAR scales down more gracefully than COR, with BAR1 dissipating 8% less power than COR1. This power advantage arises because BAR scales down its frontend/backend resources,

whereas COR only scales down frontend utilization. This distinction is important, as BAR1’s two- wide frontend consumes 16% less power on average than COR1’s four-wide frontend, even though they yield similar performance (Figure 5-14a).

The performance scaling of COR and power scaling of BAR guide our design philosophy for

COBRA (Cheaply Overprovisioned and Borrowed Resources for All execution types), a hybrid design point that provides the foundation for two execution styles: an out-of-order version

(COBRo) and an in-order version (COBRi). (We use the term COBRA when discussing character-

istics applicable to both COBRo and COBRi.) COBRA seeks the best of both worlds. First, COBRA

leverages the high-performance features of COR (overprovisioned window and execution

resources) with the lower-power features of BAR (an interleaved ROB and a dynamically scalable

frontend/backend). Specifically, COBRA uses a core-private interleaved ROB and a core-private

pipeline of variable width (2-wide or 4-wide) to achieve better power scaling. Second, COBRA can

borrow an adjacent core’s execution resources (about ~0.3 mm2 each, including datapath) to scale

beyond its own overprovisioned resources (COBRA8), but because these resources are very small, 83 #/"2O ฀

#/"2I

฀ ฀ ฀ FIGURE 5-19. Conceptual block diagram of the COBRA hybrid design. Shaded components are shared between two adjacent cores, and dotted components are turned off.

and because they are the only borrowed resource between two cores, they can easily fit within the single-cycle communication domain (i.e., Chapter 2). Hence, COBRA keeps the cost of borrowing small enough not to diminish the benefits, whereas wire delays and power increase limit BAR from scaling up beyond four.

Figure 5-19 depicts the resulting designs of COBRo and COBRi. The former makes use of

aggressive OoO IQs like BAR and COR, coupled with COR’s base steering, while the latter takes

WiDGET’s simple in-order execution unit (EU) approach. To mitigate in-order issuing constraints

(Chapter 4), COBRi employs WiDGET’s steering heuristic instead, and each of the EUs is

equipped with multiple in-order instruction buffers. Table 5-3 provides configuration details of

both COBRo and COBRi and also marks which design influenced the design of a given compo-

nent. Appendix D discusses the reasons for our choice of instruction buffer count per EU (i.e.,

eight) on COBRi. 84 TABLE 5-3. COBRA Configuration Parameters

Component COBRo COBRi BAR COR

L1-I 1,2,4,8 No aggregation (Total: 32KB) X Frontend/ 1 2 wide X Backend 2,4,8 4 wide X Width Frontend 1,2,4,8 7 cycles X Depth 1 unified 16-entry OoO IQ, no 1 EU (eight 16-entry instruction 1 steering necessary, 2-wide issue 1 buffers per EU), WiDGET steer- per IQ ing, single issue per EU 2 aggregated OoO IQs, back-to- 2 aggregated EUs, back-to-back 2 back bypass between the IQs, 2 bypass between the EUs, WiD- X base steering, 2-wide issue per IQ GET steering, single issue per EU Scheduling 4 aggregated OoO IQs, back-to- 4 aggregated EUs, back-to-back 4 back bypass between the IQs, 4 bypass between the EUs, WiD- base steering, 2-wide issue per IQ GET steering, single issue per EU 2 clusters of 4 IQs, back-to-back 2 clusters of 4 EUs, back-to-back bypass within a cluster, 1-cycle bypass within a cluster, 1-cycle 8 8 X link between clusters, base steer- link between clusters, WiDGET ing, 2-wide issue per IQ steering, single issue per EU 1 1 IALU, 1 FPALU, 1 AGEN Execution 2 2 IALU, 2 FPALU, 2 AGEN XX Resources 4 4 IALU, 4 FPALU, 4 AGEN 8 8 IALU, 8 FPALU, 8 AGEN Instruction 1 64 entries, unified X X 2 2 banks, 64 entries each X Window 4,8 4 banks, 64 entries each L1-D 1,2,4,8 No aggregation (Total: 32KB) X

5.6.1 Evaluation

Figure 5-20 shows a power-performance graph comparing the two versions of COBRA (i.e.,

COBRo and COBRi) to BAR and COR (mean of all the benchmarks from Figures 5-2 and 5-8). All 85

Scaling 8

Scaling 4

Scaling 2

Scaling 1

FIGURE 5-20. Power-performance of all designs with the default configurations normalized to the baseline.

four designs have scaling points one, two, and four, but COBRA also includes an additional scaling point (COBRo8 and COBRi8), achieved through overprovisioning and borrowing (i.e., beyond the

capabilities of BAR or COR by themselves).

As COBRo more closely resembles the microarchitectures of BAR and COR than COBRi does,

COBRo’s power-performance curve has a similar shape with BAR’s and COR’s. At scaling point 1,

COBRo leverages a scaled-down (though fully private) frontend like BAR1 for power savings. As

COBRo scales up, its performance follows that of COR2 and COR4 at those respective scaling points, by using only core-private (cheaply overprovisioned) resources. Despite the identical per- formance, COBRo consumes 5% less chip power than COR by replacing COR’s monolithic ROB with smaller, interleaved ROBs (Figure 5-24).

Although the scale-2 cores with OoO execution (i.e., BAR, COR, and COBRo) are most similar to the baseline OoO, they have slightly worse performance due to the following three reasons. 86

1.5 1.0 0.5 Normalized IPC

0.0 jbb wrf gcc mcf oltp lbm zeus milc astar h264 tonto

sjeng gems namd bzip2 dealII

cactus xalanc soplex apache povray gobmk hmmer gamess bwaves HMean zeusmp leslie3d calculix sphinx3 gromacs omnetpp perlbench libquantum FIGURE 5-21. IPC of COBRi8 normalized to COBRo8.

First, the scalable cores partition execution bandwidth such that each IQ only has one mix of each

ALU type. Despite the same total count of execution resources for all designs including OoO, the

hardwired partitioning can leave the otherwise utilized execution bandwidth underutilized. Sec-

ond, the simple base steering used by COR and COBRo does not take load balancing into consider-

ation in order to optimize for data locality. Third, scaled-up BAR incurs additional latency

overheads as discussed before. These three factors decrease the aggregated issue rates of BAR2 and

COR2 by 10% and 12%, respectively, compared to the baseline OoO, resulting in performance deg-

radation.

COBRi, on the other hand, has a distinctively different power-performance curve due to its in-

order execution style. Because each EU can only issue one instruction per cycle, COBRi‘s issue bandwidth is a half that of the designs with OoO IQs, significantly handicapping COBRi1 and

COBRi2. Nonetheless, at scaling point 4, which matches the issue bandwidth with the frontend/ backend width, COBRi outperforms all other designs by 13-23%. Furthermore, COBRi outper-

forms its OoO counterpart, COBRo, by 21% at scaling point 8. These counter-intuitive results are

due to two properties unique to COBRi. First, COBRi employs more intelligent instruction steer-

ing—based on WiDGET’s steering logic—to expose independent instructions to the in-order buff- 87 5 Scaling 8 4 Scaling 4 3 Scaling 2

MLP 2 Scaling 1 1

0 jbb gcc mcf oltp zeus astar h264 sjeng bzip2 xalanc apache gobmk hmmer

GMean GMean omnetpp perlbench libquantum

(a) Integer (left) and commercial (right) benchmarks

5 Scaling 8 4 Scaling 4 3 Scaling 2

MLP 2 Scaling 1 1

0 wrf lbm milc tonto gems namd dealII

cactus soplex povray gamess

bwaves GMean zeusmp leslie3d calculix sphinx3 gromacs

(b) Floating-point benchmarks

FIGURE 5-22. MLP of COBRo (left bars) and COBRi (right bars).

ers. Second, COBRi creates a larger instruction window by provisioning eight times more of the equally sized buffers than COBRo, although each of COBRi’s buffers are much less capable in-order buffers. These two properties together allow COBRi to have better latency tolerance than COBRo, essentially overcoming in-order issuing constraints.

To understand each benchmark’s contribution to the harmonic mean IPC at scaling point 8,

Figure 5-21 plots per-benchmark IPCs of COBRi8 normalized to COBRo8 in an ascending order.

All benchmarks outperform COBRo8; however, five benchmarks (i.e., bwaves through libquantum) have more than 25% higher IPCs. These benchmarks benefit the most from the larger window and the improved latency tolerance of COBRi8, resulting in higher MLP, as Figure 5-22 presents. These 88 COBRo 1 COBRo 2 COBRo 4 COBRo 8 1.0

0.5

0.0 Normalized IPC

mcf HMean perlbench libquantum FIGURE 5-23. Per-benchmark IPC of COBRo.

performance results also confirm the observation made in Chapter 4 that harnessing many in- order buffers can match—or even exceed—the performance of traditional OoO issuing.

Borrowing execution resources from within the single-cycle communication domain (i.e., adjacent neighbor) enables both COBRo and COBRi to have more power-performance scalability than BAR and COR, achieving a wide performance range of 1.6 for COBRo and 2.6 for COBRi. As

Figure 5-23 demonstrates for the COBRo case (COBRi shares the same characteristics), this greater

scalability is especially useful for memory-intensive benchmarks, including mcf and libquantum

(6% and 23% higher IPC, respectively) because the additional IQs (or instruction buffers) effec-

tively provide more buffering for memory-dependent instructions while increasing MLP. Bench-

marks that do not stress latency tolerance (e.g., perlbench) scale more modestly from scaling points

4 to 8.

As for the power scaling, Figure 5-24 demonstrates that it is negligible or modest for most of the components. The two exceptions are L1D and COBRo’s Sched/Steer. The power increase of the

former is solely from higher utilization of the statically provisioned resource. The latter, on the 89 Scaling 8 5 Scaling 4 4 Scaling 2 3 Scaling 1 2 1 0 Normalized Power COR COR COR COR COR COR COR COR COBRi COBRi COBRi COBRi COBRi COBRi COBRi COBRi COBRo COBRo COBRo COBRo COBRo COBRo COBRo COBRo L1I F/D/R Sched/Steer Exec Backend L1D L2/L3 Chip Power FIGURE 5-24. Categorized chip-wide power consumption.

other hand, deploys twice as many power-demanding OoO IQs, resulting in 2.4 times more Sched/

Steer power. In contrast, COBRi provides power savings of up to 99% by using in-order issuing.

Nevertheless, even with a greater than linear increase in Sched/Steer power by COBRo, distributed

OoO IQs scale power much better than a larger, monolithic OoO IQ of the same effective size

[56,28].

Overall, COBRA’s design philosophy provides an energy efficient foundation for both of the

execution styles, as Figure 5-25 presents. Both always achieve better energy efficiency than BAR and COR, except for COBRi1, COBRo exceeding them by up to 48% under ED2 and COBRi exceeding by up to 68% under ED2. COBRi’s inefficiency at scaling point 1 is due to the much

lower performance than the other designs while not conserving the power proportionally, as

Figure 5-20 demonstrates. In contrast, at scaling point 2, COBRi balances power and performance,

yielding the best energy efficiency in both ED (Figure 5-25a) and ED2 (Figure 5-25b) despite the

lowest performance (Figure 5-20). The energy efficiency gap between COBRo and COBRi becomes

larger (by up to 50%) at higher scaling points of 4 and 8. COBRi outperforms the OoO issuing counterpart by effectively harnessing the lower-power in-order building blocks, providing the most energy efficient scalable core design. 90 BAR BAR COR COR COBRo COBRo COBRi COBRi 1.5 3 1.0 2 0.5 1 Normalized E*D 0.0 0 Normalized E*D^2 Scaling 1 Scaling 2 Scaling 4 Scaling 8 Scaling 1 Scaling 2 Scaling 4 Scaling 8 2 (a) Geometric mean ED (b) Geometric mean ED FIGURE 5-25. Geometric mean energy efficiency.

5.7 Summary

This chapter deconstructed prior scalable core designs from the literature [40,46,75,28], and made an observation that the fundamental difference between the designs lies in where additional resources come from. These additional resources come from either across cores or within a core itself, and result in trade-offs between latency and the amount of scalable resources. Resource bor- rowing allows maintaining pipeline balance throughout core scaling without significant area over- heads. However, borrowing incurs latency penalties. In contrast, overprovisioning sacrifices balance in an effort to reduce wire delays. As wire delays worsen in smaller technology nodes, this chapter evaluated if/what elements are worthwhile to scale even with latency overheads, and what scaling mechanisms should be employed from an energy efficiency standpoint.

We showed that only a small portion of a core resides within single-cycle communication dis- tance of other cores, and this area becomes smaller with higher clock frequency and in smaller technology nodes. As an example of this phenomenon, we find that coarse-grained aggregation of

L1-D caches can do more harm than good (2-4% performance degradation on average), by not improving hit rate enough to compensate for the increased load-to-use latency. L1-I aggregation, 91 on the other hand, comes with prohibitive power cost and little performance benefit. The opposite design extreme, overprovisioning core-private resources, avoids the energy inefficiency from cache aggregation, but the unscaled wide frontend becomes a bottleneck when scaled down. We find that 16% of frontend power can be saved by scaling down the frontend, at little performance cost when paired with a scaled-down pipeline.

Our COBRA hybrid design integrates the performance scalability of an overprovisioned COR design with the lower-power features of a borrowing design, BAR. COBRA also enables further performance scaling by only borrowing small, latency-effective execution resources. We explored two execution styles based on the COBRA design philosophy: out-of-order COBRo and in-order

COBRi. With better performance and lower power, COBRo improves the energy efficiency of COR by up to 6% and BAR by up to 48% on average. In contrast, COBRi leverages multiple in-order instruction buffers with WiDGET’s steering heuristic, scaling the performance from in-order to coarse-grain OoO. Once the issue bandwidth matches the frontend/backend width, we showed that in-order COBRi even outperforms the other OoO-based designs while still enabling power savings of up to 39%. At scaling point of 2 or higher, COBRi always delivers the best energy effi- ciency, improving COR by up to 43% and BAR by up to 68% on average.

However, even with the in-order issuing and scaled-down pipeline, the lowest power-perfor- mance point of COBRi is still far from zero, the ideal lowest end of the power proportionality curve. The next chapter therefore investigates methods to achieve the lower region of the curve. 92 Chapter 6

Power Gliding: Extending the Power-Performance Curve

฀ ฀ FIGURE 6-1. Conceptual power proportionality goal

The final proposal of this thesis focuses on scaling down a processor along the left side of the power proportionality curve as Figure 6-1 illustrates. In this low-power low-performance region, commercial processors increasingly rely on frequency scaling to lower power, as the utility of

DVFS becomes limited in future technology nodes. However, the linear power reduction of fre- quency scaling calls for an alternative, more efficient method.

This chapter explores power gliding, a microarchitectural alternative to fine-grain power man- agement, seeking a power-performance curve closer to DVFS than frequency scaling. Power glid- ing selectively disables or constrains microarchitectural performance optimizations, on a per-core basis, trading performance for power reductions. Our approach draws insight from the perfor- 93 mance optimization rule established by the Intel : a 1% performance improvement

should come with no more than a 3% power increase; otherwise, DVFS could do better [32].

Power gliding uses this rule in reverse: by disabling optimizations that meet this 3:1 power-to-per-

formance ratio, power gliding can effectively extend the cubic DVFS power-performance curve.

While some optimizations may have much less than a 3:1 ratio, and thus should be left on, others

may exceed the 3:1 ratio for a given workload, allowing power gliding to do better than DVFS.

We decided to aim for the 3:1 power to performance ratio, rather than simply attempting to

extend the 1:1 ratio from frequency scaling, because traditional performance optimizations have

been built into chips based on the DVFS 3:1 rule. This means there are many existing optimiza-

tions that approach this 3:1 rule, and are therefore provide the potential for scaling down at this more aggressive level. Essentially, because we are looking to selectively disable existing optimiza-

tions in order to extend the low power scaling ability of DVFS, it makes intuitive sense to aim for

the ratio at which those optimizations were designed.

Potentially a large number of performance optimizations exist that meet the power gliding cri-

terion and, as a result, can be disabled for power savings. In fact, many previously proposed low-

power techniques that result in a performance loss become viable options, even though they might

not have been previously considered appropriate for high-performance processors. Under the 3:1

ratio power gliding leverages, no complex policies are necessary to use those techniques as long as

the power savings exceed the performance loss. We select two sets of intuitive techniques that

affect a fairly broad range of workloads in a pair of case studies: frontend power gliding and L2

power gliding. We apply power gliding to the baseline out-of-order design in Chapter 5 due to the 94 generic nature of our power gliding mechanisms. Nonetheless, we also evaluate L2 power gliding on a scalable core to examine the extent of power proportionality this thesis accomplishes.

The rest of the chapter first discusses the limitations of frequency scaling further and identifies opportunities for more efficient power-performance scaling (Section 6.1). After explaining our simulation environment (Section 6.2), we provide two case studies (Section 6.3 and Section 6.4).

Finally, we summarize our findings (Section 6.5).

6.1 Limitation of Frequency Scaling and Power Gliding Opportunities

The operating frequency range of commercial chips is typically much larger than the operating voltage range [43]. However, frequency scaling is less effective for power savings than voltage scal- ing for two main reasons. First, the former only enables a linear dynamic power reduction, signifi- cantly less than the cubic reduction achieved by voltage (and frequency) scaling. Second, unlike voltage scaling, frequency scaling does not directly impact static power. Given that static power is a significant contributor to the total power in smaller technology nodes [36], this limitation con- strains the amount of power savings achievable via frequency.

With these factors in mind, we first model and understand the power-performance impacts of frequency scaling on different workload types (Section 6.1.1) and then identify opportunities to achieve more graceful power-performance scaling using architectural techniques and other exist- ing circuit techniques (Section 6.1.2).

6.1.1 Analysis of Frequency Scaling

A trend in the industry has been to achieve more fine-grained power management by increas- ing the number of frequency domains on chip [43]. We therefore model independent frequency 95 TABLE 6-1. Workload characteristics. L3 MPKI=L3 misses per thousand instructions. BPKI=Branches per thousand instructions. SPKI=Branch- misprediction-caused squashes per thousand instructions. L2 sensitivity=Increase in L2 miss rate from an 8-way L2 to a direct-mapped L2. Shaded cells indicate representative workloads.

Workload Workload L3 MPKI BPKI SPKI Sensitivity L2 L3 MPKI BPKI SPKI Sensitivity L2 libquantum 36.1 251 .003 1.0 gromacs 1.1 12 0.9 3.7 mcf 36.0 205 6.7 1.3 hmmer 0.9 6 0.2 1.4 bwaves 28.1 3 0.2 1.2 omnetpp 0.9 231 7.9 12.4 milc 23.8 14 .009 1.0 gcc 0.8 185 13.0 9.3 lbm 22.9 12 0.1 1.5 leslie3d 0.7 116 13.2 19.6 gems 15.1 6 0.1 2.9 calculix 0.6 45 1.4 5.5 zeus 8.8 153 15.6 4.2 dealII 0.5 43 5.7 5.5 apache 8.1 154 20.7 4.0 h264ref 0.5 67 2.8 12.0 sphinx3 7.9 56 10.3 1.7 bzip2 0.4 166 2.4 7.9 perlbench 5.8 176 5.2 2.3 gobmk 0.4 140 19.0 40.0 cactus 4.2 1 0.1 2.2 sjeng 0.3 137 18.4 14.9 zeusmp 3.4 16 0.1 5.9 tonto 0.3 74 8.2 13.4 astar 2.9 129 23.8 2.1 wrf 0.2 120 2.3 14.8 jbb 2.0 172 11.7 3.8 namd 0.2 53 2.4 22.0 soplex 1.6 200 6.0 8.0 povray 0.2 101 3.7 57.0 xalancbmk 1.4 212 4.5 1.7 games 0.1 53 2.7 89.2 oltp 1.2 155 20.1 8.1 HMean 1.7 17 0.1 13.8

domains for each core and an on-chip L3 cache. Because an L3 is usually shared in a chip multi- processor (CMP) for servers while an L2 is more often core-private [43], we assume that core fre- quency scaling only affects the associated L1 and L2; the frequencies of the L3 and the memory remain the same. In addition, we idealize asynchronous domain crossing as having no overhead, resulting in an optimistic frequency scaling model. 96 Our operating frequency range is modeled after the POWER7 processor’s [43], and we simu-

late five of the scaled-down frequencies, 1.0 (nominal frequency), 0.88, 0.76, 0.64, and 0.50 (mini-

mum frequency), on a 4-way out-of-order (OoO) superscalar core (Section 6.2) running the SPEC

CPU 2006 benchmark suite [34] and Wisconsin Commercial Workloads [2]. Table 6-1 lists the

characteristics of each workload. We show the results of five representative workloads throughout

this chapter: two memory-intensive (libquantum and bwaves), two CPU-intensive (gobmk and

namd), and one commercial (oltp) workloads. These workloads also represent a diverse mix of characteristics that the following two case studies exploit. The first case study, frontend power gliding exploits branches per thousand instructions (BPKI) and branch-misprediction-caused squashes per thousand instructions (SPKI). libquantum has high BPKI but low SPKI. bwaves has low BPKI and SPKI. Both gobmk and oltp have high BPKI and SPKI. Finally, namd has medium

BPKI and low SPKI. The performance sensitivity of the second case study, L2 power gliding depends on workloads’ L3 misses per thousand instructions (MPKI) and L2 miss-rate sensitivity to different associativity (L2 Sensitivity). Memory-intensive libquantum and bwaves have high L3

MPKI and low L2 sensitivity, while CPU-intensive gobmk and namd exhibit the opposite charac- teristics of low L3 MPKI and high L2 sensitivity. oltp, on the other hand, sits in the middle: medium L3 MPKI and L2 sensitivity.

Figure 6-2 and Figure 6-3 plot the five workloads’ resulting chip power and run-time slow- down from frequency scaling, respectively. In general, frequency scaling degrades performance

(up to 44% on average) more than it saves power (up to 36% on average), especially at lower fre- quencies. This is because frequency scaling reduces dynamic power linearly, but leaves static power largely unchanged. In fact, our simulation results show that static power accounts for 89-

99% of the total L2 power because the L2 is not active a significant portion of the time. 97 1.0 1.0

0.8 1.00 0.8 1.00 0.6 0.88 0.6 0.88 0.4 0.76 0.4 0.76 0.64 0.64 Slowdown 0.2 0.50 0.2 0.50 Normalized Power 0.0 0.0 oltp oltp namd namd gobmk gobmk bwaves HMean bwaves GMean libquantum libquantum FIGURE 6-2. Chip power reduction by FIGURE 6-3. Run-time slowdown by frequency scaling frequency scaling

Upon closer inspection, these two figures demonstrate that core frequency scaling does not apply uniformly across the workloads. As expected, memory-intensive workloads are power- and performance-tolerant to reducing frequencies because much of the time is spent idle waiting for memory data. The higher nominal frequency of the L3 and memory makes the perceived access latencies smaller for the core. This behavior in turn implies that these workloads consume more L2 and L3 dynamic power than the other workloads, which coincides with the chip power breakdown shown in Figure 6-4. The L2 and L3 power of libquantum, for instance, occupies 57% of the total power, but the fraction is only 13% for the CPU-intensive benchmark, namd. Despite the much higher cache power consumption by libquantum, its total chip power is significantly less than namd’s, making the absolute L2 and L3 power of these two benchmarks roughly the same.

Figure 6-4 also shows that our 8MB L3 consumes less power than the 1MB L2, as the L3 is opti- mized for static power, while the L2 is tuned for performance.

To the first order, lowering the core frequency has no impact on the L3 power. Coupled with the dominance of static power in the L2 and L3, these large caches are almost insensitive to fre- quency scaling and, consequently, the fraction of the total power consumed by the caches only 98 1.0 L1I 0.8 Frontend 0.6 Execution 0.4 Backend L1D 0.2 L2 Normalized Power 0.0 L3 oltp namd gobmk bwaves HMean libquantum FIGURE 6-4. Chip power breakdown at the nominal frequency

rises with a decrease in frequency. Therefore, the large L2 and L3 limit the overall power savings of

the memory-intensive workloads, making frequency scaling less effective in reducing power.

6.1.2 Power-Performance Scaling Opportunities

Architecture scale-down is effective when the power benefits at least outweigh the perfor- mance loss. We also want to deliver a power-performance curve that is lower than the frequency scaling curve. Although many circuit-level techniques have been investigated to enlarge the volt- age scaling range [21,19], we argue that there is abundant unexplored territory in which architec-

tural techniques can flourish, side by side with re-purposed circuit techniques. In particular, chips

should trade off performance for power during power-saving modes and provide knobs for the

low-level system software to exploit for graceful power scaling.

We present two case studies to make our argument more concrete, though many other tech-

niques are also possible. One example is the approach used by scalable core designs that scale

down the pipeline resources, width, and depth. Beyond these core resource scalings, hardware

speculations (e.g., memory disambiguation predictor and prefetching) and interconnect network 99 TABLE 6-2. Baseline configuration parameters

Component Configuration

Fetch buffer 16 entries Renamer checkpoints 16 Physical registers 128 1MB, 8-way, 2 banks, 64B line, 12 cycles, write back, private, high-performance L2 cache device type 8MB, 16-way, 8 banks, 64B line, 24 cycles, shared, low-standby-power device L3 cache type

TABLE 6-3. Simulated frequency scaling points 1.0 (Nominal) 0.88 0.76 0.64 0.50 (Minimum)

should also be re-examined to see if they meet power gliding’s 3:1 ratio. One can also implement

more selective drowsy-mode policies that discriminate instruction and data streams in a shared

cache, for instance. Ultimately, the goal of this exploration lies in identifying extra resources or

activities that only produce marginal performance benefits and scaling them down during power-

saving modes.

6.2 Methodology

We use the 4-way OoO superscalar described in Chapter 3 as our baseline core design. Table 6-

2 details the configuration parameters specific to this chapter. Our frequency scaling model is

based on IBM’s most recent POWER7 [43], which implements per-core frequency scaling.

Although POWER7 allows a fine-grained frequency step of 25MHz, we only simulate the nominal and minimum frequencies and three intermediate frequencies while fixing the voltage, and use a 100 linear interpolation to estimate the remaining ones. Table 6-3 notes the simulated frequency points.

We derive DVFS curves by assuming a 22% operating voltage range based on Intel Pentium M.

We factor in supply voltage and temperature fluctuations when estimating the corresponding static power.

Each case study discussed in the following two sections provides additional configurations specific to the study.

6.3 Case Study 1: Frontend Power Gliding

Our first case study explores opportunities within the frontend of a core for trading off ILP and/or MLP for power during power-saving modes. The frontend is an attractive target—it is organized for quick retrieval of instructions after a pipeline flush, and it is generally underutilized once the scheduler is full. In addition, the rate of instruction flow to the backend sets an upper bound on achievable ILP and MLP. In this case study we apply previously proposed mechanisms to reduce wasteful power from wrong-path instructions and lessen or turn off the capabilities of power-hungry frontend structures—even at the cost of performance (Section 6.3.1). We then com-

pare the resulting power and performance to those of frequency scaling and DVFS (Section 6.3.2).

6.3.1 Implementation

Frontend power gliding makes collective use of three techniques: checkpoint removal, specula-

tion control, and power-gating portions of fetch buffer and registers. These techniques work

together to provide a more efficient power-performance scaling than frequency scaling. 101

8 6 4 2 0 Useful Checkpoints (%) oltp namd gobmk bwaves HMean libquantum

FIGURE 6-5. Useful checkpoint rate of the baseline

Checkpoint removal. Many high-performance OoO processors employ checkpointing architec-

tural state (e.g., a register renamer table) to recover from branch mispredictions [45,76]. Although

checkpoints facilitate fast misprediction recovery [38,55], they in fact have very small opportuni-

ties in improving overall performance due to modern highly accurate branch predictors [62].

Figure 6-5 reports useful checkpoint rate as a fraction of committed mispredicted branches over

checkpoint-allocated branches. When allocating checkpoints to all branches, only 0.002-8% of all

checkpoints are useful for misprediction recovery.

Moshovos proposed reducing checkpoints either by predicting which branches are likely to require checkpoints, and/or using out-of-order checkpoint release [55]. While both of these tech- niques still recover from mispredictions as soon as detected, we propose a novel technique that alternates between recovery at detection and recovery at commit depending on operation modes.

During normal, performance-driven mode, we use the conventional checkpoint-based recovery at detection. When power efficiency is required (i.e., power gliding mode), we disable all checkpoint- ing hardware and recover at commit by flushing any structures with corrupted states. Compared to the checkpoint-based recovery at detection, recovery at commit lengthens misprediction penalties 102 by the cycles between detection and commit (unless it takes longer to re-fill the window), which can be significant. We mitigate the penalties by flushing wrong-path instructions in the OoO exe- cution engine as soon as we detect a misprediction, and refetching the correct-path instructions.

We also stall the rename stage, which contains corrupted states, until the recovery action is taken.

Even though flushing the OoO execution engine is non-trivial, the mechanism already exists for checkpointing in the baseline design and eliminates unnecessary interference with correct-path execution.

Speculation control. To reduce wasteful power from wrong-path instructions, we also apply a simplified version of speculation control that gates the frontend when the amount of speculation in a window exceeds a certain threshold [14,51]. Although the prior proposals measure specula- tion by the number of unresolved low-confidence branches, we instead use the number of all unre- solved branches regardless of the confidence as a proxy. This coarse approximation penalizes very low SPKI (squashes per thousand instructions) workloads and phases with low SPKI; however, we obviate the need for a confidence estimator which increases complexity and power consumption.

The rest of the speculation-control mechanisms remain the same. When the threshold is reached, we stall all instructions at rename; once the number of unresolved branches goes below the thresh- old, we un-stall renaming. The smaller the threshold, the more aggressively we suppress wrong- path instructions, with a greater risk of performance loss.

Fetch buffer resizing. A complementary method for regulating speculation is to lessen fetching aggressiveness by shrinking the fetch buffer. The fetch buffer is usually sized larger than the fetch width to mask I-cache miss latencies. As the fetch stage continues fetching instructions until the buffer becomes full, a larger buffer size generally increases the probability of fetching wrong-path 103 instructions. Furthermore, even for low SPKI workloads, the full capability of the fetch stage is not necessary most of the time because the fetch is typically designed to reduce the latency of infre- quent window re-fills. Thus, we power-gate a portion of the fetch buffer, making the buffer size match the fetch width, which in turn results in less frequent accessing of the I-cache and instruc- tion translation look-aside buffer.

Register file resizing. Because speculation control and fetch buffer resizing reduce in-flight instructions, these two techniques create opportunities to proportionally resize other structures as well. While a more balanced approach is desirable, this initial work focuses on a major source of frontend power, physical registers. Nevertheless, the next section shows that the less aggressive frontend also affects the rest of the pipeline. Although power-gating a portion of the physical reg- isters reduces the physical register file and the free list as well, dynamic physical register file resiz- ing is not an easy task due to potentially scattered valid register mappings. We simply do not allocate registers that will be power-gated and enter the power gliding mode once those registers do not contain valid mappings. This expensive operation only needs to occur once, and the power- gated portion remains fixed throughout the power-gliding mode.

In summary, the above techniques work in concert to reduce wasteful work and lessen the aggressiveness of the frontend for power savings.

6.3.2 Evaluation

Table 6-4 explains the configuration space of this first case study. We experiment with five con- figurations that selectively stall renaming with varying degrees of aggressiveness. All the configu- rations power-gate the entire checkpointing hardware, three quarters of the fetch buffer, and half of the physical register file and the free list in the baseline. 104 TABLE 6-4. Configuration space for Case Study 1

Max in-flight Fetch buffer Physical Configuration Checkpoint count unresolved BRs size registers

Base Unconstrained 16 16 128 Stall-8 8 Stall-4 4 Stall-3 3 0464 Stall-2 2 Stall-1 1

In Section 6.1, we selected five representative workloads. These five workloads have different

branch characteristics, as the third and fourth columns in Table 6-1 explain. libquantum has high

BPKI (branches per thousand instructions) but low SPKI. bwaves has low BPKI and low SPKI.

Both oltp and gobmk have high BPKI and high SPKI. Finally, namd has medium BPKI and low

SPKI. These workloads also represent other characteristics. libquantum and bwaves are memory-

intensive, while gobmk and namd are CPU-intensive. Additionally, libquantum and gobmk repre-

sent integer benchmarks, and bwaves and namd represent floating-point benchmark. oltp, on the

other hand, is an on-line transaction workload.

Figure 6-6 presents power and performance normalized to the baseline for the five representa-

tive workloads as well as the harmonic mean of all SPEC and the commercial workloads. The data

points labeled as Freq Scaling are the same results as the ones in Figure 6-2 and Figure 6-3 in

Section 6.1, and Stall-* represent the Stall-8 through Stall-1 configurations in Table 6-4. Analytical

DVFS represents the derived DVFS curve for each workload. Although the degree and the scalabil-

ity vary, all our configurations yield lower power-performance curves than frequency scaling,

indicating a more efficient power-performance trade-off. Furthermore, some of the Stall-* data 105

libquantum bwaves oltp 1 1 1

0.9 0.9 0.9

0.8 0.8 0.8

0.7 0.7 0.7

0.6 0.6 0.6

0.5 0.5 0.5 Normalized Chip Power Normalized Chip Power Normalized Chip Power 0.4 0.4 0.4 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized Performance Normalized Performance Normalized Performance Freq Scaling Analytical DVFS gobmk Stall-* namd HMean 1 1 1

0.9 0.9 0.9

0.8 0.8 0.8

0.7 0.7 0.7

0.6 0.6 0.6

0.5 0.5 0.5 Normalized Chip Power Normalized Chip Power Normalized Chip Power 0.4 0.4 0.4 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized Performance Normalized Performance Normalized Performance FIGURE 6-6. Power-performance normalized to the baseline. (Lower right is better.) HMean=Harmonic mean of all 33 workloads

points even lay on top of the DVFS curves. We verified that the effectiveness of our configurations holds true for the rest of the workloads, and summarize the results using harmonic mean. These favorable results demonstrate that the previous techniques we employed are viable options in the context of frequency scaling, even though the techniques may not have worked well under peak performance restrictions. Coupling the techniques together reduces wasteful energy and trims the capability of structures designed for worst-case performance rather than uniformly slowing down execution by frequency scaling regardless of whether instructions are speculated or on the correct path. 106 The effectiveness of frontend power gliding, however, depends largely on the workloads’ BPKI

and SPKI because of the inherent dependency of branch characteristics. The higher the BPKI is,

the more unresolved branches an instruction window has, creating more opportunities for stalling

the rename stage. Similarly, the higher the SPKI is, the more reduction in wrong-path instructions we can achieve. Hence, the high BPKI and high SPKI workloads, oltp and gobmk, exhibit the most power savings for the amount of performance loss and also scale well with different stalling aggressiveness. The converse—that low BPKI and low SPKI exhibit less significant power sav-

ings—unfortunately also applies, a fact exemplified by the low BPKI, low SPKI workload bwaves.

The low BPKI characteristic in particular makes this workload category insensitive to frontend stalling, resulting in one power-performance point for all levels of aggressiveness. The reduction in physical registers constrains the ILP and MLP as well as the power. However, only six out of 33 workloads fit in this workload category.

The utility of stalling for wasteful power reduction is similarly dependent on BPKI and SPKI.

Figure 6-7 plots the ratio of committed instructions over dispatched instructions. Because work- loads with medium to high SPKIs (e.g., oltp and gobmk) have more wrong-path instructions than those with low SPKIs (e.g., libquantum, bwaves, and namd), the former improves the ratio by up to

54% as we increase the stalling aggressiveness. The latter workload type is insensitive to stalling because the ratio of the baseline is already quite high.

To understand the performance impact of each frontend power gliding technique, we added one technique at a time to the baseline and measured the resulting IPCs. Figure 6-8 presents the

IPC degradations compared to the baseline for each of the five representative workloads. The "&

Chkpt Removal" stack shows the IPC loss just from our commit-time misprediction recovery. The 107 0.8 Stall-8 0.6 Stall-4 0.4 Stall-3 Ratio Stall-2 0.2 Stall-1

0.0

oltp namd gobmk bwaves HMean libquantum

FIGURE 6-7. Ratio of committed / dispatched instructions. (Higher is better.)

1.0 0.8 & Chkpt Removal 0.6 & Spec Cntrl & Fetch Buff 0.4 & Regs 0.2 Stall-1 Normalized IPC 0

oltp namd gobmk bwaves HMean libquantum FIGURE 6-8. IPC impacts of the applied techniques with Stall-1

next stack "& Spec Cntrl" represents the IPC loss when coupling the checkpoint removal with our most aggressive speculation control of allowing only one unresolved branch in flight. Similarly, "&

Fetch Buff" adds fetch buffer resizing, and "& Regs" adds register file resizing to the techniques listed above. The height of the bottom "Stall-1" stack represents the IPC of all techniques combined

(i.e., Stall-1) normalized to the baseline.

As expected, the lack of checkpoints has very small performance implications because of the low useful checkpoint rate in the baseline (Figure 6-5). Even the high SPKI workloads, oltp and gobmk only have 5% and 7% performance degradation from the longer misprediction penalties, respectively, and the fraction becomes negligible for workloads with low SPKI (e.g., libquantum, 108 1.0 L1I 0.8 Frontend 0.6 Execution 0.4 Backend L1D 0.2 L2 Normalized Power 0.0 L3 Base Base Base Base Base Base Stall-8 Stall-4 Stall-3 Stall-2 Stall-1 Stall-8 Stall-4 Stall-3 Stall-2 Stall-1 Stall-8 Stall-4 Stall-3 Stall-2 Stall-1 Stall-8 Stall-4 Stall-3 Stall-2 Stall-1 Stall-8 Stall-4 Stall-3 Stall-2 Stall-1 Stall-8 Stall-4 Stall-3 Stall-2 Stall-1 libquantum bwaves oltp gobmk namd HMean FIGURE 6-9. Power breakdown normalized to the baseline

bwaves, and namd). Adding the most aggressive speculation control affects the workloads differ-

ently. libquantum suffers from this techniques the most, degrading the IPC by 58%, because our

approximated speculation control stalls the rename often due to the high BPKI but the stalling

only slows down the correct-path execution due to the low SPKI. The performance loss is more

modest for the high SPKI workloads (oltp and gobmk) and the medium BPKI workload (namd). In contrast, the low BPKI bwaves allows few opportunities for the speculation control, thereby show- ing insensitivity to the technique.

Combining the above two techniques with fetch buffer resizing has negligible performance impact for different reasons. The speculation-control sensitive workloads permit much fewer instructions to enter the window than the baseline; hence, the reduced fetching capability does not affect the overall performance much. The reason for the outlier bwaves is the mostly idle backend due to the memory intensity as well as the infrequent pipeline flushes (Figure 6-5). As bwaves is

the only workload that is largely unaffected by the speculation control, it instead shows the most

sensitivity (28%) to the reduced physical register count. The large IPC loss is also an indication

that the baseline design is not skewed during the normal, unconstrained operation mode. 109 Figure 6-9 presents a normalized power breakdown. As expected, the fraction of the frontend

power becomes smaller with more aggressive stalling levels with an exception of stalling-insensi-

tive bwaves. Much of bwaves’ power reduction is contributed by the smaller physical register file, which slows down the fetching and execution. Even though our approach only targets the frontend of the core, regulating speculation and scaling down the fetch buffer and physical registers reduce power consumption throughout the pipeline and the cache hierarchy. gobmk, for example, lowers

the power of the execution by 47%, backend by 50%, and L1-D by 50% with the Stall-1. Most

importantly, frontend power gliding is also effective for memory-intensive workloads which are

largely tolerant to frequency scaling, yielding up to 15% and 18% chip power reduction for bwaves

and libquantum, respectively. As a result, we achieve more power-performance scalability than fre-

quency scaling by 2.5x for bwaves and 3.8x for libquantum.

6.4 Case Study 2: L2 Power Gliding

The second case study addresses the increasingly problematic issue of static power in large

caches. Static power essentially imposes an upper limit on power savings achievable by frequency

scaling, and large caches exacerbate the issue. As transistors become leakier in future technology

nodes [23], it becomes ever more important to scale down static power along with dynamic power

in order to provide a wide power range.

In this section, we turn our attention to existing circuit techniques for static power manage-

ment and employ those techniques during power-saving modes rather than within a nominal

operating mode. Hence, we do not investigate mechanisms to mitigate the performance degrada-

tions these techniques impose; rather, we use them without any complex policies to fully enjoy the

benefits of the techniques. We first explain five levels of power-saving techniques we apply to the 110 L2 cache (Section 6.4.1), then evaluate the power-performance impacts both within the L2 and on

the chip as a whole (Section 6.4.2). We conclude this section by evaluating L2 power gliding on the

COBRi scalable core design proposed in the previous chapter for completeness (Section 6.4.3).

As we assume a private L2 but a shared L3, we leave the L3 configuration (Table 6-2) unchanged, similar to our frequency scaling assumption.

6.4.1 Implementation

For gradual power-performance scaling, we enable five levels of power-saving modes, with each additional level increasing in aggressiveness. The first level puts the entire L2 data arrays into a drowsy mode to cut down the L2 static power. Drowsy mode is an effective technique that sup- plies just enough voltage (drowsy voltage) to preserve the state of the memory cells and switches to a higher active voltage to safely read out the data [23]. Because each L2 cache line is maintained with the low drowsy voltage during inactivity and only accessed lines temporarily boost the supply voltage, static power is significantly reduced. Accessing a drowsy line, however, takes an extra cycle to wake up, increasing the L2 access latency to 13 cycles in our implementation, and many policies for selective drowsy mode have been proposed to mitigate the performance impact [23,57]. Given the goal of this work—providing a lower power-performance curve than frequency scaling—we are more interested in aggressive power reduction, and hence we simply apply drowsy mode uni- formly to the cache.

The second level of power-saving modes goes one step further by putting the L2 tag arrays into drowsy mode as well, lengthening the access latency by another cycle (i.e., 14 cycles). 111 TABLE 6-5. Configuration space for Case Study 2

Configuration Drowsy L2 Data Drowsy L2 Tag L2 Associativity L2 Access Cycles

Base N 12 N Level-1 8 13 Level-2 Level-3 Y 4 Y14 Level-4 2 Level-5 1

The remaining three levels gradually trade off the L2 cache capacity for power by power-gating

some of the associative ways, which is permanent during each level. Again, we apply this technique

uniformly to all the sets in the L2 for simplicity and maximum power savings, rather than optimiz-

ing for each workload’s cache-line reuse patterns or idle periods [36]. When enacting each of the

three levels, any dirty lines in the ways that will be power-gated must be written back to the L3.

Without attempts to minimize the associated power and latency cost [36], we simply write back

dirty lines before gating in this case study. We expect the performance impact of the reduced asso-

ciativity to be workload dependent—depending on the workload’s working set size, data locality,

and reuse distance. However, compared to the first two levels, power-gating a portion of the L2

completely eliminates the static power of the disabled portion and reduces the dynamic energy of

L2 tag accesses.

6.4.2 Evaluation

We evaluate the effectiveness of our five levels of power-saving modes against frequency scal-

ing and DVFS. Table 6-5 summarizes the configuration parameters. As these configurations trade off the L2 cache performance for power, the effectiveness depends on a workload’s memory inten- 112

libquantum bwaves oltp 1 1 1

0.9 0.9 0.9

0.8 0.8 0.8

0.7 0.7 0.7

0.6 0.6 0.6

0.5 0.5 0.5 Normalized Chip Power Normalized Chip Power Normalized Chip Power 0.4 0.4 0.4 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized Performance Normalized Performance Normalized Performance Freq Scaling Analytical DVFS gobmk Level-* namd HMean 1 1 1

0.9 0.9 0.9

0.8 0.8 0.8

0.7 0.7 0.7

0.6 0.6 0.6

0.5 0.5 0.5 Normalized Chip Power Normalized Chip Power Normalized Chip Power 0.4 0.4 0.4 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 0.4 0.5 0.6 0.7 0.8 0.9 1 Normalized Performance Normalized Performance Normalized Performance FIGURE 6-10. Power-performance normalized to the baseline. (Lower right is better.) HMean=Harmonic mean of all 33 workloads.

sity and the L2 miss rate sensitivity to cache sizes (the last column of Table 6-1). Both of the mem- ory-bound workloads, libquantum and bwaves, are insensitive to L2 size because of the already high miss rates (0.31 and 0.40, respectively); however, the L2 miss rates of the CPU-intensive workloads (gobmk and namd) and the commercial workload (oltp) worsen as the L2 shrinks. We model a drowsy cache based on the work by Flautner et al. [23].

Figure 6-10 plots the power-performance scaling by our five levels of power-saving modes, the frequency scaling model, and the derived DVFS. All five workloads enable much more power sav- ings for the same performance than frequency scaling—and even DVFS in many cases—by 113 1.0 L1I 0.8 Frontend 0.6 Execution 0.4 Backend L1D 0.2 L2 Normalized Power 0.0 L3 Base Base Base Base Base Base Level-1 Level-2 Level-3 Level-4 Level-5 Level-1 Level-2 Level-3 Level-4 Level-5 Level-1 Level-2 Level-3 Level-4 Level-5 Level-1 Level-2 Level-3 Level-4 Level-5 Level-1 Level-2 Level-3 Level-4 Level-5 Level-1 Level-2 Level-3 Level-4 Level-5 libquantum bwaves oltp gobmk namd HMean FIGURE 6-11. Power breakdown normalized to the baseline

addressing both dynamic and static power of the L2. Again, we verified that the rest of the work-

loads also exhibit the effectiveness of L2 power gliding over frequency scaling, summarizing the

results using harmonic mean. We achieve these better power-performance trade-offs despite our

simple, non-optimized use of drowsy mode and L2 way shrinking. Memory-intensive libquantum

and bwaves have the most chip power reduction (43% and 33%, respectively) from placing the L2

cache into drowsy mode. Because these workloads have more than a third of the chip power from

the static-power dominated L2 in the baseline, as Figure 6-11 shows, they provide ample opportu-

nity for static power reduction by implementing the drowsy mode. Although the rest of the work-

loads benefit from drowsy mode as well (Figure 6-12), the L2 contributes much less to the total power of these workloads in the baseline, limiting chip power reduction to 15-24%. As mentioned in Section 6.1, the L3 consumes less power than the L2 in the baseline because the L3 is optimized for leakage and is less frequently accessed, while the L2 is optimized for performance.

The Level-[345] configurations, on the other hand, target both static and dynamic power of

the L2 by power-gating some of the associativity. However, performance responds differently due

to the differences in the L2 utilization and the miss rate sensitivity to reduced associativity. The 114 1.0 0.8 Base Level-1 0.6 Level-2 0.4 Level-3 0.2 Level-4 Level-5

Normalized L2 Power 0.0 oltp namd gobmk bwaves HMean libquantum FIGURE 6-12. Normalized total L2 power

memory-intensive workload, libquantum (bwaves), has a negligible performance loss of 0.1% (2%) while yielding up to 5% (5%) additional power savings, resulting in an almost ideal power scaling.

The other three workloads that are more sensitive to the reduced L2 associativity have more mod- est power-performance scaling curves, though still outperforming frequency scaling. Even namd which has the smallest L2 power fraction achieves 18% more power reduction for the same perfor- mance compared to frequency scaling.

The worsened L2 miss rate by Level-[345] power gliding inevitably increases the L3 utilization

and consequently the L3 power, as Figure 6-11 shows. Despite essentially shifting some of the

power from the L2 to the L3, the result is a profound impact on the core power. The longer mem-

ory request latencies leave the core idle more often, most notably resulting in 19% and 20% average

power reduction in the execution and backend logic, respectively, for namd. The increased idle-

ness in the core in turns provides an opportunity to scale down the aggressiveness of the core, sim-

ilar to Case Study 1 in the last section; however, we leave further design explorations as future

work. 115 TABLE 6-6. COBRi configuration parameters

Component Configuration

L1-I 1,2,4,8 No aggregation (Total: 32KB) Frontend/Backend 1 2 wide Width 2,4,8 4 wide Frontend Depth 1,2,4,8 7 cycles 1 EU (eight 16-entry instruction buffers per EU), WiDGET steering, single 1 issue per EU 2 aggregated EUs, back-to-back bypass between the EUs, WiDGET steering, 2 single issue per EU Scheduling 4 aggregated EUs, back-to-back bypass between the EUs, WiDGET steering, 4 single issue per EU 2 clusters of 4 EUs, back-to-back bypass within a cluster, 1-cycle link between 8 clusters, WiDGET steering, single issue per EU 1 1 IALU, 1 FPALU, 1 AGEN Execution 2 2 IALU, 2 FPALU, 2 AGEN Resources 4 4 IALU, 4 FPALU, 4 AGEN 8 8 IALU, 8 FPALU, 8 AGEN Instruction Win- 1 64 entries, unified 2 2 banks, 64 entries each dow 4,8 4 banks, 64 entries each L1-D 1,2,4,8 No aggregation (Total: 32KB)

6.4.3 Application of Power Gliding to COBRi

For completeness, this section evaluates L2 power gliding on a scalable core, which is our foun- dation to deliver power proportionality during more performance-centric modes. We use the

COBRi design proposed in the previous chapter as an example scalable core. As described earlier,

COBRi combines energy-efficient aspects of the borrowing- and overprovisioned-based scalable core models to enhance the power-performance scalability in both directions (upward and down- ward). COBRi also addresses the power inefficiency of out-of-order (OoO) schedulers by integrat- ing WiDGET’s in-order Execution Units and instruction steering. 116

1.4 1 COBRi8 1.2 0.8 1 COBRi4 0.8 0.6 COBRi2 0.6 COBRi1 0.4 Level-0 0.4 Base

Normalized Chip Power Level-[1-5] Normalized Chip Power 0.2 Ideal Power Proportionality 0.2 COBRi COBRi + L2 Power Gliding COBRi COBRi + L2 Power Gliding 0 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 0 0.2 0.4 0.6 0.8 1 Normalized Performance Normalized Performance (a) Normalized to the baseline (b) Normalized to COBRi8

FIGURE 6-13. Harmonic mean power and performance

Given that the aim of power gliding is to scale down power even at the cost of performance, we apply power gliding to the fully scaled down configuration of COBRi (i.e., COBRi1). Table 6-6 repeats the configuration parameters of COBRi1 used in Chapter 5 as a reference. Despite the scaled down pipeline, COBRi1 still employs eight in-order instruction buffers for performance, enabling a limited form of OoO execution. Hence, we add a new power gliding level called Level-0 that is tailored to the COBRi design and reduces the instruction buffer count to one. Level-0 there- fore converts COBRi1 to a single-issue in-order design. Building up from this, we apply the remaining L2 power gliding levels, from Level-1 through Level-5, to the in-order design formed by

Level-0.

Figure 6-13a plots the harmonic means of the resulting chip power and performance normal- ized to the baseline OoO machine. The figure also includes the power-performance points of

COBRi from the previous chapter to show the scaling trend. From the COBRi1 point, Level-0 enables 10% more power savings just by reducing the instruction buffer count from eight 117 (COBRi1) to one. However, Level-0 sacrifices more performance (18%) for the amount of power savings, indicating that simply converting from an OoO pipeline to an in-order pipeline does not produce cost-effective scaling, at least on the COBRi infrastructure. In contrast, the remaining five

levels of L2 power gliding attack power-dominant performance optimizations and enable 25%

power reduction with only 4% performance degradation compared to the Level-0 point. Similar to

the L2 power gliding results when applied directly on the OoO baseline design (Section 6.4.2),

placing the L2 into drowsy mode (Level-[12]) results in the largest power savings because of the

significance of the L2 static power. A difference, however, is that making the L2 smaller (Level-

[345]) has little effect on the in-order core on average. This behavior arises because the in-order

core’s lack of latency tolerance already incurs frequent memory stalls and long idle core periods in

the baseline, and, therefore, many more workloads become insensitive to the poor L2 perfor-

mance. The resulting power-performance characteristics in Figure 6-13a resemble those of the

memory-intensive workloads—libquantum and bwaves—from the previous section.

To better present the scalability that this thesis accomplishes as a whole, Figure 6-13b replots

the harmonic mean chip power and performance points of COBRi and power gliding in Figure 6-

13a using a different normalization point: the fully scaled-up configuration COBRi8. By using the

energy-efficient scalable core when performance is more critical and by switching to power gliding

when scaling down power toward zero, we collectively yield a power-performance curve that

closely follows the ideal power proportionality curve. In fact, only two of the points—by COBRi1

and Level-0—result in slightly less efficient scaling, while the rest of the power-performance

points lie in the lower right region of the ideal curve, indicating more efficient scaling. Although

our design scalability does not quite reach the ideal zero point, we nonetheless achieve a design 118 that scales from an aggressive OoO processor to a single-issue in-order processor that consumes

85% less power.

6.5 Summary

Frequency scaling is becoming increasingly important due to the shrinking voltage scaling range and the need for fine-grained power management within shared voltage planes. However, the limited linear dynamic power reduction of frequency scaling is likely to constrain flexible power budgeting.

We proposed a new concept called power gliding, which disables or regulates performance optimizations that meet the 3:1 power-to-performance ratio to yield a power-performance curve closer to DVFS than frequency scaling. Because power reduction is not constrained by maintain- ing peak performance in the context of frequency scaling, power gliding enables a new way to look at previously proposed low-power techniques that result in performance loss. Our two case studies examined the core frontend and L2 cache, leveraging simplified speculation control, power-gating of structures designed for aggressive performance, and drowsy mode. We showed that power glid- ing enables more efficient power-performance scaling than frequency scaling across all workloads, even those that exhibit tolerance to frequency scaling. In particular, L2 power gliding resulted in up to 48% more power savings for the same performance compared to frequency scaling, and many of the power gliding data points even exceeded the power savings by DVFS.

Although we statically selected resources to disable across all workloads, dynamic detection that exploits workload and phase behaviors will certainly lead to better power-performance scal- ing. The power gliding concept opens opportunities to explore many different mechanisms for this power-constrained era. 119 Chapter 7

Conclusions

The increasing levels of power consumption by computer systems has necessitated innovations

in every aspect of computer science, not just once but multiple times, in recent years. From the

computer architecture perspective, the goal of building chips with ever faster clock speeds was

replaced by a focus on more power- or energy-efficient designs, with less emphasis on perfor-

mance. We argue that performance improvements are still possible as long as new designs main-

tain tight control over where and when to spend power. Power proportionality is one of the properties needed for those designs. Because technology scaling in nano-scale nodes limits the effectiveness of commonly used low-power circuit techniques, especially voltage scaling, this the- sis examined microarchitectural mechanisms to realize power-proportional computing.

In this final chapter, we first present a summary of our work (Section 7.1). We then reflect on further work necessary to achieve system-wide power proportionality and discuss opportunities that lie ahead (Section 7.2).

7.1 Summary

We approached the goal of power-proportional processors through dynamic resource alloca- tion. We aggregate resources to achieve higher performance at higher power, and selectively dis- able resources and/or performance optimizations to reduce performance with lower power. To deliver modest to high performance, we proposed a scalable core design called WiDGET. The nov- 120 elty of WiDGET lies in harnessing low-power building blocks, namely distributed in-order buff- ers, and varying the number of active in-order buffers and functional units for power-efficient scalability. On a single chip, WiDGET scales from a low-power in-order core to a high-perfor- mance out-of-order (OoO) core and anywhere in between. Our mechanism to approximate OoO execution with the distributed in-order buffers resulted in a core that consumes 8% less power than a high-end Intel Xeon-like core while exceeding the performance by 26%, using the most aggressive configuration. Overall, WiDGET delivered a power range of 2.2 and a performance range of 3.8 by only scaling the execution resources.

We also conducted a study to identify the power-performance impacts of scaling other core resources beyond execution resources. Based on our core scaling taxonomy, we produced the insight that the fundamental difference among the prior scalable core proposals lies in where resources come from when scaling up. We developed two abstract scalable core models to under- stand the impacts on power and performance: 1) borrow entire core resources from neighboring cores; or 2) overprovision a few core-private resources. We found that increased latencies from inter-core communication outweigh the benefits of maintaining pipeline balance when scaled up.

Aggregating L1 caches resulted in poor power-performance scaling because of already low miss rates in the L1-Is and the increased load-to-use latency of the L1-Ds. On the other hand, when fully scaled down, smaller components in the borrowing-based cores facilitated better power-per- formance trade-offs than statically overprovisioning resources. We concluded this study by pro- posing a hybrid design that combines the desirable features of the two models, improving the scalability and energy efficiency of both. 121 Finally, we explored avenues for scaling down power and performance toward zero. We pro-

posed dynamically disabling performance optimizations, trading performance for power reduc-

tions, and called this concept power gliding. The important implication of power gliding is to allow performance degradation as long as the degradation is exceeded by power savings, Henceforth, we leveraged previously proposed low-power techniques without complex policies or logic to obtain the best possible power savings. Our two case studies focused on the core frontend and the L2 cache using frontend stalling for speculation control, power-gating of structures designed for aggressive performance, and drowsy mode. Despite targeting only a portion of the core, each case study showed power reduction throughout the core, and resulted in even better power-perfor-

mance scaling than DVFS in some cases.

With dynamic resource allocation, we achieved a processor design that dissipates power in

proportion to work done across the entire performance spectrum. The lowest performance config-

uration only consumes 15% of the peak power, approaching the ideal power proportional curve.

The importance of approaching, or matching, the ideal power proportional curve has

increased in modern chip designs. In fact, concerns for chip power are becoming more urgent as

the utility of traditional power-management techniques becomes limited in future technology

nodes. We must design chips that flexibly adjust the capability and the number of active cores for

various working conditions, rather than only targeting peak performance at a maximum allowable

power.

7.2 Reflections

Within the framework of this thesis we have only taken the first steps toward achieving sys-

tem-wide power proportionality. Due to the broad scope of this problem, we narrowed our focus 122 to achieving power proportionality on a single thread, which is itself a challenging goal on many-

core systems that are better tuned for throughput rather than single-thread performance. Using

our work as a foundation, one of the key directions for future study lies in dynamically balancing

TLP and ILP for a given power-performance target. An optimal balance would require identifying

and solving thread interference, adapting to different workload demands, desirable configurations

of each active core, and thermal hot spots. All of these elements would then have to be handled

dynamically to provide an adaptable system for emerging versatile workloads.

Although processors are one of the major contributors to total system power, system-wide

power proportionality cannot be achieved without addressing every non-negligible power source

in the system, including the memory system, interconnect network, and disks. Many researchers

have investigated techniques to make these components more power proportional. A coherent and

coordinated approach across these disparate research efforts will be the only means of providing a

smooth transition across the power-performance curve.

On a more personal note, I feel I devoted my graduate school years to a very important and

challenging problem. The Power Wall makes it difficult to keep up with the previous years’ sub- stantial performance improvements, whether measured by single-thread performance, system throughput, or performance per watt. Large data centers and providers of cloud systems are con- tinuously searching for methods to reduce power cost. My proposals in this thesis will not, in iso- lation, be solutions for these challenges. Instead, they provide a solid foundation for significant additional research and development that may, in combination with the ideas and research of many others, help the computing field overcome the Power Wall limitations. The knowledge and 123 experience I have gained as a graduate student, formulating these proposals, have also made me better prepared to make a meaningful contribution to the computing industry in coming years. 124 References

[1] A. Al-Nayeem, M. Sun, X. Qiu, L. Sha, S. P. Miller, and D. D. Cofer. A Formal Architecture Pattern for Real-Time Distributed Systems. In 2009 Real-Time Systems Symposium, pages 161–170.

[2] A. R. Alameldeen, C. J. Mauer, M. Xu, P. J. Harper, M. M. K. Martin, D. J. Sorin, M. D. Hill, and D. A. Wood. Evaluating Non-deterministic Multi-threaded Commercial Workloads. In Proc. of the 5th Workshop on Computer Architecture Evaluation Using Commercial Workloads, pages 30–38, Feb. 2002.

[3] D. Albonesi, R., Balasubramonian, S. Dropsbo, S. Dwarkadas, F. Friedman, M. Huang, V. Kursun, G . G. Magklis, M. Scott, G. Semeraro, P. Bose, A. Buyuktosunoglu, P. Cook, and S. Schuster. Dynami- cally tuning processor resources with adaptive processing. IEEE Computer, 36(2):49–58, Dec. 2003.

[4] D. H. Albonesi. Selective cache ways: on-demand cache resource allocation. In Proc. of the 32nd Annual IEEE/ACM International Symp. on Microarchitecture, pages 248–259, Nov. 1999.

[5] G. M. Amdahl. Validity of the Single-Processor Approach to Achieving Large Scale Computing Capabilities. In AFIPS Conference Proceedings, pages 483–485, Apr. 1967.

[6] O. Azizi, A. Mahesri, B. C. Lee, S. J. Patel, and M. Horowitz. Energy-performance tradeoffs in pro- cessor architecture and circuit design: a marginal cost analysis. In Proc. of the 37th Annual Intnl. Symp. on Computer Architecture, June 2010.

[7] R. Balasubramonian, D. Albonesi, A. Buyuktosunoglu, and S. Dwarkadas. Memory hierarchy reconfiguration for energy and performance in general-purpose processor architectures. In Proc. of the 33rd Annual IEEE/ACM International Symp. on Microarchitecture, pages 245–257, Dec. 2000.

[8] R. Balasubramonian, S. Dwarkadas, and D. H. Albonesi. Dynamically managing the communica- tion-parallelism trade-off in future clustered processors. In Proc. of the 30th Annual Intnl. Symp. on Computer Architecture, June 2003.

[9] A. Baniasadi and A. Moshovos. Instruction distribution heuristics for quad-cluster, dynamically- scheduled, superscalar processors. In Proc. of the 27th Annual Intnl. Symp. on Computer Architec- ture, June 2000.

[10] L. A. Barroso and U. Hölzle. The Case for Energy-Proportional Computing. IEEE Computer, 40(12), 2007.

[11] L. Benini, P. Siegel, and G. D. Micheli. Automatic Synthesis of Gated Clocks for Power Reduction in Sequential Circuits. IEEE Design and Test of Computers, 1994.

[12] S. Borkar. The Exascale Challenge. In International Symposium on VLSI Design Automation and Test, pages 2–3, 2010. 125 [13] D. Brooks and M. Martonosi. Dynamically exploiting narrow width operands to improve processor power and performance. In Proc. of the 5th IEEE Symp. on High-Performance Computer Architec- ture, pages 13–22, Jan. 1999.

[14] D. Brooks and M. Martonosi. Dynamic Thermal Management for High-Performance Microproces- sors. In Proceedings of the 7th IEEE Symposium on High-Performance Computer Architecture, Jan. 2001.

[15] J. Burns and J.-L. Gaudiot. Area and System Clock Effects on SMT/CMP Processors. In Proc. of the Intnl. Conf. on Parallel Architectures and Compilation Techniques, Sept. 2001.

[16] A. Buyuktosunoglu, D. Albonesi, S. Schuster, D. Brooks, P. Bose, and P. Cook. A circuit level imple- mentation of an adaptive issue queue for power-aware microprocessors. In Great Lakes Symposium on VLSI Design, pages 73–78, 2001.

[17] R. Canal, A. Gonzalez, and J. E. Smith. Very Low Power Pipelines Using Significance Compression. In Proc. of the 33rd Annual IEEE/ACM International Symp. on Microarchitecture, pages 181–190, Dec. 2000.

[18] R. Canal, J.-M. Parcerisa, and A. Gonzalez. A Cost-Effective Clustered Architecture. In Proc. of the Intnl. Conf. on Parallel Architectures and Compilation Techniques, Oct. 1999.

[19] A. P. Chandrakasan, D. C. Daly, D. F. Finchelstein, J. Kwong, Y. K. Ramadass, M. E. Sinangil, V. Sze, and N. Verma. Technologies for Ultradynamic Voltage Scaling. Proceedings of the IEEE, 98(2):191– 214, Feb. 2010.

[20] A. P. Chandrakasan, S. Sheng, and R. W. Brodersen. Low-Power CMOS Digital Design. IEEE Jour- nal of Solid-State Circuits, 27(4):473–484, April 1992.

[21] R. G. Dreslinski, M. Wieckowski, D. Blaauw, D. Sylvester, and T. Mudge. Near-Threshold Comput- ing: Reclaiming Moore’s Law Through Energy Efficient Integrated Circuits. Proceedings of the IEEE, 98(2):253–266, Feb. 2010.

[22] S. Dropsho, A. Buyuktosunoglu, R. Balasubramonian, D. H. Albonesi, S. Dwarkadas, G. Semeraro, G. Magklis, and M. L. Scott. Integrating Adaptive On-Chip Storage Structures for Reduced Dynamic Power. In Proc. of the Intnl. Conf. on Parallel Architectures and Compilation Techniques, pages 141–152, Sept. 2002.

[23] K. Flautner, N. S. Kim, S. Martin, D. Blaauw, and T. Mudge. Drowsy caches: simple techniques for reducing leakage power. In Proc. of the 29th Annual Intnl. Symp. on Computer Architecture, May 2002.

[24] M. S. Floyd, S. Ghiasi, T. W. Keller, K. Rajamani, F. L. Rawson, J. C. Rubio, and M. S. Ware. System power management support in the IBM POWER6 microprocessor. IBM Journal of Research and Development, 51(6), 2007.

[25] I. T. R. for Semiconductors. ITRS 2010 Update. Semiconductor Industry Association, 2010. www.itrs.net/links/2010itrs/home2010.htm. 126 [26] A. S. Ganapathi, Y. Chen, A. Fox, R. H. Katz, and D. A. Patterson. Statistics-driven workload mod- eling for the Cloud. In 2010 IEEE 26th International Conference on Data Engineering Workshops, 2010.

[27] G. Gerosa, S. Curtis, M. D’Addeo, B. Jiang, B. Kuttanna, F. Merchant, B. Patel, M. Taufique, and H . H. Samarchi. A Sub-2 W Low Power IA Processor for Mobile Internet Devices in 45 nm High-k Metal Gate CMOS. IEEE Journal of Solid-State Circuits, 44(1):73–82, 2009.

[28] D. Gibson and D. A. Wood. Forwardflow: A Scalable Core for Power-Constrained CMPs. In Proc. of the 37th Annual Intnl. Symp. on Computer Architecture, June 2010.

[29] J. González and A. González. Dynamic Cluster Resizing. In Proceedings of the 21st International Conference on Computer Design, 2003.

[30] D. Gove. CPU2006 Working Set Size. Computer Architecture News, 35(1):90–96, 2007.

[31] L. Hammond, B. Hubbert, M. Siu, M. Prabhu, M. Chen, and K. Olukotun. The Stanford Hydra CMP. IEEE Micro, 20(2):71–84, March-April 2000.

[32] H. Hanson, S. W. Keckler, S. Ghiasi, K. Rajamani, F. Rawson, and J. Rubio. Thermal response to DVFS: analysis with an Intel Pentium M. In Proceedings of the 2007 international symposium on Low power electronics and design, pages 219–224, New York, NY, USA, 2007. ACM.

[33] A. Hartstein and T. R. Puzak. Optimum Power/Performance Pipeline Depth. In Proc. of the 36th Annual IEEE/ACM International Symp. on Microarchitecture, Dec. 2003.

[34] J. L. Henning. SPEC CPU2006 Benchmark Descriptions. Computer Architecture News, 34(4):1–17, 2006.

[35] M. D. Hill and M. R. Marty. Amdahl’s Law in the Multicore Era. IEEE Computer, pages 33–38, July 2008.

[36] M. Horowitz, E. Alon, S. Naffziger, R. Kumar, and K. Bernstein. Scaling, power, and the future of CMOS. In IEEE International Electron Devices Meeting, 2005., Dec. 2005.

[37] Z. Hu, A. Buyuktosunoglu, V. Srinivasan, V. Zyuban, H. Jacobson, and P. Bose. Microarchitectural techniques for power gating of execution units. In International Symposium on Low Power Electron- ics and Design, pages 32–37, Aug. 2004.

[38] W.-M. W. Hwu and Y. N. Patt. Checkpoint Repair for High-Performance Out-of-Order Execution Machines. IEEE Transactions on Computers, 36(12), Dec. 1987.

[39] Intel. Intel and Core i7 (Nehalem) Dynamic Power Management, 2008.

[40] E. Ipek, M. Kirman, N. Kirman, and J. F. Martinez. Core Fusion: Accomodating Software Diversity in Chip Multiprocessors. In Proc. of the 34th Annual Intnl. Symp. on Computer Architecture, June 2007. 127 [41] J. J. Tschanz, S. Narendra, Y. Ye, B. Bloechel, S. Borkar, and V. De. Dynamic-sleep transistor and body bias for active leakage power control of microprocessors. In Proceedings of the IEEE 2003 International Solid-State Circuits Conference, February 2003.

[42] A. Jain and et al. A 1.2GHz Alpha Microprocessor with 44.8GB/s Chip Pin Bandwidth. In Proceed- ings of the IEEE 2001 International Solid-State Circuits Conference, pages 240–241, 2001.

[43] R. Kalla, B. Sinharoy, W. J. Starke, and M. Floyd. Power7: IBM’s Next-Generation Server Processor. IEEE Micro, 30:7–15, 2010.

[44] S. Keckler, D. Burger, K. Sankaralingam, R. Nagarajan, R. McDonald, R. Desikan, S. Drolia, M. Govindan, P. Gratz, D. Gulati, H. H. amd C. Kim, H. Liu, N. Ranganathan, S. Sethumadhavan, S. Sharif, and P. Shivakumar. Architecture and Implementation of the TRIPS Processor. CRC Press, 2007.

[45] R. E. Kessler. The Alpha 21264 Microprocessor. IEEE Micro, 19(2):24–36, March/April 1999.

[46] C. Kim, S. Sethumadhavan, M. S. Govindan, N. Ranganathan, D. Gulati, D. Burger, and S. W. Ke c k - ler. Composable Lightweight Processors. In Proc. of the 40th Annual IEEE/ACM International Symp. on Microarchitecture, Dec. 2007.

[47] H. S. Kim and J. E. Smith. An instruction set and microarchitecture for instruction level distributed processing. In Proc. of the 29th Annual Intnl. Symp. on Computer Architecture, May 2002.

[48] R. Kumar, D. Tullsen, P. Ranganathan, N. Jouppi, and K. Farkas. Single-ISA Heterogeneous Multi- core Architectures for Multithreaded Workload Performance. In Proc. of the 31st Annual Intnl. Symp. on Computer Architecture, pages 64–75, June 2004.

[49] G. Magklis, G. Semeraro, D. H. Albonesi, S. G. Dropsho, S. Dwarkadas, and M. L. Scott. Dynamic Frequency and Voltage Scaling for a Multiple-Clock-Domain Microprocessor. IEEE Micro, 23(6):62–68, Nov/Dec 2003.

[50] P. S. Magnusson et al. Simics: A Full System Simulation Platform. IEEE Computer, 35(2):50–58, Feb. 2002.

[51] S. Manne, A. Klauser, and D. Grunwald. Pipeline Gating: Speculation Control for Energy Reduc- tion. In Proc. of the 25th Annual Intnl. Symp. on Computer Architecture, pages 132–141, June 1998.

[52] M. M. K. Martin, D. J. Sorin, B. M. Beckmann, M. R. Marty, M. Xu, A. R. Alameldeen, K. E. Moore, M. D. Hill, and D. A. Wood. Multifacet’s General Execution-driven Multiprocessor Simulator (GEMS) Toolset. Computer Architecture News, pages 92–99, Sept. 2005.

[53] D. Meisner, B. T. Gold, and T. F. Wenisch. PowerNap: Eliminating Server Idle Power. In Proc. of the 14th Intnl. Conf. on Architectural Support for Programming Languages and Operating Systems, Mar. 2009. 128 [54] M.Miyazaki, J.Kao, and A.Chandrakasan. A 175mW Multiply- Accumulate Unit Using Adaptive Supply Voltage and Body Bias (ASB) Architecture. In Proceedings of the IEEE 2002 International Solid-State Circuits Conference, pages 58–59, February 2002.

[55] A. Moshovos. Checkpointing alternatives for high performance, power-aware processors. 2003.

[56] S. Palacharla and J. E. Smith. Complexity-Effective Superscalar Processors. In Proc. of the 24th Annual Intnl. Symp. on Computer Architecture, pages 206–218, June 1997.

[57] S. Petit, J. Sahuquillo, J. M. Such, and D. Kaeli. Exploiting temporal locality in drowsy cache poli- cies. In Proceedings of the 2nd conference on Computing Frontiers, 2005.

[58] K. K. Rangan, G.-Y. Wei, and D. Brooks. Thread Motion: Fine-Grained Power Management for Multi-Core Systems. In Proc. of the 36th Annual Intnl. Symp. on Computer Architecture, June 2009.

[59] J. Renau, K. Strauss, L. Ceze, W. Liu, S. Sarangi, J. Tuck, and J. Torrellas. Energy-Efficient Thread- Level Speculation on a CMP. IEEE Micro, 26(1), Jan/Feb 2006.

[60] A. Roth and G. S. Sohi. Register Integration: A Simple and Efficient Implementation of Squash Reuse. In Proc. of the 33rd Annual IEEE/ACM International Symp. on Microarchitecture, pages 223– 234, Dec. 2000.

[61] P. Salverda and C. Zilles. Fundamental performance constraints in horizontal fusion of in-order cores. In Proc. of the 14th IEEE Symp. on High-Performance Computer Architecture, pages 252–263, Feb. 2008.

[62] A. Seznec and P. Michaud. A case for (partially) TAgged GEometric history length branch predic- tion. Journal of Instruction Level Parallelism, Feb. 2006.

[63] T. Sha, M. M. K. Martin, and A. Roth. NoSQ: Store-Load Communication without a Store Queue. In Proc. of the 39th Annual IEEE/ACM International Symp. on Microarchitecture, pages 285–296, Dec. 2006.

[64] T. Shyamkumar, N. Muralimanohar, J. H. Ahn, and N. P. Jouppi. CACTI 5.1. Technical Report HPL-2008-20, Hewlett Packard Labs, 2008.

[65] R. Singhal. Inside Intel Next Generation Nehalem Microarchitecture. 2008.

[66] J. E. Smith. Decoupled Access/Execute Computer Architecture. In Proc. of the 9th Annual Symp. on Computer Architecture, pages 112–119, Apr. 1982.

[67] S.Mutoh, T.Douseki, Y.Matsuya, T.Aoki, S.Shigematsu, and J.Yamada. 1-V Power Supply High- Speed Digital Circuit Technology with Multithreshold-Voltage CMOS. IEEE Journal of Solid-State Circuits, 8(30):847–854, 1995.

[68] G. Sohi, S. Breach, and T. Vijaykumar. Multiscalar Processors. In Proc. of the 22nd Annual Intnl. Symp. on Computer Architecture, pages 414–425, June 1995. 129 [69] J. G. Steffan, C. B. Colohan, A. Zhai, and T. C. Mowry. A Scalable Approach to Thread-Level Spec- ulation. In Proc. of the 27th Annual Intnl. Symp. on Computer Architecture, June 2000.

[70] S. Tam, S. Rusu, J. Chang, S. Vora, B. Cherkauer, and D. Ayers. A 65nm 95W Dual-Core Multi- Threaded Xeon Processor with L3 Cache. In Proc. of the 2006 IEEE Asian Solid-State Circuits Con- ference, Nov. 2006.

[71] S. Thompson, P. Packan, and M. Bohr. MOS Scaling: Transistor Challenges for the 21st Century. , 1998.

[72] T.Kuroda, T.Fujita, S.Mita, T.Nagamatsu, S.Yoshioka, K.Suzuki, F.Sano, M.Norishima, M.Murota, M.Kako, M.Kinugawa, M.Kakumu, and T.Sakurai. A 0.9-V, 150-MHz 10-mW 4mm2 2-D Discrete Cosine Transform Core Processor with Variable Threshold-Voltage (VT) Scheme. IEEE Journal of Solid-State Circuits, 31(11):1770–1779, November 1996.

[73] F. Tseng and Y. N. Patt. Achieving Out-of-Order Performance with Almost In-Order Complexity. In Proc. of the 35th Annual Intnl. Symp. on Computer Architecture, June 2008.

[74] S. University. CPU DB, 2011. http://cpudb.stanford.edu/.

[75] Y. Watanabe, J. D. Davis, and D. A. Wood. WiDGET: Wisconsin Decoupled Grid Execution Tiles. In Proc. of the 37th Annual Intnl. Symp. on Computer Architecture, June 2010.

[76] K. C. Yeager. The MIPS R10000 Superscalar Microprocessor. IEEE Micro, 16(2):28–40, Apr. 1996.

[77] B. Zhai, D. Blaauw, D. Sylvester, and K. Flaunter. Theoretical and Practical Limits of Dynamic Volt- age Scaling. In Proc. of the 41st Annual Design Automation Conference, pages 868–873, June 2004. 130 Appendix A

Supplements for Instruction Steering Cost Model (Chapter 2)

Figure A-1 demonstrates the importance of accounting for communication delays in instruc- tion steering cost models. Figure A-1(a) plots the harmonic means of IPC speedup for the SPEC

CPU2006 benchmark suite by varying the communication delay from zero to four cycles. The x- axis is the number of employed in-order execution units (EUs), and each speedup is based on the four-EU configuration with the same delay. The idealized communication (Perfect) enables 26% speedup when the EU count increases from four to eight, whereas it drops to 19% under four-cycle delays. Thus, as one would expect, performance gains from more EUs degrade as communication becomes more expensive.

฀ ฀ ฀ ฀฀ ฀ ฀฀ ฀฀

(a) Unclustered EUs: (b) Clustered EUs: EU count impact on performance Cluster size impact on performance

FIGURE A-1. Performance sensitivity under realistic communication delays 131 However, assuming delays between every EU is rather pessimistic. A more realistic design will

cluster a few EUs with no intra-cluster delay, while imposing inter-cluster delays. Figure A-1(b) plots the performance implications of cluster size. It fixes the total EU count to eight and assumes a 1-cycle delay per inter-cluster hop. The speedups are normalized to an unclustered design, in which inter-EU communication increases one cycle. By assigning two EUs per cluster, perfor- mance increases 64%. A cluster size of four further improves the speedup by another 4%, but the speedup gain becomes negligible beyond that point. Despite the similar performance of the 2- and

4-EU clusters, the WiDGET design presented in Chapter 4 employs the latter for more scalable power proportionality. 132 Appendix B

Supplements for Simulation Tools (Chapter 3)

Table B-1 summarizes the parameter space of our simulators used. As Chapter 3 discussed, we parametrize most of the microarchitectural details for high fidelity. Though GEMS’s Ruby pro- vides many configuration options, we opt to make a few idealized assumptions for simulation speed.

TABLE B-1. Simulation parameter space

Type Parameter Parameterized? Realistic Parameter Value? Realistic Parameter Chip count N- System scope Core count per chip YY Processor model YY Core scope Pipeline depth YY Pipeline width YY Fetch width YY Fetch buffer sizse YY Instruction prefetch YY Fetch Branch predictor models YY Branch predictor table sizes YY Branch target buffer sizes/associativity YY Return address stack size YY Architectural registers N- Physical registers YY Decode Register file ports YY Rename maps YY 133 TABLE B-1. Simulation parameter space

Type Parameter Parameterized? Realistic Parameter Value? Realistic Parameter Functional unit count YY Specialized ALUs (e.g., shiter, floating-point divider) N- Variable functional unit latency N- Instruction queue/buffer organization YY Instruction queue/buffer count Y Y Instruction queue/buffer size Y Y Instruction queue/buffer ports Y Y Instruction wakeup/select latency YY Execution Instruction steering heuristics YY Memory disambiguation models YY Load-store queue size/ports Y Y Disambiguation predictor table sizes YY MSHRs YY Hardware data prefetch N- TLB latency NY TLB size./associativity N- Operant network models YY Operand network Network latency YY Network bandwidth YN Commit logic size YY Commit Commit logic ports YY Thread count per core YY Multithreading Thread scheduling policy Y Y Resource sharing discipline Y/Na Y Cache size, associativity, ports, latency YY Cache line size YY Cache line eviction policies N - L1-I/D organization YY L2/L3 organization (e.g., shared/private) N- Memory system Cache coherence protocols YY On-chip interconnect models YN Memory controller latency YY Memory controller contention Y N DRAM latency YY 134 TABLE B-1. Simulation parameter space

Type Parameter Parameterized? Realistic Parameter Value? Realistic Parameter Technology node YY Technology Die area YY Clock frequency YY Clock gating option N- Clock gating reactivation delay N- Power gating option N- Power gating reactivation delay N- Power Hardware structure size/ports Y Y Logic YY Datapath width YY Wire delay YY

a. Not all hardware resources (e.g., L1 caches) allow configurable sharing disciplines. 135 Appendix C Supplements for WiDGET’s Instruction Steering Heuristic (Chapter 4)

The steering heuristic provided in Figure 4-6 is best explained with an example, illustrated in

Figure C-1. Suppose eight EUs spanning two clusters are dedicated to this instruction engine. Fur- ther assume all operands are initially available in the register file and all EUs are empty. Figure C-

1(a) shows a dataflow graph of instructions with each node denoting an instruction sequence number and the destination register in parenthesis.

In the first cycle, instructions i1 through i4 are steered. Since i1 has no data dependencies, it is steered to the empty EU 0 (line 2 in Figure 4-6). It marks the steered EU ID in the LPT entry for the destination register r1, leaving the consumer field unchanged. It also resets the empty bit vec- tor for EU 0. Conversely, i2 depends on i1. An access to the LPT entry for r1 reveals the producer of i1 is steered to EU 0 and no other instructions have followed i1 yet. Hence, i2 is steered to the producer EU 0 (line 5). It updates the corresponding LPT entry as well as i1’s to prevent other con- sumers of i1 from steering to EU 0. i3 begins a new independent chain. It selects the empty EU 4 in

Cluster 1 to balance the load (line 2). Both the LPT entry for r3 and the empty bit vector are updated accordingly. i4 is analogous to the case of i2, following the producer EU 4 (line 5). As both of the head instructions in EUs 0 and 4 are ready, they execute in their EUs. Figure C-1(b) shows

the steering result at the end of cycle 1.

In the second cycle, i5 through i8 are steered. i5 is sent to the producer EU 0 (line 5), setting

the consumer field in the r2’s LPT entry. i6 also depends on i2, yet an LPT lookup indicates that 136 (a) (b) (c)

฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ I I

I I ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀ ฀

฀฀ ฀ ฀฀฀฀ ฀฀฀฀ ฀ ฀ FIGURE C-1. Instruction steering example

the slot immediately succeeding r2 has been claimed. i6 therefore finds the empty EU 2 in the same Cluster 0 by accessing the Empty Bit Vector (line 7). Note that i2 will forward the result to

EU 2 at the end of the execution, enabling both i5 and i6 to execute in parallel. i7 depends on both i5 and i4, which are in EUs 0 and 4, respectively. Although both of the producers have empty slots behind them, i7 selects the producer EU 0 of the first source operand r5 (line 10). Finally, i8 is sent to the producer EU 2 (line 5). Figure C-1(c) displays the final state at the end of cycle 2.

The naive implementation of steering is serial, since it is constrained by the serial dependen- cies among a group of instructions to be steered. However, we utilize a parallel prefix computation.

The parallel dependence-check performed in renaming is employed to detect dependencies in the same steering group [60]. Concurrently, each instruction accesses the LPT and the bit vectors to choose a candidate EU. Then the candidate EUs are compared and are modified if necessary to reflect the intra-group dependencies, followed by updating the LPT and the bit vectors accord- ingly. 137 Appendix D

Supplements for Per-EU Instruction Buffer Limit Study (Chapter 5)

Chapters 4 and 5 made an observation that distributed buffering is the key to enable a large instruction window without prohibitive power impacts. The performance benefit is amplified for in-order buffers, which have less latency tolerance than out-of-order (OoO) buffers. Although higher instruction buffer count generally allows more look-ahead and higher performance, it also demands more power and area and increases the design complexity. This appendix conducts a limit study on the trade-off in the context of COBRi in Chapter 5. We vary the number of in-order instruction buffers in each execution unit (EU) as well as the total EU count dedicated to a single thread. All results are normalized to the 4-way OoO core described in Table 5-3, and Table 5-3 lists other configuration parameters of COBRi.

As expected, higher per-EU buffer count (shown in the legend) leads to higher IPC as

Figure D-1 plots. With just one EU and four or more instruction buffers, the performance improvement is negligible because the bottleneck becomes the issue bandwidth, which is one. On the other hand, configurations with more than one EUs are more sensitive to the buffer count. The eight-EU configurations, which yield the highest IPCs among the same per-EU buffer count, boost the performance by 20% from two to four buffers, 8% from four to six buffers, and 4% from six to eight buffers, but the IPC gains become negligible (less than 1%) after that point. 138

FIGURE D-1. IPC sensitivity

FIGURE D-2. Chip power sensitivity

Figure D-2 demonstrates impacts on the chip power with the same configuration points. Given a fixed EU count, the power increase from additional buffers is more moderate than the perfor- mance gains, indicating power efficiency of the in-order buffers. Although additional buffers in an

2 EU increase inputs to the issue selection logic, each buffer only contains 16 entries (~0.1x10-2 mm 139

FIGURE D-3. ED sensitivity

2 FIGURE D-4. ED sensitivity

in 45 nm). Rather, the chip power shows more sensitivity to the number of allocated EUs, which provide not only instruction buffers but also functional units, and have first-order impacts on the issue rate and operand network traffic. With eight EUs and eight or more buffers per EU, the power increase becomes negligible (less than 0.7%) similarly to the performance trend because most benchmarks do not have enough independent instructions to fully utilize the buffers. 140 We use two metrics to evaluate the impacts on the energy efficiency: energy-delay (ED) prod-

2 uct (Figure D-3) and ED (Figure D-4). Again, the x-axes are active EU count, while the number of

2 buffers per EU is shown in the legends. Both ED and ED display little sensitivity to the instruction buffer count beyond four. Nonetheless, given the power-performance saturation points of the eight-EU configurations, eight instruction buffers per EU are optimal for the design space we eval- uated. 141 Appendix E

Tables of Baseline Values

This appendix provides baseline values used to produce the normalized figures in the disserta- tion. We omit power and energy values because our power estimation is meaningful only for rela- tive, not absolute, comparisons.

The following table lists the IPCs of Neon used in Figure 4-8.

TABLE E-1. Baseline values for Figure 4-8

Workload IPC Workload IPC

perlbench 1.02 zeusmp 1.35 bzip2 1.06 gromacs 1.49 gcc 0.98 cactusADM 1.83 mcf 0.17 leslie3D 0.90 gobmk 1.30 namd 2.12 hmmer 2.21 dealII 1.58 sjeng 1.43 soplex 1.59 libquantum 0.26 povray 1.83 h264 1.59 calculix 2.32 omnetpp 0.88 GemsFDTD 0.67 astar 0.95 tonto 1.66 xalancbmk 0.44 lbm 0.39 INT HMean 0.61 wrf 1.83 bwaves 0.65 sphinx3 0.39 gamess 2.18 FP HMean 0.97 milc 0.48 142 The following table lists the IPCs of the baseline out-of-order (OoO) model used in Figure 5-2.

TABLE E-2. Baseline values for Figure 5-2

Workload IPC Workload IPC

perlbench 0.61 bwaves 0.39 bzip2 1.73 gamess 2.04 gcc 1.04 milc 0.17 mcf 0.18 zeusmp 0.93 gobmk 1.22 gromacs 1.12 hmmer 2.16 cactusADM 1.27 sjeng 1.31 leslie3D 0.97 libquantum 0.19 namd 2.01 h264 0.98 dealII 1.32 omnetpp 1.18 soplex 1.00 astar 1.06 povray 1.53 xalancbmk 0.67 calculix 1.85 INT HMean 0.59 GemsFDTD 0.51 apache 0.38 tonto 1.52 jbb 0.90 lbm 0.33 oltp 0.73 wrf 1.64 zeus 0.38 sphinx3 0.37 Com HMean 0.52 FP HMean 0.68

The following table lists the run-time in cycles of the baseline OoO model used in Figure 5-7.

TABLE E-3. Baseline values for Figure 5-7

Workload Cycles (Billion) Workload Cycles (Billion)

perlbench 0.16 bwaves 0.25 bzip2 0.06 gamess 0.05 gcc 0.09 milc 0.56 mcf 0.55 zeusmp 0.11 gobmk 0.08 gromacs 0.09 hmmer 0.05 cactusADM 0.08 sjeng 0.07 leslie3D 0.10 143 TABLE E-3. Baseline values for Figure 5-7

Workload Cycles (Billion) Workload Cycles (Billion)

libquantum 0.52 namd 0.05 h264 0.10 dealII 0.07 omnetpp 0.08 soplex 0.10 astar 0.09 povray 0.06 xalancbmk 0.15 calculix 0.05 INT GMean 0.12 GemsFDTD 0.19 apache 0.25 tonto 0.06 jbb 0.11 lbm 0.30 oltp 0.13 wrf 0.06 zeus 0.26 sphinx3 0.26 Com GMean 0.18 FP GMean 0.11

The following table lists the L1-I access count of BAR1 used in Figure 5-11.

TABLE E-4. Baseline values for Figure 5-11

Workload Accesses (Million) Workload Accesses (Million)

perlbench 30.38 bwaves 6.46 bzip2 2.47 gamess 11.30 gcc 32.07 milc 7.43 mcf 21.54 zeusmp 7.31 gobmk 45.01 gromacs 15.16 hmmer 6.70 cactusADM 6.34 sjeng 27.19 leslie3D 53.03 libquantum 0.01 namd 8.81 h264 9.09 dealII 35.33 omnetpp 35.68 soplex 29.72 astar 24.66 povray 31.82 xalancbmk 20.09 calculix 8.54 INT Avg 21.24 GemsFDTD 6.77 apache 65.93 tonto 23.77 jbb 23.07 lbm 6.75 oltp 55.51 wrf 9.44 zeus 57.85 sphinx3 54.88 Com Avg 50.59 FP Avg 18.99 144 The following table lists the IPCs of COBRo8 used in Figure 5-21.

TABLE E-5. Baseline values for Figure 5-21

Workload IPC Workload IPC

perlbench 0.63 bwaves 0.49 bzip2 1.42 gamess 2.29 gcc 1.03 milc 0.22 mcf 0.19 zeusmp 1.11 gobmk 1.31 gromacs 1.23 hmmer 2.54 cactusADM 1.68 sjeng 1.38 leslie3D 0.94 libquantum 0.21 namd 2.03 h264 1.05 dealII 1.46 omnetpp 1.28 soplex 1.02 astar 1.07 povray 1.57 xalancbmk 0.71 calculix 2.41 INT HMean 0.62 GemsFDTD 0.65 apache 0.40 tonto 1.58 jbb 0.95 lbm 0.44 oltp 0.76 wrf 1.91 zeus 0.40 sphinx3 0.40 Com HMean 0.54 FP HMean 0.80

The following table lists the run-time in cycles of the baseline OoO model used in Figure 6-2.

TABLE E-6. Baseline values for Figure 6-2

Workload Cycles (Billion) Workload Cycles (Billion)

perlbench 0.15 bwaves 0.23 bzip2 0.05 gamess 0.05 gcc 0.09 milc 0.50 mcf 0.53 zeusmp 0.10 gobmk 0.08 gromacs 0.08 hmmer 0.04 cactusADM 0.07 sjeng 0.08 leslie3D 0.10 145 TABLE E-6. Baseline values for Figure 6-2

Workload Cycles (Billion) Workload Cycles (Billion)

libquantum 0.46 namd 0.05 h264 0.10 dealII 0.07 omnetpp 0.08 soplex 0.09 astar 0.10 povray 0.06 xalancbmk 0.13 calculix 0.05 INT GMean 0.11 GemsFDTD 0.17 apache 0.24 tonto 0.07 jbb 0.10 lbm 0.27 oltp 0.13 wrf 0.06 zeus 0.24 sphinx3 0.25 Com GMean 0.17 FP GMean 0.10

The following table lists the IPCs of the baseline OoO model used in Figure 6-8.

TABLE E-7. Baseline values for Figure 6-8

Workload IPC Workload IPC

perlbench 0.65 bwaves 0.43 bzip2 1.81 gamess 2.09 gcc 1.04 milc 0.20 mcf 0.19 zeusmp 1.02 gobmk 1.25 gromacs 1.19 hmmer 2.24 cactusADM 1.31 sjeng 1.28 leslie3D 1.00 libquantum 0.21 namd 2.09 h264 0.99 dealII 1.40 omnetpp 1.23 soplex 1.07 astar 1.03 povray 1.61 xalancbmk 0.74 calculix 1.87 INT HMean 0.62 GemsFDTD 0.57 apache 0.41 tonto 1.49 jbb 0.94 lbm 0.36 oltp 0.77 wrf 1.67 zeus 0.40 sphinx3 0.40 Com HMean 0.55 FP HMean 0.74 146 The following table lists the IPCs of COBRi8 used in Figure 6-13.

TABLE E-8. Baseline values for Figure 6-13

Workload IPC Workload IPC

perlbench 0.68 bwaves 0.61 bzip2 1.93 gamess 2.53 gcc 1.21 milc 0.35 mcf 0.21 zeusmp 1.24 gobmk 1.42 gromacs 1.34 hmmer 2.81 cactusADM 1.71 sjeng 1.54 leslie3D 1.04 libquantum 0.41 namd 2.30 h264 1.17 dealII 1.61 omnetpp 1.41 soplex 1.17 astar 1.21 povray 1.74 xalancbmk 0.98 calculix 2.55 INT HMean 0.80 GemsFDTD 0.77 apache 0.42 tonto 1.74 jbb 1.03 lbm 0.48 oltp 0.80 wrf 2.21 zeus 0.41 sphinx3 0.46 Com HMean 0.57 FP HMean 0.98

Finally, the following table lists TAGE branch misprediction rate of the baseline OoO model.

TABLE E-9. TAGE branch misprediction rate

Workload Misprediction (%) Workload Misprediction (%)

perlbench 1.43 bwaves 7.26 bzip2 1.16 gamess 3.38 gcc 2.67 milc 0.04 mcf 2.13 zeusmp 0.72 gobmk 5.31 gromacs 3.27 hmmer 1.95 cactusADM 2.37 sjeng 4.46 leslie3D 3.38 147 TABLE E-9. TAGE branch misprediction rate

Workload Misprediction (%) Workload Misprediction (%) libquantum 0.001 namd 3.50 h264 1.40 dealII 2.53 omnetpp 1.82 soplex 1.53 astar 6.31 povray 1.55 xalancbmk 1.44 calculix 2.16 INT HMean 0.01 GemsFDTD 1.20 apache 2.74 tonto 3.74 jbb 3.06 lbm 0.73 oltp 4.15 wrf 1.38 zeus 2.04 sphinx3 5.16 Com HMean 2.81 FP HMean 0.54