Quick viewing(Text Mode)

Timing Speculation and Adaptive Reliable Overclocking Techniques for Aggressive Computer Systems Viswanathan Subramanian Iowa State University

Timing Speculation and Adaptive Reliable Overclocking Techniques for Aggressive Computer Systems Viswanathan Subramanian Iowa State University

Iowa State University Capstones, Theses and Graduate Theses and Dissertations Dissertations

2009 Timing speculation and adaptive reliable techniques for aggressive computer systems Viswanathan Subramanian Iowa State University

Follow this and additional works at: https://lib.dr.iastate.edu/etd Part of the Electrical and Computer Engineering Commons

Recommended Citation Subramanian, Viswanathan, "Timing speculation and adaptive reliable overclocking techniques for aggressive computer systems" (2009). Graduate Theses and Dissertations. 10967. https://lib.dr.iastate.edu/etd/10967

This Dissertation is brought to you for free and open access by the Iowa State University Capstones, Theses and Dissertations at Iowa State University Digital Repository. It has been accepted for inclusion in Graduate Theses and Dissertations by an authorized administrator of Iowa State University Digital Repository. For more information, please contact [email protected]. Timing speculation and adaptive reliable overclocking techniques for aggressive computer

systems

by

Viswanathan Subramanian

A dissertation submitted to the graduate faculty

in partial fulfillment of the requirements for the degree of

DOCTOR OF PHILOSOPHY

Major: Computer Engineering

Program of Study Committee: Arun K. Somani, Major Professor Akhilesh Tyagi Randall L. Geiger Joseph A. Zambreno David Ferna ´ndez-Baca

Iowa State University

Ames, Iowa

2009

Copyright c Viswanathan Subramanian, 2009. All rights reserved. ii

To my dear parents

To my enlightening teachers

To my loving wife

To my caring family

To my beloved friends iii

TABLE OF CONTENTS

LIST OF TABLES ...... vi

LIST OF FIGURES ...... viii

ACKNOWLEDGEMENTS ...... xii

ABSTRACT ...... xiv

CHAPTER 1. INTRODUCTION ...... 1

1.1 High Performance ...... 3

1.1.1 Device Scaling ...... 4

1.1.2 Architectures ...... 4

1.1.3 Better-Than-Worst-Case Designs ...... 5

1.1.4 Adaptive Systems ...... 6

1.2 Fault Tolerant Computing ...... 7

1.2.1 Transient Faults ...... 8

1.2.2 Redundancy Techniques ...... 8

1.2.3 Fault Mitigation Techniques ...... 9

1.2.4 Exploiting Fault Tolerance to Improve Performance ...... 10

1.3 Power/Thermal Aware Computing ...... 11

1.4 Contributions of this Thesis ...... 12

CHAPTER 2. BACKGROUND ...... 16

2.1 Parameter Variations ...... 16

2.2 Reliable Overclocking ...... 18

2.2.1 Timing Error Detection and Recovery ...... 18 iv

2.2.2 Timing Error Rate Based Feedback Control System ...... 20

2.2.3 Timing Speculation ...... 20

2.3 Razor Architecture ...... 22

2.4 SPRIT3E Framework ...... 24

CHAPTER 3. MANIPULATING SHORT-PATHS FOR PERFORMANCE ...... 27

3.1 Impact of Short-paths ...... 28

3.1.1 Timing Constraints ...... 28

3.1.2 Variable or Fixed Phase Shift ...... 30

3.1.3 Manipulating Contamination Delay ...... 31

3.2 Increasing Contamination Delay of a CLA Adder Circuit - A Case Study ...... 33

3.2.1 Analysis of Reliable Overclocking Performance ...... 37

CHAPTER 4. CHARACTERIZING ADAPTIVE RELIABLE OVERCLOCKING . . . . 41

4.1 Evaluating Speculative Reliable Overclocking ...... 42

4.1.1 Performance Metrics ...... 46

4.2 Analysis Framework ...... 47

4.2.1 Modeling a Reliably Overclocked Processor (ROP) ...... 48

4.2.2 Power and Thermal Modeling ...... 53

4.3 Adaptive Clocking ...... 55

4.3.1 Clock Tuning Schemes ...... 57

4.3.2 Comparing Adaptive Clocking Techniques ...... 59

4.4 Reliable Overclocking Analysis ...... 63

CHAPTER 5. THERMAL IMPACT OF RELIABLE OVERCLOCKING ...... 69

5.1 Thermal and Reliability Management ...... 70

5.2 Analysis Framework for Estimating On-chip Temperature ...... 71

5.2.1 Thermal Throttling ...... 72

5.2.2 Simulation Parameters ...... 72

5.3 On-chip Temperature Trends in Reliably Overclocked Processors ...... 73 v

CHAPTER 6. RELIABLE OVERCLOCKING AND TECHNOLOGY SCALING . . . . . 78

6.1 Technology Scaling ...... 79

6.2 A Reliable Overclocking Approach ...... 81

6.3 Analysis Framework ...... 82

6.4 Performance at Different Technology Nodes ...... 84

6.5 Comparing Technology Scaling with Reliable Overclocking ...... 85

CHAPTER 7. FAULT TOLERANT AGGRESSIVE SYSTEMS ...... 91

7.1 Conjoined Pipeline Architecture ...... 93

7.1.1 Conjoined Pipeline Datapath Description ...... 93

7.1.2 Error Detection and Recovery ...... 95

7.2 Timing Requirements ...... 100

7.3 Implementation Considerations ...... 104

7.3.1 Two Clock Approach ...... 106

7.4 Experiments and Results ...... 106

CHAPTER 8. CONCLUSIONS AND FUTURE WORK ...... 110 vi

LIST OF TABLES

Table 3.1 Implementation details of CLA adder circuits ...... 37

Table 4.1 Processor specifications ...... 49

Table 4.2 Synthesis report of major pipeline stages ...... 49

Table 4.3 Simulator parameters ...... 53

Table 4.4 Comparing various performance metrics between a base non-overclocked pro-

cessor, a reliably overclocked processor tuned using a single clock generator

and a reliably overclocked processor tuned using dual clock generators. All

the systems execute SPEC2000 integer benchmarks ...... 61

Table 4.5 Comparing various performance metrics between a base non-overclocked pro-

cessor, a reliably overclocked processor tuned using a single clock generator

and a reliably overclocked processor tuned using dual clock generators. All

the systems execute SPEC2000 floating point benchmarks ...... 62

Table 4.6 Comparing various performance metrics for non-overclocked and reliably over-

clocked processors executing SPEC2000 integer benchmarks ...... 67

Table 4.7 Comparing various performance metrics for non-overclocked and reliably over-

clocked processors executing SPEC2000 floating point benchmarks . . . . . 67

Table 4.8 Effect of memory overclocking on the performance benefits of a ROP execut-

ing SPEC2000 integer benchmarks ...... 68

Table 4.9 Effect of memory overclocking on the performance benefits of a ROP execut-

ing SPEC2000 floating point benchmarks ...... 68

Table 5.1 Mean Time To Failure (MTTF) for critical wear out models ...... 71 vii

Table 5.2 Simulator parameters ...... 73

Table 6.1 Technology scaling parameters ...... 82

Table 6.2 Comparing various performance metrics across different technology nodes for

a non-overclocked processor executing SPEC2000 integer benchmarks . . . . 83

Table 6.3 Comparing various performance metrics across different technology nodes for

a non-overclocked processor executing SPEC2000 floating point benchmarks 84

Table 7.1 Possible error scenarios ...... 99

Table 7.2 Fault injection results ...... 107

Table 7.3 Timing errors ...... 108 viii

LIST OF FIGURES

Figure 2.1 Cross section of a n-channel MOSFET in the ON state showing channel for-

mation. The channel exhibits pinch-off near drain indicating operation in sat-

uration (active) region...... 18

Figure 2.2 Typical pipeline stage in a ROP. Local timing error detection and recovery

scheme for critical registers is shown in detail...... 19

Figure 2.3 Timing diagram showing overclocking advantage per cycle, as compared to

the worst-case clock ...... 20

Figure 2.4 Timing diagram showing pipeline stage level timing speculation ...... 21

Figure 2.5 Reduced overhead Razor flip-flop and metastability detection circuits (Figure

reproduced from [27]) ...... 23

Figure 2.6 SPRIT3E framework ...... 25

Figure 3.1 Clock timing waveforms showing governing requirements, for MAINCLK and

PSCLK, over the full range of overclocked aggressive frequencies (FMIN )* FMAX ) 28 Figure 3.2 Examples of Main and PS clocks with variable and fixed phase shifts . . . . . 31

Figure 3.3 Timing waveforms after increasing contamination delay to half the propaga-

tion delay for the full range of overclocked aggressive frequencies (FMIN )*

FMAX ) ...... 33 Figure 3.4 8-bit CLA adder ...... 34

Figure 3.5 Delay distribution for an 8-bit CLA adder ...... 34

Figure 3.6 8-bit CLA adder with additional delay blocks to increase contamination delay 35

Figure 3.7 Delay distribution for an 8-bit CLA adder after increasing contamination delay 36

Figure 3.8 Experimental setup to estimate performance improvement of CLA adder circuits 37 ix

Figure 3.9 Percent of error cycles versus clock period for an 8-bit delay added CLA adder

circuit ...... 38

Figure 3.10 Percent of error cycles versus clock period for a 32-bit delay added CLA adder

circuit (Contamination delay 1.21ns)...... 39

Figure 3.11 Percent of error cycles versus clock period for a 32-bit delay added CLA adder

circuit (Contamination delay 1.38ns)...... 39

Figure 3.12 Percent of error cycles versus clock period for a 64-bit delay added CLA adder

circuit ...... 40

Figure 4.1 Alpha 21264 integer and floating point pipeline showing timing error detec-

tion and recovery circuit for critical registers ...... 41

Figure 4.2 Simulation framework ...... 48

Figure 4.3 Cumulative error profile for all pipeline stages at overclocked operating fre-

quencies for SPEC2000 integer benchmarks. Also shown separately are error

profiles for issue stage and execute stage...... 51

Figure 4.4 Error profile for three SPEC2000 integer benchmarks executing five different

instruction and data sets ...... 52

Figure 4.5 Feedback control system to dynamically tune clock frequency: Single clock

generator with variable phase shift ...... 55

Figure 4.6 Feedback control system to dynamically tune clock frequency: Dual clock

generators with fixed phase shift ...... 56

Figure 4.7 Run time, energy and energy-delay product trends for SPEC2000 integer bench-

marks as target error rate varies from 0% to 20%. All values are normalized

to 0% target error rate (no overclocking mode)...... 65

Figure 4.8 Run time, energy and energy-delay product trends for SPEC2000 floating

point benchmarks as target error rate varies from 0% to 20%. All values are

normalized to 0% target error rate (no overclocking mode)...... 66 x

Figure 5.1 Simulation framework depicting thermal throttling, alongside timing error

based feedback control, for a reliably overclocked system ...... 72

Figure 5.2 On-chip temperature trends and MTTF results for bzip2 . The plots

show how on-chip temperature and MTTF varies for a non-overclocked pro-

cessor, a reliably overclocked processor, and a reliably overclocked processor

with thermal throttling...... 73

Figure 5.3 On-chip temperature trends and MTTF results for cra fty benchmark. The

plots show how on-chip temperature and MTTF varies for a non-overclocked

processor, a reliably overclocked processor, and a reliably overclocked pro-

cessor with thermal throttling...... 74

Figure 5.4 On-chip temperature trends and MTTF results for gzip benchmark. The plots

show how on-chip temperature and MTTF varies for a non-overclocked pro-

cessor, a reliably overclocked processor, and a reliably overclocked processor

with thermal throttling...... 75

Figure 5.5 On-chip temperature trends and MTTF results for mc f benchmark. The plots

show how on-chip temperature and MTTF varies for a non-overclocked pro-

cessor, a reliably overclocked processor, and a reliably overclocked processor

with thermal throttling...... 75

Figure 5.6 Relative performance for SPEC2000 integer benchmarks ...... 77

Figure 6.1 Technology scaling vs. speculative reliable overclocking: Power consumption

trends for SPEC2000 integer benchmarks...... 86

Figure 6.2 Technology scaling vs. speculative reliable overclocking: Power consumption

trends for SPEC2000 floating point benchmarks...... 86

Figure 6.3 Technology scaling vs. speculative reliable overclocking: Temperature trends

for SPEC2000 integer benchmarks...... 87

Figure 6.4 Technology scaling vs. speculative reliable overclocking: Temperature trends

for SPEC2000 floating point benchmarks...... 87 xi

Figure 6.5 Technology scaling vs. speculative reliable overclocking: Run time for SPEC2000

integer benchmarks. All values are normalized to 90nm run time...... 88

Figure 6.6 Technology scaling vs. speculative reliable overclocking: Run time for SPEC2000

floating point benchmarks. All values are normalized to 90nm run time. . . . 88

Figure 6.7 Technology scaling vs. speculative reliable overclocking: Energy consump-

tion for SPEC2000 integer benchmarks. All values are normalized to 90nm

energy values...... 89

Figure 6.8 Technology scaling vs. speculative reliable overclocking: Energy consump-

tion for SPEC2000 floating point benchmarks. All values are normalized to

90nm energy values...... 89

Figure 6.9 Technology scaling vs. speculative reliable overclocking: EDP for SPEC2000

integer benchmarks. All values are normalized to 90nm EDP...... 90

Figure 6.10 Technology scaling vs. speculative reliable overclocking: EDP for SPEC2000

floating point benchmarks. All values are normalized to 90nm EDP...... 90

Figure 7.1 Conjoined Pipeline Architecture: Shaded region represents the L-PIPELINE.

Dotted line encompasses the Local Fault Detection and Recovery (LFDR) circuit 94

Figure 7.2 Waveforms highlighting error detection and recovery in a Conjoined Pipeline

system ...... 97

Figure 7.3 Dynamic frequency scaling ...... 100

Figure 7.4 Modular implementation ...... 104

Figure 7.5 Clock generation circuitry ...... 105

Figure 7.6 Execution time for three different applications running on Conjoined Proces-

sor in various modes ...... 109 xii

ACKNOWLEDGEMENTS

I would like to take this opportunity to express my thanks to those who helped me with various aspects of conducting research and the writing of this thesis.

First and foremost, Dr. Arun K. Somani for his guidance, patience and support throughout this research and the writing of this thesis. His insights and words of encouragement have often inspired me and renewed my hopes for completing my graduate education. I am very thankful to him for giving me the liberty to pursue a research direction that I liked, and the guidance required to be successful at it. I would also like to thank my committee members for their efforts and contributions to this work:

Dr. Akhilesh Tyagi, Dr. Randall L. Geiger, Dr. Joseph Zambreno and Dr. David Fernandez-Baca. I am also thankful to Dr. Chris Chu and Dr. Shashi K. Gadia for being part of my program committee at various times. I am thankful to the many professors at Iowa State who taught me and provided me with sufficient knowledge to conduct this research. I am thankful to the ECPE department and Iowa State

University for providing me with such a wonderful atmosphere for carrying out quality research.

I would like to acknowledge the contributions of my colleagues Premkumar Ramesh, Prasad Avir- neni, Mikel Bezdek, Roy Lycke, and Adam Jackson in developing various aspects of this thesis. I am very much thankful to Prem and Prasad for making the research atmosphere lively, and without them,

I would have found it extremely difficult to wade through the highs and lows of conducting research. I would like to thank Ganesh for his guidance throughout the initial stages of my graduate career. I am very much indebted to Kavitha Balasubramanian for being the painstaking proofreader of each one of my papers. I am thankful to my wife, Kamala, for helping me plot graphs and arrange the humongous data I collected from my simulation runs. I am thankful to my other research group colleagues Srivat- san, Mike, Kamna, Nathan, Koray, Jinxu, Ramon, Kritanjali, David, Nishanth, Pavan, and Zachary for the many wonderful discussions we had during the weekly seminars. xiii

If not for my friends who made my stay at ISU so totally memorable and enjoyable, I would have found living so far away from everything I cared about extremely unbearable. I am forever grateful to these amazing folks - Vasanth, Puvi, Atul, Hari, Kavitha, Kamna, Prem, Prasad, KK, Muthu, Vat- san, Satya, Abhijit, Nishanth, Vichu, Shibhi, Nikhil, Niranjan, Rakesh, Rokkam, Satyadev, Richard,

Srikanth, Sankalp volunteers, coffee room chitchatters and Friday evening volleyball gang. I am very grateful to Vasanth Balaramudu, who was my roommate during three years of my life at Iowa State, for his exceptional friendship, and I can say without any doubt, if not for his amazing culinary skills, I would have lost twenty pounds for sure. I am not really sure whether that is a boon or a bane, consid- ering the rigor I am going through to lose that twenty pounds.

Finally, I would like to thank my parents, my wife, and family for their loving guidance and mo- tivation during the writing of this work. I am forever indebted to my parents without whose support I would never have come this far. xiv

ABSTRACT

Computers have changed our lives beyond our own imagination in the past several decades. The continued and progressive advancements in VLSI technology and numerous micro-architectural in- novations have played a key role in the design of spectacular low-cost high performance computing systems that have become omnipresent in today’s technology driven world. Performance and depend- ability have become key concerns as these ubiquitous computing machines continue to drive our ev- eryday life. Every application has unique demands, as they run in diverse operating environments.

Dependable, aggressive and adaptive systems improve efficiency in terms of speed, reliability and en- ergy consumption.

Traditional computing systems run at a fixed clock frequency, which is determined by taking into account the worst-case timing paths, operating conditions, and process variations. Timing specula- tion based reliable overclocking advocates going beyond worst-case limits to achieve best performance while not avoiding, but detecting and correcting a modest number of timing errors. The success of this design methodology relies on the fact that timing critical paths are rarely exercised in a design, and typical execution happens much faster than the timing requirements dictated by worst-case de- sign methodology. Better-than-worst-case design methodology is advocated by several recent research pursuits, which exploit dependability techniques to enhance computer system performance.

In this dissertation, we address different aspects of timing speculation based adaptive reliable over- clocking schemes, and evaluate their role in the design of low-cost, high performance, energy efficient and dependable systems. We visualize various control knobs in the design that can be favorably con- trolled to ensure different design targets.

As part of this research, we extend the SPRIT3E, or Superscalar PeRformance Improvement Through

Tolerating Timing Errors, framework, and characterize the extent of application dependent performance xv acceleration achievable in superscalar processors by scrutinizing the various parameters that impact the operation beyond worst-case limits. We study the limitations imposed by short-path constraints on our technique, and present ways to exploit them to maximize performance gains. We analyze the sensitivity of our technique’s adaptiveness by exploring the necessary hardware requirements for dynamic over- clocking schemes. Experimental analysis based on SPEC2000 benchmarks running on a SimpleScalar

Alpha processor simulator, augmented with error rate data obtained from hardware simulations of a superscalar processor, are presented.

Even though reliable overclocking guarantees functional correctness, it leads to higher power con- sumption. As a consequence, reliable overclocking without considering on-chip temperatures will bring down the lifetime reliability of the chip. In this thesis, we analyze how reliable overclocking impacts the on-chip temperature of a microprocessor and evaluate the effects of overheating, due to such reliable dynamic frequency tuning mechanisms, on the lifetime reliability of these systems. We then evaluate the effect of performing thermal throttling, a technique that clamps the on-chip temperature below a predefined value, on system performance and reliability. Our study shows that a reliably overclocked system with dynamic thermal management achieves 25% performance improvement, while lasting for

14 years when being operated within 353K.

Over the past five decades, technology scaling, as predicted by Moore’s law, has been the bedrock of semiconductor technology evolution. The continued downscaling of CMOS technology to deep sub-micron gate lengths has been the primary reason for its dominance in today’s omnipresent microchips. Even as the transition to the next technology node is indispensable, the initial cost and time associated in doing so presents a non-level playing field for the competitors in the semiconductor busi- ness. As part of this thesis, we evaluate the capability of speculative reliable overclocking mechanisms to maximize performance at a given technology level. We evaluate its competitiveness when compared to technology scaling, in terms of performance, power consumption, energy and energy delay prod- uct. We present a comprehensive comparison for integer and floating point SPEC2000 benchmarks running on a simulated Alpha processor at three different technology nodes in normal and enhanced modes. Our results suggest that adopting reliable overclocking strategies will help skip a technology node altogether, or be competitive in the market, while porting to the next technology node. xvi

Reliability has become a serious concern as systems embrace nanometer technologies. In this dissertation, we propose a novel fault tolerant aggressive system that combines soft error protection and timing error tolerance. We replicate both the pipeline registers and the pipeline stage combinational logic. The replicated logic receives its inputs from the primary pipeline registers while writing its output to the replicated pipeline registers. The organization of redundancy in the proposed Conjoined

Pipeline system supports overclocking, provides concurrent error detection and recovery capability for soft errors, intermittent faults and timing errors, and flags permanent silicon defects. The fast recovery process requires no checkpointing and takes three cycles. Back annotated post-layout gate-level timing simulations, using 45nm technology, of a conjoined two-stage arithmetic pipeline and a conjoined five- stage DLX pipeline processor, with forwarding logic, show that our approach, even under a severe fault injection campaign, achieves near 100% fault coverage and an average performance improvement of about 20%, when dynamically overclocked. 1

CHAPTER 1. INTRODUCTION

Microprocessors and application specific integrated circuits (ASICs) have evolved spectacularly since the early 1970s, paving the way for the digital revolution we continue to witness in our everyday life. The wide range of applicability of digital systems have subjected them to diverse demands, in terms of performance, power consumption and dependability, as they perform a plurality of tasks and run in a multitude of operating environments. These demands are interrelated and need to be addressed cohesively, as improving one metric alone is counterproductive for another. For instance, choosing operating voltage and frequency for performance, results in increased power consumption.

Technology scaling, in line with ”Moore’s Law”, has sustained for long the unparalleled growth of the semiconductor industry. Technology scaling leads to smaller transistors, higher packing densities, decreased supply voltages and increased clock frequencies; thereby contributing to the goals of higher performance and lower power consumption. However, with ultra deep sub-micron technologies, the integrated circuit reliability is impacted, as thinner circuits and wires are exercised more aggressively leading to premature device failures. In addition, some of the operating environments are hazardous to the working of the integrated electronics that forms the fabric of these computer systems.

Reliability issues in combinational logic have become more pronounced and their manifestations result in frequent error occurrence, as we rapidly adopt technological advancements [85]. A major reason for this is the increasing probability of longer single event transient (SET) pulses in newer technologies. Radiation induced SET pulses have widths in the range of 500ps to 900ps in the 90nm process, as compared to 400ps to 700ps in the 130nm process [66]. Control logic and random logic protection continues to be a major issue in the quest for highly dependable systems [77].

Traditional computing systems are designed to run reliably at discrete voltage and clock frequency settings, which are determined by taking into account the implementation technology, power budget, 2 worst-case timing paths, operating conditions and process variations. Dynamic voltage and frequency scaling (DVFS) techniques that choose one of the discrete voltage-frequency settings during run-time are commonly employed in today’s , to reduce power consumption by enabling high performance only when demanded by the currently executing application. Adaptive techniques such as

DVFS are beneficial since every application has unique demands.

For a given supply voltage, the clock frequency of a processor is selected such that there is sufficient time for the longest delay path to stabilize under adverse operating conditions. However, this worst- case propagation delay estimate is too conservative, as process, voltage and temperature variations that are introduced during circuit fabrication and operation have a wide range of possible values. Processor manufacturers assume the worst, while determining the critical operating voltage and frequency values.

These operating levels are defined as critical because going beyond them may result in system crashes, due to erroneous computation and device malfunction, due to overheating. However, the rarity of oc- currence of worst-case scenarios, combined with input data dependent circuit delays, led to techniques such as overclocking that exploit the latent best-case performance in computing systems [3, 54, 103].

Overclocking is a procedure in which the operating frequency is increased beyond manufacturer specified frequency limits for reliable operation, without changing the system supply voltage [19].

The interest in designs that operate beyond worst-case design settings started when PC tweakers in the early 1990s modified their existing computers to run at higher speeds, enabled by exotic cooling solutions. Over the last decade, gaming enthusiasts have embraced overclocking in their pursuit forever faster execution times and fantastic gaming experiences. During recent years, overclocking has become mainstream, with manufacturers introducing technologies that support overclocking: AMD’s

Overdrive and Advance Clock Calibration technologies are cases in point [1].

The gains from overclocking are possible because of the worst-case assumptions used by traditional design methodologies. However, systems running at overclocked speeds cannot be relied upon, as the possibility of a system failure always exists. As a result, to account for the timing errors that occur at overclocked speeds, it is important to overclock the system reliably, to make the common case faster

[27, 92]. When reliably overclocked, performance benefits can be seen only if computed data is used at overclocked speeds. The concept of using data speculatively assuming no timing errors, is called 3 timing speculation. Timing speculation based reliable overclocking mechanisms employ proven fault tolerance techniques to detect and recover from timing errors [5, 8, 27, 33, 32, 92].

Future systems need to have crucial reliability enhancements to have high trust and dependability on their computational correctness. Fault tolerance approaches strive to achieve high degree of fault coverage, while being as conservative as possible in terms of area and power overhead. They attempt to minimize performance degradation compared to a non fault-tolerant system. Commercial systems opt for low overhead approaches that provide limited fault coverage and tolerate a subset of hardware fault classes, while incurring a modest performance penalty. On the other hand, servers designed for continued operation, such as IBM zSeries, have robust reliability, availability and serviceability features

[56]. With the advent of Chip Multiprocessors (CMP), fault tolerance techniques that also improve performance have been developed [88, 95]. These approaches utilize two cores to run an application with the goal of executing the application faster than on a single core, while leveraging the redundancy to tolerate faults.

In this dissertation, we address various aspects of timing speculation based adaptive reliable over- clocking schemes, and evaluate its role in the design of aggressive computer systems. The goal of this thesis is to develop low-cost, high performance, energy efficient and dependable systems. We char- acterize the various factors that influence the performance gains achievable through adaptive reliable overclocking. We evaluate the effectiveness of reliable overclocking, as compared to technology scal- ing. We also develop fault tolerant aggressive systems that have the twin goals of guaranteeing high performance and fault tolerance.

1.1 High Performance Computing

In conformance with Moore’s law [63], the semiconductor industry has witnessed its effectiveness doubling roughly every two years for the past several decades. This seemingly everlasting improve- ment, in the performance of microprocessors and other digital systems, is being sustained by a host of innovations at the manufacturing level, circuit level and micro-architectural level. 4

1.1.1 Device Scaling

Device scaling has enabled the doubling of transistor density and clock frequency with each tech- nology generation [9, 24]. However, recent scaling trends indicate the need for new materials and manufacturing methodologies to circumvent the predicted scaling limits at ultra deep sub-micron tech- nology nodes [96]. International Technology Roadmap for Semiconductors (ITRS) uses the term “tech- nology node” to indicate overall industry progress in IC feature scaling [29]. Technology nodes used to be defined based on the smallest half-pitch of contacted metal lines on DRAM. However, for micro- processors and ASICs, technology feature size is indicated by the gate-length isolated feature size.

Current microprocessors are manufactured using the 45nm process, while 32nm technology is targeted for Q4 2009 [45]. ITRS predicts that technology scaling will continue through the next decade, and nano-CMOS technology will remain the dominant implementation technology for silicon

VLSI chips. Technology scaling enables the following: For every 30% downscaling of technology node, transistor density doubles, gate delay reduces by 30%, operating frequency improves by 43%, active power consumption halves, and energy savings of 65% is observed. However, power and process variations impose a limit on frequency scaling that is achievable by device scaling alone [23, 30].

Device scaling also complicates the on-chip communication as interconnects do not scale as well as logic gates [9]. Wire delays have started dominating the overall delay and in modern microprocessors, pipeline stages are dedicated only for moving signals across the chip.

1.1.2 Microprocessor Architectures

Micro-architectural innovations complement improvements in process technology, and contribute to the immense advancements in computer technology. To continue the pace of this growth, advances in microprocessor architectures are critical. An important factor that has contributed significantly to the improvement in microprocessor performance is instruction level parallelism (ILP). Superscalar pro- cessors issue more than one instruction in a given cycle thereby increasing the number of instructions being issued and executed in a cycle. These processors have hardware support to perform dynamic scheduling. Very Long Instruction Word (VLIW) machines are compiler assisted and execute more than one instruction in a cycle. 5

Along with device scaling that enable faster logic circuits, pipelining has played a major role in increasing clock frequency. Processor pipeline depths have increased from five stages to thirty stages.

Deep pipelining or superpipelining [61] allows a faster clock frequency by dividing the clock limiting stages into multiple sub-stages. However, the benefits of superpipelining are limited by the penalties imposed by branch and other hazards. Intel Pentium 4 has a twenty stage mis-prediction pipeline

[41]. Recent research suggests that clock frequency improvements from increasing pipeline depths are reaching a point of diminishing returns. It has been shown in [37] that while the optimal number of pipeline stages is application dependent, on an average, performance is maximized with around twenty stages.

Asynchronous design techniques present an alternative to synchronous design approaches [26].

For long these methodologies suffered because of lack of computer aided design tool support. In recent years, there have been efforts to automate the asynchronous design process. Counterflow pipeline pro- cessors present an interesting approach towards asynchronous microprocessor implementations [89].

However, asynchronous designs are not easy to comprehend and present several implementation bot- tlenecks [38]. In addition, aggressive time-to-market schedules and significantly low product lifetime make asynchronous design methodologies unattractive.

1.1.3 Better-Than-Worst-Case Designs

Reliable overclocking allows embedded systems and processors to run at higher frequencies than the manufacturer specified worst-case limits. The vendor specified frequency includes a safety margin to provide tolerance for process variations, voltage fluctuations and extreme temperatures. For systems operating in typical operating environments, significant benefits can be achieved through overclocking, if reliable execution can be guaranteed. Also, frequency binning is used by manufacturers to sort fabricated devices based on their speed. The discrete speed grades and inter-die process variations can be exploited by overclocking to improve performance [32].

The most significant aspect that is exploited by reliable overclocking is the input data dependency of the worst-case delays. The worst-case delay paths are sensitized only for specific input combina- tions and data sequences [3]. Typically, the propagation delay of the digital system is much less, than 6 the worst-case delay and this can be exploited by overclocking. The benefits of overclocking can be furthered by allowing a tolerable number of errors to occur, and have an efficient mechanism to de- tect and recover from those errors. This technique along with has been used to improve energy efficiency [27]. Along with adaptive clocking mechanisms, reliable overclocking improves performance drastically [92]. In [58], the trade-off between reliability and performance is studied, and overclocking is used to improve the performance of register files.

In [54], the issue, register renaming, and ALU logic of a superscalar processor are replaced with approximate versions that execute in half the time, but not necessarily correct. Two versions of the original ALU and register renaming logic are required to detect errors in the approximate versions.

Timing error avoidance techniques that overclock, but within safe limits also have been proposed [99].

1.1.4 Adaptive Systems

Increasingly miniaturized systems and higher frequencies of operation have resulted in increased overall power dissipation for the same chip size. Power and energy aware computing approaches scale supply voltage to reduce dynamic power dissipation. This approach also lowers the frequency so that errors are avoided. Dynamic frequency scaling along with dynamic voltage scaling allows energy- performance trade-offs during run time.

Over the last decade, overclocking as a means to improve processor performance is gaining popular- ity [19]. Overclocking does not guarantee computational correctness. Hence, it is necessary to develop solutions that reliably and dynamically adjust frequency to the optimal value. Having a clock signal whose frequency adapts well to the environment, as well as the application characteristics empowers digital systems to capitalize on significant performance benefits in terms of execution time. Such dy- namic clock frequency schemes have been proposed earlier for ASICs [34] and Field Programmable

Gate Arrays (FPGAs) [12].

Adaptive overclocking adjusts frequency for variations in process and environmental conditions during run-time. Also, worst-case conditions occur rarely, leaving room for significant performance improvements that can be achieved through dynamically adjusting clock frequency at run time beyond worst-case limits. In [59], a theoretical control technique for a variable speed processor is presented. 7

Triple-latch monitor based designs that perform hardware self tuning based on circuit performance monitoring have the capability to adapt to process and temperature variations [48].

1.2 Fault Tolerant Computing

Along with performance, reliability is becoming one of the preeminent concerns for computer ar- chitects. As scaling in VLSI technology continues into the nanometer regime, both memory elements and combinational gates become susceptible to soft errors due to the transient pulses induced by ra- diation that have durations often higher than the gate propagation delays [7]. This electric pulse can propagate without masking, get latched in a memory element and may result in an error (soft error) at the application level, hence resulting in unfavorable system behavior. Further, higher clock speeds decrease the cycle time, increasing the probability that a soft error is latched. These trends imply that future digital systems need to be protected against both single event transients (SETs) and single event upsets (SEUs) [47].

Laprie [52] defines fault as an erroneous state of hardware or , while an error as the man- ifestation of a fault. A failure occurs when the actual operation deviates from the desired operation.

Not all faults become errors, and not all errors lead to failure [102]. Lately, reliability issues have become more pronounced and their manifestations result in frequent error occurrence, as we rapidly adopt technological advancements [85]. Future systems need to have crucial reliability enhancements to have high trust and dependability on their computational correctness. Fault tolerance architectures have become more attractive than fault avoidance architectures as performance takes center stage.

The impact of soft errors and silicon defects on system reliability have been steadily rising as we progress towards 32nm technologies and beyond. Soft errors, which are transient in nature, and silicon defects, which lead to permanent failures, have intrigued researchers to formulate fault tolerance techniques with varied capabilities to improve system reliability. Soft errors, induced by high energy radiation and external noise, have become more frequent and may result in incorrect computation and silent data corruption. Intermittent faults are also a cause for concern [20]. Silicon defects resulting from silicon failure mechanisms such as transistor wear outs, gate breakdown, hot carrier degradation, and manufacturing limitations reduce lifetime and reliability of fabricated devices [57]. 8

1.2.1 Transient Faults

A host of research work exists that characterizes the effects of transient faults on a high-performance processor pipeline. An analysis of soft error sensitivity for picoJava-II, a microprocessor core devel- oped by Sun Microsystems, has been performed in [51]. This work asserts that reasonable prediction of soft error sensitivity can be made by means of deduction from the processor’s microarchitecture leading to efficient implementations of redundancy for various logic blocks. Similar characterizations have been performed on other commercial microprocessors [81, 101].

The fact that not all transient faults translate to system failure has led to concepts such as archi- tectural vulnerability factor (AVF) [65]. AVF is calculated for different hardware structures based on their usage and presence of valid data in them, and it is defined as the probability that a fault occurring in a particular structure will result in an error. Calculating AVF for various structures enables selec- tive switch-on of fault tolerance mechanisms, based on the AVF at that point of time. The reason for

AVF to be different at various times is because a soft error happening, for instance, on dead code or wrong-path instructions may be inconsequential or may propagate without affecting the end result of a computation. Natural fault masking, resulting from logical masking, electrical masking, and latching window masking [85], also prevent a fault from manifesting as an error.

Prior research indicates that memories are more susceptible to faults than the processor pipeline itself [39, 98]. Microprocessor datapath protection has been well researched. However, control logic and random logic protection continues to be a major issue for high performance systems [77, 49].

Currently, a given single-bit is expected to flip in a RAM only once every many billions of hours of operation. However, with the growing sizes of RAMs and other hardware components, one can expect the error frequency to become much more noticeable in the near future. Also, processors deployed in hazardous environments will have a higher probability of getting affected by bit-flips.

1.2.2 Redundancy Techniques

Various components in a processor are protected using different types of redundancy techniques.

Redundancy can be broadly classified as information redundancy, spatial redundancy and temporal redundancy. 9

Information redundancy involves generating extra code bits from data and appending them to the data before storing it. During data retrieval, the code is regenerated and cross-checked to detect errors.

Error correcting codes that are capable of single error correction and double error detection are used to ensure high availability and dependability of memories [18].

Spatial redundancy is achieved by performing the same computation on multiple independent func- tional units at the same time. Concurrent error detection is achieved by duplication and comparison, and recovery is triggered whenever mismatch happens [62].

Temporal redundancy works by repeating the computation on the same hardware multiple times and comparing the execution results to flag errors. While spatial methods can tolerate both transient and permanent errors, temporal techniques only work for transient errors [72]. All such techniques offer fault tolerance by either performing fault detection coupled with recovery or by providing fault masking. The choice of redundancy is based on several parameters including cost, performance, and fault coverage.

1.2.3 Fault Mitigation Techniques

Soft errors are induced by alpha particle emissions from chip packaging, high energy particle strikes, which includes neutrons from cosmic rays, and electrical noise. Several factors, such as the energy of the incoming particle, the geometry of the impact, the location of the strike, and the design of the logic circuit, determine the occurrence of a soft error. The combination of capacitance and voltage at a transistor node, described by the critical charge parameter Qcrit , determines the minimum electron charge disturbance needed to change the logic level. With device scaling Qcrit decreases, but the charge collection efficiency of transistors also goes down [82]. Radiation hardening techniques are used to minimize the impact of soft errors on digital circuits [43].

Soft error detection and recovery in the IBM Z990 processors is discussed in [56]. AR-SMT [76] proposes using the multi threading capability of modern processors to execute the program and a dupli- cate of the program in parallel as two threads. The two threads execute simultaneously and the results are compared after a fixed period of time. If the results mismatch, the system goes back to a check- pointed state and starts re-execution. Many such proposed techniques protect the core datapath against 10 faults. Schemes, which provide for fault detection and correction higher up in the design hierarchy usually, have higher penalties for recovery in comparison to fault masking and schemes implemented at a lower level. Techniques that propose architectural level solutions primarily use checkpointing for recovery, while circuit level schemes implement in-situ error detection and correction mechanisms.

Protection of FSM based control logic by error masking has been discussed in [21]. A signature caching scheme was proposed in [49] to detect SEUs in the control logic of complex microproces- sors. The ReStore architecture [100] uses checkpointing and rollback to recover from soft errors. The rollback is done based on certain abnormal events such as exceptions and incorrect control flow. A scheme to protect the static output bits of the instruction decode logic in a RISC architecture during loop execution is proposed in [77].

Temporal schemes developed for processors are designed to provide tolerance to transient sources of errors, such as single event upsets caused by high energy particle strikes. A good example of tempo- ral redundancy is REdundant Execution using Spare Elements (REESE) [68]. In this architecture, all instructions reaching the commit stage in a superscalar processor are re-executed by the same hardware in a redundant stream.

1.2.4 Exploiting Fault Tolerance to Improve Performance

Fault tolerance approaches strive to achieve high degree of fault coverage, while being as con- servative as possible in terms of area and power overhead. They attempt to minimize performance degradation compared to a non fault-tolerant system. Commercial systems opt for low overhead ap- proaches that provide limited fault coverage, while incurring a modest performance penalty. Mission critical systems with hard real-time constraints require extensive fault coverage and no compromise in performance. However, the performance degradation associated with high reliability solutions force systems requiring high performance to, sometimes, sacrifice reliability.

With the advent of Chip Multiprocessors (CMP), fault tolerance techniques that also improve per- formance have been developed [88, 95, 108]. These approaches utilize two cores to run an application with the goal of executing the application faster than on a single core, while leveraging the redundancy to tolerate faults. The speedup is achieved by exchanging control and data flow information between 11 the two cores. Here, execution is rolled backed to a checkpointed state and instructions are re-executed to recover from a fault.

Several of these proposed architectures apply fault tolerance with the goal of improving perfor- mance past worst-case limits. The Selective Series Duplex architecture [50] consists of an integrity checking architecture for superscalar processors that can achieve fault tolerance capability of a duplex system at much less cost than the traditional duplication approach. DIVA [5] uses spatial redundancy by providing a separate, slower pipeline processor alongside the fast processor.

A body of work exists that utilize two cores to speedup performance and/or improve fault toler- ance. Architectures such as, Slipstream [95], Dual-core Execution [107], Future Execution [31], and

Reunion [88] exchange control and data flow information between the cores to speed up execution, while leveraging the redundancy to provide partial fault coverage. Paceline [32] performs overclocking of the cores to speed up execution.

1.3 Power/Thermal Aware Computing

The need for low power architectures that deliver high performance while consuming as less power as possible is increasingly being felt by embedded system designers as they try to pack more and more power intensive computational tasks, while curtailing their power budgets. One of the most effective and widely used techniques for power aware computing is dynamic voltage scaling (DVS). Supply voltage can be lowered during processor idle times. To reduce supply voltage, clock frequency needs to be reduced first. Dynamic voltage and frequency scaling together narrows the gap between high performance and low power requirements [69]. As dynamic energy scales quadratically with supply voltage, significant energy reduction is possible by lowering the supply voltage [64].

Industry standards such as Intel SpeedStep [75], AMD PowerNow [2], Longrun [28] technologies alternate between a set of predefined voltage and frequency pairs and choose the best pair based on the worst-case voltage, temperature and process conditions. Several monitoring techniques are employed in modern microprocessors to achieve dynamic voltage and frequency scaling based on the current workload of the processor. These run time schemes use the outputs of performance counters, thermal sensors, and ring oscillators that are embedded at strategic locations on the chip. 12

Correlating voltage controlled oscillator approaches have been proposed wherein the oscillator speed automatically adapts based on the supply voltage and generates the fastest safe clock speed [16, 35].

More aggressive power reduction can be achieved by tuning the supply voltage of individual processor chip using embedded inverter delay chains [25].

The RAZOR architecture [22, 27] uses temporal fault tolerance by replicating critical pipeline registers in order to dynamically scale voltage past its worst-case limits. Razor achieves lower energy consumption by reducing supply voltage in each pipeline stage. In [83], a multiple clock domain approach for processor design is presented for improving energy efficiency. This work supports locally synchronous, globally asynchronous design technique. The goal is to run different parts of the processor with different clocks and use existing microprocessor queue structures for inter domain communication.

In recent years, thermal aware computing is becoming as important as power aware computing. The initial attempts, to bring down on-chip temperature, seek to minimize power. However, meeting power budget requirements is not sufficient and cooling mechanisms are not cost effective. This created a necessity for a control mechanism built within the processor chips, which is effective and economically viable. Designs began to include thermal sensors in various locations on a processor chip [74]. These dynamic control mechanisms effectively manage temperature, but suffer considerable overhead. Follow on research started to focus on the design of thermally aware high performance processors aiming to minimize performance impact for specific applications [14]. Dynamic Thermal Management (DTM) schemes are presented in [87] to make run-time decisions at different levels.

1.4 Contributions of this Thesis

The ideas in this thesis stem from the following observations:

• To provide reliable execution, traditional design methodologies ensure timing error avoidance by

designing to accommodate the worst-case parameters. However, in practice the worst-cases are

rare, leading to a large amount of exploitable performance improvement, if timing errors can be

detected and recovered from. Several architectural design concepts have been recently proposed. 13

• With techniques to speedup circuit operation, the overall performance of a digital system is going

up, but on-chip thermal management is becoming an issue.

• Semiconductor technology evolution enables high speed, low power systems, but the design

costs and time-to-market are becoming worse with deep sub-micron technology generations. A

supplementary approach that can provide stopgap relief is beneficial.

• As the advances in semiconductor technology allow continued scaling of VLSI implementation,

the possibilities for encountering a soft error also increases. As a result, most digital systems

need to incorporate soft error tolerance mechanisms.

This thesis discusses the potential and limitation of reliable overclocking as a viable technique to enhance mainstream system performance. Potential solutions that can overcome the limitations are explored. This thesis brings forth the thermal issues related to overclocking. This thesis projects spec- ulative reliable overclocking as a cost-efficient competitive stopgap alternative to technology scaling.

The design of adaptive clocking mechanisms that aid in dynamic frequency switching without incurring any penalty cycles are also explored. This thesis also explores the possibility of coupling the timing error tolerance mechanism with transient, intermittent and permanent fault tolerance mechanisms to design highly reliable high performance systems.

In this chapter, we presented an introduction to various innovative techniques that have improved computer system performance over the years. We also presented an introduction to fault tolerant com- puting, and introduced the reliability issues that affect VLSI circuits. We surveyed the various tech- niques that have been proposed in the past to improve system performance and reliability. We also looked at the on-going research work in power and thermal aware computing. In this chapter, we also introduced the better-than-worst-case design methodology and reliable overclocking that form the basis of the research work presented in this thesis.

In Chapter 2, a brief description of relevant work pertaining to the research work reported in this thesis is presented. We analyze the issues presented by parameter variations in transistors, and how technology scaling exacerbates this problem. We then look at reliable overclocking in detail, explain- ing the timing error detection and recovery methodology, timing error rate based feedback control 14 system and timing speculation. We briefly present the Razor [27] framework, which first introduced circuit-level timing speculation, and look at how Razor enabled energy savings by exploiting the data dependence of circuit delay. We also briefly present the metastability problem, and look at the circuit- level metastability mitigation technique presented as part of the Razor circuit. Finally, we present the

SPRIT3E framework for reliably overclocking a superscalar processor. The SPRIT3E framework de- veloped by Mikel Bezdek, as part of his Master’s thesis [8], forms the primary basis on which the ideas for this dissertation work materialized.

The contributions of this dissertation are presented in the following chapters:

In Chapter 3, we present the limitation, imposed by short paths in the circuit, on reliable over- clocking. We explore possibilities to manipulate the contamination delay of the circuit to maximize the performance gains achievable through reliable overclocking. To start with, we present a description of the clocking system used to generate the main clock and the backup clock that support speculative overclocking. We then present how contamination delay limits the extent of overclocking. To analyze the benefits of manipulating contamination delay in digital logic circuits to maximize the benefits of reliable overclocking, we did a case study on Carry Look Ahead (CLA) adders. CLA adders are chosen as they have very low contamination delay. We first increase the contamination delay, and then perform frequency scaling to operate the CLA adder at higher than worst-case speeds. Increasing contamination delay involves adding delays to the circuit. This increases the area and power overhead of the circuit.

To minimize this overhead, we look at the optimal value for contamination delay, since increasing contamination delay results in higher number of timing errors for the same amount of overclocking performed when the contamination delay is not increased.

In Chapter 4, we characterize the extent of performance enhancement achievable in computer sys- tems by dynamically varying the operating frequency past worst-case limits. We present an analysis framework and discuss in detail the nuances of designing a reliably overclocking system. One of our key objectives is to see the effect of overclocking on superscalar processors for various bench- mark applications, and analyze the associated overhead in terms of extra hardware and error recovery penalty. We analyze the sensitivity of our technique’s adaptiveness by exploring the necessary hard- ware requirements for dynamic overclocking and dynamic frequency scaling schemes. We exploit the 15 data-dependent variance in circuit delay to achieve better-than-worst-case performance using adap- tive reliable overclocking methodology. Experimental analysis based on integer and floating point

SPEC2000 benchmarks running on a SimpleScalar Alpha processor simulator, augmented with error rate data obtained from hardware simulations of a superscalar processor, are presented.

In Chapter 5, we analyze the temperature pattern of reliably overclocked systems, and evaluate the lifetime reliability of such reliable aggressive clocking mechanisms. We monitor the on-chip tem- perature of aggressively overclocked systems that dynamically enhance single threaded application performance. We couple thermal monitoring techniques with reliable overclocking to alleviate lateral issues relating to system power and reliability.

In Chapter 6, we discuss the results we obtained for speculative reliable overclocking and technol- ogy scaling. We compare reliable overclocking with technology scaling, and evaluate its competitive- ness vis-a-vis` technology scaling. We start with an overview of technology scaling, and then present speculative reliable overclocking as a bridge between two successive technology generations. This al- ternative has the potential to reduce time-to-market, and act as a stopgap technique for performance enhancement in between two technology generations, or help skip a technology generation altogether.

In Chapter 7, we present a conjoined duplex system approach to provide tolerance for myriad hardware faults that plague modern computing systems. Our approach is capable of protecting both the datapath and control logic. Our conjoined pipeline system is capable of recovering from timing errors as well, thereby allowing a significant degree of overclocking. When coupled with a dynamic clock tuning mechanism based on a set target error rate, the system frequency adapts to application characteristics during run time. The concept of increasing the frequency and phase shifting the clocks makes sure that both the primary and redundant pipelines can run faster and the second pipeline is timing safe. The

CPipe architecture, pipeline datapath and the error detection and recovery methodology is described in detail. We derive the relevant parameters that affect dynamic frequency scaling and the possible range of operating frequencies. Finally, we present the system implementation issues, and our implementation of a conjoined two-stage arithmetic pipeline and a conjoined five-stage DLX pipelined processor.

In Chapter 8, we conclude the research conducted in this dissertation and present the possible avenues, this research work presents for the future. 16

CHAPTER 2. BACKGROUND

In this chapter, the background material required for understanding the rest of this thesis is pre- sented. This chapter covers parameter variations and the need for self-tuning systems. Timing error avoidance mechanisms tune frequencies until timing errors start to happen. Reliable overclocking techniques allow timing errors to happen, and have inbuilt error tolerance mechanisms to guarantee functional correctness. Timing speculation aids aggressive systems by allowing data to be used spec- ulatively. Ernst, et al. in their work “Razor: A low-power pipeline based on circuit-level timing spec- ulation” [27], introduced timing speculation and combined it with timing error tolerance methodology to achieve energy efficiency. Our work, SPRIT3E [8, 92], extends reliable overclocking as a viable technique for performance enhancement in superscalar processors.

2.1 Parameter Variations

Ultra sub-micron process technologies force designers to adopt worst-case design methodologies that require safety margins to be added to individual system components to address parameter varia- tions. Parameter variations include systematic and random variations in process, voltage, and tempera- ture (PVT) [11]. These variations that affect propagation delay can be classified as physical variations, which are introduced during fabrication, and environmental variations, which are introduced during run-time [67]. Physical variations lead to both inter-die and intra-die variations. Environmental vari- ations such as variations in temperature and power supply voltage also have an effect on the delay through any path.

During fabrication, the difficulties in minutely controlling the various processing steps contribute to process variations. Some of them, such as lithographic irregularities, are systematic in nature, while others, such as dopant fluctuations, lead to random effects [80]. Voltage fluctuations arise because 17 of variations in power supply, and also because of change in the capacitive load being driven during run-time [10]. Temperature variations are a direct effect of chip heating up and cooling down during operation.

Inter-die or die-to-die variations that affect two different chips, containing the same circuit, have led to the, now very common, semiconductor industry marketing technique called speed binning or frequency binning. Processor and memory manufactures employ speed binning to test their products for specific timing capabilities and bin them according to their tested frequencies. Process variations contribute to 30% variation in chip frequency [11, 13]. Intra-die or within-die variations, which are dependent on design implementation, are mostly caused by variations in effective gate length of tran- sistors present in the same die. Die-to-die voltage fluctuations and within-die thermal variations also exacerbate this problem.

Parameter variations are becoming a key concern for circuit designers because they affect two key transistor parameters that dictate circuit performance: the threshold voltage, VT , and effective gate length, Le f f . Gate length (L) is the physical distance between source (S) and drain (D) regions of a MOS transistor, shown in Figure 2.1, and when determined from actual transistor characteristics, it is referred to as “effective”. Threshold voltage is dependent on temperature and it affects both the frequency and leakage power.

Leakage power, also referred to as static power, in MOS transistors happens because of gate leakage current and sub-threshold leakage current. As the gate oxide thickness, tox, scales with newer process technologies, the magnitude of the gate leakage current increases, contributing to increased leakage power. Process level techniques such as the use of high-k dielectrics have alleviated the gate leakage problem. The other dominant leakage mechanism is due to the drain-source sub-threshold current.

Sub-threshold current increases with increase in threshold voltage. As temperature varies, threshold voltage varies, leading to an exponential dependence of leakage power on operating temperature.

To account for these parameter variations, designers often assume delays 3σ from the typical delay.

The deviation of process, voltage and temperature parameters from nominal specifications can only be statistically estimated when fixing the frequency of a circuit. 18

VGS >> VT VS = 0 G VDD > VGS-VT

SiO2 Cgate tox S D

- VGS-VT + n+ n+ W p- Substrate

L

Figure 2.1 Cross section of a n-channel MOSFET in the ON state showing chan- nel formation. The channel exhibits pinch-off near drain indicating operation in saturation (active) region.

2.2 Reliable Overclocking

In a reliably overclocked processor (ROP), to tolerate timing errors, registers in the critical paths of every pipeline stage are augmented with a second time-delayed register. A typical pipeline stage in such a processor, along with local timing error detection and recovery circuit augmentation for critical path registers, is shown in Figure 2.2. Each combinational logic stage is a dense logic combination with multiple inputs and outputs, and possibly with more than one path from each input to output. The short paths in the logic can operate correctly even during extreme voltage and/or frequency scaling.

The paths that are not likely to meet their timing requirements are categorized as critical paths and only their corresponding stage output registers are replaced with timing error detection and recovery circuits.

2.2.1 Timing Error Detection and Recovery

A brief description, of how reliable overclocking is achieved, is presented from [8]: The main register is clocked ambitiously by the MAINCLK at a frequency higher than that required for error-free operation. The backup register is clocked in such a way that it is prevented from being affected by 19

MAINCLK

l a s c i r t e i t Reg in r Reg out s C i - g

e n R o N

stage i stage i+1 Critical Registers Reg in Reg out r

x e n t u i s i a M g M e Local Timing Error R Detection and Recovery

r p e t u s k i c

g stage error a e B R Comparator

PSCLK

Figure 2.2 Typical pipeline stage in a ROP. Local timing error detection and re- covery scheme for critical registers is shown in detail. timing errors, and its output is considered “gold.” The clock for this register is phase shifted, shown as PSCLK, such that the combinational logic is effectively given its full, worst-case propagation delay time to execute. In case of a mismatch between the primary and backup registers, a recovery measure is taken by correcting the current stage data and stalling the pipeline for one cycle. In addition to local recovery, action is also taken on a global scale to maintain correct execution of the pipeline in the event of a timing error. The extent to which systems can be overclocked is limited by the penalty cycles needed to recover from timing errors. A balance must be maintained between the number of cycles lost to error recovery and the gains of overclocking. The achievable performance enhancement per cycle, compared to the worst-case clock, WCCLK, is shown in Figure 2.3 as Φ2. WCCLK is shown only for comparison purposes, and is not required during operation. One important factor that needs to be addressed while phase shifting the PSCLK is to limit the amount of phase shift within the fastest delay path of the circuit. In Chapter 3, how short paths limit frequency scaling and how we can overcome this limitation is dealt with in detail. 20

Φ 1 Φ1 Worst Case Clock (WCCLK) Period Φ Φ3 Aggressive Clock Period WC 2 CLK Φ Phase Shift Φ3 4

Φ1= Φ3+ Φ4 = Circuit Propagation Delay MAINCLK Φ4 Φ2= Φ4 = Overclocking Advantage per cycle

PSCLK Φ4 < Circuit Contamination Delay

Figure 2.3 Timing diagram showing overclocking advantage per cycle, as com- pared to the worst-case clock

2.2.2 Timing Error Rate Based Feedback Control System

Reliably overclocking a processor may not yield an increase in performance at all times. The amount of aggressive overclocking is strongly influenced by the number of input combinations respon- sible for the longer timing-paths. As frequency is scaled higher dynamically, more number of input combinations would result in error. The percent of clock cycles affected by errors impact the perfor- mance. Each time an error occurs, additional time is required to recover from that error. In addition, the occurrence of a timing error is highly dependent on the workload and the current operating condi- tions. It is beneficial to have an adaptive clock tuning system, which increases or decreases the clock frequency based on a set target error rate.

In a ROP, dynamic clock frequency tuning is controlled by a global feedback system based on the total number of timing errors that occur in a specified time interval. The number of errors occurring at each timing error counter sampling interval is continuously monitored. As long as the number of errors is within the target limit, the frequency is scaled up, else scaled down. One can apparently construe that the error rate is a monotonically increasing function with respect to frequency. In Chapter 4, adaptive clocking techniques for reliably overclocked systems is discussed in detail.

2.2.3 Timing Speculation

The most significant aspect that is exploited by reliable overclocking is the input data dependency of the worst-case delays. The worst-case delay paths are sensitized only for specific few input combi- nations and data sequences [3]. Allowing a tolerable number of errors to occur, and incorporating an 21 efficient mechanism to detect and recover from those errors are the key elements [27, 92]. Based on this observation, numerous architectures have been proposed over the years. One of the earliest works on aggressive clocking, TEATIME [99] scales the frequency of a pipeline using dynamic timing error avoidance. This technique attempts to achieve better-than-worst-case performance by realizing typical delay operation rather than assuming worst-case delays and operating conditions.

When reliably overclocked, performance benefits can be seen only if computed data is used at overclocked speeds. Timing speculation is a technique wherein data generated at aggressive speeds is sent forward speculatively assuming error free operation, and when an error is detected, the forwarded data is voided and the computation is redone. Circuit level speculation is proposed initially in Razor, and have been applied to superscalar processors in SPRIT3E. Figure 2.4 shows timing waveforms that depict pipeline stage level timing speculation. In the figure, inst0 moves forward speculatively without any timing errors. However, inst1 encounters a timing error in Stage i, indicated by corrupted data

“terr”. This error is detected by the error detection mechanism, and the stage error signal is asserted.

This stage error signal triggers a local and global recovery. Timing error recovery flushes the data sent forward speculatively, indicated in the figure as “xxx”, and voids the computation performed by

Stage i + 1. Once the timing error is fixed, the pipeline execution continues normally. The values Φ1,

Φ2, Φ3 and Φ4 are explained in Figure 2.3.

Φ1

Φ2 WCCLK Φ3

MAINCLK Φ4

PSCLK

Stage Main Reg inst 0 inst 1 inst 2 i-1 Backup Reg inst 0 inst 1 inst 2 terr inst 1 Stage Main Reg inst 0 inst 2 i Backup Reg inst 0 inst 1 inst 2

Stage Error

Stage Main Reg inst 0 xxx inst 1 inst 2 i+1 Backup Reg inst 0 xxx inst 1 inst 2

Figure 2.4 Timing diagram showing pipeline stage level timing speculation 22

Apart from the run-time schemes, there are static methods that are specifically developed for better- than-worst-case architectures. BlueShift [33] proposes a design methodology from ground up for tim- ing speculation. The main idea is to identify and optimize the frequently used critical paths, called the

‘overshooters’ at the expense of the lesser frequent ones. Timing speculation has been well studied in the chip multiprocessors as well. Generally, these techniques couple two cores, such that one of them is sped-up with the help of the other [32, 95, 88]. Other works in the domain seek to improve reliability and common case performance through functionally incorrect design [5, 54, 60].

The desire for better than worst-case designs is much more serious in nanoscale technology. Pro- cess, voltage and temperature variations within and across the die are causing the bottleneck while selecting the worst-case frequency. ReCycle uses additional registers and clock buffers to apply cycle time stealing in the pipeline from faster stages to the slower ones [97]. Another technique, EVAL, has been proposed to maximize performance with low power overhead in the presence of timing induced errors [79].

2.3 Razor Architecture

Razor employs dynamic voltage scaling (DVS) technique along with timing speculation, error de- tection and correction mechanism to recover from timing errors, thereby eliminating the need for volt- age margins and exploiting the data dependence of circuit delay. Razor permits operation at sub-critical voltages by tolerating circuit timing errors and guaranteeing correct operation of the processor pipeline.

As voltage is scaled lower and lower, the number of errors increase resulting in increased power con- sumption and decreased instruction throughput because of the associated error recovery penalty. To counter this, Razor tunes the supply voltage based on error monitoring and feedback control achieving significant power savings. Also, an error recovery technique based on counterflow pipeline methodol- ogy is proposed.

In Razor, pipeline registers are augmented with a shadow latch, instead of a register. The working of a razor flip-flop, consisting of a main register and shadow latch, is similar to the description provided in Section 2.2.1. The shadow latch is clocked by a delayed version of the main register clock. The clock timing requirements guarantee that the shadow latch is not corrupted with incorrect value when 23 operating at better-than-worst-case conditions. This scheme is most suitable to deal with multiple bidirectional (0 to 1 and 1 to 0) errors [73]. To eliminate clock distribution overhead for the delayed clock, the shadow latch is clocked by the negative edge of the main register clock. Also, the minimum path constraints needs to be taken care of and to this end buffers are inserted during synthesis stage with a power overhead of around 3%.

It is important for clocked flip-flops to respect setup and hold time criteria. Otherwise, metastability might occur, during which the flip-flop’s output may take an indefinite period of time to settle down to its correct state; possibly oscillating multiple times between the stable states 0 and 1. To reduce power overhead and handle metastability issues in the main register the Razor flip-flop has special circuit-level implementations as shown in Figure 2.5.

Figure 2.5 Reduced overhead Razor flip-flop and metastability detection circuits (Figure reproduced from [27])

In a 64-bit Alpha processor only 192 flip-flops out of 2048 flip-flops required Razor augmentation.

This amounted to a power overhead of 1%. Razor architecture was analyzed at various levels: First, an

18x18 bit multiplier is implemented and analyzed in FPGA. Then a C-level timing model of a Kogge-

Stone adder is implemented with Razor timing details from SPICE analysis. This C-model is then integrated into the execution stage of simplescalar simulator [4]. Overall, substantial energy savings of up to 62% was observed with less than 3% impact on performance due to error recovery. 24

In [22], by generating an asymmetric clock, the duration of the positive clock phase is varied. Also an internal core frequency generator is available which is capable of generating clocks at different frequencies from 60 MHz to 400 MHz in steps of 20 MHz, and the duration of the positive clock phase is configurable from 0ns to 3.5ns in steps of 500ps.

Though a prototype circuit was fabricated, the hardware was only verified with simple programs.

Architectural simulations reflect only the energy savings for a Razor augmented adder. The clock frequency is not changed and it is fixed before processor operation. The minimum path constraints problem is not fully addressed.

2.4 SPRIT3E Framework

The Superscalar PeRformance Improvement Through Tolerating Timing Errors (SPRIT3E) frame- work allows the clock frequency of a superscalar processor to be dynamically tuned to its optimal value, beyond the worst-case limit. SPRIT3E framework mainly exploits the dependence of critical path on input data, and makes the common case faster. Because the frequency is dynamically modified as the processor is running, variations in the environmental conditions, such as temperature and voltage, as well as variations present from fabrication, are automatically adjusted for. As frequency scales to higher values, timing errors will begin to occur. To prevent these errors from corrupting the execution of the processor, fault tolerance in the form of temporal redundancy is used. Specifically, pipeline stages are augmented with a local fault detection and recovery (LFDR) circuit. As frequency is scaled higher dynamically, more number of input combinations would result in error. Each time an error occurs, additional time is required to recover from that error. The error rate is monitored during run time, and based on a set tolerable error rate that does not affect the performance, the clock frequency is adjusted dynamically. The timing error mitigation scheme is similar to the one used in Razor [27].

SPRIT3E technique applied to a superscalar processor is shown in Figure 2.6. To mitigate the timing errors, every pipeline stage is augmented with a second, time-delayed register. The LFDR circuit is highlighted in the figure. The main register is clocked ambitiously by the MAINCLK at a frequency higher than that required for error-free operation. The backup register is clocked in such a way that it is prevented from being affected by timing errors, and its output is considered “gold.” The 25

clock for this register is phase shifted, shown as PSCLK, such that the combinational logic is effectively given its full, worst-case propagation delay time to execute.

MAIN FUn stall CLK IF stall ID stall RB stall

Global Recovery stage stage Local Error error stall Detection & Recovery

PSCLK

Figure 2.6 SPRIT3E framework

In addition to local recovery, action must be taken on a global scale to maintain correct execution of the pipeline in the event of a timing error. Mechanisms are identified to recover errors in all pipeline stages [92]. Because errors are detected quickly and the recovery technique utilizes many existing paths through the processor, the area and performance overhead incurred from allowing timing errors to occur is kept to a minimum. The current approach keeps area overhead for timing error detection low by reusing the combinational logic and by duplicating only critical pipeline registers. Circuitry added to perform global error recovery is modest, since the logic involved is not complex and reuses already existing signals in the pipeline. Overall, SPRIT3E provides a viable means of tolerating timing errors.

To gauge the performance improvements provided by the SPRIT3E framework, an initial sequence of experiments has been performed. The first experiment was done on a simple multiplier circuit that established that significant room for improvement does indeed exist. Using a method such as dual latching to tolerate a small amount of timing errors allowed the multiplier circuit to run at almost half the period, a speedup of 44%. SPRIT3E framework was also evaluated in a DLX superscalar processor.

The evaluation was done for three different benchmarks. Experimental results show that on an average a performance gain up to 57% across all benchmark applications is achievable. 26

As demonstrated by successful overclocking, the current practice to set the frequency for syn- chronous circuits is far too conservative. At the same time, fault tolerance is necessary to ensure reli- ability if timing errors are not avoided with worst-case margins. The SPRIT3E framework addresses these problems with relatively simple additions to the superscalar pipeline. Only the pipeline registers are duplicated and the large combinational logic blocks making up the stages are reused by utilizing temporal redundancy. Additional overhead comes from the error recovery logic, but this too may be kept modest by reusing existing pipeline signals whenever possible. All in all, the performance gained by operating at the optimal, sub-worst-case period more than justifies the overhead of the detection and recovery logic. 27

CHAPTER 3. MANIPULATING SHORT-PATHS FOR PERFORMANCE

The cardinal factor that limits frequency scaling, as described in the previous chapter for Razor and

SPRIT3E frameworks, is the contamination delay of the circuit. Contamination delay is the minimum amount of time beginning from when the input to a logic becomes stable and valid to the time that the output of that logic begins to change. The major hurdle imposed by the short-paths in the circuit is to limit the phase shift of the backup clock by the contamination delay of the circuit. The phase shift of the delayed clock is restricted below the contamination delay to prevent incorrect result from being latched in the backup register. Reliable execution can be guaranteed only if the contents of the redundant register is considered “golden”. To overcome this limitation, it is important to increase the contamination delay of the circuit.

In this chapter, we evaluate the impact of short paths on reliable overclocking, and explore ways to manipulate the contamination delay of the circuit to maximize the performance gains achievable through reliable overclocking. To serve our purpose, we did a case study, to evaluate the performance improvement in Carry Look Ahead (CLA) adders using our technique. CLA adders have very low contamination delay. We need to first increase their contamination delay to get any meaningful im- provement in performance by operating them at higher than worst-case speeds. To minimize the area and power overhead associated with adding additional buffers to increase the contamination delay of the circuit, we look at the optimal value for contamination delay, since increasing contamination delay results in higher number of timing errors for the same amount of overclocking performed when the contamination delay is not increased. We further built an experimental setup to estimate the perfor- mance improvement of the new CLA adders with increased contamination delay. We observed that for circuits with higher propagation delay, there was a significant performance gain using our technique. 28

3.1 Impact of Short-paths

In order to support reliable overclocking, we aggressively decrease the clock period, while ensuring that the backup register is timing error free. This is achieved by time delaying the backup register clock signal, PSCLK, by the same amount as the difference between the worst-case clock, WCCLK, period and the aggressive clock, MAINCLK, period. To guarantee the integrity of data latched by the backup register, it is important to ensure that the input of the backup register changes only because of valid timing error free data. In order to accomplish this, we need to increase the delay of all the paths in the circuit to at least the desired maximum phase shift of the PSCLK. To better understand this, let us look at in detail the clock timing requirements for adaptive reliable overclocking. The following discussion assumes dynamic overclocking based on timing error based feedback mechanism. It is possible to consider aggressive operation at one particular frequency. The timing requirements for such an operation can be obtained directly from the equations derived for adaptive reliable overclocking.

3.1.1 Timing Constraints

MIN ← FREQUENCY → MAX

FMIN (TPS = 0) FINT (TPS < TCD) FMAX (TPS = TCD)

A B C D E F A B C D E F

WCCLK TCD TCD TCD

TPD TPD TPD

MAINCLK

TOV TOV TOV

PSCLK

TOV TPS TOV TPS TOV

Figure 3.1 Clock timing waveforms showing governing requirements, for MAINCLK and PSCLK, over the full range of overclocked aggressive frequencies (FMIN )* FMAX ) 29

To be able to reliably overclock a system dynamically using SPRIT3E framework, the foremost requirement is to generate the MAINCLK and PSCLK. The two clocks are governed by certain timing requirements that are to be met at all times. Figure 3.1 depicts the two clocks, with respect to the

WCCLK, for the full range of frequencies, FMIN )* FMAX , that are possible when a system is dynamically overclocked beyond the worst-case operating frequency, FMIN. The two clock signals, MAINCLK and

PSCLK, have the same frequency at all times, but they are out-of-phase by an amount determined by the extent of overclocking.

We define the following parameters to analyze the clock timing constraints for adaptive reliable overclocking:

• Let TPD denote the worst-case propagation delay of the circuit.

• Let TCD denote the contamination delay of the circuit.

• Let TWCCLK, TMAINCLK and TPSCLK represent the clock periods of WCCLK, MAINCLK and PSCLK respectively.

• Let TPS represent the amount of phase-shift between MAINCLK and PSCLK.

• Let TOV denote the overclocked aggressive time period.

At all times, the following equations hold.

1 TWCCLK = TPD = (3.1) FMIN

TMAINCLK = TPSCLK = TOV (3.2)

TPD = TOV + TPS (3.3)

From Figure 3.1, we observe and understand the following:

In FMIN setting, there is no overclocking. TOV = TPD. In this case, there is no need to phase shift the PSCLK. TPS = 0. The two clock signals are identical to the WCCLK. 30

The maximum possible frequency, FMAX permitted by reliable overclocking is governed by TCD.

This is because short paths in the circuit, whose delay determine TCD, can corrupt the data latched in the backup register. From the FMAX setting shown in Figure 3.1, we observe the following: Data launched by the MAINCLK, at clock edge A, is destined to be captured aggressively by the MAINCLK edge C and to be captured timing error free by the PSCLK edge D. However, if the phase shift TPS is greater than the TCD, then the data launched at A can corrupt the backup register at PSCLK edge B. If such a corruption happens, then the backup register may get incorrect result and cannot be considered

“golden”. Hence, it is not possible to overclock further than FMAX . The following equations should hold at all times to guarantee reliable overclocking.

TPS ≤ TCD (3.4)

1 FMAX ≤ (3.5) TPD − TCD

For any intermediate overclocked frequency, FINT , between FMIN and FMAX , TPS ≤ TCD. During operation, FINT is determined dynamically based on the number of timing errors being observed during a specific duration of time.

The dependence of phase shift on contamination delay leads directly to the limitation of the aggres- sive frequency scaling. A simplistic notion of the maximum speedup that is achievable through reliable overclocking is given by equation 3.6.

T Maximum Speedup = PD (3.6) TPD − TCD

3.1.2 Variable or Fixed Phase Shift

Until now, we have been considering a variable phase shift to generate the PSCLK. In other words, to generate a new frequency of operation, we change both the frequency, as well as the phase shift.

However, it is possible to maintain a fixed phase shift between the MAINCLK and the PSCLK, while generating the aggressive frequencies of operation. 31

Under fixed phase shift technique, the following changes are required:

• Equation 3.3 changes to TPD ≤ TOV + TPS.

• For FMIN setting, TOV = TPD and TPS = TCD.

• For FMIN setting, TPD − TCD ≤ TOV ≤ TPD and TPS ≤ TCD.

• For FMAX setting, TOV = TPD − TCD and TPS = 0.

0 5 10 15 20 0 5 10 15 20

MAINCLK MAINCLK Case I Φ PSCLK PSCLK

MAINCLK MAINCLK Φ1 Case II Φ PSCLK PSCLK

MAINCLK MAINCLK

Case III Φ2 Φ PSCLK PSCLK

TCD (3 ns) TCD (3 ns) TPD (10 ns) TPD (10 ns)

(a) Variable Phase Shift (b) Fixed Phase Shift

Figure 3.2 Examples of Main and PS clocks with variable and fixed phase shifts

Figure 3.2 shows three possible ways of generating the two clocks, either with variable phase shift or fixed phase shift, when the worst-case propagation delay is 10ns, and the contamination delay is

3ns. In Case I, there is no frequency scaling, and the clock period of the MAINCLK is equal to the propagation delay. In Case II, the frequency of the MAINCLK is scaled to 9ns. Case III shows the maximum possible overclocking. In this case, the clock period of the MAINCLK is 7ns.

When variable phase shift technique is used, the phase shift of the PSCLK varies from 0 in Case I, to Φ1 = 1ns in Case II, and finally to Φ2 = 3ns in Case III. In fixed phase shift approach, the phase shift of the PSCLK is fixed at Φ = 3ns for all three cases.

3.1.3 Manipulating Contamination Delay

From the above discussion, we understand that short paths severely limit the extent of reliable overclocking in circuits. This problem is compounded, as most circuits have a significantly lower con- 32 tamination delay. For instance, a 32-bit CLA adder circuit, implemented in 0.18um Cadence Generic

Standard Cell Library (GSCLib), has a propagation delay of 3.99ns, but an insignificant contamination delay of 0.06ns, thus allowing almost no performance improvement through reliable overclocking.

Since contamination delay limits performance improvement, it might be worthwhile to redesign the logic and increase the contamination delay. Redesigning circuits from ground-up for reliable over- clocking is beyond the scope of this thesis. In this thesis, we look at manipulating short paths by increasing their delay to a threshold value determined by the performance requirements.

Increasing the delay of all the paths in the circuit above a desired lower bound, while not affecting the critical path of the circuit is one of the steps performed during synthesis of sequential circuits to

fix hold time violations. For a signal to be latched correctly in a flip-flop on the clock edge, the signal should be stable for at least a specific amount of time before and after the clock edge, called the setup time and the hold time respectively. Clock skew, which is the difference in arrival times at the source and destination flip-flops, also exacerbates hold time requirements in sequential circuits. Hold time violations occur when the previous data at the input of the destination flip-flop is not held long enough to be latched properly. The data can change during the hold time window, if the contamination delay of the circuit is less than the hold time requirements at the destination flip-flop. The hold time requirement for a sequential circuit is normally a very small fraction of the propagation delay of the circuit. Hence, adding buffers to short-paths that violate hold time criteria is a step that is done without too much of a concern regarding area and power overheads.

However, increasing the contamination delay of a logic circuit significantly, sometimes as high as half the propagation delay, without affecting its propagation delay is not a trivial issue [84]. At

first glance, it might appear that adding delay by inserting buffers to the shortest paths will solve the problem. However, delay of a circuit is strongly input dependent, and several inputs play a role in deciding the value of an output in a particular cycle. Current synthesis tools support increasing the delay of short paths through their hold violation fixing option; in a broader sense, what we essentially want to do is that to extend the hold time of the backup register.

3 To guarantee the correct working of SPRIT E framework, the phase shift, TPS, cannot be more than half the propagation delay of the circuit. This restriction comes from Equations 3.2 and 3.3. The phase 33

shift, TPS, is at most equal to the overclocked time period, TOV . If TPS is greater than TOV and TOV is less than half the propagation delay of the circuit, then Equations 3.2 and 3.3 cannot hold together. The following equation should hold at all times to guarantee reliable overclocking.

T PD ≤ T ≤ T ≤ T (3.7) 2 PS OV PD

MIN ← FREQUENCY → MAX

FMIN (TPS = 0) FMAX (TPS = TCD) A B C D

WCCLK TCD TCD

TPD TPD

MAINCLK

TOV TOV

PSCLK

TOV TPS TOV

Figure 3.3 Timing waveforms after increasing contamination delay to half the propagation delay for the full range of overclocked aggressive frequen- cies (FMIN )* FMAX )

Overclocking to half the original clock period is possible only if the contamination delay is more than half the propagation delay of the circuit. From Equations 3.4 and 3.7, we see that to support reliable overclocking contamination delay of the circuit can be increased up to half the propagation delay of the circuit. Figure 3.3 shows the clock waveforms for MAINCLK and PSCLK after increasing the contamination delay to its highest beneficial value.

3.2 Increasing Contamination Delay of a CLA Adder Circuit - A Case Study

To show that it is possible to increase contamination delay without affecting the propagation delay, we experimented on a CLA adder circuit. Our experiments indicate that by carefully studying the input- output relationship of a given circuit, it is possible to overcome the limitation imposed by contamination 34 delay on our technique. The following case study presents our experiments and results we achieved for a CLA adder circuit.

Figure 3.4 8-bit CLA adder

Let us first consider an 8-bit CLA adder shown in Figure 3.4. The propagation delay of the circuit is estimated to be 1.06ns, and the contamination delay 0.06ns. We synthesized the circuit using Cadence

BuildGates Synthesis tool in Physically Knowledgeable Synthesis (PKS) mode. We used the 0.18um

Cadence Generic Standard Cell Library (GSCLib) for timing estimation.

Figure 3.5 Delay distribution for an 8-bit CLA adder

Figure 3.5 shows the delay distribution of all possible timing paths from any input to any output in an 8-bit CLA adder circuit, considering both rising and falling transitions. From Figure 3.5 it can be seen that just about 20% of the paths have a delay more than 0.75ns. Though this is highly motivating and provides a strong reason to apply our technique, a 0.06ns contamination delay acts as a dampener and we risk incorrect operation if the clock period is reduced beyond 1ns. This is because, once the 35 input to the adder stabilizes, the output starts changing after 0.06ns, the contamination delay of the circuit. This will cause an incorrect sum to be latched by the redundant register at the first rising edge of the PSCLK following the rising edge of the MAINCLK. Reliable execution is guaranteed only if the data latched in the redundant register is considered “golden”.

To overcome the limitation imposed by the contamination delay, we developed a technique to in- crease the contamination delay without affecting the propagation delay of the circuit. As each output of the CLA adder depends on several inputs, and more than one path to each output exists, with both shorter and longer paths overlapping, adding buffer delays to shorter paths resulted in increasing the overall propagation delay of the circuit. After carefully studying the delay pattern, we observed that it is possible to distribute the additional delay to either the input side or the output side, or both. By doing so, we are able to increase the contamination delay. More importantly, the overall propagation delay remained unchanged.

Figure 3.6 8-bit CLA adder with additional delay blocks to increase contamina- tion delay

Figure 3.6 shows the new CLA adder circuit. A chain of inverters form the delay block. As seen in the figure, there are now three different circuits that compute the Sum, Propagate, and Generate bits, and they are called L-Type, I-Type, and M-Type. The main difference is where the additional delay is placed. The amount of delay can also be different.

By varying the number of I-Types, and increasing or decreasing the L-Types and M-Types, the increase in contamination delay can be controlled. Each inverter has a delay in the range of 0.06 − 36

0.08ns. For an 8-bit adder, ten inverter delay blocks are sufficient to increase the contamination delay to a significantly higher value. Also increasing contamination delay beyond a certain point is not beneficial, as it will push most of the timing paths to higher delay values, and when frequency is scaled higher more timing errors will occur.

Figure 3.7 Delay distribution for an 8-bit CLA adder after increasing contamina- tion delay

Figure 3.7 shows the new delay distribution. The contamination delay of the circuit now is 0.37ns, while the propagation delay remains unchanged at 1.06ns. Now 31% of the timing paths have a delay value greater than 0.75ns. Though we can possibly phase shift our PSCLK by 0.35ns, reducing the clock period by that amount results in higher number of errors. Having a control over the increase in con- tamination delay gives us an advantage to tune the circuit’s frequency to the optimal value depending on the application and the frequency of occurrence of certain input combinations. Introducing delay to increase contamination delay increases the area of the circuit. Therefore, judiciously increasing contamination delay makes sure that the increase in area is kept minimal.

Using the same technique, to add additional delays, we increased the contamination delay of 32-bit and 64-bit CLA adders.

Table 3.1 provides all relevant details about the implementation of our technique in 8-bit, 32-bit and 64-bit CLA adder circuits. Since enhancing contamination delay without affecting propagation delay was our main goal, our initial implementation did not take into consideration any technique for minimizing the area. 37

Table 3.1 Implementation details of CLA adder circuits

Adder Width Original Delay Added 2 2 TCD (ns) TPD (ns) Area (µm ) TCD (ns) TPD (ns) Area (µm ) 8 0.06 1.06 304 0.35 1.06 928 32 0.06 3.99 1216 1.21 3.99 14528 64 0.06 7.89 2432 1.82 7.89 47752

3.2.1 Analysis of Reliable Overclocking Performance

To estimate the performance improvement achievable using our technique, we performed a series of experiments in ALU circuits. An initial study of a multiplier circuit [8] provided compelling reasons to further explore the possibilities of implementing the technique in circuits with low contamination delay. Also, the study reinforced our conviction that significant performance improvement is possible using overclocking.

Figure 3.8 Experimental setup to estimate performance improvement of CLA adder circuits

Not all circuits are as amenable to overclocking as the multiplier circuit. The CLA adder circuit dis- cussed in Section 3.2 provides a good platform to study the various factors that influence overclocking.

A different experimental setup, as seen in Figure 3.8, was built to estimate the performance improve- ment of the adder. This time FPGAs were not used because of the routing delays and the difficulty involved in adding delays to the circuit. As before, two linear feedback shift registers provide random 38

inputs to the adder. The MAINCLK and PSCLK have the same frequency. As the frequency of the clocks is scaled higher and higher, the phase shift also increases to provide full propagation delay before the result gets latched in the redundant register. The outputs of the primary and redundant registers are compared every cycle, and the error counter is incremented for each error observed. It is also neces- sary to verify the correctness of the result stored in the redundant register. The test bench internally computes the sum and that value is compared with that of the redundant register. Another counter is used to log the number of incorrect results computed. Only the adder circuit is simulated with timing information obtained from the Cadence GSCLib. The experiment is then run for 10,000 cycles with different frequencies, and the error rate is observed after every run.

Figure 3.9 Percent of error cycles versus clock period for an 8-bit delay added CLA adder circuit

Figure 3.9 shows the percentage of cycles affected by errors as frequency is scaled higher for an

8-bit CLA adder circuit. The figure also highlights the worst-case clock period, and the clock period at which incorrect results are detected for the first time. The worst-case delay of the circuit is 1.06ns and the contamination delay is 0.35ns. When the clock period is 0.8ns, 33.26% of clock cycles are affected by errors. Until this point there is no incorrect result stored in the redundant register. As we scale further, the phase shift exceeds the contamination delay. The number of incorrect results latched in the redundant register is almost 98% when the clock period is 0.7ns. Frequency scaling beyond this point will result in unreliable execution. 39

Figure 3.10 Percent of error cycles versus clock period for a 32-bit delay added CLA adder circuit (Contamination delay 1.21ns)

Figure 3.11 Percent of error cycles versus clock period for a 32-bit delay added CLA adder circuit (Contamination delay 1.38ns) 40

Figure 3.10 and Figure 3.11 present the error rate at various frequencies for 32-bit CLA adder circuits with different contamination delay values. From the figures, we observe that increasing con- tamination delay beyond a point, results in higher error rate. As contamination delay is increased from

1.21ns to 1.38ns, the number of cycles affected by errors at 3ns clock period goes from 10.07% to

55.19%. This illustrates the need to optimally increase contamination delay.

Figure 3.12 Percent of error cycles versus clock period for a 64-bit delay added CLA adder circuit

Finally, Figure 3.12 shows the percentage of errors detected in a 64-bit CLA adder operating at higher than worst-case speeds.

From the different adder experiments, we observe that as propagation delay of a circuit goes up, the clock period can be reduced further and further. In the case of 64-bit CLA adder, even after the phase shift is increased beyond the contamination delay (1.82ns) there is no incorrect sum detected.

Only when the clock period is reduced to 5ns, incorrect results are observed. The reason for this is that the shorter paths responsible for the contamination delay are not being exercised by the inputs to the adder. However, increasing the phase shift beyond the contamination delay does not guarantee reliable execution for all possible inputs. 41

CHAPTER 4. CHARACTERIZING ADAPTIVE RELIABLE OVERCLOCKING

In this chapter, we explore timing speculation based reliable overclocking and evaluate the various factors that impact reliable overclocking. We present a design methodology that provides dynamically controllable knobs to designers to balance the parameters of interest.

IntExec IntMap Error IntQ Error IntReg Error Error

Int IntQ IntReg IntExec Map From Non-Critical

Paths r e t s i g e R C IF g LdStU DCache

P Bpred e R n i a

From M DCache Critical LdStU Error Error Paths FPAdd g e FP R

p BPred Error FPQ FPReg

u Map k

B FPMul

Timing Error IF Error FPAdd/FPMul Recovery Circuit FPQ Error FPMap Error FPReg Error Error

Figure 4.1 Alpha 21264 integer and floating point pipeline showing timing error detection and recovery circuit for critical registers

In a reliably overclocked processor (ROP), to tolerate timing errors, registers in the critical paths of every pipeline stage are augmented with a second time-delayed register. Figure 4.1 shows an enhanced

Alpha 21264 pipeline that is capable of supporting speculative reliable overclocking. A typical pipeline stage in such a processor, along with local timing error detection and recovery circuit augmentation for critical path registers, is shown in the figure. Reliably overclocking a processor may not yield an increase in performance at all times; the reason being that the occurrence of a timing error is highly dependent on the workload and the current operating conditions. The amount of frequency scaling is 42 strongly influenced by the number of input combinations responsible for the longer timing-paths. As frequency is scaled higher dynamically, more number of input combinations would result in error. Each time an error occurs, additional time is required to recover from that error.

In this chapter, we analyze target error rate values and evaluate adaptive clock tuning systems, which increase or decrease the clock frequency based on a set target error rate. We monitor the error rate during run time, using various error sampling techniques and based on a set tolerable error rate that does not affect the performance, we adjust the clock frequency dynamically. The fact that dynamic clock tuning comes at a significant runtime cost has lead to the use of two clock generators enabling clock tuning in near zero time. In this chapter, we analyze the benefits of using single and dual clock generators, and various clock tuning techniques. We further evaluate the importance of a faster memory alongside a faster processor. We present our results for SPEC2000 integer and floating point bench- marks executing on a SimpleScalar Alpha processor, augmented with error rate profiles obtained from a Alpha processor hardware model.

4.1 Evaluating Speculative Reliable Overclocking

In Chapter 3, we determined the limits of the frequency scaling, and the importance of manipulat- ing short paths to maximize gains. Let us now look at the other important factor that affects reliable overclocking. The number of errors that occur at overclocked frequencies play a significant role in de- termining the extent of overclocking. As frequency is scaled higher, the number of input combinations that result in delays greater than the new clock period also increases. Each error takes additional cycles to recover, resulting in diminishing returns at higher operating frequencies. Hence, deciding on the target error rate is the foremost step in speculative reliable overclocking.

Let us first analyze the impact of error rate:

Let TWC denote the original worst-case clock period.

Let TOV denote the clock period after aggressive frequency scaling.

Let TGAIN be the time difference between the original clock period and the aggressive clock period.

TGAIN = TWC − TOV (4.1) 43

Let δ denote the frequency scaling factor.

T δ = WC (4.2) TOV

If a particular application takes CTOT clock cycles to execute, then the total execution time is re- duced by TGAIN ×CTOT , if there is no error.

Let CREC be the number of cycles needed to recover from an error.

Let ε denote the fraction of clock cycles, out of a total CTOT cycles, affected by errors, due to overclocking.

To achieve any performance improvement Equation 4.3 must be satisfied. Equation 4.3 states that as long as the error recovery overhead is less than the reduction in execution time, a reliably overclocked system performs better than a non-overclocked system.

ε ×CTOT ×CREC × TOV < CTOT × TGAIN (4.3)

Equation 4.4 provides an upper bound for the number of errors that can be tolerated. This upper bound is inversely proportional to the error recovery penalty. Equation 4.4 is for a system that is always overclocked at a constant frequency.

T ε < GAIN (4.4) CREC × TOV

For adaptive systems, in addition to the timing error recovery overhead that scales along with frequency scaling, there is also a frequency switching time, TTUNE , consumed by the clock generator’s frequency synthesizer to generate the new frequency. Therefore, it is necessary to make sure that the time gained through overclocking is more than the total losses incurred. Considering the frequency switching penalty, Equation 4.4 becomes Equation 4.5. Here, CSAMP refers to the number of cycles in an error sampling interval before frequency switching is triggered.

T T ε < GAIN − TUNE (4.5) CREC × TOV CSAMP ×CREC × TOV 44

An application can now be partitioned into smaller time intervals, and the frequency for every subsequent interval be determined by the number of errors being observed at the current overclocked frequency. In order to maximize gains, TTUNE should be minimized or the error sampling interval should be made much longer. If we consider an application that takes CTOT clock cycles to execute, and partition it into n sampling intervals, each consisting of CSAMP cycles, then Equation 4.6 presents the overall aggressive execution time, EXOV . Here, each sampling interval has a different clock period and error rate. If we assume that the system starts running initially at worst-case clock period, then

TOV1 = TWC. Each sampling interval includes a frequency switching time.

EXOV =(CSAMP + ε1 ×CSAMP ×CREC) × TOV1 + ... + (CSAMP + εn ×CSAMP ×CREC) × TOVn (4.6) + n × TTUNE

The total execution time for a non-overclocked system that executes an application from start to

finish at the worst-case clock period is given by Equation 4.7. The term TTUNE is included to account for the one-time clock generation time.

EXWC = CTOT × TWC + TTUNE (4.7)

The execution cycles for a pipelined processor are mainly divided into cycles for instruction execu- tion, memory and branches. During overclocking, the number of execution cycles may go up depending on timing errors, as already discussed. Also, in a computation, it is possible that when the clock fre- quency is scaled there is an increase in the total number of execution cycles.

For instance, in a pipelined processor, when the processor accesses memory, the number of clock cycles taken for that memory operation increases when the frequency is scaled, if the clock frequency of the memory remains constant. Consider a processor whose clock period is 10ns, and a memory access which took 20 CPU cycles. If after scaling, the clock period is reduced to 5ns, then the same memory access would take 40 CPU cycles. Now, if the clock frequency of the memory is not scaled corresponding to that of CPU, there will be an increase in the memory cycles. However, the branch penalty cycles remain unaffected. 45

Analytically, if each memory operation takes CM cycles at TWC, then by scaling the clock by a factor of δ each memory operation will now take δ ×CM cycles. However, this is strongly dependent on the number of memory bound instructions.

Let γ denote the fraction of memory access per cycle.

If we consider that CTOT excludes memory cycles, then the aggressive execution time for any particular sampling interval is given by Equation 4.8. Equation 4.8 accounts for frequency switching penalty for that sampling interval. Equation 4.9 presents the overall execution time.

EXOVINT = CSAMP × TOV + γ ×CSAMP × δ ×CM × TOV + ε ×CSAMP ×CREC × TOV + TTUNE (4.8)

EXOV = EXOVINT1 + EXOVINT2 + ... + EXOVINTn (4.9)

By replacing TOV by TWC and substituting δ = 1 & ε = 0 in Equation 4.8, we get the worst-case runtime for a sampling interval, as given by Equation 4.10. By replacing CSAMP by CTOT , we get the overall worst-case execution time, EXWC. The one-time clock generation overhead can be added to

EXWC estimation.

EXWCINT = CSAMP × TOV + γ ×CSAMP ×CM × TOV (4.10)

Although overclocking improves performance, it also increases the switching activity of the cir- cuits. This causes more dynamic power dissipation. As Equation 4.11 illustrates, we see a factor of

δ increase in the dynamic power consumed because of overclocking. Here, α,C and V are switch- ing activity factor, circuit capacitance and system voltage respectively. Leakage power is discussed in

Section 4.2.2.

α ×C ×V 2 α ×C ×V 2 POV = + PLEAK = + PLEAK (4.11) TOV δ × TWC 46

4.1.1 Performance Metrics

In order to evaluate the performance of speculative reliable overclocking, we derive the following performance metrics:

The speed up achieved from reliable overclocking in a sampling interval is obtained by dividing

EXOVINT by EXWCINT . This is given by Equation 4.12. The overall speedup is given by Equation 4.13.

EX δ × (1 + γ ×C ) Speedup = WCINT = M (4.12) EXOVINT (1 + γ × δ ×CM + ε ×CREC) + TTUNE

EX EX OverallSpeedup = WC = WC (4.13) EXOV EXOVINT1 + EXOVINT2 + ... + EXOVINTn

Normally, for pipeline stage level timing speculation designs, the recovery is made the following cy- cle, i.e., CREC = 1. From Equation 4.12, we can understand that for optimal performance enhancement, the workload should have a right mixture of memory bound and CPU bound instructions. Practically, γ is quite small, as there exists several methods in literature for shadowing the memory operations, such as caching and buffering.

Traditionally, power and delay have been the two most important specifications for digital systems and microprocessor designs. There are numerous metrics in use presently, all of which are primarily based on these two factors. Each metric assigns different weights to the two terms based on the design goals. Power-Delay Product (PDP), or the energy is a widely used metric for older technologies (>

180nm). With increase in leakage power in the deep submicron era, PDP may not prove to be the best option. Also, PDP is technology dependent; necessitating newer metric for comparing across different technologies. An improved metric called the Energy-Delay Product (EDP) was later developed. This gives a quadratic weightage for delay compared to energy. In other words, EDP measures how fast a given circuit operates, while consuming minimal energy.

The energy or PDP for an overclocked system is given by Equation 4.14. EDP for an overclocked system is calculated from Equation 4.15. By replacing TOV with TWC and EXOV with EXWC in Equa- tions 4.11, 4.14, 4.15 the corresponding metrics for a non-overclocked system are calculated. 47

α ×C ×V 2 PDPOV = × EXOV + PLEAK × EXOV = POV × EXOV (4.14) TOV

2 α ×C ×V 2 2 EDPOV = × EXOV + PLEAK × EXOV = PDPOV × EXOV (4.15) TOV

4.2 Analysis Framework

To evaluate the effectiveness of speculative reliable overclocking, we develop an analysis frame- work that allows us to analyze the impact of various factors on the extent of reliable overclocking.

We modify an existing model of a microprocessor, add the necessary features that help us to study the benefits of speculative reliable overclocking and provide an effective framework for understanding the performance gains.

Figure 4.2 presents the entire analysis methodology. The figure depicts a SimpleScalar Alpha processor simulator, in combination with a Wattch power model and HotSpot thermal model. The individual components are explained below in detail. The power model takes in technology specific parameters to compute active and leakage power every cycle. The thermal model provides block level temperature based on chip floor plan and run-time power consumption information. A hardware model of an Alpha processor implemented in 45nm technology is used to enable overclocking dependent er- ror injection. During normal execution, without any overclocking, clock controller provides a singular technology dependent base frequency. When reliable overclocking is enabled, timing error based feed- back control is activated. Adaptive clock tuning techniques are employed to adapt system behavior based on workload characteristics.

Our base processor, which is an out-of-order 64-bit Alpha EV6 processor, is derived from the

SimpleScalar-Alpha tool set [17]. This processor executes the Alpha AXP ISA. Our base processor configuration resembles that of Alpha EV6 processor. Table 4.1 provides the configuration details of the principal features of the base processor. The processor configuration remains consistent with technology scaling, and we evaluate the same base processor across different technologies. However, the area estimate varies, and this is captured by the floor plan, which is provided as an input to the thermal model. 48

Temperature Workload Data Leakage 45nm OSU Sim-Outorder SPEC 2000 Power Update Standard Cell HotSpot INT & FP Active + Leakage Wattch Power Alpha Processor Benchmakrs Library Thermal Model Instructions Model Accesses Simulator

VDD, VT, R, C, VSense

Alpha Processor Random INT Profiler Clock Hardware Model Data

Random Stall FP Data Alpha EV6 Clock Technology Floorplan Controller Error Parameters INT Error FP Error Timing Injector FP ALU fbase, OVL Profile Profile Errors

Figure 4.2 Simulation framework

4.2.1 Modeling a Reliably Overclocked Processor (ROP)

To evaluate the capabilities of speculative reliable overclocking, we modeled a reliably overclocked processor using a functional simulator, which incorporates a random timing error injector based on error profiles obtained by running application binary on a hardware model. When reliably overclocked, we dynamically tune the clock frequency based on the number of errors happening during a pre-determined time interval and target error rate.

In order to bring in the aspects of timing error in the SimpleScalar Alpha simulator, which is cycle accurate, but not timing accurate, we analyzed the number of timing errors occurring in the hardware model of a superscalar processor. For this purpose, we analyzed the error rate in the different pipeline stages of a superscalar, dynamically scheduled integer pipeline similar in complexity to the

Alpha 21264 [101] that executes a subset of the Alpha instruction set. Our analysis was performed at each stage in isolation because the hardware model of the processor is not fully synthesizable, and we require a synthesizable model to get the gate-level timing information. As a result, we synthesized individual pipeline stages using the 45nm OSU standard cell library [91]. Once we got the synthesized blocks for a particular stage, we replaced the RTL model for that block with the synthesized model, and evaluated each pipeline stage independently. We annotated the timing information, extracted in standard delay format (SDF), on the blocks to run timing accurate simulations. We ran the instruction 49

Feature Specifications Fetch/decode/issue/commit width 4/4/4/4 instructions/cycle out of order execution Functional units 4 integer arithmetic and logic units 1 integer multiplier/divider unit 4 floating point arithmetic and logic units 1 floating point multiplier/divider unit Branch predictor 8-K entry bimodal Branch target buffer 512 sets, four-way set associativity L1 instruction cache size 64 KB L1 data cache size 64 KB L2 unified cache size 2 MB

Table 4.1 Processor specifications profile of various integer benchmarks obtained from the SimpleScalar simulator through the various stages. We used random data values for other inputs, filled the memory with random data. We explain below in detail, how we obtained the integer pipeline error profile. For measuring error rate in floating point computation, we evaluated a floating point ALU obtained from opencores.org, and measured the error rate at varying frequencies and incorporated it as part of the floating point error profile shown in Figure 4.2. For the rest of the floating point pipeline, we used the average of integer pipeline error profiles.

Pipeline Stage TPD (ns) TCD (ns) % Critical Registers Fetch 3.90 0.06 2.1 Decode 2.76 0.10 0 Rename 2.88 0.06 0 Issue 4.89 0.10 89.17 Execute 6.65 0.08 11.86 Memory 5.21 0.10 3.21 Commit 1.94 0.07 0

Table 4.2 Synthesis report of major pipeline stages

The hardware model of the processor has altogether 12 pipeline stages. Table 4.2 reports the syn- thesis results for the major pipeline stages. In the hardware model, the fetch stage, for instance, is divided into three stages. The table reports only the propagation delay, TPD, and the contamination delay, TCD, for the slowest among the three fetch stages. The timing values, reported in ns, are obtained 50 from static timing analysis reports. In the table, we report the percentage of registers that have path terminating at them with delay values greater than or equal to 3.5ns. We fixed the worst-case delay at

7ns to allow the maximum propagation delay of 6.65ns in the execute stage.

Figure 4.3 shows the error rate for the issue stage, execute stage and the cumulative error rate for all stages of the processor. We noticed around 89.17% of the paths fail in the issue stage at 3.5ns, which causes a sudden rise in error rate, as observed in the Figure 4.3. From the distribution of critical registers and as correlated by the error rate results for various benchmarks, we observe that the primary contributors for timing errors are the issue stage and the execute stage. Since, the issue stage performs computation every cycle, overclocking it results in significant increase in errors, as shown in Figure 4.3.

However, the execute stage performs computations that exercise the critical paths rarely resulting in moderate increase in error rate with overclocking.

The error rate at different frequencies were measured by running the experiment for 100,000 cycles, and repeating the experiment with different sequences of 100,000 instructions for each benchmark.

Average values are reported in the chart. Based on the error profile, the random timing error injector in Figure 4.2 injects appropriate number of errors. Whenever an error occurs, the pipeline stalls for a cycle.

As explained earlier, it is necessary to augment critical registers with error detection and recovery circuit, and also increase contamination delay of paths terminating at critical registers to a value greater than the desired extent of overclocking. Our simulator overclocks up to 55% of the worst-case clock period. This requires increasing contamination delay to over 45% the clock period. We obtain the power overhead values from Razor [27] and incorporate them in our power and thermal model. In

Razor, the contamination delay was increased to 50% of the clock period for paths terminating in critical registers. The power overhead reported was 3%, which came from the extra buffers padded to improve contamination delay and the backup registers.

Figure 4.4 shows the error profiles for three benchmarks executing five different instruction and data sets. The variability seen in these plots indicate the need for adaptive clocking techniques. The performance of a ROP is significantly enhanced by a dynamic overclocking mechanism. Traditional dynamic frequency scaling techniques stall the system clock during the frequency switching phase. 51

60 All Stages All Stages (Worst Case Clock Period, tproc = 6.89 ns) 50 (Worst Case Clock Period, tproc = 6.65 ns)

40

30

20

10 Rate Error Percent

0

7 6.5 6 5.5 5 4.5 4 3.5 3

Clock Period (ns)

50 Issue Stage Issue Stage (Worst Case Clock Period, tissue = 5.75 ns) (Worst Case Clock Period, tissue = 4.89 ns) 40

30

20

10 Rate Error Percent

0

7 6 5 4 3

Clock Period (ns)

8 ExecuteExecut Stagee Sta ge (Worst Case Clock Period, texecute = 6.89 ns) (Worst Case Clock Period, texecute = 6.65 ns) 6

4

2 Percent Error Rate Error Percent

0

7 6.5 6 5.5 5 4.5 4 3.5 3

Clock Period (ns)

Figure 4.3 Cumulative error profile for all pipeline stages at overclocked operat- ing frequencies for SPEC2000 integer benchmarks. Also shown sepa- rately are error profiles for issue stage and execute stage. 52

8 bzip2 benchmark Error Rate for Execute stage 6 5 different instruction and date profiles

4

2 Percent Error Rate Error Percent

0

7 6 .5 6 5 .5 5 4 .5 4 3 .5 3

Clock Period (ns)

4 crafty benchmark Error Rate for Execute stage 5 different instruction and date profiles

2 Percent Error Rate Error Percent

0

7 6 .5 6 5 .5 5 4 .5 4 3 .5 3

Clock Period (ns)

8 gap benchmark Error Rate for Execute stage 5 different instruction and date profiles 6

4

2Rate Error Percent

0

7 6 .5 6 5 .5 5 4 .5 4 3 .5 3

Clock Period (ns)

Figure 4.4 Error profile for three SPEC2000 integer benchmarks executing five different instruction and data sets 53

The entire process of switching from one frequency to another may take upwards of 100 clock cy- cles depending upon the speed at which the voltage controlled oscillator (VCO), and the delay lock loops (DLLs) or phase-locked loops (PLLs) can generate the new stable clock signal. This frequency switching penalty becomes a bottleneck, and prevents adjusting the clock frequency frequently. To overcome this limitation, we may consider using two clock generators, and have a control mechanism that switches between these two clock generators. This provides the capability to adjust concurrently the clock frequency, while the system is running. Adaptive clocking mechanisms are discussed in detail in Section 4.3.

Table 4.3 specifies the various simulation parameters we incorporated into our ROP simulation model.

Parameter Value Technology node 45nm Supply Voltage 1V Threshold Voltage 0.2398V Worst-case frequency 1536MHz Maximum Overclocked frequency 2792MHz No. of frequency levels 32 Frequency sampling interval 100000 cycles PLL locking time 10µs Frequency switching penalty Single PLL: 10µs Dual PLL: 0µs Temperature sampling interval 1ms

Table 4.3 Simulator parameters

4.2.2 Power and Thermal Modeling

Wattch [15] is an accurate, architecture level power tool that is embedded within the SimpleScalar simulator. Wattch categorizes the various processor units to be one among the following four types: array structures, fully-associative content-addressable memory structures, combinational logic blocks and clock resources. This classification enables modeling the power for each processor functional block based on its categorization, input configuration that determines the size of the block, access pattern, which is workload dependent and implementation technology parameters. Wattch estimates 54 the worst-case cycle power based on voltage, frequency and process technology. During runtime, this worst-case power consumption value is scaled based on accesses and resources used in a cycle. Wattch calculates instantaneous power that includes both active and leakage power at every cycle, and outputs the total power accumulated over a simulated period of time.

We chose the in built linear with 10% turnoff power, as this resembles the industry reported leakage power values. Even as leakage power tends to get worse with scaling, several new techniques have come up to minimize leakage power within acceptable levels. Some of the popular leakage reduction techniques include architecture level techniques, such as power gating, transistor level techniques, such as multi-threshold transistors, and material changes, such as the latest hafnium base high-k dielectric for gate oxide. One important fact that makes leakage power so important with scaling is its exponential dependence on on-chip temperature. Active power, however, is largely in- dependent of temperature, even as circuit switching is temperature dependent because of the effect of temperature on the threshold voltage of a CMOS device.

We model this temperature dependence of leakage power, using the empirical relationship presented in Equation 4.16 [40]. Here, β is technology dependent constant (β is 0.036 and 0.017 for 180nm and

th 70nm respectively), T0 is the temperature of a reference point and Ti is the temperature at i instant with respect to the reference point. We obtain on-chip temperature values for each of the functional blocks using the HotSpot tool [42], which acts a thermal sensor.

β(Ti−T0) PLEAK ∝ e (4.16)

We modified Wattch to track instantaneous power for each functional block. The instantaneous power trace is provided to the HotSpot RC model to calculate temperature. Thermal modeling requires the chip floor plan. We obtain the Alpha EV6 floor plan for 45nm technology node from 130nm floor plan provided as part of the HotSpot distribution by assuming scaling is proportional to square of technology. The thermal model, which was originally designed for 0.18um technology, includes die, heat spreader, and models. 55

4.3 Adaptive Clocking

Dynamic clock frequency tuning is controlled by a global feedback system, which is based on the total number of timing errors that occur in a specified time interval. The number of errors occurring at each timing error counter sampling interval is continuously monitored. As long as the number of errors is within the target limits, the frequency is scaled up, else scaled down. One can apparently construe that the error rate is a monotonically increasing function with respect to frequency. This allows the use of efficient search algorithms to select the next tuned frequency, starting from the base frequency. The maximum frequency for performance enhancement is theoretically limited by the contamination delay of the circuit. If the time period of the new frequency is less than the contamination delay of the circuit, timing errors certainly occur during every cycle and the error rate goes to 100%. Clock periods that are greater than the propagation delay do not cause any timing errors (0% error rate). Earlier studies have indicated that fixing a non-zero target error rate improves performance significantly.

SysClk

Target Error Rate Main Clock ΔF PROGRAMMABLE Tolerance +/- CLOCK Factor CLOCK GENERATOR SPRIT3E CONTROLLER & PS ΔΦ PHASE SHIFT Clock CONTROL ErrRate ERateClr

Figure 4.5 Feedback control system to dynamically tune clock frequency: Single clock generator with variable phase shift

The dynamically tuned frequency is achieved through the global feedback system pictured in Fig- ures 4.5 and 4.6. Figure 4.5 depicts a feedback mechanism with single clock generator and variable phase shift option for the PSCLK. Figure 4.6 shows the presence of two clock generators and the PSCLK has a fixed phase shift from the MAINCLK. Both variable and fixed phase shift options can be used either with single or dual clock generators. 56

Figure 4.5 shows the ROP, which is SPRIT3E in this case, along with the dynamic clock tuning mechanism, wherein a single programmable clock generator is used. Before operation begins, a small, non-zero, error rate is programmed as the set point. The clock controller is initialized with the worst- case delay parameters of the pipeline. The initial frequency of the clocks is determined by the worst- case propagation delay, and the PSCLK begins with no phase shift (∆Φ = 0). These values are sent to the clock generator block. This block consists of the clock generator, which includes a voltage controlled oscillator (VCO) and a PLL. The VCO is able to generate a variable frequency clock that meets the ∆F value given by the clock controller. The PLL locks the output of the VCO to provide the

MAINCLK to the pipeline. The clock generator then phase shifts the MAINCLK by the value requested by the clock controller and produces the PSCLK. Once the clocks are stable, the clock generator states that the signals are locked. During the period in which the clocks are being adjusted, the pipeline is stalled. To avoid a high overhead from frequent clock switching, the number of timing errors in the pipeline must be sampled at a large interval and a new frequency determined after that interval.

Figure 4.6 depicts a clock tuner, which is composed of a clock controller, two programmable clock generators, and a phase shift control block. In the figure, the parameters that control the operation of each of the blocks is highlighted. The operation of this clock tuner is as follows:

SysClk

TunerGen1 PROGRAMMABLE Clk1 CLOCK Main CLOCK GENERATOR 1 Clock CONTROLLER FMiin FMax CLOCK PS 3 ClkSwitch SPRIT E SELECT FIXED Clock PHASE PROGRAMMABLE SHIFT Tuner ErrRate Gen2 CLOCK Clk2 TargetErrRate GENERATOR 2 TolFactor ERateClr SwitchRate FMin FMax

Figure 4.6 Feedback control system to dynamically tune clock frequency: Dual clock generators with fixed phase shift

The programmable clock generators generate clocks that have frequencies within the range pre- scribed by FMIN and FMAX . The exact frequency to be generated is based on the frequency tuning value 57 that is estimated by the clock controller based on the current error rate. The frequency steps that are possible in the given frequency range is determined by the number of bits used to represent the two tuning values TunerGen1 and TunerGen2. For instance, if we use 5-bits, then 32 different frequencies can be generated between FMIN and FMAX , with 0 corresponding to FMIN and 31 corresponding to FMAX .

If the number of steps, NSteps, is fixed, then the step increase or decrease in clock time period is given by Equation 4.17.

TMAX − TMIN 1 1 TStep = ; TMAX = and TMIN = (4.17) NSteps FMIN FMAX

Based on the values of TunerGen1 and TunerGen2, TPS for each generator is calculated as TStep ×

TunerGen1 and TStep × TunerGen2, respectively. Once TPS is known, using Equation 3.3, the clock fre- quencies of Clk1 and Clk2 are decided.

The clock select block switches between the clocks generated by the two generators. The clock controller generates the signal ClkSwitch to select between the two clocks. Generator selection toggles whenever a new frequency is generated, and is in sync with the newest frequency being generated.

The actual switching happens after allowing sufficient time for the outputs of the clock generators to stabilize. Once the MAINCLK is selected, the fixed phase shift block phase shifts the MAINCLK to generate the PSCLK. The clocks to the ROP are stalled for few cycles to ensure that the system functionality is not affected by unwanted glitches in any of the two clocks.

4.3.1 Clock Tuning Schemes

The operation of the clock controller is dependent on two parameters: TargetErrRate, which spec- ifies the tolerable error rate, and SwitchRate, which specifies the rate at which frequency switching happens. For example, if we specify SwitchRate as 100,000, then frequency switching happens every

100,000 clock cycles. If TargetErrRate is specified as 1% (an error rate of 1000 per 100,000 cycles), then frequency is increased if the number of errors is below 1000 after 100,000 cycles, otherwise it is decreased. Frequency is decreased immediately if the error rate exceeds the TargetErrRate anytime during the 100,000 cycles. Also, a tolerance band for the error rate can be used, during which the frequency will not be changed. As mentioned earlier, frequency switching takes several cycles once 58

Algorithm 1 Binary Search Algorithm for Calculating Tuning Values

1: Initial: Ti = Tj = Thi = Th j = 0; j = 1; i = 2 Tuner0 length 2: Initial: LowerBound = 0; U pperBound = 2 Gen1 − 1 3: if cycles ≥ SwitchRate then 4: x ⇐ MAX(Tj,Th j,Thi) 5: y ⇐ MIN(Tj,Th j,Thi) 6: Ti ⇐ Ti 7: if ErrRate ≤ TargetErrRate then 8: if Ti ≥ x then Ti+UB 9: Tj ⇐ 2 10: else 11: if Ti < y then Ti+y 12: Tj ⇐ 2 13: else Ti+x 14: Tj ⇐ 2 15: end if 16: end if 17: else 18: if Ti ≤ y then Ti+LB 19: Tj ⇐ 2 20: else 21: if Ti > x then Ti+x 22: Tj ⇐ 2 23: else Ti+y 24: Tj ⇐ 2 25: end if 26: end if 27: end if 28: Thi ⇐ TunerGeni; Th j ⇐ TunerGen j 29: TunerGeni ⇐ Ti; TunerGen j ⇐ Tj 30: i ⇐ j; j ⇐ i 31: cycles ⇐ 0 32: else 33: cycles ⇐ cycles + 1 34: end if 59 initiated. The ERateClr signal is asserted for one cycle just before asserting ClkSwitch to reset the system error rate.

Frequency tuning is implemented in an ad-hoc fashion. Frequency is increased or decreased by a single step based on the history of error during the preceding sampling interval. A binary search can be implemented too. Simple binary search does not work well, as frequency switches by a large amount for every switch and does not settle down at the right value. We use a modified binary search algorithm for deciding on the tuning values. Algorithm 1 runs once to calculate the new tuning values whenever frequency switching is called for. Our search alternates between TunerGen1 and TunerGen2. We also record the current tuning value before changing it, so that whenever it is required to find the midpoint between the current value and the lower bound or the upper bound, our search algorithm first looks at the last three tuning values to make a decision. This allows the dynamic clock tuner to change frequencies by smaller values in the beginning, and if error rate is still high or low, drop to the midpoint based on upper or lower bound, as the case may be.

4.3.2 Comparing Adaptive Clocking Techniques

In this section, we compare the benefits of having a single clock generator or two clock generators to support reliable overclocking. We compare the performance of the two clocking schemes, shown in

Figures 4.5 and 4.6, against a base system that does not support any overclocking. The base processor operates at the worst-case clock frequency, and does not see any timing errors during its operation. The base system theoretically has an error rate target of 0% and does not have a need to adapt its frequency during runtime. The two overclocked systems adapt themselves during runtime based on the specified target error rate. The single clock generator system incurs a clock switching penalty of around 100µs, since the processor has to be stalled when the system frequency is changing. The dual clock generator system tunes its frequency using the second generator, while the first generator provides the clock signal to the processor. When the need arises the processor alternates between the two generators and the switching happens within a few cycles.

Table 4.4 compares various performance metrics between a base non-overclocked processor, a re- liably overclocked processor that uses a single clock generator for dynamic clock tuning and a reliably 60 overclocked processor that uses two clock generators to adapt its frequency quickly to the changing runtime requirements. The comparison results are presented for three error rate targets. We report results for six integer benchmarks, namely, bzip2, cra fty, gap, gzip, mc f , and vpr. From the results, we see that there are significant runtime benefits, as a result of reliable overclocking. The dual clock generator system achieves the best runtime across all benchmarks. However, the power consumption is much higher for the dual clock generator system. Still, the dual clock generator system is beneficial if we look at the energy-delay product.

For an error rate target of 1%, the single clock generator system achieves, on an average, 11.34% improvement in runtime over the base system. The dual clock generator system outperforms the single clock generator system by almost 20%, and betters the base system by almost 30%, across all bench- marks. Even for 3% and 5% error rate targets the dual clock generator is 20% faster than the single clock generator system, even as the single clock generator outperforms the base system by about 20%.

With respect to power, the base system consumes the lowest power, followed by the single and dual clock generator based overclocked systems. If we consider the energy-delay product metric, the dual clock generator based overclocked system turns out to be much better as it achieves around 30% im- provement over the base system, and outperforms the single clock generator based overclocked system by about 25%, across all benchmarks.

Table 4.5 presents results for six SPEC2000 floating point benchmarks. The benchmarks we con- sidered are applu, apsi, equake, galgel, mesa, and mgrid. For the 1% error rate target, the single clock generator based overclocked system achieves, on an average, only about 4% improvement in runtime, while the dual clock generator based overclocked system achieves almost 24%. For 3% and

5% error rate targets the system with two clock generators does better by 20% than the system with single clock generator. The energy-delay product values still indicate that having a dual clock genera- tor based tuning system for a reliably overclocked processor is a good idea, and we can achieve more benefits by doing so. 61

Table 4.4 Comparing various performance metrics between a base non-over- clocked processor, a reliably overclocked processor tuned using a single clock generator and a reliably overclocked processor tuned using dual clock generators. All the systems execute SPEC2000 integer bench- marks Target Error Rate 0% 1% 3% 5% Benchmark Metric Base Single Dual Single Dual Single Dual bzip2 RUN TIME (ms) 40.08 34.27 27.42 32.36 25.57 31.02 24.67 POWER (W) 2.55 3.27 3.74 3.53 4.08 3.70 4.27 ENERGY (J) 0.10 0.11 0.10 0.11 0.10 0.11 0.11 EDP (ms*J) 4.09 3.83 2.81 3.70 2.66 3.56 2.59 crafty RUN TIME (ms) 28.30 22.77 18.24 22.92 17.72 23.21 17.89 POWER (W) 3.48 4.65 5.45 4.78 5.73 4.85 5.82 ENERGY (J) 0.10 0.11 0.10 0.11 0.10 0.11 0.10 EDP (ms*J) 2.79 2.41 1.81 2.50 1.79 2.60 1.85 gap RUN TIME (ms) 20.04 20.14 15.44 18.18 15.47 16.77 13.62 POWER (W) 4.37 4.67 5.72 5.28 5.77 5.75 6.74 ENERGY (J) 0.09 0.09 0.09 0.09 0.09 0.10 0.09 EDP (ms*J) 1.75 1.89 1.36 1.72 1.38 1.61 1.24 gzip RUN TIME (ms) 52.25 48.72 38.99 41.50 33.40 41.06 32.21 POWER (W) 2.75 3.21 3.70 3.81 4.39 3.97 4.64 ENERGY (J) 0.14 0.16 0.14 0.16 0.15 0.16 0.15 EDP (ms*J) 7.52 7.61 5.63 6.56 4.89 6.68 4.80 mcf RUN TIME (ms) 26.03 22.71 18.52 21.04 16.96 21.04 16.58 POWER (W) 5.36 6.44 7.59 7.11 8.45 7.29 8.82 ENERGY (J) 0.14 0.15 0.14 0.15 0.14 0.15 0.15 EDP (ms*J) 3.63 3.32 2.60 3.14 2.42 3.21 2.41 vpr RUN TIME (ms) 19.56 16.62 13.54 15.82 12.44 16.02 12.49 POWER (W) 3.41 4.32 4.98 4.68 5.53 4.72 5.61 ENERGY (J) 0.07 0.07 0.07 0.07 0.07 0.08 0.07 EDP (ms*J) 1.31 1.19 0.91 1.16 0.85 1.20 0.87 62

Table 4.5 Comparing various performance metrics between a base non-over- clocked processor, a reliably overclocked processor tuned using a sin- gle clock generator and a reliably overclocked processor tuned using dual clock generators. All the systems execute SPEC2000 floating point benchmarks Target Error Rate 0% 1% 3% 5% Benchmark Metric Base Single Dual Single Dual Single Dual applu RUN TIME (ms) 41.22 40.29 31.34 35.57 29.13 34.12 27.40 POWER (W) 3.79 4.18 5.02 4.74 5.49 5.05 5.93 ENERGY (J) 0.16 0.17 0.16 0.17 0.16 0.17 0.16 EDP (ms*J) 6.44 6.78 4.93 5.99 4.65 5.87 4.44 apsi RUN TIME (ms) 12.26 11.62 9.31 10.10 8.14 9.96 7.94 POWER (W) 5.69 6.31 7.55 7.43 8.84 7.66 9.21 ENERGY (J) 0.07 0.07 0.07 0.07 0.07 0.08 0.07 EDP (ms*J) 0.86 0.85 0.65 0.75 0.58 0.75 0.57 equake RUN TIME (ms) 27.17 25.74 20.68 22.18 17.85 21.84 17.33 POWER (W) 4.86 5.41 6.43 6.42 7.61 6.65 7.98 ENERGY (J) 0.13 0.14 0.13 0.14 0.14 0.14 0.14 EDP (ms*J) 3.59 3.58 2.75 3.14 2.41 3.16 2.38 galgel RUN TIME (ms) 41.24 39.58 31.31 34.26 27.81 33.31 26.76 POWER (W) 2.72 3.11 3.62 3.63 4.15 3.82 4.40 ENERGY (J) 0.11 0.12 0.11 0.12 0.12 0.13 0.12 EDP (ms*J) 4.63 4.87 3.54 4.25 3.20 4.23 3.14 mesa RUN TIME (ms) 27.12 25.69 20.64 22.14 17.81 21.80 17.30 POWER (W) 4.84 5.38 6.40 6.38 7.57 6.61 7.93 ENERGY (J) 0.13 0.14 0.13 0.14 0.13 0.14 0.14 EDP (ms*J) 3.56 3.55 2.72 3.12 2.39 3.13 2.36 mgrid RUN TIME (ms) 40.31 39.14 30.63 34.21 27.93 33.20 26.56 POWER (W) 3.90 4.32 5.18 4.97 5.77 5.25 6.19 ENERGY (J) 0.16 0.17 0.16 0.17 0.16 0.17 0.16 EDP (ms*J) 6.34 6.62 4.86 5.80 4.50 5.78 4.36 63

4.4 Reliable Overclocking Analysis

The Razor architecture is among the first in which the voltage was decreased below worst-case settings without altering the frequency during run time [27]. In a 64-bit Alpha processor only 192

flip-flops out of 2048 flip-flops required Razor augmentation. This amounted to a power overhead of

1%. Razor architecture was analyzed at various levels: First, an 18 × 18 bit multiplier is implemented and analyzed in FPGA. Then a C-level timing model of a Kogge-Stone adder is implemented with

Razor timing details from SPICE analysis. This C-model is then integrated into the execution stage of

SimpleScalar simulator. Overall, substantial energy savings of up to 62% was observed with less than

3% impact on performance due to error recovery.

In contrast to Razor’s energy saving goals, the SPRIT3E framework varied the frequency dynami- cally at a fixed voltage level [92]. To gauge the performance improvements provided by the SPRIT3E framework, an initial experiment on multiplier circuits indicated that using a method such as dual latch- ing to tolerate a small amount of timing errors allowed the multiplier circuit to run at almost half the period, a speedup of 44%. SPRIT3E framework was also evaluated in a DLX superscalar processor.

Experimental results show that on an average a performance gain up to 57% across all benchmark applications is achievable.

The most important thing that makes speculative reliable overclocking so attractive is the possibility of allowing and tolerating timing errors. An important design constraint is fixing the target error rate.

We evaluated the performance of ROP with different error rate targets. For this analysis, we used 45nm technology parameters. Also, the results presented are for a reliably overclocked system that tunes its frequency using the two clock generator system shown in Figure 4.6.

Figures 4.7 and 4.8 shows the performance trends for SPEC2000 integer and floating point bench- marks as we vary target error rate from 0% to 20%. The processor operates in normal mode without any speculative reliable overclocking for 0% error rate target. From the plots, we can see that all bench- marks show significant improvement in run time with reliable overclocking. However, as the error rate target is increased, the error recovery penalty plays a role and offsets some of the improvement in per- formance. This is seen from stagnating or decreasing improvement in run time for more than 5% error rate target. 64

The energy curves show that even as run time decreases, the total energy consumption increases modestly with higher error rate targets. This increase comes from the significant increase in power consumption for reliably overclocked systems. However, the energy-delay product metric shows that reliably overclocking is beneficial for systems, which require both energy-efficiency and high perfor- mance. From the plots, we understand that 5% is a good target error rate, as it minimizes both run time and energy-delay product.

Tables 4.6 and 4.7 present the percent increase or decrease for run time, power and energy con- sumption, energy-delay product and temperature when reliably overclocked, as compared to the base processor. A positive value for percent difference indicates a favorable improvement because of re- liable overclocking, while a negative value indicates a decline. Power consumption increases by an average of 38% across all benchmarks, while energy consumption increased on an average by only 3%.

The significant improvement in run time and energy delay product validates the reliable overclocking scheme. Change in the maximum temperature reached is also within 3%.

The comparison results presented until now assumes that the memory is also reliably overclocked.

If we assume that only the processor is overclocked, then memory intensive applications may not see significant benefits. This is reflected in Tables 4.8 and 4.9. Integer benchmarks bzip2 and gzip, and floating point benchmark mgrid are memory intensive benchmarks and benefit minimally with overclocking, as the memory latency plays a key role in the run time. A non-memory inten- sive benchmark, cra fty for instance, is more computation oriented and we notice that there is no difference in performance with or without memory overclocking. We observe the same behavior for gap, mc f , vpr, apsi, equake, and galgel. applu is moderately memory intensive and suffers a 6.25% performance degradation with no memory overclocking. 65

Run Time 1 e m i

T 0.9

n

u 0.8 R

d

e 0.7 z i l a 0.6 m r o

N 0.5 0% 1% 2% 3% 5% 7% 10% 12% 15% 20% Target Error Rate

Energy 1.2 y g r

e 1.15 n E

1.1 d e z i

l 1.05 a m

r 1 o

N 0.95 0% 1% 2% 3% 5% 7% 10% 12% 15% 20% Target Error Rate

Energy-Delay Product 1 P

D 0.9 E

d 0.8 e z i l

a 0.7 m r 0.6 o N 0.5 0% 1% 2% 3% 5% 7% 10% 12% 15% 20% Target Error Rate

Figure 4.7 Run time, energy and energy-delay product trends for SPEC2000 inte- ger benchmarks as target error rate varies from 0% to 20%. All values are normalized to 0% target error rate (no overclocking mode). 66

Run Time

e 1 m i T

0.9 n u

R 0.8

d

e 0.7 z i l

a 0.6 m r

o 0.5 N 0% 1% 2% 3% 5% 7% 10% 12% 15% 20% Target Error Rate

Energy

y 1.25 g r e n

E 1.15

d e z i l 1.05 a m r o

N 0.95 0% 1% 2% 3% 5% 7% 10% 12% 15% 20% Target Error Rate

Energy-Delay Product 1 P

D 0.9 E

d 0.8 e z i l

a 0.7 m r 0.6 o N 0.5 0% 1% 2% 3% 5% 7% 10% 12% 15% 20% Target Error Rate

Figure 4.8 Run time, energy and energy-delay product trends for SPEC2000 float- ing point benchmarks as target error rate varies from 0% to 20%. All values are normalized to 0% target error rate (no overclocking mode). 67

Table 4.6 Comparing various performance metrics for non-overclocked and reli- ably overclocked processors executing SPEC2000 integer benchmarks

bzip2 crafty gap METRIC Base ROP % Diff Base ROP % Diff Base ROP % Diff RUN TIME (ms) 40.08 24.67 38.45 28.30 17.89 36.76 20.04 13.62 32.04 POWER (W) 2.55 4.27 -67.39 3.48 5.82 -67.14 4.37 6.74 -54.45 ENERGY (J) 0.10 0.11 -2.88 0.10 0.10 -5.11 0.09 0.09 -4.27 EDP (ms*J) 4.09 2.59 36.68 2.79 1.85 33.53 1.75 1.24 29.14 TEMPERATURE (K) 325 330 -1.52 329 337 -2.33 332 339 -2.13

gzip mcf vpr METRIC Base ROP % Diff Base ROP % Diff Base ROP % Diff RUN TIME (ms) 52.25 32.21 38.36 26.03 16.58 36.30 19.56 12.49 36.14 POWER (W) 2.75 4.64 -68.46 5.36 8.82 -64.48 3.41 5.61 -64.21 ENERGY (J) 0.14 0.15 -3.65 0.14 0.15 -4.31 0.07 0.07 -4.09 EDP (ms*J) 7.52 4.80 36.12 3.63 2.41 33.55 1.31 0.87 33.53 TEMPERATURE (K) 329 337 -2.30 333 343 -3.06 328 335 -2.13

Table 4.7 Comparing various performance metrics for non-overclocked and reli- ably overclocked processors executing SPEC2000 floating point bench- marks applu equake mgrid METRIC Base ROP % Diff Base ROP % Diff Base ROP % Diff RUN TIME (ms) 41.22 27.40 33.54 27.17 17.33 36.20 40.31 26.56 34.12 POWER (W) 3.79 5.93 -56.32 4.86 7.98 -64.22 3.90 6.19 -58.84 ENERGY (J) 0.16 0.16 -3.69 0.13 0.14 -4.19 0.16 0.16 -4.49 EDP (ms*J) 6.44 4.44 31.09 3.59 2.38 33.53 6.34 4.36 31.16 TEMPERATURE (K) 335 344 -2.80 332 342 -2.82 334 344 -2.89

apsi galgel mesa METRIC Base ROP % Diff Base ROP % Diff Base ROP % Diff RUN TIME (ms) 12.26 7.94 35.24 41.24 26.76 35.12 27.12 17.30 36.20 POWER (W) 5.69 9.21 -61.82 2.72 4.40 -61.51 4.84 7.93 -64.10 ENERGY (J) 0.07 0.07 -3.72 0.11 0.12 -4.49 0.13 0.14 -4.16 EDP (ms*J) 0.86 0.57 32.83 4.63 3.14 32.21 3.56 2.36 33.55 TEMPERATURE (K) 333 344 -3.08 328 334 -1.84 332 341 -2.80 68

Table 4.8 Effect of memory overclocking on the performance benefits of a ROP executing SPEC2000 integer benchmarks

bzip2 crafty gap METRIC Mem*RO Percent Mem*RO Percent Mem*RO Percent Yes No Change Yes No Change Yes No Change RUN TIME (ms) 24.67 36.51 -48.01 17.89 17.90 -0.01 13.62 13.62 0.00 POWER (W) 4.27 3.39 20.52 5.82 5.82 0.01 6.74 6.74 0.00 ENERGY (J) 0.11 0.12 -17.86 0.10 0.10 0.00 0.09 0.09 0.00 EDP (ms*J) 2.59 4.52 -74.43 1.85 1.85 -0.01 1.24 1.24 0.00 TEMPERATURE (K) 330 328 0.56 337 337 0.00 339 339 0.00

gzip mcf vpr METRIC Mem*RO Percent Mem*RO Percent Mem*RO Percent Yes No Change Yes No Change Yes No Change RUN TIME (ms) 32.21 44.64 -38.60 16.58 16.80 -1.31 12.49 12.49 0.00 POWER (W) 4.64 3.74 19.47 8.82 8.75 0.82 5.61 5.61 0.00 ENERGY (J) 0.15 0.17 -11.78 0.15 0.15 -0.49 0.07 0.07 0.00 EDP (ms*J) 4.80 7.44 -54.94 2.41 2.46 -1.81 0.87 0.87 0.00 TEMPERATURE (K) 337 335 0.42 343 343 0.10 335 335 0.00

Table 4.9 Effect of memory overclocking on the performance benefits of a ROP executing SPEC2000 floating point benchmarks

applu equake mgrid METRIC Mem*RO Percent Mem*RO Percent Mem*RO Percent Yes No Change Yes No Change Yes No Change RUN TIME (ms) 27.40 29.11 -6.25 17.33 17.34 -0.01 26.56 32.18 -21.16 POWER (W) 5.93 5.72 3.61 7.98 7.98 0.02 6.19 5.43 12.35 ENERGY (J) 0.16 0.17 -2.44 0.14 0.14 0.00 0.16 0.17 -6.30 EDP (ms*J) 4.44 4.83 -8.85 2.38 2.39 -0.01 4.36 5.62 -28.79 TEMPERATURE (K) 344 343 0.35 342 342 0.00 344 340 1.01

apsi galgel mesa METRIC Mem*RO Percent Mem*RO Percent Mem*RO Percent Yes No Change Yes No Change Yes No Change RUN TIME (ms) 7.94 7.95 -0.16 26.76 26.76 -0.03 17.30 17.31 -0.05 POWER (W) 9.21 9.20 0.09 4.40 4.40 0.04 7.93 7.93 0.07 ENERGY (J) 0.07 0.07 -0.07 0.12 0.12 0.01 0.14 0.14 0.02 EDP (ms*J) 0.57 0.58 -0.23 3.14 3.14 -0.02 2.36 2.36 -0.03 TEMPERATURE (K) 344 344 0.00 334 334 0.00 341 341 0.00 69

CHAPTER 5. THERMAL IMPACT OF RELIABLE OVERCLOCKING

The design for worst-case settings provides us an opportunity to improve processor performance to a greater extent through overclocking. Reliable overclocking mechanisms employ proven fault tol- erance techniques to detect and recover from any timing errors that occur at better-than-worst-case speeds. Although aggressive clocking mechanisms facilitate in improving performance, they adversely impact on-chip temperatures, leading to hotspots. Overclocking enthusiasts invest heavily in expen- sive cooling solutions to protect the chip from overheating, and such overclocked systems typically have significantly lower lifetime. Additionally, reliable overclocking techniques necessitate additional circuitry, leading to an increase in power consumption. Higher clock speeds and power densities invari- ably lead to accretion of on-chip temperature over a period of time. As system operates faster, on-chip temperatures quickly reach and exceed the safe limits. This poses a serious threat to the lifetime relia- bility of these systems [90]. In this chapter, we perform a comparative analysis on the thermal behavior of reliably overclocked systems with non-accelerated systems. Our aim in this work is to establish a safe acceleration zone for such ‘better than worst-case’ designs by efficiently balancing the gains of overclocking and the impact on system temperature.

We must emphasize that current products from both the leading microprocessor vendors, Intel and

AMD, have dynamic thermal monitoring techniques that take necessary corrective action to maintain on-chip temperature [2, 55, 75]. The corrective actions, in most cases, shut down the system or re- duce system voltage and frequency, leading to considerable performance degradation. Our goal in this study is to analyze the temperature pattern of reliably overclocked systems, and evaluate the life- time reliability of such reliable aggressive clocking mechanisms. Furthermore, we monitor the on-chip temperature of aggressively overclocked systems that dynamically enhance single threaded application performance. We couple thermal monitoring techniques with reliable overclocking to alleviate lateral 70 issues relating to system power and reliability. While taking feedback from an integrated thermal mon- itor, we observed an average performance increase of 25%, while operating within temperature 355K.

To the best of our knowledge, this is the first work that analyzes the impact of reliable overclocking on on-chip temperature.

5.1 Thermal and Reliability Management

Over the last decade, thermal awareness has gained importance distinguishing itself from power awareness. Processor chips began to have thermal sensors in various locations to regularly sample the temperature and to shut down the operation in case of overheating. However, rapid heating and cooling of processor chips create thermal cycles affecting the lifetime reliability of the system [90].

The power consumed by a VLSI chip consists of two parts: dynamic and static. Dynamic power is dependent on capacitance (C), voltage (V), frequency ( f ) and switching factor (α), and is given by

2 Pdyn = αCV f . Since dynamic power is directly proportional to the frequency at which the circuit operates, this causes overclocked systems to consume more power, which in turn causes systems to overheat. However, solving the thermal problem is not as simple as bringing down the overall power consumed [86].

The thermal problem becomes much more noticeable in designs under 90nm technology, where leakage power grows significantly. The leakage power grows exponentially with temperature as given by the empirical relationship in Equation 4.16 [40]. Leakage power suffers from a positive feedback, wherein, increase in temperature leads to further leakage and increased total power consumption, which in turn leads to increase in temperature. Due to non-uniform switching and leakage, temperature is not distributed uniformly across the chip, creating localized heating in parts leading to hotspots.

Higher temperatures not only increase power budget, but also affect the lifetime reliability of the devices. To improve the overall reliability and lifetime of the systems, the thermal performance should be monitored and the average degradation of transistors managed. RAMP model [90] relates thermal cycling to mean time to failure due to various factors such as, electromigration, stress migration and di- electric breakdown and brings the importance of keeping the on-chip temperature within critical limits.

Table 5.1 summarizes five critical failure mechanisms, namely, electromigration, stress migration, time 71

Wear out Mechanism Proportional Model (MTTF) and Fitting Parameters

E −n aEM Electromigration (EM) [70] (J) e kT ; J=Current Density; n=1.1, EaEM=0.9eV

−m EaSM Stress Migration (SM) [70] |Tmetal − T| e kT ; m=2.5, EaSM=0.9

[X+(Y/T)+ZT] 1 (a−bT) kT Time dependent dielectric (V ) e ; a=78, b=-0.081, X=0.759eV, breakdown (TDDB) [104] Y=-66.8eV/K, Z=-8.37e-4eV/K

Thermal Cycling (TC) [90] ( 1 )q; T =Ambient Temperature; q=2.35 T−Tambient ambient

hn    o iβ1 Negative Bias Temperature ln A − ln A −C × T ; 1+2eB/kT 1+2eB/kT e−D/kT Instability (NBTI) [106] A=1.6328, B=0.07377, C=0.01, D=0.06852, β1=0.3 Table 5.1 Mean Time To Failure (MTTF) for critical wear out models dependent dielectric breakdown, thermal cycling and negative bias temperature instability as specified in [90], with their respective mean time to failure (MTTF). Here, k is Boltzmann’s constant and T is temperature in Kelvin. These wear out phenomena create impedance in the circuits, gradually leading to permanent device failures.

5.2 Analysis Framework for Estimating On-chip Temperature

The analysis framework for estimating on-chip temperature is similar to the one presented in Chap- ter 4. In order to demonstrate the full extent of the overheating problem, we disabled clock gating.

Clock gating is a low power technique that is used to minimize dynamic power consumption during idle time. The benefits from low power techniques complement the benefits achieved through ther- mal throttling. The analysis framework requires few additions in order to support thermal throttling.

Figure 5.1 depicts both timing error based feedback control, and thermal throttle. For our initial eval- uation of how on-chip temperatures vary when reliably overclocked, we only observe the temperature, without employing any thermal throttle. We employ dynamic clock tuning beyond worst-case limits, using timing error based feedback control, to adapt system behavior based on workload characteristics.

The number of timing errors occurring at a given time is based on the workload being executed by the processor. 72

Random Error Timing Error Rate Timing Error Profile Injector +/-

Timing Error Control Clock Set Point Generator HDL Unit Critical Timing Temperature ROP Simulator +/- SimOutorder Instantaneous SimpleScalar Power HotSpot Wattch Alpha Processor Temperature Accesses Simulator Block Leakage Power Power Application Model Model Binary

Figure 5.1 Simulation framework depicting thermal throttling, alongside timing error based feedback control, for a reliably overclocked system

5.2.1 Thermal Throttling

Thermal throttling is a technique in which system operation is throttled when on-chip temperature exceeds a critical value. In Figure 5.1, the HotSpot thermal model estimates the on-chip temperature during run-time. The current on-chip temperature is compared with an user or manufacturer defined critical value, and the control unit takes corrective actions. Our implementation of thermal throttle reduces system operating frequency, whenever the on-chip temperature exceeds the critical value. We restrict reliable overclocking based on the on-chip temperature. Frequency is increased only if the timing error rate value is below the target error rate. However, frequency is decreased if either the timing error rate exceeds the timing error set point value or the on-chip temperature exceeds the critical temperature value.

5.2.2 Simulation Parameters

Table 5.2 presents the simulation parameters. We evaluate the system temperature while running at

1.25V. From Figure 3.1 in Chapter 3, we can see that clock period can be scaled only up to 50% of the original cycle time. We assume up to 45% overclocking. Table 5.2 provides the worst-case frequency and the maximum overclocked frequency we considered for our simulations. We perform a binary 73

Parameter Value Technology node 45nm Voltage 1.25V Minimum frequency 1024MHz Maximum frequency 1862MHz No. of frequency levels 32 Area 10mm2 Temperature sampling interval 1ms Frequency sampling interval 100µs Frequency penalty Single PLL: 10µs Dual PLL: 0µs

Table 5.2 Simulator parameters search between 32 frequency levels within the allowed range, based on error rate and also temperature, when employing thermal throttle. We assume the presence of two phase-locked loops (PLLs), so that there is no performance penalty involved, while switching between frequencies. If there is only one

PLL, it takes up to 10µs to change from one frequency to another.

5.3 On-chip Temperature Trends in Reliably Overclocked Processors ) K ) (

s r e r a u e t Y ( a

r F e T p T m M e T

Thousands of Cycles Non-Overclocked Reliable Overclocking Thermal Throttling

Figure 5.2 On-chip temperature trends and MTTF results for bzip2 benchmark. The plots show how on-chip temperature and MTTF varies for a non-overclocked processor, a reliably overclocked processor, and a re- liably overclocked processor with thermal throttling. 74

We simulated six SPEC INT 2000 benchmarks, namely bzip2, cra fty, gcc, gzip, mc f and parser to analyze and compare the on-chip temperature trends and MTTF behavior for a non-overclocked pro- cessor, a reliably overclocked processor and a reliably overclocked processor with thermal throttling.

We calculate MTTF based on the on-chip temperature at that given instant of time. We obtain the pro- portionality constant for our calculations from the baseline MTTF at 337K [90]. The MTTF values are obtained from the formulas mentioned in Table 5.1. ) K ) ( s

r e a r e u t Y ( a

r F e T p T m M e T

Thousands of Cycles Non-Overclocked Reliable Overclocking Thermal Throttling

Figure 5.3 On-chip temperature trends and MTTF results for cra fty benchmark. The plots show how on-chip temperature and MTTF varies for a non-overclocked processor, a reliably overclocked processor, and a re- liably overclocked processor with thermal throttling.

Figure 5.2 compares the transient temperature trends of a reliably overclocked processor with a non- overclocked processor for bzip2 benchmark. We evaluate ROP performance with and without thermal throttling. From the plots, we can clearly see that there is up to 15K difference between a reliably overclocked processor and a non-overclocked processor. Also, we see that the reliably overclocked processor reaches and exceeds 360K on executing around 3 million instructions.

Based on the cooling solution used, the system will reach a steady state temperature and remain there. In our experiments, a non-overclocked processor settles at 347K for the same cooling solution.

We start our experiments at a steady state temperature of 340K. This initial temperature is based on the assumption that the system has already performed certain operations, before it executes the benchmark of interest. 75 ) K ) (

s r e r a e u t Y ( a

r F e T p T m M e T

Thousands of Cycles Non-Overclocked Reliable Overclocking Thermal Throttling

Figure 5.4 On-chip temperature trends and MTTF results for gzip benchmark. The plots show how on-chip temperature and MTTF varies for a non-overclocked processor, a reliably overclocked processor, and a re- liably overclocked processor with thermal throttling. ) K ) (

s r e r a u e t Y ( a r F e T p T m M e T

Thousands of Cycles Non-Overclocked Reliable Overclocking Thermal Throttling

Figure 5.5 On-chip temperature trends and MTTF results for mc f benchmark. The plots show how on-chip temperature and MTTF varies for a non-overclocked processor, a reliably overclocked processor, and a re- liably overclocked processor with thermal throttling. 76

When incorporating thermal throttle, we find that the temperature gets clamped at the desired choice of operating temperature. Since thermal sensor outputs are available once every ms, it is good to choose a temperature 3K below the critical temperature, so that even if the system temperature overshoots before getting a thermal measurement, it will not exceed the critical temperature.

Figure 5.2 shows the mean time to failure of reliably overclocked system, with and without thermal throttle, as compared to the non-overclocked system. We observe that a non-overclocked system has a longer lifetime, of about 30 years, as its on-chip temperature does not exceed 347K. However, a reliably overclocked system has a much shorter lifetime of about 9 years. Applying thermal throttling at about 355K increased the system lifetime to about 14 years. We understand from the figure that running the bzip2 benchmark at lower temperatures over a long period of time improves the MTTF significantly. This motivates the need for having efficient dynamic thermal management techniques, alongside reliable overclocking, to achieve performance gain and reliability. Also, thermal management techniques alleviate the need for having an expensive cooling solution, making it cost effective to have high performance systems.

Figure 5.3, 5.4 and 5.5 show the on-chip temperature trends and MTTF results for cra fty, gzip and mc f benchmarks. As can be seen from the plots, the thermal characteristics and MTTF trends for cra fty, gzip and mc f benchmarks have similar nature to the ones recorded for bzip2 benchmark. Other two benchmarks, gcc and parser, have similar thermal and MTTF trends.

The relative speed-up for the six benchmarks, running 107 instructions is illustrated in Figure 5.6.

Reliable overclocking, on an average, achieves 35% increase in performance over a non-overclocked system. When a thermal throttle is applied, the performance gain drops to 25%. 77

Figure 5.6 Relative performance for SPEC2000 integer benchmarks 78

CHAPTER 6. RELIABLE OVERCLOCKING AND TECHNOLOGY SCALING

The continued and progressive scaling of the minimum feature size of the metal-oxide-semiconductor

field-effect transistors () has played a key role in the design of spectacular low-cost high per- formance computing systems [46]. ITRS uses the term “technology node” to indicate overall industry progress in integrated circuit (IC) feature scaling [29]. Even though, technology scaling reduces cost in the long run, the research, development and production cost involved in getting the next generation fab- rication plant, at this level of sub-nm scaling, has resulted in diminishing return-on-investments, forcing many IC designers to adopt a fabless asset-lite business model. Unlike earlier practices, present and future technology generations call for more technology and design interactions very early in the design cycle to minimize cost and maximize yield. This required interaction adversely affects the fabless IC vendors resulting in ineffective time-to-market.

The race towards the next technology node has become a key part in retaining market share and competitive advantage in the semiconductor business, and in the design of high performance energy efficient systems [24, 36, 44]. The time-to-market difference gives a significant advantage to the bigger players in the semiconductor industry. To facilitate competitiveness and to create a level playing field for everyone involved, it becomes important to look at cheaper alternative solutions that enable the rest of the industry to compete without the need to immediately embrace technology scaling. Though it is not possible to avoid technology scaling altogether, in this paper, we strive to present a possible approach, which will either enable to bridge the gap between porting to the next technology node by extending the lifetime of a technology generation, or to skip a process technology generation altogether.

Our work in this chapter evaluates the competitiveness of timing speculation based adaptive reli- able overclocking with respect to technology scaling. We make a convincing case for gains of reliable overclocking, quantifying such gains at different technology nodes, developing a methodology and 79 framework to evaluate such systems by estimating likely error rates using a synthesizable hardware model and applying them to the newly developed complete evaluation framework for realistic simula- tions.

We present the results of our experimental analysis based on integer and floating point SPEC2000 benchmarks running on a SimpleScalar Alpha processor simulator, augmented with error rate data ob- tained from hardware simulations of a superscalar processor. We compare the performance and energy management of the reliably overclocked systems to the non-overclocked systems that are implemented using different process technology nodes. Our evaluation results quantify the comparative gains achiev- able with reliable overclocking, and our substantive and significant results make a case for this approach to be a worthwhile technique to pursue in mainstream processor design.

6.1 Technology Scaling

The key driving force behind technology scaling is the market’s need for cost-efficient high per- formance, energy efficient computing systems. New process methods and materials have continuously emerged to surmount the seemingly impossible technology barriers, such as lithography and oxide scaling limits [36]. The semiconductor industry pursues technology scaling, irrespective of the cost involved, for the following reasons, as hypothesized by the scaling theory [24, 9]:

For every subsequent generational change in technology, gate delay reduces proportionally, con- tributing to the performance improvement seen in the operating frequency of a system that is ported to the scaled technology node. However, the increase in active and leakage power, observed in sub-100nm nodes, have subdued the quest for higher operating frequencies.

Until the 1990s, constant voltage scaling was practiced resulting in increased performance and higher active power. As power became an important design constraint owing to thermal design point requirements, semiconductor manufactures switched to constant electric field scaling to scale down supply voltage to minimize power dissipation. Constant field scaling requires threshold voltage to be scaled proportional to the feature size. However, threshold voltage scaling is limited by the sub-

kT threshold slope, which in turn is limited to the thermal voltage, VT = q , where k is the Boltzmann’s constant and q is the electron charge. 80

As manufacturers reduce supply voltage with subsequent technology nodes, significant energy and active power savings are expected. One thing that worsens with sub-100nm technologies is the leakage power [44]. In MOS technology, leakage power comes from two sources: sub-threshold leakage and gate leakage. Sub-threshold leakage happens when current flows through the MOS transistors during their turn-off state, which happens when the gate-to-source (VGS) voltage is lower than the sub-threshold voltage (VTH ).

When VTH is lowered proportionally with VDD to improve performance, since a large gate overdrive

(VGS −VTH ) is required to enable high speed switching, this results in 5x increase in sub-threshold leak- age current. Gate leakage happens as the gate oxide between the metal gate and the channel becomes thinner and thinner resulting in tunneling current through the gate dielectric. Even as active power is independent of temperature, leakage power has an exponential dependence on device operating tem- perature. With scaling, keeping the devices cool becomes much more important, as leakage power becomes a significant part of the total power consumption.

Technology scaling complicates the on-chip communication as interconnects do not scale as well as logic gates [9]. Even as area and fringe capacitance decreases proportionally with downscaling, interconnect resistance and capacitance increases with scaling of wire width and thickness. Wire delays have started dominating the overall delay and in modern microprocessors pipeline stages are dedicated only for moving signals across the chip. More interconnect layers are added to subsequent technology to account for the increased density and complexity that comes with reduced area. Even as interconnect scaling is considered to be one of the bottlenecks that hinder future scaling, over the years changes in interconnect materials and architectures have favored interconnect performance to be in accordance with the transistor scaling trends.

In a nutshell, for every 30% downscaling of technology node, transistor density doubles, gate de- lay reduces by 30%, operating frequency improves by 43%, active power consumption halves, and energy savings of 65% is observed. However, recent technology generations are unable to scale clock frequency as desired because of low power requirements, process variations and reliability concerns.

The trade-off between maintaining high speed switching and low-leakage remains an important design constraint and plays a key role in deciding the scaling trends of supply voltage and threshold voltage. 81

6.2 A Reliable Overclocking Approach

As discussed in the previous section, technology scaling does present a strong reason for the con- tinued investment by semiconductor manufacturers. We do not dispute technology scaling. Our goal is to present an option to designers to use an enhancement in the current technology node and enable the longevity of the process technology node until the adoption of a new one. We believe that reliable over- clocking presents an opportunity to VLSI designers to achieve significant performance improvements by considering it as an enhancement in every process technology node.

In this chapter, we make a viable case for a reliable overclocking approach that is at least as ef- fective and competitive as technology scaling, albeit at a lower cost. Our technique either acts as an intermediate point between two technology nodes or presents an alternative to technology porting, a technique which is commonly done before moving on to the next architecture that consumes the extra silicon area created by scaling. Technology porting performs a die shrink on an existing design and achieves significant improvements in speed and power consumption. Our goal is to show that by reli- ably overclocking a VLSI chip design implemented in the current technology node, we can match and exceed the performance of the same design that is ported to the next technology node.

It is important to remember that even as scaling shrinks the device sizes, the die size tends to increase with improvements in yield enhancing manufacturing techniques. This allows multiple cores and hardware co-processors to be added to the system enabling high throughput parallel operation.

Hence, technology scaling is necessary to accommodate significant changes in design architecture. We present reliable overclocking as a stopgap alternative and as a supplement to technology scaling.

Our approach is based on timing speculative reliable overclocking. Timing speculation is a tech- nique by which a computing system performs aggressive computation, albeit incorrectly during a small fraction of instances. As the name suggests, the technique involves using data speculatively, and de- ploying efficient checking mechanisms to detect and correct erroneous computations. Timing errors resulting from accelerated computation and premature use of data are tolerated by exploiting proven fault tolerance techniques to ensure functional correctness. Many techniques have been proposed ear- lier that utilize different error tolerance mechanisms to tolerate timing errors. These techniques show significant improvement in performance as they enable faster execution of typical computations, and 82 suffer an error recovery penalty for rare occurrences of worst-case delay involving computations. Sim- ply put, these techniques take advantage of Amdahl’s law, making the common case faster.

6.3 Analysis Framework

TECHNOLOGY NODE PARAMETER 90nm 65nm 45nm

Supply Voltage, VDD (V) 1.2 1.1 1 Threshold Voltage, VT (V) 0.2943 0.2725 0.2398 Base Frequency, fbase (MHz) 768 1063 1536 Overclocking Frequency Range (MHz) 768 - 1396 1063 - 1932 1536 - 2792

Table 6.1 Technology scaling parameters

In order to prove that speculative reliable overclocking can be considered as a viable stop-gap alternative for technology scaling, we adopt the analysis framework described in Chapter 4. For our analysis, we look at three technology nodes. Several IC designers are now either implementing their products in 45nm node or making the transition to 45nm node. The predecessors of 45nm node are

65nm and 90nm nodes. We obtained the scaling values for each of the three nodes from the Ss-PPC tool developed by University of Texas at Austin. Ss-PPC is a SimpleScalar simulator for the Power

PC instructions set architecture [78]. In Ss-PPC simulator, the technology dependent parameters are scaled from the values provided for 0.8um technology, which were originally presented as part of the

Wattch power simulator. The technology nodes we chose are scaled approximately by 30% from one another and are currently used by industry. Other technology nodes in the vicinity of these nodes exist.

However, performance characteristics do not change drastically with minor scaling, and our choice of technology nodes enables us to investigate the performance trends in state of the art semiconductor process technologies.

With technology scaling, resistance, capacitance, voltage and circuit delay values are scaled in ac- cordance with the scaling theory. As discussed earlier in this chapter in Section 6.1, supply voltage and threshold voltage are scaled minimally to guarantee higher performance. Table 6.1 presents the scaled supply voltage and threshold voltage values across technologies. Because of the difficulty in- volved in changing the heat sink characteristics, the processor frequency for 90nm technology is fixed 83 at 768 MHz, so that technology scaling and overclocking result in frequencies that maintain the temper- ature within acceptable limits. We assume that the processor is operating at room temperature (300K) before executing a benchmark. For 65nm and 45nm technologies, we scale the 90nm base frequency proportionally to 1063 MHz and 1536 MHz, respectively. Table 6.1 indicates the range of overclock- ing for the three technology nodes. While operating the simulator at much higher frequencies and at different technology nodes, as reported in Table 6.1, we use the error rates, discussed in detail in

Section 4.2.1, relatively.

One important factor that needs to be taken care of with technology scaling is memory performance, since memory does not scale as well as logic. Over the years, for every 60% reduction in feature size for logic transistors, memory is scaled by 10% [53]. We have taken this into account while scaling from one technology to another.

Table 6.2 Comparing various performance metrics across different technology nodes for a non-overclocked processor executing SPEC2000 integer benchmarks

BZIP2 CRAFTY GAP SCALING IMPACT METRIC 90nm 65nm 45nm 90nm 65nm 45nm 90nm 65nm 45nm Better or Worse RUN TIME (ms) 57.20 48.11 40.08 56.58 40.88 28.30 40.08 28.96 20.04 Better POWER (W) 2.48 2.51 2.55 2.73 3.10 3.48 3.54 3.95 4.37 Worse ENERGY (J) 0.14 0.12 0.10 0.15 0.13 0.10 0.14 0.11 0.09 Better EDP (ms*J) 8.11 5.81 4.09 8.73 5.18 2.79 5.69 3.31 1.758 Better TEMPERATURE (K) 321 322 325 322 324 329 323 326 332 Worse

GZIP MCF VPR SCALING IMPACT METRIC 90nm 65nm 45nm 90nm 65nm 45nm 90nm 65nm 45nm Better or Worse RUN TIME (ms) 80.56 65.28 52.25 51.63 37.43 26.03 39.10 28.25 19.56 Better POWER (W) 2.61 2.69 2.75 4.45 4.91 5.36 2.71 3.04 3.41 Worse ENERGY (J) 0.21 0.18 0.14 0.23 0.18 0.14 0.11 0.09 0.07 Better EDP (ms*J) 16.94 11.47 7.52 11.87 6.88 3.63 4.14 2.43 1.31 Better TEMPERATURE (K) 322 324 329 323 326 333 321 323 328 Worse 84

Table 6.3 Comparing various performance metrics across different technology nodes for a non-overclocked processor executing SPEC2000 floating point benchmarks

APPLU EQUAKE MGRID SCALING IMPACT METRIC 90nm 65nm 45nm 90nm 65nm 45nm 90nm 65nm 45nm Better or Worse RUN TIME (ms) 78.84 58.02 41.22 54.33 39.26 27.17 68.84 53.22 40.31 Better POWER (W) 3.08 3.47 3.79 3.98 4.42 4.86 3.49 3.75 3.90 Worse ENERGY (J) 0.24 0.20 0.16 0.22 0.17 0.13 0.24 0.20 0.16 Better EDP (ms*J) 19.15 11.67 6.44 11.74 6.81 3.59 16.53 10.61 6.34 Better TEMPERATURE (K) 323 326 335 323 326 332 323 326 334 Worse

APSI GALGEL MESA SCALING IMPACT METRIC 90nm 65nm 45nm 90nm 65nm 45nm 90nm 65nm 45nm Better or Worse RUN TIME (ms) 24.49 17.70 12.26 82.44 59.57 41.24 54.20 39.17 27.12 Better POWER (W) 4.73 5.22 5.69 2.04 2.37 2.72 3.96 4.40 4.84 Worse ENERGY (J) 0.12 0.09 0.07 0.17 0.14 0.11 0.21 0.17 0.13 Better EDP (ms*J) 2.84 1.63 0.86 13.85 8.42 4.63 11.62 6.74 3.56 Better TEMPERATURE (K) 323 326 333 320 322 328 323 326 332 Worse

6.4 Performance at Different Technology Nodes

In this section, we evaluate the impact of technology scaling on performance. Tables 6.2 and 6.3 present the results for six integer and six floating point benchmarks, respectively, executing on the base processor implemented in 90nm, 65nm and 45nm technology. From the results, we find that going from

90nm to 65nm node, which is a 27.7% reduction in feature size, on an average, across both integer and floating point benchmarks, decreases run time by 25.47%, while increasing power consumption by about 10.09%. Because of the significant improvement in run time and moderate increase in power consumption, energy efficiency improved on an average by 18.09%. The energy-delay product drops by almost 38.90% indicating the significant benefits of technology scaling.

A switch from 65nm to 45nm node, which is a 30% reduction in feature size, on an average, across all twelve benchmarks, decreases run time by 27.98%, increases power consumption by about 8.77%, improves energy efficiency by 21.83%, and minimizes energy-delay product by 43.59%. Overall, the performance trends are very much in accordance with the scaling theory. 85

6.5 Comparing Technology Scaling with Reliable Overclocking

Having seen that both technology scaling and reliable overclocking improve run time and energy efficiency, let us now compare how they fare against each other. From Tables 4.6, 4.7, 6.2 and 6.3 we notice that technology scaling increases power moderately as compared to reliable overclocking.

This is reflected in Figures 6.1 and 6.2. Power consumption values increase by almost 60% for reliable overclocking. A ROP implemented in 90nm technology consumes more power than a non-overclocked processor implemented in the same technology, as well as the subsequent technologies. Temperature trends, as shown in Figures 6.3 and 6.4, for reliable overclocking are much better than power con- sumption trends. The on-chip temperatures of the ROP are higher than the non-overclocked processor implemented in the same technology. However, it is lower than the next technology generation.

Figures 6.5 and 6.6 show the run time trends for SPEC2000 integer and floating point benchmarks, respectively. Reliable overclocking results in better run time than technology scaling. This is because by setting an error rate target of 5% reliable overclocking allows operation at higher frequencies than possible with technology scaling.

Even as reliable overclocking outperforms technology scaling in terms of run time, the increase in power consumption is a worry. However, the significant decrease in run time, helps in keeping the increase in energy consumption within 3% for the ROP, as compared to the non-overclocked processor.

The energy trends are shown in Figures 6.7 and 6.8.

Reliable overclocking scores over technology scaling when energy-delay product is considered.

Figures 6.9 and 6.10 show that the energy-delay product for ROP is better or comparable to the non- overclocked processor implemented in the next technology node.

Overall, reliable overclocking presents a compelling reason to be looked at seriously by chip de- signers alongside technology scaling. Even as technology scaling is important for continued progress in the design of integrated circuits, speculative reliable overclocking presents a convincing case as a supplement to technology scaling. A reliable overclocking approach presents a stopgap alternative for porting to the next technology node, or it possibly enables certain products to skip a technology node altogether. 86

Figure 6.1 Technology scaling vs. speculative reliable overclocking: Power con- sumption trends for SPEC2000 integer benchmarks.

Figure 6.2 Technology scaling vs. speculative reliable overclocking: Power con- sumption trends for SPEC2000 floating point benchmarks. 87

Figure 6.3 Technology scaling vs. speculative reliable overclocking: Temperature trends for SPEC2000 integer benchmarks.

Figure 6.4 Technology scaling vs. speculative reliable overclocking: Temperature trends for SPEC2000 floating point benchmarks. 88

Figure 6.5 Technology scaling vs. speculative reliable overclocking: Run time for SPEC2000 integer benchmarks. All values are normalized to 90nm run time.

Figure 6.6 Technology scaling vs. speculative reliable overclocking: Run time for SPEC2000 floating point benchmarks. All values are normalized to 90nm run time. 89

Figure 6.7 Technology scaling vs. speculative reliable overclocking: Energy con- sumption for SPEC2000 integer benchmarks. All values are normal- ized to 90nm energy values.

Figure 6.8 Technology scaling vs. speculative reliable overclocking: Energy con- sumption for SPEC2000 floating point benchmarks. All values are normalized to 90nm energy values. 90

Figure 6.9 Technology scaling vs. speculative reliable overclocking: EDP for SPEC2000 integer benchmarks. All values are normalized to 90nm EDP.

Figure 6.10 Technology scaling vs. speculative reliable overclocking: EDP for SPEC2000 floating point benchmarks. All values are normalized to 90nm EDP. 91

CHAPTER 7. FAULT TOLERANT AGGRESSIVE SYSTEMS

Technology scaling and hazardous operating environments make embedded processors and system- on-chips highly susceptible to faults. The impact of soft errors and silicon failures on system reliability have been steadily rising as we progress toward 32nm technologies and beyond. Soft errors, which are transient in nature, and silicon defects, which lead to permanent failures, have intrigued researchers to formulate fault tolerance techniques with varied capabilities to improve system reliability. Soft errors, induced by high energy radiation and external noise, have become more frequent and may result in incorrect computation and silent data corruption. Intermittent faults that persist for a short duration of time at one particular location are also a cause for concern [20]. Silicon defects resulting from silicon failure mechanisms such as transistor wear outs, gate breakdown, hot carrier degradation, and manufacturing limitations degrade lifetime and reliability of fabricated devices. Transient, intermittent and permanent fault classes constitute the three major reasons for hardware failure.

In this chapter, we develop a conjoined duplex system approach to provide tolerance for myriad hardware faults that plague modern computing systems. Our approach is capable of protecting both the datapath and control logic. With minor additions to the error recovery procedure used in SPRIT3E, our fault tolerant aggressive system is capable of recovering from soft errors and timing errors. When coupled with a dynamic clock tuning mechanism based on a set target error rate, the system frequency adapts to application characteristics during run time. The concept of increasing the frequency and phase shifting the clocks makes sure that both the primary and redundant pipelines can run faster and the second pipeline is timing safe.

Our Conjoined Pipeline (in short, CPipe) system employs a special way to organize pipeline redun- dancy, with the goal of tolerating the three major fault classes that severely undermine the reliability of current and future systems. The CPipe system builds on the better-than-worst-case design method- 92 ologies [3] proposed in Razor [27] and SPRIT3E [92] performance enhancement techniques. In CPipe, both the pipeline registers and the pipeline stage combinational logic are replicated. The term ”con- joined” implies the intertwining of the two pipelines and their constant and continued dependency on each other.

SPRIT3E employed temporal redundancy to reliably overclock a superscalar processor. By means of duplicating critical registers and clocking the redundant register by a delayed version of the system clock, SPRIT3E demonstrated that considerable performance improvement can be achieved through reliable overclocking. However, in the presence of faults, the redundant register cannot be relied upon, and this necessitates spatial redundancy of combinational logic to ensure that the value stored in the redundant register is “gold”. Our CPipe system is designed to tolerate transient and intermittent faults along with timing errors, and implements a robust error detection and recovery mechanism.

The contamination delay of the redundant pipeline is increased to support the operation of the

CPipe system. Depending on the extent of overclocking desired, the contamination delay is increased accordingly. The contamination delay of the primary pipeline is not increased, and this allows the

CPipe approach to have less timing errors at a particular better-than-worst-case frequency, as compared to Razor or SPRIT3E approach. This is because, increasing contamination delay affects non-short path delays too, even though the overall worst-case propagation delay is not increased. An input data combination, which did not cause an error, originally at a particular frequency, may result in an error after the contamination delay is increased.

The CPipe system benefits from a dynamic clock tuning mechanism that is capable of adapting the system clock frequency to the optimal value based on the current executing application and the environmental conditions. The range of frequencies at which the CPipe system operates reliably is estimated based on the implementation of the CPipe datapath and error recovery logic. The frequency is tuned during run-time in the range to maximize performance.

We performed a series of experiments to evaluate the fault tolerance and overclocking capability of the CPipe technique. We designed and implemented a two stage conjoined arithmetic pipeline for this purpose. The first stage performs 64-bit carry look ahead addition, and the second stage performs

32-bit multiplication of the most significant and least significant words of the adder output. Separate 93 experiments were carried out to verify detection and recovery from soft errors, timing errors, and intermittent faults. Permanent fault detection is also verified. Our fault injection campaign indicated fault masking in case of soft errors. The output of the pipeline was verified for correctness, and it was made certain that all randomly injected faults were detected and recovered from.

To prove that our CPipe technique is viable in the presence of feedback signals from subsequent pipeline stages, we implemented our technique in a five-stage in-order pipeline processor supporting

DLX instruction set architecture. The implemented processor supports data forwarding and hazard detection. We subjected the conjoined processor to faults, and performed analysis for three different microbenchmarks. Our results demonstrate sufficient confidence in the correct working of our tech- nique, and indicate the possibility of extending our technique in out-of-order systems too.

7.1 Conjoined Pipeline Architecture

The basic principle behind the CPipe system architecture is to replicate the entire pipeline, and interlink the two pipelines in a way so as to provide capability to tolerate various fault types. Both primary and redundant pipelines are susceptible to faults that are uniformly distributed in time and space. Timing errors occur, if the primary pipeline is overclocked to speed up execution. The redun- dant pipeline is guaranteed to have sufficient time for execution, and is free from timing errors. Since, the redundant pipeline can be corrupted because of a fault occurrence, the error detection and recovery process is more complex than that described in Razor and SPRIT3E techniques. The ensuing descrip- tion of the CPipe architecture explains how random occurrence of faults and timing errors are handled concurrently. The following description assumes that the CPipe system is running at an overclocked frequency, when errors are detected.

7.1.1 Conjoined Pipeline Datapath Description

The organization of redundancy in CPipe is illustrated in Figure 7.1. The figure shows three pipeline stages: P-STAGE N-1, P-STAGE N and P-STAGE N+1. The CPipe concept in its entirety is portrayed in the figure for P-STAGE N. The primary pipeline is referred to as the L-PIPELINE (Leading

Pipeline) and the redundant pipeline as the S-PIPELINE (Shadow Pipeline). In the figure, the shaded 94 . Dotted line encompasses the Local Fault Detection and IPELINE L-P Recovery (LFDR) circuit Figure 7.1 Conjoined Pipeline Architecture: Shaded region represents the 95 pattern distinguishes the L-Pipeline from the S-PIPELINE. The L-PIPELINE registers, S-PIPELINE registers, E-DETECT module, and the MUX before the L-PIPELINE registers together form the local fault detection and recovery (LFDR) circuit. The LFDR circuit, highlighted in the figure, replaces the pipeline registers that are present in a normal pipelined system. In the figure, feedback signals indicate signals received from other pipeline stages other than the immediately preceding stage.

To provide tolerance for soft errors that occur in the combinational logic, the pipeline stage com- binational logic between the pipeline registers is duplicated. The leading logic, L-LOGIC, receives its inputs from the previous stage L-PIPELINE register, and stores its computed results in the current stage L-PIPELINE register. However, the shadow logic, S-LOGIC, though receiving its inputs from the previous stage L-PIPELINE register, stores its outputs in the current stage S-PIPELINE register. To lucidly understand the CPipe concept, observe in Figure 7.1 that the L-PIPELINE register of P-STAGE

N-1 feed both the L-LOGIC and S-LOGIC of P-STAGE N, and the L-LOGIC of P-STAGE N writes its results to the L-PIPELINE register of P-STAGE N, while the S-LOGIC of P-STAGE N writes its output to the S-PIPELINE register of P-STAGE N. The above implementation ensures that both datapath and control signals are protected from hardware faults.

The CPipe architecture requires three clocks for proper operation. The three input clocks are the leader clock, LClk, the error clock, EClk, and the shadow clock, SClk. EClk and SClk are phase shifted versions of the LClk. These three clocks, along with the error signals from all the pipeline stages, control LGClk and SGClk. The clocks LGClk and SGClk, are gated versions of LClk and SClk. The L-

PIPELINE registers are clocked by LGClk, while the S-PIPELINE registers are clocked by SGClk. EClk is required to precisely control when LClk and SClk needs to be stalled to ensure correct operation. Also on error detection, the control signal to load the S-PIPELINE register values into the L-PIPELINE register, shown as LoadSP in Figure 7.1, is asserted for a cycle. This entire control mechanism is performed in the clock stall control module, shown as CLK STALL CNTRL in Figure 7.1.

7.1.2 Error Detection and Recovery

As mentioned earlier, the results computed by the S-LOGIC are free from timing errors, but sus- ceptible to soft errors. This complicates the error detection and recovery process. It is very important 96 to ensure that the S-PIPELINE register is not corrupted with incorrect result; otherwise recovery will not be possible. Considering this complication, in the CPipe architecture error detection is performed before storing the results in the S-PIPELINE register. Only if the results computed by the S-LOGIC matches the values registered in the L-PIPELINE register, then the S-LOGIC outputs are written into the S-PIPELINE register. The E-DETECT module incorporates metastability detection, similar to the one described in [22], for the L-PIPELINE register, as the L-PIPELINE flip-flops may enter a metastable state when overclocked, or when a soft error reaches the registers during the latching window. The

Error flag is asserted to indicate an error.

The delay between the clocking of the L-PIPELINE and the S-PIPELINE registers of a pipeline stage introduce the necessary spatial and temporal redundancy required to detect timing errors. The contamination delay of the S-LOGIC needs to be increased to a value more than the delay between the

LGClk and the SGClk. This is important to ensure that the S-LOGIC outputs are not changed by the values newly registered in the L-PIPELINE register.

The error detection and recovery process does not differentiate between errors occurring in the S-

PIPELINE and the L-PIPELINE. The L-PIPELINE is susceptible to both soft errors and timing errors, while the S-PIPELINE is susceptible to soft errors. It is not possible to differentiate between these errors. The transient fault tolerance mechanism is overloaded to detect and recover from any timing errors that might occur because of overclocking.

Figure 7.2 illustrates the entire error detection and recovery mechanism when a soft error or a timing error occurs in the L-PIPELINE of P-STAGE N (see Figure 7.1). The figure shows the instructions that are being executed in the L-LOGIC and the S-LOGIC of the three pipeline stages. If the instructions in the L-PIPELINE execute without errors, the conjoined pipeline proceeds without any interruption.

An error occurrence is highlighted in cycle 3. The error occurs during the execution of INST 1 in the L-LOGIC of P-STAGE N. This error is not yet detected, leading to the output of L-LOGIC being stored in the registers of P-STAGE N.

The L-PIPELINE of P-STAGE N starts executing INST 2 in cycle 4. However, the L-PIPELINE of

P-STAGE N+1 executes INST 1 in cycle 4 with incorrect result provided by P-STAGE N. This needs to be corrected. After the S-LOGIC of P-STAGE N finishes execution, the E-DETECT module detects 97

Figure 7.2 Waveforms highlighting error detection and recovery in a Conjoined Pipeline system 98 the mismatch between the L-PIPELINE register values and the outputs of S-LOGIC. The error flag is asserted, triggering the recovery process.

The Error signals from all the pipeline stages are combined together using “or” gates, and latched by the rising edge of EClk in the CLK STALL CNTRL module. The latched signal is referred to as GError in Figure 7.2. The global error signal ensures correct execution of the pipeline, and helps in global re- covery. The GError signal is asserted or deasserted only when the recovery counter (RCounter) is

B“00” or B“11”. This guarantees two cycles for the L-PIPELINE to re-execute the erroneous instruc- tion. This is necessary since the error might have been caused because of overclocking. It can be observed in Figure 7.2 that on error detection the entire pipeline goes back by one instruction. It is also possible to insert bubbles to avoid re-execution of instructions in the forward pipeline stages.

The entire error detection and recovery mechanism happens in three cycles. The cycle counts are with respect to the leader clock. The following sequence takes place in cycles 4, 5 and 6 as a result of the error in cycle 3:

• FIRST (See cycle 4 in Figure 7.2): Error flag is asserted by the E-DETECT module, and GError

goes high at the rising edge of EClk. Immediately after GError goes high, SClkStall goes low

before the SClk edge, and LoadSP goes high before the LClk edge. As a result, S-PIPELINE

registers are not updated, as SGClk is low, and value from S-PIPELINE registers are loaded into

the corresponding L-PIPELINE registers. RCounter is incremented at the end of the cycle.

• SECOND: LClkStall goes low at the negative edge of LClk. This stalls LGClk in the next cycle and

avoids any glitches. SClkStall remains low. Both LClkStall and SClkStall are active low signals. In

this cycle, both L-PIPELINE and S-PIPELINE registers are not updated. RCounter is incremented

at the end of the cycle.

• THIRD: LClkStall signal goes high at the negative edge of LClk. SClkStall remains low. RCounter is incremented at the end of the cycle.

At the end of cycle 6, the erroneous instruction completes its re-execution successfully. In cycle

7, the execution of CPipe returns to normal. In Figure 7.2, it can be seen that during the recovery 99 process the S-LOGIC computes intermediate results, but the outputs are not written to the S-PIPELINE registers.

Possible Error Scenarios: The possible error scenarios include soft error or timing error in L-

PIPELINE, combined with a soft error in the S-PIPELINE. Intermittent faults are also possible in either of the pipelines. The error detection and recovery mechanism described is robust and can handle any number of errors in a single cycle, and all possible combination of errors. Table 7.1 shows the possible error scenarios that can happen in a CPipe system.

Table 7.1 Possible error scenarios

Case L-PIPELINE S-PIPELINE 1. Soft Error No Error 2. Soft Error Soft Error 3. Timing Error No Error 4. Timing Error Soft Error 5. No Error Soft Error

Intermittent Faults: If in cycle 7, the global error signal does not go low, the entire recovery process is repeated. The recovery process is triggered repeatedly until the error disappears. This allows recovery from transients that persist for a short duration of time. Intermittent faults that occur in bursts are handled similarly by the CPipe architecture.

Permanent Faults: If after significant number of retries the error persists and the pipeline is stuck in a loop, then the fault is declared as permanent. The permanent fault flag is asserted indicating a system failure. In this case, it could be possible to reconfigure the CPipe system to run either using only the

L-PIPELINE or the S-PIPELINE with no fault tolerance and overclocking. In this thesis, this possibility is not pursued further. However, it can be noted here that with additional logic, we can choose a combination of L-PIPELINE and S-PIPELINE stages, if the need arises during reconfiguration.

Timing Errors: Timing errors occur when the system is overclocked. However, when the CPipe is used primarily for fault tolerance, and not for improving performance, the signal Overclock is de- asserted, indicating that there will be no timing errors in the L-PIPELINE. In this case, error recovery takes two cycles, as LClk is not stalled to accommodate recovery from timing errors. 100

Fault Tolerance Analysis: The possibility of the CPipe architecture not detecting a fault is ex- tremely low. One possibility is a timing error happening in the L-PIPELINE and a soft error happening in the S-PIPELINE, and the error flag not being asserted because of identical corruption. This possi- bility is extremely low since even if a single mismatch happens in the entire system, the error flag is asserted. Timing errors and soft errors affect multiple signals, consequently affecting several flip-flops in the registers. Another case is when both the S-LOGIC and L-LOGIC are affected by soft errors.

The same soft error cannot affect both the logic, if so, then this will be detected by the previous stage

E-DETECT module. This is because the L-PIPELINE register outputs feeding the S-LOGIC also goes to the E-DETECT module. Another failure possibility is when a transient pulse occurs right after the error signal is latched and before the S-LOGIC outputs are stored in the S-PIPELINE registers corrupting the S-PIPELINE register values. This duration is extremely small (one NOT and one AND gate, plus global routing delay), and given the distribution of soft errors in time and space, this error possibility is insignificant. The error register is metastability hardened, and any small variation will make the

GError signal go high. In essence, the CPipe architecture is capable of providing very high degrees of fault coverage.

7.2 Timing Requirements

Figure 7.3 Dynamic frequency scaling

For proper operation of systems implementing the CPipe architecture, it is of paramount importance to respect the timing relationship between the three clocks, namely, LClk, EClk and SClk. To support reliable dynamic overclocking, certain governing conditions need to be met at all times. Figure 7.3 101

shows the parameters that control the full range of frequencies, FMin )* FMax, that are possible when a system is dynamically overclocked beyond the worst–case operating frequency, FMin. For our CPipe architecture, we extend the clock generation methodology described in SPRIT3E [92]. In a CPipe system, there are three clocks and two of those, EClk and SClk, are phase shifted versions of the LClk. Additionally, CPipe requires minimum phase shift guarantees for correct operation.

The following parameters that can be estimated for FMin settings of any digital system are defined below to calculate the dynamic frequency operation range:

• Let TMax represent the worst-case time period required by the digital system under consideration.

• Let TErr represent the time required for error detection and assertion of the global error signal.

This includes the E-DETECT module delay and the generation of the global error signal from the

pipeline stage error signals.

• Let TSStall represent the time required to stall SClk to prevent incorrect value from being loaded

into the S-PIPELINE registers. This includes the clock gating delay and the clock propagation

delay.

• Let TLSP represent the time required to assert LoadSP signal on detection of an error, the routing

delay, and the multiplexer delay required to load the S-PIPELINE register values into the L-

PIPELINE registers.

• Let TFminCD represent the minimum contamination delay of the S-LOGIC of all the pipeline stages.

Figure 7.3 shows the time available for the above operations under FMin and FMax settings. PSMin, defined by Equation 7.1, represents the minimum required phase shift to ensure correct operation, and it should satisfy Equation 7.2.

PSMin = TErr + TSStall (7.1)

PSMin ≤ TFminCD (7.2)

Clock Timing Requirements: LClk active edge occurs first, followed by EClk edge, and then SClk edge. EClk time lag should at least be equal to TErr, and the SClk phase shift amount should at least 102

be PSMin. Fixing the phase shift between EClk and SClk as TSStall makes dynamic frequency operation easier, since only the phase shift between LClk and EClk needs to be controlled. Also, maintaining the phase shift value between EClk and SClk to the bare minimum reduces the possibility of common mode failure. The rest of the discussion in this paper is based on this approach. The effects that lead to variable circuit delays, such as temperature, voltage, and process variations, also cause variations in the clock period, referred to as clock skew. In order to account for this possibility, the worst-case clock skew is assumed when determining the maximum frequency scaling achievable, and is added to the estimation of TErr, TLSP and TSStall. Dynamic Overclocking: When dynamic overclocking is done to improve performance, the follow- ing additional parameters needs to be derived for FMax settings:

• Let TMin represent the minimum clock period at which the system is guaranteed to recover from timing errors that might happen as a result of overclocking.

• Let PSMax represent the maximum phase shift required to ensure correct operation.

• Let TFmaxCD represent the minimum contamination delay of the S-LOGIC of all the pipeline stages.

As seen in Figure 7.3, the only parameter that becomes critical because of frequency scaling is

TLSP. TErr and TSStall are taken care by the clock timing requirements, and their criticalness continue to remain the same as in FMin settings. Depending on the extent of overclocking required, the value

TFmaxCD is fixed at any value within the range given by Equation 7.3. If a higher value is chosen, then the contamination delay of the S-LOGIC of the pipeline stages needs to be increased above this value.

PSMin ≤ TFmaxCD ≤ TMin (7.3)

The error detection and if necessary, the recovery, should be initiated before the L-PIPELINE reg- isters receive the next set of values. The minimum clock period, TMin, is given by Equation 7.4. The corresponding phase shift, PSMax is given by Equation 7.5, and should satisfy Equation 7.6

T + T + T T ≤ Max Err LSP (7.4) Min 2 103

PSMax = TMax − TMin + PSMin (7.5)

PSMax ≤ TFmaxCD (7.6)

Let TPS represent the adjustable phase shift value. Equation 7.7 defines the range of phase shift values, while Equation 7.8 defines the range for TPS.

PSMin ≤ TPS + TErr + TSStall ≤ PSMax (7.7)

0 ≤ TPS ≤ TMax − TMin (7.8)

For a system under consideration, the values of TMin and PSMax are derived using the above method.

Then, for any frequency F, such that FMin ≤ F ≤ FMax, the associated time period T given by Equa- tion 7.9 and the phase shift PS given by Equation 7.10 are found.

T = TMax − TPS (7.9)

PS = PSMin + TPS (7.10)

Fixed Frequency Operation: For operating without any run–time optimizations, the frequency of the three clocks is fixed at the desirable operating frequency satisfying the above conditions, and the required phase shifts between the clocks are enforced. Under these conditions, the CPipe archi- tecture offers protection from soft errors and permanent fault detection, while achieving performance improvement if the error rate is low. Also, it is important to ensure that the contamination delay of the S-LOGIC of all the pipeline stages is more than the phase shift required for this frequency. If the frequency is fixed at the worst-case operating frequency, the CPipe system is guaranteed to have the same performance as an unprotected system, while offering high reliability. If operating at or below worst-case operating frequency, the Overclock signal is deasserted enabling two cycle recovery from transient errors. 104

Figure 7.4 Modular implementation

7.3 Implementation Considerations

The CPipe architecture is easy to integrate in any system during the RTL/structural level design phase. After the modules representing pipeline combinational logic are designed, they can be assem- bled together by using the local fault detection and recovery (LFDR) circuits instead of the registers.

As explained earlier, the LFDR module includes error detection logic and both leader and shadow pipeline registers. The LFDR circuit is designed as a separate module, with its data width con-

figurable. Modular design makes it easy to replicate the logic. The connectivity is done, as explained in the previous sections, and the CPipe system implementation is complete. Figure 7.4 illustrates the modular implementation of CPipe architecture, where L and S stand for leader logic and shadow logic, respectively. This can be extended to any number of pipeline stages.

One of the major issues that needs to be taken care of is the clocking of the LFDR circuit, and ensuring that the timing requirements derived in Section 7.2 are met. For pipeline stages with less latencies, the error detection delay, TErr will be significant. Also because of the global routing delays, the performance gain will be modest, as frequency cannot be scaled much. However, in most pipelined designs, the longest pipeline stage limits the frequency. If the critical path in the slowest pipeline stage is not exercised often, then it can be overclocked, and all other pipeline stages will benefit. The CPipe approach guarantees high degree of fault coverage for all designs, while offering performance gains whenever possible. One significant benefit that is derived from implementing CPipe architecture is the reduction in design optimization time to achieve particular performance. The adaptive clocking mechanism allows performance to match or exceed expected levels during run-time. 105

We implemented and evaluated our clock generation methodology on a Xilinx Virtex 5 FPGA [105].

The Virtex 5 FPGA has support for digital clock managers (DCMs) and phase lock loops (PLLs).

For our purpose, we used the PLL in frequency synthesis mode and generated the three clocks with the required phase shifts in between them. The PLL has support for six output clocks with different phase shifts and frequencies. Using the dynamic reconfiguration port, it is possible to reconfigure during run-time the clock frequency, as well as the phase shift. Figure 7.5 shows the programmable part of the PLL. The multiplier and divider values are varied to get the required frequency.

Figure 7.5 Clock generation circuitry

We use three of the output clocks. Each of the output clocks are programmed to have the required phase shift. Based on information provided by Xilinx, we adjusted the phase shift in increments of

11.25 degrees, where 0 degree corresponds to no phase shift, and 360 degrees corresponds to one full clock period shift. This corresponds to 32 different clock frequencies between minimum and maximum possible frequencies. Finer phase shifts are also possible. Also the SClk is phase shifted a constant value from the EClk based on the value of TSStall, which for experimentation was fixed at 1ns. When the output of the VCO is close to 1GHz the PLL locks in approximately 1400 cycles, mea- sured at a reference clock frequency of 100MHz. The output of VCO is further divided to generate each of the output clocks. The PLL takes upwards of 1500 cycles when the VCO output frequency is less than 1GHz. We used a look up table for reconfiguring during run-time the values of the multiplier, divider, and the values of output clock dividers and phase shifts. 106

7.3.1 Two Clock Approach

Clock distribution and routing inside a design is one of the major design issues. At high frequencies, clock skew will limit the implementation of CPipe architecture, as it requires strict timing requirements between the three clocks at all times to guarantee correct execution.

With a modest increase in implementation overhead, the CPipe architecture is allowed to operate with two clocks. The leader clock is inverted inside the LFDR circuit to locally generate the shadow clock. Also the clock stall logic is moved inside the LFDR circuit. With this setup, only one clock needs to be routed inside the design, since the error clock, EClk, is used only to clock the error register.

The duty cycle of LClk is adjusted to maximize performance gain. Since 50% duty cycle is not necessary for the system to operate correctly, it can be adjusted accordingly, instead of varying the phase shift.

For this approach to work, few conditions need to be taken care of.

Let THigh represent the high time, and let TLow represent the low time of LClk. The assertion and global routing delay of the two stall signals, LClkStall and SClkStall, and the load S-PIPELINE signal,

LoadSP determine TLow. Since these delays do not change with frequency scaling, sufficient time should be guaranteed at FMin settings. Also, the phase shift of EClk is kept below THigh − TSStall. We imple- mented and evaluated this approach too on the Virtex 5 FPGA. The duty cycle for each of the output clocks is as easily programmable as the phase shift of the output clocks.

7.4 Experiments and Results

To prove the viability of the CPipe system architecture, we performed the following experimental runs on a two stage arithmetic pipeline. Our designed CPipe system performs 64-bit addition in the

first stage, and a 32-bit multiplication in the second stage. The 64-bit carry look ahead adder output is divided into two, and fed to the multiplier as multiplicand and multiplier.

We synthesized our design in Synopsys design compiler. We used the 45nm OSU standard cell library [91]. From static timing analysis reports, we estimated the values of TMax as 9.1ns, TErr as

1.7ns, TSStall as 0.67ns, and TLSP as 1.85ns. Then, using the equations derived in Section 7.2 we calculated the values of TMin as 6.33ns and PSMax as 5.14ns. Based on these values, the synthesis was performed again with minimum delay constraints to increase the contamination delay of the S-LOGIC 107 blocks. Since increasing contamination delay increases area and power, we chose not to overclock all the way to TMin, and our implementation supported overclocking up to 7ns. We used SOC encounter tool to layout the design and to extract standard delay format (SDF) timing information. We did timing simulations on the SDF annotated post layout design to evaluate fault coverage and performance improvement. We designed in Vhdl the dynamic clock generation circuit and used the delay values for locking that we obtained from the Xilinx Virtex 5 experimentation.

In our experiment, for an 1ms run, we injected faults randomly in time and space and evaluated the fault tolerance capability of the design for the various fault types. Our random fault injectors introduced approximately 100 transient faults and 3 intermittent faults in 100,000 cycles. Some of the intermittent faults persisted longer simulating a permanent fault. The pipeline output is verified for correctness by comparing with a non fault injection run. Timing errors occur as a result of overclocking, and the timing error recovery process was also verified similarly.

Table 7.2 Fault injection results

Transient Faults Intermittent Faults Permanent Faults Mode Run Operations Injected Detected Injected Detected Injected Detected NOOC 1 109818 981 131 27 27 5 5 NOOC 2 109892 937 105 26 26 5 5 NOOC 3 109772 941 124 27 27 5 5 MAXOC 1 141054 913 216 27 27 5 5 MAXOC 2 140976 953 193 27 27 5 5 MAXOC 3 140879 919 214 31 31 5 5 DYNOC 1 132975 925 207 26 26 5 5 DYNOC 2 133053 914 177 25 25 5 5 DYNOC 3 133031 933 190 26 26 5 5

We repeated the experiment with three different random seeds, and performed the experiments in three different modes. The three modes were, no overclocking (NOOC), TMin = 9.1ns, TMax = 9.1ns, maximum overclocking (MAXOC), TMin = 7ns, TMax = 7ns, and dynamic overclocking (DYNOC),

TMin = 7ns, TMax = 9.1ns. Table 7.2 reports results for the three types of faults injected, and it also presents the number of correct operations performed in 1ms time duration.

In [92], for a multiplier circuit 44% performance improvement was achieved for an error rate target of 1%. However, because of the limitations imposed on the clock timing requirements, the maximum 108 frequency that is achievable in CPipe is limited. Even while running at maximum possible frequency, for randomly generated inputs we observed extremely less timing errors, as reported in Table 7.3.

From the results, we can see that when running at worst-case frequency less transient errors get detected, as most of them are masked because of the longer clock period. In dynamic overclocking mode, we perform the modified binary search algorithm, described in Algorithm 1, on the allowed range of frequencies and also account for the clock scaling penalty. Always running at the maximum frequency yields the best results for the two-stage arithmetic pipeline. Even when exposed to a severe fault campaign, we obtain approximately 28% performance increase over NOOC while operating at

MAXOC. DYNOC offers about 21% performance increase over NOOC.

Table 7.3 Timing errors

Mode Run Operations Timing Errors NOOC 1 109818 0 NOOC 2 109892 0 NOOC 3 109772 0 MAXOC 1 141054 13 MAXOC 2 140976 10 MAXOC 3 140879 13 DYNOC 1 132975 13 DYNOC 2 133053 11 DYNOC 3 133031 12

We also designed and simulated a five stage conjoined in-order pipeline processor. The conjoined processor implemented in 45nm technology supports operand forwarding and is based on the DLX instruction set architecture. The purpose of this experiment was to prove that CPipe architecture works perfectly well in the presence of feedback signals.

We ran three different microbenchmarks to evaluate the conjoined processor architecture. The microbenchmarks were written in assembly. The RandGen application performs a simple random number generation to give a number between 0 and 255. One million random numbers are generated, and the distribution of the random variable is kept in memory. The MatrixMult application multiplies two 50x50 integer matrices and stores the result into memory. The BubbleSort program performs a bubble sort on 5,000 half-word variables. 109

The performance of the three modes is shown in Figure 7.6. The fault injection campaign is similar to the arithmetic pipeline case. From static timing analysis reports, we estimated the values of TMax as

6ns, TErr as 1.4ns, TSStall as 0.67ns, and TLSP as 1.7ns. We estimated TMin to be 4.55ns. From the chart, we see that when overclocked, all three application show significant performance gains while operating at DYNOC and MAXOC modes, even when subjected to a severe fault campaign. For applications running for a longer time, the performance benefits achieved through reliable overclocking is huge.

Figure 7.6 Execution time for three different applications running on Conjoined Processor in various modes

For our approach, there are no timing overheads on the leading pipeline except for the MUXing- delay. The error detection is done in parallel with useful computation. Superficially, area overhead is the cost of a second core along with overclocking and error detection overhead. For the two stage conjoined arithmetic pipeline, the post-layout area is estimated to be 1.72E5 um2, which works out to 285% the size of the a non fault tolerant arithmetic pipeline. The DLX processor area is about

310% the size of the original processor. A significant component of the area overhead results from the contamination delay compensation of the S-LOGIC. By designing buffers specifically for this purpose and a robust algorithm to increase short path delays, this overhead can be alleviated. 110

CHAPTER 8. CONCLUSIONS AND FUTURE WORK

Advances in computing technologies have transformed society and have allowed formation and growth of many communities that were beyond imagination two decades ago. To sustain this growth, advances in microprocessor architectures are critical. The continued shrinking of VLSI circuits has complemented architectural innovations ever since silicon transistors began to revolutionize our world.

As device scaling reaches its limits, reliable overclocking has the capability to extend the quest for high performance further, until suitable device alternatives are found. As more and more people get their hands on computers, and more and more day to day activities get automated, it becomes important to have techniques that adapt to the environment, and limit power consumption as and when possible, while not making the end user unhappy because of lack of performance.

This thesis takes into account the wide range of applicability of digital systems, which subject them to diverse demands, in terms of performance, power consumption and dependability, as they perform a plurality of tasks and run in a multitude of operating environments. Considering that these demands are interrelated and need to be addressed cohesively, as improving one metric alone is counterproductive for another, we developed schemes that combines fault tolerance, overclocking and thermal throttling techniques to dynamically enhance computer system performance, reliability and thermal management.

As demonstrated by the successful timing error tolerant overclocking methodology, the current way of estimating the operating frequency for synchronous circuits is far too conservative. The SPRIT3E framework reuses existing superscalar pipeline logic whenever possible, resulting in a modest error detection and recovery logic overhead. However, as silicon feature size decreases, architects have increasingly large silicon real estate available to them. As a result, this trade-off to achieve high perfor- mance is acceptable. This work extends the SPRIT3E framework by taking advantage of the margins produced by worst-case design mentality. 111

In this work, we studied the various factors that limit overclocking. Contamination delay had a serious impact on the extent of overclocking. We looked at ways to manipulate the contamination delay of logic circuits to favorably benefit reliable overclocking. In this thesis, we developed an analysis framework that enables the understanding of several nuances of reliable overclocking. We explored the benefits of reliable overclocking, and looked at ways to make it better. Our results indicate that setting a target error rate of 5% yields significant run-time benefits, while minimizing energy-delay product.

We also presented an initial study of the effects of reliably overclocked systems on on-chip tem- peratures [93]. In addition, we also analyzed the consequent effects on lifetime reliability of these systems. We considered a reliable overclocking framework and studied its thermal behavior compared to worst-case design. Our work in this dissertation is an initial exploration of dynamic thermal manage- ment in reliably overclocked systems. We are continuing this work by developing a powerful thermal management scheme that enhances performance as much as possible while operating well within the thermal limits, guaranteeing an extended system lifetime. The results we have obtained at this juncture are very promising, opening up many different directions for the near future.

Our thermal throttling approach can be extended by adding a dynamic voltage-frequency control technique. Based on the work done in this thesis, we are developing a scheme called DVARFS, which explores a new direction to manage on-chip thermal conditions and improves energy efficiency for processors, especially for battery operated devices [71]. The DVARFS mechanism facilitates reliable overclocking under thermal bounds. Our technique currently relies on an ad-hoc scheme for switching frequencies, a prediction based scheme that makes wise decisions, will reduce the time taken by the clock controller to decide on the best frequency of operation for the current executing application.

This thesis makes significant research contributions as improving performance generation after generation is becoming difficult, as technology scaling in the ultra deep sub-micron region is both expensive, as well as, time consuming. Challenges from process variations can possibly wipe out the benefits of an entire technology node. Overclocking has become mainstream, and several hardware vendors are allowing high-performance enthusiasts to overclock their systems. Reliable overclocking has the capability to enhance the lifetime of a technology node by extending the performance gains achievable with that generation. 112

The research presented in this thesis can enable hardware vendors to provide overclocking in their mainstream chips. Also, in this expensive semiconductor business, our technique presents smaller players a possibility to meet design goals without resorting to expensive process technology upgrades.

We compared speculative reliable overclocking with technology scaling. With every new generation of technology, performance improves by 30%. Our results indicate that reliable overclocking improves performance more than the switch to the next technology node. Even as power and energy consumption increase with reliable overclocking, the energy-delay product metric indicates that reliable overclocking is a good choice for high performance energy conscious systems.

System dependability is a core issue, which is often ignored due to its impact of performance. Most often than not, system designers need to make difficult trade off choice between reliability and high per- formance. In this dissertation, we proposed a solution that guarantees fault tolerant execution without compromising on the performance of the system. The solution proposed integrates overclocking with redundant execution thereby providing tolerance to soft errors, timing errors, intermittent faults and permanent faults.

The CPipe architecture relies on the organization of redundancy and adaptive clocking capabilities to improve fault coverage and performance [94]. One of the salient features of our approach lies in the capability to trigger recovery immediately on error detection, without requiring any checkpointing, thereby saving the time and space required to store the current execution status. The CPipe architecture protects both the datapath and control signals. In essence, the CPipe architecture presents a viable high performance high reliability solution.

In the future, to minimize the area overhead of CPipe architecture, new low-overhead high relia- bility high performance architectures can be developed. In that direction, our work in this dissertation laid the foundation for developing two techniques, namely Soft Error Mitigation (SEM) and Soft and

Timing Error Mitigation (STEM), for protecting combinational logic blocks from soft errors [6]. The

first technique (SEM), based on distributed and temporal voting of three registers, unloads the soft error detection overhead from the critical path of the systems. The second technique (STEM) adds timing error detection capability to guarantee reliable execution in aggressively clocked designs that enhance system performance by operating beyond worst-case clock frequency. 113

As part of future work, it will be interesting to see how reliable overclocking applies in the case of superpipelined cores. The important issues of concern are the global propagation of the stall signal on error detection to initiate proper recovery, the precise control of the phase shift between clocks at high frequencies, handling multiple cycle execution in a pipeline stage, and the distribution of the two clock signals to all the pipeline registers in the design.

Another interesting aspect that can be explored is independent overclocking of individual cores in a chip-based multi-core system. Scheduling various tasks based on on-chip temperature, core speed, and ability to overclock can allow multi-cores to enhance single threaded application performance. The work can be extended to real-time systems too, allowing faster energy-efficient execution of various tasks.

Overall, our research looked at the possibility of integrating performance, reliability and energy- efficiency. There have been solutions in the past, which address one or the other issue. An integrated solution becomes attractive, as it presents the best, while solving the primary issues that plague modern computing systems. A unified solution to these problems could render significant cost and perfor- mance benefits. As computing machines become ubiquitous, our research helps to make them highly reliable and energy efficient, without compromising on performance. The techniques we introduced and explored in our work holds good prospect for further research. 114

Bibliography

[1] AMD. Amd 790gx chipset. http://www.amd.com/us/products/desktop/chipsets/

7-series-integrated/Pages/amd-790gx-chipset.aspx, Date Accessed: June 20, 2009.

[2] AMD. Amd powernow! technology. http://www.amd.com/epd/processors/6.

32bitproc/8.amdk6fami/x24267/24267a.pdf, Date Accessed: June 20, 2009.

[3] T. Austin, V. Bertacco, D. Blaauw, and T. Mudge. Opportunities and challenges for better than

worst-case design. In Asia South Pacific Design Automation Conference, pages 2–7, 2005.

[4] T. Austin, E. Larson, and D. Ernst. Simplescalar: an infrastructure for computer system model-

ing. IEEE Computer, 35(2), February 2002.

[5] T. M. Austin. Diva: a reliable substrate for deep submicron microarchitecture design. In Inter-

national Symposium on Microarchitecture, pages 196–207, 1999.

[6] N. D. Avirneni, V. Subramanian, and A. K. Somani. Low overhead soft error mitigation tech-

niques for high-performance and aggressive systems. In IEEE/IFIP Dependable Systems and

Networks, pages 185–194, 2009.

[7] R. C. Baumann. Radiation-induced soft errors in advanced semiconductor technologies. IEEE

Transactions on Device and Materials Reliability, 5(3):305–316, September 2005.

[8] M. Bezdek. Utilizing timing error detection and recovery to dynamically improve superscalar

processor performance. Master’s thesis, Iowa State University, 2006.

[9] S. Borkar. Design challenges of technology scaling. IEEE Micro, 19(4):23–29, 1999. 115

[10] S. Borkar, T. Karnik, and V. De. Design and reliability challenges in nanometer technologies. In

Proceedings of the 41st annual conference on Design automation, pages 75–75, 2004.

[11] Shekhar Borkar, Tanay Karnik, Siva Narendra, Jim Tschanz, Ali Keshavarzi, and Vivek De.

Parameter variations and impact on circuits and microarchitecture. In DAC ’03: Proceedings of

the 40th annual Design Automation Conference, pages 338–342, 2003.

[12] J. A. Bower, W. Luk, O. Mencer, M. J. Flynn, and M. Morf. Dynamic clock-frequencies for

fpgas. Microprocessors and Microsystems, 30(6):388–397, 2006.

[13] K.A. Bowman, S.G. Duvall, and J.D. Meindl. Impact of die-to-die and within-die parameter fluc-

tuations on the maximum clock frequency distribution for gigascale integration. IEEE Journal

of Solid-State Circuits, 37(2):183–190, 2002.

[14] D. Brooks and M. Martonosi. Dynamic thermal management for high-performance micropro-

cessors. In Proceedings of the 7th International Symposium on High-Performance Computer

Architecture, volume 49, 2001.

[15] D. Brooks, V. Tiwari, and M. Martonosi. Wattch: A framework for architectural-level power

analysis and optimizations. In Proceedings of the 27th Annual International Symposium on

Computer architecture, pages 83–94, 2000.

[16] T. D. Burd, T. A. Pering, A. J. Stratakos, and R. W. Brodersen. A dynamic voltage scaled

microprocessor system. IEEE Journal of Solid-State Circuits, 35(11):1571–1580, November

2000.

[17] D. Burger and T.M. Austin. The SimpleScalar tool set, version 2.0. ACM SIGARCH Computer

Architecture News, 25(3):13–25, 1997.

[18] C. L. Chen and M. Y. Hsiao. Error-correcting codes for semiconductor memory applications: A

state-of-the-art review. IBM Journal of Research and Development, 28(2):124–134, 1984.

[19] B. Colwell. The zen of overclocking. IEEE Compututer, 37(3):9–12, March 2004. 116

[20] C. Constantinescu. Trends and challenges in vlsi circuit reliability. IEEE Micro, 23(4):14–19,

2003.

[21] E. Cota, F. Lima, S. Rezgui, L. Carro, R. Velazco, M. Lubaszewski, and R. Reis. Synthesis of

an 8051-like micro-controller tolerant to transient faults. Springer Science Journal of Electronic

Testing, 17(2):149–161, April 2001.

[22] S. Das, D. Roberts, S. Lee, S. Pant, D. Blaauw, T. Austin, K. Flautner, and T. Mudge. A self-

tuning dvs processor using delay-error detection and correction. IEEE Journal of Solid-State

Circuits, 41(4):792–804, April 2006.

[23] B. Davari, R. H. Dennard, and G. G. Shahidi. Cmos scaling for high performance and low

power-the next ten years. Proceedings of the IEEE, 83(4):595–606, April 1995.

[24] RH Dennard, FH Gaensslen, H.N. Yu, VL Rideout, E. Bassous, and AR LeBlanc. Design

of ion-implanted MOSFET’s with very small physical dimensions. Proceedings of the IEEE,

87(4):668–678, 1999.

[25] S. Dhar, D. Maksimovic,´ and B. Kranzen. Closed-loop adaptive voltage scaling controller for

standard-cell asics. In International symposium on Low power electronics and design, pages

103–107, 2002.

[26] K. Emerson. Asynchronous design-an interesting alternative. In International Conference on

VLSI Design, pages 318–320, January 1997.

[27] D. Ernst, N. S. Kim, S. Das, S. Pant, R. Rao, T. Pham, C. Ziesler, D. Blaauw, T. Austin, K. Flaut-

ner, and T. Mudge. Razor: A low-power pipeline based on circuit-level timing speculation. In

IEEE Micro, pages 7–18, December 2003.

[28] M. Fleischmann. Longrun -dynamic power management for crusoe proces-

sors. Transmeta Corporation, pages 1–18, 2001.

[29] International Technology Roadmap for Semiconductors. 2008 update overview. http://www.

itrs.net/Links/2008ITRS/Home2008.htm, Date Accessed: June 20, 2009, 2008. 117

[30] D. J. Frank, R. H. Dennard, E. Nowak, P. M. Solomon, Y. Taur, and H-S. P. Wong. Device scaling

limits of si mosfets and their application dependencies. Proceedings of the IEEE, 89(3):259–288,

March 2001.

[31] I. Ganusov and M. Burtscher. Future execution: A prefetching mechanism that uses multiple

cores to speed up single threads. In ACM Transactions on Architecture and Code Optimization,

pages 424–449, 2006.

[32] B. Greskamp and J. Torrellas. Paceline: Improving single-thread performance in nanoscale

cmps through core overclocking. In Parallel Architectures and Compilation Techniques, pages

213–224, September 2007.

[33] B. Greskamp, L. Wan, U.R. Karpuzcu, J.J. Cook, J. Torrellas, D. Chen, and C. Zilles. Blueshift:

Designing processors for timing speculation from the ground up. In IEEE 15th International

Symposium on High Performance Computer Architecture, 2009. HPCA 2009, pages 213–224,

2009.

[34] D. Grunwald, P. Levis, K. I. Farkas, C. B. Morrey III, and M. Neufeld. Policies for dynamic

clock scheduling. In USENIX Symposium on Operating System Design and Implementation,

pages 73–86, October 2000.

[35] V. Gutnik and A. Chandrakasan. An efficient controller for variable supply-voltage low power

processing. In Symposium on VLSI Circuits, pages 158–159, June 1996.

[36] W. Haensch, EJ Nowak, RH Dennard, PM Solomon, A. Bryant, OH Dokumaci, A. Kumar,

X. Wang, JB Johnson, and MV Fischetti. Silicon CMOS devices beyond scaling. IBM Journal

of Research and Development, 50(4-5):339–362, 2006.

[37] A. Hartstein and T. R. Puzak. The optimum pipeline depth for a microprocessor. In International

Symposium on Computer Architecture, pages 7–13, May 2002.

[38] S. Hauck. Asynchronous design methodologies: an overview. Proceedings of the IEEE,

83(1):69–93, January 1995. 118

[39] P. Hazucha, C. Svensson, and S. A. Wender. Cosmic-ray soft error rate characterization of a

standard 0.6-m cmos process. IEEE Journal of Solid-State Circuits, 35(10):1422–1429, October

2000.

[40] S. Heo, K. Barr, and K. Asanovic. Reducing power density through activity migration. In

ISLPED’03, Proceedings of the 2003 International Symposium on Low Power Electronics and

Design, pages 217–222, 2003.

[41] G. Hinton, D. Sager, M. Upton, D. Boggs, D. Carmean, A. Kyker, and P. Roussel. The microar-

chitecture of the pentium 4 processor. Intel Technology Journal, 2001.

[42] W. Huang, K. Sankaranarayanan, R.J. Ribando, M.R. Stan, and K. Skadron. An improved

block-based thermal model in HotSpot 4.0 with granularity considerations. In Proceedings of

the Workshop on Duplicating, Deconstructing, and Debunking, 2007.

[43] H. L. Hughes and J. M. Benedetto. Radiation effects and hardening of mos technology: devices

and circuits. IEEE Transactions on Nuclear Science, 50(3):500–521, June 2003.

[44] M. Ieong, B. Doris, J. Kedzierski, K. Rim, and M. Yang. Silicon device scaling to the sub-10-nm

regime. Science, 306(5704):2057–2060, 2004.

[45] Intel. White paper introduction to 32nm process technology. http://download.intel.

com/pressroom/kits/32nm/westmere/Intel_32nm_Overview.pdf, Date Accessed: June

20, 2009, 2009.

[46] H. Iwai, K. Kakushima, and H. Wong. Challenges For Future Semiconductor Manufacturing.

International Journal of High Speed Electronics and Systems, 16(1):43, 2006.

[47] A. H. Johnston. Radiation effects in advanced microelectronics technologies. IEEE Transactions

On Nuclear Science, 45(3):1339–1354, June 1998.

[48] T. Kehl. Hardware self-tuning and circuit performance monitoring. In IEEE International Con-

ference on Computer Design, pages 188–192, October 1993. 119

[49] S. Kim and A. K. Somani. On-line integrity monitoring of microprocessor control logic. In

International Conference on Computer Design, pages 314–319, 2001.

[50] S. Kim and A. K. Somani. Ssd: An affordable fault tolerant architecture for superscalar proces-

sors. In Pacific Rim Dependable Computing Conference, pages 27–34, December 2001.

[51] S. Kim and A. K. Somani. Soft error sensitivity characterization for microprocessor dependabil-

ity enhancement strategy. In IEEE/IFIP International conference on Dependable Systems and

Networks, pages 416–425, 2002.

[52] J-C. Laprie. Dependable computing and fault tolerance : Concepts and terminology. In Twenty-

Fifth International Symposium on Fault-Tolerant Computing, June 1995.

[53] CC Liu, I. Ganusov, M. Burtscher, and S. Tiwari. Bridging the processor-memory performance

gap with 3D IC technology. IEEE Design & Test of Computers, 22(6):556–564, 2005.

[54] T. Liu and S-L. Lu. Performance improvement with circuit-level speculation. In International

Symposium on Microarchitecture, pages 348–355, December 2000.

[55] R. McGowen, CA Poirier, C. Bostak, J. Ignowski, M. Millican, WH Parks, and S. Naffziger.

Power and temperature control on a 90-nm Itanium family processor. IEEE Journal of Solid-

State Circuits, 41(1):229–237, 2006.

[56] P. J. Meaney, S. B. Swaney, P. N. Sanda, and L. Spainhower. Ibm z990 soft error detection

and recovery. IEEE Transactions on Device and Materials Reliability, 5(3):419–427, September

2005.

[57] M. Mehrara, M. Attarian, S. Shyam, K. Constantinides, V. Bertacco, and T. Austin. Low-cost

protection for ser upsets and silicon defects. In Design, Automation and Test in Europe Confer-

ence, pages 1146–1151, March 2007.

[58] G. Memik, M. H. Chowdhury, A. Mallik, and Y. I. Ismail. Engineering over-clocking: reliability-

performance trade-offs for high-performance register files. In IEEE/IFIP International confer-

ence on Dependable Systems and Networks, pages 770–779, June 2005. 120

[59] A. Merchant, B. Melamed, E. Schenfeld, and B. Sengupta. Analysis of a control mechanism for

a variable speed processor. IEEE Transactions on Computers, 45(7):793–801, July 1996.

[60] F. Mesa-Martinez and J. Renau. Effective optimistic-checker tandem core design through ar-

chitectural pruning. In Proceedings of the 40th Annual IEEE/ACM International Symposium on

Microarchitecture, pages 236–248, 2007.

[61] S. Mirapuri, M. Woodacre, and N. Vasseghi. The mips r4000 processor. IEEE Micro, 12(2):10–

22, 1992.

[62] S. Mitra and E. J. McCluskey. Which concurrent error detection scheme to choose? In Interna-

tional Test Conference, pages 985–994, 2000.

[63] G. Moore. Cramming more components onto integrated circuits. Electronics Magazine, 38(8),

April 1965.

[64] T. Mudge. Power: a first-class architectural design constraint. IEEE Computer, 34(4):52–58,

April 2001.

[65] S.S. Mukherjee, C. Weaver, J. Emer, S.K. Reinhardt, and T. Austin. A systematic methodol-

ogy to compute the architectural vulnerability factors for a high-performance microprocessor.

In Proceedings of the 36th annual IEEE/ACM International Symposium on Microarchitecture,

2003.

[66] B. Narasimham et al. Characterization of digital single event transient pulse-widths in 130-

nm and 90-nm cmos technologies. IEEE Transactions on Nuclear Science, 54(6):2506–2511,

December 2007.

[67] S. R. Nassif. Modeling and forecasting of manufacturing variations. In Asia South Pacific Design

Automation Conference, pages 145–149, January 2001.

[68] J. B. Nickel and A. K. Somani. Reese: A method of soft error detection in microprocessors.

In IEEE/IFIP International conference on Dependable Systems and Networks, pages 401–410,

2001. 121

[69] T. Pering, T. Burd, and R. Brodersen. The simulation and evaluation of dynamic voltage scaling

algorithms. In International symposium on Low power electronics and design, pages 76–81,

1998.

[70] Joint Electron Devoce Eng. Council Pub. Failure mechanisms and models for semiconductor

devices, 2002.

[71] P. K. Ramesh, V. Subramanian, and A. K. Somani. Thermal management in reliably overclocked

systems. In IEEE Workshop on Silicon Errors in Logic - System Effects, Stanford, CA, Mar, 2009.

[72] S. K. Reinhardt and S. S. Mukherjee. Transient fault detection via simultaneous multithreading.

In International Symposium on Computer Architecture, pages 25–36, 2000.

[73] D. Roberts, T. Austin, D. Blauww, T. Mudge, and K. Flautner. Error analysis for the support

of robust voltage scaling. In International Symposium on Quality of Electronic Design, pages

65–70, March 2005.

[74] E. Rohou and M.D. Smith. Dynamically managing processor temperature and power. In In 2nd

Workshop on Feedback-Directed Optimization, 1999.

[75] E. Rotem, A. Naveh, M. Moffie, and A. Mendelson. Analysis of thermal monitor features of the

intel pentium m processor. In TACS Workshop at ISCA-31, 2004.

[76] E. Rotenberg. AR-SMT: A microarchitectural approach to fault tolerance in microprocessors.

In International Symposium on Fault-Tolerant Computing, pages 84–91, June 1999.

[77] Ganesh T. S., V. Subramanian, and A. K. Somani. Seu mitigation techniques for microprocessor

control logic. In European Dependable Computing Conference, October 2006.

[78] K. Sankaralingam, R. Nagarajan, S.W. Keckler, and D. Burger. Simplescalar simulation of the

powerpc instruction set architecture. Technical report tr2000-04, The University of Texas at

Austin, 2001. 122

[79] S. Sarangi, B. Greskamp, A. Tiwari, and J. Torrellas. EVAL: Utilizing processors with variation-

induced timing errors. In Proceedings of the 2008 41st IEEE/ACM International Symposium on

Microarchitecture-Volume 00, pages 423–434, 2008.

[80] S.R. Sarangi, B. Greskamp, R. Teodorescu, J. Nakano, A. Tiwari, and J. Torrellas. VARIUS: A

model of process variation and resulting timing errors for microarchitects. IEEE Transactions

on Semiconductor Manufacturing, 21(1):3, 2008.

[81] N. Seifert, X. Zhu, and L. W. Massengill. Impact of scaling on soft-error rates in commercial

microprocessors. IEEE Transactions On Nuclear Science, 49(6):3100–3106, December 2002.

[82] N. Seifert, X. Zhu, and L. W. Massengill. Impact of scaling on soft-error rates in commercial

microprocessors. IEEE Transactions on Nuclear Science, 49(6):3100–3106, December 2002.

[83] G. Semeraro, G. Magklis, R. Balasubramonian, D. H. Albonesi, S. Dwarkadas, and M. L.

Scott. Energy-efficient processor design using multiple clock domains with dynamic voltage

and frequency scaling. In International Symposium on High-Performance Computer Architec-

ture, pages 29–40, February 2002.

[84] N. V. Shenoy, R. K. Brayton, and A. L. Sangiovanni-Vincentelli. Minimum padding to satisfy

short path constraints. In IEEE/ACM International conference on Computer-aided design, pages

156–161, 1993.

[85] P. Shivakumar, M. Kistler, S. W. Keckler, D. Burger, and L. Alvisi. Modeling the effect of

technology trends on the soft error rate of combinational logic. In IEEE/IFIP International

conference on Dependable Systems and Networks, pages 389–398, June 2002.

[86] K. Skadron, M.R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan.

Temperature-aware microarchitecture. In Proceedings of the International Symposium on Com-

puterArchitecture, pages 2–13, 2003.

[87] K. Skadron, M.R. Stan, K. Sankaranarayanan, W. Huang, S. Velusamy, and D. Tarjan.

Temperature-aware microarchitecture: Modeling and implementation. ACM Transactions on

Architecture and Code Optimization (TACO), 1(1):94–125, 2004. 123

[88] J. C. Smolens, B. T. Gold, B. Falsafi, and J. C. Hoe. Reunion: Complexity-effective multicore

redundancy. In IEEE Micro, pages 223–234, 2006.

[89] R. F. Sproull, I. E. Sutherland, and C. E. Molnar. The counterflow pipeline processor architec-

ture. IEEE Design and Test of Computers, 11(3):48–59, 1994.

[90] J. Srinivasan, SV Adve, P. Bose, and JA Rivers. Lifetime reliability: Toward an architectural

solution. IEEE Micro, 25(3):70–80, 2005.

[91] J.E. Stine, I. Castellanos, M. Wood, J. Henson, F. Love, W.R. Davis, P.D. Franzon, M. Bucher,

S. Basavarajaiah, J. Oh, et al. FreePDK: An Open-Source Variation-Aware Design Kit. In

Proceedings of the 2007 IEEE International Conference on Microelectronic Systems Education,

pages 173–174, 2007.

[92] V. Subramanian, M. Bezdek, N. D. Avirneni, and A. K. Somani. Superscalar processor perfor-

mance enhancement through reliable dynamic clock frequency tuning. In IEEE/IFIP Interna-

tional conference on Dependable Systems and Networks, pages 196–205, June 2007.

[93] V. Subramanian, P. K. Ramesh, and A. K. Somani. Managing the impact of on-chip temperature

on the lifetime reliability of reliably overclocked systems. In Second International Conference

on Dependability-DEPEND’09, June 18-23, pages 156–161, 2009.

[94] V. Subramanian and A. K. Somani. Conjoined pipeline: A fault-tolerant high performance

microarchitecture. In Pacific Rim International Symposium on Dependable Computing, Taipei,

Taiwan, Dec, pages 9–16, 2008.

[95] K. Sundaramoorthy, Z. Purser, and E. Rotenburg. Slipstream processors: improving both perfor-

mance and fault tolerance. In International Conference on Architectural Support for Program-

ming Languages and Operating Systems, pages 257–268, 2000.

[96] SE Thompson, RS Chau, T. Ghani, K. Mistry, S. Tyagi, and MT Bohr. In search of” Forever,”

continued transistor scaling one new material at a time. IEEE Transactions on semiconductor

manufacturing, 18(1):26–36, 2005. 124

[97] A. Tiwari, S.R. Sarangi, and J. Torrellas. ReCycle:: pipeline adaptation to tolerate process

variation. In Proceedings of the 34th annual international symposium on Computer architecture,

pages 323–334, 2007.

[98] Y. Tosaka, H. Kanata, S. Satoh, and T. Itakura. Simple method for estimating neutron-induced

soft error rates based on modified bgr model. IEEE Electron Device Letters, 20(2):89–91, Feb

1999.

[99] A. K. Uht. Uniprocessor performance enhancement through adaptive clock frequency control.

IEEE Transactions on Computers, 54(2):132–140, February 2005.

[100] N. J. Wang and S. J. Patel. ReStore: symptom based soft error detection in microprocessors. In

IEEE/IFIP International conference on Dependable Systems and Networks, pages 30–39, July

2005.

[101] N. J. Wang, J. Quek, T. M. Rafacz, and S. J. Patel. Characterizing the effects of transient faults on

a high-performance processor pipeline. In IEEE/IFIP International conference on Dependable

Systems and Networks, pages 61–70, 2004.

[102] C. Weaver, J. Emer, S. S. Mukherjee, and S. K. Reinhardt. Techniques to reduce the soft error rate

of a high-performance microprocessor. In International Symposium on Computer Architecture,

pages 264–275, June 2004.

[103] G. Wolrich, E. McLellan, L. Harada, J. Montanaro, and R. Yodlowski. A high performance

floating point coprocessor. IEEE Journal of Solid-State Circuits, 19(5):690–696, 1984.

[104] E. Wu, J. Sune, W. Lai, E. Nowak, J. McKenna, A. Vayshenker, and D. Harmon. Interplay of

voltage and temperature acceleration of oxide breakdown for ultra-thin gate oxides. Solid State

Electronics, 46(11):1787–1798, 2002.

[105] Xilinx. Xilinx ug190 v5.0 virtex-5 fpga user guide. www.xilinx.com/support/

documentation/user_guides/ug190.pdf, Date Accessed: July 11, 2009. 125

[106] S. Zafar, T.J.W.R. Center, and Y. Heights. A Model for Negative Bias Temperature Instability in

Oxide and High K pFETs. In IEEE International Conference on Integrated Circuit Design and

Technology, 2007. ICICDT’07, pages 1–5, 2007.

[107] H. Zhou. Dual-core execution: Building a highly scalable singlethread instruction window. In

Parallel Architectures and Compilation Techniques, pages 231–242, 2005.

[108] H. Zhou. A case for fault tolerance and performance enhancement using chip multi-processors.

IEEE Computer Architecture Letters, 5(1):22–25, 2006.