DIVA: A Reliable Substrate for Deep Submicron Microarchitecture Design

Todd M. Austin’ Advanced Architecture Laboratory University of Michigan [email protected]

Abstract examples of faulty processor designs. Perhaps the most publicized case was the FDIV bug in which Building a high-petformance microprocessor presents an infrequently occurring error caused erroneous results in many reliability challenges. Designers must verify the cor- some floating point divides [ 161. More recently, the MIPS rectness of large complex systems and construct implemen- RlOOOO microprocessor was recalled early in its introduc- tations that work reliably in varied (and occasionally ad- tion due to implementation problems [ 181. These faulty verse) operating conditions. To&rther complicate this task, parts resulted in bad press, lawsuits, and reduced customer deep submicron fabrication technologies present new relia- confidence. In most cases, the manufacturers replaced the bility challenges in the form of degraded signal quality and faulty parts with fixed ones, but only at great cost. For ex- logic failures caused by natural radiation interference. ample, Intel made available replacement Pentium proces- In this paper; we introduce dynamic verification, a novel sors to any customer that had a faulty part and requested a microarchitectural technique that can significantly reduce replacement, at an estimated cost of $475 million [32]. the burden of correctness in microprocessor designs. The approach works by augmenting the commit phase of the 1.1 Designing Correct Processors processor pipeline with a functional checker unit. Thefunc- To avoid reliability hazards, chip designers spend con- tional checker verifies the correctness of the core proces- siderable resources at design and fabrication time to verify sor’s computation, only permitting correct results to com- the correct operation of parts. They do this by applying mit. Overall design cost can be dramatically reduced be- functional and electrical verification to their designs. 2 cause designers need only veri’ the correctness of the Functional Verification Functional verification occurs at checker unit. design time. The process strives to guarantee that a design We detail the DIVA checker architecture, a design opti- is correct, i.e., for any starting state and inputs, the design mized for simplicity and low cost. Using detailed timing will transition to the correct next state. It is quite difficult simulation, we show that even resource-frugal DIVA check- to make this guarantee due to the immense size of the test ers have little impact on core processor peflormance. To space for modern microprocessors. For example, a micro- make the case for reduced verification costs, we argue that processor with 32 32-bit registers, 8k-byte instruction and the DIVA checker should lend itself to functional and elec- data caches, and 300 pins would have a test space with at trical verification better than a complex core processor Fi- least 213”3g6 starting states and up to 2300 transition edges nally, future applications that leverage dynamic veri@cation emanating from each state. Moreover, much of the behav- to increase processor performance and availability are sug- ior in this test space is not fully defined, leaving in question gested. what constitutes a correct design. Functional verification is often implemented with 1 Introduction simulation-based testing. A model of the processor being designed executes a series of tests and compares the model’s Reliable operation is perhaps the single most important results to the expected results. Tests are constructed to pro- attribute of any computer system, followed closely by per- vide good coverage of the processor test space. Unfor- formance and cost. Users need to be able to trust that when tunately, design errors sometimes slip through this testing the processor is put to a task the results it renders are cor- rect. If this is not the case, there can be serious repercus- ‘Often the term “validation” is used to refer to the process of verifying the correctness of a design. In this paper we adopt the nomenclature used sions, ranging from disgruntled users to financial damage in the formal verification literature, i.e., verijcarion is the process of deter- to loss of life. There have been a number of high-profile mining if a design is correct, and validufion is the Process of determining if a d&.ign meets customers’ needs. In other words, verification answers ‘This work was performed while the author was (un)employed as an the question, “Did we build the chip right?“, and validation answers the independent consultant in late May and June of 1999. question, “Did we build the right chip?‘.

196 1072-445lI99 $10.00 0 1999 IEEE process due to the immense size of the test space. To min- Noise-Related Faults Noise related faults are the result imize the probability of this happening, designers employ of electrical disturbances in the logic values held in circuits various techniques to improve the quality of verification in- and wires. As process feature sizes shrink, interconnect be- cluding co-simulation [6], coverage analysis [6], random comes increasingly susceptible to noise induced by other test generation [ 11,and model-driven test generation [ 131. wires [4, 231. This effect, often called crossrulk, is the re- A recent development called formal verification [15] sult of increased capacitance and inductance due to densely works to increase test space coverage by using formal meth- packed wires [2]. At the same time, designs are using lower ods to prove that a design is correct. Due to the large num- supply voltage levels to decrease power dissipation, result- ber of states that can be tested with a single proof, the ap- ing in even more susceptibility to noise as voltage margins proach can be much more efficient than simulation-based are decreased. testing. In some cases it is even possible to completely ver- Natural Radiation Interference There are a number of ify a design, however, this level of success is usually re- radiation sources in nature that can affect the operation served for in-order issue pipelines or simple out-of-order of electronic circuits. The two most prevalent radiation pipelines with small window sizes. Complete formal ver- sources are gamma rays and alpha particles. Gamma rays ification of complex modern microprocessors with out-of- arrive from space. While most are filtered out by the at- order issue, speculation, and large windows sizes is cur- mosphere, some occasionally reach the surface of the earth, rently an intractable problem [8,24]. especially at higher altitudes [33]. Alpha particles are cre- Electrical Verification Functional verification only ver- ated when atomic impurities (found in all materials) decay ifies the correctness of a processor’s function at the logic 1211.When these energetic particles strike a very small tran- level, it cannot verify the correctness of the logic imple- sistor, they can deposit or remove sufficient charge to tem- mentation in silicon. This task is performed during electri- porarily turn the device on or off, possibly creating a logic cal verification. Electrical verification occurs at design time error [I 1, 231. Energetic particles have been a problem for and fabrication time (to speed bin parts). Parts are stress DRAM designs since the late 1970’s when DRAM capac- tested at extreme operating conditions, e.g., low voltage, itors became sufficiently small to be affected by energetic high temperature, high frequency, and slow process, until particles [21]. they fail to operate.3 The allowed maximum (or minimum) It is difficult to shield against natural radiation sources. for each of these operating conditions is then reduced by a Gamma rays that reach the surface of the earth have suffi- safe operating margin (typically IO-20%) to ensure that the ciently high momentum that they can only be stopped with part provides robust operation at the most extreme operat- thick, dense materials [33]. Alpha particles can be stopped ing conditions. If after this process the part fails to meet its with thin shields, but any effective shield would have to be operational goals (e.g., frequency or voltage), directed test- free of atomic impurities, otherwise, the shield itself would ing is used to identify the critical paths that are preventing be an additional source of natural radiation. Neither shield- the design from reaching these targets [3]. ing approach is cost effective for most system designs. As a Occasionally, implementation errors slip through the result, designers will likely be forced to adopt fault-tolerant electrical verification process. For example, if an infre- design solutions to protect against natural radiation interfer- quently used critical path is not exercised during electrical ence. verification, any implementation errors in the circuit will Increased Complexity With denser feature sizes, there not be detected. Data-dependent implementation errors are is an opportunity to create designs with many millions and perhaps the most difficult to find because they require very soon billions of transistors. While many of these transis- specific directed testing to locate. Examples of these type tors will be invested in simple regular structures like cache of errors include parasitic crosstalk on buses [4], Miller ef- and predictor arrays, others will find their way into complex fects on transistors [31], charge sharing in dynamic logic components such as dynamic schedulers, new functional [31], and supply voltage noise due to 2 spikes [25]. units, and other yet-to-be-invented gadgets. There is no shortage of testimonials from industry leaders warning that 1.2 Deep Submicron Reliability Challenges increasing complexity is perhaps the most pressing problem To further heighten the importance of high-quality veri- facing future microprocessor designs [3,30, 14, lo]. With- fication, new reliability challenges are materializing in deep out improved verification techniques, future designs will submicron fabrication technologies (i.e., process technolo- likely be more costly, take longer to design, and include gies with minimum feature sizes below 0.25~77i). Finer fea- more undetected design errors. ture sizes result in an increased likelihood of noise-related faults, interference from natural radiation sources, and huge 1.3 Dynamic Verification verification burdens brought on by increasingly complex Today, most commercial microprocessor designs employ designs. If designers cannot meet these new reliability chal- fault-avoidance techniques to ensure correct operation. De- lenges, they may not be able to enjoy the cost and speed sign faults are avoided by putting parts through extensive advantages of these denser technologies. functional verification. To ensure designs are electrically 30r fail to meet a critical design constraint such as power dissipation robust, frequency and voltage margins are inserted into ac- or mean time to failure (MIFF). ceptable operating ranges. These techniques, however, are

197 Traditionaf Out-of-Order Core DIVA Core DIVA Checker

out-of-order out-of-order

~~~~~-

in-order in-order in-order in-order issue retirement issue I verify and commit

a) b)

Figure 1. Dynamic Verification. Figure a) shows a traditional out-of-order processor core. Figure b) shows the core augmented with a checker stage (labeled CHK). The shaded camponents in each figure indicate the part of the processor that must be verified correct to ensure correct computation.

becoming less attractive for future designs due to increasing into a deadlock or livelock state where no instructions at- complexity, degraded signal integrity, and natural radiation tempt to retire. For example, if an energetic particle strike interference. changes an input tag in a reservation station to the result tag Traditional fault-tolerance techniques could address of the same instruction, the processor core sched