Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor Christopher Weaver1, Joel Emer1, Shubhendu S

31st Annual International Symposium on Computer Architecture (ISCA), June 2004 Techniques to Reduce the Soft Error Rate of a High-Performance Microprocessor Christopher Weaver1, Joel Emer1, Shubhendu S. Mukherjee1, and Steven K. Reinhardt1,2 1Massachusetts Microprocessor Design Center 2Advanced Computer Architecture Lab Intel Corporation EECS Department, University of Michigan 77 Reed Road, Hudson MA 01749 1301 Beal Avenue, Ann Arbor, MI 48109 {christopher.t.weaver,joel.emer,shubu.mukherjee}@intel.com [email protected] Abstract we add error protection mechanisms or use a more robust technology (such as fully-depleted SOI), a microprocessor’s error Transient faults due to neutron and alpha particle strikes pose rate will grow in direct proportion to the number of devices we a significant obstacle to increasing processor transistor counts in add to a processor in each succeeding generation. future technologies. Although fault rates of individual transistors may not rise significantly, incorporating more transistors into a Figure 1 illustrates the possible outcomes of a single-bit device makes that device more likely to encounter a fault. Hence, fault. Outcomes labeled 1-3 indicate non-error conditions. The maintaining processor error rates at acceptable levels will require most insidious form of error is silent data corruption (SDC) increasing design effort. (outcome 4), where a fault induces the system to generate erro- neous outputs. To avoid SDC, designers often employ basic This paper proposes two simple approaches to reduce error error detection mechanisms, such as parity. With the ability to rates and evaluates their application to a microprocessor instruc- detect a fault but not correct it, we avoid generating incorrect tion queue. The first technique reduces the time instructions sit in outputs, but cannot recover when an error occurs. In other vulnerable storage structures by selectively squashing instructions words, simple error detection does not reduce the error rate, but when long delays are encountered. A fault is less likely to cause an does provide fail-stop behavior and thereby avoids any data error if the structure it affects does not contain valid instructions. corruption. We call errors in this category detected unrecover- We introduce a new metric, MITF (Mean Instructions To Failure), able errors (DUE). to capture the trade-off between performance and reliability intro- We subdivide DUE events according to whether the duced by this approach. detected error would affect the final outcome of the execution. The second technique addresses false detected errors. In the We call benign detected errors false DUE events (outcome 5 in absence of a fault detection mechanism, such errors would not Figure 1) and others true DUE events (outcome 6). In most sit- have affected the final outcome of a program. For example, a fault uations, it is impossible for a processor to determine at the time affecting the result of a dynamically dead instruction would not an error is detected whether it is benign. The conservative change the final program output, but could still be flagged by the approach is to signal all detected errors as processor failures hardware as an error. To avoid signalling such false errors, we (e.g., via a machine-check exception). modify a pipeline's error detection logic to mark affected instruc- A direct approach to reducing error rates involves adding tions and data as possibly incorrect rather than immediately sig- error correction or recovery mechanisms to a design, eliminat- naling an error. Then, we signal an error only if we determine later that the possibly incorrect value could have affected the program’s output. faulty bit is read? no yes 1. Introduction benign fault; Single bit upsets from transient faults have emerged as a 1 bit has no error key challenge in microprocessor design. These faults arise error protection? from energetic particles—such as neutrons from cosmic rays detection & detection and alpha particles from packaging material—generating elec- fault corrected; correction no 2 only tron-hole pairs as they pass through a semiconductor device. no error Transistor source and diffusion nodes can collect these charges. A sufficient amount of accumulated charge may invert the state affects program out- affects program out- of a logic device—such as an SRAM cell, a latch, or a gate— come? come? thereby introducing a logical fault into the circuit’s operation no yes no yes [32]. Because this type of fault does not reflect a permanent failure of the device, it is termed soft or transient. benign fault; Soft errors will be an increasing burden for microproces- 3 no error 4 SDC5 false DUE6 true DUE sor designers as the number of on-chip transistors continues to grow exponentially. The raw error rate per latch or SRAM bit is Figure 1. Classification of the possible outcomes of a faulty bit projected to remain roughly constant or decrease slightly for in a microprocessor. SDC = silent data corruption. DUE = the next several technology generations [9][11]. Thus, unless detected unrecoverable error. 1 31st Annual International Symposium on Computer Architecture (ISCA), June 2004 ing outcomes 3 through 6 from Figure 1. Unfortunately, these events account for as much as 52% of the total DUE rate of an mechanisms come at a significant cost in power, performance, instruction queue protected only with parity. and area. Circuit-level recovery mechanisms, such as the DICE To track false DUE events, we propose attaching a bit cell [6], can double the number of transistors per storage ele- called the π bit (pi for possibly incorrect) to every instruction ment. Error correction codes in memory have significant stor- and potentially to various hardware structures. When an error is age overhead and incur additional latency in generating and detected, the hardware will set the π bit of the affected instruc- checking the codes. Recovery with redundant execution tion instead of signaling the error. Later, by examining the π bit schemes (e.g., [2], [29], [8]) requires duplication of a thread or and identifying the nature of the instruction, the hardware can processor. Furthermore, these techniques may be overkill for decide if indeed a visible error has occurred. Thus, for exam- most of the microprocessor market, which requires good reli- ple, the retire unit in a pipeline can ignore the π bit of a wrong- ability but not bulletproof operation. path instruction and avoid signaling a false error. Among the This paper focuses on an alternative mechanism to lower- different categories of instructions, identifying dynamically ing error rates: reducing the likelihood that a transient fault dead instructions is most involved. We propose several mecha- causes the processor to declare an error condition. We propose nisms, including a Post-commit Error Tracking (PET) buffer two variations on this theme, with specific application to a and the optional use of π bits on register files, various hard- microprocessor instruction queue. ware structures, and in the memory system, to provide good coverage of false DUE from dynamically dead instructions. Our first approach reduces the probability that a transient Using the above techniques, we can cover 100% of the false fault affects valid state (e.g., an instruction) by keeping valid DUE events. We show that the combination of the exposure state out of vulnerable structures. This reduces the exposure of and false DUE reduction techniques reduce the SDC rate of an valid state to radiation and, hence, lowers the soft error rate of unprotected instruction queue by 26% and the DUE rate of an the affected structures. For example, most bits in an invalid instruction queue protected with parity by 57%. instruction queue entry (other than the valid bit itself) will Overall this paper makes four contributions. First, we pro- never be read; thus a fault in an invalid entry will not result in pose a new approach to reducing soft error rates by reducing an error (outcome 1 from Figure 1). By filling the instruction the exposure of valid state to neutron or alpha strikes. Second, queue with invalid entries instead of valid instructions during we introduce the term MITF (mean instructions between fail- lengthy stalls, we reduce the probability of a neutron or alpha ures) to reason about the trade-off between reliability and per- particle strike corrupting a valid instruction. In general, we use formance. Third, we identify the existence of false DUE situations that presage long delays as triggers to carry out events. And, fourth and final, we describe several new mecha- actions that reduce the exposure of valid state to potential nisms, such as the π bit, to avoid flagging errors on false DUE faults. In this paper, we examine cache-miss triggers and events. squashing actions to remove existing instructions from the instruction queue. Our results with an Itanium 2-like proces- The rest of the paper is organized as follows. Section 2 sor running the SPEC CPU2000 show that these techniques can discusses how to compute the SDC and DUE rates of a micro- reduce the instruction queue’s soft error rate by 18-34% for processor. Section 3 describes how to reduce the error rate of a only a 2-10% decrease in overall IPC. processor by reducing the amount of time an instruction sits in a vulnerable storage structure. Section 4 describes how to Because such stalling operations may reduce performance, reduce false DUE rate by marking instructions as possibly we must determine whether the performance loss is justified for incorrect. Section 5 describes our methodology and Section 6 the corresponding gain in reliability. To quantify this trade-off, shows our results. Section 7 discusses related work. Finally, we introduce a new metric, Mean Instructions to Failure Section 8 presents our conclusions. (MITF). We show that as long as the relative decrease in vul- nerability is larger than the corresponding decrease in IPC 2.

Load more