Fuse: a Technique to Anticipate Failures Due to Degradation in Alus

View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by UPCommons. Portal del coneixement obert de la UPC Fuse: A Technique to Anticipate Failures due to Degradation in ALUs Jaume Abella, Xavier Vera, Osman Unsal†, Oguz Ergin‡, Antonio González Intel Barcelona Research Center, Intel Labs – UPC {jaumex.abella, xavier.vera, antonio.gonzalez}@intel.com Ψ‡ Abstract Such unreliability can be addressed in several ways. One solution consists in testing the blocks for errors This paper proposes the fuse, a technique to anticipate [11][18] and reconfigure the system accordingly. failures due to degradation in any ALU (Arithmetic Logic However, testing only avoids future crashes, but it does Unit), and particularly in an adder. The fuse consists of a not prevent the system from crashing whenever failures replica of the weakest transistor in the adder and the show up for the first time. circuitry required to measure its degradation. By Another set of solutions is based on detecting failures mimicking the behavior of the replicated transistor the and avoiding data corruption. Memory-like structures, fuse anticipates the failure short before the first failure in such as caches and register files, can be protected with the adder appears, and hence, data corruption and ECC [9], which is useful to detect transient and permanent program crashes can be avoided. Our results show that errors. ALUs are very likely to cause crashes and data the fuse anticipates the failure in more than 99.9% of the corruption because most of the instructions use them, and cases after 96.6% of the lifetime, even for pessimistic thus, it is mandatory to protect them. However, random within-die variations. combinational blocks like ALUs cannot use ECC or parity, and require more expensive techniques like reexecution or residue codes computation among others. 1. Introduction Reexecution can be performed in a different instance of the same type of ALU [16][17] or in a special ALU As technology evolves, the geometry of transistors and devoted to error detection [4]. Both solutions are wires shrinks. However, supply voltage does not scale at expensive either in terms of performance and/or extra the same pace [23]; this causes transistors and wires to hardware. Residue computation [5][12] is an alternative to suffer higher current densities, which also imply higher detect failures in ALUs, especially in adders. Obtaining temperatures. The increased current density and residue codes requires special hardware to compute the temperature translate into higher vulnerability of circuits. modulo function. Under these conditions, transistors and wires will degrade This paper proposes the fuse, a new technique to faster and will be more prone to failures (higher failure anticipate failures due to degradation in ALUs at a very rate per device). Furthermore, there will be an increased low cost. In particular, we propose the design and number of failures in the chip because of the larger implementation of a fuse to anticipate failures in a sparse number of such devices (transistors shrink but the chip tree adder [14], which is the one used in the Intel® size is expected to remain constant [23]). Pentium® 4. The fuse is built as a replica of the weakest The increasing unreliability of processors will make transistor in the adder and the circuitry required to devices fail frequently during the normal lifetime of the measure its degradation. Whenever the fuse does not meet processor. Moreover, transistor geometry may change the delay constraints, it implies that the protected adder is significantly from one chip to another or even within the about to fail, so it can be disabled or its frequency chip itself, in such a way that some components are prone decreased [19] to prevent data corruption and program to degrade faster than others. Similarly, dynamic crashes. variations of operating frequency, voltage and The fuse is a very efficient solution in terms of temperature may accelerate degradation significantly for hardware and power. We illustrate how to design and some blocks. Thus, lifetime of blocks in a chip is implement a fuse for a sparse tree adder, although the unpredictable and mechanisms are required to detect same idea can be extended to any other ALU without failures before such failures produce crashes or data requiring any special property for the protected block, as corruption. it is the case for residue computation. The rest of the paper is organized as follows. Section 2 introduces the main sources of failure affecting † Osman Unsal is currently with the Barcelona Supercomputing Center, Spain ([email protected]) microprocessors. Section 3 presents the fuse, our ‡ Oguz Ergin is currently with the TOBB University of Economics technique to anticipate failures in adders. Section 4 and Technology, Ankara, Turkey ([email protected]) presents the evaluation of the fuse. Section 5 reviews 13th IEEE International On-Line Testing Symposium (IOLTS 2007) 0-7695-2918-6/07 $25.00 © 2007 some related work. Future work is introduced in Section voltage of 1.2V, and a lifetime of 7 years. Note that 6. Finally, section 7 draws the main conclusions of this operating conditions (temperature and voltage) are those work. considered for TDP (Thermal Design Point), which are the ones used to decide whether a circuit passes or fails 2. Sources of Wearout and Failure the test. There are many sources of failure (SOF) [24] affecting transistors and wires such as electromigration, stress migration, time-dependent dielectric breakdown, negative bias temperature instability (NBTI), etc. Although our invention can be used to detect failures due to degradation for most of such SOF, we focus on a single SOF for the sake of illustration: NBTI. NBTI, which mainly affects PMOS transistors, has emerged in recent years as the most important SOF for transistors in sub-130nm technology as shown in literature [3]. Our simulations show a similar trend since transistors fail much earlier due to NBTI than due to any other SOF. Some molecules at the gate interface are broken when the input voltage of the gate is negative and the source voltage is positive. As a consequence, the threshold voltage increases and the gate becomes slower, leading to failures due to time constraints. NBTI depends on temperature, voltage, Figure 1. Schematic of a 32-bit sparse tree adder utilization and transistor geometry. [14] 3. Fuse for a Sparse Tree Adder Table 1. Workloads This section presents the fuse for a sparse tree adder Benchmark suite # traces Description [14]. The fuse is used to detect when a failure due to Encoder 62 Audio/video encoding degradation is about to happen in the adder, but it does not SpecFP2000 41 Floating-point specs SpecINT2000 33 Integer specs detect the actual failure. A fuse consists of a transistor Kernels 53 VectorAdd, FIRs and some extra logic to (i) degrade it properly and (ii) Multimedia 85 WMedia, photoshop measure its degradation. When the fuse fails (i.e., it does Office 75 Excel, Word, PowerPoint not meet the delay constraints), the adder is considered to Productivity 45 Internet contents creation be unsafe. As a result, the system may disable the adder or Server 55 TPC-C try to find a frequency [19] such that the fuse (and thus Workstation 49 CAD, rendering the adder) still meets the delay constraints. Similarly, such Spec2006 71 New spec release scheme can be used to remove part of the guardband that has been set up to tolerate some extent of degradation. The following sections identify and address the several A scheme of the adder can be found in Figure 1. The key issues when designing the fuse. The first one is implementation used for the adder is the one provided by identifying the proper characteristics of the fuse; the the authors of the adder and both the adder and the fuse second is designing the circuit in such a way that it is have been designed in 65nm technology for Hspice-like guaranteed that there is no transistor in the adder that will simulation. The simulator is an Intel production aging fail due to degradation before the fuse. Finally, we simulator whose inputs are the spice description of the address the problem of pessimism: it must be minimized circuit, the technology description, the environmental the time elapsed between the fuse failure and the time parameters of the simulation (mainly temperature and when the adder would have failed. voltage), the length of the simulation (how many years the circuit is expected to work), and the inputs for the circuit. 3.1 How to Identify the Required Fuse Note that we do not feed the circuit with inputs for its whole lifetime. Instead, we provide the simulator with The fuse replicates the weakest transistor of the adder. inputs for a given period of time, and the simulator The weakness of transistors depends on different critical repeats the circuit inputs for the whole lifetime. The intrinsic SOF. Our simulations show that NBTI is much experiment has been run sampling 1138 instructions more significant than any other SOF affecting transistors requiring additions from a set of 569 traces from different for the evaluated technology. Thus, to design the fuse we workloads (see Table 1). Longer experiments have not must care about the weakest transistor due to NBTI. been possible due to the large cost of such a long We choose as fuse the transistor whose combination of simulation at electrical level. Simulations have been physical characteristics and activity patterns gives the conducted assuming a temperature of 110ºC, a supply shortest lifetime in terms of NBTI. Vulnerability depends 13th IEEE International On-Line Testing Symposium (IOLTS 2007) 0-7695-2918-6/07 $25.00 © 2007 on the length and width of the PMOS transistors, and its oscillators consist of an odd number of inverters arranged inputs (i.e.

Load more