Modeling the Effect of Redundancy on Yield and Performance of VLSI Systems

344 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-36, NO. 3, MARCH 1987 Modeling the Effect of Redundancy on Yield and Performance of VLSI Systems ISRAEL KOREN, MEMBER, IEEE, AND DHIRAJ K. PRADHAN, SENIOR MEMBER, IEEE Abstract-The incorporation of different forms of redundancy Low yield (expected percentage of good chips out of a has been recently proposed for various VLSI and WSI designs. wafer) is a problem of increasing significance as circuit These include regular architectures, built by interconnecting a density grows. One solution suggests improvement of the large number of a few types of system elements on a single chip or wafer. The motivation for introducing fault-tolerance (redun- manufacturing and testing processes to minimize manufactur- dancy) into these architectures is two-fold: yield enhancement ing faults. However, this approach is not only very costly, but and performance (like computational availability) improvement. also quite difficult (or even impossible) to implement, with the Our objective in this paper is to develop analytical models that increasing number of components that can be placed on one evaluate how yield enhancement and performance improvement But incorporating redundancy for fault tolerance does may both be achieved by introducing redundancy into VLSI and chip. WSI designs. Such models also allow us to evaluate the cost- provide a very practical solution to the low yield problem. effectiveness of a given fault-tolerance strategy and calculate the This has been demonstrated in practice for high density amount of redundancy to be added. memory chips (e.g., [1]) and should be extended to other types of VLSI circuits. In general, yield may be enhanced because Index Terms-Computational availability, fault tolerance, can be accepted, in spite of some manufacturing redundancy, reliability, VLSI designs, wafer-scale integration, the circuit yield. defects, by means of restructuring, as opposed to having to discard the faulty chip. Achieving reliable operation also becomes increasingly I. INTRODUCTION difficult with the growing number of interconnected elements IMPORTANT innovations are likely to occur in two and hence, the increased likelihood that faults can occur. Here VLSI-based areas, namely, wafer-scale integrated too, redundant elements which are ready to replace faulty ones architectures, and single VLSI chip/multielement architec- when the system is in operation, can increase the reliability tures. The former has the potential for a major breakthrough and other performance measures like computational availabil- with its ability to realize a complete multiprocessing system on ity. This will eliminate the steps In summary, the justification for introducing fault tolerance a single wafer. expensive required is to dice the wafer into individual chips and bond their pads to (redundancy) into the architecture of VLSI-based systems In addition, internal connections between chips two-fold. One is to deal with manufacturing flaws and increase external pins. and on the same wafer are more reliable and have a smaller the yield. The other is to deal with operational faults propagation delay than external connections. The latter does enhance the performance availability. make it possible to build a high-speed processor on a single Our objective in this paper is to formulate analytical models chip, designed by interconnecting a large number of simple that will enable us to analyze the effectiveness of a given fault- processing elements, memory modules and the like. These tolerance technique in increasing yield and improving per- imaginaion of several formance, or find the tradeoff between the two. These models architectures already have captured the tech- and researchers alike. will also allow us to compare various fault tolerance computer manufacturers the Much recent research has focused on these new architec- niques, examine different system topologies and determine tural innovations, especially those created by interconnecting a optimal amount of redundancy to be added. that have to be considered large number of elements such as processors, memories, In the next section, the aspects when evaluating a fault tolerance strategy are detailed. In switches, communication links etc, all on a single chip or of wafer. Concerns about fault tolerance in such VLSI-based Section III, expressions for the actual and apparent yield In Section IV systems stem from the two key factors of performance and VLSI chip with added redundancy are derived. yield enhancements. we present models that allow us to compute various mneasures of combined performance and reliability. Then, an example of is in Section Manuscript received August 1, 1985; revised February 4, 1986 and JunelO, a VLSI-based system with redundancy analyzed 1986. This work was supported in part by the United States Air Force Office V and final conclusions are presented in Section VI. of Scientific Research under Grant 84-0052 and by the National Science Foundation under Grant DCR-8509423. II. FAULT-TOLERANCE IN VLSI AND WSI I. Koren is with, the Department of Electrical and Computer Engineering, University of Massachusetts, Amherst, MA 01003, on leave from the A variety of techniques for introducing fault tolerance into Technion-Israel Institute of Technology, Haifa 32000, Israel. VLSI and WSI architectures with regular topologies have been D. K. Pradhan is with the Department of Electrical and Computer Engineering, University of Massachusetts, Amherst, MA 01003. recently proposed, [2], [3], [6], [7], [10], [15], [17], [18]. IEEE Log Number 8611446. Because fault tolerance is an involved subject, completely 0018-9340/87/0300-0344$01.00 © 1987 IEEE KOREN AND PRADHAN: EFFECT OF REDUNDANCY ON YIELD AND PERFORMANCE OF VLSI SYSTEMS 345 different schemes might be cost-effective in different situa- analysis we have to take into account the relative hardware tions and for different objective functions. complexity (silicon area) of all system elements, and their Several aspects have to be considered when evaluating a susceptibility to failures (manufacturing defects or operational fault-tolerance strategy for multielement systems. The first is faults). the type of failures to be dealt with. There are two distinct Processing elements (PE's) are traditionally considered the types of failures with which fault-tolerance strategies can be most important system resource; hence, achieving 100 percent designed to deal. These are production defects and operational utilization of them is often attempted. For example, in [2], faults. A relatively large number of defects is expected when [15], and [18] switching elements are added between proces- manufacturing a silicon wafer in the current technology. sors to assist in achieving this goal. In [3] and [10] connecting Normally, all chips with production flaws are discarded tracks are added on the wafer to be used in bypassing the leading to a low yield. defective PE's when connecting the fault-free ones. However, Operational faults have in comparison a considerably lower the silicon area that needs to be devoted to switching elements probability of occurrence, the difference of which may be in (e.g., switches capable of interconnecting 4 to 8 separate orders of magnitude. Improvements in solid-state technology parallel busses [18]) or to additional communication links and maturity of the fabrication processes have reduced the cannot be ignored. Consequently, such schemes might be failure rate of a single component within a VLSI chip. beneficial only for PE's which are substantially larger than the However, the exponential increase in the component-count per switches and the additional links (e.g., [13]). Also, the VLSI chip has more than offset the increase in reliability of a addition of switching elements and especially the longer single componient. Thus, operational faults cannot be ignored interconnection between active processors result in longer although they have -a substantially lower probability of delays affecting the throughput of the system. To overcome occurrence compared to production defects. Consequently, a this performance penalty, it has been suggested in [9] to add fault-tolerance strategy that enables the system to continue registers for bypassing faulty processors. The effect of this is processing, even in the presence of operational faults, can be to introduce extra stages in the pipeline, thus increasing the beneficial. latency of the pipeline without-reducing its throughput. The two types of failures, manufacturing defects and In the above mentioned schemes, one of the underlying operational faults, also differ in the costs associated with them. assumptions is that the extra circuitry (e.g., switching ele- Defects are tested for before the IC's are assembled into a ments, communication links or registers) are failure free and system and therefore, they contribute only to the production only processors can fail. However, larger silicon areas costs of the IC's. In contrast, faults occur after the system has devoted to those elements increase their susceptibility to been assembled and is already operational. Hence, their defects or faults; as a result, the above-mentioned assumption impact is on the system's operation and their damage might be might not be valid any more. substantial, especially in systems used for critical real-time In general, there are several alternative ways for introduc- applications. Clearly, a method which is cost-effective for ing redundancy into the system. Redundancy can be intro-

Modeling the Effect of Redundancy on Yield and Performance of VLSI Systems

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support