<<

344 IEEE TRANSACTIONS ON COMPUTERS, VOL. -36, NO. 3, MARCH 1987 Modeling the Effect of Redundancy on Yield and Performance of VLSI Systems

ISRAEL KOREN, MEMBER, IEEE, AND DHIRAJ K. PRADHAN, SENIOR MEMBER, IEEE

Abstract-The incorporation of different forms of redundancy Low yield (expected percentage of good chips out of a has been recently proposed for various VLSI and WSI designs. wafer) is a problem of increasing significance as circuit These include regular architectures, built by interconnecting a density grows. One solution suggests improvement of the large number of a few types of system elements on a single chip or wafer. The motivation for introducing fault-tolerance (redun- manufacturing and testing processes to minimize manufactur- dancy) into these architectures is two-fold: yield enhancement ing faults. However, this approach is not only very costly, but and performance (like computational availability) improvement. also quite difficult (or even impossible) to implement, with the Our objective in this paper is to develop analytical models that increasing number of components that can be placed on one evaluate how yield enhancement and performance improvement But incorporating redundancy for fault tolerance does may both be achieved by introducing redundancy into VLSI and chip. WSI designs. Such models also allow us to evaluate the cost- provide a very practical solution to the low yield problem. effectiveness of a given fault-tolerance strategy and calculate the This has been demonstrated in practice for high density amount of redundancy to be added. memory chips (e.g., [1]) and should be extended to other types of VLSI circuits. In general, yield may be enhanced because Index Terms-Computational availability, fault tolerance, can be accepted, in spite of some manufacturing redundancy, reliability, VLSI designs, wafer-scale integration, the circuit yield. defects, by means of restructuring, as opposed to having to discard the faulty chip. Achieving reliable operation also becomes increasingly I. INTRODUCTION difficult with the growing number of interconnected elements IMPORTANT innovations are likely to occur in two and hence, the increased likelihood that faults can occur. Here VLSI-based areas, namely, wafer-scale integrated too, redundant elements which are ready to replace faulty ones architectures, and single VLSI chip/multielement architec- when the system is in operation, can increase the reliability tures. The former has the potential for a major breakthrough and other performance measures like computational availabil- with its ability to realize a complete system on ity. This will eliminate the steps In summary, the justification for introducing fault tolerance a single wafer. expensive required is to dice the wafer into individual chips and bond their pads to (redundancy) into the architecture of VLSI-based systems In addition, internal connections between chips two-fold. One is to deal with manufacturing flaws and increase external pins. and on the same wafer are more reliable and have a smaller the yield. The other is to deal with operational faults propagation delay than external connections. The latter does enhance the performance availability. make it possible to build a high-speed processor on a single Our objective in this paper is to formulate analytical models chip, designed by interconnecting a large number of simple that will enable us to analyze the effectiveness of a given fault- processing elements, memory modules and the like. These tolerance technique in increasing yield and improving per- imaginaion of several formance, or find the tradeoff between the two. These models architectures already have captured the tech- and researchers alike. will also allow us to compare various fault tolerance computer manufacturers the Much recent research has focused on these new architec- niques, examine different system topologies and determine tural innovations, especially those created by interconnecting a optimal amount of redundancy to be added. that have to be considered large number of elements such as processors, memories, In the next section, the aspects when evaluating a fault tolerance strategy are detailed. In switches, communication links etc, all on a single chip or of wafer. Concerns about fault tolerance in such VLSI-based Section III, expressions for the actual and apparent yield In Section IV systems stem from the two key factors of performance and VLSI chip with added redundancy are derived. yield enhancements. we present models that allow us to compute various mneasures of combined performance and reliability. Then, an example of is in Section Manuscript received August 1, 1985; revised February 4, 1986 and JunelO, a VLSI-based system with redundancy analyzed 1986. This work was supported in part by the United States Air Force Office V and final conclusions are presented in Section VI. of Scientific Research under Grant 84-0052 and by the National Science Foundation under Grant DCR-8509423. II. FAULT-TOLERANCE IN VLSI AND WSI I. Koren is with, the Department of Electrical and Computer Engineering, University of Massachusetts, Amherst, MA 01003, on leave from the A variety of techniques for introducing fault tolerance into Technion-Israel Institute of Technology, Haifa 32000, Israel. VLSI and WSI architectures with regular topologies have been D. K. Pradhan is with the Department of Electrical and Computer Engineering, University of Massachusetts, Amherst, MA 01003. recently proposed, [2], [3], [6], [7], [10], [15], [17], [18]. IEEE Log Number 8611446. Because fault tolerance is an involved subject, completely 0018-9340/87/0300-0344$01.00 © 1987 IEEE KOREN AND PRADHAN: EFFECT OF REDUNDANCY ON YIELD AND PERFORMANCE OF VLSI SYSTEMS 345

different schemes might be cost-effective in different situa- analysis we have to take into account the relative hardware tions and for different objective functions. complexity (silicon area) of all system elements, and their Several aspects have to be considered when evaluating a susceptibility to failures (manufacturing defects or operational fault-tolerance strategy for multielement systems. The first is faults). the type of failures to be dealt with. There are two distinct Processing elements (PE's) are traditionally considered the types of failures with which fault-tolerance strategies can be most important system resource; hence, achieving 100 percent designed to deal. These are production defects and operational utilization of them is often attempted. For example, in [2], faults. A relatively large number of defects is expected when [15], and [18] switching elements are added between proces- manufacturing a silicon wafer in the current technology. sors to assist in achieving this goal. In [3] and [10] connecting Normally, all chips with production flaws are discarded tracks are added on the wafer to be used in bypassing the leading to a low yield. defective PE's when connecting the fault-free ones. However, Operational faults have in comparison a considerably lower the silicon area that needs to be devoted to switching elements probability of occurrence, the difference of which may be in (e.g., switches capable of interconnecting 4 to 8 separate orders of magnitude. Improvements in solid-state technology parallel busses [18]) or to additional communication links and maturity of the fabrication processes have reduced the cannot be ignored. Consequently, such schemes might be failure rate of a single component within a VLSI chip. beneficial only for PE's which are substantially larger than the However, the exponential increase in the component-count per switches and the additional links (e.g., [13]). Also, the VLSI chip has more than offset the increase in reliability of a addition of switching elements and especially the longer single componient. Thus, operational faults cannot be ignored interconnection between active processors result in longer although they have -a substantially lower probability of delays affecting the throughput of the system. To overcome occurrence compared to production defects. Consequently, a this performance penalty, it has been suggested in [9] to add fault-tolerance strategy that enables the system to continue registers for bypassing faulty processors. The effect of this is processing, even in the presence of operational faults, can be to introduce extra stages in the pipeline, thus increasing the beneficial. latency of the pipeline without-reducing its throughput. The two types of failures, manufacturing defects and In the above mentioned schemes, one of the underlying operational faults, also differ in the costs associated with them. assumptions is that the extra circuitry (e.g., switching ele- Defects are tested for before the IC's are assembled into a ments, communication links or registers) are failure free and system and therefore, they contribute only to the production only processors can fail. However, larger silicon areas costs of the IC's. In contrast, faults occur after the system has devoted to those elements increase their susceptibility to been assembled and is already operational. Hence, their defects or faults; as a result, the above-mentioned assumption impact is on the system's operation and their damage might be might not be valid any more. substantial, especially in systems used for critical real-time In general, there are several alternative ways for introduc- applications. Clearly, a method which is cost-effective for ing redundancy into the system. Redundancy can be intro- handling defects is not necessarily cost-effective for handling duced into the architecture at the basic element level and/or at operational faults, and vice versa. the system level. In the case of system level redundancy, spare For both types of failures in VLSI, a repair operation is elements are added to the original design and they will be used impossible and the best one can do is to somehow avoid the use to replace any faulty system element. In the case of element of the faulty part by restructuring the system links so as to level redundancy, each element has some internal redundancy isolate it. This restructuring capability can be either static or allowing it to remain operational even in the presence of dynamic. When the manufacturing is completed, the certain internal faults (with possibly a lower computational chips within the wafer are tested to determine the defective capacity). Note that both element level and system level ones. A static scheme using for example fusible links or laser- redundancies can be incorporated into the same system. formed links, can be then employed to configure out the Several forms of redundancy can be used to handle defective elements and interconnect most of the good ones to manufacturing defects to increase the yield. The defective form an operational system. Dynamic restructuring is required elements are configured out and the good ones are intercon- during the normal system operation, when faulty parts have to nected to form an operational system. Once this procedure is be restructured out of the system without human intervention. completed, the system goes into operation and it has to handle Such a dynamic strategy might be appropriate to handle from this point on only operational faults. At this point the defects as well. Comnparing the two alternatives, static fault-tolerance capacity of the system is used to improve its schemes tend to use less hardware but consume operator time, performance availability. First, the remaining redundant while dynamic schemes which are controlled internally by the elements (if any) can be used as spares and then, the system is system usually require extra circuitry. gracefully degraded. We conclude therefore, that the same Another aspect that has to be considered when evaluating redundancy can be used for yield enhancement and for the effectiveness of a given fault-tolerance technique, is the performance improvement as well. required hardware investment. The hardware added can be in the form of redundant switching elements, [21, [18], [15] and! III. THE YIELD OF FAULT-TOLERANT CHIPS or redundant processors, communication links, or other To evaluate the effectiveness of redundancy for yield system elements [3], [10], [6]. When carrying out such an enhancement we need an expression for the yield of a fault- 346 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-36, NO. 3, MARCH 1987

tolerant VLSI chip. Such expressions have been presented in having zero defects in all its elernents [11] and [7]. A more general expression for the yield was proposed in [8]. In what follows we modify the latter to Y= Pr 1 (3.2) include some simplified yield models (as used in [3] and [10]) i Aid>J( +i and to take into account the effect of incomplete testing on the where n is the total number of different types of elements in yield. the chip. The yield of any VLSI chip depends on the types of defects Note that in (3.2) we implicitly assume that all mhanufactur- which may occur during the manufacturing process and their ing defects result in logical faults which in turn cause distribution. The majority of fabrication defects can be erroneous behavior of the chip. Certain defects may however classified as random spot defects [20] caused by minute produce no faults at all. For example, a defect in the outer area particles deposited on the wafer. The area of the system of the chip which is usutally occupied by bonding pads, may be elements that we will be considering here and for which we harmless to the electrical performance of the circuit. To will have spare ones (e.g., processors, memories, busses etc), consider logical faults instead of fabrication defects, we will as is substantially larger than the expected area of a spot defect. a first approximation, multiply the defect density d by the Consequently, we assumie in what follows that each spot defect probability that at least one fault is caused by a defect. If for affects only a single element. example, we adopt the assumption made in [16] that the For the statistics of the fabrication defects we can adopt one number of logical faults corresponding to a single defect of the models suggested in the literature like Poisson, follows a Poisson distribution with mean c, we should then binomial, general negative binomial statistics and others. multiply d by (1 - e-C). For convenience, we will in what Under proper assumptions each one of these statistics can be follows still refer to manufacturing defects (rather than logical used and the "correct" one is the one that fits the data best faults) with average density d which equals the original defect [20]. One model which has been shown to agree with density multiplied by the probability that a defect results in a experimental results, is the generalized negative binomial logical fault. distribution [19]. Its attractiveness stenms from the fact that it does not assume that all defects are evenly distributed The Yield of a Chip with Added Redundancy throughout the wafer but rather allows defects to cluster. We Suppose now that redundancy is added to a chip at the adopt here this distribution although all our subsequent results system level so that s, defective elements of type i ca-n be remain valid if some other distribution is selected. tolerated, (i.e., substituted by good spares), and denote by N1 The negative binomial distribution has two parameters; d is the total number of elements (including the spares) of type i in the average defect density1 and a is the defect clustering a chip. Then, the chip is acceptable with any number of parameter. A low value of a can be used to model severe manufacturing defects in type i elements as long as all of them clustering of defects on a wafer, while for a oo we obtain are restricted to at Most si elements. The yield, which is now the Poisson distribution. These two parameters depend on the the probability of a chip being acceptable, is given by number and complexity of manufacturing steps performed. n Consequently, different system elements might have different Y= Pr {There are defects in at most values for these two parameters. Let di and oai be the two ,11 parameters for type i system elements, then the probability of si elements of type i} (3.3) having xi defects in the system elements of type i on a chip is If we denote (Aid, xi a = Pr {The defects in type i elements are all (x, + distributed in exactly j out of the Ni elements} PrPrlXi=x}l= r ai) (3.1) {X=x1}=x1!Pr(a1) then (1+ n Si ai Y=I2IEa50. (3.4) i=1 j=O where Ai is the total silicon area devoted to type i elements within the chip. We may now obtain an explicit expression for the probabil- in The parameters d and at of the yield equation are estimated ity a(i) two different ways. One is to follow [8] and define either by monitoring many wafers or from a few carefully placed test chips on each wafer. Another technique for Q = Pr {x defects are distributed into exactly estimating the parameters of the yield equation has been I out of N elements/There are x defects} recently proposed by Seth and Agrawal [16] based on the commulative coverage of test patterns applied to the chips which equals after manufacturing. For nonredundant chips, the yield is the probability of N N (I-k)x QXJ) z( 1) (k k=0 j-k, N-jJ The average number of defects per Chip X (e.g., [8]) is d*A where A is the chip's area. for x.j and 0

where (kN-kN is the multinomial coefficient. (isolated) given that it is faulty. By adding this probability one For x < i we have Q = 0 and for j = 0 may consider less than perfect procedures for locating faulty elements and reconfiguring them out of the system. (N)_ I if x= 0 To tolerate si defective elements of type i, at least Q xo - si jo otherwise. redundant ones are needed. However, the exact amount of This expression was derived assuming that the defects are required redundancy depends upon the specific static or distinguishable, i.e., Boltzmann statistics are followed [20]. If dynamic reconfiguration scheme used. This in turn, deter- we select Bose-Einstein statistics, the defects are indistin- mines the increase in chip area which must be taken into guishable and the resulting expression is account when calculating the yield, since a larger number of defects is expected now. Let zys denote the increase in the area Ai (due to the addition j-1 of redundancy), needed to tolerate these si faulty elements of Q((JN))_ y I)k i x+j-k-1 type i. Let denote the increase in total chip area that is a (- k x -yf (x+N- 1 k=O required to tolerate all s = (sl, s2, .,* sn) faulty elements. The factor oy is called the redundancy factor [7] and it for,x>j and ON-J (l - Yy)' (3.8) we call wafer-equivalent yield, is obtained from (3.4) after dividing it by yy. The assumption here is that each element of type i may be By comparing the wafer-equivalent yield of the fault- defective with an independent probability (1 - Y,). This tolerant chip and the yield of the simplex one (with no fault- approach has been adopted, for example, in [3] and [10]. tolerance features), we can determine whether it is beneficial When setting the parameters for the yield of a single when yield is considered, to have built-in fault tolerance and element Yi, we may require that the expected value and how many redundant elements should we add. This compari- variance of the number of defects in the total chip area will be son can be done for various system topologies and different the same as in the first approach. The expected value and reconfiguration algorithms. variance for the generalized negative binomial distribution are An analysis along these lines has been done in [11] and in Ad and Ad(l + Ad/a), respectively. Therefore, both Aid [7]. In both it has been observed that the improvement in yield and a, should be divided by Ni. Consequently saturates above some amount of redundancy. This indicates that there is an optimal amount of redundancy that should be -caiINi added. 1(Aid Yi= + (3.9) Still, the exact value of this optimal amount of redundancy does depend upon the expression we adopt for the wafer- Before comparing these two alternatives for calculating the equivalent yield of a fault-tolerant chip. To illustrate this, we probability a(i) (given by (3.7) and (3.8)), we return to the compare in the following example the optimal amounts of general equation for the yield, i.e., (3.4). This equation can be redundancy when using the above two alternative schemes for multiplied by a "bypass coverage probability" [11], which is calculating a') (defined by (3.7) and (3.8), respectively). the conditional probability that an element can be bypassed Example: Consider a chip with only a single type of 348 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-36, NO. 3, MARCH 1987

element that can have defects. Assume further that at least 15 elements, it is simpler to derive a general expression for the such elements are required and the redundancy factor is apparent yield. Assuming, for simplicity, that the fault therefore, y, = 1 + (sI 15). The optimal percentage of coverage is the same for all types of elements, we obtain redundancy (i.e., s/15) for the two schemes was computed for n S. different values of the clustering parameter. Yapparent = i [ a () As was expected, for high values of a (i.e., no clustering), il j=O both schemes call for exactly the same amount of redundancy. Here, the generalized negative bionomial distribution ap- proaches the Poisson distribution for which the two schemes + i a(')~ Q() (fc)r(l-fC)j r1 (3.12) j=sj+1 r=O are equivalent. This was observed for a 2 10 and Ad = 5 IV. RELIABILITY AND PERFORMANCE OF FAULT-TOLERANT CHIPS defects per silicon unit area. For low values of a(ca < 0.01), adding redundancy is not recommended by any of the two In the previous section we have derived expressions for the schemes; the defects are severly clustered resulting in a high actual and apparent yield of a fault-tolerant VLSI chip. In this yield of the simplex chip. section we develop models for operational fault-tolerant chips For values of a in (0.01, 10), the second scheme calls, in in order to derive expressions for their reliability and most cases, for a lower optimal amount of redundancy. performance availability. Moreover, this scheme might be too optimistic since it Once the manufacturing process has been completed, chips achieves a higher wafer-equivalent yield (compared to that having si or less defects in type i elemetits (i = 1, 2, * *, n) achieved by the first scheme) for a smaller amount of are accepted and then reconfigured to avoid the use of the redundancy. For example, for a = 0.6 the yield of the defective elements. If the number of defective elements of any simplex chip is for both schemes 26.2 percent. The optimal type i is less than s,, the chip has some "residual" redundancy amount of redundancy according to the first scheme is 33 which can then be used for performance enhancement, i.e., percent achieving a maximum wafer-equivalent yield of 47.4 handle operational faults which occur during the life time of percent. The optimal amount of redundancy according to the the system. Even chips in which no redundant elements are left second scheme is only 20 percent but the maximum wafer- when leaving the manufacturing site (i.e., there were origi- equivalent yield is 78.2 percent. Still, only experimental data nally E=n si defects in the chip), can still benefit from the obtained by monitoring wafers can show which scheme is fault-tolerance capability, if graceful degradation is allowed. "correct." To evaluate the effectiveness of the "residual" redundancy and the fault-tolerance capacity of the chip, we have to select The Effect of Imperfect Production Testing some performance measures and we need a model that will For both schemes, a chip with redundancy is considered allow us to calculate these measures. A natural choice for this acceptable as long as the number of defective elements of one purpose is a Markov model. type does not exceed the number of spares for this type. This The proposed Markov model includes a single system assumes that the applied production testing procedure is failure state (F). At all other states of the model, the system is perfect and we accept only chips which satisfy the above operational in the presence of some faulty elements. Consider requirements. However, testing is never perfect and conse- now such a general state and letfi and ui denote the number of quently, chips having more than the allowed number of faulty elements and active (used) ones of type i at this state, defective elements will be declared good resulting in a higher respectively. Note that ui is defined as the number of active "apparent" yield [22] [16] defined by elements and not just fault-free ones. We have therefore, the inequality = Y+ Yapparent Ybg ui

This inequality means that if less than elements were Sm si . defective at t = 0, the system will endure the failure of more than mi elements at t > 0. An example of the suggested Markov model for a chip with two types of elements that can fail, is depicted in Fig. 1. In this model (fi, f2) is a state at which the system is operational in the presence of fi and f2 faulty elements of type 1 and 2, fIf respectively. A transition from state (f1, f2) to the system I.v failure state (F) takes place when an additional element becomes faulty and the 'system fails to recover from its effect. The corresponding transition rate is denoted by af2. Simi- lary, f4il2 f2 'is the transition rate from state (ft, f2) to state sm j (fl + 1, f2) These transition rates depend on the failure rates of the F system elements and on the probability of a successful recovery (by reconfiguration) of the system following an Fig. 1. A Markov model for a VLSI chip with defects and operational faults. element failure. Let Xi denote the failure rate of operational elements of type i (adopting the common assumption that the simplified by partitioning the Markov chain into n independent time to failure is exponentially distributed), and let pi denote chains. Each will then be solved separately and the final results the probability of a successful recovery, i.e., will be combined to obtain the required performance mea- pi= Pr {The system recovers (reconfigures) successfully/ sures. This case is demonstrated in the next section. State (0, 0) of the Markov model in Fig. 1 is the initial state A fault in type i element has occurred} of the'system if no defects occurred while the chip was The expression for the transition rate from (fi, f2) to (fi + 1, manufactured. If there were k, and k2 defective elements of f2) depends on the restructuring strategy employed. For type 1 and 2, respectively, (0 c kI c si, i = 1, 2) then (kl, k2) example, if upon a failure all the remaining fault-free elements will be the initial state. The probability of this event is of both types are used (active) then akl * a(2 (0 s1 and k2 < S2, the to some empirical rule. Several such rules can be envisioned, probability of (kl, k2) being an initial state is then for example, the average over all possible positions of the Si faulty elements, or the worst case one, or the most probable (1) ki r (2) 11 ' akl E r (fC)r(I -fC) .* ak2 (4.5) one. r=O There are cases in which ui depends only on the value offi(i = 1, 2, * , n). In these cases, not only the above expression The expressions for the probabilities of the other two kinds for the transition rate is accurate but the entire model can be of initial states may be derived similarly. Summing the 350 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-36, NO. 3, MARCH 1987 probability of each state being an initial state results in the The summation in (4.9) is over all paths (total of apparent yield as given by (3.12). (f1l i Sf2[k2 ) paths, each consisting of (f/ + f2) - (kA + k2) A state like (sl + ml, f2) in Fig. I is a terminal state with + 1 nodes) in the Markov model (Figure 1) from state (kA, k2) respect to failures in type 1 elements, i.e., a state from which, to (fi, f2)- upon an additional failure of a type 1 element, the only For such a Markov model we can calculate several transition possible is to the system failure state (F). Recall that performance measures like Reliability, Performability, Com- ml is the largest number of faulty elements of type 1 that the putational availability, and Area utilization [7]. Let Rk1,k2(t) system can tolerate if no redundant elements were left when denote the reliability of a system (i.e., the probability that it the system went into operation. Similarly, all states (fi, s2 + operates correctly in the time interval [o, t]), which had kA and M2) are terminal states with respect to failures in type 2 k2 defective elements of type 1 and 2, respectively, at t = 0. elements. This reliability can be calculated from the above Markov We define now the state probabilities of our model model as follows SRk1Sk(t)=l+ml 52+m2 2 (4.11) P (kik2(t) =Pr (The system is in state (fi, f2) at time t/ Rkl,k2(t)= f =flff2k (t). fl=kl f2=k2 The system was initially in state (kl, k2)} We may then define and compute forf1 .Ak1 andf2 . k2) with 1 and P(klk12)(0) s1+m1 s2+m2 Ptkl4k2)(0) = 0 forf1 > Ac orf2 > k2- E Pr (kl, k2) is the initial state} ,k2(t) The Markov model in Fig. 1 is described then by the E *Rkj following differential equations. R (t) =kj=O s1+m1k2=O s2+m2 Pr is the initial state} dP (k1,k2) (t) {(kl, k2) k 1, k2 kj=O k2=0 _ _ =pcr1,2kl1k2I.t)t (4.6) dt Ck,2 lk (4.12) as the average reliability of a system having si(i = 1, 2) or less =J2 - k1'k2gfl dt af f2Pf,f2 (t defects when manufactured. The probability of (kA, Ak2) being an initial state is obtained _ 2 (t) + ) (4.7) either from (4.3) or from (4.5) or a similar appropriate +tft f2PflIS flf2 1fl f2 equation. The denominator in (4.12) (which is a normalization where factor so that the probabilities sum to 1) is the apparent yield (3.12). If the fault coverage fc is nearly 1, then equation

I afl 'f2 =cYf2+ Yfif2h+ + a%,f2 (4.8) (4.12) reduces to We denote by V4 any vertex (state) at level b in Fig. 1, i.e., I Si 52 R(t)=1 E al2 * a52 * Rkl,k2(t) (4.13) any state (i, j) satisfying i + j = b. Thus, the sequence (kA, kl=O k2=0 k2), Vkj+k2+1, * * vi+ 1, (i, j) corresponds to a path in Fig. 1 from the initial state (kA, k2) to the state (i, j). The average reliability is in general a function of the amount Using this notation, the solution of (4.6) and (4.7) under the of added redundancy. When searching for the optimal amount condition of redundancy to be incorporated into the chip, we should not ignore the cost associated with redundancy, i.e., a larger chip ai,4j*ax,y for all (x, y)*(i, j) area resulting in a smaller number of chips fabricated out of a given size wafer. We may therefore, attempt to optimize the which is usually satisfied, is expected number of chips, in a given wafer, that are operating P klk2)(t) correctly in the time interval [0, t]. This equals the reliability of a single chip times the expected number of chips per wafer. The number of chips decreases as redundancy is introduced, Yd akl,k2+1 vkl+k22+1 t'fl+f21 by the factor of Consequently, we may search for the A path (kl,k2), * (flif2) 'yg. values of s1 and s2 that optimize the following cost function, fl +f2 e-av-b t which we call wafer-level reliability X fl +f2 (4.9) b=kl +k2 II (aVC- avb) C=k1 +k2 WR (t)= (4.14) c*b *Ys and This will allow us to determine whether it is beneficial when kl,k2)(t) = ecklI,k2t (4.10) reliability is considered, to introduce redundancy into the architecture of the system and how many redundant elements where Vkl + k2 and vfl +f2 in (4.9) specify the states (kl, k2) and of each type we should include. (fA, f2), respectively. Example: The wafer-level reliability as a function of the KOREN AND PRADHAN: EFFECT OF REDUNDANCY ON YIELD AND PERFORMANCE OF VLSI SYSTEMS 351 active elements of both types in State (f1, f2). An expression for the computational capacity of an example system is 0.s developed in the next section. Here too, when searching for the optimal amount of redundancy, we should employ a cost function which con- siders the additional silicon area needed when fault tolerance is introduced into the chip. Such a cost function, called area -J 0.7 utilization in [7], is a] A (klk2)(t) w = (4.16) LJ Uk ,k2(t) 0 5 Other performance measures, like mean time to failure, can also be calculated. For example, let Tkl,k2 denote the mean LLJ time to failure of a system which was initially in state (k1, k2), ': then 0.3/ Tkl,k2= Rkl,k2(t) dt. (4.17) The average mean time to failure can be defined similarly to I 3 5 7 (4.12). S A Modelfor a System with Element-Level Redundancy Fig. 2. The wafer-equivalent yield and wafer-level reliability of a fault- tolerant chip. The Markov mnodel depicted in Fig. 1 describes a system which has redundancy only at the system level. Here, a single added redundancy has been calculated for the following defect or fault in an element requires the switching out of it with example. A chip has one type of elements that can have defects (isolation) of the failing element and the replacement or operational faults. Fifteen such elements are required but a spare one (if one exists). We may add redundancy at the us to use an element even if it the chip can be used if at least 12 are operational (i.e., m = element level which will allow has one defect or failure. We assume that the second failure is 3). The model parameters chosen are a - 0.743, Ad = 1.93 two will to [16] and p = 0.9. The wafer-equivalent yield of this chip is fatal meaning that an element having failures have maximized when four redundant elements are added to the be switched out. An example might be a multiprocessor in simplex chip. The improvement in wafer-level reliability also which each node consists of a dual processor. If one of the two a saturates above some amount of redundancy. The optimal processors fails, we may still use this node (with possibly as as the second amount of redundancy that maximizes the wafer-level reliabil- lower computational capacity) long processor ity depends on the mission time of the system. Fig. 2 depicts is operating correctly. the wafer-level reliability and wafer-equivalent yield as A Markov model suitable for such a system with a single functions of the number of spare elements with the mission type of elements (e.g., processors) that can fail is shown in time as parameter. For a low value of t (time is measured in 1/ Fig. 3. Here, (f, f) is a state at which the system is operational nodes X units), the optimal amount of redundancy is = 0. For t in the presence off faulty nodes and f partially faulty sp. nodes having a single faulty processor). As before, s + = 0.25/X, sop = 3, and for t = 0.35/X, s,p. = 6. The (i.e., tradeoff between yield enhancement and reliability improve- m is the maximum number of defective and faulty nodes that ment depends therefore, on the mission time. Graphs like the the system can tolerate. Consequently, the state (s + m, N - one shown in Fig. 2, can be used to determine the final amount s - m) is a terminal state. The initial state of the Markov of redundancy to be added when both yield and reliability are model is determined by the number and distribution of the considered. manufacturing defects. If s is the maximum number of Reliability is considered in many cases an insufficient defective nodes (partially or completely) that we are willing to measure for the performance of a system and hence, other tolerate and still accept the VLSI chip, then all the states (j, J); measures have been proposed and used. Many such perform- j = 0,1, *,s; = 0,1,* - j arepossible initial ance measures can be computed using the same Markov states. Let aj,; be the probability of (j, I) being the initial model. For example, the computational availability A (1k2)(t) state, then this is the probability of having 2j + i defects out which is the expected available computational capacity defined of which exactly 2j appear in pairs, hence by s1+ml s2+m2 (N)(N-i)2;

D (4.15) A (klfk2)(t) = fk1, k2= (2t) a2 (4.18) fl=kl f2=k2 aj1= i j+J^J where CfIf2 is the computational capacity of the system in state (f1,f2) [7], expressed for example in instructions per time unit. The computational capacity depends on the number of where a2 +I ,is given by (3.7) with N being replaced by 2N. 352 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-36, NO. 3, MARCH 1987

)(t _ (k,k) dflf tj dt oef+fa7 (t)

+ a f- lf+ -_lf1, (t fgf_ I PsM- (t) (4.21) where

f+i F ajf,y=ajjCZ ,f-1 +a/ff+f,f 1 + f,j. (4.22)

VA f- We denote by VU any state (J, f) satisfying 2f + f = b. For example, v5 iS any one of the states (2, 1) and (1, 3) in Fig. 3. Thus, the sequence (k, k), V2k+ k+ I, V2k+k2, V2f+f - 1, (f, f,f f) corresponds to a path in Fig. 3 from the initial state (k, k) to the state (f, f). Using this notation, the solution of (4.20) and (4.21) under the condition af,f* aXX for all (x, x)*(f f which is usually satisfied, is k) Pp(kf,f )(t)t

= + k+ I V2k 2 . .. ot12k CZ +^+1 otf,f 7k,k 2k+k:+ U2f+f- I A path (k,k), ,(f,f) Fig. 3. A Markov model for a VLSI chip with element-level and system- level redundancy. 2f+f e vbt E 2f+f (4.23) The yield of the chip is b=2)k+k rIH (°avc dvb-) c=2k+k s N-j c*b -^. (4.19) =Y-y=i y=0Y, aj Ij=o fr=o and An expression for the apparent yield, similar to (3.13), can (k - ak,kt p k,'.'Q)k -e (4.24) also be derived. If we insist on having at least N - s perfectly operating where V2k+k and V2f +f in (4.23) specify the states (k, k) and nodes, then s is the total number of completely or partially C!, f), respectively. faulty nodes that we allow, and the yield is given in this case The summation in (4.23) is over all paths (total of by (2f -25.+{-1) paths, each consisting of (2f + ) - (2k + 1) s s-j + 1 nodes) in the Markov model (Fig. 3) from state (k, k) to Y= Z a10-. the state V,f f). j=O J=0 Using the state probabilities given by (4.23), one can derive for various measures and determine The state probabilities for the Markov model shown in Fig. expressions performance as was done for the 3 are defined as follows: the optimal amount of redundancy, previous Markov model. -(t)Pr (The system is in state (f, f) at time t/ V. A MULTIPLE Bus MULTIPROCESSOR SYSTEM The system was initially in state (k, k)} As an example for a VLSI chip consisting of several types of system elements with some redundant ones of each type, k-0, 1, , s; k=0, 1, , N-s; consider a multiple bus multiprocessor systeni designed on a and (f, f) . (k k) lexicographically. large area chip (or even a wafer). Let P, M, and B denote the number of homogeneous processors, memory modules and with P(kt) (0) = 1 and p7,k) (0) = 0 for (f, > (k, k) interconnecting busses, respectively, that are operational lexicographically. (fault-free). Such a system is depicted in Fig. 4 where each The above state probabilities satisfy the following differen- global bus can connect any processor to any memory module. tial equations: Following the notation used in [12] and [21], we refer to this multiprocessor as a P * M * B system. There are several ways to characterize the of a P * (4.20) behavior dPkWk=-°kPkk(t) M * B system. Our purpose in this section is only to illustrate KOREN AND PRADHAN: EFFECT OF REDUNDANCY ON YIELD AND PERFORMANCE OF VLSI SYSTEMS 35;3

/3 (P)p /(P-i)P D(i)p

bus I 6 26 (P- )1 (P-i+)6 P5 bus 2 Fig. 5. A birth and death model for a P * M * B multiple bus multiprocessor.

bus B to exactly j out of the M memory modules is Ql.) which is defined in (3.5). This probability has to be multiplied by min processor processor processor (j, B) since only B requests can be serviced ifj is greater than 2 0 0 0 p the number of busses B. Finally we obtain Fig. 4. A P * M * B multiple bus multiprocessor. min (iMM) :) Q (m min (j, B) i=l, 2, * P. (5.1) the application of the model presented in the previous section. j=1 We adopt therefore, a relatively simple characterization of the The Markov chain in Fig. 5 is a birth and death one, whose system which is based on two parameters. One is the time solution is easily obtained. Let 4(i) denote the steady-state between two consecutive memory references and the second is probability of state i, then the connection time between processor and memory in a single memory access. Both parameters are random variables and are assumed to be exponentially distributed with mean 1/6 i - I /3(P-k) (processing time) and 1/,u (connection time). A i 1-i We wish to find an expression for the computational ,I# k= capacity of a P*M*B system denoted by CP,M,B. An expres- t6 i! eb(i) (i= 1, 2, . P) sion that will allow us to determine the optimal number of spare processors, spare memory modules and spare intercon- nection busses to be designed in the VLSI chip so that yield p HA)II (P-k) and/or performance are maximized. We may define the computational capacity of the P * M *1B system as the expected number of active processors, i.e., (5.2) processors which are executing their task and not idle while waiting to access a common memory module. This perform- ance index is known as processing power. Other performance m-1 indexes like the average cycle time and the instruction ]j execution rate can be simply derived from the processing m= (It ) m(P-k) power index [12]. To calculate the processing power index we may construct a queuing network model. The computational complexity of this model increases very rapidly with system size. Fortunately, as The processing power .measure is given by has been shown in [12] and [21], approximate models with reasonably small errors in the final results, can be employed. p CP,M,B = 1; i(bi) e (5.3) These are derived by lumping "equivalent" states of the i=l model to obtain a Markov chain of substantially smaller size. An example of such a model is shown in Fig. 5. In it, at state Expression (5.3) can now by substituted into (4.15) to (P - i) there are P - i processors which are executing their calculate the computational availability which is defined now tasks while the remaining i processors are idle being serviced as the expected available processing power. Equation (4.16) or waiting to be serviced by a memory module. may then be used to compute the area utilization of the At a rate of i3(i) * I, one of the i idle processors will multiprocessor VLSI system. complete its service increasing the number of active ones to P To calculate the computational availability we also need the - i + 1. ,B(i) is the average number of processors, out ofthe i state probabilities of a Markov model similar to the one shown idle ones, that are serviced at a given time instant. Similarly, at in Fig. 1, with three types of elements, namely processors, a rate of (P - i)6, an active processor will generate a memory memory modules, and busses. Fortunately, in the multiple bus request and join the idle processors, reducing the number of multiprocessor system, the number of active elements of any active ones by 1. type depends only on the number of faulty elements of this To derive an expression for ,3(i), we assume (as in [12] and type and is independent of faulty elements of the other two [21]) that processors request service from the different types. Consequently, the Markov model (like the one in Fig. memory modules with equal probabilities. Hence, the proba- 1) may be partitioned into three independent ones, each solved bility that all i requests of the idle processors will be directed separately. 354 IEEE TRANSACTIONS ON COMPUTERS, VOL. C-36, NO. 3, MARCH 1987

of spare memories and busses were fixed at 3 and 1, respectively, and the different values for the number of spare processors appear on the horizontal axis. One conclusion that might be drawn from this figure is that the area utilization measure is more sensitive to the number of

z spare memory modules, than it is to the other two types of 0 elements.

N Another interesting phenomenon was observed while per- -J forming this analysis. The optimal number of spares of any type in Fig. 1, is independent of the numbers of spares of the Di 2P other two types of elements. For example, the curves (2, -, 1) and (0, -, 0) have their maximum at exactly the same value of 3 spare memories. The same phenomenon was observed for a mission time of t = 0.3 l/Xo. Here, the optimal values of spare numbers are: (4, 3, 2) for maximum wafer-equivalent 3 5 7 9 yield (as before), (3, 5, 2) for maximum wafer-level reliabil- NUMBER OF SPARES ity, and (4,5 2) for maximum area utilization. However, for a Fig. 6. The area utilization of the P * M * B multiprocessor as a function of longer mission time (e.g., t > 0.4 MO0) where the optimal the number of spare elements of different types. values of the spare numbers are higher, the above indepen- dence is not preserved. The state probability which is needed for the calculation of the computational availability (and the system reliability as VI. CONCLUSIONS well), is equal to the product of the state probabilities obtained separately from the three simpler Markov models for the VLSI and WSI architectures that use redundancy for yield processors, memory modules, and busses. and performance improvement have been considered. The This calculation was done for a multiple bus multiprocessor available redundancy on the chip or wafer is primarily limited system with the following parameters: by the size of the chip or wafer; hence, it is imperative to find a method by which one can share the available 1) Eight processors with a = 0.3, Ad = 1.5, p = 0.9, and optimally X = X0 (time will be measured in 1/X0 units). In addition, m = redundancy between yield enhancement and performance 2, and 6/I= 0.91. improvement. 2) Eight memory modules with e = 0.2, Ad = 1.2, p = We have developed in this paper analytical models for the 0.92, X Xo, and m = 1. evaluation of performance and yield improvement through redundancy. The models can be used to the 3) Four buses with a = 0.12, Ad = 0.9, p = 0.95, X = proposed study 0.75XO, and m = 1. effect of sharing element level and system level redundancy, For these system parameters three sets of values for the between these two somewhat competing requirements. number of spare processors, spare memory modules and spare busses were obtained. For maximum wafer-equivalent yield 4, ACKNOWLEDGMENT 3, and 2 spare processors, memories, and busses, respec- The authors wish to thank D. Towsley and Z. Koren for tively, are required. In this calculation we have assumed that helpful discussions and to thank Z. Koren for developing the fc = 1 and therefore, the apparent yield equals the actual one. program to calculate the optimal amount of redundancy for For mnaximum wafer-level reliability (1, 3, 1) spares (proces- maximum yield and performance. sors, memories and busses, respectively) are required for a mission time of t = 0.2 * 1/Xo. For the same mission time, the REFERENCES maximum area utilization (wafer-level processing power [1] R. P. Cenker, et al., "A fault-tolerant 64K dynamic random-access availability), is achieved for (2, 3, 1) spares (processors, memory," IEEE Trans. Electron. Dev., vol. ED-26, pp. 853-860, memories, and busses, respectively). June 1979. A useful application for this model might be the analysis of [2] D. S. Fussel and P. J. Varman, "Fault-tolerant wafer-scale architec- tures for VLSI," in Proc. 9th Annu. Symp. Comput. Arch., May the relative importance of spares for the three types of system 1982. elements. To perform this kind of analysis we can, for [3] J. W. Greene and A. El Gamal, "Configuration of VLSI arrays in the example, set the number of spare memory modules and spare presence of defects," J. Ass. Comput. Mach., vol. 31, pp. 694-717, Oct. 1984. busses at some fixed values (e.g., their optimal values for a [4] D. Gordon, I. Koren, and G. M. Silberman, "Restructuring hexagonal desired mission time) and then, observe the dependency of the arrays of processors in the presence of faults," to appear in J. VLSI area utilization on the number of spare processors. Such an Comput. Syst. [5] K. Hedlund and L. Snyder, "Wafer-scale integration of configurable, analysis has been done for the above system parameters, and highly parallel (chip) processor," in Proc. 1982 Int. Conf. Parallel the results are illustrated in Fig. 6. Processing, Aug. 1982, pp. 262-265. In this figure, the area utilization is shown as a function of [6] 1. Koren, "A reconfigurable and fault-tolerant VLSI multiprocessor array," in Proc. 8th Ann. Symp. Comput. Architect., May 1981, the number of spares (spare processors or spare memories or pp. 425-441. spare busses). The notation (-, 3, 1) means that the numbers [7] I. Koren and M. A. Breuer, "On area and yield considerations for KOREN AND PRADHAN: EFFECT OF REDUNDANCY ON YIELD AND PERFORMANCE OF VLSI SYSTEMS 355

fault- tolerant VLSI processor arrays," IEEE Trans. Comput., vol. I| 2' _Israel Koren (S'72-M'76) received the B.Sc. C-33, pp. 21-27, Jan. 1984. / (cumm laude), M.Sc., and D.Sc. degrees from the [8] I. Koren and D. K. Pradhan, "Introducing redundancy into VLSI Technion-Israel Institute of Technology, Haifa, in designs for yield and performance enhancement," in Proc. 15th Ann. 1967, 1970, and 1975, respectively, all in electrical Symp. Fault-Tolerant Comput., June 1985, pp. 330-335. >~engineering. [9] H. T. Kung and M. S. Lam, "Fault-tolerance and two-level pipelining Since 1979 he has been with the Departments of in VLSI systolic arrays," in Proc. of 1984 Conf. Advanced Res. Electrical Engineering and at the VLSI, M.I.T., Jan. 1984, pp. 74-83. Technion-Israel Institute of Technology, Haifa, [10] F. T. Leighton and C. E. Leiserson, "Wafer-scale integration of A | | / Israel where he became the Head of the VLSI systolic arrays," IEEE Trans. Comput., vol. C-34, pp. 448-461, j Systems Research Center in 1985. Previously he has May 1985. held positions with the University of California, [11] T. E. Mangir and A. Avizienis, "Fault-tolerant design for VLSI: Santa Barbara and the University of Southern California, Los Angeles. In Effects of interconnect requirements on yield improvements of VLSI 1982 he was on sabbatical leave with the University of California, Berkeley. designs," IEEE Trans. Comput., vol. C-31, pp. 609-615, July 1982. He has been a consultant to National Semiconductor, Israel, in the areas of [12] M. Ajmone Marsan and M. Gerla, "Markov models for multiple bus architecture of microprocessors and high-speed algorithms for arithmetic multiprocessor systems," IEEE Trans. Comput., vol. C-31, pp. 239- operations, in 1984-1986, to Tolerant Systems, San Jose, CA in architecture 248, Mar. 1982. of fault-tolerant distributed computer systems in 1983, and to ELTA, [13] H. Mizrahi and I. Koren, "Evaluating the cost-effectiveness of Electronics Industries, Israel, in the area of architecture of parallel signal switches in processor array architectures," in Proc. 1985 Int. Conf. processors in 1981-1982. Currently he is a Visiting Professor at the Parallel Processing, Aug. 1985. University of Massachusetts, Amherst. His current research interests are [14] D. K. Pradhan, "Fault-tolerant architectures for multiprocessors and fault-tolerant VLSI architectures and the reliability evaluation of computer VLSI systems," in Proc. 13th Symp. Fault-Tolerant Comput., June systems. 83. [15] A. L. Rosenberg, "The diogenes approach to testable fault-tolerant arrays of processors," IEEE Trans. Comput., vol. C-32, pp. 902- Dhiraj K. Pradhan (S'70-M'72-SM'80) received 910, Oct. 1983. the Ph.D. degree in 1972. [16] S. C. Seth and V. D. Agrawal, "Characterizing the LSI yield equation He is currently a Professor in the Department of from wafer test data," IEEE Trans. Comput.-Aided Des., vol. CAD- Electrical and Computer Engineering, University of 3, pp. 123-126, Apr. 1984. Massachusetts, Amherst. Previously, he has held [17] C. H. Sequin and R. M. Fujimoto, "X-tree and Y-component," in positions with the University of Regina, Regina, VLSI Architectures, R. Randell and P. C. Treleaven, Eds., Sask., Canada, Oakland University, Rochester, MI, Eglewood Cliffs, NJ: Prentice-Hall, 1983, pp. 299-326. Stanford University, Stanford, CA, and IBM Cor- [18] L. Snyder, "Introduction to the configurable highly parallel com- poration. He has been actively involved with re- puter," Computer, vol. 15, pp. 47-56, Jan. 1982. search in fault-tolerant computing and parallel [19] C. H. Stapper, A. N. McLaren, and M. Dreckman, "Yield model for processing since 1972. He has presented several productivity optimization of VLSI memory chips with redundancy and papers on fault-tolerant computing and parallel processing. He has also partially good product," IBM J. Res. Develop., vol. 24, no. 3, pp. published extensively in journals such as IEEE TRANSACTIONS and Net- 398-409, May 1980. works. His research interests include fault-tolerant computing, computer [20] C. H. Stapper, F. M. Armstrong, and K. Saji, "Integrated circuit yield architecture, graph theory, and flow networks. statistics," Proc. IEEE, vol. 71, pp. 453-470, Apr. 1983. Dr. Pradhan has edited the Special Issues on Fault-Tolerant Computing, in [21] D. Towsley, "Approximate models of multiple bus multiprocessor IEEE COMPUTER (March 1980), and the April 1986 issue of this TRANSAC- systems," IEEE Trans. Comput., vol. C-35, pp. 220-228, Mar. TIONS and served as Session Chairman and Program Committee member for 1986. various conferences. He is an Editor for the Journal of VLSI and Digital [22] R. L. Wadsack, "Fault coverage in digital integrated circuits," Bell Systems. He is also the Editor of a book entitled Fault-Tolerant Computing: Syst. Tech. J., vol. 57, pp. 1475-1488, May 1978. Theory and Techniques (Prentice-Hall).