<<

roundtable

Fault Tolerance

Jeffrey Voas

his is a “virtual” roundtable discussion ent ways and with differing success for between five respected experts from hardware as compared to . the fault-tolerant computing field: Karama Kanoun and Jean-Claude Laprie: Joanne Bechta Dugan (UVA), Les Hat- is generally implemented by ton (Oakwood Computing), Karama error detection and subsequent system recov- T Kanoun and Jean-Claude Laprie ery. Recovery consists of error handling (to Particpants (LAAS-CNRS), and Mladen Vouk (NC State eliminate errors from the system state) and University). The reason for including this fault handling (to prevent located faults from Joanne Bechta Dugan is a professor piece is to provide a mini tutorial for readers being activated again). Fault handling in- of electrical and engineering at the University of Virginia. Her re- who are unfamiliar with the field of software volves four steps: fault diagnosis, fault isola- search interests include hardware and fault tolerance or who might have had more tion, system reconfiguration, and system reini- software , fault- exposure to hardware fault tolerance. Eleven tialization. Using sufficient might tolerant computing, and mathematical modeling using dynamic fault trees, questions were posed to each member of the allow recovery without explicit error detec- Markov models, Petri nets, and simula- panel over email, and here we present their tion. This form of recovery is called fault tion. Contact her at [email protected]. responses. masking. Fault tolerance can also be imple- Les Hatton’s biography appears on —Jeffrey Voas mented preemptively and preventively—for page 39. example, in the so-called software rejuvena- tion, aimed at preventing accrued error condi- Karama Kanoun’s biography appears on page 33. tions to lead to failure. Mladen Vouk: A principal way of intro- Jean-Claude Laprie is a research ducing fault tolerance into a system is to director at CNRS, the French National Organization for Scientific Research. provide a method to dynamically determine He is Director of LAAS-CNRS, where he if the system is behaving as it should—that founded and previously directed the is, you introduce a self-checking or “oracle” research group on Fault Tolerance and What are the basic principles of building fault- Dependable Computing. He also capability. If the method detects unexpected founded and previously directed the tolerant systems? and unwanted behavior, a fault-tolerant sys- Laboratory for Dependability Engi- Joanne Bechta Dugan: To and build tem must provide the means to recover or neering, a joint academia–industry laboratory. His research interests have a fault-tolerant system, you must understand continue operation (preferably, from the focused on fault tolerance, depend- how the system should work, how it might user’s perspective, in a seamless manner). ability evaluation, and the formulation fail, and what kinds of errors can occur. Error of the basic concepts of dependability. What is the difference between hardware and Contact him at [email protected]. detection is an essential component of fault tolerance. That is, if you know an error has software fault tolerance? Mladen A. Vouk is a professor of occurred, you might be able to tolerate it—by Joanne: The real question is what’s the dif- computer science at the N.C. State Uni- versity, Raleigh, North Carolina. His re- replacing the offending component, using an ference between design faults (usually soft- search and development interests in- alternative means of computation, or raising ware) and physical faults (usually hardware). clude software engineering, scientific an exception. However, you want to avoid We can tolerate physical faults in redundant computing, computer-based education, and high-speed networks. He received adding unnecessary complexity to enable fault (spare) copies of a component that are identi- his PhD from the King’s College, Univer- tolerance because that complexity could result cal to the original, but we can’t generally tol- sity of London, UK. He is an IEEE Fellow in a less reliable system. erate design faults in this way because the er- and a member of the IEEE Reliability, Communications, Computer and Educa- Les Hatton: The basic principles depend, ror is likely to recur on the spare component tion Societies, and of the IEEE TC on to a certain extent, on whether you’re de- if it is identical to the original. However, the Software Engineering, ACM, ASQC, and signing hardware or software. Classic tech- distinction between design and physical faults Sigma Xi. Contact him at vouk@csc. ncsu.edu. niques such as independence work in differ- is not so easily drawn. A large class of errors

54 IEEE SOFTWARE July/August 2001 0740-7459/01/$10.00 © 2001 IEEE ROUNDTABLE arise from design faults that do not re- aware. What I mean is that, to the ex- detection, but the response to errors cur because the system’s state might tent a software system can check its is what differentiates the two ap- slightly differ when the computation is responses before activating any physi- proaches. Fault tolerance implies that retried. cal components, a mechanism for im- the software system can recover from Les: From a design point of view, proving error detection, fault toler- —or in some way tolerate—the error the basic principles of thinking about ance, and safety exists. Think of the and continue correct operation. Safety system behavior as a whole are the adage, “Measure twice, cut once.” implies that the system either contin- same, but again, we need to consider The software analogy might be, ues correct operation or fails in a safe what we’re designing—software or “Compute twice, activate once.” manner. A safe failure is an inability to hardware. I suspect the major differ- Les: I’m not sure there is a key tolerate the fault. So, we can have low ence is cost. Software fault tolerance technology. The behavior of the sys- fault tolerance and high safety by is usually a lot more expensive. tem as a whole and the hardware and safely shutting down a system in re- Karama and Jean-Claude: A widely software interaction must be thought sponse to every detected error. used approach to fault tolerance is to through very carefully. Efforts to sup- Les: It’s certainly not a simple re- perform multiple computations in port software fault tolerance, such as lationship. Software fault tolerance is multiple channels, either sequentially in languages, are related to reliability, and a system or concurrently. To tolerate hardware rather crude, and education on how can certainly be reliable and unsafe physical faults, the channels might be to use them is not well established in or unreliable and safe as well as the identical, based on the assumption universities. more usual combinations. Safety is that hardware components fail inde- Karama and Jean-Claude: We can intimately associated with the sys- pendently. Such an approach has use three key technologies—design di- tem’s capacity to do harm. Fault tol- proven to be adequate for soft faults— versity, checkpointing, and exception erance is a very different property. that is, software or hardware faults handling—for software fault toler- Karama and Jean-Claude: Fault whose activation is not systematically ance, depending on whether the cur- tolerance is—together with fault pre- reproducible—through rollback. Roll- rent task should be continued or can vention, fault removal, and fault fore- back involves returning the system be lost while avoiding error propaga- casting—a means for ensuring that the back to a saved state (a checkpoint) tion (ensuring error containment and system function is implemented so that existed prior to error detection. thus avoiding total system failure). that the dependability attributes, Tolerating solid design faults (hard- Tolerating solid software faults for which include safety and availability, ware or software) necessitates that the task continuity requires diversity, satisfy the users’ expectations and re- channels implement the same function while checkpointing tolerates soft quirements. Safety involves the notion through separate and imple- software faults for task continuity. Ex- of controlled failures: if the system mentations—that is, through design ception handling avoids system failure fails, the failure should have no cata- diversity. at the expense of current task loss. strophic consequence—that is, the Mladen: Hardware fault tolerance, Mladen: Runtime failure detection system should be fail-safe. Controlling for the most part, deals with random is often accomplished through either failures always include some forms of failures that result from hardware de- an acceptance test or comparison of fault tolerance—from error detection fects occurring during system opera- results from a combination of “differ- and halting to complete system recov- tion. Software fault tolerance contends ent” but functionally equivalent sys- ery after component failure. The sys- with random (and sometimes not-so- tem alternates, components, versions, tem function and environment dictate, random) invocations of software paths or variants. However, other tech- through the requirements in terms of (or path and state or environment com- niques—ranging from mathematical service continuity, the extent of fault binations) that usually activate soft- consistency checking to error coding tolerance required. ware design and implementation de- to data diversity—are also useful. Mladen: You can have a safe sys- fects. These defects can then lead to There are many options for effective tem that has little fault tolerance in it. system failures. system recovery after a problem has When the system specifications prop- been detected. They range from com- erly and adequately define safety, then What key technologies make software plete rejuvenation (for example, stop- a well-designed fault-tolerant system fault-tolerant? ping with a full data and software re- will also be safe. However, you can Joanne: Software involves a sys- load and then restarting) to dynamic also have a system that is highly fault- tem’s conceptual model, which is eas- forward error correction to partial tolerant but that can fail in an unsafe ier than a physical model to engineer state rollback and restart. way. Hence, fault tolerance and safety to test for things that violate basic are not synonymous. Safety is con- concepts. To the extent that a soft- What is the relationship between cerned with failures (of any nature) ware system can evaluate its own per- software fault tolerance and software that can harm the user; fault tolerance formance and correctness, it can be safety? is primarily concerned with runtime made fault-tolerant—or at least error- Joanne: Both require good error prevention of failures in any shape or

July/August 2001 IEEE SOFTWARE 55 ROUNDTABLE form (including prevention of safety- puter Soc. Press, 1992, pp. 5–17.) How can a company deciding whether critical failures). A fault-tolerant and to add fault tolerance to a system safe system will minimize overall fail- Much research has been done in this determine whether the return on ures and ensure that when a failure area, but it seems little has made it investment is sufficient for the addi- occurs, it is a safe failure. into practice. Is this true? tional costs? Joanne: I think it’s a psychological Joanne: I think it all hinges on the What are the four most seminal issue. In general, I don’t think that soft- cost of failure. The more expensive papers or books on this topic? ware engineers are trained to consider the failure (in terms of actual cost, Joanne: Important references on failures as well as other engineers. Soft- reputation, human life, environmen- my desk include Michael Lyu’s Soft- ware designers like to think about neat tal issues, and other less tangible ware Fault Tolerance (John Wiley, ways to make a system work—they measures), the more it’s worth the ef- 1995), Nancy Leveson’s Safeware: aren’t trained to think about errors and fort to prevent it. System Safety and —es- failures. Remember that most software Les: This is very difficult. Determin- pecially for the case studies in the faults are design faults, traceable to hu- ing the cost of software failure in actu- appendix (Addison-Wesley, 1995), man error. If I have a finite amount of arial terms given the basically chaotic Debra Herrmann’s Software Safety resources to expend on a project, it’s nature of such failure is just about im- and Reliability (IEEE Press, 2000), hard to make a case for fault tolerance possible at the current level of knowl- and Henry Petroski’s To Engineer is (which usually implies redundancy) edge—the variance on the estimates is Human: The Role of Failure in Suc- rather than just spending more time to usually ridiculous. ROI must include a cessful Design (Vintage Books, get the one version right. knowledge of risk. Risk is when you 1992). Les: Yes, I agree that little has made are not sure what will happen but you Les: The literature is pretty packed, it into practice. Most industries are in know the odds. Uncertainty is when although some of Nancy Leveson’s too much of a rush to do it properly— you don’t know either. We generally contributions are right up there. Be- if at all—with reduced time-to-market have uncertainty. On the other hand, I yond that, I usually read books on fail- so prevalent, and software engineers am inclined to believe that the cost of ure tolerance and safety in conven- are not normally trained to do it well. failure in consumer embedded systems tional engineering. There is also a widespread and wholly is so high that all such systems should Karama and Jean-Claude: B. Ran- mistaken belief that software is either incorporate fault-tolerant techniques. dell, “System Structure for Software right or wrong, and testing proves that Karama and Jean-Claude: It is Fault Tolerance,” IEEE Trans. Soft- it is “right.” misleading to consider the ROI as the ware Eng., vol. SE-1, no. 10, 1975, pp. Karama and Jean-Claude: This only criterion for deciding whether 1220–1232. A. Avizienis and L. Chen, could be true for design diversity, as it to add fault tolerance. For some ap- “On the implementation of N-version has been used mainly for safety-critical plication domains, the cost of a sys- Programming for Software Fault Toler- systems only. Exception handling is tem outage is a driver that is more ance During Execution,” Proc. IEEE used in most systems: telecommunica- important than the additional devel- COMPSAC 77, 1977, pp. 149–155. tions, commercial servers, and so forth. opment cost due to fault tolerance. Software Fault Tolerance, M.R. Lyu, Mladen: I think we’ve made a lot Indeed, the cost of system outage is ed., John Willey, 1995. J.C. Laprie et of progress—we now know how to usually the determining factor, be- al., “Definition and Analysis of Hard- make systems that are quite fault- cause it includes all sources of loss of ware- and Software-Fault-Tolerant Ar- tolerant. Most of the time, it is not an profit and income to the company. chitectures,” Computer, vol. 23, no. 7, issue of technology (although many Mladen: There are several fault- July 1990, pp. 39–51. research issues exist) but an issue of tolerance cost models we can use to Mladen: I agree that M. Lyu’s Soft- cost and schedules. Economic and answer this question. The first step is ware Fault-Tolerance and B. Randell’s business models dictate cost-effective to create a risk-based operational “System Structure for Software Fault- and timely solutions. Available fault- profile for the system and decide Tolerance” are important. Also, A. tolerant technology, depending on which system elements need protec- Avizienis, “The N-Version Approach the level of confidence you desire, tion. Risk analysis will yield the to Fault-Tolerant Software,” IEEE might offer unacceptably costly or problem occurrence to cost-to-bene- Trans. Software Eng., vol. SE-11, no. time-consuming solutions. However, fit results, which we can then use to 12, 1085, pp. 1491–1501, and J.C. most of our daily critical reliability make appropriate decisions. Laprie et al., “Definition and Analysis expectations are being met by fault- of Hardware- and Software-Fault-Tol- tolerant systems—for example, 911 What key standards, government erant ,” Computer, vol. services and aircraft flight systems— agencies, or standards bodies require 23, no. 7, July 1990, pp. 39–51. so it is possible to strike a successful fault-tolerant systems? (Reprinted in Fault-Tolerant Software balance between economic and tech- Joanne: I’d be surprised if there Systems: Techniques and Applica- nical models. are any standards that require fault tions, Hoang Pham, ed., IEEE Com- tolerance (at least for design faults)

56 IEEE SOFTWARE July/August 2001 ROUNDTABLE per se. I would expect that correct strating that something has been Stop) tolerate soft software faults operation is required, which might achieved in classic terms means com- through checkpointing. imply fault tolerance as an internal paring the behavior with and without Mladen: Well-known projects in- means to achieve correct operation. the additional tolerance-inducing tech- clude the French train systems, Boe- Les: Quite a few standards have niques. We don’t do experiments like ing and Airbus fly-by-wire aircraft, something to say about this—for ex- this in software engineering, and our military fly-by-wire, and NASA space ample, IEC 61508. However, IEC prediction systems are often very shuttles. Several proceedings and 61508 also has a lot to say about a lot crude. There are some things you can books, sponsored by IFIP WG 10.4 of other things. I do not find the prolif- do, but comparing software technolo- and the IEEE CS Technical Commit- eration of complex standards particu- gies in general is about as easy as com- tee on Fault Tolerant Computing, de- larly helpful, although they are usually paring supermarkets. scribe older fault-tolerance projects well meaning. There is too much opin- Karama and Jean-Claude: As for (see www.dependability.org). ion and not enough experimentation. any system, analysis and testing are Karama and Jean-Claude: Several mandatory. However, fault injection Is much progress being made into standards for safety-critical applica- constitutes a very efficient technique fault-tolerant software? tions recommend fault tolerance—for for testing fault-tolerance mechanisms: Les: I don’t think there is much new hardware as well as for software. For well-selected sets of faults are injected in the theory. Endless and largely point- example, the IEC 61508 standard in the system and the reaction of fault- less technological turnover in software (which is generic and application sec- tolerance mechanisms is observed. engineering just makes it harder to do tor independent) recommends among They can thus be improved. The main the right things in practice. With a new other techniques: “failure assertion problem remains the representative- operating system environment emerg- programming, safety bag technique, ness of the injected faults. For design ing about every two years on average, diverse programming, backward and diversity, the implementation of the a different programming language forward recovery.” Also, the Defense specifications by different teams has emerging about every three years, and standard (MOD 00-55), the avionics proven to be efficient for detecting a new paradigm about every five years, standard (DO-178B), and the stan- specifications faults. Back-to-back it’s a wonder we get anywhere at all. dard for space projects (ECSS-Q-40- testing has also proven to be efficient Karama and Jean-Claude: To the A) list design diversity as possible in detecting some difficult faults that best of our knowledge, the main con- means for improving safety. no other method can detect. cepts and means for software fault Mladen: Usually, the requirement is Mladen: You know you’ve achieved tolerance have been defined for a few not so much for fault tolerance (by it- fault tolerance through a combination years. However, some refinements of self) as it is for , relia- of good requirements specifications, these concepts are still being done bility, and safety. Hence, IEEE, FAA, realistic quantitative availability, relia- and implemented in practice, mainly FCC, DOE, and other standards and bility and safety parameters, thorough for distributed systems such as fault- regulations appropriate for reliable design analysis, formal methods, and tolerant communication protocols. computer-based systems apply. We can actual testing. Mladen: The basic ideas of how to achieve high availability, reliability, and provide fault tolerance have been safety in different ways. They involve a What well-known projects contain known for decades. Specifics on how proper reliable and safe design, proper fault-tolerant software? to do that cost effectively in modern safeguards, and proper implementa- Les: Certainly some avionics sys- systems with current and future tech- tion. Fault tolerance is just one of the tems and some automobile systems, nologies (wireless, personal devices, techniques that assures that a system’s but the techniques are not usually nanodevices, and so forth) require quality of service (in a broader sense) widely publicized so I’m not sure that considerable research as well as inge- meets user needs (such as high safety). many would be well known. Ariane nious engineering solutions. For exam- 5 is a good, well-documented exam- ple, our current expectations for 911 How do you demonstrate that fault ple of a system with hardware toler- services require telephone switch avail- tolerance is achieved? ance but no software tolerance. abilities that are in the range of 5 to 6 Joanne: This is a difficult question Karama and Jean-Claude: Airbus nines (0.999999). These expectations to answer. If we can precisely de- A-320 and successors use design di- are likely to start extending to more scribe the classes of faults we must versity for software fault tolerance in mundane things such as personal com- tolerate, then a collection of tech- the flight control system. Boeing B- puting and networking applications, niques to demonstrate fault tolerance 777 uses hardware diversity and car automation (such as GIS locators), (including simulation, modeling, test- compiler diversity. The Elektra Aus- and, perhaps, space tourism. Almost ing, fault injection, formal analysis, trian railway signaling system uses none of these would measure up to- and so forth) can be used to build a design diversity for hardware and day, and many will require new fault- credible case for fault tolerance. software fault tolerance. Tandem tolerance approaches that are far from Les: I honestly don’t know. Demon- ServerNet and predecessors (Non- standard textbook solutions.

July/August 2001 IEEE SOFTWARE 57