Fault Tolerance

roundtable Fault Tolerance Jeffrey Voas his is a “virtual” roundtable discussion ent ways and with differing success for between five respected experts from hardware as compared to software. the fault-tolerant computing field: Karama Kanoun and Jean-Claude Laprie: Joanne Bechta Dugan (UVA), Les Hat- Fault tolerance is generally implemented by ton (Oakwood Computing), Karama error detection and subsequent system recov- T Kanoun and Jean-Claude Laprie ery. Recovery consists of error handling (to Particpants (LAAS-CNRS), and Mladen Vouk (NC State eliminate errors from the system state) and University). The reason for including this fault handling (to prevent located faults from Joanne Bechta Dugan is a professor piece is to provide a mini tutorial for readers being activated again). Fault handling in- of electrical and computer engineering at the University of Virginia. Her re- who are unfamiliar with the field of software volves four steps: fault diagnosis, fault isola- search interests include hardware and fault tolerance or who might have had more tion, system reconfiguration, and system reini- software reliability engineering, fault- exposure to hardware fault tolerance. Eleven tialization. Using sufficient redundancy might tolerant computing, and mathematical modeling using dynamic fault trees, questions were posed to each member of the allow recovery without explicit error detec- Markov models, Petri nets, and simula- panel over email, and here we present their tion. This form of recovery is called fault tion. Contact her at [email protected]. responses. masking. Fault tolerance can also be imple- Les Hatton’s biography appears on —Jeffrey Voas mented preemptively and preventively—for page 39. example, in the so-called software rejuvena- tion, aimed at preventing accrued error condi- Karama Kanoun’s biography appears on page 33. tions to lead to failure. Mladen Vouk: A principal way of intro- Jean-Claude Laprie is a research ducing fault tolerance into a system is to director at CNRS, the French National Organization for Scientific Research. provide a method to dynamically determine He is Director of LAAS-CNRS, where he if the system is behaving as it should—that founded and previously directed the is, you introduce a self-checking or “oracle” research group on Fault Tolerance and What are the basic principles of building fault- Dependable Computing. He also capability. If the method detects unexpected founded and previously directed the tolerant systems? and unwanted behavior, a fault-tolerant sys- Laboratory for Dependability Engi- Joanne Bechta Dugan: To design and build tem must provide the means to recover or neering, a joint academia–industry laboratory. His research interests have a fault-tolerant system, you must understand continue operation (preferably, from the focused on fault tolerance, depend- how the system should work, how it might user’s perspective, in a seamless manner). ability evaluation, and the formulation fail, and what kinds of errors can occur. Error of the basic concepts of dependability. What is the difference between hardware and Contact him at [email protected]. detection is an essential component of fault tolerance. That is, if you know an error has software fault tolerance? Mladen A. Vouk is a professor of occurred, you might be able to tolerate it—by Joanne: The real question is what’s the dif- computer science at the N.C. State Uni- versity, Raleigh, North Carolina. His re- replacing the offending component, using an ference between design faults (usually soft- search and development interests in- alternative means of computation, or raising ware) and physical faults (usually hardware). clude software engineering, scientific an exception. However, you want to avoid We can tolerate physical faults in redundant computing, computer-based education, and high-speed networks. He received adding unnecessary complexity to enable fault (spare) copies of a component that are identi- his PhD from the King’s College, Univer- tolerance because that complexity could result cal to the original, but we can’t generally tol- sity of London, UK. He is an IEEE Fellow in a less reliable system. erate design faults in this way because the er- and a member of the IEEE Reliability, Communications, Computer and Educa- Les Hatton: The basic principles depend, ror is likely to recur on the spare component tion Societies, and of the IEEE TC on to a certain extent, on whether you’re de- if it is identical to the original. However, the Software Engineering, ACM, ASQC, and signing hardware or software. Classic tech- distinction between design and physical faults Sigma Xi. Contact him at vouk@csc. ncsu.edu. niques such as independence work in differ- is not so easily drawn. A large class of errors 54 IEEE SOFTWARE July/August 2001 0740-7459/01/$10.00 © 2001 IEEE ROUNDTABLE arise from design faults that do not re- aware. What I mean is that, to the ex- detection, but the response to errors cur because the system’s state might tent a software system can check its is what differentiates the two ap- slightly differ when the computation is responses before activating any physi- proaches. Fault tolerance implies that retried. cal components, a mechanism for im- the software system can recover from Les: From a design point of view, proving error detection, fault toler- —or in some way tolerate—the error the basic principles of thinking about ance, and safety exists. Think of the and continue correct operation. Safety system behavior as a whole are the adage, “Measure twice, cut once.” implies that the system either contin- same, but again, we need to consider The software analogy might be, ues correct operation or fails in a safe what we’re designing—software or “Compute twice, activate once.” manner. A safe failure is an inability to hardware. I suspect the major differ- Les: I’m not sure there is a key tolerate the fault. So, we can have low ence is cost. Software fault tolerance technology. The behavior of the sys- fault tolerance and high safety by is usually a lot more expensive. tem as a whole and the hardware and safely shutting down a system in re- Karama and Jean-Claude: A widely software interaction must be thought sponse to every detected error. used approach to fault tolerance is to through very carefully. Efforts to sup- Les: It’s certainly not a simple re- perform multiple computations in port software fault tolerance, such as lationship. Software fault tolerance is multiple channels, either sequentially exception handling in languages, are related to reliability, and a system or concurrently. To tolerate hardware rather crude, and education on how can certainly be reliable and unsafe physical faults, the channels might be to use them is not well established in or unreliable and safe as well as the identical, based on the assumption universities. more usual combinations. Safety is that hardware components fail inde- Karama and Jean-Claude: We can intimately associated with the sys- pendently. Such an approach has use three key technologies—design di- tem’s capacity to do harm. Fault tol- proven to be adequate for soft faults— versity, checkpointing, and exception erance is a very different property. that is, software or hardware faults handling—for software fault toler- Karama and Jean-Claude: Fault whose activation is not systematically ance, depending on whether the cur- tolerance is—together with fault pre- reproducible—through rollback. Roll- rent task should be continued or can vention, fault removal, and fault fore- back involves returning the system be lost while avoiding error propaga- casting—a means for ensuring that the back to a saved state (a checkpoint) tion (ensuring error containment and system function is implemented so that existed prior to error detection. thus avoiding total system failure). that the dependability attributes, Tolerating solid design faults (hard- Tolerating solid software faults for which include safety and availability, ware or software) necessitates that the task continuity requires diversity, satisfy the users’ expectations and re- channels implement the same function while checkpointing tolerates soft quirements. Safety involves the notion through separate designs and imple- software faults for task continuity. Ex- of controlled failures: if the system mentations—that is, through design ception handling avoids system failure fails, the failure should have no cata- diversity. at the expense of current task loss. strophic consequence—that is, the Mladen: Hardware fault tolerance, Mladen: Runtime failure detection system should be fail-safe. Controlling for the most part, deals with random is often accomplished through either failures always include some forms of failures that result from hardware de- an acceptance test or comparison of fault tolerance—from error detection fects occurring during system opera- results from a combination of “differ- and halting to complete system recov- tion. Software fault tolerance contends ent” but functionally equivalent sys- ery after component failure. The sys- with random (and sometimes not-so- tem alternates, components, versions, tem function and environment dictate, random) invocations of software paths or variants. However, other tech- through the requirements in terms of (or path and state or environment com- niques—ranging from mathematical service continuity, the extent of fault binations) that usually activate soft- consistency checking to error coding tolerance required. ware design and implementation de- to data diversity—are

Fault Tolerance

The Google File System (GFS)

Introspection-Based Verification and Validation

Fault-Tolerant Components on AWS

Network Reliability and Fault Tolerance

Fault Tolerance

Detecting and Tolerating Byzantine Faults in Database Systems Benjamin Mead Vandiver

Fault Tolerance Techniques for Scalable Computing∗

Fault-Tolerant Operating System for Many-Core Processors

Fault Tolerance in Tandem Computer Systems

Reliability and Fault-Tolerance by Choreographic Design∗

Milesight-Troubleshooting RAID

Scalable Design of Fault-Tolerance for Wireless Sensor Networks