Software Fault Tolerance: a Tutorial

NASA / TM-2000-210616 Software Fault Tolerance: A Tutorial Wilfredo Torres-Pomales Langley Research Center, Hampton, Virginia October 2000 The NASA STI Program Office ... in Profile Since its founding, NASA has been dedicated to CONFERENCE PUBLICATION. Collected the advancement of aeronautics and space papers from scientific and technical science. The NASA Scientific and Technical conferences, symposia, seminars, or other Information (STI) Program Office plays a key meetings sponsored or co-sponsored by part in helping NASA maintain this important NASA. role. SPECIAL PUBLICATION. Scientific, The NASA STI Program Office is operated by technical, or historical information from Langley Research Center, the lead center for NASA programs, projects, and missions, NASA's scientific and technical information. The often concerned with subjects having NASA STI Program Office provides access to the substantial public interest. NASA STI Database, the largest collection of aeronautical and space science STI in the world. TECHNICAL TRANSLATION. English- The Program Office is also NASA's institutional language translations of foreign scientific mechanism for disseminating the results of its and technical material pertinent to NASA's research and development activities. These mission. results are published by NASA in the NASA STI Report Series, which includes the following Specialized services that complement the STI report types: Program Office's diverse offerings include creating custom thesauri, building customized TECHNICAL PUBLICATION. Reports of databases, organizing and publishing research completed research or a major significant results ... even providing videos. phase of research that present the results of NASA programs and include extensive For more information about the NASA STI data or theoretical analysis. Includes Program Office, see the following: compilations of significant scientific and technical data and information deemed to • Access the NASA STI Program Home Page be of continuing reference value. NASA at http'//www.sti.nasa.gov counterpart of peer-reviewed formal professional papers, but having less • E-mail your question via the Internet to stringent limitations on manuscript length [email protected] and extent of graphic presentations. • Fax your question to the NASA STI Help TECHNICAL MEMORANDUM. Scientific Desk at (301) 621-0134 and technical findings that are preliminary or of specialized interest, e.g., quick release • Phone the NASA STI Help Desk at reports, working papers, and (301) 621-0390 bibliographies that contain minimal annotation. Does not contain extensive Write to: analysis. NASA STI Help Desk NASA Center for AeroSpace Information CONTRACTOR REPORT. Scientific and 7121 Standard Drive technical findings by NASA-sponsored Hanover, MD 21076-1320 contractors and grantees. NASA / TM-2000-210616 Software Fault Tolerance: A Tutorial Wilfredo Torres-Pomales Langley Research Center, Hampton, Virginia National Aeronautics and Space Administration Langley Research Center Hampton, Virginia 23681-2199 October 2000 Available from: NASA Center for AeroSpace Information (CASI) National Technical Information Service (NTIS) 7121 Standard Drive 5285 Port Royal Road Hanover, MD 21076-1320 Springfield, VA 22161-2171 (301) 621-0390 (703) 605-6000 Abstract Because of our present inability to produce error-free software, software fault tolerance is and will continue to be an important consideration in software systems. The root cause of software design errors is the complexity of the systems. Compounding the problems in building correct software is the difficulty in assessing the correctness of software for highly complex systems. This paper presents a review of software fault tolerance. After a brief overview of the software development processes, we note how hard-to-detect design faults are likely to be introduced during development and how software faults tend to be state-dependent and activated by particular input sequences. Although component reliability is an important quality measure for system level analysis, software reliability is hard to characterize and the use of post-verification reliability estimates remains a controversial issue. For some applications software safety is more important than reliability, and fault tolerance techniques used in those applications are aimed at preventing catastrophes. Single version software fault tolerance techniques discussed include system structuring and closure, atomic actions, inline fault detection, exception handling, and others. Multiversion techniques are based on the assumption that software built differently should fail differently and thus, if one of the redundant versions fails, at least one of the others should provide an acceptable output. Recovery blocks, N- version programming, N self-checking programming, consensus recovery blocks, and t/(n-1) techniques are reviewed. Current research in software engineering focuses on establishing patterns in the software structure and trying to understand the practice of software engineering. It is expected that software fault tolerance research will benefit from this research by enabling greater predictability of the dependability of software. iii Acknowledgements I would like to acknowledge my gratitude to Dr. F. Gail Gray of the Virginia Polytechnic Institute and State University (VA Tech) for providing advice during the preparation and reviewing of this report. I also wish to acknowledge my appreciation to Mr. Ricky Butler of the NASA Langley Research Center for his help in reviewing the final version of this report. iv Table of Contents 1. INTRODUCTION .............................................................................................. 1 2. SOFTWARE DEVELOPMENT ......................................................................... 2 3. SOFTWARE DESIGN FAULTS ....................................................................... 6 4. SOFTWARE FAULT TOLERANCE ................................................................. 8 4.1. Single-Version Software Fault Tolerance Techniques ......................... 9 4.1.1. Software Structure and Actions ............................................................................. 9 4.1.2. Error Detection ..................................................................................................... 10 4.1.3. Exception Handling ............................................................................................. 12 4.1.4. Checkpoint and Restart ........................................................................................ 12 4.1.5. Process Pairs ........................................................................................................ 13 4.1.6. Data Diversity ...................................................................................................... 14 4.1.6. Considerations on the use of Checkpointing ....................................................... 16 4.2. Multi-Version Software Fault Tolerance Techniques .......................... 17 4.2.1. Recovery Blocks .................................................................................................. 17 4.2.2. N-Version Programming ...................................................................................... 18 4.2.3. N Self-Checking Programming ............................................................................ 19 4.2.4. Consensus Recovery Blocks ................................................................................ 20 4.2.6. t/(n- 1)-Variant Programming ............................................................................... 21 4.2.7. Additional Considerations ................................................................................... 21 4.2.7.1. Multi-Version Software Development .......................................................... 21 4.2.7.2. Output Selection Algorithms ......................................................................... 23 4.3. Fault Tolerance in Operating Systems ............................................ 25 4.4. Software Fault Injection for Fault Tolerance Assessment .................. 27 V 5. HARDWARE AND SOFTWARE FAULT TOLERANCE ................................ 28 5.1. Computer Fault Tolerance ............................................................. 28 5.2. Examples of Fault Tolerant Architectures ........................................ 34 5.2.1. B777 Primary Flight Control Computer .............................................................. 35 5.2.2. AIRBUS A320/A330/A340 Flight Control Computer ........................................ 37 6. SUMMARY AND CONCLUDING REMARKS ................................................ 38 7. BIBLIOGRAPHY ............................................................................................ 39 vi List of Figures Figure 1: Linear Sequential Process Model .................................................................................................... 3 Figure 2: Prototyping Process Model ............................................................................................................. 3 Figure 3: Logical Representation of Checkpoint and Restart ....................................................................... 13 Figure 4: Logical Representation of Process Pairs ....................................................................................... 14 Figure 5: Checkpoint and Restart using Data Diversity (with Input Re-Expression Model) ........................ 15 Figure 6: Data Diversity using Input Data

Load more