Fault-Tolerant Design
Total Page:16
File Type:pdf, Size:1020Kb
Elena Dubrova Fault-Tolerant Design www.ketabdownload.com Fault-Tolerant Design www.ketabdownload.com Elena Dubrova Fault-Tolerant Design 123 www.ketabdownload.com Elena Dubrova KTH Royal Institute of Technology Krista Sweden ISBN 978-1-4614-2112-2 ISBN 978-1-4614-2113-9 (eBook) DOI 10.1007/978-1-4614-2113-9 Springer New York Heidelberg Dordrecht London Library of Congress Control Number: 2013932644 Ó Springer Science+Business Media New York 2013 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publisher’s location, in its current version, and permission for use must always be obtained from Springer. Permissions for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to prosecution under the respective Copyright Law. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. While the advice and information in this book are believed to be true and accurate at the date of publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for any errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect to the material contained herein. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com) www.ketabdownload.com To my mother www.ketabdownload.com Preface This textbook serves as an introduction to fault tolerance, intended for upper division undergraduate students, graduate-level students, and practicing engineers in need of an overview of the field. Readers will develop skills in modeling and evaluating fault-tolerant architectures in terms of reliability, availability, and safety. They will gain a thorough understanding of fault-tolerant computing, including both the theory of how to achieve fault tolerance through hardware, software, information, and time redundancy and the practical knowledge of designing fault-tolerant hardware and software systems. The book contains eight chapters covering the following topics. Chapter 1 is an introduction, discussing the importance of fault tolerance in developing a dependable system. Chapter 2 describes three fundamental characteristics of dependability: attributes, impairment, and means. Chapter 3 introduces depend- ability evaluation techniques and dependability models such as reliability block diagrams and Markov chains. Chapter 4 presents commonly used approaches for the design of fault-tolerant hardware systems, such as triple modular redundancy, standby redundancy, and self-purging redundancy and evaluates their effect on system dependability. Chapter 5 shows how fault tolerance can be achieved by means of coding. It covers many important families of codes, including parity, linear, cyclic, unordered, and arithmetic codes. Chapter 6 presents time redun- dancy techniques which can be used for detecting and correcting transient and permanent faults. Chapter 7 describes the main approaches for the design of fault- tolerant software systems, including checkpoint and restart, recovery blocks, N-version programming, and N self-checking programming. Chapter 8 concludes the book. The content is designed to be highly accessible, including numerous examples and problems to reinforce the material learned. Solutions to problems and Power- Point slides are available from the author upon request. Stockholm, Sweden, December 2012 Elena Dubrova vii www.ketabdownload.com Acknowledgments This book has evolved from the lecture notes of the course ‘‘Design of Fault-Tolerant Systems’’ that I have taught at the Royal Institute of Technology (KTH), since 2000. Throughout the years, many students and colleagues have helped me polish the text. I am grateful to all who gave me feedback. In particular, I would like to thank Nan Li, Shohreh Sharif Mansouri, Nasim Farahini, Bayron Navas, Jonian Grazhdani, Xavier Lowagie, Pieter Nuyts, Henrik Kirkeby, Chen Fu, Kareem Refaat, Sergej Koziner, Julia Kuznetsova, Zhonghai Lu, and Roman Morawek for finding multiple mistakes and suggesting valuable improvements in the manuscript. I am grateful to Hannu Tenhunen who inspired me to teach this course and constantly supported me during my academic career. I am also indebted to Charles Glaser from Springer for his encouragement and assistance in publishing the book and to Sandra Brunsberg for her very careful proofreading of the final draft. Special thanks to a friend of mine who advised me to use the MetaPost tool for drawing pictures. Indeed, MetaPost gives perfect results. Finally, I am grateful to the Swedish Foundation for International Cooperation in Research and Higher Education (STINT), for the scholarship KU2002-4044 which supported my trip to the University of New South Wales, Sydney, Australia, where the first draft of this book was written during October–December 2002. ix www.ketabdownload.com Contents 1 Introduction ........................................ 1 1.1 Definition of Fault Tolerance . 1 1.2 Fault Tolerance and Redundancy. 2 1.3 Applications of Fault Tolerance . 2 References . 4 2 Fundamentals of Dependability .......................... 5 2.1 Notation . 5 2.2 Dependability Attributes. 6 2.2.1 Reliability . 6 2.2.2 Availability . 7 2.2.3 Safety . 8 2.3 Dependability Impairments . 9 2.3.1 Faults, Errors, and Failures. 9 2.3.2 Origins of Faults. 10 2.3.3 Common-Mode Faults . 12 2.3.4 Hardware Faults . 12 2.3.5 Software Faults. 14 2.4 Dependability Means . 15 2.4.1 Fault Tolerance. 15 2.4.2 Fault Prevention . 16 2.4.3 Fault Removal . 16 2.4.4 Fault Forecasting . 17 2.5 Summary . 17 References . 19 3 Dependability Evaluation Techniques ...................... 21 3.1 Basics of Probability Theory. 22 3.2 Common Measures of Dependability . 24 3.2.1 Failure Rate . 24 3.2.2 Mean Time to Failure . 27 xi www.ketabdownload.com xii Contents 3.2.3 Mean Time to Repair . 28 3.2.4 Mean Time Between Failures . 30 3.2.5 Fault Coverage . 30 3.3 Dependability Modeling . 31 3.3.1 Reliability Block Diagrams . 31 3.3.2 Fault Trees. 33 3.3.3 Reliability Graphs . 33 3.3.4 Markov Processes . 34 3.4 Dependability Evaluation . 37 3.4.1 Reliability Evaluation Using Reliability Block Diagrams . 38 3.4.2 Dependability Evaluation Using Markov Processes . 39 3.5 Summary . 48 References . 54 4 Hardware Redundancy ................................ 55 4.1 Redundancy Allocation . 56 4.2 Passive Redundancy. 57 4.2.1 Triple Modular Redundancy . 58 4.2.2 N-Modular Redundancy . 63 4.3 Active Redundancy . 65 4.3.1 Duplication with Comparison . 65 4.3.2 Standby Redundancy . 67 4.3.3 Pair-And-A-Spare . 73 4.4 Hybrid Redundancy . 75 4.4.1 Self-Purging Redundancy. 76 4.4.2 N-Modular Redundancy with Spares . 79 4.5 Summary . 80 References . 85 5 Information Redundancy ............................... 87 5.1 History . 87 5.2 Fundamental Notions . 88 5.2.1 Code . 88 5.2.2 Encoding . 89 5.2.3 Information Rate. 89 5.2.4 Decoding . 89 5.2.5 Hamming Distance . 90 5.2.6 Code Distance . 90 5.3 Parity Codes . 92 5.3.1 Definition and Properties . 92 5.3.2 Applications . 93 5.3.3 Horizontal and Vertical Parity Codes. 95 www.ketabdownload.com Contents xiii 5.4 Linear Codes. 96 5.4.1 Basic Notions . 96 5.4.2 Definition. 98 5.4.3 Generator Matrix . 99 5.4.4 Parity Check Matrix . 101 5.4.5 Syndrome. 102 5.4.6 Construction of Linear Codes . 102 5.4.7 Hamming Codes . 105 5.4.8 Lexicographic Parity Check Matrix . 107 5.4.9 Applications of Hamming Codes . 108 5.4.10 Extended Hamming Codes . 109 5.5 Cyclic Codes. 110 5.5.1 Definition. 110 5.5.2 Polynomial Manipulation . 111 5.5.3 Generator Polynomial . 112 5.5.4 Parity Check Polynomial . 114 5.5.5 Syndrome Polynomial . 115 5.5.6 Implementation of Encoding and Decoding . 115 5.5.7 Separable Cyclic Codes . 118 5.5.8 Cyclic Redundancy Check Codes . 121 5.5.9 Reed-Solomon Codes.