Fault Tolerant Computer Architecture

Fault Tolerant Computer Architecture iii SynthesisChapter Lectures Title on here Computer ArchitectureKratos Editor Mark D. Hill, University of Wisconsin, Madison Synthesis Lectures on Computer Architecture publishes 50 to 150 page publications on topics pertaining to the science and art of designing, analyzing, selecting and interconnecting hardware components to create computers that meet functional, performance and cost goals. Fault Tolerant Computer Architecture Daniel Sorin 2009 The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines Luiz André Barroso and Urs Hölzle 2009 Computer Architecture Techniques for Power-Efficiency Stefanos Kaxiras and Margaret Martonosi 2008 Chip Mutiprocessor Architecture: Techniques to Improve Throughput and Latency Kunle Olukotun, Lance Hammond, James Laudon 2007 Transactional Memory James R. Larus, Ravi Rajwar 2007 Quantum Computing for Computer Architects Tzvetan S. Metodi, Frederic T. Chong 2006 Copyright © 2009 by Morgan & Claypool All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews, without the prior permission of the publisher. Fault Tolerant Computer Architecture Daniel Sorin www.morganclaypool.com ISBN: 9781598299533 paperback ISBN: 9781598299540 ebook DOI: 10.2200/S00192ED1V01Y200904CAC005 A Publication in the Morgan & Claypool Publishers series SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE Lecture #5 Series Editor: Mark D. Hill, University of Wisconsin, Madison Series ISSN ISSN 1935-3235 print ISSN 1935-3243 electronic Fault Tolerant Computer Architecture Daniel J. Sorin Duke University SYNTHESIS LECTURES ON COMPUTER ARCHITECTURE #5 vi ABSTRACT For many years, most computer architects have pursued one primary goal: performance. Architects have translated the ever-increasing abundance of ever-faster transistors provided by Moore’s law into remarkable increases in performance. Recently, however, the bounty provided by Moore’s law has been accompanied by several challenges that have arisen as devices have become smaller, includ- ing a decrease in dependability due to physical faults. In this book, we focus on the dependability challenge and the fault tolerance solutions that architects are developing to overcome it. The two main purposes of this book are to explore the key ideas in fault-tolerant computer architecture and to present the current state-of-the-art—over approximately the past 10 years—in academia and industry. KEYWORDS fault tolerance (or fault tolerant), reliability, dependability, computer architecture, error detection, error recovery, fault diagnosis, self-repair, autonomous, dynamic verification vii Dedication “To Deborah, Jason, and Julie” viii Acknowledgments I would like to thank my family for their support while I was writing this lecture. I would also like to thank Mark Hill for inviting me to write this lecture and Mike Morgan for organizing the produc- tion of the lecture. Valuable feedback on early drafts of the lecture was provided by Babak Falsafi, Jude Rivers, and Mark Hill. I would also like to thank Lihao Xu for helping me with a question about error coding. ix Contents 1. Introduction .......................................................................................................1 1.1 Goals of this Book ............................................................................................... 1 1.2 Faults, Errors, and Failures .................................................................................. 2 1.2.1 Masking ................................................................................................... 2 1.2.2 Duration of Faults and Errors ................................................................. 3 1.2.3 Underlying Physical Phenomena ............................................................. 3 1.3 Trends Leading to Increased Fault Rates ............................................................ 5 1.3.1 Smaller Devices and Hotter Chips .......................................................... 5 1.3.2 More Devices per Processor .................................................................... 6 1.3.3 More Complicated Designs ..................................................................... 6 1.4 Error Models ....................................................................................................... 7 1.4.1 Error Type ............................................................................................... 7 1.4.2 Error Duration ........................................................................................ 8 1.4.3 Number of Simultaneous Errors .............................................................. 8 1.5 Fault Tolerance Metrics ....................................................................................... 9 1.5.1 Availability ............................................................................................... 9 1.5.2 Reliability .............................................................................................. 10 1.5.3 Mean Time to Failure ............................................................................ 10 1.5.4 Mean Time Between Failures ................................................................ 10 1.5.5 Failures in Time ..................................................................................... 10 1.5.6 Architectural Vulnerability Factor ......................................................... 11 1.6 The Rest of This Book ...................................................................................... 12 1.7 References .......................................................................................................... 13 2. Error Detection ................................................................................................. 19 2.1 General Concepts .............................................................................................. 19 2.1.1 Physical Redundancy ............................................................................. 19 2.1.2 Temporal Redundancy ........................................................................... 22 x FAULT TOLERANT COMPUTER ARCHITECTURE 2.1.3 Information Redundancy ....................................................................... 22 2.1.4 The End-to-End Argument .................................................................. 25 2.2 Microprocessor Cores ........................................................................................ 27 2.2.1 Functional Units .................................................................................... 27 2.2.2 Register Files ......................................................................................... 29 2.2.3 Tightly Lockstepped Redundant Cores ................................................ 29 2.2.4 Redundant Multithreading Without Lockstepping .............................. 30 2.2.5 Dynamic Verification of Invariants ........................................................ 34 2.2.6 High-Level Anomaly Detection ........................................................... 39 2.2.7 Using Software to Detect Hardware Errors .......................................... 41 2.2.8 Error Detection Tailored to Specific Fault Models ................................ 42 2.3 Caches and Memory.......................................................................................... 44 2.3.1 Error Code Implementation .................................................................. 44 2.3.2 Beyond EDCs ....................................................................................... 45 2.3.3 Detecting Errors in Content Addressable Memories ............................ 46 2.3.4 Detecting Errors in Addressing ............................................................. 47 2.4 Multiprocessor Memory Systems ...................................................................... 48 2.4.1 Dynamic Verification of Cache Coherence ........................................... 49 2.4.2 Dynamic Verification of Memory Consistency ..................................... 50 2.4.3 Interconnection Networks ..................................................................... 52 2.5 Conclusions ....................................................................................................... 52 2.6 References .......................................................................................................... 52 3. Error Recovery .................................................................................................. 61 3.1 General Concepts .............................................................................................. 61 3.1.1 Forward Error Recovery ........................................................................ 61 3.1.2 Backward Error Recovery ...................................................................... 62 3.1.3 Comparing the Performance of FER and BER ..................................... 68 3.2 Microprocessor Cores ........................................................................................ 69 3.2.1 FER for Cores ....................................................................................... 69 3.2.2 BER for Cores ....................................................................................... 69 3.3 Single-Core Memory Systems .........................................................................

Load more