Application Layer Recovery Layer

AN ULTRARELIABLE MULTICOMPUTER ARCHITECTURE FOR REAL-TIME CONTROL APPLICATIONS by Peter C. Buechler A Thesis Submitted to the Faculty of the College of Engineering in partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Engineering Florida Atlantic University Boca Raton, Florida December 1989 AN ULTRARELIABLE MULTICOMPUTER ARCHITECTURE FOR REAL-TIME CONTROL APPLICATIONS by Peter C. Buechler This thesis was prepared under the direction of the candidate's thesis advisor, Dr. Eduardo B. Fernandez, Department of Computer Engineering, and has been approved by the members of his supervisory committee. It was submitted to the faculty of the College of Engineering and was accepted in partial fulfillment of the requirements for the degree of Master of Science in Computer Engineering. SUPERVISORY COMMITTEE: ~~ Dr.-----~-------------------- E. B. Fernandez, Thesis Advisor _7:_d:~~:..KY#£~ Dr. T. M. Khoshgoftaar ~:£~~---- Dr. D.P. Gluch k&~ll ____ _ Chairperson, Department of Computer Engineering ---~~~~~Sl ________ _ Dean of Graduate Studies Date ii ·------------------- ----------- ACKNOWLEDGEMENTS I would like to thank my committee members for their suggestions and criticisms, my wife for acting as an economic slave while I pursued my studies, my mother for her support, and Mr. Paul Luebbers for stimulating discussions and encouragement. iii ABSTRACT Author: Peter C. Buechler Trtle: An Ultrareliable Multicomputer Architecture for Real-Time Control Applications Institution: Florida Atlantic University Thesis Advisor: Dr. Eduardo B. Fernandez Degree: Master of Science in Engineering Year: 1989 This thesis considers the design of ultrareliable multicomputers for control applications. The fault tolerance problem is divided into three subproblems: software, processing node, and communication fault tolerance. Design is performed using layers of abstraction, with fault tolerance implemented by dedicated layers. For software fault tolerance, new constructs for concurrent n-version programming are introduced. For processing node fault tolerance, the distributed fault tolerance (DFl) concept of Chen and Chen is extended to allow for arbitrary failures. Communication fault tolerance is achieved with multicasting on a fault-tolerant graph (FG) network. Reliability models are developed for each of the layers, and a performance model is developed for the communication layer. An example flight control system is compared to currently existing architectures. iv TABLE OF CONTENTS 1. Introduction ..................................................................................................... 1 2. Current Architectures ..................................................................................... 6 2.1 Historical Perspective .......................................................................... 6 2.2 Sperry Flight Systems ......................................................................... 8 2.3 FTP/AP ................................................................................................. 10 2.4 MAFT.................................................................................................... 14 2.5 Airbus A320 ......................................................................................... 17 2.6 Summary.............................................................................................. 19 3. URMC Architecture ......................................................................................... 21 3.1 Virtual Machine Design Approach ..................................................... 22 3.2 The Recovery Layer Concept. ........................................................... 24 3.3 Top-Level URMC Design .................................................................... 25 3.4 Allocatioll of System Requirements to Layers .................................. 28 3.4.1 Software Fault Tolerance ........................................................ 28 3.4.2 Processing Node Fault Tolerance ......................................... 29 3.4.31nterconnection Network Fault Tolerance ............................. 31 3.5 Top-Level Reliability Model. ................................................................ 32 4. Software Fault Tolerance ............................................................................... 34 4.1. Sequential Constructs ........................................................................ 35 4.1.1 Recovery Blocks..................................................................... 35 4.1.2 N-Version Programming ......................................................... 38 4.2. Concurrent Constructs ...................................................................... 40 4.2.1 Recovery Block Extensions .................................................... 40 v --- -------- 4.2.1.1 PTC .............................................................................. 41 4.2.1.2 The Conversation ....................................................... 44 4.2.1.3 The Colloquy ............................................................... 45 4.2.2 N-Version Programming Extensions ..................................... 47 4.2.2.1 Modular Redundancy in CSP .................................... 47 4.2.2.2 Resilient Procedures ................................................... 50 4.3 Selecting a Fault-tolerant Construct .................................................. 51 4.3.1 Sequential Software ................................................................ 52 4.3.2 Concurrent Software ............................................................... 57 4.4 Concurrent N-Version Programming (CNVP) ................................... 58 4.4.1 Process-Dissimilar CNVP ....................................................... 60 4.4.2 Subprogram-Dissimilar CNVP ............................................... 65 4.4.3 Structure-Dissimilar CNVP ..................................................... 69 4.4.4 Comparison of CNVP Constructs ......................................... 73 4.5 CNVP Reliability Model. ...................................................................... 79 4.5.1 Review of Sequential NVP Models ........................................ 79 4.5.2 Process-Dissimilar CNVP. .. ................................................... 87 4.5.3 Subprogram-Dissimilar CNVP ............................................... 90 4.5.4 Structure-Dissimilar CNVP ..................................................... 91 4.6 Correlated Error Problem ................................................................... 91 4.7 Software Fault Tolerance Summary .................................................. 95 5. Processing Node Fault Tolerance ................................................................. 97 5.1 Derived Requirements for the Layer .................................................. 99 5.2 SISD Redundancy Management Methods ....................................... 100 vi 5.2.1 Standby Sparing ..................................................................... 100 5.2.2 N-Modular Redundancy ......................................................... 102 5.3 MIMD Redundancy Management Methods...................................... 104 5.3.1 Redundancy with System Diagnosis..................................... 105 5.3.2 Modular Redundancy with Voting ......................................... 112 5.4 Voting vs. Diagnosis........................................................................... 113 5.5 URMC Redundancy Management Technique .................................. 115 5.5.1 Byzantine Fault Masking ........................................................ 116 5.5.2 Byzantine Fault Diagnosis ...................................................... 118 5.5.3 Spare Processor Coordination .............................................. 121 5.5.41/0 Hardware Fault Tolerance ................................................ 122 5.5.5 Support for Dissimilar Processors ......................................... 123 5.6 Reliability Model. .................................................................................. 124 6. Network Fault Tolerance ................................................................................ 130 6.1 Network Requirements ....................................................................... 131 6.21nterconnection Network Overview.................................................... 133 6.3 The Unidirectional Unk FG Network.................................................. 137 6.4 Reliability Analysis ............................................................................... 140 6.5 Performance Analysis ......................................................................... 142 6.6 Summary.............................................................................................. 147 7. Example URMC System ................................................................................. 149 7.1 Requirements ...................................................................................... 149 7.2 Current Capability............................................................................... 150 7.2.1 Software Failure Parameters .................................................. 150 vii ----- ·-- --·--------------------------- 7.2.2 Hardware Failure Parameters ................................................ 154 7.2.3 Hardware Throughput. ........................................................... 155 7.3 Example URMC Design ...................................................................... 155

Application Layer Recovery Layer

CISC Processor - Intel X86

Communication Theory II

Computer Architectures an Overview

Evolution of the Pentium

Dstni-LX Data Book

Μc/OS-II User's Manual 1

CPU History [Tualatin] [Banias] [Dothan] [Yonah (Jonah)] [Conroe] [Allendale] [Yorkfield XE] Intel Created Pentium (From Quad-Core CPU

Microprocessors and Microcontrollers Material

Advanced Microprocessors 1

Microprocessor, Microcomputer and Their Applications Fourth Edition

Computer Architectures & Hardware Programming 1

Space - Radiation Qualification of a Microprocessor I Implemented for the Intel 80186 I R