AN ULTRARELIABLE MULTICOMPUTER ARCHITECTURE

FOR REAL-TIME CONTROL APPLICATIONS

by

Peter C. Buechler

A Thesis Submitted to the Faculty of the

College of Engineering in partial Fulfillment of the Requirements for the Degree of

Master of Science in Computer Engineering

Florida Atlantic University

Boca Raton, Florida

December 1989 AN ULTRARELIABLE MULTICOMPUTER ARCHITECTURE FOR REAL-TIME CONTROL APPLICATIONS

by Peter C. Buechler

This thesis was prepared under the direction of the candidate's thesis advisor, Dr. Eduardo B. Fernandez, Department of Computer Engineering, and has been approved by the members of his supervisory committee. It was submitted to the faculty of the College of Engineering and was accepted in partial fulfillment of the requirements for the degree of Master of Science in Computer Engineering.

SUPERVISORY COMMITTEE: ~~ Dr.-----~------E. B. Fernandez, Thesis Advisor _7:_d:~~:..KY#£~ Dr. T. M. Khoshgoftaar ~:£~~---- Dr. D.P. Gluch k&~ll ____ _ Chairperson, Department of Computer Engineering

---~~~~~Sl ______Dean of Graduate Studies Date

ii

·------ACKNOWLEDGEMENTS

I would like to thank my committee members for their suggestions and criticisms, my wife for acting as an economic slave while I pursued my studies, my mother for her support, and Mr. Paul Luebbers for stimulating discussions and encouragement.

iii ABSTRACT

Author: Peter C. Buechler Trtle: An Ultrareliable Multicomputer Architecture for Real-Time Control Applications

Institution: Florida Atlantic University Thesis Advisor: Dr. Eduardo B. Fernandez Degree: Master of Science in Engineering Year: 1989

This thesis considers the design of ultrareliable multicomputers for control applications. The fault tolerance problem is divided into three subproblems: software, processing node, and communication fault tolerance. Design is performed using layers of abstraction, with fault tolerance implemented by dedicated layers. For software fault tolerance, new constructs for concurrent

n-version programming are introduced. For processing node fault tolerance,

the distributed fault tolerance (DFl) concept of Chen and Chen is extended to allow for arbitrary failures. Communication fault tolerance is achieved with

multicasting on a fault-tolerant graph (FG) network. Reliability models are

developed for each of the layers, and a performance model is developed for

the communication layer. An example flight control system is compared to

currently existing architectures.

iv TABLE OF CONTENTS

1. Introduction ...... 1

2. Current Architectures ...... 6

2.1 Historical Perspective ...... 6

2.2 Sperry Flight Systems ...... 8

2.3 FTP/AP ...... 10

2.4 MAFT...... 14

2.5 Airbus A320 ...... 17

2.6 Summary...... 19

3. URMC Architecture ...... 21

3.1 Virtual Machine Design Approach ...... 22

3.2 The Recovery Layer Concept...... 24

3.3 Top-Level URMC Design ...... 25

3.4 Allocatioll of System Requirements to Layers ...... 28

3.4.1 Software Fault Tolerance ...... 28

3.4.2 Processing Node Fault Tolerance ...... 29

3.4.31nterconnection Network Fault Tolerance ...... 31

3.5 Top-Level Reliability Model...... 32

4. Software Fault Tolerance ...... 34

4.1. Sequential Constructs ...... 35

4.1.1 Recovery Blocks...... 35

4.1.2 N-Version Programming ...... 38

4.2. Concurrent Constructs ...... 40

4.2.1 Recovery Block Extensions ...... 40

v

------4.2.1.1 PTC ...... 41 4.2.1.2 The Conversation ...... 44 4.2.1.3 The Colloquy ...... 45 4.2.2 N-Version Programming Extensions ...... 47 4.2.2.1 Modular Redundancy in CSP ...... 47 4.2.2.2 Resilient Procedures ...... 50 4.3 Selecting a Fault-tolerant Construct ...... 51 4.3.1 Sequential Software ...... 52 4.3.2 Concurrent Software ...... 57 4.4 Concurrent N-Version Programming (CNVP) ...... 58 4.4.1 Process-Dissimilar CNVP ...... 60 4.4.2 Subprogram-Dissimilar CNVP ...... 65 4.4.3 Structure-Dissimilar CNVP ...... 69 4.4.4 Comparison of CNVP Constructs ...... 73 4.5 CNVP Reliability Model...... 79 4.5.1 Review of Sequential NVP Models ...... 79 4.5.2 Process-Dissimilar CNVP...... 87 4.5.3 Subprogram-Dissimilar CNVP ...... 90 4.5.4 Structure-Dissimilar CNVP ...... 91 4.6 Correlated Error Problem ...... 91 4.7 Software Fault Tolerance Summary ...... 95 5. Processing Node Fault Tolerance ...... 97 5.1 Derived Requirements for the Layer ...... 99 5.2 SISD Redundancy Management Methods ...... 100

vi 5.2.1 Standby Sparing ...... 100

5.2.2 N-Modular Redundancy ...... 102

5.3 MIMD Redundancy Management Methods...... 104

5.3.1 Redundancy with System Diagnosis...... 105

5.3.2 Modular Redundancy with Voting ...... 112

5.4 Voting vs. Diagnosis...... 113

5.5 URMC Redundancy Management Technique ...... 115

5.5.1 Byzantine Fault Masking ...... 116

5.5.2 Byzantine Fault Diagnosis ...... 118

5.5.3 Spare Processor Coordination ...... 121

5.5.41/0 Hardware Fault Tolerance ...... 122

5.5.5 Support for Dissimilar Processors ...... 123

5.6 Reliability Model...... 124

6. Network Fault Tolerance ...... 130

6.1 Network Requirements ...... 131

6.21nterconnection Network Overview...... 133

6.3 The Unidirectional Unk FG Network...... 137

6.4 Reliability Analysis ...... 140

6.5 Performance Analysis ...... 142

6.6 Summary...... 147

7. Example URMC System ...... 149

7.1 Requirements ...... 149

7.2 Current Capability...... 150

7.2.1 Software Failure Parameters ...... 150

vii

----- ·-- --·------7.2.2 Hardware Failure Parameters ...... 154

7.2.3 Hardware Throughput...... 155

7.3 Example URMC Design ...... 155

7.3.1 Software Fault Tolerance ...... 156

7.3.2 Hardware Fault Tolerance ...... 158

7.3.3 Communication Fau~ Tolerance ...... 163

7.3.4 Communication Performance ...... 163

7.4 Comparison with Current Systems ...... 164

7.5 Summary ...... 166

8. Conclusion ...... 167

8.1 Summary of the URMC...... 167

8.2 Contributions...... 169

8.3 Suggestions for Further Work...... 171

8.3.1 Software ...... 171

8.3.2 Processing Hardware...... 172

8.3.3 Communication Hardware...... 173

8.3.4 Development of an URMC ...... 173

Appendix A ...... 175

Appendix B ...... 186

Bibliography...... 189

viii LIST OF TABLES

7.3.2-1. Failure probability for 1/0 repsets ...... 162

7.3.2-2. Failure probability for processing repsets ...... 162

ix

------LIST OF FIGURES

2.2-1. Sperry flight control architecture ...... 9

2.3-1. The FTP/AP architecture ...... 11

2.3-2. Distribution of channel value to all ...... 12

2.4-i. The MAFT architecture ...... 15

2.5-1. A single A320 computer {SEC or ELAC) ...... 18

2.5-2. A320 pitch control system ...... 19

3.1-1. Layers of abstraction in an SISD computer...... 23

3.1-2. Layers of abstraction in the OSI Model...... 23

3.3-1. URMC layers of abstraction ...... 27

3.5-1. Top-level reliability diagram for URMC ...... 33

4.2.1-1. Domino effect from uncoordinated recovery blocks ...... 41

4.2.1.2-1. A conversation with three processes ...... 44

4.3.1-1. Characteristics of fault-tolerant constructs ...... 53

4.3.1-2. Overheads of constructs...... 56

4.4.1-1. Process-dissimilar CNVP ...... 61

4.4.1-2. Recovery layer components of a three version process...... 64

4.4.2-1. Subprogram-dissimilar CNVP ...... 67

4.4.3-1. Structure-dissimilar CNVP ...... 70

4.4.4-1. Relationship between CNVP constructs ...... 7 4

4.4.4-2. Communication structure for a minimum extraction sort...... 75

4.4.4-3. Structure-dissimilar CNVP for a sorting problem ...... 78

4.5.1-1. NVP fault sources...... 80

4.5.1-2. Major fault types for NVP ...... 80

X 4.5.1-3. Detailed relbbility model of n-version program ...... 81

4.5.1-4. Simple reliability model without recovery ...... 85

4.5.1-5. Simplified reliability model with recovery ...... 86

4.5.2-1. Top-level reliability diagram for process-dissimilar CNVP ...... 87

4.5.2-2. Reliability for three version sequential process ...... 88

4.5.3-1. Reliability diagram for subprogram-dissimilar CNVP ...... 90

4.6-1. Effect of independence assumption (from [Eckh85]) ...... 92

4.6-2. Effect of shifted intensity distribution (from [Eckh85]) ...... 93

5.2.1-1. Standby sparing with comparison checking ...... 101

5.2.2-1. Redundant system with voting ...... 102

5.3.1-1. An optimally one-step 2-diagnosable system ...... 107

5.3.1-2. Necessary partitioning for ti-diagnosable system ...... 108

5.5.1-1. Sending messages between repsets ...... 117

5.6-1. Reliability of 4-node 1/0 repset...... 126

5.6-2. Markov model for 4-node processing repset...... 127

5.6-3. Markov model for spares pool reliability ...... 128

5.6-4. URMC processing node reliability diagram ...... 129

6.2-1. Unk and architectures ...... 134

6.3-1. A (2,3) FG network...... 137

6.4-1. Node architecture ...... 141

6.5-1. The basic queueing process...... 142

6.5-2. Queueing system formed by one node of network...... 144

7.2.1-1 . Comparisons of estimators (from [Eckh88]) ...... 153

7.3.2-1. A single processing node in the URMC-1 ...... 159

xi 7.3.2-2. A single input/output node in the URMC-1 ...... 160 A-1. SHARPE program to compute 1/0 repset failure probability ...... 175 A-2. Output from SHARPE program of Figure A-1 ...... 178 A-3. SHARPE program to compute processing repset failure probability .... 181 A-4. Output from SHARPE program of Figure A-3 ...... 184

xii 1. Introduction.

Digital computers are now being applied to real-time control applications which require high throughput and extremely high reliability. The throughput requirements can be met by the use of a multicomputer, since multiple instruction stream, multiple data stream (MIMD) computers have been shown to be capable of considerable speedup over conventional single instruction stream, single data stream computers (SISD) [Atha88]. However, using multicomputers for real-time control will require that they be fault-tolerant and able to handle the special needs of real-time systems, such as hard processing deadlines and high 1/0 bandwidth. In this thesis, concepts for designing and analyzing ultrareliable multicomputers (URMC) for real-time control applications will be examined. It is intended that the concepts be applicable to a wide range of systems, from flight control computers with tens of processors and a failure probability of less than 1 x 1Q-9 over a ten hour flight, to a process control computer with up to thousands of processors but with a failure probability several orders of magnitude larger. However, for brevity and concreteness, the thesis concentrates on the flight control application. Current flight control computers for fly-by-wire commercial transport

aircraft must have a throughput of 5.5 million instructions per second, an 1/0

1 2

rate of one million bits per second, and a probability of failure less than 1x1 Q-10 per flight hour [Kiec88]. Despite these stringent requirements, several computers have been developed for fly-by-wire control. Both the military and civilian versions must tolerate hardware failures, but due to the higher reliability demanded for civilian applications, the commercial air transport systems must have software fault tolerance as well. Though developed for flight control, these systems are applicable to other high reliability control applications as well, such as process control or autonomous robot control. The current generation of flight control computers made the assumption that failures in software versions would occur independently. Unfortunately, this assumption is probably incorrect, so more versions of software will be required to reach the reliability level desired [Eckh88]. To execute n versions of software in the same time as a single version, the throughput of the flight control computer hardware will have to be increased n times.

The next generation flight control computer will be assigned more tasks than the current generation. The next generation of avionics computers will integrate flight control, propulsion control, navigation, flight management, collision avoidance, communications, and radar in the same system [Redi84] [Swih84]. Also the system will be smarter, with an expert system to consult with the pilot, apprising the pilot of its diagnosis of problems and of the recommended procedure for dealing with them. Different parts of the system will require different levels of reliability . 3

How can the throughput requirements be met, first for a computer with more software versions, and then later for a computer with much greater throughput to take on the additional tasks? Also, how can different sections of the software in the same computer system be designed with differing reliabilities? This thesis will outline a top-level approach to designing an ultrareliable multicomputer (URMC), as may be required for greater reliability of current applications, and with expansion capabilities to design much larger multicomputers in the near future. The computer must be programmed with concurrent software to take advantage of the parallelism, and that software

must be fault-tolerant to meet the reliability requirements. The thesis will first discuss how such a computer may be designed, then detail the software fault tolerance and hardware fault tolerance methods to be used. Analytical models for evaluating the reliability of the proposed schemes will be derived. An example system will be constructed which must handle the same tasks as

current computers, but without the assumption of software failure

independence.

An outline of each chapter in the thesis follows.

Chapter 2 will describe past work in architectures for highly reliable

digital flight control systems. Those which were to be used in flight-critical applications and which have both hardware and software fault tolerance will be reviewed in more detail.

Chapter 3 discusses the design approach used in the URMC, which is

to design a hierarchy of virtual machines, with each machine using the

services of the machine below it to provide services to the machine above it. A

------4 new level of abstraction, the recovery layer, has been proposed for software fault tolerance [Fern89a]. In this thesis the concept is extended by adding levels to the virtual machine hierarchy to provide tolerance to faults at three different levels: communication, node hardware, and software. Each of these layers will provide high!y reliable service to the layer above it despite using the facilities of an unreliable layer below it. Chapter 3 will also discuss the top level requirements which each of the fault tolerance layers must meet. Chapter 4 describes the software constructs which will be necessary for programming fault-tolerant concurrent software. Fault tolerant constructs for both sequential and concurrent software are reviewed and compared for use in the URMC. It is found that no current software fault tolerant construct is sufficient, so a new construct called concurrent n-version programming is introduced. A reliability model for this construct is found. Chapter 5 discusses fault tolerance at the level of the processing nodes. The requirements imposed on this layer by the URMC system requirements and by requirements derived from the software layer's implementation are outlined. Current approaches to distributed fault tolerance are described, then the distributed fault tolerance (DFT) approach of [Chen85] is selected for further development. The DFT proposal does not account for Byzantine faults [Lamp82], so it is extended to mask and diagnose Byzantine faults. A reliability model for this layer is found. Chapter 6 discusses the interconnection network in the URMC. It presents the requirements imposed upon this layer by the URMC system requirements and by the requirements derived from the implementations of 5 higher layers. It is determined that a pseudo-completely connected network should be used, and from these the fault-tolerant graph (FG) network is chosen. A reliability model is developed for this layer as for the other layers. Chapter 7 puts together the ideas of chapters 3, 4, 5, and 6, by outlining an example system, a flight control computer which meets the requirements of the current generation without assuming software failure independence. It is compared to the systems described in chapter 2.

Chapter 8 is the conclusion. It is concluded that the URMC architecture may be a possible approach to designing fault-tolerant control computers, but that there are still several questions to which the answers must be found before this can be definitively shown. Ideas for Mure work are outlined. 2. Current Architectures.

In order to orient the reader to the current state of the art in fault­ tolerant embedded control applications, this chapter reviews some of the architectures which have been developed. A large amount of work has been performed on fault-tolerant architectures for aircraft flight control, so all examples will be taken from flight control. However, these architectures may be used for other applications as well; for example, the Fault Tolerant Processor [Smit84] has been developed for use in flight control, power plant monitoring and control, process control, and/or air traffic control.

2.1 Historical Perspective.

The first digital fly-by-wire system was used in the Lunar Module of the

Apollo program [Redi84]. This was followed by several others including the F­ a Digital Fly-By-Wire, the AFTI/F-16, and the space shuttle; many systems are surveyed in [Redi84]. The first systems intended for use in commercial air

transports were the Fault-Tolerant Multiprocessor Computer (FTMP) [Smit86] [Hopk78] and Software Implemented Fault-Tolerant Computer (SIFT) [Smit86] [Wens78]. They were to compete with each other, the FTMP demonstrating

6

------7 fault tolerance in hardware and SIFT demonstrating it in software. Both systems relied upon modular redundancy and voting for fault tolerance. All of the above systems had tolerance to physical hardware faults which occurred at random intervals. After the Byzantine fault problem was discovered in the development of the SIFT computer, Byzantine fault tolerance was also considered in the design. However, these systems did not tolerate generic faults, which are mistakes made in the design of the system. It became evident that unless there was protection against errors in the design of the system, the reliability goals could not be achieved. In the software intensive systems of the present, software faults are the greatest design problem; it is estimated that even well debugged avionics software has a failure rate on the order of 1o-5 per hour [Youn84]. The newest generation of digital flight control computers deal with both physical hardware failures and generic errors. Some of these systems are used for applications in which fault detection is crucial, but fault tolerance is not, such as the Boeing 737 autopilot [Youn84] and the Airbus A310 flap/slat controller [Rouq86]. These are called fail-passive systems. Fail-operational systems are more challenging to design as the failure must not only be detected, but also masked. The following sections describe four systems which have been designed to be fail-operational after either a random hardware failure or the manifestation of a generic fault. 8

2.2 Sperry Flight Systems.

An architecture suggested in [Youn84] uses three dissimilar versions of the processor, with each processor version running its own version of software (see Figure 2.2-1). The processors are arranged in three lanes (Flight Control Computers) with three processors in each lane, one processor of each type. Thus there are a total of nine processors. Each flight control computer has two outputs, one which is dedicated to one of the three processors (which may be called the dedicated processor), and the other output capable of being driven by either one of the other two processors (which may be called the checking processors). Each of the three lanes uses a different version of the processor as its dedicated output processor. The dedicated output processor is checked against the two checking processors with comparators (marked C in the figure).

A hardware or generic fault in the dedicated processor will cause both comparators in the lane to detect a disagreement, and thus the outputs disengage. A fault in one of the checking processors will cause only one of the two comparators to detect a disagreement, so the outputs will not disengage. Therefore, if a generic fault is encountered in one of the three processor versions, the lane with that version as its dedicated processor disengages, while the other two lanes are left with two processors in each. This allows the system to tolerate two faults, two physical or one physical and one generic. 9

FCC 1 b*d

FCC 2

FCC 3

Figure 2.2-1. Sperry flight control architecture. 10

2.3 FTP/AP.

The FTP is a tightly synchronized triple modular redundant system with identical hardware in each channel [Smit84] [Lala86]. It is fail-operational to hardware faults, including Byzantine faults, using interstage communicators to guarantee interactive consistency. The FTP/AP adds a fourth channel and a dissimilar applications processor to each of the four channels [Lala88] (see Figure 2.3-1). The FTP/AP is a redundant SISD machine with no multiprocessing capability. Each of the four channels in the FTP/AP contains a main processor, an application processor, and an interstage communication device. The four main processors and the four interstages make up the core FTP. The main processor handles synchronization, voting, scheduling, and 1/0. The main processors are tightly synchronized so that they can perform a bit-for-bit comparison upon their results to detect failures. Input sensors and results from the application processors' dissimilar software will not be exactly the same bit for bit, so values from them are distributed to ensure simplex source congruency, then voted.

The software and hardware in the core FTP stay the same from application to application, so they are to be verified rigorously to prove that they have no design faults. The four applications processors are of four different hardware designs and run four dissimilar versions of software. This makes the FTP/AP fail-operational/fail-operational to similar hardware faults in the processors or interstage communicators, or to faults in the applications

.... - ·--·------11

software. All four versions of software are stored in each application processor, but normally the processor executes only one of the tour.

Channel A lnterstage Processor Communicator II~~· I

lnterstage Processor Channel B Communicator 11~:.-.1

Channel C lnterstage Processor CommuniCator II~·· I

lrcie1.;tage Processor Channel D Convnunicator 11~·1

Figure 2.3-1. The FTP/AP architecture.

The main processors each input one of four redundant sensor channel values, then broadcast it to the other channels by passing it through the interstages (see Figure 2.3-2). The pseudocode for a value to be broadcast from channel i to all channels is: channel i: read sensor value[!) tor j = 1 to n do send value(i) to interstage j end for

all channels: for j = 1 to n do read value(i,j) end for value(i) = majority_vote(i,j)

------.. - 12

Each processor now has the same value for each simplex input sensor.

The processors perform a mid-value averaging vote upon the redundant sensors to arrive at a value for that input. If one of the simplex values is different from the voted value by more than an allowable tolerance, that input sensor is declared bad.

lnterstage Processor Channel A Processor Communicator

Channel B I Processor I Processor Channel C I Processor I Processor Channel D I Processor I Processor

Figure 2.3-2. Distribution of channel value to all.

The voted values and sensor diagnoses are sent to all of the other main

processors, where they are compared bit-for-bit to detect failures in any of the

main processors. Each main processor also sends its voted input values to its

application processor if a task requiring those values is executing.

When the application processor has arrived at its results, they are

passed back to the main processors and exchanged using the same protocol

as for the input sensor values. The application results are voted to arrive at a 13 final output value. If any application value differs from the voted value by more than some tolerance, that application processor is declared bad. The main processors exchange the voted application values and diagnoses, which are voted bit-for-bit to detect any error in the main processors. If an application processor is declared bad, then the problem may have been in the hardware or in the software. A procedure is followed to isolate the problem between these two. The state of the software version in question is rolled back to the state it had before the last execution. The iteration of the questionable version is then executed on all four processors. A bit-for-bit vote of the results of the three channels which are not suspects provides the suspect version's output with hardware faults masked. The algorithm compares the original value, the suspect channel's new value, and the hardware fault masked value from the other three processors. If the new iteration on the suspect channel agrees with the voted value of the other three channels, but not with the suspect's original value, then the fault was just a transient. If both the new iteration and the original iteration on the suspect channel disagree with the voted value of the other three channels, then a permanent hardware fault in the suspect channel is diagnosed. If both the new and original iterations of the suspect channel agree with the hardware fault masked value from the other three channels, it is an application version software failure. If the malfunction was a hardware problem, that channel's results are not used in the Mure. If the malfunction was the result of an application 14 version software error, then a confidence voter is invoked. If only one version of software has failed, there is a 3 - 1 split, and the majority value is taken.

There may be a correlated error in two of the versions which resulted in the failure of both, but which did not cause them to fail to the same incorrect value. In this case, there is a 2 - 1 - 1 split. The value chosen is that of the two versions which agree, while the identities of the pair which failed are Jogged. Later, if a correlated error occurs in two versions which causes the two failed versions to agree, then there is a 2 - 2 split. In that case, the value is chosen from the pair which has been shown to fail together as a pair in 2 - 1 - 1 splits the least often. This allows some correlated errors in two software versions to be tolerated.

It was determined that it would be too difficult to reinitialize a failed software version to a state that is congruent to the other versions, so instead a failed version is reinitialized to a cold start state. It is then allowed to run with

its output masked in the hopes that it will eventually start to agree with the

other versions. The version is restored if its output agrees with the voted

output for several iterations.

2.4 MAFT.

The Multicomputer Architecture for Fault Tolerance (MAFT) is a system

developed by Bendix Corporation which has multiprocessing capability along

with hardware and software fault tolerance, including tolerance to Byzantine

faults [Kiec89] [Kiec88] [Giuc86] [Walt85]. It is a hardware implementation 15

which builds upon the strategies of SIFT. A MAFT system has up to eight nodes, each connected to the other seven with a broadcast serial communication network. Each node is partitioned into two processors, the operations controller (OC) and the application processor (AP) (see Figure 2.4-

1 ).

Fully Connected Broadcast Network

Operations Operations Operations Controller Controller • • • Controller

Application Application Application Proc. Ver. 1 Proc. Ver. 2 • • • Proc. Ver. n

Application Specific 1/0 Network

Figure 2.4-1 . The MAFT architecture.

... -

------~~------. - 16

The OC handles internode communication and synchronization, data voting, error detection, task scheduling, and system reconfiguration. The AP is application specific, performing 1/0 and computation of application functions. The MAFT allows the application designer to program the system without concern for fault tolerance aspects except that the designer must prepare three different versions of the application software, preferably in different high­ level computer languages. A fault-tolerant designer may then use several built­ in tools for voting, depending upon the type (continuous or discrete) and the criticality of the result. The fault tolerance engineer assigns tasks to processors in a static fashion. They are only moved if reconfiguration is necessary after a fault, when they are moved to second choice or third choice processors. However, the tasks are scheduled dynamically on each processor. No distinction is made in the handling of hardware and software faults except those made by the application designer. The facilities of the operations controllers are used to implement the application fault tolerance. They maintain synchronization with a hardware version of an optimal clock synchronization algorithm and use a Byzantine agreement protocol to arrive at distributed agreement on task scheduling and system reconfiguration. The results of any variables shared between application tasks are forwarded to all OC nodes upon completion of the task, where they are voted. To avoid waiting for hung tasks, each OC updates the voted value as the copies arrive rather than waiting for all expected values to arrive. 17

Errors are detected and counted for each node. If a node has had too many errors then the system is reconfigured to exclude that node. No distinction is made between software and hardware caused errors, so it is possible for a node to be excluded due to errors in the software tasks executed upon it. The node continues to run after being excluded, and its results are still compared to those of the nodes in the current operating set. Thus if the node recovers its internal state well enough to start getting the same answers that the other nodes do, it can be readmitted after some time. A possible configuration for MAFT is discussed in [Giuc86]. A multitasking computer was implemented redundantly with six application processors, using two processors in each of three dissimilar hardware designs. The system requires at least three processors to be operational

[Kiec88]. Processors may fail one at a time from hardware faults, or pairs with similar hardware and software may fail together due to a design fault in the hardware or software.

2.5 Airbus A320.

The Airbus 320 is the first commercial transport to use fly-by-wire control for its primary flight control system [Rouq86]. There is a mechanical backup system to make the plane easier to certify for flight, but the mechanical backup has limited authority (i.e., it cannot command the aircraft control surfaces through their full range of motion), and it is hoped that it will never be used. The basic building block of the flight control system is a duplex

--- --· ------18

compl.:ter (see Figure 2.5-1) with each channel monitoring the other, thus each computer is a self-checking pair. Each of the two channels has similar

hardware, but dissimilar software.

Figure 2.5-1 . A single A320 computer (SEC or ELAC).

There are two types of computers based upon the self-checking pairs:

the Elevator and Aileron Computer (ELAC) based upon the Motorola 68000, and the Spoiler and Elevator Computer (SEC) based upon the 80186. Each of the two computers has two different programs, so a total of four

computer programs have been developed.

The A320 pitch control system is shown in Figure 2.5-2. The sidestick

commands are sent to two ELACs and two SECs. Under normal operation,

one of the ELACs controls the elevator; if it fails, the other ELAC takes over. If

two physical hardware failures or a single generic failure in the software or

hardware cause both of the ELACs to shut down, then elevator control is

performed by one of the two SECs.

This system has a weakness: although there are two types of processors

used, the dissimilar computer programs which compare with each other run

----·------. - 19

on similar hardware. Thus a generic hardware fault which causes both channels in a computer to fail in a similar manner will not be detected.

Elevator Commands --o

Sidestick Commands SEC (1)

Figure 2.5-2. A320 pitch control system.

2.6 Summary.

The four architectures reviewed above have several common characteristics: 1) computer level fault tolerance - redundant computers are used as the

building blocks for fault tolerance. In the past redundancy has been explored at lower levels of the design (e.g. processors, memories, and buses in FTMP), but with the development of VLSI it is more economical to use a higher level unit for redundancy. 2) generic error detection - all use dissimilar software for different

channels in order to detect design errors in the software. Dissimilar 20

hardware is also used, at least as a backup. In all but the A320 architecture, dissimilar hardware processors are used to check each other's results. 3) synchronous - all use comparison of synchronized hardware channels for fault detection, although the level at which they are synchronized differs. 4) SISD - with the exception of MAFT, none are capable of multiprocessing. MAFT is theoretically capable, but the system described in [Giuc86] is a SISD system. A full MAFT system would have only eight nodes, which does not allow a high degree of multiprocessing. The similarities above provide guidance for the URMC architecture. It will use fault tolerance at the computer level, dissimilar designs for generic error detection, and synchronized hardware. However, it will be a MIMD computer which is capable of considerable parallelism in its processing. Several issues must be addressed before such a computer can be built, and these are the subject of the rest of this thesis. 3. URMC Architecture.

This chapter gives an overview of the URMC architecture, justifying the top-level design, explaining the approach to fault tolerance, and assigning system level requirements to different parts of the design. Section 3.1 describes the concept of design through layers of abstraction, showing that this has been successfully used for design of computers and communication networks in the past. Section 3.2 reviews previous work on adding an extra layer of abstraction, the recovery layer, to simplify the design of software fault tolerance.

The new work begins in section 3.3. First a justification for developing a message-passing MIMD machine (multicomputer) is presented. Next it is explained that the URMC design extends the recovery layer approach, adding several fault-tolerant layers, with each layer detecting, diagnosing, masking, and recovering from its assigned fault type. It is decided to have three of these layers, for software, processing node hardware, and interconnection network hardware. Top-level requirements are assigned to the three fault tolerant layers in section 3.4. These requirements are to be met by the designs for the individual layers described in chapters 4, 5, and 6. Finally, in section 3.5, a top level reliability model for the URMC is shown.

21 22

3.1 Virtual Machine Design Approach.

Design of computers may be performed using layers of abstraction, designing a hierarchy of virtual machines. Each virtual machine presents a list of operations as its interface to the next higher layer. The machine uses one or more of the operations of the level below it to construct its own operations.

This process makes the design modular, greatly reducing the psychological complexity of the design process. The layers of abstraction may be used only as a design aid, or they may be actually implemented in the final system. If implemented in the final system, then another advantage is derived - the implementation of a lower layer of abstraction may be changed without affecting the layers above. Figure 3.1-1 shows a typical virtual machine hierarchy for a single instruction stream, single data stream (SISD) computer.

A similar approach may be used for the design of communication networks, such as the International Standard Organization's Open Systems

Interconnect (OSI) model shown in Figure 3.1-2.

The services of a layer may be implemented using several services of the layer below, or only one. For instance, an operating system layer provides the services of system calls and of nonprivileged machine instructions to a

high level language layer, and uses the services of the machine language level to do this. Each of the system calls is constructed from many machine

language instructions while the user machine instructions are made from just one architectural machine instruction. 23

Application

High Level Language

Operating System

Machine Language

Microcode

Register Transfer

Logic

Circuit

Figure 3.1-1. Layers of abstraction in an SISD computer.

Application

Presentation Session Transport

Network Data Link Physical

Figure 3.1-2. Layers of abstraction in the OSI Model. 24

3.2 The Recovery Layer Concept.

Kim suggests [KimK84] that a fault-tolerant computer be designed by providing fault tolerance at each of the layers of abstraction. For each type of fault at each layer, the designer must determine whether to handle the fault or to allow it to propagate to a higher level. Adding the fault-handling and decision-making functions to each layer complicates the design of the virtual machines. [Anco87] suggests inserting an additional layer in the virtual machine hierarchy in order to handle software faults without complicating the design of the software layer. This extra layer is called the recovery layer, and provides tolerance to software faults through the execution of a recovery metaprogram.

The application software layer can then be simpler, developed for the application without concern for fault tolerance mechanisms. In this thesis, the idea of a recovery layer is extended, so that several

layers are inserted in the virtual machine hierarchy to handle different types of faults. Each fault-tolerant layer uses the services of a fault-susceptible layer below it to construct fault tolerant services used by the layer above it. This

simplifies the design of the layers of the virtual machine by removing the need

to place fault tolerance in each. Each of the fault tolerating layers will have to tolerate only one type of fault, and can neglect others. This approach to

design does not improve the fault coverage, except incidentally through simplifying the fault tolerance designer's task. It must next be determined how

many of these layers to insert, and where to place them in the hierarchy. 25

3.3 Top-Level URMC Design.

The first decision to be made is to what general class of computer the next generation of flight control computers should belong. There are four classifications of computers: single instruction stream with a single data stream {SISD), single instruction stream with multiple data streams {SIMO), multiple instruction streams with a single data stream (MISD), and multiple instruction streams with multiple data streams {MIMD) [Fiyn66]. The most flexible of these is the MIMD machine. An MIMD machine may be either a multiprocessor, which has several processors sharing a common memory, or a multicomputer, which has several processors with local memories communicating only through message-passing. Since each of the processing subsystems of a multicomputer are autonomous, a multicomputer has more isolation between processors than a multiprocessor, and is therefore better for isolating faults. Also, shared memory can act as a single point of failure unless duplicated. It was thus decided that the next generation of critical control computers should be multicomputers, hence the name Ultrareliable Multicomputer (URMC).

The next decision is how to divide the fault tolerance requirements into layers. One obvious split is into hardware and software. Fault tolerance may

be provided separately for these two areas. Since multicomputers have

several processing nodes which communicate through an interconnection

network, the hardware can be further broken down into processing node

design and interconnection network design. Examination of the current

-- ·------26

literature found that there are several fault tolerance strategies for processing

nodes, and several fault tolerance strategies for communications networks. It was therefore decided to divide the hardware fault tolerance into two types,

processing node fault tolerance and interconnection network fault tolerance.

Thus there are three basic layers of fault tolerance, which will be inserted as

three layers (or groups of layers) in the hierarchy of virtual machines. Since hardware is present only to allow the execution of software, the software fault tolerance layer is placed above the hardware layers. Having the

hardware fault tolerance below the software allows a scheme to be designed

to handle software faults without concern for the hardware. The problems of

distinguishing software faults from hardware faults and of rerouting software

around faulty hardware are eliminated from this layer. The design of software fault tolerance in the URMC will be discussed in chapter 4. For hardware fault tolerance, it must be decided which of the two layers

will be on top, the processing node fault tolerance or the communication

network fault tolerance. Many algorithms have been developed for the

redundancy management of the processors in a distributed system. Most of

these assume perfect communication channels, though some do not. In order

to allow the use of any of these algorithms, the communication fault tolerance layer will be placed below the node fault tolerance layer. This allows the

isolation and detection of node faults with the assumption of perfect reliability

of the communication links between the nodes. The design of the processing

node fault tolerance will be discussed in chapter 5.

----- ~ ~ ~~------~----~ 27

FT Software

• •

FT Hardware

., ~ FT Communication : Several , ' • H/W Layers • Several ~.....~--!!!:::,...... _ : 051-Like Circuit ~..--..L;.;a;,:_yers Components , Circuit ' Components

Figure 3.3-1. URMC layers of abstraction.

The communication layer is not used ior constructing the processing node hardware, but only for the construction of the processing node fault

tolerance. Therefore the communication layer is not placed below all of the

node hardware layers, but will be placed below the hardware fault tolerance layer. The design of the interconnection network and of the processing nodes

themselves will be separated and parallel below the hardware fault tolerance

layer.

----·· ------28

The positioning of the communication fault tolerance below the node

fault tolerance is reinforced by the fact that there are many techniques for

multiple routing of messages between single pairs of nodes which do not

assume that the nodes are failure free. Design of the communication layer is

discussed in chapter 6. The resulting structure of the URMC layers of abstraction is shown in

Figure 3.3-1. Note that the node hardware and communication hardware are

separate until joined in the hardware fault tolerance layer.

3.4 Allocation of System Requirements to Layers.

Since many multicomputers have been designed in the past [Atha88],

this section will outline only the allocation of fault tolerance and reliability

requirements to the layers of the URMC. Other requirements, such as

language support, scheduling, etc., can be met using one of the many

techniques presented in the literature.

3.4.1 Software Fault Tolerance.

It would be too ambitious to expect the software fault tolerance layer to

mask faults in the specification of the software, but it should be able to tolerate

faults in the implementation of the software. It should also provide support for

fault tolerance in a concurrent processing model, as the next generation flight

------29

control computer will be programmed in a concurrent language such as Ada

[MIL-STD-1815A].

When a failure occurs in the software, it must be detected so that action may be taken to correct it. The software module which caused the fault must be diagnosed so that the effects of the fault may be isolated to that module or

the fault containment region of which it is a part. Software faults are all design faults present since the code was written, but a failure is triggered by some

sequence of events or particular data input [Hech86]. Since software in a real­

time system has generally been thoroughly tested, the fault will probably not occur again for some time, arising only due to exceptional, untested circumstances. An input in a certain area of the input space will cause the design fault to be manifested as an error. If the internal state is not corrupted,

the software will behave correctly when the input leaves this area [Bish88].

Thus, software failures can usually be considered as transient failures

[TsoK86]. If possible, the failed channel should be restored to a normal state,

so that it can rejoin the system and provide resilience against software failures in other channels. These requirements will be addressed in the discussion of

the software fault tolerance layer{s) discussed in chapter 4.

3.4.2 Processing Node Fault Tolerance.

The throughput of current flight control computers is on the order of 5.5

million instructions per second {MIPS), with 1/0 rates of 1 million bits per

second {MBPS) [Kiec88]. One question not answered in the reference is what

------30

type of instruction they mean, 16-bit or 32-bit, CISC or RISC?. The date of the development effort suggests that a 16-bit CISC processor was intended. The next generation system should be capable of significantly higher throughput, both through the use of faster, 32 bit processors and through the capability of having more nodes in the multicomputer.

The node fault tolerance layer must provide hardware service to the software layers above as if hardware failures never occurred. If a node fails, the failure must be detected and the fault diagnosed as being caused by a certain node, then that node isolated so that the fault does not spread. Many hardware faults are transient, so recovery should be performed using the node which failed, if possible. If the failure is permanent, the system should be restructured to replace the faulty node with a spare node, with minimal disturbance of the software tasks which had been executing on the faulty node.

If the software is diverse, then the underlying hardware of the software versions will be in different states, and a hardware design error is not likely to cause failure in more than one software version's hardware at a time [Aviz86].

A failure which occurs due to a hardware design fault may be detected, isolated, and recovered from as if it were a software fault. However. since some applications designers will wish to have dissimilar versions of hardware

(because of the chance of correlated hardware design errors, or

environmentally induced hardware failures) the node fault tolerance layer of the URMC shall be able to support dissimilar node hardware .

.. ------31

1/0 presents a special problem for reconfiguration after a fault. The 1/0

lines are often hardwired to particular nodes of the computer, so 1/0 drivers cannot migrate to an available spare node upon failure of the original. A

broadcast bus design does not suffer from this limitation, but serves as an 1/0

bottleneck. It will be assumed that 1/0 is from/to redundant sensors/actuators. The signals are replicated and wired to each of several 1/0 nodes. A means to handle 1/0 node failures separately from failures in non-I/O nodes must be

found. The implementation chosen for the software layers will cause additional

requirements to be placed on the processing node fault tolerance layer. These

are called derived requirements, and are discussed in section 5.1. Chapter 5 discusses the implementation of hardware fault tolerance to satisfy the requirements stated in this section as well as the derived requirements of section 5.1.

3.4.3 Interconnection Network Fault Tolerance.

The communication fault tolerant layer is the lowest fault tolerating layer

in the URMC. This layer must transmit message packets between nodes

reliably. H a message is sent between a pair of nodes, then the communication fault tolerance layer guarantees that it will be delivered correctly within some

time limit.

------·-··-·---- 32

To increase the dissimilarity of the parallel algorithms, different virtual hardware configurations may be assumed, which are then mapped to the physical hardware configuration. For instance, three versions of software may be prepared, with one designed to run on a tree network, one on a shuffle­ exchange, and one on a mesh. These must then be mapped to the physical architecture. The easiest mapping is to a completely connected machine. As it is impractical to provide a completely-connected multicomputer with more than a few processors, a communication scheme will be designed which provides a reliable virtual completely-connected machine using the services of a fault-susceptible partially connected physical network. The interconnection network should not be liable to destruction by a single incident of damage to the airplane, such as an engine explosion. Thus the interconnection network must be redundant and physically distributed. In

addition, further derived requirements are imposed upon the communication

system by the implementations chosen for the software and processing node fault tolerance. These derived requirements are discussed in section 6.1 .

Chapter 6 describes an implementation scheme for the communication layer

which satisfies the requirements of this section as well as those of section 6.1.

3.5 Top-Level Reliability Model.

A pessimistic approach to determining the reliability is to assume that a

failure of any one of the three fault-tolerating layers will cause the system to

fail. This may not always be true. For instance, loss of communication between 33

a pair of nodes is not bad until those two nodes wish to communicate, and loss of all hardware nodes capable of running a software task is not bad unless that task should be run. However, the series approximation will be a decent lower bound on reliability, so the top-level reliability diagram chosen for the URMC is that of Figure 3.5-1.

Software Processing t---iJiM_I Node

Figure 3.5-1. Top-level reliability diagram for URMC.

The failure rate goal for a critical system such as a flight control computer is 1 x 1o-1o failures per hour [Giuc86]. This total failure rate must be distributed amongst the three blocks of the URMC reliability diagram. It is very cftfficult to achieve high reliability in software due to the fact that many failures are correlated between versions [Knig86]. Therefore it will be assumed that most of the total allowable system failure rate will be due to software failures, with the failure rate of the processing node and communication fault tolerance

being much lower. 4. Software FauH Tolerance.

This chapter discusses software fault tolerance in the URMC. In section 4.1, the two most used fault-tolerant constructs for sequential software are described: the recovery block [Rand75] and n-version programming [Aviz85].

Section 4.2 reviews three constructs which have been proposed to extend the recovery block concept to concurrent software, programmer-transparent coordination [KimK86], the conversation [Rand75]. and the colloquy [Greg85]; reviews a construct which extends n-version programming to concurrent software, modular redundancy in CSP [Manc86a); and reviews a construct which may be used to implement n-version programs in distributed systems, the resilient procedure [UnK86]. New work begins in section 4.3, which

compares the fault-tolerant constructs of section 4.1 and 4.2 for use in the

URMC. Section 4.3 decides that n-version programming best meets the fault tolerance and real-time requirements of most software in a critical control

computer. While two of the constructs reviewed in section 4.2 extend n-version

programming to concurrent software, it is determined that more work must be

done. Section 4.4 suggests three types of concurrent n-version programming

(CNVP): they are process-dissimilar, subprogram-dissimilar, and structure­

dissimilar CNVP. For each of the three CNVP constructs, an implementation

34 35

based upon a recovery layer [Fem89a] is suggested. In section 4.5 a reliability

model for the three types of concurrent n-version programming is derived by extending a previously presented model [TsoK86] for sequential n-version programming. Section 4.6 concludes with a summary of software fault · tolerance in the URMC.

4.1. Sequential Constructs.

Several software constructs which can be used to implement software

fault tolerance in sequential programs have been suggested, chiefly the recovery block [Rand75] and n-version programming [Aviz85]. Both rely upon redundant components with diverse designs for fault tolerance. The constructs are thus similar in their approach to fault tolerance, but differ in their approach to redundancy management. Data diversity [Amma87] has been suggested as

a simpler alternative than design diversity, or as a source of additional

diversity in systems built with design diversity.

4.1.1 Recovery Blocks.

A recovery block [Rand75] consists of a recovery point, an acceptance

test, and several try blocks. The syntax of a recovery block is: 36

ensure acceptance_test by try_1 else by try_2

else by try n else fail - end ensure

The semantics of a recovery block are as follows. At the entry to the block, a recovery point is established. The states which all variables modified in the recovery block have at the recovery point are saved. The first try block is executed, then the acceptance test is performed upon the result. If the result passes the acceptance test, the modified variables stay modified, the recovery point information is discarded, and the recovery block is exited. If the result of the first try block fails the acceptance test, the variables modified during the course of execution of the try block are restored to the values which they had at the recovery point, the second try block is executed, and the acceptance test is repeated .. The process continues until either a try block's results pass the acceptance test or all n try blocks have been tried and have failed. If all n try blocks fail, the recovery block faiis.

When a try block's results do not pass the acceptance test, all data is

restored to the value it had prior to the try block's execution. Fault recovery is therefore performed automatically, and no further recovery is necessary. Extra time must be aUowed for the execution of a recovery block in case of failure, to

allow time for the data to be restored and for the extra try block(s) to be

executed.

Several variations are possible on this basic structure. In one variation

of the recovery block, the first try block attempts the complete computation,

------· 37 while subsequent try blocks attempt to compute only critical answers, or perform a less precise computation [KimK84]. This could require more than one type of acceptance test.

Another variation on the recovery block is the distributed recovery block [KimK89], in which a recovery block with n tries is replicated on n different processors. Each copy of the recovery block has the same try blocks, but they do not execute the tries in the same order. For example, a two-version recovery block is distributed to two processors, one the primary processor and the other the alternate. One processor uses version A as its primary try block and version 8 as its alternate, while the other processor uses 8 as the primary and A as the alternate. Both processors execute their primary try block, then check the results with the acceptance test. If both results pass, the results of the primary processor are output and the secondary processor's results are ignored. If the primary processor's results pass and the secondary processor's results do not, then the secondary uses its alternate try block to compute the correct result, thus recovering from the failure. If the primary processor's results do not pass the acceptance test and the secondary's results do, then the secondary processor becomes the primary and vice versa. The new secondary (old primary) processor then recovers from the failure by executing its alternate try block. If both processors' results fail the acceptance test then the recovery block fails. 38

4.1.2 N-Version Programming.

An n-version program [Aviz85] has n versions of software which perform the same end computation by differing means, as do the try blocks in a recovery block. However, n-version programming uses voting to establish correctness rather than an acceptance test. Every version of software is executed, then the results are voted to pick one answer. An n-version program has the form:

par version 1 version:2

version n end par - vote result

The versions may be executed sequentially or in parallel, as long as all

of the versions' results are computed before voting - usually they are executed

in parallel. The voter may use any of several techniques, such as selecting the

majority's value or selecting a middle value.

Upon the failure of one version, the output of the majority voter remains

correct, thus masking the failure. If the software version which failed has any

internal states, they may be wrong. It is desirable to correct this problem so

that the version may provide a correct output later if another version fails, thus

helping to mask that failure. A method for accomplishing this is Community 39

Error Recovery (CER) [TsoK86], which uses default exception handling to

achieve forward error recovery from errors in n-version software. The CER method is based upon two levels of recovery: cross-check

points (cc-points) and recovery points (r-points). The cc-points allow recovery from errors which produce wrong results but do not interfere with the correct

control flow in the version. The r-points allow recovery from errors which cause

incorrect control flow in the failed version.

At the cc-points, the versions send their current result to a supervisor program. The supervisor runs a decision function to arrive at a decision result,

and this value is sent back to the versions. The versions use the decision result in subsequent computations regardless of whether it agrees with their

own result or not. This will often be enough to recover from a failure, as was

demonstrated in [TsoK87].

At recovery points, each version submits a recovery point identification

to show that it is at the recovery point. If a version sends the supervisor a

missing or incorrect recovery point then the supervisor activates a state-output exception handler in every good version. The state-output exception handler

outputs the values of all internal states in a version to the supervisor. The

internal variables from all good versions are used in a decision function to

arrive at a decision state. The supervisor then activates a state-input exception

handler in the bad version, which replaces the internal state of the bad version

with the decision state.

------·------40

4.2. Concurrent Constructs.

Several concurrent constructs have been derived by extending sequential constructs to concurrent software. The recovery block concept has been extended as programmer-transparent coordination [KimK86], conversations [Rand75], and colloquys [Greg85]. The n-version programming concept has been extended as modular redundancy in communicating sequential processes [Manc86].

4.2.1 Recovery Block Extensions.

When extending the recovery block to concurrent software, the chief problem is to avoid the domino effect [Rand75] when a recovery block fails its acceptance test. Figure 4.2.1-1 illustrates the problem with three processes, each of which has entered four nested recovery blocks that it has not yet exited. The brackets indicate the recovery points of the recovery blocks and the dashed lines indicate interactions between processes. If process 1 fails, it will back up to its fourth recovery point. If process 2 fails, it will back up to its fourth recovery point past an interaction with process 1, which must therefore back up to its third recovery point. If process 3 fails, all processes will end up rolling back to their first recovery points.

The domino problem can occur any time that: 41

1) the recovery block structures of the various processes are uncoordinated, and 2) either member of any pair of interacting processes can cause the other to backup.

Process 1 [ [ I [ E • 1 2 I 3 4 I I I I Process 2 [ I [ I IE I [ • 1 I 2 I a• 4 I I I Process 3 [ [ El IE • 1 2 3 4

time

Figure 4.2.1-1. Domino effect from uncoordinated recovery blocks.

4.2.1.1 PTC.

The programmer transparent coordination (PTC) scheme allows processes to be programmed with recovery blocks without burdening the program designer with the task of coordinating the recovery points of interacting processes. The original PTC scheme [KimK84] relies upon a centralized monitor to handle communication between processes. Recently 42

[KimK86], the PTC has been extended to loosely coupled networks, called

PTC/LCN. The latter scheme will be reviewed here.

The computation model used is a loosely coupled network, where there are multiple processing nodes with no shared memory. A process which sends information to another is called an exporter process, while the receiver is called an importer process. The processes communicate through point-to­

point messages which are handled by an underlying communication

subsystem. The communication subsystem is assumed to deliver a destination

message within a bounded communication delay without loss of content of the

message.

There are four major elements to the PTC/LCN scheme: 1) Each process performs error detection and recovery using recovery blocks, with no error-handling coordination between processes. 2) an exporter process may send an importer process a message which

contains material which is not yet fully validated; thus a message may

be sent during execution of a try in a recovery block, and if the

message is later invalidated, the exporter revokes the message.

3) each process is responsible for detecting and correcting all errors

which it originated. The processes must accept all imported data as

correct unless the exporter revokes the message.

4) Before a process imports uncommitted information, a recovery point is

automatically inserted by the underlying machine. Thus if the

information is later revoked by the exporter, the importer will not have to

roll back any further than the point at which it got the bad information. 43

This new type of recovery point is termed a branch recovery point to distinguish it from the recovery points established at the beginning of recovery block, which are called base recovery points.

If a process imports information which may be revoked, it is called a direct dependent upon the exporter, and the exporter is called a direct potential recaller of the importer. If the exporter may have to revoke the information when information which it previously received from a third process is revoked, then the third process is called an indirect potential recaller of the importing process. A branch recovery point is established when a process imports information from an exporter which has a direct potential recaller which is not a direct potential recaller of the importing process.

If an acceptance test is passed for a recovery block, but branch

recovery points still exist in that block, it is partially validated. A branch recovery point may be discarded when all processes in its potential recall set

(direct or indirect) have been partially validated. A recovery block which has passed the acceptance test and which either contained no branch recovery

points or had them all discarded is completely validated. A base recovery

point may be discarded when its recovery block has been completely validated. The PTC/LCN simplifies the design of the program at the expense of a more complicated underlying machine and the loss of the ability to detect failures in other processes.

~-·- --~----- 44

4.2.1.2 The Conversation.

Process 1 Process 2 Process 3

time

Reive ry line ____ _....

----~ ... Sidewall ....._. ____

~Lis ne

Figure 4.2.1.2-1. A conversation with three processes.

A conversation [Aand75] is a two-dimensional recovery block (see Figure 4.2.1.2-1) which spans two or more processes. It prevents the domino effect by forcing the programmer to coordinate the recovery points in different processes. At the start of the conversation is a recovery line, at which each process establishes a recovery point. Sidewalls are erected which prevent the processes in the conversation from communicating with any process outside the conversation. The processes then proceed concurrently, executing their 45 primary try block, until they reach the test line. At the test line, a local acceptance test is executed in each process. If all local acceptance tests pass, the recovery line values are discarded, the sidewalls are taken down, and execution continues. If any acceptance test fails, all of the processes in the conversation restore their state to that of the recovery line, then execute an alternate try block.

4.2.1.3 The Colloquy.

The colloquy [Greg85] was introduced to provide a backward error recovery mechanism which is more general than the conversation. A colloquy

is a group of processes which work together to achieve a common goal, executing one or more dialogs. A dialog is like a single try block in a conversation. A dialog is an occurrence in which a set of processes: 1) establish individual recovery points,

2) communicate among themselves and no others,

3) determine whether all should discard their recovery points and proceed

or restore their states from their recovery points and proceed, and

4) follow this determination.

A dialog differs from a conversation in several ways:

1) there may be a time limit on the execution,

2) there may be a global acceptance test which is performed in addition to

the local acceptance tests performed in each process, 46

3) if the local and global acceptance tests can be passed without the participation some of the processes, then the dialog can succeed without the participation of those processes, and 4) after a rollback, the processes are free to participate in another dialog with different processes as their alternate approach - they do not need

to try again with the same processes with which they just failed. A colloquy is an execution time concept, a collection of dialogs which attempts to accomplish some global goal. Entry of a process into a colloquy is

announced by a dialog sequence, which has the following syntax:

select attempt 1 or attempt_2

or attempt n time-out - sequence_of_ statements else sequence_of_ statements end select;

When the process reaches the SELECT statement, it establishes a

recovery point. The process then attempts to perform the procedure in

attempt_1 , which may be to engage in a dialog with some other processes. If

that dialog fails, the process will try the other attempts, which may be other

dialogs with other sets of processes. If any dialog succeeds, the SELECT statement succeeds. If the time limit expires before all attempts are completed,

------47 the execution of the dialog which the process is currently executing will fail, and the sequence of statements after TIME-OUT will be executed. If all attempts have failed before the TIME-OUT, the sequence of statements after the ELSE statement will be executed.

4.2.2 N-Version Programming Extensions.

When extending n-version programming to concurrent software, the

chief problem has to do with the nondeterminism which is possible in several

concurrent software models. In a nondeterministic situation a correct program may take any of several actions. Correct versions could obtain completely

different results by performing different correct actions.

4.2.2.1 Modular Redundancy in CSP.

The CSP model of concurrent computation [Hoar78] represents a

concurrent program as several sequential processes which communicate only

through the synchronous send and receive of messages over predefined channels. Mancini suggests [Manc86] that fault tolerance can be added to the

CSP model by allowing a process P to be replicated n times, making an n­

redundant module np. For hardware fault tolerance, one implements the

processes identically and maps them to n different processors. If software fault

tolsrance is desired, the n processes have the same message passing and

receiving specifications, but one implements the n versions dissimilarly. The 48

notation of CSP is extended to allow modular redundancy of some or all of the processes. To prevent the problem of nondeterminism causing correct

versions to reach different results two chief requirements are imposed:

1) all the copies P1 ••• Pn of a multiple module np must resolve nondeterminism in an identical manner, and 2) all copies of a multiple module must process in an identical order the contents of all input channels which are used in some input guards of

an alternative command.

Two solutions to the latter requirement are presented. The first uses a centralized communication structure to ensure that all inputs arrive in the same

order at all n modules. The second solution is a distributed solution, and uses

the services of the distributed system kernel. The kernel in each processor which has one of the n copies of a process P arrives at a distributed agreement with the other n - 1 kernels about the order of the messages. As

the latter solution will be of interest later in this thesis, it is presented here in

more detail.

The following assumptions are made:

1) a message will be correctly delivered within some time interval, delta, if

it is sent (i.e., none are lost), 2) processors communicate only by means of two-party messages, .. 3) the sender of a message is always identifiable by the receiver, and

4) all nonfaulty processors have clocks that have been synchronized.

The sending of a message from a module Q to a multiple module np

proceeds as follows:

~------~ ~------49

1) The sender process (Q) sends a copy of the message to the kernels

(N1 •. • Nn) in the nodes containing the n copies of the receiver process

("P or P1 •.• Pn)· 2) Each kernel periodically starts an instance of an algorithm for interactive

consistency. Each message is uniquely identified by the triple (sender, receiver, msg_type). The kernels arrive at an interactive consistency on

the identifiers for the messages which they have received. 3) Based upon the interactive consistency vector, each processor determines the set I, the union of all messages which have been

received by any of the instances N1 ••• Nn . 4) The processor waits the time interval delta for any messages which it

has not received. If the message never comes, it is given the value

NULL. Due to assumption 1, this can only happen if a faulty Nj has informed the others of receiving a message which it did not actually receive. Since all nonfaulty processes will mark the message contents

as NULL, they will still be in agreement.

5) Each kernel orders the messages in I according to a priority list

referring to the senders. Each message receives a mark which specifies

the ordering.

6) Each kernel Ni inserts the marked messages into the channels leading

to P1• Note that distributed agreement is not reached upon the contents of

the messages, but only upon the identifiers. This is possible because of

assumption (1), which assumes all messages are correctly transmitted. If the

------50 sending process is multiple, the kernels run a voting algorithm to choose a value to send to P1.

4.2.2.2 Resilient Procedures.

The resilient procedure [UnK86] was developed for use in distributed systems. It is intended to implement fault-tolerant constructs in a distributed environment. It is mentioned here because it could be adapted as a means to implement n-version programming.

A resilient procedure appears as a single procedure to the outside world, but actually consists of a group of processes, one coordinator and n cohorts, each located in a different node of the network. The coordinator is the process invoked upon a call to the resilient procedure. When the coordinator process receives a request for a computation from a calling process, it invokes its cohort processes. These may execute the same software if hardware fault tolerance is the only tolerance needed, or may execute different versions of software in parallel for both hardware and software fault tolerance.

When finished computing, the cohorts send their answers to the coordinator.

The coordinator may either compare the results or perform acceptance tests to determine their validity.

The problem of nondeterminism is solved by routing all calls through the coordinator. The cohorts then perform the tasks presented to them by the coordinator in a deterministic fashion. If an n-version program were to be 51 implemented, each cohort would be implemented with a different version of the software process or subprogram. Since all versions of the software execute simultaneously in the cohorts, the execution time required when failures are present is the same as when they are not present. However, the coordinator is a single failure point. The

cohorts can monitor the coordinator to detect a failure, then replace it with a

new coordinator if it fails. This requires some sort of acceptance test on the coordinator, with the same problems of safety and reliability that the acceptance test for a recovery block have. An additional problem is that time

is necessary for the reconfiguration.

4.3 Selecting a Fault-tolerant Construct.

In order to pick the best fault-tolerant software construct for real-time multicomputing, the characteristics and overheads of the constructs must be

compared and considered. This will be done by first picking which of the two

main sequential fault-tolerant constructs is best for sequential real-time

software. Then a construct which extends the chosen sequential construct to

concurrent software is chosen.

The characteristics which distinguish real-time systems from ordinary systems are [Hech76]:

1) processing must be completed within a hard time deadline, and

2} processing must continue despite faults and the occurrence of unusual

circumstances. 52

For example, in an on-line system such as an airplane reservation system, speed is desirable from the user's point of view, but if an occasional transaction takes too long, nothing serious will occur. Also, while it may be very important that the passengers be booked correctly, if a problem is detected the system may be shut down for a time while the problem is corrected. In contrast, for a real-time aircraft control system missing a processing deadline could cause loss of control of the aircraft, and for control of an unstable aircraft, the system may not shut down for more than a few milliseconds without catastrophic consequences.

4.3.1 Sequential Software.

The characteristics and overheads of the two sequential software fault­ tolerant constructs reviewed in section 4.1 are shown in figures 4.3.1-1 and

4.3.1-2. These characteristics and overheads will be considered one at a time to choose the best sequential fault-tolerant construct for real-time computing.

First the characteristics will be covered. The best error processing technique is not immediately apparent, and so judgement on this will be postponed until below.

The judgement on result acceptability is an important characteristic for determining which of the two constructs is better for real-time systems. For certain tasks with well defined results, such as sorting of a list, the acceptance test is clear and simple. However, for real-time computation such as control loops or navigation, the expected result is often not clear. In [Hech76] the 53

problem of acceptance tests for fault-tolerant real-time software is addressed, and it is suggested that a reasonableness check on the value or rate of change of a value will usually be the acceptance test used. This has two disadvantages: the designer of the acceptance test must be aware of what the reasonable values are, and wrong values that are reasonable will not be detected.

Method Error Processing Judgement Version Consistency Suspension of Name Technique on Result Execution of Service Delivery Acceptabifrty Scheme Input Data During Error Processing

Recovery Error detection by Absolute, Sequential Implicit from Yes, duration Blocks acceptance test and being made backward is necessary backward recovery with respect recovery for executing to the technique one or more specification try blocks

N-Version Vote Relative, Parallel Explicit, No Programming being made from the with respect use of to the dedicated versions' mechanisms results

Figure 4.3.1 -1. Characteristics of fault-tolerant constructs.

One example given by Hecht is an East/West position routine in an

aircraft navigator. The position is computed, and the change in position is

compared to the previously computed change in position with some tolerance.

To determine that tolerance, it is necessary for the acceptance test

·- · ·· · - · ·- -· ...... --· .. . ·· ------··- --· . -- . ------· --·- - 54 programmer to know a lot about aircraft navigation and expected flight patterns, and the tolerance must be large enough to allow the aircraft to execute turns without causing the acceptance test to fail. Even if the software were to fail in such a way as to indicate that an aircraft in straight and level flight had turned 1aoo, the failure would not be detected if the erroneous turn occurred at a reasonable pace. Additional reasonableness checks may be added to increase the fault coverage, but this may quickly grow

unmanageable. Take for example an aircraft pitch control system. The aircraft controller may have a different mode for each mission of the aircraft: a fighter would have modes such as air-to-air, air-to-ground, landing, takeoff, etc. The reasonable results would be different in each mode. Reasonable results would also vary depending upon the aircraft speed, altitude, weight, center-of­ gravity, throttle level, etc. Some of these may change suddenly, e.g. weight changes suddenly when ordnance is released. The acceptance tests grow

with each variable which must be considered. The design of the tests is a manual operation, and as the acceptance test complexity rises, so does the

probability that an important test will be neglected or an erroneous test inserted. Voting is a relative procedure only. All that is required is to ensure that correct versions of the software will reach the same result within some tolerance. Determination of that tolerance is the only problem in the voter which requires detailed knowledge of the application.

Parallel execution is best for n-version systems with real-time deadlines,

as this reduces the total time required. Most n-version programs are executed 55

in parallel, and most recovery blocks in series. However, it is possible to execute recovery blocks in parallel [KimK89], so this characteristic does not

indicate which of the two constructs is better. Explicit consistency of input data requires a lot of overhead, so it appears that the recovery block is best from this standpoint. However, this advantage is reduced by the fact that input sensors will generally be redundant and will require voting overhead even in systems which use implicit consistency. In a real-time flight control system, suspension of service delivery should be for at most a few milliseconds. The suspension for the extra time required for executing try blocks is often unacceptable. If the different tries are executed in parallel as in the distributed recovery block [KimK89], the results are available after execution of a single try block, but additional time is still necessary before the next execution of the recovery block to allow for execution of additional try blocks in the versions which failed the first time.

Next the overheads of the constructs will be considered.

Both the recovery block and n-version programming require two extra modules, so they are tied when compared for overhead in the diversified software layer. An n-version program voter is generally fairly simple, but guaranteeing interactive consistency requires several message exchanges between

versions. A recovery block must save its values, either in a recovery cache or

by some other means. Which of the two methods has less overhead depends

too much on implementation for a clear comparison here. 56

To determine the construct with the lowest operational time overhead, the number of input variables versus the number of internal variables must be considered. If the number of internal variables is small relative to input variables, storing values as the recovery block does would take less time than the input consistency checking of n-version programming. If there are many internal variables compared to inputs, the input data consistency would take less time. In most real-time control systems, 1/0 is intensive, so the recovery block would have the lower overhead. However, the recovery block's advantage is partially negated by the need to vote input sensor values.

Method Structural Ovemead Operational Time Overhead Name Diversified MechoN5ms Ongoing On Error Software Occurrence lover Version Decider

Recovery 1 version and l

N-Version 2 versions Voter Input data \bter Usuotv Programming consistency Execution negmglble and version execution synctvon~ zatlon

Figure 4.3.1-2. Overheads of constructs.

·-·- ·· - · - ··- - ···- · . . . . ·-·----- 57

Voting and comparison are fairly simple operations, so a voting scheme will generally execute faster than acceptance tests. This is especially true for

real-time control systems, where there is often no easy test to run on the

output to determine if it is acceptable.

Operational time overhead on error occurrence is negligible for voting

systems without recovery. For systems with recovery, it is likely to take longer, but that can be run in the spare time as it is not critical to complete recovery

before the real-time deadline. In contrast, executing an entire new try block and rerunning the acceptance test would take considerable time.

In summary, the recovery block has more execution time overhead

upon error occurrence than an n-version program, due to the time required for

backward error recovery, and the necessary acceptance tests will often be too

complicated to complete in a reasonable time for many real-time applications.

N-version programming should generally be used in real-time applications

because of the relative judgement of result acceptability and the low overhead on error occurrence.

Checking industry practice shows that these conclusions are

supported, as all four examples discussed in chapter 2 use n-version

programming.

4.3.2 Concurrent Software.

The results of the sequential comparison are used to find and pick a

good construct for fault-tolerant concurrent software. The conversation,

~----~---- 58 programmer-transparent coordination, and the colloquy are all concurrent extensions of the recovery block, and have the same disadvantages for real­ time control systems that the recovery block has: extra time required for backward error recovery and the need for an absolute acceptance test.

Therefore, we choose n-version programming for use as the fault-tolerant construct in the URMC. Unfortunately, not much work has been performed on concurrent n­ version programming (CNVP). Mancini has suggested adding modular redundancy to CSP [Manc86a]. This method relies upon dissimilar implementations of each process to achieve fault tolerance. However, the message passing requirements for each process are identical, reducing the amount of dissimilarity possible between versions. It has been shown that correlated errors between software versions is a major problem with n-version programming, even if the specification of the software is correct [Knig86]. Different programmers tend to make the same mistakes when presented with a problem. This source of correlated errors could be reduced by increasing the dissimilarity between versions. The next section suggests ways to increase dissimilarity between versions.

4.4 Concurrent N-Version Programming (CNVP).

In this section it is proposed that as there are three constructs which

extend the recovery block concept to concurrent software, so there should be

three constructs which extend n-version programming to concurrent software.

- -· ------59

These constructs are named process-dissimilar, subprogram-dissimilar, and structure-dissimilar CNVP; they are corollaries to programmer-transparent coordination, the conversation, and the colloquy, respectively. The three schemes for extending recovery blocks require different levels of similarity between the computational tries. PTC relies upon establishing a new recovery point at specific communications between processes, so all of the try blocks in a process must have the same communication structure. The conversation extends the recovery block over several processes, so that there may be different communication patterns in the different try blocks of the participating processes; however the same processes participate in each of the try blocks. Also the results of different tries in a conversation must all be left in the same processes, as local acceptance tests are used to determine validity. The colloquy extends the conversation by allowing the try blocks to have different processes participating, and to use a global acceptance test. Hence the tries may have completely different computational structures. The three different n-version programming derived constructs have the same dissimilarity requirements as the three different recovery block extensions. For describing possible implementations of the constructs, the concurrent software will be assumed to consist of several sequential processes which communicate using message passing. Implementations will be proposed based upon the concept of a recovery layer [Fern89a]. The fault tolerance will be application transparent [Anco87], i.e., system behavior in the

- · ·- ·------·· 60 event of faults is specified with minimal modifications to the application software. Application transparent strategies can be further divided into: 1) application independent - all policies which can be applicable to arbitrary programs, 2) application implicit - activation of fault tolerance actions is associated to occurrence of predefined critical constructs of the application language, and 3) application explicit - the application software specifies a minimal set of application dependent entities. The CNVP constructs can be used with any of the other three communication models (monitors, remote actions, or generative communication [Gele85]) as well, but this will not be described. As message passing can be used to implement any of these schemes, simple transformations should be sufficient to change the implementations outlined here into implementations suitable for other communication models.

4.4.1 Process-Dissimilar CNVP.

Process-dissimilar CNVP is the NVP corollary to PTC. As with the recovery blocks of PTC, there are n different versions of each process. However, instead of establishing a recovery point at each message receive, a vote is performed at each message receive (see Figure 4.4-1 ). The dissimilarity between versions is limited by the requirement that each version must have the same message send and receive requirements. 61

However, fault diagnosis is good because voting at every message receive masks a fault in one of the versions of a duplicated sequential process before that failure can propagate to the processes which import data from the failed process. The modular redundancy in a message passing system proposal of Mancini [Manc86a] uses the synchronous communication of CSP, so it is a special case of process-dissimilar CNVP.

Process 1 Process 2 Process 3 Version a be a be a be

time

1---- ...... 1-t--- __ _

~ I=

IE--=-

Figure 4.4.1-1. Process-dissimilar CNVP. 62

To program a process-dissimilar program, first the application designer should specify the processes and messages in the system. The application designer must also specify how nondeterminism is to be resolved, as each process should behave in the same way given the same input messages. [Manc86a] extends the CSP notation of [Hoar78] to describe such an approach, and discusses means to ensure input data consistency between versions, but recovery is not considered. An implementation with recovery based upon the idea of a recovery layer will be presented. The recovery layer is visible to the fault tolerance designer. On this layer, each application process with i message input channels and j message output channels can be viewed as an abstract entity assembled from the following recovery layer components: n separate versions of the process, n x i input consistency processes, and n x j voting processes (see Figure 4.4.1-2). The input consistency technique can be the same as that described by Mancini, which has already been reviewed in section 4.2.2.1. While Mancini suggested putting the input consistency function in the system kernel of each processor, it should instead be placed in the processes in the recovery layer.

This places all functions concerned with redundancy management in the

recovery layer and thus increases the modularity of the design. Mancini's plan did not allow for recovery. If fault masking is desired without recovery afterwards, the voter could be implemented as a single

process, executing on a different resource than the application processes. If recovery is desired, the Community Error Recovery scheme suggested by Tso

[TsoK86] and reviewed in section 4.1.2 may be used. A process in charge of 63 the recovery of an application should have full visibility of the application's variables [Fern89a]. Therefore, the voter process should be distributed into several voter processes, with a voter process on each resource which is executing an application process. This has the disadvantage of requiring the voters to use an algorithm fo; interactive consistency [Lamp82], but has two advantages when implementing the CER scheme: at cc-points the voter can communicate values back to the application process without burdening the interprocessor communications network, and at r-points the underlying processor hardware can give the voter processes the authority to stop the application processes and to invoke the state-output and state-input exception

handlers. Figure 4.4.1-2 already has this configuration shown. A method for giving recovery processes authority over application processes is described in

[Ozak88]. An application implicit strategy is used to implement the cc-points. Whenever a message is sent, a vote is performed, and then the decision value

is returned to the application process and used in subsequent computations.

H the programmer is aware of this, he may structure the software in such a

way as to make recovery more likely.

Recovery points are implemented with an application explicit strategy.

Extra processes must be placed in the recovery layer, as for each of the n

application versions there is a state output exception handler and a state input exception handler. The application designer should assign levels of criticality

and timing requirements for each process. Based upon this information, the

location of the recovery points is determined by the fault tolerance designer. 64

The fault tolerance designer also specifies a time limit on each version's execution and the type of voting to perform on each communication channel (e.g., majority, midvalue, or approximate agreement, depending upon the

types of values sent).

Application /\ Layer

Output Voting Recovery N Versions of Application Layer

Input Consistency

Figure 4.4.1-2. Recovery layer components of a three version process.

In [Sarm88] it is suggested that an intelligent system could be used, with the designer specifying the criticality of a module, and the system

··-·-·------·- ----·· . ------·-- . - 65

informing the designer which modules need extra versions. The system would then construct the fault-tolerant program. The need for n versions of each process can be supported by extending a language which uses separate specifications and bodies for the processes {such as Ada [MIL-STD-1815A]). There would be one specification and n bodies. To allow cc-points the specification should state how many bodies should be present and the type of voting to use on each message. For r-points, the location is marked in the specification, along with a time limit on the execution since the last r-point and a list of the internal variables to be handled by the state input and output exception handlers. The n bodies will contain n versions of the specification. No internal states may be preserved past r-points but those listed in the specification. A translator may automatically convert the program into the form of the recovery layer, inserting standard voting, input consistency, and exception handling processes.

4.4.2 Subprogram-Dissimilar CNVP.

Subprogram-dissimilar CNVP is the NVP corollary to the conversation. In a conversation, every process participates in every try by executing its own local try block, and local acceptance tests are used in each process to determine if the conversation has succeeded. The different try blocks in the participating processes are not required to have the exact same communication structure. For subprogram-dissimilar CNVP the entire

------66 subprogram is duplicated as a unit, as if the n tries of a conversation were all run in parallel (see Figure 4.4.2-1). Communication is allowed in an unrestricted fashion between processes in the nth version of the subprogram, but is not allowed between processes in different subprogram versions. Therefore the communication patterns in the n versions may be different. However, as in the conversation, each version of the subprogram must have the same number of processes. As results are checked in a local fashion only, each version must result in the same answers being present in each process. Voting is then performed between the processes which correspond to each other in the n different subprograms, i.e, process 1 in a version compares its results to those of process 1 in the other n - 1 versions, process 2 to the other n - 1 versions of process 2, etc. The dissimilarity between the versions may be greater than in process­ dissimilar CNVP due to the fact that the communication structures in the versions may be different. However, dissimilarity is limited by the requirement that there be a similar number of processes and that at the end the same

results are found in each of the n versions of each process. Performing voting at the end of the subprogram's execution allows failures to propagate between the sequential processes, making diagnosis more difficult than if voting were performed at each message send. However, the distribution of the failures at the end can provide some clues as to the sources of the failures,

allowing possible failure sources to be identified by backtracking through the 67 communication structure of the program. In addition, failing the voting may activate acceptance tests on the results for fault diagnosis purposes.

Version A Version B Version C

Process 1 2 3 1 2 3 2 3

lnputConCommurucat.sis~rons

time ~- I

Voting Communications

Figure 4.4.2-1 . Subprogram-dissimilar CNVP.

To program a subprogram-dissimilar system, the application designer specifies the processes in each system and specifies in which processes the results should reside at the time of voting. There is more freedom than in 68 process-dissimilar CNVP, as the messages between processes do not need to be specified, and nondeterminism is allowed within the confines of the subprogram execution. The input consistency and voting no longer run concurrently with the application. Instead, each process involved in the subprogram-dissimilar section begins with an input consistency routine and ends with a voting routine. This is shown in Figure 4.4.2-1. Input consistency and voting may be performed using an algorithm which ensures interactive agreement [Lamp82]. The location of the input consistency and voting tasks are determined in an application explicit fashion. The fault-tolerant designer determines if a particular voting spot is a cc-point or an r-point, based upon the needs of the application.

This scheme may be supported by a distributed language, but the specification covers an entire subprogram, a group of tasks rather than a single task. The specification states the processes which should be present, which processes results should be located at cc-points and r-points, and time limits between the points. The n bodies must have the tasks specified, but they

may communicate amongst themselves as desired by the body programmer.

The body programmer must insert cc-points and r-points as found in the specification. The input consistency, cc-point voting and r-point voting are in the form of calls to recovery layer primitives.

If a voter detects a problem, it is no longer certain that the problem originated in the process in which the failure was detected. At a cc-point this

may be neglected, and execution may continue with the decision result used 69 in the process which detected the failure. A-points are more complicated. The process which detected a failure must be reset, but so do all processes which may have caused the failure. It is suggested that a recovery process be associated with each application process. The recovery process associated with the failed application process may communicate with the recovery processes associated with all processes which sent messages to its application. They in turn send messages back to the recovery processes associated with the application processes which sent them messages, and so on. The recovery processes that have been notified that their application may have failed send messages to the n - 1 recovery processes in the corresponding process of the other subprogram versions, requesting state­ output data. The recovery process votes this data, then uses the decision value to restart the application process. To avoid sending warnings around in circles, the dependency of processes upon others may be tracked using the technique outlined in [KimK89] for the PTC/LCN scheme.

4.4.3 Structure-Dissimilar CNVP.

Structure-dissimilar CNVP is the NVP corollary to the colloquy. The colloquy extends the conversation by allowing different numbers of processes in each try at the computation. The equivalent extension in NVP is to duplicate the entire parallel subprogram, allowing the n versions of the parallel program to use different numbers of processes connected in different ways, and with 70

the results distributed in different processes at the end of computation (see Figure 4.4.3-1).

Input Consistency /l Version A B c

time 1 / Voter I

Figure 4.4.3-1 . Structure-dissimilar CNVP.

--··--·---- 71

Since there is no longer a one-to-one correspondence between processes in different versions, voting cannot be performed on a process by process basis. Instead, a global vote must be performed on the complete results of each of the versions. This corresponds to the use of a global acceptance test in a colloquy instead of the local acceptance tests used in the conversation. The only requirement for similarity between the versions is that they arrive at the same final result. This provides an opportunity for great dissimilarity between versions; however fault diagnosis is difficult, as there is no clue as to which sequential process in the failed version was the source of the failure. As with subprogram-dissimilar CNVP, additional acceptance tests may be run to aid in fault diagnosis.

As with the other two CNVP constructs, structure-dissimilar CNVP may be supported by a language with separate specification and bodies. Now the specification only states the results which should be obtained, whether those results are at a cc-point or an r-point, and time limits between the points.

Within each body there may be different processes, different messages, and nondeterminism. Due to the lack of similarity between versions, input consistency and voting must be performed in a separate process or group of

processes. Since every cc-point or recovery point requires that the data be

merged together into a voter, they reduce dissimilarity, and hence they should be placed at rare intervals.

Implementation of recovery will be more difficult for dissimilar-structure CNVP than for the other two CNVP approaches. Cc-points are complicated by the fact that the return of the decision result to the versions will be performed

------72

differently with each version, depending upon the structure of the processes in

the version. If there are m results, there may be m voters. At an r-point, if any of the

m voters do not receive a result within the time interval allotted, or a version

believes it is at the wrong recovery point, then the voter must restart that

version. The voter needs to restart not just one application process, but all of

the processes in the failed version. This means that the voter process must

have the authority to restart processes in processors other than the one upon

which they are running. This may be implemented by having the voter process

signal the recovery layer in every processor which is running one of the

application processes for the bad version. The recovery layer stops the

application process and reinitialize it. It may not be possible to reinitialize using

internal variables from other versions due to the great dissimilarity, so the

application may be set to a standard initial condition. For applications where

no internal states are kept between executions (e.g., sorting), this will work

well. For applications with internal variables stored between executions, means

must be found to initialize them to reasonable values. For example, in an

aircraft adaptive control system, the software constantly estimates the system

parameters. At an r-point, an initial estimate of the parameters may be made

based upon the current flight conditions.

In general, structure-dissimilar CNVP will probably be harder than the

other two types of CNVP, as it is harder to find dissimilar implementations than

similar implementation. However, the extra effort should result in a lower

correlated error rate.

------73

4.4.4 Comparison of CNVP Constructs.

It is interesting to compare process-, subprogram-, and structure­ dissimilar CNVP from the standpoint of two goals of design fault tolerance: maximizing the dissimilarity between the n versions, and minimizing the size of the failure containment regions. Correlated errors between dissimilar versions has been shown to be a potential problem in n-version software [Knig86]. The greater the dissimilarity that can be achieved between the n versions, the lower the probability that there will be correlated errors between the versions, so from this standpoint one should choose the CNVP method with the greatest dissimilarity between versions, which is structure-dissimilar CNVP. When a software failure occurs, it is desirable to diagnose the failure and to contain its effects to as small a section of software as possible. The smaller the fault containment regions, the more failures the program can withstand on average before complete failure. Thus from this standpoint one should choose the CNVP method with the smallest fault isolation regions, which is process-dissimilar CNVP. Unfortunately, the two goals of maximal dissimilarity and minimal fault containment regions are incompatible. The

relationship between the three types of CNVP with respect to these

parameters is shown graphically in Figure 4.4.4-1. The choice of which of the three CNVP constructs to use must be made with the application in mind.

To illustrate the three methods, an example will be described. The

example is to sort a list of m integers into ascending order. This has a clearly

--····------74 defined result, so in actual use it would probably be better to use an acceptance test rather than voting, but the example is clear and simple and thus is chosen for pedagogical reasons. For additional simplicity, it is assumed that no two integers are the same, and recovery of failed versions is not handled.

Process­ Dissimilar t CNVP Increasing Subprogram­ Fault Diagnosis Dissimilar Capability CNVP

Structure­ Dissimilar CNVP

Increasing Dissimilarity •

Figure 4.4.4-1 . Relationship between CNVP constructs.

To implement a process-dissimilar CNVP, the designer specifies the processes and their message passing behavior. Assume the designer chooses to use a minimum extraction sort using processes in a tree communication structure [AkiS85]. The structure for an 8 integer sort is shown in Figure 4.4.4-2. The sort extracts the minimum integer from the sequence of those to be sorted, then the minimum of those left, and so on until all n members of the sequence are in monotonically increasing order. The integers

·-- -- ·-- ··· -- --·- ----·-- - ·· ------75

to be sorted are loaded into the leaves of the tree. Each processor on the next level determines the smaller integer held by its two children and passes that to its parent, leaving the child which had the smallest element empty. This continues until the smallest value is in the root node. It is stored in an output buffer, and the cycle repeated.

Figure 4.4.4-2. Communication structure for a minimum extraction sort.

For process-dissimilar CNVP, each process is implemented n times. A vote is performed on each communication from child to parent, and each communication must be performed in the same way. One way to communicate between parents and children is to use asynchronous message passing. The parent informs the child when the parent is ready to accept a

value. The child then sends it to the parent, the parent selects the smaller

------~ ~~ -~ - ··------··~· ------76 value and returns the larger value to the child which sent it. A disadvantage is that all n versions must perform the same computation and the same communication, so the versions are fairly similar. One advantage is that voting on every message allows many uncorrelated failures to occur without crashing the system. For instance, in a three version program, one of the versions of every process could fail and the system would still operate correctly.

Redoing this example with subprogram-dissimilar CNVP allows the dissimilarity to increase between versions. Assume that all n versions perform

a minimum extraction sort on a tree of processes. Since voting is saved until the end, the communication between parent and child can be different in each version. One version may communicate as described for the process­

dissimilar program. Another may use synchronous communication, eliminating the need for the parent to inform the child when it is ready to accept a value. A third version may have the parent keep both values instead of returning the

larger value to the child from which it came. This greater dissimilarity reduces

the chance of a correlated error between the versions. However, fault

diagnosis is more difficult. For example, at the end it may be found that one

version misplaced a value. The guilty process could be any of those which the

value passed through on its way through the tree. Hvery dissimilar versions are desired, structure-dissimilar CNVP may be used. With this construct, n completely different parallel algorithms can be used. For maximum dissimilarity, these should use different paradigms and

different computational structures to achieve the same goal. Greater

dissimilarity of algorithms is probably possible with parallel algorithms than 77

with sequential algorithms due to the additional variable added by the communication structure, so structure-dissimilar CNVP may allow greater dissimilarity than any other type of n-version programming yet suggested. Let us return to the example which sorts a sequence of m integers. The sort could be done using three algorithms: a minimum extraction sort on a tree network, a bitonic merge sort on a shuffle-exchange network, and an enumeration sort on an array of interconnected trees [AkiS85]. These algorithms are each based upon a different paradigm from [Nels87]. The minimum extraction sort uses the pipeline paradigm; the bitonic merge sort uses the divide and conquer paradigm; and the enumeration sort uses a compute-aggregate-broadcast (CAB) algorithm to find the correct location of each element, with each element found separately, so it is a divide-and­ conquer paradigm wrapped around a CAB paradigm. For additional diversity the three versions may use different communication structures: one may use message passing, one remote actions, and one generative communication. All three algorithms read the unsorted list of m integers, sort the integers, then output a sorted list of m integers. As the algorithms have different intermediate results, it is not possible to have voting throughout - the voting must wait until the complete answer is ready. There could be M copies

of the voter, so that each element of the sequence is voted upon separately from the other elements. This increases the concurrency and allows multiple faults to be masked, even if the faults are in different versions, provided that those faults do not cause failures in the same elements of the sequence. Figure 4.4.4-3 shows the program structure for this example.

------78

Sorted List

Voter

Minimum Extraction Bitonic Merge on a Enumeration on an on a Tree Shuffle-ExChange Array of Trees

Unsorted List

Figure 4.4.4-3. Structure-dissimilar CNVP for a sorting problem.

Nondeterminacy can no longer be handled by requiring that all versions handle nondeterminacy in the same way, followed by guaranteeing input

consistency. It will be necessary either that the groups of processes in the

- ·-··- ··------·--· -- . . - 79 versions not interface to the outside world in a nondeterminate manner, or that each nondeterminate interface may be handled in a single process which is a point of similarity between the versions. In the latter case, the nondeterminacy handling process is similar to the coordinator process in the resilient procedure construct, except that the cohorts are made up of groups of processes rather than single processes, and there may be more than one of the nondeterminacy handling processes for each group of cohorts.

4.5 CNVP Reliability Model.

The CNVP models will be found by extending currently existing n­ version programming models. These are reviewed in section 4.5.1. Sections

4.5.2 - 4.5.4 then present reliability models for process-dissimilar, subprogram­ dissimilar, and structure-dissimilar CNVP, respectively.

4.5.1 Review of Sequential NVP Models.

The stages of the design and implementation of an n-version program were described in [Arla88]. The design process is shown in Figure 4.5.1-1.

The stage at which an error is made effects the nature of the resulting fault(s) as shown in Figure 4.5.1-2. The fact that correlated errors may exist even if the mistakes is made at a later stage [Knig86] is represented by the dependency

channels marked with the letters a - d. 80

A Markov model for the reliability of an n-version program without recovery running under the Design Diversity Experiment System (DEDIX) was

devised by Laprie [TsoK86]. It is shown in Figure 4.5.1-3.

Specification of the Decider

Veri n #2 Decider

Figure 4.5.1-1. NVP fault sources.

Paths where faults are created, or Fault Types dependency channels

1 -> 2 Related fault In the three versions

a, b,orc Related fault in two versions

1 -> 2 ·> 3, Related fault in versions and decider 1 -> 3, or d 2 -> 4, 2 -> s. or 2 ·> 6 Independent fault In a version

3 -> 7 Independent fault in tho decider

Figure 4.5.1-2. Major fault types for NVP. 81

Figure 4.5.1-3. Detailed reliability model of n-version program .

. ------82

In this model, it is assumed that the program has three versions running under DEDIX supervision. DEDIX performs the communication and decision functions for the three programs, which run in parallel. The states in the

Markov model are made from a combination of the state of execution of the software module and its failure status. They are marked with the state code (X,Y), whose components have the following meanings:

X: block status with respect to execution

1- idle,

D - executing DEDIX, and

V - executing versions in parallel.

Y: block status with respect to error activation

Y missing - no error activated CE(D) - common mode error from DEDIX failure,

CE(C) - common mode error in versions due to a mistake in the

specification or a common error made by the

independent design teams,

IE(V) - independent error in a version (a number placed before indicates how many versions have independent

errors present). 83

Transitions between states are made when the program enters a new mode or a failure occurs. The transition rates are labeled with codes Xy. which are defined as follows:

X : rate classification eta- solicitation rate of then-version program; its inverse is the

mean duration of idle periods. gamma - end of solicitation rate; its inverse is the mean duration

of execution periods. lambda - the failure rate; its inverse is the mean latency of the

system errors.

Y : source modifier D- DEDIX

V- version.

C- correlated error.

The system is assumed to begin in the idle state (state 1). When the program is to be run, DEDIX is invoked and runs for some time setting up the versions for execution (state D). Then the versions execute {state V) and when a cc-point or r-point is reached they return their values to DEDIX. DEDIX votes the results {state D), If there is more work to do the versions continue {state V), otherwise the system returns to idle {state 1). The probability that the

system has more work to do given that it is in DEDIX is called the activation 84 ration and denoted by q. The occurrence of a failure causes other transitions to occur. A failure in DEDIX causes the system to enter state 0, CE(D). This is a common mode error, so when DEDIX completes running, the final answer is wrong. This is represented by the transition to state CF. A common mode error which occurs during version execution will also cause a complete system failure. Upon occurrence of the failure, the system enters state V, CEM. When version execution is complete, DEOIX runs (stateD, CEM). and produces an incorrect result (state CF). An independent error in a single version causes the system to enter state V, 11EM. If the system goes back to DEDIX without another error, the system will not fail, and will enter the states I, 11EM; 0,

11EM; and V, 11EM. These correspond to the I, 0, and V states but with an error in one version. Occurrence of another failure, whether common mode or independent, results in system failure. There are some interesting anomalies in the model as shown in

[TsoK86]. For one thing, the possibility of a second or third independent failure occurring during a single version run is shown, but not the possibility of an independent failure of one version followed by a common mode failure.

Another anomaly is that in the transitions from the V state, the chance of more than one independent error occurring before return to DEOIX is shown to be

nonzero, but from the V,11EM state the chance of two independent failures

occurring before return to OEOIX is zero. A third anomaly has to do with the activation ratio, q, after a single failure. For some reason, q is 1 after a failure. The figure could be redrawn to eliminate these anomalies, but instead we will

------85

examine the simplified figure of [TsoK86], then determine if that must be

redrawn. The model may be simplified by recalling that execution rates are much higher than failure rates, and by combining states. The resulting simplified

model is shown in Figure 4.5.1-4. The chance of more than one failure

occurring during a version execution is considered to be close enough to zero to ignore: this eliminates several states, and also solves two of the anomalies

of the more complicated model. The only change which appears to be

necessary would be to account for the activation ratio q in the transition from

state V to state V, 11E(V). The transition between these states should be q x

lambday, and there should be a transition from state V to state I, 11E(V) with

probability (1 - q) x lambday.

Figure 4.5.1-4. Simple reliability model without recovery. 86

Recovery may be added to the system as described in [TsoK86] and reviewed in section 4.1 .2. A simplified Markov model for nvp with recovery is shown in Figure 4.5.1-5.

Figure 4.5.1-5. Simplified reliability model with recovery.

The only difference between the figure with recovery and the one without recovery is that the program enters a recovery state, corresponding to execution of the state-input and state-output exception handlers. If the handlers succeed, the system returns to state V, and if the handlers fail, the system goes to state V, 11E(V). 87

4.5.2 Process-Dissimilar CNVP.

To develop a process-dissimilar CNVP, the parallel program is divided into m processes, then each of these m processes is programmed n times in n dissimilar ways. Let us assume that for the parallel program to operate correctly, all m processes must operate correctly. The reliability diagram is m blocks in series, one block for each process (see Figure 4.5.2-1).

Figure 4.5.2-1. Top-level reliability diagram for process-dissimilar CNVP.

Let us examine a block. The kth process has n versions of the

application, and each of then versions has i(k) inputs and j(k) outputs, with a

consistency checker on each input and a voter on each output. For this

model, let us assume that the communication is synchronous, as in CSP (this

may be modified later). If we assume that the input consistency processes,

application process, and its voter processes are on the same physical

processor then only one can run at a time. Also, if the versions are synchronized, the n processors will make state transitions at the same time.

We may describe the status of the system with a model very similar to that for

a sequential n-version program, (which makes sense since we are modeling a 88

sequential process). The model for a three-version system appears in Figure

4.5.2-2.

ad'Yv 'Ywo

ad'Yv

Figure 4.5.2-2. Reliability for three version sequential process.

Assuming synchronous VO. there are five possible states for a correct n-modular sequential process program: waiting for an input (WI}, performing

·------89 input consistency checks (I C), executing the versions M, deciding on the correct result (D), and waiting for an output rNO). Then if a failure occurs, recovery may be attempted (R), the entire system may fail (F), or a single version may fail permanently, in which case the system is in one of the five states listed before, but with a (,1) after the state to indicate that one version is failed (the F state has been shown twice to avoid excessive crossing of the transition arcs in the figure). The transitions are marked with rates using the following conventions:

lambda - failure rate of state, and

gamma - completion rate of state.

The subscripts modifying the lambda and gamma symbols follow the state names above, with the following exception: lambda sub C gives the rate at which common mode faults are activated in the versions, while lambda sub

V gives the rate at which independent faults are activated in the versions.

Finally, the rates are modified with the following variables:

a - the probability that the sender of a message is not ready given that the

process is ready to receive a message,

b - the probability that the receiver of a message is not ready given that the

process is ready to send a message,

d - the probability that an 1/0 operation is an input operation.

If the system uses asynchronous message passing, the model is modified either by setting the average wait to send a message equal to zero, or by setting a to zero. 90

4.5.3 Subprogram-Dissimilar CNVP.

~c

Figure 4.5.3-1. Reliability diagram for subprogram-dissimilar CNVP.

To prepare a subprogram-dissimilar CNVP, the entire subprogram is replicated. The subprogram sits idle until it is activated by being called with input. It then performs input consistency checking, runs then versions of the

------91 subprogram, runs a decision function, and outputs the result. After the result

has been output, the program returns to the waiting for input state.

The reliability diagram for this sequence is a modified version of the reliability diagram for a single process in process-dissimilar CNVP, as shown in Figure 4.5.3-1. In this figure the failure state, F, has been shown twice to

avoid excessive crossing of the transition arcs.

The reliability diagram for subprogram-dissimilar CNVP is very similar to

that of a sequential n-version program because the concurrent nature of the

software is hidden within the subprogram.

4.5.4 Structure-Dissimilar CNVP.

The reliability model of the structure-dissimilar CNVP is the same as that

for subprogram-dissimilar CNVP. It is hoped that the correlated error rate will

prove to be lower due to the greater dissimilarity between versions.

4.6 Correlated Error Problem.

To determine the number of versions necessary to meet the URMC

requirements, some means of finding the failure rates for both independent

and correlated faults is necessary. The model of Eckhardt and Lee [Eckh85]

takes into account correlated errors as well as independent errors. In this model, e gives the probability that a version will contain an error at a certain point in the input space. The interpretation of g(9) is the probability of 92 encountering the area of input with the coincidence error E>. Figure 4.6-1 shows a figure from [Eckh85], with the failure rate of each version assumed to be 2 x 1Q-4 per hour, and with the decision function consisting of a majority voter. If version failures are assumed to be independent, then it would require only 5 versions to reach a system failure rate of 1 x 10-s per hour, but if the values for E> shown are reasonable, then it would take 17 versions to produce a system with a failure rate less than 1 x 10-9 per hour. This is an unreasonable number of software components to prepare. One way to minimize the number of components is to limit e to a low value. Figure 4.6-2 shows the affect of varying e, even over a small proportion of the input space.

10'3 _e__ 9 _t~J __ 10'4 0 • 989i7 . 01 • 005 I2 Probability IP Nl 10·5 • 02 • 002S6 of syster:1 --=·.:.10::.....__:.• ...::;OOOO::.::.J __ failure

Number UJJ of coraponents

Figure 4.6-1. Effect of independence assumption (from [Eckh85]).

.. . - ·· ---- ·------. 93

10-5

1 0~ Prollability IP Nl of system -7 10 ..failure

5 9 13 17 21 Number fNl of comoonents

Figure 4.6-2. Effect of shifted intensity distribution (from [Eckh85]).

Another way to reduce the number of components needed is to use a plurality vote rather than a majority vote. In the FTP/AP (Lala88], there are four versions of software. If a correlated error occurs in two versions which results in those two versions reaching two different wrong answers, then the versions form a 2 - 1. - 1 split. The decision function then selects the value of the 2 versions which agree.

Before the number of versions necessary for the URMC can be determined, the following questions must be answered: 1) What values of e are reasonable for software developed to industrial standards? 94

2) What proportion of correlated failures will result in similar wrong answers in the incorrect versions? Once these answers have been found, the number of versions may be minimized by using the decision function of [Lala88], which does not require a majority, and by keeping E> as small as possible. Knight and Leveson performed an empirical study of failure probabilities in multiversion software [Knig87] which appears to show that the correlated errors between versions make the reliability levels required for the URMC very difficult to reach. However Avizienis has suggested that the use of rigorous specifications, enforced dissimilarity, and industrial software practices should reduce the correlated errors below the level observed by Knight and Leveson [Aviz87]. Until a better method for preventing correlated errors is found, the following rules are suggested: 1) The versions should use independent design and implementation techniques such as diverse algorithms, programming languages, translators, design automation tools, and machine languages [Aviz84]. 2) The versions should be prepared by independent (noninteracting) programmers or designers, preferably with a wide background of experience [Aviz84]. 3) The requirements specification must be complete and correct, as ambiguities or incomplete areas may cause correlated errors [Aviz84]. 4) Modules should be subjected to four types of tests: functional, performance, stress, and structural [Fair85]. Extreme values and illegal 95

values should be checked in the functional tests. In this author's opinion, the combination of unusual values in functional testing and stress testing should detect problems very likely to have been neglected by the programmer, such as numeric overflow or filling of buffers, and should reduce e in these areas; 5) All software should be designed with software engineering practice as rigorous as for critical single version software - the ideas that redundancy allows money to be saved by avoiding module test [Youn84] or by using free lance programmers [Aviz85] should be abandoned. 6) Simple software or software used on many different projects (such as the recovery layer software) should be proven correct if possible.

4.7 Software Fault Tolerance Summary.

The development of fault-tolerant software is the biggest problem in the next generation of computers for critical control, as techniques for fault­ tolerant concurrent software are not yet well developed. The URMC system should make provisions for sequential or concurrent n-version programming, recovery blocks and colloquys, and for simplex software as well. Time-critical software can be written as n-version software, while software with no hard deadlines could be written with the recovery block or colloquy. The colloquy will probably have to be used for programs which find heuristic suboptimal solutions, as different methods may find very different and yet still correct

------96

answers. To guarantee that different versions in ann-version program will find the same suboptimal answer would require considerable similarity between the versions. Finally, simplex software can be used for operations which are non-critical. Since concurrent n-version programming is not well defined, considerable discussion of the subject was presented. Three methods are outlined: process-dissimilar, subprogram-dissimilar, and structure-dissimilar CNVP; they are corollaries to the three methods used to extend the recovery block to concurrent software: PTC, the conversation, and the colloquy. Modular redundancy in CSP is a special case of process-dissimilar CNVP. The greatest problem with software fault tolerance is correlated errors between the versions. Programming the versions with algorithms based upon different paradigms should cause e to be distributed differently in the different versions, thus significantly reducing correlated errors. This should be investigated with a programming experiment, if possible.

~-~~----~---- 5. Processing Node Fault Tolerance.

The URMC processes run on the hardware of the processing nodes. There may be one or more processes on each hardware processor. The hardware nodes must provide this service to the software regardless of the occurrence of hardware failures. Communication between processors is through message passing, using the services of a .. perfecf' communication

network, which is guaranteed to correctly deliver a message within a finite time; therefore, the node hardware layer has to tolerate only failures within the

processing nodes themselves. Hardware fault tolerance has been studied for some time, so the techniques for hardware fault tolerance are fairly well understood for

conventional SISD computers. Overviews may be found in [Fern89a] and

[Siew82]. Considerable work has been done in extending fault tolerance to

parallel processors, but it is still an active research area. Several different ideas

from past work have been combined to develop the fault tolerance scheme

used in the URMC. Fault tolerance in hardware is achieved by redundancy, so that a failed

component's function can be handled by another component. For the URMC,

it has been decided that fault tolerance will be achieved through redundancy

97 98 of the entire processing node, and fault isolation regions will be the entire node. This approach has been chosen because it allows the use of inexpensive general-purpose components in the development of a fault­ tolerant system. The desirability of redundancy and isolation at this level of the hardware is attested to by the use of this approach by all four of the example systems of chapter 2, despite past investigations of isolation at lower levels of hardware (such as redundancy at the level of buses, CPUs, and memories in FTMP [Smit86]). This is because inexpensive VLSI components have made redundancy at the computer level the most economical. Section 5.1 describes additional requirements which must be placed on the URMC processing node hardware fault tolerance layer(s), which are derived from the choice of concurrent n-version programming as the software fault tolerance technique. Section 5.2 reviews the two main methods for fault tolerance in SJSD machines: standby sparing and modular redundancy. In section 5.3, extensions of the two SISD fault tolerance techniques to MIMD machines are described: redundancy with t-fault diagnosis and modular redundancy with voting. Section 5.4 compares the two fault tolerance techniques for MIMD machines and concludes that modular redundancy with voting is the construct which is best for meeting the URMC requirements. An n-modular redundant scheme for distributed fault tolerance (OFT) has been described by Chen and Chen [Chen85], but was not designed to handle Byzantine faults [Lamp82], so it is determined that it is necessary to extend this scheme. Section 5.5 contains new work, as it describes the sch~me used

in URMC: it begins with the concept of repsets for distributed fault tolerance 99

scheme described by Chen and Chen, but adds several extensions: it is combined with Byzantine fault masking as described by Lamport et al. [Lamp82], Byzantine fault diagnosis through the methods of intermittent fault diagnosis [Shin87], and access to the spare processors is coordinated

through a monitor. Two possible methods of diagnosis are discussed,

centralized and decentralized. Special problems with 1/0 fault tolerance and with systems which use dissimilar hardware processors are also discussed. Finally, in section 5.6 a reliability model for the URMC processing node fault tolerance layer is derived.

5.1 Derived Requirements for the Layer.

In addition to the requirements stated in section 3.4.2, there are additional requirements on the processing node fault tolerance layer(s) which are derived from the choice of concurrent n-version programming for software fault tolerance: 1) The concurrent nature of the software means that different tasks must be able to communicate their results to each other. Due to the real-time

nature of the system, the communication delay between tasks must be minimized.

2) The n versions of each software process must be isolated from each

other to prevent fault propagation. That is, if different versions of the same process share the same physical processor, there must be a

protection scheme to protect the resources of one process from the

------100

other (fortunately, many current include hardware support for such protection mechanisms). 3) The throughput requirement has been increased by n times due to the

n-redundant tasks. Also the overhead of running the tasks which perform voting and input consistency between software versions requires still more throughput. Experimental work on SIFT shows that this can be up to 80% of the throughput [Smit86]. The FTP/AP and MAFT have reduced this overhead through migration of the redundancy management functions to hardware, which decreases flexibility and makes the system rely on special-purpose hardware. However, leaving the functions in software will require that the total throughput requirement be raised two or three times.

5.2 SISD Redundancy Management Methods.

There are two major approaches to managing a redundant SISD system: standby sparing and modular redundancy with voting. These can be combined to achieve dynamic modular redundancy. Since these approaches are standard, so they will be described only briefly.

5.2.1 Standby Sparing.

The standby sparing scheme has n identical copies of the hardware with only one actively providing results. The other copies are spares which are

------101 used only in the event of detected failure of the primary hardware. Upon primary failure, a switch activates one of the spare components in itS place. This system has the problem that the coverage of the diagnostic tests must be very high or the reliability will suffer. One means to improve the fault coverage is through comparison. The component is duplicated with both copies executing simultaneously. One copy outputs the results while the other copy serves only as input to a comparator. If there is a disagreement in the two copies' results, the comparator disconnects the output. This can be combined with standby sparing to yield a system which is fail-operational and which has very good coverage (see Figure 5.2-1), sometimes called a dual-dual system.

Figure 5.2.1-1 . Standby sparing with comparison checking. 102

The reliability of a dual-dual system is given by:

(5.1-1) where Rk is the reliability of the circuits needed for comparison and switching of modules, Am is the module reliability, and C is the coverage factor [Fern89b].

5.2.2 N-Modular Redundancy.

In N-modular redundancy, there are n copies of the hardware f~llowed by a voter (see Figure 5.2.2-1). All n copies compute a result and send it to the voter. The voter then chooses a decision result using some scheme, such as the majority value or the midvalue. If less than half of the processes have failed, the voter can mask the failures.

Figure 5.2.2-1 . Redundant system with voting. 103

The system reliability of a voted system is given by:

I_N/2_1 Asys - R., • L C(N, i) Rm(N- 0(1 - Rm)l (5.2-1) i=O

where N is the total number of modules, A., is the reliability of the voter, Am is the reliability of the module, and C(N, i) is the i combination of an N-set

[Siew82]. The number of faults which can be tolerated by n component can be

increased if reconfigurable NMR is used. In this scheme, it is assumed that at

most one copy of the hardware will fail at a time. The fault is diagnosed and

the system reconfigured so that the failed copy is excluded from future voting,

which allows up to n - 2 faults to be tolerated. For example, in a 5-modular

redundant system which is not reconfigurable, only 2 faults will be tolerated, as

the third fault will leave a majority of incorrect copies. However, if the hardware

copies fail one at a time and failed copies are excluded from Mure votes, then

3 faults may be tolerated. The fourth fault will cause failure, as it is impossible

to attribute the fault to one of the two remaining channels as there are no

others with which to compare results. The system reliability for a reconfigurable N - modular redundant system with S spares is given by

[Siew82]:

p

Asy5 = Rv • L C(N + S, i) Rm(N + S- Q (1 - Am) I (5.2-2) i=O

------104

where the symbols have the same meanings as in equation (5.2-1).

5.3 MIMD Redundancy Management Methods.

The two main methods for redundancy management in SISO systems,

standby sparing and n-modular redundancy, have been extended to MIMO systems.

Standby sparing in an MIMO machine works similarly to that in an SISO

machine. Each task is scheduled redundantly, and the results sent to a

comparator. If the results from redundant tasks compare correctly they are

released, otherwise the tasks are rescheduled for execution on other

processors. Several schemes for improving the coverage of processor faults

using the concept of t-fault diagnosis have been advanced, and these will be

discussed in section 5.3.1.

Modular redundancy with voting can also be extended in a

straightforward manner. Each processing node in the multicomputer may be

replicated n times, and the results passed through a voter either before they

are output from the node or before the next node inputs them. This scheme

was used in FTMP [Smit86], in which processors, memories, and buses were used in groups of three.

If no reconfiguration is allowed, then less than n/2 of the modules may

fail. If failures are restricted to one at a time and reconfiguration is allowed to mask modules which have failed, then as many as n - 2 modules may fail. If

the system is down to two modules which disagree and built-in diagnostics

------105

can isolate the problem to one of the modules, then as many as n - 1 of the modules may fail. If failures occur one at a time and a spare can replace a failed module before another fails, then there is no limit to the number of modules which may fail. This shows that it is desirable to be able to access a pool of spares to replace failed modules.

The diagnosis and reconfiguration tasks are complicated by the fact that they must not introduce a single point of failure. Other problems are avoiding incorrect diagnosis of a failed processor due to a Byzantine fault, and of coordinating access to the spare processors. A preliminary approach to modular redundancy in a MIMD computer was proposed by Chen and Chen and is discussed in section 5.3.2. This scheme is determined to be insufficient (as discussed in section 5.4), resulting in several extensions which are discussed in section 5.5.

5.3.1 Redundancy with System Diagnosis.

One way to achieve fault tolerance in a message passing MIMD

machine is to schedule each software task redundantly to the physical

processors. After the execution of the tasks is completed, the results are stored until the system can run diagnostics to determine the health of the

processors. The results of the tasks' computations are released only from tasks which ran on processors found to be nonfaulty by the diagnosis. Since only ·one copy of the task need be executed on a nonfaulty processor, this

·--~---~---- 106

scheme allows up to n - 1 of the n redundant processors to fail while still allowing the computation result to be correct and available. The most studied model for fault diagnosis is the PMC model [Prep67]. In the PMC model the system is broken down into processing elements (PEs), with each PE capable of testing one or more other PEs with perfect coverage.

This is represented with a directed graph, with the modules represented as nodes of the graph and with a directed edge from node i to node j if node i is capable of performing a test on node j. Edge E(i,j) is assigned a weight of 0(1} if the test by node i finds that node j is good(faulty). A fault free unit will always get the correct test result, while a bad processor may get an incorrect test result. The results of all of the tests are gathered together into a fault syndrome from which the condition of the system may be determined. A system which may have up to t faults and still be correctly diagnosed is called a t-fault diagnosable system. If this can be performed in one step it is called one step t-diagnosab/e, while if the faulty units must be found and repaired in a sequence then it is called sequentially t-fault diagnosable. One step t-fault diagnosability can be achieved optimally by having n = 2t + 1 units, each of which is tested by exactly t other units. Figure 5.3.1-1 shows an optimally 2- fault diagnosable system.

Necessary conditions for t-fault diagnosability are given in [Prep67].

Hakimi and Amin showed that if no two units test each other then the above

conditions are also sufficient, and they also derived necessary and sufficient

conditions for the case where no such restriction is placed on the tests

[Haki74]. A system in which no two units test each other is one-step t- 107 diagnosable if and only if there are at least n = 2t + 1 units and each unit is tested by at least t other units. If units are allowed to test each other then these same two conditions must be present plus another: for each integer p such that 0 s p < t, any subset X of the vertices with 1 X 1 = n - 2t + p must have test links from units in the subset X to at least p units in V - X.

Figure 5.3.1-1. An optimally one-step 2-diagnosable system.

One severe disadvantage of the PMC model is the assumption that all units will fail permanently, while many faults are intermittent in nature. This problem was addressed in [Mall78). Tests are run on the processors and the results assembled into a syndrome. If a syndrome is ever consistent with some of the nodes having permanent faults, those nodes are flagged as faulty.

If the syndrome is not consistent, the testing continues. A system is defined as 108

trfault diagnosable if it is such that if no more than t units are intermittently faulty then a fault free unit will never be diagnosed as faulty and the diagnosis at any time is at worst incomplete, but never incorrect. It was found that the

conditions for one step and sequential t1fault diagnosability are identical. It is shown to be both necessary and sufficient that a system is tr fault diagnosable

if given any 2 subsets of units in the system S1, S2, such that both subsets' size is less than or equal to ti and they have no elements in common, then the

set A of remaining elements has at least one testing link to both S1 and S2 (see Figure 5.3.1-2). The system of Figure 5.3.1-1 is 2-fault diagnosable for

permanent faults, but only one-fault diagnosable for intermittent faults.

Figure 5.3.1-2. Necessary partitioning for ~-diagnosable system.

Another disadvantage to the PMC model is the assumption that a unit is

capable of performing a perfect test upon another unit. This is impossible to

- ····------109 achieve in practice, and an attempt to find a very thorough though imperfect test is likely to result in an extremely time consuming diagnostic test. [MaleSO] suggests using the results of the user tasks as a test. Each task is run on two different nodes and the results compared to determine the health of the nodes. If the results from the two nodes disagree, a comparison task is run on one of the two nodes plus a third node. The method assumes that the coverage from comparison of results is very high, and assumes a centralized analysis to allow the faults to be located.

Chwa and Hakimi [Chwa81] suggest combining the comparison technique with intermittent t-fault diagnosis techniques to arrive at a diagnosis for a system. Each task is scheduled to execute on several different processors of the system, and the results compared. A comparison between processor i and j considered as both a test of i on j and of j on i. If tasks are assigned to z processors at a time, then each task execution causes a comparison test between a processor and z - 1 other processors. The results of these comparisons are assembled, and checked for a syndrome consistent with permanent faults of less than ti of the processors. If no miscomparisons occur, the results of all task computations may be released. If a consistent syndrome is found, then those processors are diagnosed as faulty, and all

results of tasks executed on the faulty processors are thrown away - results from other processors which executed those same tasks are released. If an

inconsistent syndrome is found, the results of tasks which did not agree are

held back while more tasks are executed on the processors. When a

consistent syndrome is found, the tasks on the processors which were not

------·-. ------110

found to be faulty will be released. Tasks which were run only on processors in the fault set are rescheduled for execution on other processors. Since the diagnosis may be incomplete, it is possible but unlikely that an incorrect task result will be released. An algorithm for diagnosis is given which completes in 0( I E I ) time. This system has the disadvantage of having to hold the results from tasks while awaiting a syndrome to appear which is consistent with a

permanent fault situation. Thus several other algorithms have been presented which speed up the diagnosis by avoiding the wait for a syndrome which is

consistent with permanent faults [Dahb83] [Dahb85] [Yang86]. These algorithms execute in 0( I E I ) time, and results must still wait upon a diagnosis.

Another approach to comparison is that in [Agra85], in which an

algorithm is proposed named the Recursive Algorithm for Fault Tolerance

(RAFT). RAFT uses the comparison method to detect faults but does not use

~-diagnosability. A task is scheduled to be run on two processors. A digital

signature is generated by the processors while they are computing the values

for the task, with the signature uniquely representing the activity taking place

in the processor while it executes the task. If the signatures from the two

processors agree, the results are released. Otherwise, the task is assigned to

a third processor. The signature of the third processor is compared to that of

the first. If it agrees with the first, the result from the first is released and the

second is added to the list of suspect processors. If not, it compares to the

second. If it still does not match, the task is run on a fourth processor, whose 111 signature is compared to the first, second, and then third. The task keeps being placed on a new processor until either there are no more or some pair have the same digital signature. The article computes probabilities for RAFT to make the correct choice when releasing the results.

A problem with using comparison for fault diagnosis as described above is that while it provides complete coverage for the results of a particular task, it does not provide complete coverage of the processors upon which the tasks were run. Some processors may be partly faulty in such a way that certain of the tasks run on them fail while others succeed. For instance, a bit fault in program memory will affect the execution of the task which runs the code with that bit in it, but no other task in the node will be affected. This leads to the following scenario. A task may be scheduled to run on m processors, and it computes m different answers. After diagnosis, m - 1 of the processors may be diagnosed as faulty. Should the result from the processor which was diagnosed as good be released? If it is desired that results from a task be released from a processor even if that processor is the only one running the task which was diagnosed as correct, then there is a possibility of releasing incorrect results due to an incomplete diagnosis. If it is required that more than one execution of a task reach the same result before those results are released, then there is no advantage over n-modular redundancy. 112

5.3.2 Modular Redundancy with Voting.

Another technique for constructing a fault-tolerant multicomputer is to

have n-modular redundancy at each of the hardware nodes [Chwa81]. When any of the nodes has exhausted its fault tolerance, that node can no longer

help with the computation. If it is a real-time system which must have all nodes working to meet its deadlines, then the system will fail. The reliability of such a system may be found by finding the reliability of each NMR component and then placing them in series. Another scheme for fault tolerance is to have dynamic n-modular redundancy, as in the DFT scheme described by Chen and Chen [Chen85]. A MIMD distributed system consisting of several nodes is divided into replication sets (repsets) of three processors, and any additional processors are placed

into a pool of spares. The processors in each repset maintain input consistency of all messages received from other repsets (e.g., with the

algorithm of [Manc86b]), execute identical tasks, and are responsible for the maintenance of a majority of correct processors in their own repset. Thus each repset behaves as a fault-tolerant SISD node of the multicomputer. A repset

sends and receives messages with other repsets just as the node of a multicomputer sends and receives messages with other nodes. A repset processor maintains a list of the members of its own repset and of all repsets with which it interfaces. When one of the processors in a repset fails, the other two processors in its repset detect the failure through the result voting and

command one of the spare processors to replace the failed processor. A

··------·- -·-- - ... ------·- ---·------113 spare processor takes over the tasks of a failed processor if two other processors simultaneously tell it to do so (within the limits imposed by synchronization). After the reconfiguratior., the processors in all interfacing repsets are informed of the change so that they may send and receive messages correctly between themselves and the new member. The OFT scheme of Chen and Chen ignores the problem of Byzantine agreement between the processors, or of the problem of coordinating access to the spares between the repsets.

5.4 Voting vs._Diagnosls.

Analysis shows that systems which are trfault diagnosable using comparisons as the tests will have lower hardware overhead for a given reliability than systems employing n-modular redundancy [Chwa81] [Dahb85]. However, this comes at a price: results from tasks must be delayed until the fault diagnosis is performed. This precludes any of the tasks used in a given diagnosis cycle from depending upon results from tasks previously used in that cycle. This would cause a large delay for systems with tasks arranged in a precedence order, and since the diagnosis algorithms have complexity 0( I E I ), the delay gets worse and worse as the size of the system increases. Because they cannot depend upon each other, the tasks in a diagnosis set

must all come from the same level of a precedence graph or from different precedence graphs. In addition, the execution of the tasks is arranged in cycles, so the tasks must either be of the same execution time or there will be 114 a lot of processor throughput wasted while waiting upon the completion of tasks in a task cycle. For these reasons, ~diagnosability is still not a practical technique to determine the validity of output results in most systems, whether real-time or not. The comparison methods suggested in [Male80] or in [Agra85] could be used to advantage in a system which had enough spare execution time to allow tasks to be reexecuted upon failures. However, the problem of who is to determine when to reschedule the tasks and when to release the tasks is not covered in these articles. Also, in real-time systems there are often many tasks (or sequences of tasks in a precedence list) which will barely complete before a hard deadline, and the time cannot be taken to allow a task to be executed more than once.

Using n-modular redundancy solves the problems of throughput and time delay which the comparison technique has, but at the expense of additional processor overhead. Chen and Chen have suggested a scheme for distributed fault tolerance with n-modular redundant repsets, but unfortunately, their scheme has the following problems:

1) Byzantine faults are not considered, so the processors in a rep set may

misdiagnose the failed processor and attempt to replace the wrong

one.

2) The reconfiguration has a single point of failure, the spare processor itself. A spare processor does not join a repset unless requested by

two processors in a repset. But what if a spare processor has failed in

------··- -·· 115

such a way as to agree to join two different repsets? Some means of

coordinating the reconfiguration in a distributed fashion is required.

The above analysis shows that a scheme should be developed which allows for at least masking and possibly also diagnosis of Byzantine faults, and which coordinates access to the spare processors in the event of more than one diagnosed processor failure.

5.5 URMC Redundancy Management Technique.

The processing nodes of the URMC will be split into repsets, as in

[Chen85], with m processors in each. In addition, there will be k spares, for a

total of k + m = n processors. The scheme of Chen and Chen will be used, with extensions to allow the masking and diagnosis of Byzantine faults.

If there are many repsets and the exposure time is large, the probability

of more than one Byzantine fault occurring in a given repset increases. It is

possible to mask Byzantine faHures using an algorithm as described in

[Lamp82]. However, even with authenticated signatures this requires 2f + 1 processors. If the Byzantine failures could be diagnosed, then the failed

processor could be replaced in the same fashion as a processor which fails in

a consistent fashion. Byzantine faults may be diagnosed using an algorithm

for intermittent fault diagnosis [Shin87], such as those described above.

Section 5.5.1 will discuss modifications to the OFT concept of Chen and

Chen to allow Byzantine fault masking. Byzantine fault diagnosis will be

discussed in section 5.5.2, first with a centralized method and then with a

~------116 decentralized method. Coordination of access to the spares pool will be discussed in section 5.5.3. Section 5.5.4 will discuss fault tolerance of nodes which perform system 1/0, as nodes with hardwired 1/0 cannot be replaced by spares. Finally, section 5.5.5 will discuss modifications to this scheme which are necessary if processors with dissimilar hardware are to be used within repsets.

5.5.1 Byzantine Fault Masking.

Assume that two repsets are to communicate as in Figure 5.5.1-1 a. The problem of masking a Byzantine fault in a repset processor may be handled in one of two ways:

1) Complete communication - each member of a sending repset sends a message to every member of the receiving repset. The three receivers

then exchange the values and vote on them to arrive at the value of the

sender with Byzantine failures masked. For this purpose, the

unauthenticated algorithm of [Lamp82] may be used, with each

sending processor in turn acting as the general and the three receiving

processors acting as the lieutenants. When values have been arrived at for all three sending nodes, the results are voted (see Figure 5.5.1-1b).

2) Reduced communication - a repset may arrive at Byzantine agreement

upon the values before sending them to another repset, using the

authenticated algorithm of [Lamp82]. After arriving at distributed agreement in the sending repset, the values are sent by each sender to 117

only one member of the receiving repset (its peer processor), where

the values are voted once again using an algorithm for Byzantine

agreement (see Figure 5.5.1-1c). The second interactive agreement is necessary because a Byzantine failed processor in the sending repset

may send a false value to its peer processor in the receiving repset. 0----{)

a. Abstracted communication

b. Complete r~t to c. Reduced repset to repset communrcation repset communication

Figure 5.5.1-1. Sending messages between repsets.

Since all processors in a repset execute the same sequential process, nondeterminacy must be resolved in the same way in the different copies of the hardware, requiring that all input messages be kept consistent in the 118

different copies. The input messages may be kept consistent using the algorithm described in [Manc86b].

The latter method described above requires that somewhat fewer messages be sent, total. However, the former method will result in more diagnostic information being collected, which may be of use in Byzantine fault diagnosis.

5.5.2 Byzantine Fault Diagnosis.

In the systems described in chapter 2, Byzantine faults are masked but not diagnosed. As the number of repsets in the system grows, the chance of a fault occurring in a repset which already has a Byzantine fault present

increases. There are three possible approaches to this problem:

1) the fault tolerance level of the repsets may be increased to handle

multiple Byzantine faults,

2) the fault tolerance level of the repsets may be increased to handle

some Byzantine faults and some benign faults, or

3) an attempt may be made to diagnose the Byzantine faults and to

reconfigure the repset to eliminate them. To handle f Byzantine faults, a repset must have 3f + 1 members if using unauthenticated messages and 2f + 1 members if using authenticated messages [Lamp82]. A model for systems with dual failure modes has been described by

Meyer and Pradhan [Meye87]. It is shown that to tolerate up to f faults, with up 119 to m of those faults malicious, the repset must have n members with n > f + 2m. While it is always possible for a malicious processor to behave in such a way as to elude diagnosis, this is improbable, so an attempt may be made to diagnose Byzantine faults [Shin87]. Byzantine diagnosis may be considered as a problem similar to intermittent fault diagnosis. First the processors arrive at interactive consistency upon their values.

If any processor must have its value set to the default value of the interactive consistency algorithm, then clearly it has a Byzantine failure. More elusive failures may be found by having each processor compare the value it had for the other processor's internal value with the value which that processor sent it

during the message rounds. Inconsistencies are used by each processor to

prepare a list of suspicious processors. The lists of suspicious processors are

combined to arrive at a fault syndrome. The Byzantine faults may then be

considered as intermittent faults, and the results for trfault diagnosability used

to ensure correct {although possibly incomplete) diagnosis. Shin describes a

method to collect the syndromes for evaluation which limits the number of

Byzantine faults which a processor may exhibit, by not allowing any more

communication between two processors if one accuses the other of sending it

a value inconsistent with its actual value. The same result could be achieved

by remembering the accusations for some time.

Where should the fault syndromes be collected? One possibility would

be to have centralized fault diagnosis, in which all suspicious processor lists

would be sent to a single repset which acts as the diagnosis center for the 120

entire system. Using a repset as the diagnosis center ensures that the diagnosis will be performed in a fault-tolerant fashion.

While centralized diagnosis is likely to be the most complete, the diagnosis algorithms increase in complexity as 0( I E I ). As the system size grows, the algorithm could prove to be a bottleneck. In addition, the centralized diagnosis repset would be flooded with messages in a large system.

It may be necessary to break a large system down into smaller diagnosis units, which may be done by using each repset as a diagnosis unit.

Since diagnosis will be distributed, it is necessary that all lists of suspicious

processors be shared using an interactive consistency algorithm. If complete

repset to repset communication is used, then the processors in the receiving

repset should each construct a list of suspicious processors and then send that list to each of the processors in the repset which sent them the message.

If reduced repset to repset communication is used, then the processors may

attempt diagnosis based only upon the interactive consistency algorithms they

participate in with members of their own repset. If reduced communication is

used, care must be taken not to diagnose a processor as failed because it

sends out an inconsistent input value during execution of the input

consistency algorithm - the fault may lie in the processor in the sending repset. 121

5.5.3 Spare Processor Coordination.

In the case of centralized diagnosis, the same repset which handles diagnosis may also handle reconfiguration. After diagnosing a faulty processor, the diagnosis repset commands a spare processor to join the repset of the failed processor. It then informs the members of that repset and its interfacing repsets of the change. In the case of decentralized diagnosis, the reconfiguration may still be handled in a centralized fashion. This is possible because the failure rate of the processors is low enough that even in a system with thousands of processors, centralized reconfiguration should not be a severe bottleneck. When a processor is diagnosed as failed by the other members of its repset, they communicate with the centralized reconfiguration repset to inform it of the need for a spare. The reconfiguration repset then continues as described in the above paragraph.

Several additional steps are necessary when a spare processor joins a repset: 1) The new member synchronizes its clock with that of the other repset members. 2) The application code and current state data are sent to the new member by the other members of the repset. 3) The new member votes the application data and state data before initializing itself. In repsets which can mask multiple faults, this will guard

------122

against an undiagnosed Byzantine failed member sending the new

member the wrong initialization code and/or data.

This reconfiguration process will probably take a significant amount of time. As the number of repsets in the system increase, the chance of a second failure prior to reconfiguration of the failed processor increases. Thus it may be necessary to make each repset tolerant to multiple faults. The number of faults each repset should tolerate can be determined with the reliability model of section 5.6.

5.5.4 1/0 Hardware Fault Tolerance.

In a system with a large amount of 1/0 it is impractical to hardwire every

1/0 device to every processing node in the system. 1/0 in MAFT was accomplished by having all 1/0 devices on redundant buses, with any processor capable of accessing the bus. Since the bus may become a bottleneck, a different 1/0 scheme is proposed for the URMC.

1/0 will be hardwired or bused to special 1/0 repsets. If more 1/0 is required than one repset can handle, then several repsets will be used, each with a different subset of the total 1/0 wired to them. Each processor in the 1/0 repset is connected to the other repsets of the multicomputer by the virtual completely-connected network to be described in section 6. However, a failed 1/0 processor cannot be replaced from the pool of spare processors as the spares do not have the l/0 wired to them. Thus the 1/0 repsets must gradually decrease in size due to attrition. This may cause the 1/0 repsets to need more 123 members than the processing repsets, given that the reliability of an 1/0 node is the same as that of a processing node. Two problems cause the message traffic between members of the 1/0 repset to be very high: the 1/0 repset should be able to handle interrupts from the 1/0 devices with which it interfaces, and not much processing is performed between message exchanges (all the processors will do is to vote on the values). Because the members of an 1/0 repset are static (failed 1/0 processors are not replaced), it is practical for them to be interconnected by their own personal completely connected 1/0 repset communication network. This would keep much of the traffic of the 1/0 devices off of the virtual completely connected network and reduces interrupt latency.

5.5.5 Support for Dissimilar Processors.

Design errors are usually considered as a software fault problem, but it

is also possible to have design errors in the hardware used for the

processors. It has been argued that if different software versions are used, then the underlying hardware will probably be in different states and the

chance of triggering a design error in multiple copies of the hardware is small

[Aviz86]. However, with the exception of the Airbus A320, the example

systems of chapter 2 compare the results cf dissimilar processors to detect

faults (while the A320 system does use two different types of processors,

comparisons are made between similar processors running dissimilar

software). Therefore it may be assumed that the URMC may be called upon to 124 use dissimilar processors as well. This can be handled simply by requiring that each member of a single processing repset be a different type of processor. The total number of processor types required will be the same as the number of member processors in a repset. Suppose all repsets have four members, using processor types A, 8, C, and D. The spare processors will be broken down into four pools of spares, one for each type. When a processor failure is diagnosed, the reconfiguration repset replaces it with a processor of the same type. Having dissimilar processors requires that processors in a repset be synchronized less tightly than the instruction level. Since 1/0 will usually be simple, similar processors may be used in the

1/0 repsets even if dissimilar processors are used in the processing repsets; the processors could then be synchronized at an instruction level. The reduced interrupt latency, extra speed, and convenience of synchronizing the

processors on an instruction level may be worth the loss of resistance to generic faults. If it is determined that dissimilar processors should be used for

1/0 as well, then the scheme may be modified to allow frame synchronization

instead of instruction synchronization, at the expense of longer interrupt

latency and reduced coverage of nongeneric hardware faults.

5.6 Reliability Model.

The reliability model breaks the hardware faults into two types as in the dual failure mode model of Meyer and Pradhan [Meye87]. In their article, faults

were classified as either benign or malicious. Benign faults cause the

------· -- - 125 processor to stop operating entirely, while malicious faults may result in arbitrary processor behavior. If only benign faults are permitted and a proper diagnosis scheme is used, then there may be as few as one processor left operating without causing a system failure. To tolerate f benign faults there need be only f + 1 processors. To tolerate f faults with at most m of them malicious, there must be f + 2m processors [Meye87]. The URMC processing nodes may be divided into three groups: nodes in 1/0 repsets, nodes in processing repsets, and nodes in the pool of spares.

The hardware failure rate will be represented as the Greek letter lambda. The two different types of faults will be differentiated by the subscript on the letter lambda, with lambda sub M representing malicious faults and lambda sub 8 representing benign faults. Both fault types together will be represented by lambda sub H. The diagnosis rate of a malicious fault is represented by the

Greek letter beta, and the repair rate by the Greek letter mu.

Failed nodes in the 1/0 repsets are not replaced from the pool of

spares. Therefore the reliability of each 1/0 repset may be modeled as if it were a stand alone system. A four node example system is shown in Figure 5.6-1. In this diagram, there are seven states. They are:

1) all four nodes of the repset correct,

2) three nodes correct with one benign failure,

3) three nodes correct with one malicious failure,

4) two nodes correct with two benign failures,

5) two nodes correct with one malicious failure and one benign failure,

6) complete repset failure. 126

A,H

Figure 5.6-1 . Reliability of 4-node 1/0 repset.

It is assumed that the system begins with all tour nodes OK H a benign failure occurs, the system diagnoses it quickly enough that the diagnosis time is ignored here, and the system switches to 3 OK with one permanently failed. H a malicious failure occurs, the system moves to the 3 OK, 1 intermittently failing state. A malicious failure takes some time to diagnose, and the diagnosis rate is represented by {3. H the malicious fault is diagnosed, or the same node develops a benign failure, the system moves to the 3 OK, 1 permanent failure state. A second malicious failure before the first is diagnosed results in failure of the repset. Other state transitions are as marked. Note that lambda sub H equals the total hardware failure rate, including both malicious and benign failures. The model for the processing repsets is more complicated due to their interaction through the pool of spares and its monitor. Therefore a simplified, 127

approximate model is proposed, in which the processing repsets and the spares pool are modeled separately. Each processing repset may be modeled as shown in Figure 5.6-2. It is assumed that the spares pool is infinite when modeling a single processing repset. The model is like that for an 1/0 repset except for the repair rates, mu.

Figure 5.6-2. Markov model for 4-node processing repset.

There is a possibility that a spare may have an undiagnosed malicious failure when it is taken into the repset. This is not shown in the model in Figure

5.6-2. To include this possibility in the model the mu transition may be divided

into two. 128

Finally, the spares pool may be modeled by assuming that every failure

results in a spare being removed from the pool. After the pool has been

emptied, then it is assumed that all successive failures are malicious and take

place in the same processing repset. For a case with four members in the

processing repset, then the second faUure after exhausting the spares causes

failure (see Figure 5.6-3). This should provide a more pessimistic answer than

reality. The system shown is a special type of Markov model known as a pure

death process with linear rate. Its reliability may also be found by considering

it to be an m-out-of-n system, where m is the number of processors in all

working repsets minus the number of malicious failures a single repset can

stand, and n is the total number of working processors and spares.

• • •

8 Spares 7 Spares o Spares One repset down FAIL to three elements

Figure 5.6-3. Markov model for spares pool reliability.

The final model for the processing node hardware is shown in Figure

5.6-4. If there are i 1/0 repsets and j processing repsets, then the failure rate is

i times the failure rate of an 1/0 repset from Figure 5.6-1, plus j times the failure

rate of a processing repset from Figure 5.6-3, plus the failure rate of the

spares pool from Figure 5.6-4.

··--····------... ·------· -·· ··· - - · · ·· ·-- -·------·- - . . ·------129

1 i 1 j ~···~ --@--{;]

Figure 5.6-4. URMC processing node reliability diagram.

The models above may be used to determine acceptable values for the failure rates. The ratio between the malicious and benign failure rates is especially important, as a repset may withstand many more benign failures than malicious. This indicates that care should be taken to include extensive self-diagnosis capabilities, such as background testing, watchdog timers, ticket checks, etc. Also, self-checking pairs could be used at each node, which would make the coverage very close to 100%.

Another way to reduce the rate of failure due to too many malicious failures is to make the Byzantine diagnosis rate high. This indicates that probably the complete communication model should be used for repset to repset messages because of its greater diagnostic capability, although this will cause more traffic on the communication network.

Because 1/0 processors cannot be replaced by spares, there must be more of them in a repset, they must be made highly reliable, or both.

Otherwise, they will become the weak point in the system. It is proposed 1/0 nodes in a single 1/0 repset be completely-connected with their own high speed network in addition to the links to the rest of the system. 6. Network Fault Tolerance.

The processing nodes of the URMC communicate with messages passed through an interconnection network. This network provides the services of a 11 perfecr• completely-connected network: any node may send a message to any other node, and it is guaranteed that it will be delivered within some time interval. The interconnection network must provide this service despite hardware faults in the interconnection network or in relay points. The interconnection network should also be physically spread out, so that it is not liable to destruction by a single incident of damage to the airplane. In addition, further requirements have been imposed upon the communication system by the implementations chosen for higher levels of abstraction.

Many fault-tolerant interconnection networks have been proposed.

From this array of alternatives it is impossible to say which is optimal with respect to use in a given application such as flight control. Therefore, several of the more commonly ·used alternatives will be presented and a subjective argument used to choose one for use in the URMC. This selection will then be analyzed for reliability and performance to verify that it is an acceptable alternative.

Section 6.1 describes additional requirements, derived from the choice of implementation for the software fault tolerance and processing node fault

130

------131 tolerance, which must be placed on the URMC communication system fault tolerance layer(s). Section 6.2 reviews the most common architectures for interconnecting the processing nodes of a multicomputer, and chooses the unidirectional link fault-tolerant graph (FG) network for further investigation. Section 6.3 describes the interconnection rules and fault-tolerant routing strategy for the FG network. Section 6.4 covers reliability analysis of the network, section 6.5 covers performance analysis, and section 6.6 summarizes the results of this chapter.

6.1 Network Requirements.

The algorithm for Byzantine agreement used by the processing nodes of the URMC places the following requirements upon the communication network [Lamp82]:

1) every message that is sent is delivered correctly, 2) the receiver of a message knows who sent it, and 3) the absence of a message can be detected. The first requirement will be met if the network appears to be a fault-free completely-connected network, as already stated in the network requirements of chapter 3. This will be accomplished by allowing the node hardware fault tolerance layer to make a call to the communication layer containing a message to send and its destination. Using a series of layers similar to the OSI network, the message will be broken into packets, routed, and received.

------132

The second requirement will be met if authenticated communication is used at the node fault tolerance level, and it has already been decided in chapter 5 that this will be used to allow Byzantine fault diagnosis. Each message will be signed by the sending processor with a digital signature

[Rive78].

The absence of a message can be detected by synchronizing the

processors and guaranteeing that a message will be delivered within some

upper limit on time. The members of a repset must be synchronized using a

distributed fault tolerant clock synchronization algorithm. These algorithms are

evaluated with respect to agreement (how closely correct clocks agree) and

accuracy (how closely correct clocks follow real time). For most real-time systems, the former may be reduced at the expense of the latter. However, we

wish to make the design constraints on the interconnection network as tight as possible, so assume that we need optimal accuracy, i.e., the accuracy of the

logical clocks is bounded only by the accuracy of the physical clocks. One

such algorithm is described in [Srik87]. If the optimal accuracy version of the

algorithm is used, the maximum skew between processors is given by:

(6.1-1)

where Dmax is the maximum difference between clocks in correct processors,

tdel is the maximum time delay in delivering a packet, r is the bound on the

accuracy of the hardware clock, and dr is the rate of drift of the hardware

clock. Increasing skew between processors increases the delay in arriving at

- ----· . . -·· - .. -· --·- · ...... ------.. - 133 distributed agreement, and thus delays the availability of output. Abandoning a completely-connected network (with a diameter of 1) can greatly increase the diameter and hence the delay in message passing. Messages must be sent from node to node through a short distance, so the diameter of the URMC interconnection network should be kept as small as possible. The main requirements for the interconnection network are high reliability and performance, and methods for evaluating these are discussed in sections 6.4 and 6.5, respectively.

6.2 Interconnection Network Overview.

As explained by Pradhan [Prad85], a point-to-point interconnection network may be represented as a graph G(V,E), with V the set of all vertices in the graph, labeled i, and E the set of all edges, i,j, present only if there is an edge between vertex i and vertex j. From these graphs an architecture may be developed which is either link-oriented or bus-oriented. In a link-oriented graph, the vertices are considered as processors and the edges are links between the processors. Communication takes place by making hops from processor to processor through the interconnecting communication links. In a bus-oriented system, each node of the graph represents a bus, and the edges are processors which link the buses. Communication takes place by making hops from bus to bus through the interconnecting processors. Figure 6.2-1 shows how a graph, G, may be turned into either a link-oriented architecture, 134

LA(G), or a bus architecture, BA(G). There are many possible graphs for networks, only a few of which have been studied in any great detail.

G LA(G)

0 ----.------.------~ 21 ----.-~-.------+----~~____._+-~~~-r------+----~~ 3 ----.-+--4~~~-r--P----+----~~ 4 ----.-+--4~~~~--.-9--+~--~~

BA(G)

Figure 6.2-1 . Unk and bus architectures.

There have been several overviews of interconnection networks. Pradhan discusses such architectures as the shared bus, shared memory, loop, tree, dynamically reconfigurable networks, binary cube, and fault-tolerant graph (FG) networks in [Prad86]. Uhr describes the pipeline, ring, star, mesh, pyramid, tree, and hypercube [Uhrl87]. Quinn adds the binary shuffle­ exchange, cube-connected cycles, and butterfly [Quin87]. An overview of multistage interconnection networks (MIN) is given in [Feng81], and an

-~------135

overview of fault-tolerant MIN may be found in [Adam87]. Some other networks are considered by Agrawal and Janakiram [Agra86]. According to Agrawal and Janakiram, communication networks may be evaluated with respect to the following characteristics: average distance

between nodes, number of communication links, routing algorithm, fault

tolerance, and expansion capability. The article [Agra86] investigates these

characteristics for several networks, but not all that have been mentioned above. With all of the types of interconnection possible, it is impossible to say which is optimal for the URMC. However, brief subjective arguments are

presented to narrow down the choices to one alternative which is acceptable,

the link-oriented FG network. Some networks may be eliminated because their interconnection sends

all data through a single point, which is a single point of failure, and which will

act as a bottleneck as the number of processors increases: the ring, star,

pipeline, tree, and pyramid may be eliminated this way (see Figure 6.2-2). This makes these interconnections impractical for large numbers of processors,

with current technology limiting the number of processors to about 30, or even in special cases perhaps a hundred [UhrL87].

The crossbar switch has much higher bandwidth between processors,

but its complexity grows as O(n2) (see Figure 6.2-3). Hence the number of

processors which may be feasibly connected with a crossbar is limited to

about the same value as for the bus, ring, and star [UhrL87]. The MIN reduces the number of connections required, with complexity

growing as O(N log N), and without severe bottlenecks. Uhr suggests that it

------··---- ·---· 136

should be possible to construct systems with 1024 or even 8096 processors, which is plenty for the URMC's purposes. However, there is still an overhead of log N switches per processor in the multicomputer, and the network should be extendable to larger computers if future technology allows.

Others may be eliminated because the distance between processors grows too quickly. The mesh, for instance, has a diameter which grows as

O(n) while the diameter of the hypercube, FG network, cube-connected cycles, and butterfly increase as O(log n). We may insist that the diameter grow as O(log n), and thus eliminate the mesh. In a hypercube, the number of connections per node is log N, while the butterfly has four and the cube-connected cycles has three. An FG network has N = rm processing nodes, each with degree r, and with the diameter of the network given by m. The higher the degree possible in each node, the smaller the FG network's diameter. This is an advantage because the number of connections per node may be traded off against the diameter and fault tolerance of the network. The FG network is also self-routing and optimally fault-tolerant [Prad85]. This network will be considered as the preferred candidate.

Unk-oriented architectures are more suitable for the technology which allow the highest bandwidth, fiber-optic cables. Therefore the link-oriented FG network will be used.

The FG network considered will be the class with unidirectional links discussed by Sengupta, Sen, and Bandyopadhyay [Seng87], due to the simple fault-tolerant routing scheme available for this network. 137

6.3 The Unidirectional Link FG Network.

An FG network is built by augmenting an (r,m) shuffle-exchange network. Let the digraph be represented by G(V,E) as discussed above. There are rm nodes in V (r and m positive integers), which may be numbered in radix-r as (io. i1, ••• im_1). There is a directed edge from node ito node j if any of the following conditions is satisfied:

1) ik = jk for all 0 ~ k ~ m-2,

2) ik = jk.1 for all1 ~ k ~ m-1 and io = im-l•

3) if ik = ik+1 for all 0 ~ k ~ m- 2, then jk = (ik + 1) mod r for all 0 s k s m-1.

The first condition provides the exchange connections, the second condition provides the shuffle connections, and the third condition is the augmentation of the (r,m) shuffle-exchange network. Figure 6.3-1 shows a

(2,3) FG network.

Figure 6.3-1 . A (2,3) FG network. 138

The FG network graph, G, has the following desirable characteristics:

1) the graph has a diameter of 2m - 1,

2) G has connectivity r, the maximum possible given nodes of degree r,

which gives tolerance to r - 1 faults, and

3) if up to r - 2 nodes are removed from G and all of the edges to and

from these nodes deleted, the resulting graph has a maximum increase

in diameter of only 2.

Property 1) can be proven by presenting the normal routing scheme in

the network. Suppose a message is to be sent from source (s0, s1, ... sm_1) to

destination (d0, d1, ... dm_1). Then we may define the following path, of length no more than 2m - 1:

(so. S1, ··· Sm-2• Sm-1)

(so. s1, ... Sm-2• do)

(s1, ~. ··· Sm-2• do. so)

(s1, S2, ··· Sm-2• do. d1)

(do, d1, ··· dm-2• Sm-2l

(do. d1, ··· dm-2• dm-1)

This is called the normal path, denoted by np(s,d). If there are no faults,

np(s,d) is a path of length at most 2m - 1. Routing is simple if the message

------~ ----~ ~~-~-- -~------139 format contains the destination address, as each node may route the message packet by examining the destination and comparing it to its own address. Property 2 is discussed in both [Prad85] and [Seng87], but will not be discussed here, as when there are r faults, the route may be up to 6m - 3 in length. We will consider only the case where there are r- 2 faults or fewer, in which case property 3 keeps the maximum routing distance to 2m + 1. For routing with an increase of no more than two in diameter, consider

again the case of a message from source (So. s1, ••• sm_1) to destination (d0, d1,

... dm_1). The message is sent by multiple paths, called multicasting. For the first step, send the message packet to all r - 1 of the locations which differ from the source in the least significant digit. The message is now in r nodes,

labeled (s0, s1, •.. sm_2, X). Then, route the message over the r normal paths

from these nodes to the r nodes with the addresses (do. d1, ... dm_2, X). Finally, route the messages from the r - 1 nodes which differ from the destination by

one bit to the destination. Unfortunately, the r paths are not guaranteed to be node disjoint, so failure of a single node may cause the loss of more than one

path from source to destination. Fortunately, Sengupta, Sen, and

Bandyopadhyay have shown that as long as there have not been more than r

- 2 failures, at least one of these paths will be available [Seng87].

For most graphs, there is a concern as to how to map the processes to the processors so as to reduce the average distance between communicating

processes. This problem is ignored for the URMC because the multicasting

technique greatly reduces the gains which could be realized by placing 140 processors in adjacent processors: one of the paths will be made very short, but the others still be longer. However, there may be some mappings which are better than others, and this should be investigated in future work.

6.4 Reliability Analysis.

Every message that is sent by a fau~-free processor should be received by any fault-free processor which is supposed to receive it. As described above, a route exists between any pair of nodes with length at most 2m+ 1 if there are no more than r- 2 failures in the interconnection network. To avoid having to diagnose failures, it is proposed that the fault tolerance be utilized through multicasting. It is suggested that the messages be sent with error detecting and correcting codes sufficient to ensure that an incorrectly transmitted message will be detected with a probability high enough to meet the reliability goal of the communication network. This allows the network to correctly pass messages between a pair of processors as long as at least one fault-free route remains. Therefore, the reliability of an FG network may be found using a standard formula for an m-out-of-n system, where n is rm and m is rm- (r- 2). The reliability of an m-out-of-n system is given by [Siew82]:

N-M R = ! C(N,i) • RmN-1 • (1 - Rm)i (6.4-1) i=O

---~------141

where Rm is the reliability of a single message-passing node. The higher the desired reliability for the communication network, the larger r should be. In order to avoid the complication of passing messages through the main processing element at each node and to allow the reliability of the message-passing to be increased, message-passing should be handled by a separate processor at each node, as depicted in Figure 6.4-1.

Figure 6.4-1. Node architecture.

------···-- ·- · ·. ··· -· . . 142

6.5 Performance Analysis.

Since the repsets are synchronized through message passing, the message delay must be small. A discrete event simulation of the network should be prepared to ensure that the bandwidth on the interconnections is high enough and that the routing is performed quickly enough. To avoid having to run a computer simulation, a queueing model can be constructed for the system. This is subject to several assumptions which may not be very accurate, but they are necessary to allow an analytical solution. A basic queueing process is shown in Figure 6.5-1. The queueing process is characterized by the input source size, the statistical pattern which describes how the customers arrive, the queue size and discipline, and the service mechanism. The simplest model is attained if the input source is infinite, customers arrive as a Poisson process, the queue is infinite with a first come first served discipline, and the service mechanism serves the customer with an exponential distribution. These assumptions are particularly necessary in the case of analyzing queueing networks, so we will make assumptions to fit this.

Queueing System

Figure 6.5-1 . The basic queueing process. 143

To allow analysis of the network as a network of queues the following assumptions are made: 1) Each node generates message packets as a Poisson process with rate lambda. 2) Each node sends messages to each of the other nodes with equal likelihood. 3) Each message is sent by r different routes as described in the fault­

tolerant routing strategy. 4) At the source node, the message packet is placed in the queue for every outgoing link, thus multicasting the message packets. 5) Every message takes 2m + 1 steps to arrive at its destination (if one

arrives before then, it will decrease the load on the network, so this is a pessimistic assumption).

6) There is a first-come-first-served queue for every output link from a node. 7) The lengths of the message packets follow an exponential distribution, and the time for a node to forward a message packet is directly proportional to its length. This gives exponential service times. 8) Each node input places the message packet into the correct queue for the output the packet is to go on next. Conflicts between inputs

attempting to simultaneously place messages in the output queue are ignored. 144

With these assumptions, each node of the network may be modeled as a queueing system as shown in Figure 6.5-2.

Inputs Outputs 0 ------4( -mmmn-- 0 1. -mnmm---· . CROSS­ BAR r-1 -----

Figure 6.5-2. Queueing system formed by one node of network.

A useful property of Poisson processes is that if merged or split they

yield Poisson processes. In the network each node has a source which

originates messages at a rate lambda. These messages are placed in every

one of the r outputs so that they can be routed as described in the fault­ tolerant routing strategy. The messages travel for 2m + 1 hops each. At each node, the messages bound for that node are split off, then all messages are

routed through a crossbar to their output, and finally each of the r output

streams is merged with the messages originating in the node. The messages

originate at a rate lambda. Each message is sent out r different outputs, so the rate per output is lambda. Since each message travels at most 2m + 1 hops, the message rate on each link is at most (2m + 1) x lambda.

------.-- --. ------.------· -- ··-- -- . ... ·------145

We can define the utilization of a queueing system, r, to be the arrival rate lambda divided by the service time mu. If the utilization is less than one then a queueing system will eventually arrive at a steady-state condition. In this case the distribution of waiting times for messages in the queueing system (waiting time in queue plus service time) is given by [Hill74]:

P{W > t} = eii£1 · r]t, for t 2: 0 (6.5-1)

The nodes of the network taken together form a network of queues. Networks of queues are difficult to analyze except for special cases; the special case we use here takes advantage of the fact that the assumptions above make the network a feedforward queuing network. In that case, the equivalence property simplifies the analysis [Hi1174]:

EQUIVALENCE PROPERTY: Assume that a service facility has a Poisson input with parameter lambda and an exponential service-time distribution mu, where mu > lambda. Then the steady-state output of this service facility is also a Poisson process with parameter lambda.

Therefore each queue produces customers as a Poisson process with the same average rate at which they arrive; each node may be analyzed

separately as a stand alone queueing system. It is now possible to find the cumulative distribution for the entire system. First the probability density function for a queue wait must be found, by taking the derivative of the cumulative distribution function above:

------146

d P{W = t} = --- P{W ~ t} dt d = ___ (1 _ ei-1[1 - r]t) dt

= J1(1 - r) ei-1[1 - r]t (6.5-2)

Then the system pdf may be found by convoluting the pdf of each queue, and the system pdf integrated to find the system edt. Multiplying the Laplace transform of two functions is the same as convolution in the time domain, so first take the Laplace transform of the pdf for a node queue, raise it to the 2m + 1 power, then take the inverse Laplace transform. Referring to a table of Laplace transforms [Spie65] we find:

P{Wsys = t} = P{Wnode = t} * P{Wnode = t} * ... * P{Wnode = t}

1 = £ _ [ ---~~~-=-~~----] 2m + 1 s + J1[1 - r]

J1(1 _ r)2m+1 • t2m • ei-1[1 - r]t == ------(6.5-3) (2m)!

The system wait edt may then be found by integrating the system wait pdf: 147

P{Wsys < t} (6.5-4)

The required service rate for the messages can be found as follows:

1) Pick an acceptable delay for delivery of a message, t.

2) Determine how small the probability of exceeding the delay t should be

over the flight time.

3) Determine the rate at which messages are generated at each node,

lambda, and the maximum number of hops a message must travel, 2m + 1. 4) Solve for mu, the allowable service time at each node.

5) Find the average length of a message. Find the bit rate of the

communication links as the rate at which the node must send an

average length message to meet the average service time, mu.

6.6 Summary.

The URMC architecture will consist of a multicomputer with the nodes

interconnected in a unidirectional link FG network. It will appear to be a

completely-connected network to the node hardware fault tolerance layer. This

will be accomplished by allowing the node hardware fault tolerance layer to

make a call to the communication layer containing a message to send and its

destination. Using a series of layers similar to the OSI network, the message

------.. . .. 148 will be broken into packets, routed, and received. Each message will be signed by the sending processor with a digital signature.

The FG network allows considerable flexibility by allowing the number of connections per node to be traded off with the required fault tolerance and/cr network diameter. Messages from one node to another will be sent by r different routes with an error correcting and detecting code, which allows up to r - 2 faults to occur while still maintaining system connectivity. Self routing in the network is simple to implement. The number of connections per node will be determined from the total needed reliability, the message passing node reliability, and a standard m-out-of-n formula.

The degree of synchronization between nodes determines the maximum message delivery time. Given this, a queueing analysis provides an approximate value for the speed of the message-passing links. 7. Example URMC System.

This chapter discusses the requirements for a flight control system and describes a possible URMC configuration which meets these requirements, which will be called URMC-1. This is intended only as an example of one possible way to combine the ideas of the previous chapters.

Section 7.1 outlines the requirements for a flight control system, and section 7.2 outlines the capabilities of current technology for throughput and reliability. Section 7.3 uses the schemes and reliability models of chapters 3, 4,

5, and 6 to derive a top-level conceptual design for the URMC-1 system which meets the requirements of section 7.1. In section 7.4 the resulting URMC-1 design is compared with the aircraft flight control computer architectures reviewed in chapter 2. The results of the chapter are summarized in section 7.5.

7.1 Requirements.

In the design of MAFT it was determined that a modern flight control computer for a commercial air transport should have iteration rates up to 200

Hz, execute instructions at a rate up to 5.5 million instructions per second(MIPS), perform 1/0 at a rate up to 1 million bits per second (BPS),

149 150

have a transport lag (input to output delay) as short as 5 milliseconds, and have a failure probability of less than 1 x 1o-9 in a ten hour flight [Kiec88]. The type of instruction was not specified (8, 16, or 32 bit, RISC or CISC). It will be assumed that they referred to a 16 bit CISC instruction set, which should be

more than matched by a modern day 32 bit RISC instruction set operating at

the same MIP rate.

7.2 Current Capability.

As the problem of reliability is the most difficult problem to deal with in

designing the URMC-1, the system will be analyzed primarily for failure rate. To

determine this we must first find values for typical failure rates, recovery rates,

and coverage using current techniques and technology. The throughput assumed possible for processing and communications hardware will also be found.

7.2.1 Software Failure Parameters.

11 Experience tends to show that a reasonable expectation of bugs in a large software developed with maximum care is in the order of 1o-s per operating hour• [RTCA83]. We may accept this as the rate for software failure rates, but must still determine how many of these failures are correlated failures.

------·------·-·------·-- .. - 151

A study by Knight and Leveson [Knig86] found a discouraging result for the correlated error rate. 27 programs were developed and subjected to one million test cases. The results from these test cases were then groLJped into threes and passed through a voter. A total of 2925 three-version systems were formed by taking all possible combinations of three programs from the 27, and the performance of each of the three-version systems was determined for the one million test cases. It was found that the three version system failed

19 times less often than the single version system. Due to the low probability of encountering two independent faults on the same data set, it may be assumed that the failures of the three version system were almost entirely due to correlated faults. This gives a correlated failure rate for three versions which is approximately 19 times lower than the failure rate of a single version. This suggests that for a three-version system, approximately 1 in 19 of the faults found in a given software version will be correlated with a fault in another software version.

Later studies also showed high correlated fault rates. For instance, in a six-language multi-version software experiment run jointly by Honeywell and UCLA [Aviz88], 82 faults were removed from the six versions before acceptance, with one identical pair, five faults were uncovered by testing, with one identical pair, and six more faults were discovered by code inspection, all unrelated and different. Thus out of 93 faults, there were 89 independent faults and two pairs of identical faults. With six versions, there are 15 possible pairings, giving an average of 14.8 independent faults per version and 0.13 correlated faults between each pair. In a three version system, there would be

------152 an average of 44.4 independent faults and 0.39 correlated faults. Since the correlated faults appear in two versions, the probability of a fault in a version affecting another version would be 0.7, so for a given fault the chances of its being correlated with another version is 1 in 0. 7/14.8 + 0. 7 = 22.1. This is not much better than the Knight and Leveson value of 1 in 19. Another study is the Project on Diverse Software (PODS). Bishop and Pullen analyzed three versions of software, which were found to have 17, 16, and 13 faults. It was found that several of the faults which were not correlated appeared to be correlated due to fault masking effects in the computation of binary outputs. This effect could be eliminated by redesigning the program to eliminate the masking effects. However, three faults were found to be genuinely correlated, so out of the total of 46 faults there were 43 independent faults and one triple. Thus there were an average of 15.3 faults per version, of which one was correlated with other versions. The results of the above studies are not encouraging. Although Avizienis argued that the results of Knight and Leveson were overly pessimistic, his own results were not much better. It appears that correlated faults between versions make up about 1 in 20 of the total faults in a single version. How many of these are correlated between 2 versions, 3 versions, etc.? This is difficult to say. Eckhardt and Lee used their theory for predicting correlated faults between versions [Eckh85] to analyze the data from the study of Knight and Leveson [Eckh88]. They found a probability of system failure as shown in Figure 7.2.1-1, where the bottom curve is the prediction based upon the independent failures assumption, the middle curve is a 153 theoretical average based upon selecting without replacement a subset of N versions from a total set of n components, and the upper curve is that suggested by their model. Since the Knight and Leveson study does not appear to be too pessimistic after comparing it with the Honeywell/UCLA and PODS results, the upper curve will be used for the URMC-1 design.

10 .g '-----1.1---1---.l----l...... l-----' 1 s 9 13 17 21

Figure 7 .2.1-1 . Comparisons of estimators (from [Eckh88]).

Another question which· is of importance when using a plurality voter

(rather than a majority voter) is what proportion of the correlated faults result in

identical incorrect results and different incorrect results. In the work by Knight and Leveson it was found that 65% of the correlated faults were detectable 154 and 35% were not. While this rate was found for correlated errors in two or three versions only, it will be assumed that for any fault correlated between m versions, 65% will arrive at distinct results and 35% at identical results. This indicates that having a voter such as the one in the FTP/AP will reduce the failure rate to about 1/3 of its previous value- while somewhat helpful, it is not the orders of magnitude improvement necessary.

7.2.2 Hardware Failure Parameters.

In the design of SIFT a study was made of the expected failure rates, which resulted in a rate of 1 x 1Q-4 for the main processors and 1 x 1o.s for the 1/0 Processors and buses [Wens78]. These rates will be assumed correct for current systems as well. For systems with special hardware processors (such as the interstage communicators of FTP/AP or the operations controllers of

MAFT) it will be assumed that the special hardware is of the same complexity and reliability as an 1/0 processor.

A dual fault model will be used [Meye87] as in section 5.6., in which some faults are benign, that is, they result in the processor stopping completely, and all other faults are malicious, that is, capable of arbitrary behavior. If a processor detects a fault in itself through some self-diagnosis mechanism it can shut itself down, so all self-diagnosable faults may be considered as benign. The self test coverage is thus very important. Values on the order of 90% coverage are routinely claimed by flight control computer

------·-~- 155 developers (e.g. [Yous83]). The use of self-checking pairs would result in a coverage close to 100%, but 90% will be assumed.

7.2.3 Hardware Throughput.

The nodes of the URMC-1 may be constructed from any desired ; it will be assumed in this example that they are to be made from a processor with throughput similar to that of the Intel 80960, which can maintain 10 MIPS with bursts up to 20 MIPS. This will show that the assumed instruction execution rate is within current technological capabilities.

The communication throughput rates are assumed to be up to several hundred Mbits per second. This may be achieved with a fiber-optic link or by several slower links operating in parallel.

7.3 Example URMC Design.

In this section an example URMC design, called URMC-1, is outlined and analyzed using the reliability diagrams of chapters 4, 5, and 6. The design is to meet the requirements of section 7.1 assuming the use of 10 MIPS

processors as the computing elements.

The combined failure rate for the entire URMC-1 system must be better than 1 x 10-10 per hour, including software, hardware, and communications.

The design of reliable software is the most difficult part of the URMC-1, so the

software failure rate will be assumed to take the lion's share of the total 156 allowable failure rate. Thus an attempt will be made to make the software failure rate under 5 x 1Q-11 per hour and the node and communication hardware failure rates together less than 5 x 1()-11 per hour.

7.3.1 Software Fault Tolerance.

How many software versions are necessary to achieve a failure rate of less than 8 x 10-11 per hour? This will be estimated from Figure 7 .2.1-1. It was assumed that the test cases presented to the software versions in the Knight and Leveson study corresponded to approximately 20 years of use in the field.

Thus the probability of system failure given in the figure is that over 20 years, or 175,200 hours. We thus want to find the number of versions necessary to make the probability of system failure such that:

P(sys_tail) s 1 - e-lambdaxt (7.3.1-1)

H lambda= 8 x 1()-11 and t = 175,200 hours, we want P(sys_fail) s 1.4 x 10-5. Using the top line from Figure 7.2.1-1, this will take seven versions of software.

Is it absolutely impractical to develop seven versions of the same software? Possibly not. The extra effort required to develop more software

versions is not as great as might be thought, as the development of the

specification and tests is the largest part of the software development project,

not the coding. In a joint Honeywell/UCLA experiment [Aviz88] six versions of an autoland system were developed in parallel. It was found that having 157 multiple independent programming teams forced the requirements to be more clearly defined, and that back-to-back testing of the versions could be used to find many problems without the need to manually calculate the expected test results. Thus the extra effort of coding multiple versions is partially offset by the ease of testing. Hence it is proposed that the URMC-1 be designed to provide for the execution of seven different software versions. It may be that the cost of software maintenance and documentation of the versions will cause this approach to be impractical. If this is the case, then a way must be found to guarantee that there are few correlated errors between versions.

The software reliability model of Figure 4.5.2-2 is for three versions, but may be extended by allowing for six ranks of the five possible states of the system rather than two, and by allowing correlated software failures to skip over levels. However, analysis of this model requires that the transition rates be known, which requires a greater knowledge of the characteristics of the software, so for now it will just be assumed that seven versions are sufficient.

It is unlikely that all of the software in the flight control system will have to be reliable enough to require seven versions. Most likely only the inner control loops, which are comparatively simple, will need this reliability level. In addition, small modules of the software may be proven correct, thus reducing the amount of software which must be replicated for fault tolerance. Finally, some of the software will have clearly defined results and not have hard

deadlines, and thus will be able to be implemented as a colloquy. In summary,

it will be assumed that only 10% of the software needs to have seven version,

30% needs to have three versions, and the other 60% is either noncritical, 158 colloquys, or proven correct. Finally, the overhead of managing the redundant software will be assumed to be 50%. SIFT redundancy management consumed up to 80% of the throughput [Kiec88], but since much of the software is not redundant, it should be reasonable to assume a lower figure. Total required throughput is thus:

5.5 MIPS • 2 • [(0.1 • 7) + (0.3 • 3) + 0.6] = 24.2 MIPS (7.3.1-2)

7.3.2 Hardware Fault Tolerance.

One strength of the URMC concept is that software fault tolerance and hardware fault tolerance are treated separately, so there may be seven versions of software at some places, three at others, and one at others, without restricting the size or number of the hardware repsets. The number of hardware repsets will be determined from the throughput required, and the size of the repsets from the hardware reliability required.

The goal for the system is 5.5 usable MIPS. A software configuration such as that described above increases the required MIPS rate to 24.2 MIPS. In studies of SIFT, it was found that as much as 80% of the system throughput was occupied by overhead functions [Kiec88]. However, this overhead may be placed in a special-purpose message-passing processor and a general­ purpose of special-purpose redundancy management processor (the former approach is used in FTP/AP, the latter in MAFT). In a study of the multiprocessing efficiency of MAFT, it was found that the application processor

------159

utilization varied from 70 - 90%, depending upon the number of processors left in the configuration. As the scheduling of the tasks in a repset will not change with failures, it seems reasonable to assume that the utilization of the application processors can be kept over 80%; thus the 24 MIPS can be achieved with just three repsets if 10 MIPS processors (such as the Intel

80960) are used. A block diagram for a single node of the URMC-1 is shown in

Figure 7.3.2-1. The complexity of the redundancy management processor and communicator are assumed to be like an 1/0 processor in SIFT, for a failure rate of 1 x 10.s per hour. The bus fails at 1 x 10.s per hour and the application processor fails at 1 x 1Q4 per hour, so the total node failure rate is 1.3 x 104 per hour.

Internode Links ll--·l Application Internode Redundancy Communicator Processor Manager

Backplane Bus

Figure 7.3.2-1. A single processing node in the URMC-1. 160

The next question is how many 1/0 repsets to have. Single 1/0 processors capable of handling the full one million bits per second exist, but the overhead of redundancy management will be assumed to be 50%, so two 1/0 repsets will be needed. Figure 7.3.2-2 shows a possible configuration for a single node of an 1/0 repset. The 1/0 processor, communicators, redundancy management processor, and backplane bus are all assumed to fail at a rate of

1 x 1o-s per hour, for a total node failure rate of 5 x 1()-5 per hour.

Application lntra-Repset Internode Redundancy Commumcator Communicator Processor Manager

Backplane Bus

Figure 7.3.2-2. A single input/output node in the URMC-1.

The complete reliability diagram for the processing and 1/0 node hardware is shown in Figure 5.6 - 5, where i = 2 and j = 3. The SHARPE 161

[Sahn86] package was used to calculate the reliability of the processing and

1/0 repsets (see Appendix A) with the following assumptions: each repset contained 2, 3 or 4 nodes, all rates were exponentially distributed, the total hardware failure rate was 1.3 x 1Q-4 for a processing node and 5 x 1Q-5 for an

1/0 node, 90% of all failures are benign, diagnosis of other failures takes an average of 10 seconds, and replacement of a diagnosed failed processor takes an average of 10 seconds. With the above assumptions the failure probabilities for 1/0 nodes and processing nodes over a ten hour flight is as shown in tables 7.3.2-1 and 7.3.2-2. From these tables is it may be seen that it is necessary to have four nodes in each repset. The reliability models are thus those of figures 5.6-1 and 5.6-3. The next question is how many spare processors to have. If we have k spares, then we need to solve for an m-out-of-n system, where m is one less than the number of processors needed for the processing repsets and n is m + k. There are three processing repsets and two 1/0 repsets. A monitor repset is made from four of the processors in the spares pool. The monitor repset may give away its own members when the spares are exhausted, so the monitor repset members count as spares as well. Thus there is a need for 12 processors to make all repsets complete. Allowing one repset to have lost a processor makes m equal to 11. Using the formula for an m-out-of-n system it is found that n should be at least 17, so there are 6 spares (see Appendix B).

Counting the processing repset members, 1/0 repset members, and spares, we must have a system consisting of at least (3 x 4) + (2 x 4) + 6 = 162

25 nodes. The total node hardware failure probability is a little over 3 x 1Q-10 over a ten hour flight.

Table 7.3.2-1. Failure probability for 1/0 repsets.

time 2-node 3-node 4-node

0.0 0.0000 e+OO 0.0000 e+OO 0.0000 e+OO 1.0 1.0002 e-05 7.5379 e-10 0.0000 e+OO 2.0 2.0007 e-05 3.0079 e-09 2.0660 e-12 3.0 3.0016 9-05 6.7626 e-09 3.8505 a-12 4.0 4.0028 e-05 1.2018 e-{)8 6.5364 e-12 5.0 5.0044 e-05 1 .8775 e-{)8 1.0422 e-11 6.0 6.0063 9-05 2. 7033 e-{)8 1.5809 9-11 7.0 7.0086 e-05 3.6793 e-{)8 2.2996 e-11 8.0 8.0112 e-05 4.8056 e-{)8 3.2283 e-11 9.0 9.0142 e-05 6.0820 e-{)8 4.3971 e-11 1 0.0 1.0017 e-04 7.5087 e-{)8 5.8360 e-11

Table 7.3.2-2. Failure probability for processing repsets.

time 2-node 3-nod9 4-node

0.0 0.0000 9+00 0.0000 9+00 O.OOOOe+oo 1.0 2.6000 e-05 5.6169 9-11 5.6180 e-12 2.0 5.1999 e-05 1.1250 9-10 1.1251 e-11 3.0 7.7997 e-05 1.6884 9-10 1.6885 e-11 4.0 1.0399 e-04 2.2517 9-10 2.2518 e-11 5.0 1.2999 e-04 2.8150 9-10 2.8151 e-11 6.0 1.5599 9-04 3.3784 e-10 3.3785 e-11 7.0 1.8198 e-04 3.9417 9-10 3.9418 e-11 8.0 2.0798 e-04 4.5050 e-10 4.5052 e-11 9.0 2.3397 e-04 5.0684 9-10 5.0685 e-11 10.0 2.5997 e-04 5.6317 9-10 5.6319 e-11 163

7.3.3 Communication Fault Tolerance.

In this section the number of communication links per node to allow there to be a chance of failure less than 2 x 10-10 in ten hours is found. It is assumed that the internode communicator can continue to operate even if the processing node of which it is a member has failed, so the failure rate per communication node is 1 x 10-s per hour. Since any r - 2 failures may be tolerated, we wish to find the reliability of an m-out-of-n system again. First the obvious answer of five connections per node is tried; unfortunately the failure probability of a 22-out-of-25 system is about 2.3 x 1o-9 over a ten hour flight, which is too high. Thus 6 connections are necessary, so the system must have

62 = 36 nodes. This has the advantage of providing considerably more throughput than the 25 node system, or else of allowing slower processors in each node. The failure probability of the 32-out-of-36 communication system is about 5.9 x 10-12 over a ten hour flight, which is more than low enough. Thus the URMC-1 system is connected as an augmented (6,2) shuffle-exchange network.

7.3.4 Communication Performance.

There is some question as to whether the system can maintain the data rate necessary between nodes, since the bandwidth of a system which is not completely-connected is of course lower than one that is. Using the method outlined in chapter 6, we should determine the message rate, message length,

------·- 164 and the maximum allowable message delay. Unfortunately, we have the same problem here as with the software fault tolerance analysis - we need to know more about the application to estimate these parameters. In the absence of a better method, we will use the following: assume that the 1 Mbit/second bandwidth of MAFT was sufficient for the processors which it used. Current processors are perhaps an order of magnitude faster than those in MAFT, so a bandwidth an order of magnitude greater than that of MAFT should suffice. This requires a 10 Mbit/second bandwidth between each node and the nodes with which it communicates. Assume that the 10

Mbit/second rate is necessary for output from the repset and communication with the three other members of the repset. This makes the requirement 40 Mbit/second for data output from a single node. Then, each communication link serves as part of a chain of length 2m + 1, (five in this system), so we must multiply the data rate by that as well, for a total link data rate requirement of 200 Mbits per second. This is very high, but can be achieved with the use of a fiber-optic link.

7.4 Comparison with Current Systems.

One question which arises is if any of the four current architectures covered in chapter 2 is capable of meeting the requirements of section 7.1, and if so, why use a URMC-1 with the need for 36 nodes and 200 Mbit per second data links? 165

If the correlated failure rate found in section 7.2.1 were valid, then it would require more than four versions of software to meet the reliability requirement. Assuming a majority voter, and a system which recovers failed versions quickly (compared to the rate at which they fail), we need enough versions that a correlated failure of the majority of the versions occurs at a rate Jess than 1 x 1o-1o failures per hour, which may take seven versions. Of the currently existing systems, MAFT has the most possible nodes, with eight. It would be possible to develop a different version on each of seven nodes and to vote the results. With a 10 MIPS processor at each node it would be possible to obtain the full required throughput of 5.5 MIPS from a single processor despite a utilization as low as 55%. However, each version would execute in a SJSD fashion, which does not allow for multiprocessing. The URMC-1 allows multiprocessing, which may be desirable from the standpoint of allowing greater variety in the algorithms (as different parallel communication structures adds a new dimension of diversity). Also, if certain sections of the software may be replicated seven times, other sections three times, others only once. The separation of the URMC software fault tolerance and hardware fault tolerance allows mixing the number of versions in different sections of the code, depending upon the criticality level of the software.

Another possibility would be for some sections to use n-version programming and others to use the recovery block. The hardware would continue to use n­ modular redundancy regardless.

Also, recall that 25 processors would be enough for the throughput desired, but 36 processors were used in order to have a power of six. Thus, 166

the URMC system has almost half again the throughput necessary, so has the capability to run more complex control laws and to add intelligence. It may be that due to its size, the URMC-1 is not practical for flight control, but will turn out to be useful for ground based applications where portability is not so important. The URMC concepts may then be used to

extend the number of processors to hundreds or even thousands, with n­ version programs, colloquys, and simplex software all running concurrently.

7.5 Summary.

An example URMC system was developed to meet the needs for modern flight control systems. The resulting system may be too large to

economically fly in an aircraft, with 36 processing nodes, each linked to five other nodes with 160 Mbit per second links. However, the example does show the use of the tools developed earlier, and raises several important points in the design of fault-tolerant systems. It is a brute force approach, but may yield insight into how to improve more economical systems. The URMC concept may also be useful in ground-based applications, with the possibility for

thousands of processors and truly parallel processing.

--- -~~~ -~------8. Conclusion.

8.1 Summary of the URMC.

A very high level conceptual design for an ultrareliable multicomputer for an aircraft flight control application was developed. This system builds upon past work in the field, principally borrowing concepts from MAFT, with some input from the FTP/AP, a Sperry Flight Systems architecture, and the

Airbus A320 flight controller.

The system is designed as a hierarchy of virtual machines for the advantages of conceptual simplicity and the ability to change the implementation of lower levels without unduly impacting higher levels. The concept of a separate recovery layer for fault tolerance is expanded, with three separate groups of layers being inserted in the hierarchy: software fault tolerance, node hardware fault tolerance, and communication hardware fault tolerance. Each fault-tolerant level presents a perfect machine for use by the layer above, despite being constructed from imperfect components.

As it is necessary to program the multicomputer with concurrent software, the currently available software fault tolerance constructs were reviewed for their applicability in the URMC software fault tolerance layer(s). It was determined that concurrent n-version programming will probably be best

167 168 for most of the fault-tolerant software in the URMC. This construct provides the ability to detect faults despite a lack of foreknowledge of the characteristics of a correct output, and allows real-time deadlines to be met somewhat more easily that constructs built upon backward error recovery. Different sections of the software may have different levels of redundancy and use different fault­ tolerant constructs if desired, as the mapping of software processes is not restricted by hardware fault tolerance considerations. The node hardware achieves fault tolerance by an extension of the DFT concept of Chen and Chen [Chen85]. As in the DFT concept, the nodes of the multicomputer are divided into groups called repsets. The processors in a repset are responsible for maintaining the integrity of their own repset despite the occurrence of faults. The repset processors execute the same tasks in synchronism, then share and vote the results of those tasks. The voted values are sent to interfacing repsets, and the results of the voting are used to diagnose a failed processor within one's own repset. When a failed processor is diagnosed, the good processors in the repset replace the failed processor with a processor from a pool of spares. The method for voting and reconfiguration described by Chen and

Chen works only if all failures are consistent; Byzantine failures, which can exhibit malicious behavior, are not accounted for. In the URMC the DFT concept is extended in three ways:

1) data is shared with an interactive consistency algorithm to allow

masking of Byzantine faults in a repset, 169

2) a Byzantine fault diagnosis algorithm is run in an attempt to diagnose

Byzantine faults, and 3) access to spare processors is controlled to protect a repset against

arbitrary behavior by a processor selected from the pool of spares. The communication layer is assumed to be a perfect, completely­ connected network by the layers above it. It achieves this appearance by being connected in a fault-tolerant graph (FG) network, with message-passing

processors routing packets by several different routes to ensure that one will get through from the source to the destination. The main processors need only submit their message to the message-passing system and they may be assured that it will be sent. The resulting URMC design has a lot of overhead for the fault tolerance and requires state-of-the-art software tools, processing hardware, and communication hardware. However, it provides more flexibility, reliability, and

expansibility than previous flight control architectures. It is flexible enough to

adapt to much higher throughput requirements given lower reliability

requirements, so URMC systems should be applicable in a wide range of applications requiring higher than normal throughput and reliability.

8.2 Contributions.

Before embarking upon the design, the currently existing flight control

computer systems were surveyed. After identifying the needs for the next

generation of flight control computers a very high level conceptual design for 170 an ultrareliable multicomputer was presented. In the process of defining the computer, several new ideas had to be developed: the concept of the recovery metaprogram was expanded to include hardware (chapter 3), possible methods for concurrent n-version programing were discussed (chapter 4), and the OFT concept of Chen and Chen was extended to allow for arbitrary failures (chapter 5). In addition, some old ideas were compared for their applicability to real-time control systems: recovery blocks versus n­ version programming (chapter 4), t-fault diagnosability versus n modular redundancy (chapter 5) and several different interconnection networks (chapter 6). In order to allow evaluation of the suggested concepts, recently developed reliability models for fault-tolerant software (chapter 4) and somewhat older hardware reliability modeling techniques (chapters 5 and 6) were adapted to the structure outlined. For evaluation of the interconnection network performance, a simplified queueing model was developed (chapter 6). An example system was developed in chapter 7 to show the use of the fault-tolerant schemes and evaluation techniques from the previous chapters. In the process of estimating parameters for the models it was necessary to examine and compare the reported failure characteristics from several empirical studies.

---~ ~~~-- ~~ ------171

8.3 Suggestions for Further Work.

Several questions arose in the process of developing the URMC

concepts which showed a need for investigation in various fields, and much

more work remains to be performed upon the URMC concept itself.

8.3.1 Software.

The questions to be resolved at the software layer are the most basic of

the questions found, reflecting the lack of maturity in this new field of

investigation.

1) There are many questions having to do with software faults and their

manifestation. How many may be expected? How many will be

correlated? If the model of Eckhardt and Lee is correct, what are the

parameters to be expected?

2) How may software dissimilarity be increased? How may it be

measured? Does parallel programming with dissimilar structures

increase the dissimilarity?

3) Perhaps software metrics could be used to provide a several dimensioned number giving the location of the software in a complexity

space [Muns89]. Would different locations in space imply dissimilarity?

4) How do software faults manifest themselves? Recent work by Bishop and Pullen [Bish88] has suggested that software failure models should

assume that inputs follow a random walk rather than being randomly

------172

selected from an input probability distribution. How may the models of chapter 4 be modified to take the resulting changes in the failure rates into account? 5) In chapter 4 it was suggested that concurrent software could be developed by having the applications designer specify the criticality of the modules, and then for the fault-tolerant designer to request only that extra applications be developed {a similar proposal was made in [Sarm88]). This could be supported by using a language which has separate specifications and bodies {such as Ada), modified to allow n bodies for a single specification. Much work must be done in this field to determine if this suggestion is practical. A translator could be developed which would convert the software from a high-level specification and the n versions of the bodies into a complete fault­ tolerant software system.

8.3.2 Processing Hardware.

Several questions exist at this level, though they are not as basic as the questions at the software level. 1) Can the scheme outlined be implemented with the level of overhead claimed in chapter 5? 2) What will be the average required times for diagnosis and reconfiguration?

.. -·· ··-· ------173

3) Is the dual failure mode model enough, or could there be another

model with three failures: malicious (arbitrary behavior), consistent (one face presented to the outside world), and benign (detectable by self­

diagnosis). 4) Should self-checking pairs be used as the processing devices at each node? While this would increase the cost of the node, it would raise the coverage factor so high that the repsets could have fewer members.

8.3.3 Communication Hardware.

At the interconnection network layer, there are also questions to be resolved. 1) Will a more thorough review of the interconnection network literature may find a network with better characteristics than those of the FG

network. 2) Given an FG network using multicasting, is there a mapping of processes to the processors which will result in a shorter average of the longest paths between communicating processes?

8.3.4 Development of an URMC.

Answering the above questions should make the URMC system a productive research field for years to come. However, the ultimate determination of its viability will require more knowledge of the application 174

area, followed by more accurate models of its behavior, and finally an actual microprocessor-based implementation of an URMC system. Appendix A.

Two programs were written for SHARPE, one to find the probability of failure for an 1/0 repset, and another to find the probability of failure for a processing repset. The 1/0 repset program is shown in Figure A-1, and the resulting output is shown in Figure A-2. The processing repset program is shown in Figure A-3, and the resulting output in Figure A-4.

Figure A-1. SHARPE program to compute 1/0 repset failure probability.

* The four processor repset model is made from the Markov * model of Figure 5.6-1. * Declare transitions in Markov model. States as follows: * 6 - 4 OK * 5 - 3 OK, 1 benign failure * 4 - 3 OK, 1 malicious failure * 3 - 2 OK, 2 benign failures * 2 - 2 OK, 1 benign failure, 1 malicious failure * 1 - 1 OK, 3 benign failures * o - System Failure * The transitions variables are coded as follows: * c - coverage * lh - hardware failure rate * b - diagnosis rate

175 176 markov iorep4 6 5 4*c*lh 6 4 4*(1-c)*lh 5 3 3*c*lh 5 2 3*(1-c)*lh 4 5 b+(c*lh) 4 2 3*c*lh 4 0 3*(1-c)*lh 3 1 2*c*lh 3 0 2*(1-c)*lh 2 3 b+(c*lh) 2 0 2*lh 1 0 lh end * Declare initial probability distribution. It is assumed * that the system begins in state 6 (4 OK).

6 1 5 0 4 0 3 0 2 0 1 0 0 0 end

* The three processor model was found by eliminating states * 6 and 4 and their output transitions from the four * processor model. It is assmued to begin in three * processor good state. markov iorep3 5 3 3*c*lh 5 2 3*(1-c)*lh 3 1 2*c*lh 3 o 2*(1-c)*lh 2 3 b+(c*lh) 2 0 2*lh 1 0 lh end

------177

5 1 3 0 2 0 1 0 0 0 end * The two processor model was found by eliminating states * 2 and 5 and their output transitions from the three * processor model. It is assmued to begin in two * processor good state. markov iorep2 3 1 2*c*lh 3 0 2*(1-c)*lh 1 0 lh end 3 1 1 0 0 0 end * Bind values of the system parameters before execution bind c 0.9 lh 0.05 b 360000.0 end * The cumulative distribution function for each model is * printed, then evaluated for each hour for 0 - 10 hours. * Rates were multiplied by 1000 and time reduced by the * same amount to avoid numeric problem in SHARPE. cdf(iorep2,0) eval(iorep2,0) o.o 0.01 0.001 cdf(iorep3,0) eval(iorep3,0) o.o 0.01 0.001 cdf(iorep4,0) eval(iorep4,0) o.o 0.01 0.001 end 178

Figure A-2. Output from SHARPE program of Figure A-1 . information about system iorep2 node 0 probability of entering node: 1.0000e+OO conditional CDF for time of reaching this absorbing state

1. OOOOe+OO t ( 0) exp( O.OOOOe+OO t) + -1.8000e+OO t( 0) exp(-S.OOOOe-02 t) + 8.ooooe-o1 t( 0) exp(-1.0000e-01 t) mean: 2.8000e+01 variance: 4.9600e+02

system iorep2 node 0 t F(t)

0.0000 e+OO 0.0000 e+OO 1.0000 e-03 1.0002 e-05 2.0000 e-03 2.0001 e-05 3.0000 e-03 3.0016 e-05 4.0000 e-03 4.0028 e-05 5.0000 e-03 5.0044 e-05 6.0000 e-03 6.0063 e-05 7.0000 e-03 7.0086 e-05 8.0000 e-03 8.0112 e-05 9.0000 e-03 9.0142 e-05 1.0000 e-02 1.0017 e-04

information about system iorep3 node 0

probability of entering node: 1.0000e+OO

conditional CDF for time of reaching this absorbing state

1. OOOOe+OO t ( 0) exp( O.OOOOe+OO t) + -2.7000e+OO t( o) exp(-s.ooooe-02 t) + 2.4000e+OO t( 0) exp(-1.0000e-01 t) +·-7.0000e-01 t( 0) exp(-1.5000e-01 t) + -1.1574e-15 t( 0) exp(-3.6000e+05 t) 179 mean: 3.4667e+01 variance: 5.4044e+02

system iorep3 node 0 t F(t)

0.0000 e+OO 0.0000 e+OO 1.0000 e-03 7.5379 e-10 2.0000 e-03 3.0079 e-09 3.0000 e-03 6.7626 e-09 4.0000 e-03 1.2018 e-08 5.0000 e-03 1.8775 e-08 6.0000 e-03 2.7033 e-08 7.0000 e-03 3.6793 e-08 8.0000 e-03 4.8056 e-08 9.0000 e-03 6.0820 e-08 1.0000 e-02 7.5087 e-08

information about system iorep4 node 0 probability of entering node: 1.0000e+OO conditional CDF for time of reaching this absorbing state

1.0000e+OO t( 0) exp( o.ooooe+oo t) + -3.6000e+OO t( 0) exp(-5.ooooe-02 t) + 4.8000e+OO t( 0) exp(-l..OOOOe-01 t) + -2.8000e+OO t( 0) exp (-1. 5000e-01 t) + 6.ooooe-01 t( 0) exp(-2.0000e-01 t) + 2.3148e-15 t( 0) exp(-3.6000e+05 t) mean: 3.9667e+01 variance: 5.6544e+02 180

system iorep4 node 0 t F(t)

0.0000 e+OO 0.0000 e+OO 1.0000 e-03 0.0000 e+OO 2.0000 e-03 2.0660 e-12 3.0000 e-03 3.8505 e-12 4.0000 e-03 6.5364 e-12 5.0000 e-03 1.0422 e-11 6.0000 e-03 1. 5809 e-11 7.0000 e-03 2.2996 e-11 8.0000 e-03 3.2283 e-11 9.0000 e-03 4.3971 e-11 1. 0000 e-02 5.8360 e-11

------181

Figure A-3. SHARPE program to compute processing repset failure

probability.

* The four processor repset model is made from the Markov * model of Figure 5.6-2. * Declare transitions in Markov model. States as follows: * 6 - 4 OK * 5 - 3 OK, 1 benign failure * 4 - 3 OK, 1 malicious failure * 3 - 2 OK, 2 benign failures * 2 - 2 OK, 1 benign failure, 1 malicious failure * 1 - 1 OK, 3 benign failures * 0 - System Failure * The transitions variables are coded as follows: * c - coverage * lh - hardware failure rate * b - diagnosis rate * u - repair rate markov prep4 6 5 4*c*lh 6 4 4*(1-c)*lh 5 6 u 5 3 3*c*lh 5 2 3*(1-c)*lh 4 5 b+(c*lh) 4 2 3*c*lh 4 0 3*(1-c)*lh 3 5 u 3 1 2*c*lh 3 o 2*(1-c)*lh 2 4 u 2 3 b+(c*lh) 2 0 2*lh 1 3 u 1 0 lh end * Declare initial probability distribution. It is assumed *that the system begins in state 6 (4 OK).

6 1 5 0 4 0 182

3 0 2 0 1 0 0 0 end * The three processor model was found by eliminating states * 6 and 4 and their output transitions from the four * processor model. It is assmued to begin in three * processor good state. markov prep3 5 3 3*c*lh 5 2 3*(1-c)*lh 3 5 u 3 1 2*c*lh 3 0 2*(1-c)*lh 2 3 b+(c*lh) 2 0 2*lh 1 3 u 1 0 lh end 5 1 3 0 2 0 1 0 0 0 end * The two processor model was found by eliminating states * 2 and 5 and their output transitions from the three * processor model. It is assmued to begin in two * processor good state. markov prep2 3 1 2*c*lh 3 o 2*(1-c)*lh 1 3 u 1 0 lh end 3 1 1 0 0 0 end * Bind values of the system parameters before execution bind 183 c 0.9 lh 0.13 b 360000.0 u 360000.0 end * The cumulative distribution function for each model is * printed, then evaluated for each hour for 0-10 hours. * Rates were multiplied by 1000 and time reduced by the * same amount to avoid numeric problem in SHARPE. cdf (prep2, 0) eval(prep2,0) 0.0 0.01 0.001 cdf (prep3, 0) eval(prep3,0) 0.0 0.01 0.001 cdf (prep4, O) eval (prep4, 0) o.o 0.01 0.001 end 184

Figure A-4. Output from SHARPE program of Figure A-3.

information about system prep2 node 0

probability of entering node: 1.0000e+OO

conditional CDF for time of reaching this absorbing state

1.0000e+OO t( 0) exp( O.OOOOe+OO t) + -1.0000e+OO t( 0) exp(-2.6000e-02 t)

mean: 3.8461e+01 variance: 1.4793e+03

system prep2 node o t F(t) 0.0000 e+OO 0.0000 e+OO 1.0000 e-03 2.6000 e-05 2.0000 e-03 5.1999 e-05 3.0000 e-03 7.7997 e-05 4.0000 e-03 1.0399 e-04 5.0000 e-03 1.2999 e-04 6.0000 e-03 1.5599 e-04 7.0000 e-03 1.8198 e-04 8.0000 e-03 2.0798 e-04 9.0000 e-03 2.3397 e-04 1.0000 e-02 2.5997 e-04

information about system prep3 node 0

probability of entering node: 1.0008e+OO

conditional CDF for time of reaching this absorbing state

1.0000e+OO t( 0) exp( O.OOOOe+OO t) + -1.0000e+OO t( 0) exp(-5.6287e-08 t)

mean: 1.7766e+07 variance: 3.1564e+14

------~--~------185

system prep3 node 0 t F(t)

0.0000 e+OO 0.0000 e+OO 1.0000 e-03 5.6169 e-11 2.0000 e-03 1.1250 e-10 3.0000 e-03 1.6884 e-10 4.0000 e-03 2.2517 e-10 5.0000 e-03 2.8150 e-10 6.0000 e-03 3.3784 e-10 7.0000 e-03 3.9417 e-10 8.0000 e-03 4.5050 e-10 9.0000 e-03 5.0684 e-10 1.0000 e-02 5.6317 e-10

information about system prep4 node 0 probability of entering node: l.008le+OO conditional CDF for time of reaching this absorbing state

1. OOOOe+OO t( 0) exp( O.OOOOe+OO t) + -l.OOOOe+OO t( 0) exp(-5.5879e-09 t) mean: 1.7896e+08 variance: 3.2026e+l6

system prep4 node o t F(t)

0.0000 e+OO 0.0000 e+Od 1.0000 e-03 5.6180 e-12 2.0000 e-03 1.1251 e-11 3.0000 e-03 1. 6885 e-11 4.0000 e-03 2.2518 e-11 5.0000 e-03 2.8151 e-11 6.0000 e-03 3.3785 e-ll 7.0000 e-03 3.9418 e-ll 8.0000 e-03 4.5052 e-ll 9.0000 e-03 5.0685 e-ll 1. 0000 e-02 5.6319 e-11 Appendix B.

1********************************************************1 I* *I I* Main Program: finds reliability of m-out-of-n system *I I* *I I******************************************************** I

#include #include

I* function prototypes *I void main(); double permutations(int, int); I* finds n!l(n-i)! *I double factorial(int); I* finds n! *I double combinations(int,int); I* finds C(n,i) *I

1*------MAIN ------*1 void main(void) { int i,n,m; double lambda, r_mod, r_sys, hours; printf("\nThis program finds a failure probability\n"); printf("for an m-out-of-n system, given the hourly\n"); printf("failure rate of each module and total time\n"); printf("\nEnter module's hourly failure rate:"); scanf("%lf", &lambda); printf("\nEnter the value for n: "); scanf( "%d", &n); printf("\nEnter the value for m: ") ; scanf("%d", &m);

186 187

11 printf("\nEnter the time to evaluate (in hours) ): scanf( 11 %lf11 , &hours); r_mod = exp(-hours *lambda); r_sys=O; for(i=O; i<(n-m); i++) { r_sys += combinations(n,i) * pow(r_mod,(n-i)) * pow((l.O-r_mod),i); } printf("\nThe system reliability is: %20.15e\n", r_sys); }

I********************************************************I I* *I I* combinations - finds number of combinations of *I I* n taken taken i at a time *I I* *I I********************************************************I double combinations(int n, int i) { return(permutations(n,n-i)lfactorial(i)): }

I********************************************************I I* *I I* permutations - finds the number of permutations *I I* of n things taken i at a time *I I* ( a recursive implementation *I I* *I I********************************************************I double permutations(int n, int i) { return((n==i) ? l:n*permutations(n-l,i)); } 188

I******************************************************** I I* *I I* factorial - returns value of n factorial *I I* (a recursive implementation) *I I* *I 1********************************************************1 double factorial(int n) { return((n==O) ? 1:n*factorial(n-1)); } Bibliography.

[Adam87] G. B. Adams Ill, D. P. Agrawal, and H. J. Siegel, .. A survey and comparison of fault-tolerant multistage interconnection

networks, .. Computer, June 1987, pp. 14 - 27.

[Agra86] D.P. Agrawal and V. K. Janakiram, ~~Evaluating the performance of multicomputer configurations, 11 Computer, May 1986, pp. 23- 37.

[Agra85] P. Agrawal, 11 RAFT: a recursive algorithm for fault tolerance, 11

Proc. of the 15th lnt'l Symp. on Fault-Tolerant Computing, 1985,

pp. 814 - 821. [AkiS85] S. G. Akl, Parallel Sorting Algorithms, Academic Press, Inc., 1985. [Amma87] P. E. Amman and J. C. Knight, 11Data diversity: an approach to software fault tolerance, .. Proc. of the 17th lnt'l Symp. on Fault­

Tolerant Computing, 1987, pp. 122- 126. [Anco87] M. Ancona, A. Clematis, G. Dodero, E. B. Fernandez, and V.

Gianuzzi, 11 Using different language levels for implementing fault

tolerant programs, .. Microprocessing and Microprogramming, val. 20, 1987, pp. 33 - 38.

189 190

[Arla88] J. Arlat, K. Kanoun, and J.-C. Laprie, "Dependability evaluation of

software fault tolerance," Proc. of the 18th tnt'/ Symp. on Fault­

Tolerant Computing Systems, 1988, pp. 142 - 147.

[Atha88] W. C. Athas and C. L. Seitz, "Multicomputers: message-passing

concurrent computers," Computer, Aug. 1988, pp. 9 - 24. [Aviz88] A. Avizienis, M. R. Lyu, and W. Schutz, "In search of effective diversity: a six-language study of fault-tolerant flight control

software," Proc. of the 18th lnt'l Symp. on Fault-Tolerant

Computing Systems, 1988, pp. 15 - 22. [Aviz87] A. Avizienis, M. R. Lyu, and W. Schutz, In Search of Effective Diversity: a Six-Language Study of Fault-Tolerant Flight Control Software, UCLA Computer Science Dept. Report No. CSD -

870060, Nov. 1987.

[Aviz86] A. Avizienis and J. C. Laprie, "Dependable computing: from

concepts to design diversity," Proc. of the IEEE, vol. 74, no. 5,

May 1986, pp. 629-638.

[Aviz85] A. Avizienis, •'The n-version approach to fault-tolerant software,"

IEEE Trans. on Software Engineering, val. SE- 11, no. 12, Dec.

1985, pp. 1491 -1501. [Aviz84] A. Avizienis and J. P. J. Kelly, .. Fault tolerance by design

diversity: concepts and experiments," Computer, Aug. 1984, pp.

67-80.

------··------191

[Bish88] 11 PODS revisited -a study of software failure behavior:• Proc. of

the 18th lnt'l Symp. on Fault-Tolerant Computing Systems, 1988,

pp. 2-8.

[Chen85] Y. Chen and T. Chen, 11 0FT: distributed fault tolerance - analysis

and design,.. Proc. of the 15th lnt'l Symp. on Fault-Tolerant

Computing Systems, 1985, pp. 280 - 285.

[Chwa81] K -Y. Chwa and S. L. Hakimi, 11Schemes for fault-tolerant

computing: a comparison of modularly redundant and t­

diagnosable systems, .. Information and Control, val 49, 1981, pp.

212-238. [Dahb85] A. T. Dahbura, K. K Sabnani, and L. L. King, •111e comparison approach to multiprocessor fault diagnosis, .. Proc. of the 15th

lnt'l Symp. on Fauit-Tolerant Computing Systems, 1985, pp. 260 -

265.

[Dahb83] A. T. Dahbura and G. M. Masson, .. Greedy diagnosis as the

basis of an intermittent fault/transient-upset tolerant system

design,11 IEEE Trans. on Computers, val. C - 32, no. 10, Oct.

1983, pp. 953 - 957.

[Eckh88] D. E. Eckhardt and L. D. Lee, •Fundamental differences in the

reliability of n-modular redundancy and n-version programming, ..

Journal of Systems and Software, val. 8, 1988, pp. 313 - 318.

[Eckh85] D. E. Eckhardt and L. D. Lee, 11A theoretical basis for the analysis

of multiversion software subject to coincident errors, .. IEEE Trans.

on Software Engineering, vel. SE - 11, no. 12, Dec. 1985. 192

[Fair85] R. E. Fairley, Software Engineering Concepts, McGraw-Hill Book

Co., 1985. [Feng81] T. Y. Feng, "A survey of interconnection networks," Computer,

Dec 1981, pp. 12-27. [Fern89a] E. B. Fernandez, V. Gianuzzi, G. Dodero, A. Clematis, and M. Ancona, "A system architecture for fault tolerance in concurrent

software," in preparation. [Fern89b] E. B. Fernandez, Fault-Tolerant Computer Systems, notes for

EEL 6706, 1989.

[Fiyn66] M.J. Flynn, 'Very High-Speed Computing Systems," Proceedings

of the IEEE, vel. 54, 1966, pp. 1901 - 1909.

[Gele85] D. Gelernter, "Generative communication in Unda," ACM Trans.

on Programming Languages and Systems, vel. 7, no. 1, Jan. 1985, pp. 80- 112.

[Giuc86] D. P. Gluch and M. J. Paul, "Fault-tolerance in distributed digital

fly-by-wire flight control systems," Proc. of the 7th Digital

Avionics Systems Cont., 1986.

[Greg85] S. T. Gregory and J. C. Knight, "A new linguistic approach to

backward error recovery," Proc. of the 15th lnt'l Symp. on Fault­

Tolerant Computing, 1985, pp. 404 - 409. [Haki74] S. L. Hakimi and A. T. Amin, "Characterization of connection

assignment of diagnosable systems," IEEE Trans. on Computers,

Jan. 1974, pp. 86-88. 193

[Hech86] H. Hecht and M. Hecht, .. Software reliability in the system context, .. IEEE Trans. on Software Engineering, vol. SE - 12, no.

1, Jan. 1986, pp. 51 - 58. [Hech76] H. Hecht, .. Fault-tolerant software for real-time applications, .. ACM

Computing Surveys, val. 8, no. 4, Dec. 1976, pp. 391 - 406.

[Hill74] F. S. Hillier and G. J. Ueberman, Operations Research, San Francisco:Holden-Day, Inc., 1974. [Hoar78] C. A. R. Hoare, .. Communicating sequential processes, .. Comm. of the ACM, val. 21, no. 8, Aug. 1978, pp. 666- 677.

[Hopk78] A. L Hopkins, Jr., T .B. Smith, Ill, and J. H. Lala, 11 FTMP- a highly reliable fault-tolerant multiprocessor for aircraft, .. Proc. of the

IEEE, vel. 66, no. 10, Oct. 1978, pp. 1221 - 1239. [Kiec89] R. M. Kieckhafer, .. Fault-tolerant real-time task-scheduling in the MAFT distributed system,.. Proc of the Hawaii lnt'l Cont. on

Systems Science, 1989, pp. 143-151. [Kiec88] R. M. Kieckhafer, C. J. Walter, A.M. Finn, and P.M. Thambidurai, 11The MAFT architecture for distributed fault tolerance,.. IEEE

Trans. on Computers, val. 37, no. 4, April1988, pp. 398-405. [KimK89] K. H. Kim and H. 0. Welch, 11Distributed execution of recovery blocks: an approach for uniform treatment of hardware and software faults in real-time applications,.. IEEE Trans. on Computers, val. 38, no. 5, May 1989, pp. 626-636. [KimK86] K. H. Kim, J. H. You, and A. Aboulnaga, .. A scheme for

coordinated execution of independently designed recoverable 194

distributed processes, .. Proc. of the 16th lnt'l Symposiu, on Fault­

Tolerant Computing, 1986, pp. 130-135.

[KimK84] K. H. Kim, 11Software fault tolerance, .. chapter 20 in Handbook of Software Engineering, ed. by C. R. Vick and C. V. Ramamoorthy, van Nostrand Reinhold, 1984, pp. 437 - 454.

[Knig87] J. C. Knight and N. G. Leveson, 11An empirical study of failure

probabilities in multi-version software, .. Proc. of the 16th lnt'l

Symp. on Fault-Tolerant Computing Systems, 1986, pp. 165 -

170. [Knig86] J. C. Knight and N. G. Leveson, 11An experimental evaluation of

the assumption of independence in multiversion programming, ..

IEEE Trans. on Software Engineering, val. SE- 12, no. 1, Jan.

1986, pp. 96- 109.

[Lala88] J. H. Lala and L. S. Alger, .. Hardware and software fault

tolerance: a unified architectural approach, .. Proc. of the 18th lnt'l

Symp. on Fault-Tolerant Computing Systems, 1988, pp. 240 -

245.

[Lala86] J.H. Lala, .. A Byzantine resilient fault tolerant comf)uter for

nuclear power plant applications, .. Proc. of the 16th lnt'l Symp.

on Fault-Tolerant Computing Systems, pp. 338- 343.

[Lamp82] L. Lamport, R. Shostak, and M. Pease, ..The Byzantine Generals

problem, .. ACM Trans. on Programming Languages and

Systems, val. 4, no. 3, July 1982, pp. 382 - 401. 195

[Lapr87] J. -C. Laprie, J. Arlat, C. Beounes, K Kanoun, and C. Hourtolle,

11 Hardware- and software-fault tolerance: definition and analysis

11 of architectural solutions , Proc. of the 17th lnt'l Symp. on Fault­

Tolerant Computing, 1987, pp. 116- 121.

[UnK86] K. -J. Un, 11 Aesilient procedures - an approach to highly available

system, 11 Proc. IEEE lnt'l Cont. on Computer Languages, 1986,

pp. 98- 106.

[Male80] M. Malek, 11A comparison connection assignment for diagnosis of

multiprocessor systems,11 Proc. of the 7th Symposium on

Computer Architecture, 1980. [Mall78] S. Mallela and G. M. Masson, .. Diagnosable systems for intermittent faults, .. IEEE Transactions on Computers, vel. C-27,

no. 6, June 1978, pp. 560 - 566.

[Manc86a] L. Mancini, 11 Modular redundancy in a message passing system, ..

IEEE Trans. on Software Engineering, vel. SE - 12, no. 1, Jan.

1986, pp. 79 - 86.

[Manc86b] L. Mancini and G. Pappalardo, 'The join algorithm: ordering

messages in replicated systems,.. Proc. Cont. on Safety of

Computer Control Systems 1986, pp. 51 - 55.

[Meye87] F. J. Meyer and D. K. Pradhan, 11Concensus with dual failure

modes, Proc. of the 17th lnt'l Symp. on Fault-Tolerant

Computing, 1987, pp. 48 - 54.

[MIL-STD-1815A] Reference ManualfortheAda Programming Language, U.S.

Dept. of Defense, 1983. 196

[Muns89] J. C. Munson and T. M. Khoshgoftaar, 11The dimensionality of program complexity, .. Proc. of the 11th lnt'l Cont. on Software

Engineering, 1989, pp. 245 - 253. [Nels87] P. A. Nelson and L. Snyder, .. Programming paradigms for

nonshared memory parallel computers, .. in The Characteristics of Parallel Algorithms, ed. by L. H. Jamieson, D. Gannon, and R. J. Douglass, MIT Press, 1987.

[Ozak88] B. M. Ozaki, E. B. Fernandez, and E. Gudes, .. Software fault tolerance in architectures with hierarchical protection levels, ..

Micro, Aug. 1988, pp. 30-43. [Prad86] D. K. Pradhan, .. Fault-tolerant multiprocessor and VLSI-based system communication architectures, .. chapter 7 in Fault-Tolerant

Computing: Theory and Techniques, ed. by D. K. Pradhan, Prentice-Hall, 1986.

[Prad85] D. K. Pradhan, .. Fault-tolerant multiprocessor link and bus

network architectures, .. IEEE Trans. on Computers, val. 34, no. 1,

Jan. 1985, pp. 33 - 45.

[Prep67] F. P. Preparata, G. Metze, and R. T. Chien, non the connection

assignment problem of diagnosable systems, .. IEEE Trans. on

Electronic Computers, val EC-16, no. 6, Dec. 1967, pp. 848 - 854.

[Quin87] M. J. Quinn, Designing Efficient Algorithms for Parallel

Computers, New York: McGraw-Hill Book Co., 1987. 197

[Rand75] B. Randell, usystem structure for software fault tolerance,u IEEE

Trans. on Software Engineering, SE-1, no. 2, June 1975, pp. 220

-232.

[Redi84] H. A Rediess, Technology Review of Flight Crucial Flight Control Systems, NASA Contractor report 172332, 1984.

[Rive78] R. Rivest, A Shamir, and L. Adleman, 11A method for obtaining digital signatures and public-key cryptosystems, 11

Communications of the ACM, vol. 21, no. 2, pp. 120 - 126, Feb.

1978.

[Rouq86] J. C. Rouquet and P. J. Traverse, 11Safe and Reliable Computing

on Board the Airbus and ATR Aircraft, 11 Proc. Cont. on Safety of

Computer Control Systems 1986, pp. 93 - 97. [RTCA83] Radio Technical Commision on Aeronautics Paper No. 226- 83/SC152-13, paragraph 7.2, as quoted in [Youn84].

[Sahn86] R. A Sahner and K. S. Trivedi, SHARPE: Symbolic Hierarchical

Automatic Reliability and Performance Evaluator, Introduction

and Guide for Users, Sept. 1986.

[Sarm88] J. L.Sarmiento and E. B. Fernandez, .. A knowledge-based

system for the development of fault-tolerant programs, .. Proc. of

the Florida A. I. Research Symp. (FLAIRS), May 1988, pp. 119 -

124. A revised version will appear in Advances in Artificial

Intelligence Research, JAI Press, 1989.

-~ ~ ------198

[Seng87] A. Sengupta, A. Sen, and S. Bandyopadhyay, uon an optimally

fault-tolerant multiprocessor network architecture, .. IEEE Trans.

on Computers, vol. C-36, no. 5, May 1987, pp. 619-623.

[Shin87] K.G. Shin and P. Ramanathan, 11 Diagnosis of processors with

Byzantine faults in a distributed computing system, .. Proc. of the 17th lnt'l Symp on Fault-Tolerant Computing Systems, 1987, pp. 55-60. [Siew82] D.P. Siewiorek and R.S. Swarz, The Theory and Practice of Reliable System Design, Bedford, MA:Digital Press, 1982.

[Smit86] T.B. Smith et al, The Fault-Tolerant Multiprocessor Computer,

Noyes Publications, 1986.

[Smit84] 11 Fault tolerant processor concepts and operation,11 Proc. of the

14th lnt'l Symp. on Fault-Tolerant Computing Systems, 1984, pp.

158-163. [Spie65] M. A. Spiegel, Schaum's Outline of Theory and Problems of

Laplace Transforms, New York: McGraw-Hill Book Co., 1965.

[Srik87] T. K. Srikanth and S. Toueg, 11 0ptimal clock synchronization, ..

Journal of the ACM, vol. 34, no. 3, July 1987, pp. 626 - 645.

[Swih84] D.E. Swihart and A.M. Arabian, .. Digital flight control and avionics

integration techniques,.. IEEE National Aerospace Electronics

Conference, 1984, pp. 1329-1331. [TsoK87] K. S. Tso and A. Avizienis, .. Community error recovery in n­

version software: a design study with experimentation, Proc. of 199

the 17th lnt'l Symp. on Fault-Tolerant Computing, 1987, pp. 127- 133.

[TsoK86] K. S. Tso, A. Avizienis, and J.P. J. Kelly, .. Error recovery in multi­

version software,.. Proc. Cont. on Safety of Computer Control

Systems 1986, Sarlat, France: 1986, pp. 35 - 41 .

[UhrL87] L. Uhr, Multi-Computer Architectures for Artificial Intelligence,

New York:John Wiley and Sons, 1987.

[Walt85] C. J. Walter, R. M. Kieckhafer, and A. M. Finn, 11 MAFT: a multicomputer architecture for fault-tolerance in real-time

systems, 11 Proc. of the Real-Time Systems Symposium, 1985, pp.

133-140.

[Wens78] J. H. Wensley, L. Lamport, J. Goldberg, M. W. Green, K. N.

Levitt, P.M. Melliar-Smith, R. E. Shostak, C. B. Weinstock, 11SIFT:

design and analysis of a fault-tolerant computer for aircraft

control, .. Proc. of the IEEE, vol. 66, no. 10, Oct. 1978, pp. 1240-

1255.

[Yang86] C. -L. Yang and G. M. Masson, 11A fault identification algorithm for

trdiagnosable systems, .. IEEE Trans. on Computers, val. C-35,

no. 6, June 1986, pp. 503-51 0.

[Youn84] L. J. Yount, .. Architectural solution to safety problems of digital

flight-critical systems for commercial transports,.. Proc. of the 6th

Digital Avionics Systems Conference, 1984, pp. 28-35.

[Yous83] W. J. Yousey et al., 11AFTI/F-16 DFCS development summary - a

report to industry: redundancy management system design,u 200

Proc. of the National Aerospace and Electronics Conference

1983, pp. 1220-1226.