Recovery from Soft Errors in Triplicated Computer Systems Operating in Lock-Step
Total Page:16
File Type:pdf, Size:1020Kb
' OLE COPY Nor REMOVE NBSIR 79-1927 0£c 14 1979 Recovery from Soft Errors in Triplicated Computer Systems O perat i ng in Lock-Step A. L. Koenig A. W. Holt System Components Division Center for Computer Systems Engineering Institute for Computer Sciences and Technology National Bureau of Standards Washington, D.C. 20234 June 30, 1979 Issued November 1979 Sponsored by Defense Nuclear Agency Washington, D.C. 20305 NBSIR 79-1927 RECOVERY FROM SOFT ERRORS IN TRIPLICATED COMPUTER SYSTEMS OPERATING IN LOCK-STEP A. L. Koenig A. W. Holt System Components Division Center for Computer Systems Engineering Institute for Computer Sciences and Technology National Bureau of Standards Washington, D.C. 20234 June 30, 1 979 Issued November 1979 Sponsored by Defense Nuclear Agency Washington, D.C. 20305 U S. DEPARTMENT OF COMMERCE Luther H. Hodges, Jr., Under Secretary Jordan J. Baruch, Assistant Secretary for Science and Technology NATIONAL BUREAU OF STANDARDS. Ernest Ambler, Director RECOVERY FROM SOFT ERRORS IN TRIPLICATED COMPUTER SYSTEMS OPERATING IN LOCK-STEP ABSTRACT A Triply Modular Redundant (TMR) computer system operating in clocked lock-step is being investigated for an application requiring a Mean Time Between Failure of five years. No mechanical memories are used; this allows comparison of the outputs of the three computers to be made each clock period. The most novel contribution is the method of recovery from soft errors, such as those produced by lightning strokes or alpha particles. Data are provided on the uptime history of an experimental system, which uses three commercial microcomputers. Key Words: Fault tolerant computer; soft errors; triply modular redundant; TMR. INTRODUCTION Where exceptionally high reliability is essential, an acceptable level of performance can often be achieved through the use of redundant structures or modules. Two common forms of modular redundancy are fault masking and spare switching. With fault masking, all channels are active all the time; majority logic is used to "mask" a failing channel, keeping the total system output correct. Spare switching, as the name implies, has spare channels in a standby or passive state ready to be switched in when an active channel fails. Digital systems have been implemented with these techniques with varying degrees of success. Fault masking systems are often difficult to maintain because the process that makes them fault tolerant also makes fault detection and isolation complicated. Using spare computers in standby mode presents the complex problem of making the spare processor reconstruct the current working status after the active computer has failed. A third category of modular redundancy has been proposed, called Sift-Out Redundancy . With this technique there are no standby channels; all are active at starting time. If any one fails, however, the structure reconfigures itself so that the contribution of the failed channel to system output is eliminated. At least three modules are required to implement this kind of structure. If reduced to two modules and one fails, the system is unable to detect which module is in error. If three or more modules are used in the implementation, any errant one is "sifted out" automatically; when a failure occurs in another module, the process repeats itself until the system is sifted down to two. It must be assumed that no more than one module will fail at a given instant of time. The sifted out modules may be restored to operation at any convenient time by a command from the supervisory element of the system. 1 / P. T. de Sousa and F. P. Mathur, "Sift-Out Modular Redundancy," IEEE Transaction on Computers, Vol. C-27, pp. 624-627, July 1978 3 The experiment described here is an implementation of Sift-Out Redundan- cy applied to the data bus of a microcomputer. One unique aspect of the system is the requirement for synchronous operation of multiple computers in order to use the Sift-Out technique; another is the self-restoring action which takes place when an error has been detected. The latter takes advantage of the synchronous operation in its implementation. In the triply modular redundant (TMR system described in this report, the modules are single-board microcomputers. Each module consists of a microprocessor, read/write random access memory (RAM), programmable read-only memory (PROM), buffers for the address bus and data bus, and various timing and control circuits. All three computers run identical programs which reside in PROM. The RAM is used to store data. The programs run "lockstep"; synchronism is achieved by driving all three processors from a common clock. The input/output (I/O) data bus of each computer is the logical OR of the individual bus outputs. The busses are checked for identical content once during each clock period. If one bus is not identical to the other two, it is logically disconnected from the combined output immediately and latched out. Outputs from the error detector identify which bus has failed. This information is used by the computers to take corrective action: restore erroneous data in RAM; resynchronize the programs; reset the error detector, reconnecting the failed bus to the system output. Thus the system is self-restoring for "soft" failures, i.e., transient faults on one bus. The system can be programmed to give an appropriate alarm for more serious or very repetitive faults. Because of limitations on time and resources, this experiment is designed to demonstrate only the operation of the Sift-Out comparator hardware and the feasibility of automatic recovery through memory-to- memory transfer. Admittedly, a fully operational computer system is susceptible to many types of faults other than soft errors on the data bus. It is not within the scope of this experiment to remedy or even identify all failure modes, but rather to offer a means of improving reliability for one class of errors. Recognizing the limitations of the experiment, no attempt has been made to measure the improvement in Mean Time Between Failures resulting from TMR, though the uptime history of the system presents an encouraging picture. HARDWARE DESCRIPTION System Configuration Figure 1 shows the configuration of the hardware for the TMR experiment. One line from the data bus of each computer is brought to the comparator and to the collector. The collector output is the "correct" system output, i.e., the combined output from which erroneous data has been sifted out. One one data bus line, the least significant bit (LSB) , is used in this experiment in order to minimize the construction of special hardware; in a fully implemented system all data lines, and possibly address bus lines, would be similarly connected. However, comparison of one data line is sufficient to demonstrate the principles of the TMR 4 CONFIGURATION TMR MASTER CLOCK 1, FIGURE 5 . system. The choice of the LSB for comparison makes the comparator particularly sensitive to timing anomalies since this bit is most likely to change from one clock cycle to the next. Other interconnections of the hardware are needed to effect automatic resynchronization after an error has been detected. Command decoders are connected to each address bus in order to assert reset lines and enable input gates for reading the error detector. Parallel I/O ports connect the computers to each other for memory-to-memory transfer. Computer 1 has additional peripheral devices attached which are used primarily for software development and are not directly involved in the TMR experiment; A CRT display, keyboard, cassette tape recorder, and a 2400 baud serial I/O port. The serial port is used to down-load program code from computers in the NBS Experimental Computer Facility (ECF) The CRT display is used in the experiment as a means of verifying correct operation of the TMR system. Computer 1 and its peripheral interface boards are installed on a single backplane which is mounted on a chassis along with the system power supplies; computers 2 and 3, their parallel output ports, and the comparator hardware are mounted on a remote chassis. There is about one meter of cable between chassis. Microprocessors The microprocessors used in this experiment are type 6800. The 6800 is a single chip, 8-bit N-MOS processor housed in a 40-pin package. Its important features include: Bidirectional data bus (8 bits) Sixteen-bit address bus 72 instructions (1 to 3 bytes/instruction) 1 MHz maximum clock rate Two 8-bit accumulators Three 16-bit registers (stack pointer, program counter, index register) Memory-mapped I/O Timing is controlled by a 2-phase clock which is external to the chip. During phase one, the contents of the program counter are transferred to the address bus; on the negative transition of this phase, the program counter is incremented. During phase two, data are put on the data bus, and, when this phase goes low, these data are latched into either the processor or external memory. Direction of data flow is determined by a read /write control line. CPU Boards Each CPU board consists of the microprocessor chip, 2,048 bytes of static RAM, 256 bytes of PROM, buffers for the address and data busses, and several timing and control circuits. 6 The 16-bit address bus is buffered with tri-state drivers. Since the processor uses memory mapped I/O, the same lines are used for memory address and I/O port addresses. The bidirectional data bus is split into four parts by the buffering process; memory data out, memory data in, I/O data out, and I/O data in. Memory and I/O data out are not functionally different; they are simply connected to two different sets of tri-state buffers. The input buffers, however, are controlled differently. The memory inputs are gated with a clocked Read signal while I/O inputs are selected by the Read signal and a signal decoded from the upper half of the address bus.