A Nonstop* Kernel Joel F. Bartlett Tandem Computers Inc. Cupertino
Total Page:16
File Type:pdf, Size:1020Kb
A NonStop* Kernel Joel F. Bartlett Tandem Computers Inc. Cupertino, Ca. Abstract significantly expanded over its lifetime. The Tandem system is intended to fit these The Tandem NonStop System is a fault- requirements. tolerant [1], expandable, and distributed computer system designed expressly for i. Hardware Organization online transaction processing. This paper describes the key primitives of the kernel A network consists of up to 255 nodes. of the operating system. The first section Each node is composed of multiple processor describes the basic hardware building and I/O controller modules interconnected blocks and introduces their software by redundant buses [2,3] as shown in PMS analogs: processes and messages. Using [3] notation in Figure i. A node consists these primitives, a mechanism that allows of two to sixteen processors, where each fault-tolerant resource access, the processor (Pcentral) has its own power process-pair, is described. The paper supply, memory, backup battery, and I/O concludes with some observations on this channel (Sio). All processors are type of system structure and on actual use interconnected by redundant interprocessor of the system. buses (Sipb). Each I/O controller (Kdisc, Ksync, etc.) is connected to two I/O channels and is powered from two different Introduction power supplies using a diode ORing scheme. Fault-tolerant computing systems have been Finally, dual-ported I/O devices such as built over the last two decades in a number discs (Tdisc) may be connected to a second of places to satisfy a variety of goals. I/O controller. The contents of a disc may These results and differing approachs have be "mirrored" on a second volume, but this been summarized in [1,3,11]. In the past, function is supported primarily by software many of these systems have been designed rather than by hardware. I/O devices other for specific tasks, such as telephone than discs are normally single ported and switching, where the costs of ~ailure are are connected to one I/O controller. significant. In addition, the designers of most of the systems did not intend to The processors are 16 bits wide with up to provide general purpose hardware and 2mb of memory per processor. The internal software modules which the end user would clock speed of the processor is 100ns, customize to form a reliable system. resulting in a register-to-register add in .6 microseconds and a load of a word from An increasing number of online applications memory in i.i microseconds. in commercial data processing has created a demand for fault-tolerant general purpose The two interprocessor buses (Sipb) provide computing. These applications are also each processor with two point-to-point characterized by their high rate of growth, paths to each other processor and to which requires that the computing system be itself. Data transfers to and from the buses are buffered in high speed 16-word * NonStop and Tandem are Trademarks of packet buffers in each processor, allowing Tandem Computers Inc. data transfers at a rate of 13.5 megabytes/sec. This is more than three times faster than the processor's memory, Permission to copy without fee all or part of this material is granted which assures that interprocessor messages provided that the copies are not made or distributed for direct are not delayed by the bus bandwidth. commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by i.i Hardware fault model permission of the Association for Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. The system design goal is to provide continuous operation in the presence of a single fault. This requires that all © 1981ACM0-89791-062-1-12/81-0022 $00.75 single faults be detectable, diagnosable, 22 and repairable online. In addition, the Sometimes these occur during normal software must allow reintegration of the operations, but often their actions are in repaired module into the system. response to another fault at which point a single misstep may cause the entire system Error recovery analysis needs some to fail. assumptions to be made about the types of faults that the system can tolerate. 2. Software Structure First, a fault in either a processor, its memory, or its power supply must be The operating system [4] provides many of contained in and therefore at most disable the user services usually associated with that processor. Second, a fault in either medium-size systems: multiprogramming, an interprocessor bus or an I/O channel access to a gigabyte of virtual memory per will at most disable that bus. Third, a processor, a file system, and extensive fault in an I/O controller will at most communications facilities. However, it disable that I/O controller. With these offers these services in a fault-tolerant assumptions, it can be seen that a single manner on a modular, expandable computer fault will at most make I/O devices system. To do this, its structure has been attached only to one controller designed as an analogue of the hardware unavailable. structure. As the hardware consists of multiple processors and I/O controllers, Physical events affecting the system's the operating system functions are hardware can be divided into three classes distributed over multiple processes; and as [i]. The first class, permanent hardware the hardware modules are interconnected via failures, will be detected either on the redundant buses, the operating system initial instance of the failure or shortly processes communicate via fault-tolerant thereafter by background tests. These give messages. rise to two kinds of problems: error in recovery algorithms and contamination of At the time this structure was proposed and data bases that takes place before failure development started, December 1974, there detection. were few precedents for it. Our main inspiration was found in the work of Intermittent component failures share the Dijkstra [5] and Brinch Hansen [6], whose previously mentioned problem areas. In ideas and examples provided the key kernel addition, unless there is immediate primitives and structure for the system detection, there is a far higher design. probability of data base corruption as the background tests are much less likely to 2.1 Processes see the problem• Each processor supports up to 256 A final source of problems, and perhaps the concurrent processes. Each process has a most serious in actual operation, is that private data space, but may share code with caused by external interference with the other processes in the same processor. A system. This class includes such items as major portion of the cost of a process air conditioning failures, but is primarily switch is chargeable to memory mapping composed of operational errors by either which is required to designate the new the computer operator or service personnel. process" code and data spaces. This •- .......... Sipb I I II Pcentral Pcentral Pcentral I I I Sio Sio Sio ---Kdisc .... Ksync ] .... Tdisc .... ---Tmodem (to another node) ----Tdisc .... ---Tmodem (to another node) Kdisc--- Ktape ........ I Ttape ........ Kasync ....... I Tterminal's Figure i: PMS Diagram of a Node. 23 results in a cost of approximately .5 acknowledgement for each logical request, milliseconds to switch processes. which is the first key to providing fault tolerance in this system. All processors contain both a monitor process and a memory manager process. The In addition, the application program is not Monitor's functions include process allowed direct access to the message management within its processor (e.g., system. This is done by providing a process creation and deletion), information conventional user/privileged mode in the return, message system control, and fault processor. User programs may only enter recovery. When a process is created, it is the operating system (and privileged mode) given a unique "processid" composed of two at certain defined entry points, e.g. parts. The first is its location (node NEWPROCESS in the previous example. This number, processor number, and process restriction provides obvious protection and number) and the second is either a unique information hiding benefits and allows the timestamp or a symbolic name. operating system to control error recovery on message failures, a second key to the Process synchronization primitives include system's fault tolerance. counting semaphores and process-local event flags. Semaphores are used within the kernel to control such things as access to 2.2.1 Message primitives shared I/O controllers. Event flags are used to signal a process that events such Before the error recovery strategies can be as device interrupt, message arrival, and shown, the kernel's message Primitives must message completion have occurred. These be introduced. A message exchange using primitives were chosen using the author's these primitives takes the following form: experience at the time rather than following an exhaustive survey of available A process "R", the requestor, sends a methods. They have been more than adequate message to process "S", the server, by for the resource control involced in I/O calling the procedure LINK. The caller operations and in the implementation of the supplies a processid, six message parameter message system. More complex resource words, and an optional resident buffer control is handled by requesters sending which may contain additional data related messages to the process which "owns" the to the message or may be used to return a resource. result. If the message can be sent, then the address of the sender's Link Control 2.2 Messages Block (LCB) is returned to R, a matching LCB (which contains the six message Almost all information flow, even within a parameter words, R~s processid, S's single processor, is carried in messages processid, and the address of R's LCB) is rather than through shared storage.