<<

Fault Tolerant Operating Systems*

PETER J. DENNING Science Department, Purdue University, West Lafayette, Indiana 47907

This paper develops four related architectural principles which can guide the construction of error-tolerant operating systems. The fundamental principle, system closure, specifies that no action is permissible unless explicitly authorized. The capability based machine is the most efficient known embodiment of this principle: it allows efficient small access domains, multiple domain processes without a privileged mode of operation, and user and system descriptor information protected by the same mechanism. System closure implies a second principle, resource control, that prevents processes from exchanging information via residual values left in physical resource units. These two principles enable a third, decision verification by failure-independent processes. These principles enable prompt error detection and cost-effective recovery. Implementations of these principles are given for process management, interrupts and traps, store access through capabilities, protected procedure entry, and tagged . Keywords and phrases: capability machine, decision verification, error confinement, error detection, memory access, memory control, multiple domain processes, privilege-state mechanism, process implementation, process isolation, process switching, protected procedure entry, resource control, error recovery, store access, store control, tagged architecture CR Category: 4.30

INTRODUCTION The full range of approaches to operating systems reliability is not surveyed here. Many modern computer systems seek to Rather, this paper concentrates on hard- provide nearly continuous interactive ser- ware assistance for error confinement, with- vice for their customers, to protect against out which most other reliability goals are accidental or malicious destruction of in- difficult to achieve. With the focus on the formation in their trust, and to guarantee capability-based architecture as a particu- that confidential information cannot be larly attractive form of such hardware divulged. These systems are judged as much assistance, this tutorial concentrates on oper- by their abilities to do these things reliably ating systems for such machines. The capa- as by the range of services they offer. Al- bility architecture was originally envisioned though reliability, in the sense of error as a uniform method of protecting access to tolerance, has long been sought in operating data segments and procedures in a computer system software, it has always been difficult system [DVH66, WIL75]. More recently, it to achieve. A set of principles of reliable was advocated also as a uniform method of operating systems has begun to emerge. addressing shared objects [FAB74]. This * Supported in part by NSF Grant GJ-43176. paper explores the theme that guided the

Copyright (~) 1976, Association for Computing Machinery, Inc. General permission to republish, but not for profit, all or part of this material is granted provided that ACM's copyright notice is given and that reference is made to the publication, to its date of issue, and to the fact that reprinting privileges were granted by permission of the Association for Computing Machinery.

Computing Surveys, Vol. 8, No. 4, December 1976 360 • Peter J. Denning

CONTENTS capability concept immediately intrigued implementors. Lampson introduced it into the Berkeley Computer Corporation (BCC) 500 computer [LAM69], but unfortunately that system never went into production. Aspects of the capability concept were used in the Cal 6400 timesharing system at the University of California, Berkeley; however, INTRODUCTION Reliability' Lampson and Sturgis report that the full Error Confinement Principles power of the idea could not be implemented 1 PROCESSES Concept on a machine without considerable hard- A Single-Processor Implementation ware assistance [LAST6]. Meanwhile, at the Interrupts and Traps University of Chicago, Fabry had a pre- 2. PROCESS ISOLATION Open versus Closed Environments liminary of a machine with capability Implementing a Capability Environment hardware by 1968 [FAB68]; this machine Processes of Multiple Domains was never completed either, for lack of Isolating Descriptor Information Examples and Comparisons continued support. At the University of 3 RESOURCE CONTROL Cambridge, Wilkes and Needham have The General Principle Application to Memory Management been constructing their own capability Apphcation to Processor Assignment machine (called CAP), which appears likely 4. DECISION VERIFICATION to reach completion [NEE72, WAL73]. The Sender-Receiver Formulation Application to Memory Access and Control Hydra operating system, under develop- Application to Encapsulation ment at Carnegie-Mellon University, is 5. ERROR RECOVERY General Concepts also based on capabilities [CoJ75, WUL74, System Error Recovery WUL75]. The only capability machine in User Error Recovery full production is the British Plessey Cor- 6 SUMMARY ACKNOWLEDGMENTS poration's System 250 [ENG72], in which APPENDIX I Procedure Mechanmm for a Capabd~ty Ma- many of Fabry's ideas have been incor- chine porated. As of 1976 about 20 of these ma- REFERENCES chines were in use, most in program de- velopment for a large military system, the rest under testing as a highly reliable tele- phone switching computer. Critics are fond of noting the paucity of capability machines in the open market. The failure of many projects to reach a v production stage is easily explained by designers of the Plessey S/250: the capa- factors having no connection with the cap- bility architecture is a uniform method of ability concept itself. The common culprit implementing the highest possible degree seems to be underestimation of the difficulty of error confinement [ENG72]. of combining many new ideas (not just The concept of a "capability" as a gen- capabilities) in one design, leading to a eralized permission to use storage objects hopelessly delayed project. Lampson and (segments) and procedures was introduced Sturgis report that the Cal 6400 project fell in 1966 by Dennis and Van Horn [DVH66]. victim to this syndrome; and that others They suggested that operating system func- have suffered but survived--e.g., Hydra, tions could be envisaged as "meta-opera- , OS/360 [LAS76]. The apparent tions" manipulating a protected data struc- lack of vendor interest in capability ma- ture comprising a set of "capability lists"; chines results in part from the costly trauma a process could perform a given meta-opera- experienced by the entry of the computer tion only if a capability for it was recorded era into its third generation during the mid in the capability list of that process. The and late 1960s: buyers were understandably

Computing Surveys, Vol. 8, No. 4, December 1976 Fault-Tolerant Operating ~ystems • 361 chary about further innovations, however • all local data is consistent and error- attractive their principles. Now, in the free; and middle 1970s, we have come to appreciate • nothing outside the module can affect the serious limitations of our 1960s machines; its internal behavior except via the we are much more sensitive to the issues of interface. security and reliability; we are receptive to Thus, the correctness proof can demon- proposals for more secure and reliable strate only that the program in question systems; and, most importantly, we have at contains no design faults. Unless the pro- hand the technology to construct what once grammer or support system provides re- was considered ambitious and expensive dundancy and dynamic checking, errors hardware. The outlook is optimistic. introduced at run time can invalidate the correctness proof--for example, invalid or Reliability: Fault Tolerance inconsistent data can be passed into a procedure, or hardware malfunction can Melliar-Smith has suggested some interest- alter instruction code or data. The essential ing distinctions that clarify the relations point is that a reliable system must employ among failures, errors, and faults; and be- many run time mechanisms to keep it as tween reliability and correctness [MEL75]. close as possible to a correct state. A failure is an event at which a system vio- The architect of a reliable operating lates its specifications. An error is an item system must provide mechanisms of four of information which, when processed by types [WuL75]: the normal algorithms of the system, will • Error Confinement: The computing produce a failure. Errors do not always environment is designed so that no produce failures; they can, for example, be procedure has more capabilities than expunged by error recovery algorithms. A required for its immediate task, no fault is a mechanical or algorithmic defect procedure has a large domain of which may generate an error. The failure access, and no procedure is allowed to of a component may generate an error which operate on inconsistent data. These will appear as a fault elsewhere in the properties do not prevent errors, but system. A considerable time may elapse they will limit the risk that errors do between a failure and its detection. much damage before being detected. Reliability and correctness are not the • Detection and Categorization: Dy- same. Parnas explains the distinction as namic verification checks on infor- follows: a system is correct as soon as it is mation, and on the actions of proce- free of faults and its internal data contains dures, will detect errors by exposing no errors; it is reliable if failures do not inconsistencies in data or attempts seriously impair its satisfactory operation to violate access constraints. The [PAR75]. Reliability means not freedom ideal is precise characterization of from errors and faults, but tolerance against the detected error, so that the data them. can be restored to a consistent state Software need not be correct to be re- (one from which all subsequent system liable. A program module is considered re- operations remain within their speci- liable if the most probable errors or faults fications) and that the fault that do not render it unusable, and if those produced it can be located. It is es- that do are rare and not at times of great sential that the checking algorithms need. Similarly, correct software may be be independent of the system being unreliable. Correctness proofs often make checked, lest an error prevent its important implicit assumptions which can own detection: the checks should be be easily invalidated in practice--particu- based on the specification, rather larly that than the implementation, of the sys- • The underlying machine is correct, tem. i.e., will not fail; • Reconfiguration: The objective is plac-

Computing Surveys, Vol. 8, No. 4, December 1976 362 • Peter J. Denning

ing the system into a consistent state. each process should have no capability This may be accomplished by re- beyond what is required to perform its task. moving from service the failed unit, This means that processes cannot interact be it hardware or software; by recon- except along prearranged paths; no unex- structing valid copies of data; or by pected form of interference can occur. It backing up (parts of) the system to also means that information describing a prior, error-free state. These actions the capabilities of a process must be pro- need not repair damage, but they will tected from alteration. Saltzer and Schroeder prevent further damage. characterize this as "no-privilege defaults": • Restart: The reconfigured system is every action is denied unless explicitly al- restored to service. lowed [SAS75]. Underlying these mechanisms is an error Resource control implements assignment logging and reporting system that passes error of physical resource units to computational messages across interfaces and allows system objects, processors to processes, for example, engineers to locate and correct persistent or page frames to segments. The primary faults [PAR75, RAN75, and WUL75]. principle is that when a unit is preempted Error confinement is the most funda- from an object, the unit's state should be mental of the mechanisms above: full or saved and the unit placed in a null state; partial repair cannot realistically be suc- and when it is reassigned, the unit should cessful without assurance that the damage be placed in the state it had at the time of is localized before repairs are undertaken. last preemption from that object. In short, Error categorization and fault location no unit that contains vestiges of prior use are more likely to succeed if the possible by one object should come under the con- extent of damage is small. In principle, early trol of a different object. A secondary prin- detection gives the means for immediate ciple specifies the ability to verify that com- correction and repair. In practice, however, mands to assign and release are proper: a errors cannot or may not be detected im- unit may be assigned only when free, freed mediately-e.g., a program may not use only when assigned. bad data for a long time after the damage Decision verification specifies that every occurred. Error confinement reduces the decision should be computable in at least risk that programs will be applied to bad two independent ways. A discrepancy in- data, and that external factors can invalidly dicates an error. In the context of inter- affect a program's behavior--thus support- acting processes, a decision-making process ing ~he implicit assumptions underlying should send a message to a decision-doing most program correctness proofs. process assigned independent resources; before carrying out the decision, the re- Error Confinement Principles ceiver verifies that it is the correct receiver (e.g., by checking a tag field in the message), This paper focuses on the capability archi- and that the action requested is consistent tecture as a "natural" approach to maximal with the state of the system. In some ap- error confinement. The following sections plications, a programmer should be able to treat four general principles that enhance specify a third process that approves mes- reliability by confining errors in such sys- sages between two given processes. tems. Though they do not exhaust all the Error recovery seeks to repair damage; possibilities, these principles illustrate the if this cannot be done, it seeks to recon- techniques of primary interest in the nu- figure the system by removing from service cleus of an operating system. They include: any defective resource units and placing • Process isolation; the remaining resource units and processes • Resource control; in consistent states. The principles of process • Decision verification; and isolation, resource control, and decision • Error recovery. verification allow errors to be detected and Process isolation is the principle that located before they can spread. Error re-

Computing Surveys, Vol 8, No 4, December 1976 Fault-Tolerant Operating Systems • 363 covery in such a system can realistically en- 1. PROCESSES deavor to correct errors, rather than merely Concept mitigating their effects. A principal conclusion of this paper is The term "process" is commonly used to that many operating systems are unre- denote an abstract entity that demands and liable because of inadequate hardware as- releases various resources as it carries out a sistance for software error detection and task. It can be visualized as a program exe- confinement. Under the traditional desire cuting on a virtual processor (VP), having for "flexibility," a heavy overhead is as- access to information stored in a virtual sociated with imposing the logical structure memory (VM). As suggested in Figure 1, of the operating system onto amorphous the VP contains four principal components: hardware. Consequently, the additional i, a unique index number identifying structure for error checking and contain- the process; ment cannot be supported on traditional d, a unique identifier of the domain hardware. Highly reliable operation systems of access of the process (in the will result only when the nucleus (kernel) figure, d associates the VP with is constructed by integrating hardware and the VM containing the informa- software. This means that the abstract tion of the process); machine constituting the nucleus is specified ip, an instruction pointer that speci- first; only then is the requisite hardware fies the address in the process's (or firmware) designed, by letting considera- domain at which the next instruc- tions of cost and efficiency determine which tion to be obeyed can be found; portions of the abstract machine will be and implemented in the hardware [DIJ68, Lm72, regs, a vector containing the values of NEU75]. arithmetic, index, and base reg- The suggestion that the hardware support isters used by the process. essential operating system functions has The vector (i, d, ip, regs) is called the state- been made often, but only recently has word, or descriptor, of the process. The technological advance made it feasible. virtual processor of process i will be denoted [BAH73, HAB76, SHA74]. Since the early as VP(i). 1960s, Burroughs has provided extensive A process proceeds in the usual way: its hardware support for its two central ope- VP alternates between an action that rating system structures, the per-process fetches the next instruction (x in the figure) stacks and the segmented virtual memory and an action that performs its effect on (see Organick [ORO72]). Though small, the data (such as y in the figure). A new value Venus system demonstrated the feasibility of stateword is defined on completion of an of rendering process synchronization and instruction. The dynamic behavior of a I/O operations as microprograms [Lis72]. process is implicit in the sequence of state- Recent interest in virtual machine archi- words it generates [BRH73, DEN71, HAB76, tecture has stimulated careful reconsidera- HoR73]. tion of the hardware on which operating systems are built [MES70, PK75a, PK75b]. Experimental capability machines are in A Singte-Processor Implementation operation [ENG72, NEE72, WAL73, WVL74]. Many systems must implement a large PRIME stressed integration of hardware and number of processes with a small number of software for high reliability [FAB73]. Al- real processors (RPs). For simplicity we though the notion of integrating hardware consider an implementation based on a and began from the moti- vation to simplify operating systems, it is single RP. The process manager employs a now receiving wide attention as an essential switching mechanism that lets each VP in step toward computer systems of high re- turn run for an interval on the RP. The liability. term "virtual processor" was invented to

Computing Surveys, VoL 8, No 4, December 1976 364 • Peter J. Denning

instructions !

d

ip t

regs i data

VIRTUAL PROCESSOR (VP)

VIRTUAL MEMORY VM) FIGURE 1. Structure of a process. The stateword is embodied in the virtual processor, the instructions and data in the virtual memory. Process i is shown with access to domain d, about to execute instruction x which will affect data y and possibly the registers (regs). suggest the simulation inherent in this enabled processes according to pri- procedure. ority; and To capture the idea that the RP should 4) a set of blocked queues, one for each be assigned to the most important work signal or event on which a process first, we extend the stateword to include a may wait, containing indices of blocked priority number p: processes. These structures are stored in the machine's stateword = (i, d, ip, regs, p). real memory, along with the virtual mem- The priority number is usually a nonnega- ories of all the processes. (See Figure 2.) give integer, with lower numbers signifying [BRH73, see Chapter 4; N~.H75] higher priority. The system observes a A simple implementation of the ready priority scheduling rule: the running process and blocked queues is obtained by linear must always be a highest priority enabled lists using a link field in each process list one. entry; the link of entry i contains the index At any given time, a process may be in of the process following i in its queue. (In exactly one of three "execution states": practice, doubly-linked lists could be used running, when its VP controls the RP; for better reliability [WvL75].) The ready enabled, when it is awaiting an interval of queue is the concatenation of first-in-first- assignment to the RP; and blocked, when it out (FIFO) queues for the increasing pri- is awaiting some specified signal or event. ority levels; it can be specified by the The process manager for a system with a auxiliary pointers single RP employs four data structures; H, To, T1,'", Tp,... the first records the existence of processes, the three others their execution states: in which H is the index of the head process 1) a process list containing the state- and T~ that of the tail of the pth priority words of all processes-in the system; section. To speed up switching the RP to 2) the virtual process index (VPI) reg- the next ready process, H can be stored in ister in the processor, to hold the in- a register of RP. If a process of priority p dex of the runniug process; becomes enabled, it can be inserted in the 3) the ready queue, ordering indices of ready queue following process Tp. The

Computing Surveys, Vol. 8, No. 4, December 1976 Fault-To,rant Operating Systems • 365

Process L=st

VPI ° o REGSr~ PR~----'~'-'-~

REAL PROCESSOR (RP)

REAL MEMORY FIGURE 2. Relations among real and virtual objects. The process list contains statGwords of all virtual processors. The real memory contains the process list and all virtual memories. The real processor contains the stateword of the running process. queue of processes for the kth signal or address of the process list is kept in an RP event is implemented similarly, though the register, process indices can be interpreted internal priority ordering is usually not re- as displacements from this base. Consistent quired. Figure 3 illustrates a configuration with the priority scheduling rule, any action of such queues, where link value 0 denotes which enables a process whose priority ex- the end of a queue and process 0 is a nullity. ceeds that of the running process, must The process manager also contains a set reschedule the running process and gen- of operations, of three kinds: for switching erate a SWITCH operation. the RP among VPs, for moving process The Multics processor has a pair of in- indices among queues in response to changes structions corresponding to steps 1 and 3 in their execution states, and for adding or [ORG72]. The IBM 360/370 equipment has removing processes or blocked queues in instructions for loading and storing pro- the system. Switching the RP to the first gram status words (PSWs), which corre- ready process can be implemented by a spond to the (d, ip) portion of the state- microprogram of these steps: word [GRA75, MAD74]. Neither of these is 1) save the stateword in the process list complete; Multics does not include the entry whose index is in the register scheduling decision. IBM in addition does VPI; not save the program registers. In contrast, 2) Load register VPI with H, the index the Burroughs B6700 has a SWITCH of the highest priority enabled process, operation that saves a stateword on a replace H with its successor; and then process's stack and activates the process of a 3) load into the RP the stateword of the given next stack [ORG72]. The Control Data process list entry whose index is now 6000 series implements an "exchange in VPI. jump" instruction that swaps 24 registers Hereafter, these steps will be called the of two processes within 5 ~sec [THo70]. SWITCH operation. They are sometimes The Modular Computer Corporation has a also referred to as the "context exchange" minicomputer (the MODCOMP IV) that or "context switch" operation. If the base stores 16 statewords in a local store to speed

Computing Surveys, Vol. 8, No. 4, December 1979 366 Peter J. Denning

PROCESS LIST READY LIST

link H To il i2

i2 #3

QUEUES i6 i7

i7 0 head toil other mfo

'"i '71 -"

13 14

i# 15

i5 [ 0

FIGURE 3. A configuration with processes il, • .. , i5 on the ready list at priority 0, and 96 and i7 waiting in the kth blocked queue. up switching among the most active pro- The third class of process management cesses. operations are used for adding and removing The second class of process management processes, semaphores, or message queues; operations consists of those that cause execu- they include operations like tion state transitions. These include such i :-- createproeess(initial stateword) well known operations as send message deleteproeess(i) (i,m) and get message (i,m) for sending or s := createsemaphore(initial value) receiving a message m from the system mes- deletesemaphore (s) sage queue with index i, and wait(s) and signal(s) for some counting semaphore s Their implementation is straightforward [BRH73]. In the latter case, wait(s) causes manipulation of the process lists or sema- s to be decremented by 1; if the result is phore queues [BRH73, I~B76, N~.K75]. negative, the caller's index (in VPI) is at- tached to the tail of the queue for semaphore Interrupts and Traps s and a SWITCH operation generated. The operation signal(s) causes s to be in- An executing process may encounter con- cremented by 1 ; if the result is not positive, ditions, known as exceptions or faults, which the first index is transferred from the queue preclude its further progress until and unless of s to the ready queue. A SWITCH opera- they are corrected. They include conditions tion is generated if the signal enables a checked for in the hardware, such as arith- process of higher priority than the sender. metic contingencies (overflow, underflow, Another state-changing operation is the divide-by-zero, etc.), addressing snags (seg- interval timer, which is used to generate a ment or page faults), invalid accesses to SWITCH after a time limitwso that no segments, illegal or undefined instructions, process can monopolize the RP. or parity check errors in data transmissions

Computing Surveys, Vol. 8, No 4, December 1976 Fault-Tolerant Operating Systems • 367

(See Dennis and Van Horn [DVH66].) preemption of the currently running pro- They include conditions checked for by cess. This preemption is usually called an programming languages, such as array- "interruption" of the current process; for reference-out-of-bounds or undefined pointer this reason, device signals are sometimes values (see Goodenough [Goo75]). A hard- called "interrupt signals". Implementation ware trapping mechanism is usually imple- can employ a microprogram or micro- mented to stop the process and correct processor to set a semaphore private to the the condition. In simplest form it comprises: device driver process, thereby using the 1) Indicator flipflops. The ith flipflop is ordinary process management mechanism to set whenever the ith condition is bring the driver into operation. Many older detected. It is reset when the condition systems handle interrupts through the trap is responded to. mechanism: the device signal causes a trap 2) Trap Microprogram: At selected points to a procedure which places the current in its instruction cycle the processor process in the ready queue, enables the examines the indicator flipflops. If at driver process, and generates a SWITCH least one is set, it selects one, resets it, operation. and calls a procedure empowered It is important to remember the distinc- to deal with the corresponding fault tion between fault and device signals. A condition. A "condition code" indi- fault signal is intended to force a procedure cating the nature of the fault is passed call on a fault handler that operates in the as parameter to this procedure. environment of the same process. A device 3) Masking: A mask flipflop, set auto- signal is intended to enable a high priority, matically by the trap microprogram, work dispatching device driver process, disables further invocations of the and, hence to preempt the (real) processor trap microprogram until it is reset. into a different environment (domain). This permits the fault handling pro- This distinction is missed or blurred in cedure to run without interruption. machines that handle device signals via the The mask is reset on exit from the trap mechanism. Failure to recognize it in fault handling procedure. the implementation reduces reliability. It is helpful to imagine that the occurrence (This point will be discussed again in con- of a fault produces an immediate, unexpected nection with the resource assignment prob- procedure call on the fault handling procedure lem.) [ORa72]. Such calls must obey the normal rules of procedure calls (to be discussed in 2. PROCESS ISOLATION detail later) which include establishing the Open versus Closed Environments proper domain for the fault handler, and reestablishing the original domain of the A system is a closed environment if the normal interrupted process when the fault has been state of affairs is that no process or pro- corrected. cedure has any capability which has not Besides fault signals, operating systems been explicitly granted--that is, constraints must respond to another important class on actions must be removed expressly. A of signals, device signals. These are typically system is an open environment if the normal the completion signals issued by input/ state of affairs is that every process and output devices or controllers at the end of procedure has every capability which has a transaction. The purpose of a device signal not been explicitly denied--that is, con- is to enable a device driver process that dis- straints on actions must be imposed ex- patches the next transaction for the device. pressly. Implementation can be envisaged To maintain high levels of concurrency, in terms of a list associated with each pro- device driver processes typically have very cess: a list of capabilities in the closed en- high priority. According to the priority vironment, a list of restrictions in the open scheduling rule, the enabling of such a pro- environment. A process may perform an cess will frequently require the immediate action only if permission either is contained

Computing Surveys, Vol. 8, No. 4, December 1976 368 • Peter J. Denning

CL(d)

do Ato Jr.~" .J Ioi

ObJlCt x VP(#) VIRTUAL PROCESSOR CAPABILITY LIST FIOURB 4. An essentially closed environment. The virtual processor specifies action A on object with local address j. The system obtains (c, x) from the capability list associated with VP(i). Assuming action A is enabled by access code c, the system translates the name x to a reference to an object. The capability list is outside the domain of VP(i). in some capability, or is not denied by some forces all interactions into the open, it is restriction. possible to check them all for consistency In principle, the two forms of environment and validity as desired. are functionally equivalent--the one can Figure 4 illustrates the intuitive model simulate the other. In practice, the closed of a dosed environment. A domain of ac- environment is highly error resistant, while cess is defined by listing all capabilities in a the open environment is susceptible both capability list; the list for domain d, de- to design errors and to hardware malfunc- noted CL(d), is associated with a virtual tions. The open environment is susceptible processor via a domain register (D). A to design errors because a forgotten re- capability is represented as a triplet (j: c, x) striction will enable an invalid action; in specifying an association between a local contrast, a forgotten capability in a closed name j of the given domain and a unique environment will prevent a valid action. system name x of an object; an access code e The open environment is susceptible to mal- specifies which actions are enabled for x by functions because a destroyed restriction any process in domain d. The virtual pro- may enable an invalid action; in contrast, cessor invokes an action with a (meta) a destroyed capability in a closed environ- command "do A to j"; using j as key, a ment may disable a valid action. The im- mapping mechanism obtains the pair (c, x) portant point is that in an open environ- from the capability list, completing the ment errors and omissions tend to increase reference to x only if action A is enabled the range of action of a process, whereas by code c. in a closed environment they tend to de- Although capability lists are a natural crease this range. The open enviroment (i.e. intuitive) implementation of a closed is of little interest when error confinement environment, other implementations are is important. possible. Figure 5 illustrates an important The principle of a closed environment alternate method based on associating an is to give each process no more capabilities access list AL(x) with each object x. A than it needs to perform its task. The virtual processor of domain d may issue a normal state of affairs is completely dis- (recta) command "do A to x"; the map- joint, isolated, processes: nothing can be ping mechanism completes the reference shared or exchanged among processes ex- only if AL(x) contains an entry (d, c) whose cept by explicit arrangement, all interactions access code c enables action A. The access being prohibited unless expressly allowed. list method is functionally equivalent to No process can attempt to interfere, or the capability list method, since the triplet communicate, with another in an unex- (d, c, x) which grants processes of domain pected way. Because a closed environment d c-access to object x, is represented by

Computing Surveya, Vol. 8, No. 4, December 1976 Fault-Tolerant Operating Systems • 369

AL(x)

or -7 do A to x~>._.. --~

\11Oblect x VPh) VIRTUAL PROCESSOR ACCESS LIST FIGURE 5. Access list implementation of an essentially closed environment. The virtual processor attempts an action A on an arbitrary object x. The system translates the name to an access list, permitting action A only if enabled by access code c in an entry corresponding to the domain of VP(~).

storing (c, x) in CL(d) or by storing (d, e) (These points are explored in depth by in AL(x) [DE~71, LAM71]. Saltzer and Schroeder [SXS75].) The attraction of the capability list In short, the capability list method per- method is its run time efficiency. Suppose re_its the efficient use of capabilities, whereas each domain has access to an average of m the access list method permits their efficient objects, and each object is accessible to an administration. The most desirable imple- average of n domains. The expected storage mentation of a practical closed environ- requirement for the capabilities of an active ment should thus employ aspects of both, process in the capability list method is one rather than committing itself exclusively to capability list, or an average of m locations one. In Multics, for example, capability of fast memory; to guarantee comparable lists (descriptor segments) are initialized speed in the access list method requires an from access lists stored in file directories, average of m access lists, or an average of and are deleted when processes become in- mn locations of fast memory. Put another active [ORG72, SCS72]. The way, the "working set" of capabilities is arising from using both together improves stored in one place in the capability list the prospects of recovering from errors implementation but not in the access list that damage capability or access lists. method. It is only fair to note that the concept of Another distinction between the two forms isolation implied by a closed environment is of implementation is the approach to check- not so well defined as the above discussion ing authorization. The access list method might suggest. Above, two processes are associates permissions with objects and considered as isolated if no action of one checks authorization on each access; the can directly affect the other. Lampson capability list method associates permis- [LAM73] has pointed out various ways in sions with domains and checks authoriza- which an "isolated" process can transmit tions only once, at the times capabilities are information to another process---e.g., by granted. The reduced need for authoriza- clever use of file interlocks, encoding in- tion checking contributes to run time effi- formation into apparently legitimate mes- ciency in the capability list method. How- sages, or by varying its load on resources in ever, it also means that changes or revoca- tions of permission by an object's owner encoded patterns. With strong restrictions cannot easily be made to take immediate on a system, information flowing invalidly effect; in an access list method, changes or via variables stored in the system can be revocations can take immediate effect and blocked [DEN76]; however, information the system cannot become cluttered with flowing invalidly via resource usage pattern capabilities invalidated by such changes. is much more difficult to block [LIP75].

Computing SurveyB, Vol. 8, No. 4, December 1976 370 • Pe~r J. Denning

Fortunately, subtle interactions of this or execute permissions; the first two per- type are not important in error confinement. missions apply to memory references dur- ing the "execute" portion of a processor Implementing a Capability Environment instruction cycle, while the third applies to memory read operations during the The capability mechanism is the most effi- "fetch" portion. An enter capability grants cient known implementation of a closed en- permission to invoke a procedure in a do- vironment. Since it is referred to frequently, main differing from that of its caller its implementation is considered here. [DVH66, FAB74, WILT5]. Most computer systems have a different Figure 6 illustrates the access mechanism access mechanism for each type of resource. for storage capabilities. Each domain d has Access to segments and pages is controlled its own capability list CL(d), whose en- by memory addressing hardware; access to tries (j: c, x) specify that local name j procedures and instructions is controlled (used as a key during address translation) by the instruction execution mechanism; may be translated to any action, on object access to files is controlled by the file system; x, which is enabled by access code c. The and so on. Capabilities can be grouped into correspondences between system names corresponding classes. They can be placed and the objects themselves is stored in a in separate capability lists according to systemwide central mapping table (CMT), type; or they can be placed in the same list, whose entries are of two forms: distinguished by a type tag. When a process 1) inform: (x: 1, b, l), opens a file, for example, the file system can 2) outform: (x: O, a), return a capability for the opened file, for where x is used as a key into CMT. The example: inform entry signifies that segment x oc- j := openfile(#ilename) cupies l words of main memory beginning at address b--thus (b, l) is the familiar where j is the index of a newly created file base-limit pair [DEN70]. The outform entry capability. The process may thereafter signifies that segment x is in the auxiliary invoke particular file operations with such memory at address a. Suppose the processor calls as requests action A (read, write, instruction readfile(j), writefile(j), or closefile(j). fetch) on the wth word in segment j. The addressing microprogram performs these The parameter mechanism must verify steps: that the passed capabilities axe of the cor- 1) (c, x) : = CL(d) entry with key j; rect types. 2) (p, b, l) := CMT entry with key x; The general form of this mechanism 3) if p = 0 then MISSING-SEGMENT amounts to an automatic type-checking FAULT; mechanism: each argument capability is 4) if A not enabled by c then ACCESS checked against a "template" associated TYPE ERROR; with the procedure to verify that it is of the 5) if w ~_ l then OUT OF BOUNDS type for which the procedure was designed ERROR; [Fv.u73]. New types of objects and capa- 6) perform action A at memory ad- bilities for them can be defined if the mecha- dress b -{- w. nism is "extensible," such as in the Hydra An ordinary segmented virtual memory system [CoJ75]. The CAP system does omits the central mapping table and puts limited type checking [WAL73]. the triplets (p, b, l) in the capability list The two most frequently used types of (which is then called a "descriptor seg- capabilities, storage and enter capabilities, ment"). In contrast, this mechanism al- are often given special treatment. A storage lows permanent capability lists, since their capability grants access to a segment (or entries require no alteration under storage page) in the virtual memory of a process. management decisions (see Fabry [FXB74]). Its access code may specify read, write, As usual, a local associative memory can

Computing Surveys, Vol 8, No. 4, December 1976 Fault-Tolerant Operating Sys~ms 371

CMT

CL(d)

REAL °i rI / ,Icl x xl,101,

• i ® J' l C'l x' x'lO I a

CAPABILITY LIST Object x ASSOCIATIVE MEMORY CENTRAL MAPPING TABLE FIGURE 6. Use of storage capabilities. The processor puts addresses of all references in the local address register (LA). The local object number (3)selects a capability, which is mapped via an inform CMT entry to the base of the object; w defines a displacement. An attempt to reference local number 3' would encounter an outform CMT entry and generate a missing object fault. Paths like A-B can be stored in a local associative memory to speed up address translation. be built into the (real) processor to speed up and domain (D) register values are addressing. The associative memory holds saved in the process's stack; a small number of recently used mapping 2) D is set to d2, the content of the ad- paths (such as A-B in Figure 6). On refer- dress part of the entry with key j2 in ence to segment 3, steps 1-3 can be omitted CL(dl); and whenever the associative memory contains 3) IP is set to (0, 0). an entry with key j. (See [DENT0] and By convention, local name 0 of every do- [SAS75].) main is reserved for an "entry procedure" The implementation of enter capabilities whose entry point is at relative location 0; is complicated by the needs to pass parame- thus step 3 forces the instruction pointer ters, to change the domain register cor- to the entry of the new procedure. When rectly, and to force the instruction pointer executed by procedure p2, a return in- to the entry of the new procedure. The struction loads IP and D from the stack,' central idea of a mechanism is illustrated in and discards the portion of the stack set up Figure 7; the details are given in the Ap- for the previous call (see also [ENG72] and pendix. Suppose procedure pl of domain dl [WIL75].) contains a call on procedure p2 of domain d2. The compiler generates instructions to Processes of Multiple Domains copy capabilities for the parameters from CL(dl) to the stack, followed by the Some procedures in a process require dif- special instruction ferent capabilities from their callers: operat- ing system procedures manipulate private enter j2 system information inaccessible to users; This instruction causes these actions: procedures being debugged have fewer privi- 1) The current instruction pointer (IP) leges than their creators. Indeed, the ob-

Computing Surveys, Vol. 8, No 4, December 1976 372 • Peter J. Denning

CL(dl)

Procedure pl

,' l" [ pl enter/2

]2lent/ dg

IP

VIRTUAL STACK PROCESSOR

Procedure p2

CL(d2) i/// lo I.,I Segment x 31.,I x /""11

FIGURE 7. Enter capabilities. The current procedure (pl) executes the current instruction (at displacement w), which causes D to be set to d2 and IP to (0, 0). Paranieters, and the old (D, IP), are saved on the stack. The codes ex and wr denote execute and write permissions, while ent denotes an enter capability. jects accessible to a procedure are of at least must be forced to the initial instruc- three distinct types: tion of the called procedure. • Local Objects: those associated with, $ Damain Changing: procedure entry and private to, the procedure itself; (exit) must cause the new (former) • Nonlocal Objects: those associated local and parameter domains to be with a caller of the procedure in the established (reestablished). same process; and • Attenuation of Privilege: Under nor- • Parameters: those available as argu- mal circumstances it must be im- ments of a call. possible for a procedure to perform A computer system meeting the need to access to an object passed as a pa- separate these types of objects is said to rameter, when that action is not per- implement multiple domain processes. Such mitred of the caller. implementation must solve four problems: Capability machines solve these prob- • E~cient Representation: subdomains lems well. Separate capability lists pointed must be easily represented and ac- to by special processor base registers keep cessible when needed. track of the domains; the details are given • Protected Entry: on procedure calls, in Appendix 1. Protected entry and domain the instruction pointer of the process changing are handled by the enter capa-

Computin¢ Surveys, Vol. 8, No. 4, December 1976 Fault-Tolerant Operating Systems • 373 bility. Attenuation of privilege is auto- cause v to perform an action on x illegally matic, because passing a capability cannot on its behalf; in other words, simple passing increase the permissions implied by it, of addresses may amplify, rather than at- and because a "copy flag" can be used to tenuate, privilege. Multics solves this with regulate whether a capability can be passed a rather complex indirection mechanism: at all [Dv.N71, LAM71]. (Under special cir- the privilege number of a reference is de- cumstances, a certified procedure may fined as the greatest of the privilege num- have more permissions for a parameter bers encountered on the indirect address object than its caller; increasing the privi- chain. leges denoted by a passed capability is The defect of a privilege state mechanism called "amplification" in the Hydra Sys- is its serious violation of the principle of a tem [COJ75].) closed environment. Since the privilege When the hardware does not permit numbers imply a nesting structure, or in- multiple capability lists to be associated clusion property, on domains of access, it is with a process, a privilege state mechanism impossible to implement mutually disjoint is often used to implement a restricted domains within the same process. It is form of multiple domains. The basic idea is therefore impossible to force each procedure simple [NV.E74]. Each object x--procedure, to have the minimal capabilities it requires instruction, segment, etc.--has associated for its task. Errors committed during a with it an invariant nonnegative integer procedure of high privilege can be spread P(x), called its privilege number. The do- widely. This hazard, together with the lack main of access of a procedure u comprises of flexibility implied by strictly nested do- all objects x (in its virtual memory) for which mains, has motivated Wilkes to argue forci- P(u) _< P(x). The privilege number of the bly for entirely replacing the concept of currently executing procedure is kept in privilege state with the more general con- the instruction pointer register, and can be cept of capability lists [WIL75]. compared during access to that of an object. Procedure calls and returns must properly Isolating Descriptor Information update the privilege number field of the instruction pointer. Information defining or limiting the actions The classic supervisor/user mode flip- of processes is known as process descriptor flop found in most processors is an example information. Examples include the process of a privilege state mechanism--certain list, the various queues of process indices, sensitive instructions can be executed only capability lists, and access lists. Informa- if this flipflop contains 0. The Multics tion defining the structure, type, or format "ring" mechanism is a more ambitious of data (used during access to that data) example extending to segments [SCS72]; is known as data descriptor information. by associating three privilege numbers Examples include so-called array dope ("ring numbers") with each segment, it de- vectors, file description and index tables, fines slightly different domains for reading, directories, and the capability machine's writing, executing, and procedure entering. central mapping table. The correct operation (Priority interrupt mechanisms are not of the system obviously depends on the examples; they are simply priority schedu- continuing correctness and consistency of lers.) all descriptor information. Special care must be taken to implement The basic principle governing the cor- attenuation of privilege in a privilege state rectness of authorized changes in descriptor mechanism. The pointers passed to pro- information is: Descriptor information is cedures as parameters are usually addresses treated as private, accessible only to certain in the virtual memory of the process. If certified operations. The authority to invoice procedure u can call procedure v and can these operations is strictly limited. This pass the address of an object x for which principle is easily implemented in the con- P(v) <_ P(x) < P(u), u may be able to text of Figure 7. Suppose p2 is a certified

Computing Surveys, Vol. S, No. 4, December 1976 374 • Peter J. Denning

CMT

CL(d)

or- ,1o,1,,1 x, = ;l',l \ \ VIRTUAL \\ \ PROCESSOR \ \ / Object x CAPABILITY'1 \ / LIST \ / \ / \ \\ CENTRAL MAPPING / X TABLE / _J (motch) FIGURE 8. Redundancy tags in descriptor information. Each object has a tag (t~) that ini- tially matches a tag in each capability pointing to it (t~). The tags are not mapped, but are com- pared for a match after the capability is mapped to the object. procedure, and the segment x to which it most such information is changed during has write access contains descriptor infor- process switching and procedure entry or mation. Prior to invoking p2, the caller exit by (certified) microprograms, and (pl) places parameters describing the b) the design can be constrained so that requested change on its stack. such registers may receive only exact copies Privilege state mechanisms implement of the (certified) information stored in this principle only in a limited way, for all capability lists. Special capabilities, rather procedures whose privilege number does than privileged instructions, can be de- not exceed that of a given one will have at signed for initiating input/output operations. least the same access. In other words, data (See Wilkes [WIL75], Feustel [FEu73].) cannot be guaranteed to be accessible The problem of unauthorized changes in strictly by the procedure certified to manipu- descriptor information, as would be caused late it. To make matters worse, many privi- by hardware or software failures, has re- lege state mechanisms do not couple privi- ceived less attention. How can we prevent lege-number setting tightly with procedure the use of a capability which, because of calls and returns, it being possible for random some failure, no longer denotes the object errors or clever programming to cause a to which it originally authorized access? An presumably unprivileged procedure to come approach is suggested in Figure 8. Capa- into execution in a highly privileged state. bilities and objects are extended to contain Capability machines require no privi- tags. Associated with each object x is a tag lege state for software procedures [WIL75]. t~, a copy of which can be kept in the central The procedure mechanism automatically mapping table. When a resource unit is associates the proper capability lists with assigned to object x, its tag field is set to the current procedure. Descriptor informa- t=. Similarly, when a capability for x is tion can be made accessible only to the created, its tag is set to t~. The access mech- procedure(s) authorized to change it. No anism translates a capability (j: c, re, x) by privileged instructions for loading or un- mapping x to a base address through the loading processor registers containing de- central mapping table; no tag bits are scriptor information are needed because a) mapped, but the hardware compares the

Computing Surveys, Vol 8, No. 4, December 1976 Fault-Tolerant Operating Systems • 375 source tag (tj) against the resource unit's users, consulted whenever a file is opened, tag (t~), raising an error condition in case is a common example of an access list ap- of mismatch. Note that tj plays the role of proach. Indeed, this approach originated in a "key" and t~ that of a "lock". This method the file systems of the early 1960s [DAN65, will detect any single error of these types: DVH66,WIL75.] An interprocess message • Capability List: One or more bits of facility, in which senders may attempt tj or x is changed. transmissions to any message queue, is an- • Object: One or more bits of t~ is other example, for each receiver process changed. determines which messages to accept by • Mapping: An error translates x to the inspecting a message's identity tag and con- wrong object. sulting an internal access list [BAH70, Multiple, reinforcing errors may not be DEN71, HArt76, LAM71, OaG72]. A virtual caught--e.g., both t~ and t~ could be changed, memory mapping mechanism is a common or all of ts, t~, and x, without detection. instance of a capability list approach; each However, multiple errors are much less mapping table entry behaves as a capability likely than single ones [FAB73]. authorizing access to a given segment (or In principle, one can use t~ = t~ = x for page). The Multics virtual memory illus- the tag values; in practice the number of trates a compromise: when a process at- tag bits can be smaller, determined by the tempts initial access to a segment, the file maximum number of objects that can be system generates a capability (in the form active at once. If, for example, Figure 8 of a new entry in the process's descriptor represents a mapping table for a virtual segment) only if the access list of the seg- memory, the tags need distinguish only the ment so authorizes ~ [ORG72, SCS72]. This segments (or pages) loaded in main memory. compromise allows all virtual memory ac- If a virtual processor has access only to cesses except the first to proceed efficiently. private objects, a single tag value (stored in Only a few experimental systems imple- a processor register) for the entire process is ment a fully closed environment. Most sufficient; this method is used in PmMv. existing machines, which use a privilege [FAB73]. Few systems today have any such state~ mechanism, violate the closure prin- mechanism to serve as a redundancy check ciple in the sense that privileged procedures against unintentional errors. typically have much wider domains of ac- Another safeguard is to assign the unique cess than they require for their tasks. There object names sparsely in the space of all pos- is serious danger of errors committed in sible names. For example, if every two names highly privileged states being propagated have Hamming distance at least k A- 1 (i.e., throughout the system.2 If the virtual they differ in least k q- 1 bits), then any error memory concept is closed, why is it used side that changes no more than k bits of a name by side with the privileged state concept? will render a capability useless. The address- The answer appears to be inertia. The priv- ing mechanism will detect the error because there is no central mapping table entry 1 In the notation of Figures 4 and 5, the virtual process in domain d presents a symbolic file name whose key matches the name in the capabil- X which locates the file in a directory hierarchy. ity presented. This supplements, rather than If the access hst AL(X) contains entry (d, c), the capability (j: c, x) is placed m CL (d), j being an supplants, the tag mechanism, which relies unused local address and x the unique system name in part on the fact of tag bits not being for segment X processed by the address translator. 2 This hazard reinforces the popular apprehension that security is a myth because there is no protec- tion against highly pmvdeged supervisor pro- cedures. Even as capability systems reduce the Examples and Comparisons risk of irreparable damage by design errors or hardware malfunctions, tI~ey reduce the risk of Most practical systems should, and do, mahcious pdferage--e.g., by the procedures that employ both the access list and capability create capabihtms or pass them among processes-- because they restrict the possible interfaces at list approaches. A file directory hierarchy which purloining can occur to a few easdy moni- which attaches to each file a list of authorized tored ones.

Computing Surveys, Vol. 8, No. 4, December 1979 376 • Peter J. Denning

liege state concept was invented before strong current interest seems to derive from virtual memory, to solve early protection IBM's CP-67 project, which has entered problems; the virtual memory concept was production as VM/370 [NES70]. By allow- invented to solve storage overlay problems. ing no sharing to be explicitly arranged, Though it is implicit in the early works of the strong isolation property of itselfposes a Dennis and Van Horn [DVH66] and Fabry serious practical limitation. The exact repli- [FA~68], Wilkes seems to be the first (1968) cation and recursiveness requirements can to explicitly recognize that the principles of in principle be implemented on the more virtual memory could be extended to solve flexible capability architecture without re- both classes of problems [WIL75]. This sorting to the hazardous privilege state, realization of the late 1960s was hardly in though in fact no one has explored this time to have much influence on the design of implementation. The principal attraction of machines conceived during the early 1960s, the virtual machine approach seems inertial: machines which still dominate today. the ability to implement the three objectives Another quasi-open architectural princi- with least modification of existing hardware ple is the faddish "reeursive virtual and software. machine" (see [PoG74, PK75a, PK75b]). A There are at least five important differ- virtual machine is nothing more than a vir- ences between a capability based system tual processor and a virtual memory. As (such as CAP, Hydra, or Plessey S/250) and discussed in the literature, this approach one using virtual memory or privilege-state emphasizes three properties: virtual machines (such as Multics or VM/ • Strong Isolation. There is normally no 370). provision for shared instruction code • Capabilities are permanent; they are or flies, or for otherwise exchanging system addresses for objects [FAR74]. information efficiently between virtual They can be stored in programs or in machines. the auxiliary memory for an indefinite • Re.cursiveness. In support of the ob- period. In contrast, entries in standard jectives of developing new systems virtual memory tables are strictly under an old one, or of continuing an temporary and cannot be placed in old one without alteration when a new the auxiliary memory? one is installed, a virtual machine • Capabilities may be passed efficiently should be able to support replicas of as addresses to procedures in different itself or of similar machines (the rep- domains, and as components of inter- licas have fewer resources). process messages. In a virtual machine • Exact Replication. No running pro- system, a complex indirect addressing gram should be able to distinguish a mechanism may be required to pass replica from the real machine (except addresses safely among procedures of perhaps by observing the speed of its different domains in the same process, execution). To achieve this, intricate and information can be passed be- privilege state mechanisms have been tween virtual memories only by copy- designed. Should a virtual machine ing it. attempt invocation of any operation • The capability hardware handles ca- which may affect, or be affected by, a pabilities for nonstorage objects (e.g., system state variable, a trap to a processes, message queues, flies) with privileged procedure must occur. comparable efficiency as those for Except for the third, these objectives have storage objects (e.g., segments, pages). been present in operating * As noted in Subsection" " Implementing a Cap- since the early 1960s. Early implementa- ability Environment," the form of a capabihty tion of all three objectives appeared by 1964 may depend on where the object it denotes is on the MIT POP-l, and by 1966 on the stored--e.g., informs and outforms. Virtual mem- ory table entries are of inform only, making them M44/44X time sharing system at IBM inherently too transient for holding in auxiliary Watson Research Laboratory [ONv.67]. The memory.

Computing Surveys, VoL 8, No, 4, Deoember 1976 Fault-Tolerant Operating Systems • 377

• The capability hardware can handle fault occurs while the page is still in the small objects in very small domains; page-out queue. For some resource types, most existing virtual memory systems e.g., processors, only transitions 1 and cannot deal efficiently with small num- (dotted) 3 are possible. bers of small segments in a single proc- Reliable resource control is achieved when ess. 4 the units of a given resource type are forced • The capability mechanism is a natural to conform to a well-defined state transition basis for a closed environment. The diagram, e.g., Figure 9. Let us imagine imple- privilege-state mechanism is inflexible menting a tag field in each unit of resource: and not closed: it does not permit dis- tag 0 means that unit is free and its internal joint domains in one process, and it state null, and tag G means it is allocated to exposes all objects to erroneous ac- object x. (This scheme agrees with that dis- tions of highly privileged procedures. cussed in connection with Figure 8.) The That capabilities are permanent, confer following rules must be observed: authorization of access, and can be grouped 1) If a process, in exercising its jth capa- arbitrarily to define access domains, are the bility with tag tj, makes access to a primary reasons for the advantages de- unit with tag t~, then t~ = G must hold. scribed above. Yet the very permanence of This verifies that a unit remains as- capabilities is a liability: a system may be- signed to the object with which it is become cluttered with unusable or invalid presumed associated. capabilities which cannot easily be located 2) A command to allocate a unit to object and expunged [SAS75, WIL75]. x is valid only if the unit's tag is 0. The effect of such command is to change the tag to tx, and to restore the unit's in- 3. RESOURCE CONTROL ternal state to a former nonnnll value, The General Principle if such value is defined by an initial condition or by a prior preemption of The following discussion concerns the relia- a unit from object x. ble control of resource units subject to ex- 3) A command to preempt a unit of object clusive assignment to objects--i.e., reusable x is valid only if the unit's tag is t~. resources [HoL72]. Although we think nor- The effect of such command is to mally of just two possible valid allocation change the tag to 0, save the unit's in- states for such a unit--free and assigned-- ternal state in some protected place, many allocation strategies, notably those and then "clear" the unit by setting controlling memory, permit a third. The its internal state to some null value. third state arises from splitting the assigned (A command to release units omits the state in two, according to whether the unit state-saving step). is accessible to a process or not. A memory The implementation of Rule 1 was discussed page frame is an example. It makes transi- earlier as part of a method for detecting in- tion 1 of Figure 9 when extracted from the pool and assigned to some segment; it makes transition 2 when the paging algorithm de- taches it from the segment and enters it in the page-out queue; and it makes (solid) transition 3 when its contents have been cleared. Transition 4 is possible if a page

Needham tells me of an interesting side benefit. Efficient small domains have been a significant de- bugging aid in the CAP operating system. Most FIGUR~ 9. Allocation states of a reusable re- programming errors quickly violate access rules, source unit. Some memory allocation strategies causing the program to stop before the error has make objects inaccessible to a process without done much damage and aiding the rapid diagnosis setting the state of the object's unit(s) to a null of the cause. value.

Computing Surveys. Vol. $~ No. 4, December 1970 378 • Peter J. Denning valid uses of capabilities; the same method information. (Fabry discusses this point at is part of reliable resource control. Rule 2 some length [FxB73].) implements transition 3 of Figure 9 and Many paging mechanisms are designed to guarantees that no free unit contains in- clear page frames, but only just before re- formation about prior uses. Rule 3 imple- assignment--to be prepared for a possible ments transition 1 of Figure 9 and guarantees fault seeking the page contained in that that each unit upon assignment is in an frame. This is acceptable only if the page 'i~ internal state depending at most on prior treated as continuously assigned to the uses of the object to which it is assigned. original segment during the entire interval Rules 2 and 3 implement together the addi- from its detachment from the segment until tional requirement that reusable units can its clearing. It is unacceptable if the page be assigned to at most one object at a time. ever enters the free pool without being These rules permit transitions 2 and 4 if cleared, for then the danger exists of an un- they exist. Hardware that observes and sets undetectable protection violation. Today's unit tags in accordance with these rules is memory technology permits additional page- easy to design (see Fabry [F~.B73]). clearing circuitry at insignificant cost: set- These rules are needed for process isola- ting a tag field to 0 (Rule 2) can trigger a tion, which requires that there be no in- reset line in a semiconductor memory; or the advertent, undetected exchanges of informa- regeneration cycle of a core memory can be tion between processes through residual disabled during a read-out operation by a values left behind in the internal states of data channel. To my knowledge, only the resource units. How simple this principle is, PRIME machine proposed this [FAB73]. When yet how seldom it is observed ! the need for the simple hardware assist was not anticipated, the only mechanism of Application to Memory Management clearing is the zero-storing iteration, a method whose expense has deterred the most Units of memory resource include pages (or prudent operating systems designers from other blocks) of main memory, and tracks or observing the basic principle. records of auxiliary memory. Though These ideas carry over to the control of phrased primarily for main memory page secondary memory units such as drum frames and disk or drum tracks, the follow- tracks. The drum hardware can be designed ing ideas apply to all units of memory alloca- to check tags during access (Rule 1), in order tion. to reject track assignment requests for Consider page frames. The general princi- tracks whose tag is nonzero, and to over- ple requires that the preemption of a page write O's onto a track when a command is frame from a segment be accompanied by received to reset a tag to 0. saving that frame's content; it is saved, usually on the drum. As long as the frame's Application to ProcessorAssignment content is nonnull, the frame must be treated as still assigned to the segment, even though The process switching mechanism, which the preemption may have made it inaccessi- assigns the real processor to virtual proces- ble by setting a presence bit to 0 in some sors, must conform to these rules (see in mapping table. How many systems actually Section 1, the subsection A Single-Processor do this? Few. Some treat preempted page Implementation). The process index and frames as free even before their content is domain numbers stored in processor regis- saved on the drum, and many do not even ters serve as tags to identify the virtual clear page frames after placing them in the processor to which the real processor is as- free pool. Not only can private information signed. The SWITCH operation completely be divulged by this violation of principle, but saves a processor state and completely re- a process can acquire what it anticipated as places it with a new one. an "empty" page, only to be plunged into Many systems do in fact perform process error because the page contained nonnull switching operations correctly--except for

Computing Surveys, Vol. 8, No 4, December 1976 Fault-Tolerant Operating Systems • 379 switches associated with device signals (inter- taking procedures may produce incorrect rupts). If the hardware design did not antic- decisions because malfunctions alter their ipate the need for an efficient operation of instruction code or data. Decision verification saving and restoring statewords, a large refers to a class of redundancy checks on amount of time would be required to execute decisions; these checks verify that the action a procedure that saves the stateword-con- decided upon is authorized for its initiator, raining registers one at a time. Because a is applicable to the particular object for next device transaction must be dispatched which it is intended, or is consistent with the quickly--e.g., before a rotating drum pro- system state. These checks constitute an gresses too far from its current position-- important means of error detection. such machines cannot afford the delay of a complete switch to the device overseer Sender-Receiver Formulation process. Instead, they do no more than save the instruction pointer value and transfer to The principle of decision verification can be the entry point of a device driver procedure. examined in the context of senders and re- The flaw is obvious: the device driver proce- ceivers exchanging messages. Let us imagine dure has come into execution in the environ- that process u signifies its decision to perform ment of whatever process was interrupted; it action a on object v by transmitting the has the responsibility of saving whatever message (a, v) to the controller (access registers it uses, restoring them before re- mechanism) of v. The system tags the sent turning control to the interrupted process. message with the identification of the sender, The opportunities for undetected protection delivering the triplet (u, a, v) to the con- errors, for interference with another process's troller. The most important check performed correct execution, or for propagation of by the controller is the errors originating in the interrupt routine-- 1) Consistency Check. Verify that action a are too numerous to mention here. Given the is consistent with the state of v and is frequency of interrupts in most systems, and allowable in the state of the system. the critical nature of many interrupt tasks, Three other possible checks are this can be regarded as a major hole in many 2) Source Check. Verify that u is permit- contemporary operating systems, through ted access to v with action a. (This is which untold errors have doubtless flowed. already done in capability systems.) To sum up: the general principle of re- 3) Destination Check. Verify that v is an source assignment is not observed carefully object under the control of the con- with respect to process switching in many troller receiving the decision. contemporary systems, especially in switch- 4) Transmission Check. Verify that the ing to device driver processes. The solution received, encoded representation of is straightforward, but requires a carefully (u, a, v) is valid. (This can be done by engineered SWITCH operation as part of including a checksum in the message processor microcode, an operation that saves and verifying that the checksum of the current stateword, determines the next received (u, a, v) is the same as the re- process to run and loads its stateword--all ceived checksum.) indivisibly. The vast simplifications which If the decision passed all the checks, the would result throughout the operating sys- controller would perform the requested ac- tem, and the tremendous reduction in ex- tion. If it failed, the controller would have to posure to error and error propagation, easily send an error report back to the decision justify the cost of a SWITCH operation. sender. Unless the decision sender and de- cision receiver are very carefully designed, it may be quite difficult for the sender to 4. DECISION VERIFICATION back up and retry the decision [RUB75, Most system actions can be regarded as RAN75]. consequents of decisions taken by its proc- The purpose of this mechanism is locating esses. Even if previously certified, decision- errors by spotting inconsistencies. This im-

Computing Surveys, Vol 8, No. 4, December 1976 380 • Peter J. Denning

plies that, ideally, a decision's generation is concepts of decision verification for memory independent of its verification--the sender management. To review, each unit of and receiver have different algorithms, data, memory (a page frame of main memory, a and hardware. Only the message channel is track of auxiliary memory), has a tag. If the shared. unit's tag is zero, the unit is free and its Under these assumptions, this mecha- internal state null. If the unit's tag is t~, the nism will be capable of detecting single errors unit is assigned to object x and its internal --i.e., an incorrect decision, an erroneous state may be nonnull. A decision to assign a transmission, a malfunction in the accessing unit to object x is consistent only if the unit's mechanism or objects under its control. tag is zero; the assignment, implemented in Multiple errors may not be detected if they the memory hardware, sets the tag to t~. A "reinforce" (cancel each other)--e.g., sender decision to release a unit assigned to object u determines action a on object v, but the x is consistent only if the unit's tag is t~ ; the decision (u', a', v') reaches the controller, release, implemented in the memory hard- where a' is a valid action for u' on object v'. ware, sets the tag to zero and clears the If an error is corrected rapidly after its de- unit's internal state. tection, the problem of multiple errors exist- For main memory page frames, assign- ing simultaneously will be extremely small. ment and release actions can be coupled, With properly designed coding schemes, the respectively, with those that load and save probability of reinforcement, given that a pages, leading to additional confidence that multiple error exists, can be made small. In the resource control principle is correctly other words, the simple idea of detecting implemented. For example, a page fault single errors is both powerful and useful. prompts a decision that allocates some frame If the sender and receiver of a decision are v to object x; the data channel will accept not totally independent, single-error-detec- the request to copy the required page of x tion capability may be reduced. For example, from auxiliary memory track u only if if the sender and receiver processes are exe- tag(u) = t~ and tag(v) = 0, whereupon it cuted (at different times) on the same proc- copies the entire content of track u (tag essor, a recurrent processor malfunction may included) to frame v. A page replacement go undetected. Thus the collection of proc- prompts a similar series of checks and ac- esses on the RC4000 (see Brinch Hansen tions. [BRH70, BRH73]; also Lampson [LAM71]), By storing copies of tags in capabilities, while implementing a message facility of the mechanism can also verify all access this type, is not fully capable of implement- decisions of processes. When a process at- ing this error detection principle. Two im- tempts access to a page of its jth segment, portant cases in which the fully capability of the memory hardware permits access only if this method can be realized are exemplified tag t~ of the jth capability matches that of in the PRIME system [FAB73]. In one ease, the page frame whose address is generated all operating system decisions are taken by a by the mapping mechanism. Since the source control processor and implemented by a tag (t~) and the destination tag (of the ref- distinct process that never runs on the con- erenced frame) are not handled by the trol processor. In the other case, additional mapping mechanism, this check will detect circuitry is placed in the memory hardware single errors in the source tag, the destina- to verify all decisions made by any processor tion tag, or the mapping operation. on memory objects. A full discussion of these ideas in the con- text of a particular system, PRIME, has been Application to Memory Access and Control ° given by Fabry [F•B73].

The mechanisms outlined earlier for resource Application to Encapsulation assignment and release (in Section 3, Appli- cation to Memory Management) and tags on For the purpose of this discussion, a process capabilities and objects (in Section 2, Isolat- is encapsulated if its communication outside ing Descriptor Information) illustrate the its environment must abide by stricter rules

Computing Surveys, Vol. 8, No. 4, December 1976 Fault-Tolerant Operating Sys~ms • 381 than can be expressed directly in the system. well to keep in mind that the following ideas In a capability system, for example, the are much more speculative than the pre- normal rules allow the control, via enter ceding. capabilities, of access to procedures; but they do not allow constraints to be placed General Concepts on the values passed between a process and such procedures. A mechanism for "message The prior sections have emphasized error approval" is useful in this case: the initiator confinement and rapid detection. The penal- oi a given process may declare that any ties for each error propagation and sluggish interactions between that process and the detection show up as costly, incomplete system be approved by another process. error recovery procedures that at best miti- Fabry discusses a message approval mecha- gate rather then eliminate the effect of nism for PRIME [FAB73]; Gaines discusses errors. A not uncommon extreme case illus- another such mechanism for the IDA system trates. Suppose that an undetected error on the CDC 6000 computer [GAff4, GAI75]. changes a page table entry, giving some Specifically, suppose process p is declared process write access to the region of memory as encapsulated, and process q is to approve containing page tables; after a while the any information input to, or output from, p. mapping information becomes too inconsist- Whenever p attempts invocation of some ent for the virtual memory mechanism to procedure, say x, that may communicate translate addresses, and the system comes to outside of p, the system should send a copy a stop. It may sometimes be possible to re- of the parameters to q for approval. If q cover by purging the active processes from approves, the procedure x is activated as the system; should this fail to put the system usual. If q disapproves, the system returns to into a consistent state, a full restart and p without executing x (i.e., x appears as a initialization may be required. If the error no-operation to p). It is important that q was caused by a program bug, the state of approves a copy of the parameters of x, the memory when the system stopped may otherwise an error in q could be passed to x. be so jumbled to make locating the source of Implementation of these ideas would re- the error impossible. In short, the high cost quire a bit in a process stateword to indicate of error recovery in an unconstrained (open) whether or not the process is encapsulated. environment is a strong motivation for Associated with an encapsulated process undertaking the cost of the hardware needed must be the index of the approver process. to support a highly constrained (closed) en- In a capability machine, this would cor- vironment. respond to trapping attempted uses of enter The steps in error recovery are often listed capabilities. In a message-based system, such as detection, location, and repair or recon- as RC4000 [BRH73], it would involve trap- figuration. Detection refers to discovering ping attempted uses of send or get message that an error exists; location to identifying operations. In a conventional system, it it; and repair or reconfiguration to either would involve trapping all supervisor calls. fixing the error or returning the system to a consistent state and restarting it. A great 5. ERROR RECOVERY deal of literature on recovery from hardware failure exists, an excellent overview of recent The mechanisms of the prior sections em- developments being given in 1974 by Info- phasize error confinement and rapid de- tech [INF74]. Commonly used, well known tection, the basis for cost effective error detection methods in hardware include: recovery. Each has been implemented some- parallel operation of duplicated circuits; where. The mechanisms to be described be- error detecting and correcting codes for low concern recovery itself, a subject about storing and transmitting information; error which we have much less experience. The detecting circuits; and time-out mechanisms, techniques are under active discussion. There which signal an error if a unit has acknowl- is considerable agreement that they will be edged nothing within a specified interval. useful when implemented. Nonetheless, it is Fault location methods for hardware in-

Computing SurveyB, Vol. 8, No. 4, December 1976 382 • Peter J. Denning clude: detection mechanisms reporting the [You74], the overhead of classical check- nature of the error; diagnostic programs run pointing methods is high. These methods are to locate a detected error of unknown nature; not used except in special purpose operating and automatic retry mechanisms to prevent systems. full system stops from transient errors. Hard- Randell has recently introduced an in- ware reconfiguration methods include cir- triguing new approach to checkpoint/re- cnits that switch an offending unit off line start, which significantly reduces the over- when an error is detected. head of checkpoint by saving values of Recovery from software error is less well variables only at the moments they are understood. The literature contains various modified [RAN75]. To enable backing up to guidelines (see [WIL75], [WuL75]), each be well defined, Randell proposes a program- based on some form of redundancy. The ming language construction called a recovery ideal of redundancy is at least two copies of block. Each recovery block comprises an each piece of information in the system, each "acceptance test" (a predicate) and a collec- copy being stored separately so that a single tion of alternative procedures ("spares") for error will leave the other intact. It is not accomplishing the same task. On entry to a necessary that both versions be in the same recovery block, the first alternative is tried; form; the original can be structured for if it succeeds (passes the acceptance test), a efficient use, the copy for compact storage. normal block exit follows. If it fails, all Unfortunately, many of the guidelines for variables are restored to their values on software error recovery are vague ("prepare entry to the recovery block; then the second for the most probable errors," "nfinimize alternative is tried; and so on. A hardware references to critical data," "keep a last mechanism called a "recovery cache" is used valid copy of data," or "get it right the first to maintain chains of values of variables; in time"), while others are likely to be used case to the start of a recovery block anyway by system programmers conscious is needed, the recovery cache yields the of the possibility of errors ("use self identify- proper former values. Parallel processes must ing data structures," "use doubly linked be programmed to cooperate with respect to lists," or "use checksums for detection"). recovery points, since backing up past an Some guidelines are in fact recommendations interaction with another process requires for elaborate backup subsystems which miti- "undoing" that interaction by backing up gate against probable errors. One example the other process. (That this can be non- is a file archiving system [WIL75]. When a trivial is illustrated by Russell and Bredt in user logs off a time sharing system, all files their discussion of backing up a producer which he has modified since log-on are and consumer process after an erroneous copied to tape. The tapes are processed message has been discovered [RUB75].) Ran- periodically to retain only the most recent dell's proposed restriction for this case is copies of files. In case of failure, a user may called a conversation, and is represented by request the retrieval of the recent copy from parallel processes within a given recovery the archive tapes. Another example is pro- block. All processes of a conversation must tection against abnormal disconnections. If the pass their individual acceptance tests before user is for any reason disconnected from the any can disengage the conversation. Rele- system without issuing a log-out command, vant to a given conversation, any process an image of his process is saved in a file, that enters a (sub)recovery block is in- ready for resumption when next he logs on. communicado for the duration of that block. Another recovery mechanism often en- countered is checkpoint/restart. Its principle System Error Recovery to make a record of the system state at intervals; should an error be detected, the The methods of this paper are designed to system can be restored to the most recent detect errors in system software, especially state of record and restarted. Even if the in data structures. Once an error is found, checkpoint intervals are chosen optimally the object is to recover either by repairing it,

Computing Surveys, Vol 8, No. 4, December 1976 Fault-Tolerant Operating Systems • 383 or at least by reconstructing a consistent overall philosophy appears in Habermann system state so that the damage cannot be et al. [HFC76]. spread further. Figure 10 illustrates the ith level. Proce- The principles of system error recovery dures P,1, "'", P,~ of M, implement new are most easily illustrated in the context of a system operations not available on M,_,. level structured operating system. This prin- They manipulate resources and data repre- ciple was first used by Dijkstra in the system sented as a structure D,, and they may in- described in [DIJ68]; it was implemented voke operations defined in the lower level with the help of microprograms by Liskov M~_,. The figure shows access to P,~, ..-, on the Venus operating system [Lm72]; and P,K controlled by enter capabilities. Execu- it is recently being exploited by Neumann tion of an operation of M, is thus indivisible and his colleagues in a provably secure in the environment of its caller, and the data operating system at Stanford Research In- D, can be made inaccessible outside of level stitute (SRI) [NEu75]. The idea is to define i. the nucleus of an operating system in terms An important component of error recovery of a hierarchy of nested abstract machines, is a method or error reporting among proce- each of which manages privately a specific dures [PAR75, RAN75, WUL75]. As part of its set of resources not handled on any machine normal output, each procedure should re- contained within it. If we denote the turn an "error report" to its caller; an error machines by M0, M~, ..., M~, then the report can indicate such outcomes as no new operations defined on M, are imple- errors detected, irreparable structural error mented as procedures which operate on the in the private data of the procedure, incon- local data structure of M, and may employ sistent values found stored in the private operations defined in M,-1 ; these procedures data of the procedure, bad parameter passed, and data structure are known as level i. We or an irreparable error reported at some usually interpret M0 as the basic hardware lower level procedure. These error reports and M~ as the environment in which users are normal responses from system levels. run their programs. The number of levels Also needed are error reports as normal (K) depends on the system and its objec- stimuli. A procedure of one level, having tives; Dijkstra used five, Liskov four, and detected an error in its data structure, Neumann thirteen. A good discussion of the should be able to inform lower level proce- dures that bad values may have been passed ~ent I ent ~ enf and that possibly corrections should be made [PAR75]. To this end we can imagine an error recovery function built into each level (E, in Figure 10). An error can be reported to level i via E, ; E, can deal with this report recursively: 1) If there is a possibility that the re- ported error could have been propa- gated to M~_~, E, calls E,_~, which will put M,_~ in a consistent state; then 2) E, performs tests on D~ placing it if need be in a consistent state; it reports finally to its caller the degree of success in repairing the error. FIGURE 10. Abstract machine view of ith In Step 2, E, may safely use operations of level of an operating system Procedures P,1 , .. • , M,-1, which is known from Step 1 to be in a P,k manage the resources and local data structure (D,) of level ~, and may invoke operations of consistent state. Until E~ has reported that M,_~. E, is an error recovery procedure. M, is consistent, no process should be

Computing Surveys, Vol. 8, No. 4, December 1976 384 • Peter J. Denning allowed to execute instructions at level i or able to place such procedures under the con- higher. trol of higher level ones, arranging in par- How is error recovery initiated? Clearly an ticular for erroneous actions of lower level authorized user process could initiate error procedures to be trapped and reported to the recovery by calling E~. If a procedure P of higher levels; and 3) he must be able to stop Ms detects an inconsistency in the data the execution of a "runaway" lower level structure D~, or receives an error report procedure. The first need can be met by from M~-I, it can call E, ; P will report an either a capability or a privilege state mecha- error to its caller only if E~ fails to make nism. The second can be met by languages repairs. With modest supporting hardware, that allow users to express "exceptional the error detection methods of earlier sec- conditions" and state how they are to be tions can also be used to initiate error re- handled [Goo75]. The third can be met by covery. We need to include a level register L allowing creator processes to specify time in each virtual processor to indicate the limits for their offspring and to suspend operating system level at which the current them. These tools can, of course, be used by procedure is executing; L must be saved and the operating system designer in developing restored as part of the enter and return new components of the operating system. operations for protected procedures. An error detected when L = i causes a trap to 6. SUMMARY E~. Portions of these ideas are being used in The objective of this tutorial has been to Hydra [WuL75] and in the SRI system outline some principles of structuring operat- [NEu75]. A full implementation requires the ing systems to make unauthorized actions level structure to be reflected in the hard- impossible, to confine errors to the imme- ware by means of the level register and diate contexts of their occurrences, and to associated mechanism. Such implementa- detect and correct damage before it spreads. tion is less ambitious than the scheme of The basis is an essentially closed architecture Bernstein and Siegel [BES75]; it is more within which every process has no more ambitious that the scheme of Habermann capabilities than needed for its task and can et al [HFC76], who were concerned with interact with no other process in an unex- certification but not error recovery. pected way. This architectural principle interdicts interprocess interference in three ways: processes cannot commit unexpected User Error Recovery actions on one another, they cannot unex- Users define hierarchies among their proc- petedly alter descriptor information, and esses or procedures according to "father- they cannot exchange information in residual son" or "master-slave" relations [BRH70, values left behind in resource unit states. By BRH73, PA72a, PA72b]. Since these relations hastening error detection via such mech- have different interpretations than the anisms as decision verification, this environ- "levels" relation among the seldom changed ment renders realistic the goal of complete system operations, error recovery methods error recovery. beyond those discussed must be available for The closed environment is not intended to the users. In particular, lower level system isolate processes irrevocably. It is intended operations are logically independent of merely to force all interactions into the open higher level ones, while the lower level user where they can be verified. It facilitates, procedures are often totally dependent on rather than impedes, sharing. higher level user procedures. Throughout the discussion it has been The user needs of debugging create three emphasized how certain principles would additional requirements not present among overcome the limitations of third generation the system operations: 1) A user must be operating systems: able to place an undebugged procedure in a 1) The operating system should be de- highly constrained domain; 2) he must be composed into autonomous processes.

Computing Surveys, Vol. 8, No. 4, December 1976 Fault-Tolerant Operating Sys~vns • 385

Both user and system processes should used for organizing operating system be controlled by a single mechanism. operation, the latter for user processes Although many third generation oper- and procedures. With proper hardware ating systems implement processes for assistance, it is possible to guarantee their users, they do not employ the the indivisibility of operations by level concepts deep within the system itself. of a hierarchy and to provide a sys- 2) The process manager should implement tematic approach to error reporting all process synchronization operations and recovery. efficiently. The process switch should I have stressed that considerable hardware provide complete, automatic swapping support is required if the principles of process of process statewords. The interrupt isolation, resource control, decision verifica- strategy should be integrated in, so tion, and error recovery are to be imple- that process switches, rather than traps, mented efficiently--or at all. It is doubtful result from device signals. Inadequate whether all the places where contemporary hardware assistance makes efficient im- systems lack proper hardware assistance plementation of these ideas on third have been identified. Overcoming these limi- generation systems impossible; the in- tations in future systems will require that nermost parts of most systems contain both hardware and software are specified in many implementation shortcuts, loop- an integrated design project, rather than, as holes through which untold errors have has been the custom, more or less independ- flowed. ently. The plummeting costs of hardware 3) The concept of the essentially closed allow great optimism. environment should pervade all levels Though many operating systems today of the operating system. Access to both are explained with process concepts, few use user data and system descriptor infor- these concepts consistently throughout their mation should be controlled by a single structure. The reason seems to be, in part, mechanism without a privileged state. that the classic view of an operating system (The most efficient such mechanism as a collection of procedures (subroutines) known is the capability based machine.) has been slow in giving way to the view of a 4) The concept of resource control requires collection of processes and, in part, that the that the assignment of a resource unit classic view of nested protection domains to an object be accompanied by ini- implemented with privileged states has been tializing the unit to some expected slow in giving way to processes with multiple state. Releases should be accompanied disjointed domains. These older views tend by resetting the unit's state to a null to focus attention on manipulations of the value. Because few systems have ade- processor's instruction pointer and privilege- quate hardware assistance for this, mode register. Thus we find interrupts processes may interfere via residual treated as traps rather than as wakeup values left in these units. signals for high priority processes; we find 5) Many operating system decisions can few efficient facilities for process synchroni- be verified, but hardware assistance is zation or communication at the lowest sys- required for efficiency. A simple tagging tem levels; we find domain changing loosely mechanism can provide the basis for coupled with process switching or protected verifying not only accesses, but also procedure entry. In contrast, the process/ resource unit assignments and releases. capability view tends to force synchroniza- Decision verification is a powerful tool tion, domain changing, and sharing actions of error detection. into the open where they can be checked. 6) Process, or operation, hierarchies are This is not to suggest that the hardware common in both operating systems and mechanisms derived from the two views user subsystems. The two most always differ greatly. It does suggest that common ordering relations are "levels" the process/capability viewpoint can pro- and "father-son," the former being duce and has produced, significant overall

Computing Surveys, Vol. 8~ No. 4, December 1976 386 • Peter J. Denning

improvements in system architecture. Per- patterned after the one of the CAP machine haps the only significant difference in the [WAL73]. The complete current domain of a two views is one of abstraction--above the process is defined by three capability lists, lowest level of the system, the details of pointed to by three base registers in its vir- instruction-pointer movements and of do- tual processor: a) Base register 0 (BRO) main changes are no longer of concern. But points to CL(do), which specifies a stack used that may make all the difference. by all procedure calls in the process; a stack This paper does not pretend to give com- pointer (ST') register defines the base of the plete or uniform coverage of all structural current "frame" of the stack, b) Base regis- principles that can be or have been used to ter 1 (BR1) points to CL(dl), which specifies maximize error confinement or facilitate the global domain, i.e., the set of capabilities error detection; nor are the principles it sug- belonging to every procedure of the process. gests universally accepted. This is a difficult c) Base register 2 (BR2) points to CL(d2), subject about which to write. If readers have which specifies the local domain, i.e., the set been stimulated to think about old problems of capabilities in a process that are private to in new ways, the paper has achieved its the current procedure. Within a process, all purpose. procedure calls and returns modify the stack and SP register, and some modify the local ACKNOWLEDGMENTS domain base register; but none modifies the global domain register. This paper has many virtual coauthors. Roger A local address consists of three parts Needham spent many hours in discussions about (t, j, w), where the tag t selects one of the operating systems for capability machines, and he provided much information about the experience capability lists. The addressing mechanism with the CAP project. Dorothy Denning and translates (t, j, w) to memory address b ~- w, Kevin Kahn helped clarify the distinctions be- where b is the base address of the segment tween open/closed environments, and capability/ whose capability is at position j in CL(dt) access list based authorization ehecking, thus when t specifies the global or local domain excising a string of confusions from an earlier (t = I or 2), or otherwise at positionj + SP draft. A discussion with P. M. Melliar-Smith was in CL(do). The instruction pointer (/P) of considerable help in clarifying the properties of register contains addresses (t, j, w), but t errors and faults. K. J. Hamer-Hodges of the may not refer to the stack. Plessey Corporation saved me from misrepresent- ing the status of System 250, and helped clarify my An ordinary procedure call invokes, with- thoughts on interrupt systems. Jack Dennis sug- out any change of domain, a procedure whose gested the phraseology of approaching a computer capability is in the local or global domains. systeM as an integrated design project with no Prior to the call, parameter capabilities are prior commitment to a distinction between hard- copied onto the stack. The call instruction ware and operating system software. James Peter- is of the form "call t,j, k" where t = 0 or 1; son's critical review of a draft led to substantial its effect is revisions and clarification of many points in the 1) (c, x) := entry of CL(d~) with key j; paper. Jeffrey Buzen, Robert Fabry, Stockton verify execute access in c; Gaines, Scott Graham, David Redell, Michael 2) Save IP, SP, and BR2 (local domain) Schroeder, and Mauriee Wilkes gave freely of their in the stack; time while I struggled with clarifying some of the ideas presented here. I am indebted to the Na- 3) SP := SP + k; tional Science Foundation for its Grant GJ-43176, 4) IP := (t, j, 0). which supported some of this work. Step 4 forces the IP to the entry point (dis- placement 0) of the called procedure. An APPENDIX 1-Procedure Mechanism for a Cap- instruction "return" restores IP, SP, and ability Machine BR2 from the values found in the top stack frame. A procedure mechanism provides for activat- When the called procedure operates in a ing procedures and passing parameters to local domain different from that of its caller, them. The mechanism to be discussed is a more complex mechanism is required. A

Computing Surveys, Vol 8, No 4, December 1976 Fault-Tolerant Operating Systems • 387 special capability, called an enter capability, the IP, SP, and BR2, so that they are auto- is needed; if (j: c, x) is an enter capability, x matically restored when the fault procedure is interpreted not as a procedure name but as executes a return. the name of a new local domain. The 0th entry in the capability list of this new do- REFERENCES main is an ordinary procedure capability for the desired procedure. The instruction [BES75] BERNSTEIN,A J.; AND SIEGEL, P ., "A computer architecture for level struc- "enter t, j, k" is used to invoke the new tured operating systems," IEEE Trans. procedure; its effect is: C-24, 8 (August 1975), 785- 1) (c, x) := entry of CL(dt) with key j; 793. [BRHT0] BRINCHHANSEN, P. "The nucleus of verify enter capability; a multiprogramming system," Comm. 2) Save IP, SP, BR2 on the stack; ACM 13, 4 (April 1970), 238-241,250. [BRH73] BRINCH HANSEN, P. Operating sys- 3) SP := SP -{- k; tems principles, Prentice-Hall, Engle- 4) BR2 := x; IP := (2, 0, 0). wood Cliffs, N.J., 1973. Step 4 forces the local domain to be the [COJ75] COHEN,E.; AND JEFFERSON, D. "Pro- tection in the Hydra operating sys- called domain, and IP to be the entry point tem," in Proc. 5th A CM Syrup. on O~er- of the domain's entry procedure. Note that along System Prznciples, 1975, ACM, enter capabilities can be passed on the New York, 1975, pp. 141-160 [DAN65] DALEY, R. C.; AND NEUMANN, P. G. stack. The same return instruction as before "A general-purpose file system for will work. secondary storage," in Proc. AFIPS Fall Jr. Computer Conf., Vol. 27, Thomp- Dynamic type checking of parameters can son Book Co., Washington D.C., 1967, be performed when their capabilities are pp. 213-229. used. Specifically, any reference to CL(do) [DEN76] DENNING,D.E. "A lattice model for secure information flow," Comm ACM should cause the capability type to be 19, 5 (May 1976), 236-242. checked against that expected by the proce- [DENT0] DENNING, P. J. "Virtual memory," dure. Computing Surveys 2, 3 (Sept. 1970), 153-189. The mechanism can be generalized to ac- [DEN71] DENNING, P. J. "Third generation commodate statically nested local domains, computer systems," Computing Sur- veys 3, 4 (Dec. 1971), 175-216. patterned after ideas used on the B6700. The [DEN74] DENNING,P.J. "Structuring operat- local and global domain registers are re- ing systems for reliability," in In- placed by a display stack of base registers fotech state of the art report ~0: computer systems reliabd~ty, 1974, Infotech In- BR1, BR2, • • • , where BR1 is the outermost ternational Ltd., Maidenhead, Berk- domain. If at a given time the display con- shire, England, pp. 481-504. [DVH66] DENNIS, J. B.; AND VAN HORN, E. C. tains dx, d2, • • • , d,, the current procedure "Programming semantics for multi- is at the nth level of (static) nesting. A programmed computations," Comm. reference to an object in the kth enclosing ACM 9, 3 (March 1966), 143-155. [DIJ68] DIJKSTRA,E. W. "The structure of domain uses tag t = n - k (k >__ 0). Each THE multiprogramming system," procedure call and return manipulates the Comm ACM 11, 5 (May 1968), 341-346. display to preserve this property. (See [ENG72] ENGLAND,D.M. Architectural fea- tures of system 250," in I~fotech state of Organick [ORG72].) the art report 14: operating systems, In Subsection Error Confinement Princi- 1972, Infotech International Ltd., Maid- enhead, Berkshire, England, pp. 395- ples of the Introduction it is noted that fault 428. signals cause procedure calls on fault hand- [FAB68] FABRY, R. S. "Preliminary descrip- lers. In the present context, fault signals tmn of a supervisor for a computer organized around capabilities," in should generate enter operations for fault Quarterly progress report, No. 18, Sect. handlers. The indices of enter capabilities IIA, Inst. Computer Research, Univ. for fault handlers can be reserved, being the Chicago, Chicago, Ill., 1968. [FAB73] FABRY, R. S. "Dynamic verification same for all processes and part of the global of operating system decisions," Comm. domain; they can be stored in special regis- ACM 16, 11 (Nov. 1973), 659-668. [FAB74] FABRY, R. S. "Capability based ad- ters accessible to the trap microprogram. dressing," Comm. ACM 17, 7 (July The mask settings in effect at the time of the 1974), 403-412. fault can be saved on the stack along with [FEu73] FEUSTEL, E. A. "On the advantages

Computing Surveys~ Vol. 8, No. 4, December 1976 388 • Peter J. Denning

of tagged architecture," IEEE Trans. tern," IBM Systems J. 9, 3 (1970), Computers C-22, (July 1973), 644-656. 199-218. [GAI74] GAINES, R.S. "A new boss/slave rela- [Nv.E72] NEEVHAM, R.M. "Protection systems tion between processes," in Proe. 8th and protection implications," in Proe. Princeton Conf. on Information Science AFIPS 1975 Fall Jr. Computer Conf., and Systems, 1974. Vol. 41, AFIPS Press, Montvale, N.J., [GAI75] GAINES, R. S. "Control of processes 1972, pp. 572-578. in operating systems: The boss-slave INSET4] NEEDHAM, R. M., quoted by C. Bunyan relation." in Proc. 5th Syrup. on Operat- in Infoteeh state of the art report ~0: ing Systems Principles, 1975, ACM, New computer systems reliability, 1974, Info- York, 1975. tech International Ltd., Maidenhead, [Goo75] GOODENOUGH,J.B. "Exception hand- Berkshire, England, pp. 104-105. ~ng:~ss~c~nld ? pro~s~d l~t, i~'." [NEH75] NEHMER, J. "Dispatcher primitives for the construction of operating sys- 696. tem kernel," Aeta Informatiea 5, 4 [GRA75] GRAHAM, R. M. Principles of system (1975), 237-256. programming, John Wiley & Sons Inc., [NEu75] NEUMANN, P. G.; ROBINSON, L.; LE- New York, 1975. VITt, K. N.; BOYER, R. S.; AND SAXENA, [HAB76] HABERMANN, A.N. Introduetiou to op- A.R. "A provably secure operating erating systems design, Science Research system," in Project ~581 final report, Associates Inc., Chicago, Ill., 1976. Stanford Research Inst., Menlo Park, [HFC76] HABERMANN, A. N.; FLON, L.; AND Calif., 1975. COOPRIDER, L. "Modularization and [ONE67] O'NEILL, R.W. "Experience using a hierarchy in a family of operating sys- time-sharing multiprogramming sys- tems," Comm. ACM 19, 5 (May 1976), tem with dynamic address relocation 266-272. hardware," in Proe. AFIPS 1967 Spring [HoL72] HOLT, R. C. "Some deadlock proper- Jr. Computer Conf., Vol. 30, Thompson ties of computer systems," Computing Book Co., Washington, D.C., 1967, pp. Surveys 4, 3 (Sept. 1972), 179-196. 611-621. [HoR73] HORNING, J. J.; AND RANDELL, B. [ORG72] ORGANICK, E. I. The multics system: "Process structuring," Computing Sur- an examination of its structure, MIT veys 5, 1 (March 1973), 5-30. Press, Cambridge, Mass., 1972. [INF74] INFOTECH INTERNATIONALLTD. Report [ORG73] ORGANICK, E. I. Computer system or- ~0: computer systems reliability, C. ganization: the B5700/B6700 series, Bunyan, Ed., 1974 Infotech Interna- Academic Press, New York, 1973. tional Ltd, Maidenhead, Berkshire, [PA72a] PXRNXS, D.L. "A technique for soft- England. ware module specification with exam- [LAM69] LAMPSON, B. W. "Dynamic protee- ~lesas,~Tomm. ACM 15,5 (May 1972), tion structures, in Proc. AFIPS 1969 ~t~J~t.)u wJ. Fall Jr. Computer Conf., Vol. 35, AFIPS [PA72b] PARNAS, D. L. "On the criteria to be Press, Montvale, N.J., 1969, pp. 27-38. used decomposing systems into mo- [LA~71] LA~PSON, B. W. "Protection," in dules," Comm. ACM 15, 12 (Dec. 1972), Proc. 5th Princeton Conf. on Informa- 1053-1058. tion Sciences and Systems, 1971, pp. [PAR75] PARNAS, D.L. "The influence of soft- 437-443. Also in ACM SIGOPS Operat- ware structure on reliability," in Proe. ing Systems Review 8, 1 (Jan. 1974), ACM Intsrnatl. Conf. Reliable ~oft- 18-24. ware, SIGPLAN Notices 10, 6 (June [LAM731 LAMPSON, B.W. "A note on the con- 1975), 358-362, ACM, New York, 1975. finement problem," Comm. ACM 16, [PoG74] POPEK, G. J., AND GOLDBERG, R. P. l0 (Oct. 1973), 613-615. "Formal requirements for virtualiza- [LAS76] LAMPSON, B. W.; AND STURGIS, H. ble third generation ," "Reflections on an operating system Comm. ACM 17, 7 (July 1974), 412-421. design," Comm. ACM 19, 5 (May 1976), [PK75a] POPEK, G. J.; AND KLINE C. S. "A 251-265. verifiable protection system," in Proc. [LIP75] LIPNER, S. E. "A comment on the ACM Internatl. Conf. Reliable Soft- confinement problem," in Proe. 5th ware, 1975, ACM, New York, 1975, pp. ACM ,gymp. ou Operating System Prin- 294-304. ciples, 1975, ACM, NewYork, 1975, pp. [PK75b] POPEK, G. J.; AND KLINE, C.S. "The 192-196. PDP-11 virtual machine architecture: [LIs72] LisKov, B. H. "The design of the a case study," in Proe. 5th ACM Syrup. Venus operating system," Comm. ACM on Operating System Principles, 1975, 15, 3 (March 1972), 144-149. ACM, New York, 1975, pp. 97-105. [MAD74] MADNICK, S. E.; ANn DONOVAN, J.J. [RAN75] RANVELL, B. "System structure for Operating systems, McGraw-Hill Inc., software fault tolerance" IEEE Trans New York, 1974. on Software Engineering, SE-1, 2 (June [MEL75] MELLIAR-SMITH, P. M. A project to investigate data-base reliability, Re- 1975), T20-232. [Also in Op. Cit., 437- port, Computing Lab., Univ. New- 449.] castle-upon-Tyne, England, 1975. [RUB751 RUSSELL, D. L.; ANY BREDT, T. H. [MEST0] MEYER, R. A.; ANn SEAWRIGHT, L. H. "Error resynchronization in producer- "A virtual machine time-sharing sys- consumer systems." in Proc. 5th ACM

Computing Surveya~Vol. 8, No. 4, December 1976 Fault- Tolerant Operating Sys~ms • 389

Syrup. on Operating System Principles, [WAL73] WALKER, R. "The structure of a well ACM SIGOPS Operating Systems Re- protected computer," PhD Thesis, view 9, 5 (Nov. 1975), 106-113. Computing Laboratory, Cambridge [SAS75] SALTZER, J. H.; AND SCHROEDER,M. D. Univ., England, 1973. "The protection of information in corn- [WxL75] WILKES,M. V. Ti~ sharing computer uter systems," in Proc. IEEE 63, 9 systems, American Elsevier Publ. Co., ~Sept. 1975), 1278-1308. New York, 1st ed., 1968; 2nd ed., 1972; [SCS72] SCHROEDER, M. D.; AND SAL'I~ZER,J. H. 3rd ed., 1975. "A hardware architecture for imple- [WuL74] WULF, W. A. et al. "Hydra: the kernel menting protection rings," Comm. of a multiplroeessor system," Comm ACM 15, 3 (March 1972), 157-170. ACM 17, 6 (June 1974), 337-345. [SHA74] SHAW, A. The logical design of operat- [WuL75] WULF, W. A. "Reliable hardware/ ing systems, Prentice-Hall, Englewood ," IEEE TSE- Cliffs, N.J., 1974. 2 (June 1975), 233-240. [THo70] THORNTON, J. E. Design of a compu- [You74] YOUNG, J.W. "A first order approxi- ter: the Control Data 6600, Scott, Fores- mation to the optimum checkpoint in- man, and Co., Glenview, Ill., 1970, terval," Comm. ACM 17, 9 (Sept. pp. 117-120. 1974), 530-531.

Computing Survey, Vol. 8. No. 4, Deeembel" 1976